CN117197878B

CN117197878B - Character facial expression capturing method and system based on machine learning

Info

Publication number: CN117197878B
Application number: CN202311465863.1A
Authority: CN
Inventors: 郭勇; 苑朋飞; 靳世凯; 周浩; 尚泽
Original assignee: Zhongying Nian Nian Beijing Technology Co ltd
Current assignee: Zhongying Nian Nian Beijing Technology Co ltd
Priority date: 2023-11-07
Filing date: 2023-11-07
Publication date: 2024-03-05
Anticipated expiration: 2043-11-07
Also published as: CN117197878A

Abstract

The invention discloses a method and a system for capturing facial expressions of people based on machine learning, and relates to the technical field of facial expression capturing. Firstly, acquiring a face image through a camera, then, carrying out feature extraction on the face image through a face feature extractor based on a depth neural network model to obtain a face shallow feature map, a face middle layer feature map and a face deep feature map, then, fusing the face shallow feature map, the face middle layer feature map and the face deep feature map to obtain a multi-scale semantic fusion face feature, and finally, generating a character facial expression driving instruction of a digital person based on the multi-scale semantic fusion face feature. In this way, the digital character face may be driven with user facial expression data to enhance the immersive and interactive aspects of the virtual reality experience.

Description

Character facial expression capturing method and system based on machine learning

Technical Field

The present application relates to the field of facial expression capturing technologies, and more particularly, to a method and a system for capturing facial expressions of a person based on machine learning.

Background

Virtual Reality (VR) is a computer-generated simulated environment in which a user can personally feel and interact. Virtual reality man-machine interaction refers to interaction between a person and a computer through a virtual reality technology, and aims to provide an immersive, natural and visual interaction experience, so that a user can interact and operate with a virtual environment in a more natural manner. Traditional man-machine interaction modes, such as a keyboard, a mouse, a touch screen and the like, are not intuitive and natural enough for a virtual reality environment. Therefore, virtual reality human-machine interaction aims at developing a more intelligent and adaptive interaction mode, so that a user can interact with a virtual environment through own physical actions, voice, gestures, facial expressions and the like.

Character facial expression capture plays an important role in virtual reality human-machine interaction. By capturing the facial expression of the user, the virtual reality system can track the facial movement of the user in real time, reflect the emotion state of the user, and apply the emotion state to the facial model of the digital character, so that the virtual character can better generate emotion resonance with the user. In this way, the user can interact with the virtual character through natural facial expressions, such as smiling, blinking, frowning, etc. The face interaction can increase the freedom degree and the sense of reality of the virtual reality experience, and improve the immersion of the user to the virtual environment.

However, conventional facial expression capture schemes typically require the use of specific hardware devices, such as depth cameras, infrared sensors, and the like. These devices are costly and require additional installation and configuration, limiting their popularity and use among the public users. Furthermore, conventional schemes typically capture and recognize based on predefined facial expression models, limiting the ability of the user to personalize and diversify expressions, which may make it impossible for the user to freely express irregular or personalized facial expressions, resulting in limited feedback of the expression of the virtual character.

Accordingly, a machine learning based facial expression capture scheme for a person is desired.

Disclosure of Invention

The present application has been made in order to solve the above technical problems. The embodiment of the application provides a character facial expression capturing method and system based on machine learning. It can drive digital character faces with user facial expression data to enhance the immersive and interactive aspects of the virtual reality experience.

According to one aspect of the present application, there is provided a machine learning-based character facial expression capturing method, including:

acquiring a face image through a camera;

extracting features of the face image through a face feature extractor based on a deep neural network model to obtain a face shallow feature map, a face middle layer feature map and a face deep feature map;

fusing the face shallow feature map, the face middle feature map and the face deep feature map to obtain multi-scale semantic fusion face features;

and generating a role facial expression driving instruction of the digital person based on the multi-scale semantic fusion face features.

According to another aspect of the present application, there is provided a machine learning based facial expression capture system for a person, comprising:

the face image acquisition module is used for acquiring face images through the camera;

the feature extraction module is used for extracting features of the face image through a face feature extractor based on a deep neural network model so as to obtain a face shallow feature map, a face middle layer feature map and a face deep feature map;

the feature fusion module is used for fusing the face shallow feature map, the face middle feature map and the face deep feature map to obtain multi-scale semantic fusion face features;

and the instruction generation module is used for generating a role facial expression driving instruction of the digital person based on the multi-scale semantic fusion face characteristics.

Compared with the prior art, the facial expression capturing method and system based on machine learning provided by the application are characterized in that firstly, a face image is acquired through a camera, then, feature extraction is carried out on the face image through a face feature extractor based on a depth neural network model to obtain a face shallow feature map, a face middle layer feature map and a face deep feature map, then, the face shallow feature map, the face middle layer feature map and the face deep feature map are fused to obtain multi-scale semantic fusion face features, and finally, a digital human character facial expression driving instruction is generated based on the multi-scale semantic fusion face features. In this way, the digital character face may be driven with user facial expression data to enhance the immersive and interactive aspects of the virtual reality experience.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly introduced below, which are not intended to be drawn to scale in terms of actual dimensions, with emphasis on illustrating the gist of the present application.

Fig. 1 is a flowchart of a machine learning based facial expression capturing method of a person according to an embodiment of the present application.

Fig. 2 is a schematic architecture diagram of a machine learning-based facial expression capturing method of a person according to an embodiment of the present application.

Fig. 3 is a flowchart of substep S130 of a machine learning based facial expression capture method of a person according to an embodiment of the present application.

Fig. 4 is a flowchart of substep S140 of the machine learning based facial expression capturing method of a person according to an embodiment of the present application.

Fig. 5 is a block diagram of a machine learning based character facial expression capture system according to an embodiment of the present application.

Fig. 6 is an application scenario diagram of a machine learning based facial expression capturing method of a person according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some, but not all embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present application without making any inventive effort, are also within the scope of the present application.

As used in this application and in the claims, the terms "a," "an," "the," and/or "the" are not specific to the singular, but may include the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.

Although the present application makes various references to certain modules in a system according to embodiments of the present application, any number of different modules may be used and run on a user terminal and/or server. The modules are merely illustrative, and different aspects of the systems and methods may use different modules.

Flowcharts are used in this application to describe the operations performed by systems according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in order precisely. Rather, the various steps may be processed in reverse order or simultaneously, as desired. Also, other operations may be added to or removed from these processes.

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application and not all of the embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.

Aiming at the technical problems, the technical concept of the application is to collect face images of the face of a user through a camera, introduce image processing and analysis algorithms at the rear end to conduct feature analysis of the face images, track face movements of the user in real time, such as smiles, blinks, frowns and the like, and apply the face movements to face models of digital characters, so that the faces of the digital characters are driven by facial expression data of the user, and immersion and interactivity of virtual reality experience are enhanced.

Fig. 1 is a flowchart of a machine learning based facial expression capturing method of a person according to an embodiment of the present application. Fig. 2 is a schematic architecture diagram of a machine learning-based facial expression capturing method of a person according to an embodiment of the present application. As shown in fig. 1 and 2, a machine learning-based facial expression capturing method for a person according to an embodiment of the present application includes the steps of: s110, acquiring a face image through a camera; s120, extracting features of the face image through a face feature extractor based on a deep neural network model to obtain a face shallow feature map, a face middle layer feature map and a face deep feature map; s130, fusing the face shallow feature map, the face middle feature map and the face deep feature map to obtain multi-scale semantic fusion face features; and S140, generating a role facial expression driving instruction of the digital person based on the multi-scale semantic fusion face features.

Specifically, in the technical scheme of the application, first, a face image acquired by a camera is acquired. Then, the convolutional neural network model with excellent performance in terms of implicit feature extraction of the image is used for feature mining of the face image, and in particular, in consideration of the fact that not only deep semantic features of expression need to be focused, but also feature information such as edges, contours, textures and the like of the face need to be focused when face feature recognition and label detection are carried out, the features have important roles in face expression recognition of the face and driving of the face of the digital character. Moreover, as the pyramid network mainly solves the multi-scale problem in target detection, semantic information of low-layer features and high-layer features can be simultaneously utilized, and a good expression effect is achieved by fusing the features of different layers.

Based on the above, in the technical scheme of the application, the face image is passed through a face feature extractor based on a pyramid network to obtain a face shallow feature map, a face middle layer feature map and a face deep feature map. By extracting the feature images of different layers in the face image, the image information can be expressed and analyzed in multiple layers. Therefore, the face image of the face of the user can be better understood, and an important basis is provided for the subsequent facial expression feature capture and the driving of the facial expression of the character of the digital person. In particular, the facial shallow feature map typically contains some low-level local features, such as edges, textures, and the like. These features have a descriptive effect on the overall shape and contour of the face. The facial middle layer feature map is more abstract, and can capture the position and shape information of higher facial features such as eyes, nose, mouth and the like. These features are important for face recognition and expression analysis. The deep facial feature map is more abstract and advanced, and can represent more complex facial features such as facial expressions, emotional states, and the like. The features can provide richer facial information for achieving the tasks of facial expression capturing, emotion recognition and the like, so that the character facial expression driving of the digital person can be accurately carried out later.

Further, a first semantic joint propagation module is used for fusing the face middle layer feature map and the face deep feature map to obtain a semantic fused face middle layer feature map. In particular, here, the first semantic joint propagation module can learn and propagate deep abstract semantic feature information of the face image and fuse the deep abstract semantic feature information into the face middle layer feature image, so that the fused semantic fusion face middle layer feature image has middle layer semantic feature information which is richer in facial faces of users, and the accuracy of character facial expression driving of digital people is improved.

It should be appreciated that since the shallow feature map of a face typically contains low-level image features such as edges, textures, etc. of the face of the user, these features are important for describing the details and local information of the face image. Therefore, in the technical scheme of the application, the second semantic joint propagation module is used for fusing the semantic fusion face middle layer feature map and the face shallow feature map to obtain the multi-scale semantic fusion face shallow feature map. It should be noted that, here, the second semantic joint propagation module can learn and propagate the feature information in the semantic fusion facial middle layer feature map and fuse the feature information into the facial shallow feature map, so that the fused multi-scale semantic fusion facial shallow feature map has more sufficient and accurate feature expression capability about the facial face of the user, so as to better capture important features of the facial image on different scales, and be helpful for improving the accuracy of driving the facial expression of the digital character.

Accordingly, in step S120, the deep neural network model is a pyramid network. It is worth mentioning that Pyramid Network (Pyramid Network) is a deep neural Network model that processes and extracts features from input data at different scales. Pyramid networks are typically composed of multiple sub-networks, each of which is responsible for processing input data of different scales, which may share parameters to increase the efficiency of the network, each of which has its own feature extractor, which may extract features of different levels. In the facial expression capturing method of the person, a pyramid network is used for extracting shallow layer features, middle layer features and deep layer features of a face image. Shallow features correspond to low-level features such as edges, textures, etc., middle features correspond to higher-level features such as facial contours and expression cues, while deep features correspond to more abstract and semantic features such as representations of different facial expressions. By fusing the feature graphs with different levels, the pyramid network can obtain the multi-scale semantic fusion face features which contain information with different levels and can more accurately represent the expression of the face. Based on the facial features of the multi-scale semantic fusion, a character facial expression driving instruction of the digital person can be generated and used for controlling the facial expression of the digital person.

Accordingly, as shown in fig. 3, fusing the face shallow feature map, the face middle feature map and the face deep feature map to obtain a multi-scale semantic fusion face feature, including: s131, fusing the face middle layer feature map and the face deep feature map by using a first semantic joint propagation module to obtain a semantic fused face middle layer feature map; and S132, fusing the semantic fusion face middle layer feature map and the face shallow feature map by using a second semantic joint propagation module to obtain a multi-scale semantic fusion face shallow feature map as the multi-scale semantic fusion face feature. It should be understood that, in the facial expression capturing method of a person, the process of fusing the facial shallow feature map, the facial middle feature map, and the facial deep feature map involves two steps: s131 and S132. In step S131, the face middle layer feature map and the face deep layer feature map are fused through a first semantic joint propagation module, and the purpose of the module is to perform semantic fusion on features of different layers so as to obtain a face middle layer feature map with more characterization force and semantic information. In step S132, the semantic fusion face middle-layer feature map and the face shallow feature map are fused through the second semantic joint propagation module, and the purpose of the module is to combine the semantic fusion middle-layer feature map and the shallow feature map to obtain a multi-scale semantic fusion face shallow feature map, where the multi-scale semantic fusion face shallow feature map contains feature information from different layers, so that features and expressions of a face can be more comprehensively represented. Through the fusion operation of the two steps, the feature images of different layers can be combined to form the multi-scale semantic fusion face feature. The feature representation can better capture the expression and semantic information of the face, and provides more accurate and rich input for the generation of the facial expression driving instruction of the subsequent digital human character.

And then, the multi-scale semantic fusion facial shallow feature map passes through a classifier to obtain a classification result, wherein the classification result is used for representing facial expression labels. Specifically, in the technical solution of the present application, the classification label of the classifier is a facial expression label, so after the classification result is obtained, the facial movement of the user, such as smiling, blinking, frowning, etc., can be tracked in real time based on the classification result, and a role facial expression driving instruction of the digital person is generated, so as to enhance the immersion and interactivity of the virtual reality experience.

Accordingly, as shown in fig. 4, based on the multi-scale semantic fusion face feature, a role facial expression driving instruction of the digital person is generated, which includes: s141, enabling the multi-scale semantic fusion face shallow feature map to pass through a classifier to obtain a classification result, wherein the classification result is used for representing a face expression label; and S142, generating a character facial expression driving instruction of the digital person based on the classification result.

More specifically, in step S141, the multi-scale semantic fusion facial shallow feature map is passed through a classifier to obtain a classification result, where the classification result is used to represent a facial expression label, and the method includes: expanding the multi-scale semantic fusion face shallow feature map into classification feature vectors according to row vectors or column vectors; performing full-connection coding on the classification feature vectors by using a full-connection layer of the classifier to obtain coded classification feature vectors; and inputting the coding classification feature vector into a Softmax classification function of the classifier to obtain the classification result.

It should be appreciated that the role of the classifier is to learn the classification rules and classifier using a given class, known training data, and then classify (or predict) the unknown data. Logistic regression (logistics), SVM, etc. are commonly used to solve the classification problem, and for multi-classification problems (multi-class classification), logistic regression or SVM can be used as well, but multiple bi-classifications are required to compose multiple classifications, but this is error-prone and inefficient, and the commonly used multi-classification method is the Softmax classification function.

It is worth mentioning that full-join encoding (Fully Connected Encoding) is an operation of mapping input data to a higher dimensional feature space. In the character facial expression capturing method, full-connection coding is used for coding the multi-scale semantic fusion face shallow feature map to generate coding classification feature vectors. Specifically, full-join encoding passes a multi-scale semantic fusion face shallow feature map as a row vector or column vector as input to the full-join layer of the classifier. The full connection layer is composed of a plurality of neurons, and each neuron is connected with all neurons of the upper layer. Each neuron multiplies the input feature by a weight and performs a nonlinear transformation by an activation function to generate a coded classification feature vector. The purpose of full-join encoding is to map a multi-scale semantic fusion face shallow feature map to a higher-dimensional feature space to capture a richer and abstract feature representation. The coding classification feature vector can better represent the feature and semantic information of the facial expression through the nonlinear transformation of the full connection layer. After the encoded classification feature vector is generated, the vector is input into the Softmax classification function of the classifier for calculating the classification result. The Softmax classification function maps the encoded classification feature vector onto a probability distribution, one probability value for each class. Eventually, the category with the highest probability is used as the classification result for representing the expression label of the face. In other words, the full-connection coding plays a role in mapping the multi-scale semantic fusion facial shallow feature map to the high-dimensional feature space in the facial expression capturing method of the person so as to obtain coding classification feature vectors with more characterization force and distinction, and further, the classification and label representation of the facial expression are realized.

Further, in the technical solution of the present application, the method for capturing facial expressions of a person based on machine learning further includes a training step: the face feature extraction module is used for training the face feature extractor, the first semantic joint propagation module, the second semantic joint propagation module and the classifier based on the pyramid network. It should be understood that in the technical solution of the present application, the purpose of the training step is to train the key components so that they can effectively extract the face features from the input data, perform semantic fusion and perform expression classification. The training steps include: 1. training of face feature extractor based on pyramid network: through training the pyramid network, the training method can enable the training device to learn how to extract the features of the shallow layer, the middle layer and the deep layer from the input face image, and in the training process, the network can continuously adjust network parameters through a back propagation algorithm, so that the extracted features can better represent the structure and expression information of the face. 2. Training of a first semantic joint propagation module: the first semantic joint propagation module is used for carrying out semantic fusion on the face middle-layer feature map and the face deep-layer feature map, and by training the module, the module can learn how to fuse the features of different layers so as to obtain the face middle-layer feature map with more characterization force and semantic information. 3. Training of a second semantic joint propagation module: the second semantic joint propagation module is used for fusing the semantic fusion face middle-layer feature map and the face shallow feature map to generate a multi-scale semantic fusion face shallow feature map, and training the module can enable the module to learn how to combine the middle-layer feature map and the shallow feature map so as to obtain more comprehensive multi-scale semantic fusion face features. 4. Training of a classifier: the classifier is used for mapping the multi-scale semantic fusion facial shallow feature map to the expression label, and training the classifier can enable the classifier to learn how to associate facial features with corresponding expressions, so that facial expressions can be accurately classified. By training the key components, the whole facial expression capturing system of the person can have higher accuracy and robustness, and can capture and recognize facial expression information more accurately.

More specifically, in one specific example, the training step includes: acquiring training data, wherein the training data comprises training face images and true values of face expression labels; the training face image passes through the face feature extractor based on the pyramid network to obtain a training face shallow feature map, a training face middle layer feature map and a training face deep feature map; fusing the training face middle layer feature map and the training face deep feature map by using the first semantic joint propagation module to obtain a training semantic fused face middle layer feature map; using the second semantic joint propagation module to fuse the training semantic fusion face middle-layer feature map and the training face shallow feature map to obtain a training multi-scale semantic fusion face shallow feature map; performing point-by-point optimization on the training multi-scale semantic fusion face shallow feature map to obtain an optimized training multi-scale semantic fusion face shallow feature map; the optimized training multi-scale semantic fusion face shallow feature map passes through the classifier to obtain a classification loss function value; and training the pyramid network-based face feature extractor, the first semantic joint propagation module, the second semantic joint propagation module, and the classifier based on the classification loss function value and by back propagation of gradient descent.

In particular, in the technical solution of the present application, by using a semantic joint propagation module to fuse the training face shallow feature map, the training face middle feature map and the training face deep feature map, the training multi-scale semantic fusion face shallow feature map may express hybrid face image semantic features of different scales and depths based on a pyramid network, which may also result in the training face shallow featureThe image semantic difference among the figure, the training face middle layer feature figure and the training face deep layer feature figure is introduced into the multi-scale semantic fusion face shallow layer feature figure. Therefore, when the training multi-scale semantic fusion face shallow feature map is classified by the classifier, the class probability mapping is performed based on the image semantic feature mixing scale and depth of each feature matrix of the training multi-scale semantic fusion face shallow feature map, and the efficiency of classification regression is required to be improved due to discretization of the mixed image semantic feature representation in consideration of the mixed image semantic feature representation of each feature matrix. Therefore, when the training multi-scale semantic fusion face shallow feature map is subjected to classification regression through the classifier, the training multi-scale semantic fusion face shallow feature map is subjected to point-by-point optimization, and the training multi-scale semantic fusion face shallow feature map is specifically expressed as: performing point-by-point optimization on the training multi-scale semantic fusion face shallow feature map by using the following optimization formula to obtain the optimization training multi-scale semantic fusion face shallow feature map; wherein, the optimization formula is:wherein (1)>Is each feature value of the training multi-scale semantic fusion face shallow feature map, < >>Is the global average value of all feature values of the training multi-scale semantic fusion face shallow feature map, and +.>Is the maximum characteristic value of the training multi-scale semantic fusion human face shallow characteristic diagram, < ->Is each feature value of the optimization training multi-scale semantic fusion face shallow feature map.

That is, through the concept of regularized imitative functions of global distribution feature parameters of the training multi-scale semantic fusion face shallow feature map, the optimization is based on the parametric vector representation of global distribution of the training multi-scale semantic fusion face shallow feature map to simulate a cost function with the regular expression representation of regression probability, so that the feature manifold representation of the training multi-scale semantic fusion face shallow feature map in a high-dimensional feature space models the point-by-point regression characteristics of a weight matrix based on a classifier under the classification regression probability to capture the parameter smooth optimization track of the training multi-scale semantic fusion face shallow feature map to be classified under the scene geometry of a high-dimensional feature manifold through the parameter space of the classifier model, and the training efficiency of the training multi-scale semantic fusion face shallow feature map under the classification probability regression of the classifier is improved. In this way, the user's facial movements can be tracked in real-time and applied to the digital character's facial model to drive the digital character's face based on the user's facial expression data to enhance the immersive and interactive aspects of the virtual reality experience.

Further, the optimizing training multi-scale semantic fusion face shallow feature map is passed through the classifier to obtain a classification loss function value, which comprises the following steps: the classifier processes the optimized training multi-scale semantic fusion face shallow feature map according to the following classification training formula to generate a training classification result; wherein, the classification training formula is:wherein->Representing that the optimized training multi-scale semantic fusion face shallow feature map is projected into vectors, and the vectors are +.>To->Weight matrix for all connection layers of each layer, < ->To->Representing the bias matrix of each fully connected layer; and calculating a cross entropy value between the training classification result and the true value as the classification loss function value.

In summary, a machine learning based character facial expression capture method is illustrated that can drive a digital character face with user facial expression data to enhance the immersive and interactive aspects of a virtual reality experience.

Fig. 5 is a block diagram of a machine learning based character facial expression capture system 100 according to an embodiment of the present application. As shown in fig. 5, a machine learning based character facial expression capture system 100 according to an embodiment of the present application includes: a face image acquisition module 110, configured to acquire a face image through a camera; the feature extraction module 120 is configured to perform feature extraction on the face image by using a face feature extractor based on a deep neural network model to obtain a face shallow feature map, a face middle feature map and a face deep feature map; the feature fusion module 130 is configured to fuse the face shallow feature map, the face middle feature map, and the face deep feature map to obtain a multi-scale semantic fusion face feature; and an instruction generating module 140, configured to generate a role facial expression driving instruction of the digital person based on the multi-scale semantic fusion face feature.

In one example, in the machine learning based character facial expression capture system 100 described above, the deep neural network model is a pyramid network.

Here, it will be understood by those skilled in the art that the specific functions and operations of the respective modules in the above-described machine learning-based person facial expression capturing system 100 have been described in detail in the above description of the machine learning-based person facial expression capturing method with reference to fig. 1 to 4, and thus, repetitive descriptions thereof will be omitted.

As described above, the machine learning-based person facial expression capturing system 100 according to the embodiment of the present application may be implemented in various wireless terminals, such as a server or the like having a machine learning-based person facial expression capturing algorithm. In one example, the machine learning based facial expression capture system 100 according to embodiments of the present application may be integrated into a wireless terminal as one software module and/or hardware module. For example, the machine learning based character facial expression capture system 100 may be a software module in the operating system of the wireless terminal or may be an application developed for the wireless terminal; of course, the machine learning based character facial expression capture system 100 may also be one of many hardware modules of the wireless terminal.

Alternatively, in another example, the machine-learning-based facial expression capture system 100 and the wireless terminal may be separate devices, and the machine-learning-based facial expression capture system 100 may be connected to the wireless terminal via a wired and/or wireless network and communicate interactive information in a agreed data format.

Fig. 6 is an application scenario diagram of a machine learning based facial expression capturing method of a person according to an embodiment of the present application. As shown in fig. 6, in this application scenario, first, a face image (e.g., D illustrated in fig. 6) is acquired by a camera, then, the face image is input into a server (e.g., S illustrated in fig. 6) in which a machine learning-based person facial expression capturing algorithm is deployed, wherein the server is capable of processing the face image using the machine learning-based person facial expression capturing algorithm to obtain a classification result for representing a facial expression tag, and then, based on the classification result, a digital person' S character facial expression driving instruction is generated.

Furthermore, those skilled in the art will appreciate that the various aspects of the invention are illustrated and described in the context of a number of patentable categories or circumstances, including any novel and useful procedures, machines, products, or materials, or any novel and useful modifications thereof. Accordingly, aspects of the present application may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.) or by a combination of hardware and software. The above hardware or software may be referred to as a "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the present application may take the form of a computer product, comprising computer-readable program code, embodied in one or more computer-readable media.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The foregoing is illustrative of the present invention and is not to be construed as limiting thereof. Although a few exemplary embodiments of this invention have been described, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this invention. Accordingly, all such modifications are intended to be included within the scope of this invention as defined in the following claims. It is to be understood that the foregoing is illustrative of the present invention and is not to be construed as limited to the specific embodiments disclosed, and that modifications to the disclosed embodiments, as well as other embodiments, are intended to be included within the scope of the appended claims. The invention is defined by the claims and their equivalents.

Claims

1. A machine learning-based character facial expression capturing method, comprising:

acquiring a face image through a camera;

fusing the face shallow feature map, the face middle feature map and the face deep feature map to obtain a multi-scale semantic fusion face feature, which comprises the following steps:

a first semantic joint propagation module is used for fusing the face middle-layer feature map and the face deep feature map to obtain a semantic fused face middle-layer feature map;

using a second semantic joint propagation module to fuse the semantic fusion face middle-layer feature map and the face shallow feature map to obtain a multi-scale semantic fusion face shallow feature map as the multi-scale semantic fusion face feature;

generating a role facial expression driving instruction of the digital person based on the multi-scale semantic fusion face features;

the method further comprises the training step of: training a face feature extractor, the first semantic joint propagation module, the second semantic joint propagation module and a classifier based on a pyramid network;

the training step comprises the following steps:

acquiring training data, wherein the training data comprises training face images and true values of face expression labels;

the training face image passes through the face feature extractor based on the pyramid network to obtain a training face shallow feature map, a training face middle layer feature map and a training face deep feature map;

fusing the training face middle layer feature map and the training face deep feature map by using the first semantic joint propagation module to obtain a training semantic fused face middle layer feature map;

using the second semantic joint propagation module to fuse the training semantic fusion face middle-layer feature map and the training face shallow feature map to obtain a training multi-scale semantic fusion face shallow feature map;

performing point-by-point optimization on the training multi-scale semantic fusion face shallow feature map to obtain an optimized training multi-scale semantic fusion face shallow feature map;

the optimized training multi-scale semantic fusion face shallow feature map passes through the classifier to obtain a classification loss function value;

training the face feature extractor, the first semantic joint propagation module, the second semantic joint propagation module and the classifier based on the classification loss function value and through backward propagation of gradient descent;

performing point-by-point optimization on the training multi-scale semantic fusion face shallow feature map to obtain an optimized training multi-scale semantic fusion face shallow feature map, including:

performing point-by-point optimization on the training multi-scale semantic fusion face shallow feature map by using the following optimization formula to obtain the optimization training multi-scale semantic fusion face shallow feature map;

wherein, the optimization formula is:wherein (1)>Is each feature value of the training multi-scale semantic fusion face shallow feature map, < >>Is the global average value of all feature values of the training multi-scale semantic fusion face shallow feature map, and +.>Is the maximum characteristic value of the training multi-scale semantic fusion human face shallow characteristic diagram, < ->Is each feature value of the optimization training multi-scale semantic fusion face shallow feature map.

2. The machine learning based facial expression capture method of claim 1 wherein the deep neural network model is a pyramid network.

3. The machine learning based character facial expression capturing method of claim 2, wherein generating a digital human character facial expression driving instruction based on the multi-scale semantic fusion face features comprises:

the multi-scale semantic fusion facial shallow feature map is passed through a classifier to obtain a classification result, wherein the classification result is used for representing facial expression labels;

and generating a character facial expression driving instruction of the digital person based on the classification result.

4. The machine learning based facial expression capture method of claim 3 wherein passing the optimally trained multi-scale semantic fusion facial shallow feature map through the classifier to obtain a classification loss function value comprises:

the classifier processes the optimized training multi-scale semantic fusion face shallow feature map according to the following classification training formula to generate a training classification result;

wherein, the classification training formula is:wherein (1)>Representing that the optimized training multi-scale semantic fusion face shallow feature map is projected into vectors, and the vectors are +.>To->Weight matrix for all connection layers of each layer, < ->To->Representing the bias matrix of each fully connected layer;

and calculating a cross entropy value between the training classification result and the true value as the classification loss function value.

5. A machine learning based character facial expression capture system, comprising:

the instruction generation module is used for generating a role facial expression driving instruction of the digital person based on the multi-scale semantic fusion face characteristics;

the system further includes a training module: training a face feature extractor, a first semantic joint propagation module, a second semantic joint propagation module and a classifier based on a pyramid network;

the training module is specifically configured to:

6. The machine learning based facial expression capture system of claim 5, wherein the deep neural network model is a pyramid network.