CN117894057A

CN117894057A - Three-dimensional digital face processing method and device for emotion disorder auxiliary diagnosis

Info

Publication number: CN117894057A
Application number: CN202410269906.7A
Authority: CN
Inventors: 许迎科; 胡少华; 于佳辉; 张毓桐; 陈京凯; 周和统; 王中; 吕海龙; 来建波
Original assignee: Binjiang Research Institute Of Zhejiang University; Zhejiang University ZJU
Current assignee: Binjiang Research Institute Of Zhejiang University; Zhejiang University ZJU
Priority date: 2024-03-11
Filing date: 2024-03-11
Publication date: 2024-04-16
Anticipated expiration: 2044-03-11
Also published as: CN117894057B

Abstract

The invention relates to the technical field of computers, and discloses a three-dimensional digital face processing method and device for emotion disorder auxiliary diagnosis, wherein the method comprises the following steps: acquiring a first data set and a second data set, wherein the first data set comprises a first face image, and the second data set comprises a second face image and an emotion marking result; training a face shape model by adopting a first data set, wherein the face shape model is used for building a face three-dimensional model according to first information in a first face image, and completing training of the face shape model according to consistency of a two-dimensional face image of the face three-dimensional model and facial expression characteristics of the first face image; forming an emotion classification model according to the face shape model trained by the first data set and a classification model for determining an emotion classification result according to an intermediate result generated by the face shape model, and training the emotion classification model by the second data set; the scheme can assist doctors in diagnosis.

Description

Three-dimensional digital face processing method and device for emotion disorder auxiliary diagnosis

Technical Field

The invention relates to the technical field of computers, in particular to a three-dimensional digital face processing method and device for emotion disorder auxiliary diagnosis.

Background

From the professional medical field, the most common and severe depression in mood disorders (mooddisorders, MD) is a major depressive episode. Even the simplest activities, the patient needs to do all the effort. MD is typically characterized by lost interest in things and lack of pleasure; another feature is mania, which means that the patient can experience extreme pleasure during any activity.

Monophasic affective disorder (Majordepressivedisorder, MDD) refers to depression or mania alone. The proportion of depressed patients in unipolar affective disorders is far higher than in single manic patients. Mania is generally frequent in teenager period and is usually influenced by various factors such as environmental pressure. Patients with alternating depressive and manic states have the potential for bipolar disorder (BipolarDisorder, BD).

The existing scheme is based on the fact that doctors pay attention to patients manually, and a mode of taking models to pay attention to emotion of target characters to assist doctors in diagnosis is lacking.

Disclosure of Invention

The invention provides a three-dimensional digital face processing method and a three-dimensional digital face processing device for emotion disorder auxiliary diagnosis, which can take a model to pay attention to emotion of a target person so as to assist doctors in diagnosis.

In order to solve the technical problems, the invention is realized as follows:

In a first aspect, the present application provides a three-dimensional digital face processing method for emotion disorder assisted diagnosis, the method comprising: acquiring a first data set and a second data set, wherein the first data set comprises a first face image, the second data set comprises a second face image and an emotion marking result of the second face image, and the emotion marking result comprises: unipolar depression, bipolar disorder and health; training a face shape model by adopting a first data set, wherein the face shape model is used for building a face three-dimensional model according to first information in a first face image, and completing training of the face shape model according to consistency of a two-dimensional face image of the face three-dimensional model and facial expression characteristics of the first face image; the first information includes: facial rough features, facial expression features, facial detail features, and illumination features; and forming an emotion classification model according to the face shape model trained by the first data set and the classification model used for determining the emotion classification result according to the intermediate result generated by the face shape model, and training the emotion classification model by the second data set.

Preferably, the method further comprises: acquiring video data to be analyzed; inputting video data into a trained emotion classification model for analysis, and determining an emotion classification result, wherein the emotion classification result comprises: unipolar depression, bipolar disorder, and wellbeing.

Preferably, the method further comprises: acquiring image data, and extracting a target face image in the image data, wherein the target face image comprises a first face image or a second face image; the face shape model is used for completing training according to the following steps: modeling is carried out according to the target face image, and a face three-dimensional model is determined; generating a two-dimensional face image according to the face three-dimensional model; and adjusting the facial form model according to the consistency of the facial expression characteristics of the target facial image and the facial expression characteristics of the two-dimensional facial image.

Preferably, the modeling according to the target face image, determining a three-dimensional model of the face, includes: performing rough analysis according to a target face image to obtain first features, wherein the first features comprise face rough features and illumination features; performing expression analysis according to the target facial image, and determining facial expression characteristics to determine second characteristics, wherein the second characteristics comprise: facial expression features and mandibular posture features; the second feature further comprises a first rotation feature of the neck and a second rotation feature of the eyeball; carrying out detail analysis according to the target face image to obtain a third feature, wherein the third feature comprises: face detail features including features of the following face key points: the double canthus, upper and lower lips, left and right lip angles, lower jaw, and zygomatic muscle; and determining the three-dimensional model of the face according to the first feature, the second feature and the third feature.

Preferably, the performing detail analysis according to the target face image to obtain a third feature includes: and carrying out detail analysis according to the target face image and related face images of the target face image, and determining a third characteristic, wherein the related face images comprise images of the front frame and the rear frame of a video frame corresponding to the target face image in video data.

Preferably, after generating the two-dimensional face image according to the three-dimensional face model, the method further includes: and adjusting the three-dimensional model of the face according to the loss between the third feature of the target face image and the third feature of the two-dimensional face image.

Preferably, the intermediate result includes a face three-dimensional model modeled by a face shape model and two-dimensional image data generated based on the face three-dimensional model, and the classification model is used for: inputting the three-dimensional model of the human face into a first classifier, and determining a first analysis result; acquiring expression parameters according to two-dimensional image data corresponding to the face three-dimensional model, inputting the expression parameters into a second classifier, and determining a second analysis result; and carrying out joint analysis according to the first analysis result and the second analysis result to determine an emotion analysis result.

Preferably, the training of the emotion classification model by using the second data set includes: and acquiring an emotion analysis result corresponding to the second face image, and adjusting the emotion classification model by combining an emotion marking result of the second face image.

In a second aspect, the present application provides a three-dimensional digital face processing apparatus for emotion disorder assisted diagnosis, the apparatus comprising: the data set acquisition module is used for acquiring a first data set and a second data set, wherein the first data set comprises a first face image, the second data set comprises a second face image and an emotion marking result of the second face image, and the emotion marking result comprises: unipolar depression, bipolar disorder and health; the first model training module is used for training a face shape model by adopting a first data set, wherein the face shape model is used for building a face three-dimensional model according to first information in a first face image, and training the face shape model according to consistency of a two-dimensional face image of the face three-dimensional model and facial expression characteristics of the first face image; the first information includes: facial rough features, facial expression features, facial detail features, and illumination features; the second model training module is used for forming an emotion classification model according to the face shape model trained by the first data set and the classification model used for determining the emotion classification result according to the intermediate result generated by the face shape model, and training the emotion classification model by the second data set.

In a third aspect, the present application provides an electronic device, comprising: a memory and at least one processor; the memory is used for storing computer execution instructions; the at least one processor is configured to execute computer-executable instructions stored in the memory, such that the at least one processor performs the method according to the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method according to the first aspect.

The method and the device can be applied to the scene of assisting a doctor in diagnosis, the trained model can be adopted to identify the face image of a target person (such as a patient), the emotion analysis result of the target person is determined, and the doctor is given to assist in diagnosis. Specifically, the model can be pre-trained by adopting the first data set without labels, and the pre-trained model can be migrated by further adopting the second data set with labels, so that the model forms the capacity of identifying affective disorders. Specifically, the emotion classification model in the present embodiment is composed of a face shape model and a classification model. The face shape model is used for establishing a face three-dimensional model according to first information in the face image, and training the face shape model according to consistency of a two-dimensional face image of the face three-dimensional model and facial expression characteristics of the face image; the first information includes: facial rough features, facial expression features, facial detail features, and illumination features; the classification model is used for determining an emotion classification result according to an intermediate result generated by the face shape model, and the intermediate result comprises a face three-dimensional model modeled by the face shape model and two-dimensional image data generated based on the face three-dimensional model. According to the scheme, the face shape model can be trained by adopting a first data set without labels, the emotion classification model is formed according to the face shape model trained by adopting the first data set and the classification model, and the emotion classification model is trained by adopting a second data set with labels. Then, the video data containing the face image can be analyzed according to the trained emotion classification model, and an emotion classification result is determined to assist a doctor in diagnosis, wherein the emotion classification result comprises: unipolar depression, bipolar disorder, and wellbeing.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:

FIG. 1 is a schematic illustration of the steps of a three-dimensional digital face processing method for emotion disorder assisted diagnosis according to an embodiment of the present application;

FIG. 2 is a flow chart of a three-dimensional digital face processing method for emotion disorder assisted diagnosis according to an embodiment of the present application;

Fig. 3 is a schematic structural view of a three-dimensional digital face processing apparatus for emotion disorder auxiliary diagnosis according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The method and the device can be applied to the scene of assisting a doctor in diagnosis, the trained model can be adopted to identify the face image of a target person (such as a patient), the emotion analysis result of the target person is determined, and the doctor is given to assist in diagnosis. Specifically, the model can be pre-trained by adopting the first data set without labels, and the pre-trained model can be migrated by further adopting the second data set with labels, so that the model forms the capacity of identifying affective disorders.

From relevant surveys it has been shown that clinically, about 25% -30% of people are hospitalised for mixed episodes of acute mania, and it is therefore important to identify the course or period of depression or manic episodes. According to the clinical diagnostic requirements: judging whether the disease is repeatedly developed or not; during the intermittent period of repeated morbidity, whether the patient is fully recovered; during the two episodes, at least two months, whether some depressive feature is maintained during the intermittent period; whether depressive episodes or manic episodes, etc. Because the diagnosis and rehabilitation period is longer, the doctor cannot guarantee the change of the condition of the observed patient in the whole time period, and the diagnosis of the machine-assisted affective disorder is important in the diagnosis and treatment assisting process of the doctor.

Therefore, in the period of repeated morbidity, in order to relieve the situation that a professional doctor is short, the patient cannot be at the side of the patient at any time, and the situation that the patient has repeated illness in a certain time period can be conveniently determined. The auxiliary diagnosis method, system and device for the bipolar disorder can be used for placing the equipment in the invention at the place with the highest occurrence frequency of the patient, capturing and analyzing the facial expression of the patient in real time, thereby assisting doctors in diagnosing the illness state.

The current routine diagnostic methods for the initial condition of patients with single or double phases in clinic are: when a professional doctor initially judges the illness state of a patient, the patient can be subjected to external stimulation, such as playing videos of different emotions, to provide external stimulation for a tester, according to the emotion change of the played stimulation videos, the emotion disorder degree of the patient can be initially judged by observing the fact that the facial expression of the patient subjected to external stimulation reflects the intensity degree and then carrying out subsequent topic communication. With the increasing incidence of diseases, the demand for specialized doctors is greatly increased, and the doctor can not avoid neglecting some slight facial expressions which are not easily perceived by naked eyes and have depression or mania characteristics in the diagnosis and treatment stage. The machine-assisted affective disorder diagnosis model provided by the scheme can be used for rapidly giving diagnosis types according to the large-scale patient diagnosis and treatment video provided by a training hospital through the learned facial micro-expressions of patients in different age stages, sexes and professions. Thereby improving the diagnosis accuracy, relieving the diagnosis pressure of the professional doctor to a certain extent and assisting the doctor in diagnosing the illness state.

At present, bipolar affective disorder is not diagnosed according to facial features after three-dimensional reconstruction and modeling of facial details and micro-expressions. In consideration of the complexity and diversity of the environment, such as illumination, facial subtle expression and other factors, the application provides a three-dimensional reconstruction-based method for restoring facial emotion of a face, so that the emotion expressed by a reconstructed digital face and an input image is kept consistent, the result obtained by machine auxiliary diagnosis is more accurate, and the method is more beneficial to providing reliable auxiliary diagnosis information in a non-clinical diagnosis environment for doctors.

The overall flow of the present invention is summarized as follows:

Specifically, firstly, a patient sits in front of a display, a computer is started to play videos under five different stimuli to the patient, and the patient watches video contents, wherein the five emotional stimuli are as follows: happiness (happiness), qi (anger), sadness (sadness), fear (fear), neutrality (neutral). And placing cameras above the display and parallel to the position of the tested patient, and recording the whole process of watching the video by the patient in real time. The position of the camera is required to be recorded completely to the facial features and expression changes of the complete face, so that the system can capture the face details which are not observed by naked eyes. The invention extracts human face in the data preprocessing stage of the model, which aims to eliminate the interference of environmental background and other factors, and only focuses on three-dimensional reconstruction of the face of a patient, wherein the data preprocessing is as follows: and cutting out the single-frame image extracted from the input video data frame by frame. A single patient's face is extracted.

Second, the patient image is input to the coarse shape encoder and the trainable expressive shape encoder, then the relevant geometric and albedo model is used as a fixed decoder, and a three-dimensional reconstruction of the patient's face is performed according to the regressive expression, identity shape, pose, and albedo parameters. And the emotion consistency loss for assisting in capturing the fine emotion is used for regularizing emotion differences between the input image and the rendered patient face after the shape of the patient face passes through a fixed emotion recognition network. In this stage, the present invention focuses on the double-eye corners, lower jaw, upper and lower lips, left and right lip corners, and double-cheekbones through a loss function in order to learn the expression details of the patient's face. In a further detail training stage, the images pass through a fixed expression encoder, and the detail encoder is adjusted by using the regressive expression and the mandibular posture parameters to deepen facial wrinkles, so that the micro-expression of the patient is more detailed, and the diagnosis of a doctor is facilitated. The loss functions mentioned above are:

。

in the above formula represents emotional loss,/> represents photometric loss,/> represents ocular closure loss,/> represents upper and lower lip closure loss,/> represents left and right lip angle loss,/> represents left and right zygomatic muscle loss,/> represents expression regularizer, and each weighted by/> .

Meanwhile, the disease condition of the bipolar affective disorder patient has uncertainty and involves the mutual influence of various factors, the project comprehensively considers the information such as illumination, reflection and the like in the outdoor environment, the albedo is considered in the digital face reconstruction model, the performance of the deep learning model in the auxiliary diagnosis process in the outdoor environment can be improved, and the diagnosis scene is not limited in a diagnosis room.

For the bipolar disorder diagnosis stage, the present project uses a face detector (e.g., a multi-task convolutional neural network MTCNN) to locate the patient's face, align the detected face, and further capture facial geometry due to facial expression using geometry-based global features such as landmark points around the patient's nose, eye, mouth, and local features of texture. And inputting the learned characteristics into a backbone network CNN for characteristic extraction, and finally completing auxiliary diagnosis of the bipolar affective disorder. Coarse shape encoder mentioned in the description of the previous paragraph: this encoder plays a key role in the three-dimensional face reconstruction process, and is responsible for preliminary three-dimensional shape estimation, such as the contour of the face, the position of the five elements, and the rough texture of the face, providing initial approximations for subsequent finer shape estimation, fine facial folds, and reconstruction of facial emotion. Trainable expression shape encoder in the description of the previous paragraph: is an important component used in three-dimensional face reconstruction for encoding expression information from an input image into three-dimensional shape changes. The method allows the model to adjust the shape of the generated three-dimensional face according to the expression characteristics of the input image in the reconstruction process so as to more accurately reflect the expression of the face. Firstly, inputting an image: receiving a face image containing expression information by a trainable expression shape encoder; and then, carrying out expression coding by using a convolutional neural network CNN to learn how to extract expression information from the input image. The second is the encoded output: the output of the trainable expression shape encoder is a low-dimensional vector or code representing the expression. This code reflects the expression information in the input image so that it can be used by subsequent steps to adjust the shape of the generated three-dimensional face; cooperation with coarse shape encoder and other loss functions: trainable expression shape encoders typically work in conjunction with other components, such as coarse shape encoders and photometric losses, keypoint losses, and the like. Together they provide a comprehensive understanding of the shape and expression of the face for the model and are used to optimize the generated three-dimensional face model.

The trainable expression shape encoder is used for enhancing the expression perception capability of the three-dimensional face reconstruction model. It allows the model to better understand and recover the expression changes of the face, thereby improving the realism and accuracy of the model in emotion analysis, facial animation, virtual reality and augmented reality applications. Geometric albedo model in the description of the upper paragraph: the geometric albedo model is a model for describing attributes of a face surface. The model is helpful to consider the surface albedo of the object in the three-dimensional reconstruction process, so that the influence of illumination and shadow on the human face can be more accurately simulated, the model is suitable for application in different illumination conditions and environments, and the reality and facial recognition are enhanced. MTCNN face detector in the description of the previous paragraph: MTCNN is a deep learning model for face detection with a multi-tasking cascaded convolutional network structure. The human face detector is an efficient and accurate human face detector and is commonly used for real-time human face detection application. MTCNN has a multi-tasking cascade structure: MTCNN consists of a series of cascaded deep convolutional neural networks, each of which is responsible for a different task. These tasks typically include: candidate frame generation: the method comprises the steps of being responsible for generating a candidate face frame in an input image; finishing the candidate frames: further refining and correcting the candidate frames generated in the first stage to more accurately frame the position and shape of the face; face classification: and carrying out face classification on the candidate frames after finishing to determine whether each frame contains a face. This structure helps to improve accuracy and efficiency of detection.

The overall flow of the present application will be described with reference to fig. 1, as shown in fig. 1:

S100: and (5) collecting facial features. With the above-described apparatus built, diagnostic videos of affective disorder patients are collected at local hospital-related departments to create a second data set. This phase is a second data set provided by the relevant hospital, where the second data set comprises three categories of patients: monophasic depression, bipolar disorder, and healthy controls. The data content is to play five kinds of stimulation videos to the testers respectively, and then the professional doctor carries out preliminary judgment on the illness degree according to the reaction of the testers.

S200: and (5) preprocessing data. According to the data provided by the hospital and provided by the professional doctor for manually labeling the data after diagnosis of the testers, a whole continuous video data of each tester is segmented and clipped into five small segments according to five emotion stimulus segments, and then a video frame extractor is used for extracting frames containing face information from the continuous video.

Firstly, cross-platform computer vision (openCV) is used for loading an image containing a human face, and meanwhile, the openCV also provides a trained human face detector based on a roll neural network. The detected faces are further marked on the image with rectangular boxes for visualization. After the face is detected, the subsequent data preprocessing can be performed. The method comprises the operations of face cutting, size adjustment, graying, histogram equalization, normalization and the like, so that the face image size specification for diagnosis is the same, and model identification and judgment are facilitated.

Then using a face detector (MTCNN) to detect the face position, cutting the face area of each frame according to the detected position, enabling the model to focus on the facial expression of the face, avoiding the influence of environmental factors, and then transmitting the data containing the face information to the model for the next processing. Wherein the video frame extractor: input: the whole video is an input to this model, typically provided in the form of a video file or stream. The video may contain different scenes, actions, and objects. Frame-by-frame extraction: this model will process the video frame by frame, extracting the image of each frame from the video stream, typically including the steps of video decoding, frame sampling, and storing the image of each frame. And (3) storing images: each extracted frame of images is typically stored in some data structure, which can be used as input data in subsequent processing. Sampling rate: the model typically requires setting a frame sampling rate to control the rate at which images are extracted from the video.

S300: and (5) reconstructing a human face. The method specifically comprises the following steps: s301: data preparation. The collection of the present invention includes a first data set and a second data set: the first data set is a large-scale outdoor face image data set containing different expressions; the second data set includes hospital provided diagnostic videos of three classes of testers under the annotated five emotional stimuli in video form. For better auxiliary diagnosis, the second data set is formed by sectionally editing the whole section of tester data according to five types of stimulus, and reconstructing the three-dimensional digital human face of the tester and carrying out subsequent auxiliary diagnosis under each stimulus. S302:3DMM reconstruction and rendering. Wherein, the 3DMM: namely a 3D face shape model, is a computer vision and computer graphics model for three-dimensional face reconstruction and analysis. 3DMM is widely used to recover three-dimensional face shapes from two-dimensional images. The 3DMM is a mathematical model constructed based on a series of three-dimensional shape and texture data of a face, describing three-dimensional shape and texture attributes of the face, and can generate a new three-dimensional face model by applying changes in shape and texture parameters on the average shape and texture. This makes it possible to generate face models having different facial features and expressions. Can be used to estimate three-dimensional face shape and texture from two-dimensional images. In the part, the invention relates to the contour of the reconstructed three-dimensional face with the face, the shape parameters and the expression parameters of the face, the generated three-dimensional face is rendered into a two-dimensional image, the rendered two-dimensional image and the expression parameters are respectively input into a classification network, and the two-dimensional image and the expression parameters are combined to carry out auxiliary diagnosis of the double-emotion disorder of a tester.

Specifically, first, in the training phase, a single picture is input into the encoder 1, and detail regression is performed. Firstly, a low-dimensional potential code (latentcode) is firstly regressed, which comprises camera, reflection, light, shape, gesture and expression code, a rough shape is obtained through a smooth-surface high-precision 3D face model (FLAME), a texture map (map) is obtained, a 2D picture is rendered according to camera and illumination parameters, and then the difference between an input image and a generated 2D image is minimized.

The rough flag geometry is then enhanced by a detail map, an image is input, a 128-dimensional latent code is obtained by a trained encoder 2 to control a person-specific static facial detail, the code is concatenated with expression and mandibular gesture parameters obtained from encoder 1 to obtain dynamic expression wrinkle details for the tester, a new code is obtained, decoded by decoder 1, and converted to normal map (map) for subsequent rendering.

The term "FLAME" as used herein refers to a high-precision 3D face model that is smooth in surface and is used to represent the shape and expression of a face. It uses a diagonal micro-attention mechanism (DMA) to detect small differences between successive video frames in order to better understand the semantic information of successive video frames. The diagonal micro-attention mechanism, referred to as diagonal micro-attention mechanism (DiagonalMicroAttention, DMA), is an attention mechanism that detects small differences in facial expression of a person between two consecutive video frames. DMA attempts to understand semantic information of successive video frames, solving the problem that the marker cannot map similar contexts onto the markers; DMA is a variant of the diagonal attention mechanism (DiagonalAttentionMechanism) that uses a diagonal matrix in calculating the attention weights. Regarding the diagonal attention mechanism, it is a special attention mechanism that limits the attention weight to the diagonal, thus making the model more stable and reliable.

In addition, the FLAME model may model neck rotation, eyeball rotation, etc., thereby making expression represented by the FLAME model richer. Parameters of the FLAME model include identity parameters, posture parameters, and facial expression parameters. In the detail rendering stage, the detail displacement model may generate surface detail pictures with medium frequency.

Using the expression of flag, and the mandibular pose parameters, the cheekbone pose parameters generate details that depend on the expression. In order to better reconstruct the three-dimensional face of a tester, the invention introduces a new detail loss to distinguish the specific details of the human from the facial wrinkles caused by the expression change. This separation allows the operator to control the expression parameters to synthesize realistic, specific wrinkles while also ensuring that the specific details of the face are not changed.

The model was first learned from unsupervised field images (i.e., the first dataset) and achieved a more advanced reconstruction accuracy on both baselines (baselines). On this basis, the model in which the first data set is trained is taken as a pre-training model 1. The pre-training model 1 is introduced during the second training of the second data set, so that the use of the three-dimensional face reconstruction partial model in the invention is more universal. The diagnosis scene is different from the conventional diagnosis scene, is not only limited in a diagnosis room, but also can be applied to a non-diagnosis scene, and can be used for more widely capturing the diseased condition of a tester in an outdoor environment.

Three-dimensional reconstruction is performed on each image by using a 3DMM to obtain a corresponding three-dimensional face model, wherein the model comprises face shape and texture information (shape parameters and texture parameters). The 3DMM herein is a statistical model for modeling and generating face shapes and textures for inferring three-dimensional face shape and texture information from two-dimensional images. The reconstructed 3D face model is rendered as a 2D image to have the same view angle and scale as the original input image. This may be achieved by perspective projection and rendering techniques. In the invention, a rendering module projects a three-dimensional image into a two-dimensional image, which uses the input vertex, face information and optional attribute information and generates a rendered image, and projects a reconstructed three-dimensional face image into a two-dimensional image, which uses the input vertex, face information and optional attribute information and generates a rendered image. The coarse shape may be rendered as a 2D image. For each rendered 2D face image, a Convolutional Neural Network (CNN) is used to extract the feature representation. These features may capture expression information in the reconstructed face image.

S400: a multi-layer perceptron (MLP) network. These extracted image features are then further mapped to a higher level representation using an MLP network. The MLP may include a plurality of fully connected layers for learning nonlinear feature transformations. The input layer of the MLP network receives the raw image data and feature vectors and then gradually extracts abstract features that help to distinguish different faces through multiple hidden layers. Finally, the output layer uses these features to classify different faces. The training data and labels are used to adjust the weight and bias of the network so that the network can perform accurate face recognition.

S500: and (5) a classifier. The auxiliary diagnosis stage in the invention adopts a multi-mode model, the facial shape model output in the three-dimensional facial reconstruction is input into the expression classifier 1, the output facial expression parameters are input into the expression classifier 2, the classification results of the two are processed in a combined way, and then the final diagnosis is carried out.

Specifically: data preparation: comprises three-dimensional facial shape data and facial expression parameter data. And assigning a corresponding label to each data sample based on the diagnosis of the doctor in the second data set. Data preprocessing: performing operations such as dimension reduction, feature extraction, standardization and the like on the face shape data; and carrying out standardization and normalization processing on the facial expression parameter data. Two independent classifiers are constructed: the two classifiers are respectively trained, so that the face shape and the expression parameters can be respectively classified. And (5) jointly processing classification results: and carrying out joint processing on the output results of the two classifiers according to the classification diagnosis requirements. The two classification results are fused using a neural network to arrive at a final overall classification result, and the result is displayed by a computer at step 600.

Testing and optimizing. Supervised learning is performed using face images with 3DMM parameters to train the expression classifier. The cross entropy loss or other suitable loss function may be used for training and optimization algorithms such as gradient descent may be used to fine tune the model parameters. In the test stage, combining the 3DMM parameters with the reconstructed face image, and predicting the expression category of the face image through a trained expression classifier. The performance of the model is evaluated using the indexes of accuracy, precision, recall, etc. Specifically, first, data preprocessing is performed on an input patient image, where the data preprocessing is: and cutting out the single-frame image extracted from the input video data frame by frame. A single patient's face is extracted. Second, the input patient image is fed into a coarse shape encoder and a trainable expressive shape encoder, then the relevant geometric and albedo model is used as a fixed decoder, and three-dimensional reconstruction of the patient's face is performed according to the regressive expression, identity shape, pose and albedo parameters. And the emotion consistency loss for assisting in capturing the fine emotion is used for regularizing emotion differences between the input image and the rendered patient face after the shape of the patient face passes through a fixed emotion recognition network. In this stage, in order to learn the expression details of the patient's face, the eyes, lower jaw, upper and lower lips, and left and right lips are focused. In a further detail training stage, the images pass through a fixed expression encoder, and the detail encoder is adjusted by using the regressive expression and the mandibular posture parameters to deepen facial wrinkles, so that the micro-expression of the patient is more detailed, and the diagnosis of a doctor is facilitated.

Meanwhile, the disease condition of the bipolar affective disorder patient has uncertainty and involves the mutual influence of various factors, the project comprehensively considers the information such as illumination, reflection and the like in the outdoor environment, the albedo is considered in the digital face reconstruction model, the performance of the deep learning model in the auxiliary diagnosis process in the outdoor environment can be improved, and the diagnosis scene is not limited in a diagnosis room. For the bipolar disorder diagnosis phase, the present project uses a face detector MTCNN to locate the patient's face, align the detected face, and further capture facial geometry caused by facial expressions using geometric-based global features such as landmark points around the patient's nose, local features of eyes, mouth, and texture. And inputting the learned characteristics into a backbone network CNN for characteristic extraction, and finally completing auxiliary diagnosis of the bipolar affective disorder.

The invention provides an emotion disorder diagnosis method and system based on 3D face reconstruction, which are used for making up the gap of the prior art, improving the accuracy of double-phase emotion disorder diagnosis and solving the problem of diagnosis of human hand deficiency in the medical field; the recurrent intermittent period of bipolar disorder patients is long, usually two to three months, if some depressive feature is still maintained during the intermittent period; whether depressive episodes or manic episodes, etc. Because of the long diagnosis and rehabilitation period, doctors cannot guarantee the observation of the change of the patient condition in the whole time period. The machine-assisted affective disorder diagnosis system provided by the invention can relieve the condition of insufficient hands of doctors in the period of morbidity intermittence, better assist the doctors in diagnosing illness states under non-diagnosis scenes and timely feed back to the doctors. The invention can improve the accuracy of the diagnosis of the bipolar affective disorder and solve the problems existing in the manual diagnosis.

The invention provides a bipolar affective disorder auxiliary diagnosis system which can be flexibly used in the situations of diagnosis scenes and non-diagnosis scenes. Firstly, carrying out three-dimensional reconstruction on the face of a tester through a face reconstruction module, separating the overall face outline, the face folds and the emotion of the face, and then combining the geometric information of the three-dimensional face model with the estimated expression parameters by the model, and inputting the geometric information and the estimated expression parameters into a multi-modal neural network together. And reasoning the input three-dimensional face model and the facial expression parameters by using the trained emotion classification model so as to identify the emotion of the face.

The application provides a new detail continuous loss (loss) for separating face ID and expression (cheekbone, double-eye corners, upper and lower lips, left and right lips, lower jaw, neck) and improving diagnosis accuracy. The existing three-dimensional face reconstruction model is generally concerned with the overall outline of the face, including the shape, the size and the curve of the face; or the eyebrow part, the whole outline of the eye; or to take care of the shape change when the mouth speaks. According to the application, through observing the mathematical statistics diagram of the visualized face, the focused part of the digital face reconstruction model is thinned, and the loss of the double-eye corners, the upper lip, the lower lip, the left lip, the right lip, the lower jaw and the cheekbone muscles is invented to focus on the specific details of the face and the continuous loss of the details to separate the face ID from the expression, so that the continuity and consistency between the original image and the generated image are ensured. The continual loss of detail here helps to ensure that the generated image is visually consistent with the emotion expressed by the generated three-dimensional reconstructed image. First, the three-dimensional reconstruction of the face is focused on the key points near the eyes, the lower jaw, the upper and lower lips, the left and right lips and the nose. And the visual analysis is carried out on the data of the patient with the bipolar affective disorder diagnosed by the professional doctor, and the visual mathematical statistics diagram for auxiliary diagnosis shows that the affection is concentrated near the cheekbone of the face, so that the loss of the cheekbone is increased on the original basis.

Specific forms of continuous loss of detail include: feature extraction: extraction is typically performed using a pretrained CNN (convolutional neural network); and (3) extracting detail characteristics: removing the characteristic representation of the fuzzy version, and obtaining detail characteristics; calculating loss: calculating detail continuous loss between an original image and a generated reconstructed face image; added to the total loss, helps to generate a more vivid and more specific reconstructed image of the face. Overall, the optimized loss function is ;

wherein represents emotional loss,/> represents photometric loss,/> represents ocular closure loss,/> represents upper and lower lip closure loss,/> represents left and right lip angle loss,/> represents left and right zygomatic muscle loss,/> represents expression regularizer, and each weighted by/> . The problem that the fine wrinkles of the face are not obvious in the face reconstruction is solved through careful attention and study on the double-eye corners, the upper lip, the lower lip, the left lip, the right lip, the lower jaw and the zygomatic muscle parts, so that the accuracy of machine-assisted diagnosis of bipolar affective disorder is improved, the accuracy of the reconstruction and classification integral model is higher, and more accurate pointing is provided for doctors; the condition that the doctor ignores the change of the slight expression to cause missed diagnosis and misdiagnosis is relieved.

The application also provides face shielding, illumination, attitude change (detail separation) luminosity loss (loss). The application provides a self-supervision framework for capturing a monocular wild face, which can analyze emotion from an image and reconstruct a three-dimensional face with a high expression. The application solves the problems of inaccuracy and instability of three-dimensional face reconstruction and emotion recognition caused by the lack of various data sets, illumination change and other factors in the prior art. According to the application, the emotion characteristics mainly transmitted to the reconstructed expression are combined with the unique self-supervision framework, and the large data sets of different expressions are utilized to learn and reconstruct the expression face. The application also uses a novel luminosity loss function to consider information such as illumination, reflection and the like in the outdoor environment, thereby improving the reconstruction quality and the robustness. The application has the advantage that it can recover accurate three-dimensional geometric and appearance information, and corresponding emotional states, from a single view. The application can also solve the problems of face change, shielding, self shielding and the like under different emotion states. The application can be used in various application scenes, such as man-machine interaction, social media, virtual reality and the like.

The first data set described in the present invention: and (3) a large-scale outdoor face data set containing rich expressions. (pre-training model) the second dataset described in the present invention: three types of data collected in local hospitals include unipolar depressed patients, bipolar disorder patients, and healthy controls. The data set doctor observes the patient's response to five stimuli by playing five emotional stimulus videos (Happiness, anger, sadness, fear, neutral) to three classes of testers. In the invention, a first data set is used for training in a deep learning model, and the model is used as a pre-training model, so that a large number of outdoor face features can be learned, and the first training is completed; after a pre-training model is selected, loading pre-training weights, taking the pre-training model as a part of characteristics to extract the characteristics, constructing a new model frame, then carrying out fine adjustment on the newly constructed model to adapt to the auxiliary diagnosis task of the invention, training a second data set by using the fine-adjusted model, carrying out model migration, and enabling the model to learn a large amount of outdoor face characteristics according to the expression details learned by the model. The training strategy learns a large number of outdoor features, solves the diagnosis problem of a single scene, and can realize real-time capture of patients in the outdoor scene, so that the model can monitor the illness state of the rehabilitation patients in daily life under the condition that doctors solve the deficiency of staff, and can provide accurate single-double-phase depression diagnosis results in time.

Specifically, an embodiment of the present application provides a three-dimensional digital face processing method for emotion disorder auxiliary diagnosis, as shown in fig. 2, where the method includes:

Step 202, acquiring a first data set and a second data set, wherein the first data set comprises a first face image, the second data set comprises a second face image and an emotion marking result of the second face image, and the emotion marking result comprises: unipolar depression, bipolar disorder, and wellbeing.

Step 204, training a face shape model by adopting a first data set, wherein the face shape model is used for building a face three-dimensional model according to first information in a first face image, and completing training of the face shape model according to consistency of a two-dimensional face image of the face three-dimensional model and facial expression characteristics of the first face image; the first information includes: facial rough features, facial expression features, facial detail features, and illumination features. Illumination features include illumination features and reflection features. The illumination characteristic can be determined according to illumination in the face image, and can also be increased by a random generation mode.

And 206, forming an emotion classification model according to the face shape model trained by the first data set and the classification model used for determining the emotion classification result according to the intermediate result generated by the face shape model, and training the emotion classification model by the second data set.

The method can analyze the video data of the person needing to be concerned by using the trained emotion classification model to assist diagnosis, and specifically, as an optional embodiment, the method further comprises: acquiring video data to be analyzed; inputting video data into a trained emotion classification model for analysis, and determining an emotion classification result, wherein the emotion classification result comprises: unipolar depression, bipolar disorder, and wellbeing.

The first data set is an image containing a face and a background, and in order to reduce the influence of the background, the face image can be cut out firstly to be processed later, so that the model is prevented from paying excessive attention to the background. Moreover, the training of the face shape model by adopting the first data set is unsupervised training, so that the training of the model can be completed through the consistency of the face three-dimensional model after modeling and the facial expression characteristics of the original image. Specifically, as an optional embodiment, the method further includes: acquiring image data, and extracting a target face image in the image data, wherein the target face image comprises a first face image or a second face image; the face shape model is used for completing training according to the following steps: modeling is carried out according to the target face image, and a face three-dimensional model is determined; generating a two-dimensional face image according to the face three-dimensional model; and adjusting the facial form model according to the consistency of the facial expression characteristics of the target facial image and the facial expression characteristics of the two-dimensional facial image.

The scheme can perform rough analysis, expression analysis and detail analysis on the model so as to model the three-dimensional model of the human face. In addition, the neck rotation and eyeball rotation of the three-dimensional model of the face can be modeled, so that model training can be performed better. Specifically, as an optional embodiment, the modeling according to the target face image, determining the three-dimensional model of the face includes: performing rough analysis according to a target face image to obtain first features, wherein the first features comprise face rough features and illumination features; performing expression analysis according to the target facial image, and determining facial expression characteristics to determine second characteristics, wherein the second characteristics comprise: facial expression features and mandibular posture features; the second feature further comprises a first rotation feature of the neck and a second rotation feature of the eyeball; carrying out detail analysis according to the target face image to obtain a third feature, wherein the third feature comprises: face detail features including features of the following face key points: the double canthus, upper and lower lips, left and right lip angles, lower jaw, and zygomatic muscle; and determining the three-dimensional model of the face according to the first feature, the second feature and the third feature.

The present approach may use a diagonal micro-attention mechanism to detect small differences between successive video frames in order to better understand semantic information of successive video frames. Specifically, as an optional embodiment, the performing detail analysis according to the target face image to obtain a third feature includes: and carrying out detail analysis according to the target face image and related face images of the target face image, and determining a third characteristic, wherein the related face images comprise images of the front frame and the rear frame of a video frame corresponding to the target face image in video data.

The application provides a new detail continuous loss (loss) for separating face ID and expression (cheekbone, double-eye corners, upper and lower lips, left and right lips, lower jaw, neck) and improving diagnosis accuracy. According to the application, through observing the mathematical statistics diagram of the visualized face, the focused part of the digital face reconstruction model is thinned, and the loss of the double-eye corners, the upper lip, the lower lip, the left lip, the right lip, the lower jaw and the cheekbone muscles is invented to focus on the specific details of the face and the continuous loss of the details to separate the face ID from the expression, so that the continuity and consistency between the original image and the generated image are ensured. The continual loss of detail here helps to ensure that the generated image is visually consistent with the emotion expressed by the generated three-dimensional reconstructed image. First, the three-dimensional reconstruction of the face is focused on the key points near the eyes, the lower jaw, the upper and lower lips, the left and right lips and the nose. And the visual analysis is carried out on the data of the patient with the bipolar affective disorder diagnosed by the professional doctor, and the visual mathematical statistics diagram for auxiliary diagnosis shows that the affection is concentrated near the cheekbone of the face, so that the loss of the cheekbone is increased on the original basis. Specifically, as an optional embodiment, after generating the two-dimensional face image according to the three-dimensional face model, the method further includes: and adjusting the three-dimensional model of the face according to the loss between the third feature of the target face image and the third feature of the two-dimensional face image.

After modeling by adopting the face shape model, the face shape model can be further classified by a classification model, specifically, as an alternative embodiment, the intermediate result includes a face three-dimensional model modeled by the face shape model and two-dimensional image data generated based on the face three-dimensional model, and the classification model is used for: inputting the three-dimensional model of the human face into a first classifier, and determining a first analysis result; acquiring expression parameters according to two-dimensional image data corresponding to the face three-dimensional model, inputting the expression parameters into a second classifier, and determining a second analysis result; and carrying out joint analysis according to the first analysis result and the second analysis result to determine an emotion analysis result. The training of the model by the second data set is supervised training, which can be accomplished by utilizing the difference between the labels corresponding to the data and the model prediction results. Specifically, as an alternative embodiment, the training the emotion classification model by adopting the second data set includes: and acquiring an emotion analysis result corresponding to the second face image, and adjusting the emotion classification model by combining an emotion marking result of the second face image.

On the basis of the above embodiment, the embodiment of the present application further provides a three-dimensional digital face processing device for emotion disorder auxiliary diagnosis, as shown in fig. 3, where the device includes:

The data set obtaining module 302 is configured to obtain a first data set and a second data set, where the first data set includes a first face image, the second data set includes a second face image and a emotion marking result of the second face image, and the emotion marking result includes: unipolar depression, bipolar disorder, and wellbeing.

The first model training module 304 is configured to train a face shape model by adopting a first data set, where the face shape model is configured to build a face three-dimensional model according to first information in a first face image, and complete training of the face shape model according to consistency of a two-dimensional face image of the face three-dimensional model and facial expression characteristics of the first face image; the first information includes: facial rough features, facial expression features, facial detail features, and illumination features.

The second model training module 306 is configured to form an emotion classification model according to the face shape model trained by the first data set and the classification model for determining the emotion classification result according to the intermediate result generated by the face shape model, and train the emotion classification model by the second data set.

The implementation manner of the embodiment of the present application is similar to that of the embodiment of the method, and the specific implementation manner may refer to the specific implementation manner of the embodiment of the method, which is not repeated herein.

On the basis of the above embodiment, the present application further provides an electronic device, including: a memory and at least one processor; the memory is used for storing computer execution instructions; the at least one processor is configured to execute computer-executable instructions stored in the memory, such that the at least one processor performs the method as described in the above embodiments.

The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the processes of the data processing method embodiment, and can achieve the same technical effects, so that repetition is avoided and no further description is given here. The computer readable storage medium is, for example, read-only memory (ROM), random access memory (RandomACGessMemory RAM), magnetic disk or optical disk.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, etc., such as Read Only Memory (ROM) or flash memory (flashRAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. According to the definitions herein, the computer-readable medium does not include a transitory computer-readable medium (transitorymedia), such as a modulated data signal and carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The foregoing is merely exemplary of the present invention and is not intended to limit the present invention. Various modifications and variations of the present invention will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are to be included in the scope of the claims of the present invention.

Claims

1. A method of three-dimensional digital face processing for emotion disorder assisted diagnosis, the method comprising:

acquiring a first data set and a second data set, wherein the first data set comprises a first face image, the second data set comprises a second face image and an emotion marking result of the second face image, and the emotion marking result comprises: unipolar depression, bipolar disorder and health;

Training a face shape model by adopting a first data set, wherein the face shape model is used for building a face three-dimensional model according to first information in a first face image, and completing training of the face shape model according to consistency of a two-dimensional face image of the face three-dimensional model and facial expression characteristics of the first face image; the first information includes: facial rough features, facial expression features, facial detail features, and illumination features;

And forming an emotion classification model according to the face shape model trained by the first data set and the classification model used for determining the emotion classification result according to the intermediate result generated by the face shape model, and training the emotion classification model by the second data set.

2. The method according to claim 1, wherein the method further comprises:

acquiring video data to be analyzed;

Inputting video data into a trained emotion classification model for analysis, and determining an emotion classification result, wherein the emotion classification result comprises: unipolar depression, bipolar disorder, and wellbeing.

3. The method according to claim 1, wherein the method further comprises:

Acquiring image data, and extracting a target face image in the image data, wherein the target face image comprises a first face image or a second face image;

the face shape model is used for completing training according to the following steps:

modeling is carried out according to the target face image, and a face three-dimensional model is determined;

Generating a two-dimensional face image according to the face three-dimensional model;

and adjusting the facial form model according to the consistency of the facial expression characteristics of the target facial image and the facial expression characteristics of the two-dimensional facial image.

4. A method according to claim 3, wherein said modeling from the target face image to determine a three-dimensional model of the face comprises:

performing rough analysis according to a target face image to obtain first features, wherein the first features comprise face rough features and illumination features;

performing expression analysis according to the target facial image, and determining facial expression characteristics to determine second characteristics, wherein the second characteristics comprise: facial expression features and mandibular posture features; the second feature further comprises a first rotation feature of the neck and a second rotation feature of the eyeball;

carrying out detail analysis according to the target face image to obtain a third feature, wherein the third feature comprises: face detail features including features of the following face key points: the double canthus, upper and lower lips, left and right lip angles, lower jaw, and zygomatic muscle;

And determining the three-dimensional model of the face according to the first feature, the second feature and the third feature.

5. The method of claim 4, wherein the performing detail analysis according to the target face image to obtain a third feature includes:

And carrying out detail analysis according to the target face image and related face images of the target face image, and determining a third characteristic, wherein the related face images comprise images of the front frame and the rear frame of a video frame corresponding to the target face image in video data.

6. The method of claim 4, wherein after generating the two-dimensional face image from the three-dimensional model of the face, the method further comprises:

and adjusting the three-dimensional model of the face according to the loss between the third feature of the target face image and the third feature of the two-dimensional face image.

7. The method of claim 1, wherein the intermediate results include a face three-dimensional model modeled by a face shape model and two-dimensional image data generated based on the face three-dimensional model, the classification model being configured to:

Inputting the three-dimensional model of the human face into a first classifier, and determining a first analysis result;

Acquiring expression parameters according to two-dimensional image data corresponding to the face three-dimensional model, inputting the expression parameters into a second classifier, and determining a second analysis result;

and carrying out joint analysis according to the first analysis result and the second analysis result to determine an emotion analysis result.

8. The method of claim 7, wherein training the emotion classification model using the second data set comprises:

And acquiring an emotion analysis result corresponding to the second face image, and adjusting the emotion classification model by combining an emotion marking result of the second face image.

9. A three-dimensional digital face processing apparatus for emotion disorder assisted diagnosis, the apparatus comprising:

The data set acquisition module is used for acquiring a first data set and a second data set, wherein the first data set comprises a first face image, the second data set comprises a second face image and an emotion marking result of the second face image, and the emotion marking result comprises: unipolar depression, bipolar disorder and health;

The first model training module is used for training a face shape model by adopting a first data set, wherein the face shape model is used for building a face three-dimensional model according to first information in a first face image, and training the face shape model according to consistency of a two-dimensional face image of the face three-dimensional model and facial expression characteristics of the first face image; the first information includes: facial rough features, facial expression features, facial detail features, and illumination features;

the second model training module is used for forming an emotion classification model according to the face shape model trained by the first data set and the classification model used for determining the emotion classification result according to the intermediate result generated by the face shape model, and training the emotion classification model by the second data set.

10. An electronic device, comprising: a memory and at least one processor;

the memory is used for storing computer execution instructions;

The at least one processor is configured to execute computer-executable instructions stored in the memory, such that the at least one processor performs the method of any one of claims 1-8.