CN112149651A

CN112149651A - Facial expression recognition method, device and equipment based on deep learning

Info

Publication number: CN112149651A
Application number: CN202011345478.XA
Authority: CN
Inventors: 李天驰; 孙悦; 王帅
Original assignee: Shenzhen Dianmao Technology Co Ltd
Current assignee: Shenzhen Dianmao Technology Co Ltd
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2020-12-29
Anticipated expiration: 2040-11-25
Also published as: CN112149651B

Abstract

The invention discloses a facial expression recognition method, a device and equipment based on deep learning, wherein the method comprises the following steps: acquiring an original facial expression image, and inputting the original facial expression image to generate a confrontation network; acquiring a synthesized face image generated by a generated countermeasure network according to a current face expression image; constructing a facial expression recognition network, training the facial expression recognition network according to the original facial data and the synthesized facial data, and generating a target recognition network; the method comprises the steps of obtaining a face image to be recognized, inputting the face image to be recognized into a target recognition model, obtaining an output result of the target recognition model, and determining the face expression in the face image according to the output result. The embodiment of the invention is based on the method for generating the confrontation network, simultaneously executes the tasks of facial expression synthesis and recognition in a unified frame, improves the speed of facial expression recognition and improves the accuracy of recognition.

Description

Facial expression recognition method, device and equipment based on deep learning

Technical Field

The invention relates to the technical field of image processing, in particular to a facial expression recognition method, device and equipment based on deep learning.

Background

With the rise of the artificial intelligence industry, facial expression recognition technology based on deep learning is more and more concerned by people, and especially in a network live broadcast class, how the current class listening state of a student is can be obtained by analyzing the facial expressions of the student in a live broadcast video, so that teacher management and teaching are facilitated. In recent years, deep learning has achieved excellent results in many computer vision tasks such as image classification and face recognition. Until recently, facial expression recognition methods based on deep learning have also been developed. However, the performance of these facial expression recognition methods in an unconstrained environment is still far from satisfactory. One reason for this is that the disclosed facial expression database typically contains only a small amount of training data. While a rich variety of facial images are available on the internet, manual tagging of these images is time consuming and expensive. Therefore, it is not a trivial matter to train a deep neural network with a limited amount of training data. In order to alleviate the problems caused by insufficient training data, researchers try to train a neural network with auxiliary data to realize effective facial expression recognition. For example, there are researchers who propose a deep Convolutional Neural Network (CNN) structure for a mixed database (consisting of 7 different facial expression databases) to achieve comparison results for each database. However, due to the bias between these databases, such a training strategy may cause an over-fitting problem, thereby degrading the performance of the target database. Other researchers first pre-process CNNs in large-scale face recognition databases or other large image databases, and then fine-tune the target facial expression database. However, designing an appropriate strategy to perform fine-tuning often requires considerable effort, since deep networks are pre-trained, with a high volume of large-scale data.

In the existing method based on generation of the countermeasure network, a realistic face image with the same identity as an input face image can be generated. Due to the similarity of the composite images to the real images, these composite images may change under different conditions. However, there are some problems that have not yet been solved when these synthetic images are used directly to train a deep neural network. For example, generating high quality images that approximate the distribution of real images often requires considerable effort to generate a counterpoise network, especially if limited training data is available. Furthermore, although the quality of the composite image generated by the generation of the antagonistic network is high (even if it cannot be recognized by the human eye), it cannot be guaranteed that the performance of the deep neural network is effectively improved. This is because the inherent data bias between the composite image and the real image can be large for the recognition network.

Therefore, in the prior art, the training strategy in the face recognition method based on deep learning can cause the over-fitting problem, reduce the performance of the target database and reduce the face recognition accuracy.

Accordingly, the prior art is yet to be improved and developed.

Disclosure of Invention

In view of the defects of the prior art, the present invention aims to provide a method, an apparatus and a device for facial expression recognition based on deep learning, and aims to solve the technical problems that the training strategy in the facial recognition method based on deep learning in the prior art causes an over-fitting problem, the performance of a target database is reduced, and the accuracy of facial recognition is reduced.

The technical scheme of the invention is as follows:

a facial expression recognition method based on deep learning, the method comprising:

acquiring an original facial expression image, and inputting the original facial expression image to generate a confrontation network;

acquiring a synthesized face image generated by a generated countermeasure network according to a current face expression image;

constructing a facial expression recognition network, training the facial expression recognition network according to the original facial data and the synthesized facial data, and generating a target recognition network;

the method comprises the steps of obtaining a face image to be recognized, inputting the face image to be recognized into a target recognition model, obtaining an output result of the target recognition model, and determining the face expression in the face image according to the output result.

Further, before the acquiring the original facial expression image and inputting the original facial expression image into the generation of the countermeasure network, the method includes:

and constructing an initial generation countermeasure network, and training the initial generation countermeasure network in advance to obtain the trained generation countermeasure network.

Further preferably, the generation countermeasure network comprises a signal generator and two discriminators;

then, constructing a facial expression recognition network, training the facial expression recognition network according to the original facial data and the synthesized facial data, and generating a target recognition network, further comprising:

and generating a confrontation network signal generator according to the classification loss training obtained from the facial expression recognition network by synthesizing the facial image.

Further preferably, the acquiring and generating a synthesized facial image generated by the countermeasure network according to the current facial expression image includes:

and a signal generator of the countermeasure network separates the image information of the current facial expression image through the custom graph convolution layer, and a synthesized facial image is generated through encoding and decoding.

Preferably, the synthesized face image generated by encoding and decoding includes:

extracting image features through a convolutional neural network, compressing the image into a predetermined number of feature vectors, and completing coding;

and restoring a low-level image from the feature vector according to the deconvolution layer to generate a synthetic face image, and finishing decoding.

Further, after the training of the classification loss obtained from the facial expression recognition network according to the synthesized facial image to generate the confrontation network signal generator, the method further comprises:

and guiding a back propagation algorithm according to the real data, and supervising the learning of the characteristics of the synthetic face image.

Further, the obtaining of the output result of the target recognition model and the determining of the facial expression in the facial image according to the output result include:

and predicting the class classification of the facial image to be recognized through the target recognition model, and determining the facial expression in the facial image according to the class classification.

Another embodiment of the present invention provides a facial expression recognition device based on deep learning, including:

the system comprises an original image acquisition module, a confrontation network generation module and a face recognition module, wherein the original image acquisition module is used for acquiring an original facial expression image and inputting the original facial expression image into the confrontation network generation module to generate the confrontation network;

the synthetic face image generation module is used for acquiring and generating a synthetic face image generated by the countermeasure network according to the current face expression image;

the target recognition network training module is used for constructing a facial expression recognition network, training the facial expression recognition network according to the original facial data and the synthesized facial data, and generating a target recognition network;

and the facial expression recognition module is used for acquiring a facial image to be recognized, inputting the facial image to be recognized into the target recognition model, acquiring an output result of the target recognition model, and determining the facial expression in the facial image according to the output result.

Another embodiment of the present invention provides a facial expression recognition apparatus based on deep learning, the apparatus comprising at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described deep learning based facial expression recognition method.

Another embodiment of the present invention also provides a non-transitory computer-readable storage medium storing computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform the above-described deep learning based facial expression recognition method.

Has the advantages that: the embodiment of the invention is based on the method for generating the confrontation network, simultaneously executes the tasks of facial expression synthesis and recognition in a unified frame, improves the speed of facial expression recognition and improves the accuracy of recognition.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a flowchart of a facial expression recognition method based on deep learning according to a preferred embodiment of the present invention;

FIG. 2 is a functional block diagram of an embodiment of a deep learning-based facial expression recognition apparatus according to the present invention;

fig. 3 is a schematic diagram of a hardware structure of a facial expression recognition device based on deep learning according to a preferred embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and effects of the present invention clearer and clearer, the present invention is described in further detail below. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. Embodiments of the present invention will be described below with reference to the accompanying drawings.

The embodiment of the invention provides a facial expression recognition method based on deep learning. Referring to fig. 1, fig. 1 is a flowchart illustrating a method for recognizing facial expressions based on deep learning according to a preferred embodiment of the present invention. As shown in fig. 1, it includes the steps of:

s100, acquiring an original facial expression image, and inputting the original facial expression image to generate a confrontation network;

s200, acquiring a synthesized face image generated by a generation countermeasure network according to the current face expression image;

step S300, constructing a facial expression recognition network, training the facial expression recognition network according to the original facial data and the synthesized facial data, and generating a target recognition network;

step S400, a face image to be recognized is obtained, the face image to be recognized is input into a target recognition model, an output result of the target recognition model is obtained, and the facial expression in the face image is determined according to the output result.

In specific implementation, the embodiment of the invention is mainly used for facial expression recognition in a live webcast course. The embodiment of the invention simultaneously executes the tasks of facial expression synthesis and recognition in a unified frame, thereby improving the performance of each task. The method involves a two-stage learning process. First, a confrontation Network is generated according to GAN (generic adaptive Network), and a synthesized face image having different facial expressions is generated by preprocessing the facial expression synthesis. Secondly, training a recognition network in a unified frame in combination with a pre-trained recognition model, and recognizing the facial image to be recognized according to the recognition network after the training in combination to obtain the type of the facial expression. Among them, creating a countermeasure network is a deep learning model, and is one of the most promising methods for unsupervised learning in complex distribution in recent years. The model passes through (at least) two modules in the framework: the mutual game learning of the Generative Model (Generative Model) and the Discriminative Model (Discriminative Model) yields a reasonably good output.

The method comprises the steps of firstly inputting the existing facial expression data to generate an confrontation network, generating more and richer new facial expression data by the generated confrontation network according to the existing facial expression data, then inputting all the data to a facial expression recognition network, wherein the facial expression recognition network has two tasks, one is training a facial expression recognition model, the other is judging whether the new data generated by the generated confrontation network is beneficial to training the facial expression recognition model, and feeding the result back to the generated confrontation network. The two modules thus interact and mutually inhibit to form a closed loop structure. The method and the device realize that the tasks of facial expression synthesis and recognition are simultaneously executed in a unified frame, and improve the performance of each task.

Further, before obtaining an original facial expression image and inputting the original facial expression image into a generation countermeasure network, the method includes:

In specific implementation, an initially generated confrontation network is constructed for synthesizing facial expressions. The initially generated confrontation network, which is composed of one signal generator G and two discriminators D, is trained in advance to generate a synthesized face image, thereby generating a target confrontation network. In order to improve the quality of the generated face image, it is necessary to perform both the antagonistic learning between the generator and the discriminator and the content learning between the real image and the generated image.

Further, the generation countermeasure network comprises a signal generator and two discriminators;

In specific implementation, the recognition network R and the pre-trained generation countermeasure network module are introduced for joint training. The recognition network R performs facial expression classification with the real face image and the synthesized face image generated by the generation countermeasure network module as input. Meanwhile, a confrontation network module signal generator is generated by utilizing the classification loss training obtained from the recognition network by the synthetic image. In this way, the generate confrontation network module can generate a face image that facilitates recognition network training.

Further, acquiring a synthesized facial image generated by the countermeasure network according to the current facial expression image, comprising:

In specific implementation, the signal generator of the countermeasure network mainly comprises an encoding module and a decoding module, and more specifically, the generator separates the style and content of an image through a custom graphic convolution layer, and generates a new image through encoding and decoding.

Further, the synthesized face image generated by encoding and decoding comprises:

In specific implementation, the coding module is a residual error network, consists of a convolutional layer and a residual error block, extracts image features through a convolutional neural network, and compresses an image into a certain number of feature vectors.

The decoding module is also composed of a convolution layer and a residual block, and low-level features are restored from the feature vector by utilizing the deconvolution layer, and finally an image is generated.

Further, after generating the confrontation network signal generator according to the classification loss training obtained by synthesizing the face image from the facial expression recognition network, the method further comprises the following steps:

In particular, to mitigate the effects of data skew between real and composite images, an intra-class penalty is introduced to reduce intra-class variations in the real and composite images. In particular, a specially designed RDBP (Real Data Back Propagation) algorithm is proposed. The RDBP supervises the learning of the features of the synthesized face images by using the features of the real face images. Therefore, the recognition network can fully utilize the facial expression information of the real image and the synthetic image, thereby obviously improving the performance of facial expression recognition.

Further, acquiring an output result of the target recognition model, and determining the facial expression in the facial image according to the output result, including:

In particular, the discriminator provides two outputs, one of which judges whether the input image is "true" or "false" by means of the antagonistic learning, and the other of which predicts the class classification of the input image by means of the antagonistic learning. Another discriminator decides whether the input is a potential face representation or a random vector sampled from an a priori distribution.

Therefore, the category classification of the facial image to be recognized can be predicted by the identification model, and the facial expression in the facial image can be determined according to the category classification. Facial expressions include, but are not limited to, happy, sad, surprised, afraid, angry, aversion, and neutral.

According to the method, the facial expression recognition method based on deep learning is provided, the method simultaneously executes the tasks of facial expression synthesis and recognition in a unified frame, and the performance of each task is improved. The method involves a two-stage learning process. Firstly, the facial expressions are transmitted to a generation countermeasure network module for preprocessing, and a synthesized face image with different facial expressions is generated. Secondly, the recognition network is trained in a unified frame in combination with the pre-trained recognition model, so that the accuracy of facial expression recognition is remarkably improved.

It should be noted that, a certain order does not necessarily exist between the above steps, and those skilled in the art can understand, according to the description of the embodiments of the present invention, that in different embodiments, the above steps may have different execution orders, that is, may be executed in parallel, may also be executed interchangeably, and the like.

Another embodiment of the present invention provides a facial expression recognition apparatus based on deep learning, as shown in fig. 2, the apparatus 1 includes:

an original image obtaining module 11, configured to obtain an original facial expression image, and input the original facial expression image into a generation countermeasure network;

a synthesized face image generation module 12, configured to obtain a synthesized face image generated by the countermeasure network according to the current facial expression image;

the target recognition network training module 13 is used for constructing a facial expression recognition network, training the facial expression recognition network according to the original facial data and the synthesized facial data, and generating a target recognition network;

and the facial expression recognition module 14 is configured to acquire a facial image to be recognized, input the facial image to be recognized into the target recognition model, acquire an output result of the target recognition model, and determine a facial expression in the facial image according to the output result.

The specific implementation is shown in the method embodiment, and is not described herein again.

Another embodiment of the present invention provides a facial expression recognition device based on deep learning, as shown in fig. 3, the device 10 includes:

one or more processors 110 and a memory 120, where one processor 110 is illustrated in fig. 3, the processor 110 and the memory 120 may be connected by a bus or other means, and the connection by the bus is illustrated in fig. 3.

Processor 110 is operative to implement various control logic of apparatus 10, which may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a single chip, an ARM (Acorn RISC machine) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination of these components. Also, the processor 110 may be any conventional processor, microprocessor, or state machine. Processor 110 may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

The memory 120 is a non-volatile computer-readable storage medium, and can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions corresponding to the deep learning-based facial expression recognition method in the embodiment of the present invention. The processor 110 executes various functional applications and data processing of the device 10, i.e., implementing the deep learning based facial expression recognition method in the above-described method embodiments, by running non-volatile software programs, instructions and units stored in the memory 120.

The memory 120 may include a storage program area and a storage data area, wherein the storage program area may store an application program required for operating the device, at least one function; the storage data area may store data created according to the use of the device 10, and the like. Further, the memory 120 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 120 optionally includes memory located remotely from processor 110, which may be connected to device 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

One or more units are stored in the memory 120, and when executed by the one or more processors 110, perform the method for deep learning based facial expression recognition in any of the above-described method embodiments, e.g., performing the above-described method steps S100-S400 in fig. 1.

Embodiments of the present invention provide a non-transitory computer-readable storage medium storing computer-executable instructions for execution by one or more processors, e.g., to perform method steps S100-S400 of fig. 1 described above.

By way of example, non-volatile storage media can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), electrically erasable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in many forms such as Synchronous RAM (SRAM), dynamic RAM, (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The disclosed memory components or memory of the operating environment described herein are intended to comprise one or more of these and/or any other suitable types of memory.

Another embodiment of the present invention provides a computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method for deep learning based facial expression recognition of the above-described method embodiment. For example, the method steps S100 to S400 in fig. 1 described above are performed.

The above-described embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that the embodiments may be implemented by software plus a general hardware platform, and may also be implemented by hardware. Based on such understanding, the above technical solutions essentially or contributing to the related art can be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods of the various embodiments or some parts of the embodiments.

Conditional language such as "can," "might," or "may" is generally intended to convey that a particular embodiment can include (yet other embodiments do not include) particular features, elements, and/or operations, among others, unless specifically stated otherwise or otherwise understood within the context as used. Thus, such conditional language is also generally intended to imply that features, elements, and/or operations are in any way required for one or more embodiments or that one or more embodiments must include logic for deciding, with or without input or prompting, whether such features, elements, and/or operations are included or are to be performed in any particular embodiment.

What has been described herein in the specification and drawings includes examples that can provide a deep learning based facial expression recognition method and apparatus. It will, of course, not be possible to describe every conceivable combination of components and/or methodologies for purposes of describing the various features of the disclosure, but it can be appreciated that many further combinations and permutations of the disclosed features are possible. It is therefore evident that various modifications can be made to the disclosure without departing from the scope or spirit thereof. In addition, or in the alternative, other embodiments of the disclosure may be apparent from consideration of the specification and drawings and from practice of the disclosure as presented herein. It is intended that the examples set forth in this specification and the drawings be considered in all respects as illustrative and not restrictive. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

Claims

1. A facial expression recognition method based on deep learning is characterized by comprising the following steps:

2. The method for recognizing the facial expression based on the deep learning of claim 1, wherein before the obtaining of the original facial expression image and the inputting of the original facial expression image into the countermeasure network, the method comprises:

3. The method for recognizing facial expressions based on deep learning of claim 2, wherein the generation countermeasure network comprises a signal generator and two discriminators;

4. The method for recognizing facial expressions based on deep learning of claim 3, wherein the obtaining and generating a synthetic facial image generated by the countermeasure network according to the current facial expression image comprises:

5. The method for recognizing facial expressions based on deep learning of claim 4, wherein the synthesized facial image generated by encoding and decoding comprises:

6. The method of claim 5, wherein after generating the countering network signal generator according to the classification loss training obtained from the facial expression recognition network by synthesizing the facial image, the method further comprises:

7. The method for recognizing facial expressions based on deep learning of claim 6, wherein the obtaining of the output result of the target recognition model and the determining of facial expressions in the facial images according to the output result comprise:

8. A facial expression recognition apparatus based on deep learning, the apparatus comprising:

9. A deep learning based facial expression recognition device, the device comprising at least one processor; and the number of the first and second groups,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of deep learning based facial expression recognition according to any one of claims 1-7.

10. A non-transitory computer-readable storage medium storing computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform the method of deep learning based facial expression recognition according to any one of claims 1-7.