CN112380898A

CN112380898A - Method, device and equipment for recognizing facial expressions in live lessons

Info

Publication number: CN112380898A
Application number: CN202011065779.7A
Authority: CN
Inventors: 李天驰; 孙悦; 王帅
Original assignee: Shenzhen Dianmao Technology Co Ltd
Current assignee: Shenzhen Dianmao Technology Co Ltd
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2021-02-19

Abstract

The invention discloses a method, a device and equipment for identifying facial expressions in a live lesson, wherein the method comprises the following steps: acquiring an original training data set of original facial expressions, and performing ROI (region of interest) processing on the original training data set to generate a target training data set; constructing an initial facial expression recognition model based on a cross-layer connection convolutional neural network, and training the initial facial expression recognition model according to a target training data set to generate a target facial expression recognition model; and acquiring a facial image to be recognized, inputting the image to be recognized into the target facial expression recognition model, and generating a facial expression recognition result. According to the embodiment of the invention, the ROI processing is adopted in the image processing process, so that the training data set is added, the accuracy of facial expression recognition is improved, and the robustness of a training model is enhanced.

Description

Method, device and equipment for recognizing facial expressions in live lessons

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a method, an apparatus, and a device for recognizing facial expressions in a live lesson.

Background

With the rise of the artificial intelligence industry, facial expression recognition technology based on deep learning is more and more concerned by people, and especially in a network live broadcast class, how the current class listening state of a student is can be obtained by analyzing the facial expressions of the student in a live broadcast video, so that teacher management and teaching are facilitated. There are two main problems in the current facial expression recognition method: although the facial expression data sets are many and various at present, the expressions of most data sets are shot by a camera from a certain angle, and the number of expression images is small, so that the trained model has certain uncertainty, weak generalization capability on random new data and low robustness; the traditional LeNet-5 convolutional neural network is used for identifying handwritten numbers, and low-level detail features are not considered when feature extraction is carried out, so that the problem of gradient disappearance or explosion easily occurs if the network is deeper and deeper. Therefore, the facial expression recognition model in the prior art for the live webcasting class is poor in robustness and low in recognition accuracy.

Accordingly, the prior art is yet to be improved and developed.

Disclosure of Invention

In view of the foregoing deficiencies of the prior art, an object of the present invention is to provide a method, an apparatus and a device for recognizing facial expressions in a live broadcast class, which aim to solve the technical problems of poor robustness and low recognition accuracy of a facial expression recognition model in a live broadcast class in the prior art.

The technical scheme of the invention is as follows:

a method of identifying facial expressions in a live lesson, the method comprising:

acquiring an original training data set of original facial expressions, and performing ROI (region of interest) processing on the original training data set to generate a target training data set;

constructing an initial facial expression recognition model based on a cross-layer connection convolutional neural network, and training the initial facial expression recognition model according to a target training data set to generate a target facial expression recognition model;

and acquiring a facial image to be recognized, inputting the image to be recognized into the target facial expression recognition model, and generating a facial expression recognition result.

Further, before the acquiring an original training data set of an original facial expression and performing ROI processing on the original training data set to generate a target training data set, the method includes:

the method comprises the steps of obtaining student images in a live broadcast course through a camera, carrying out face recognition on the student images, and generating facial images to be recognized.

Further preferably, the performing face recognition on the student image to generate a face image to be recognized includes:

the student images are identified through a face identification algorithm, and facial images to be identified, including eyes, a nose and a mouth, are generated according to the identification result.

Further preferably, the acquiring an original training data set of an original facial expression, performing ROI processing on the original training data set, and generating a target training data set includes:

acquiring an initial image in an original training data set of original facial expression, carrying out quartering processing on the initial image, zooming four equally divided images, wherein the four zoomed images are consistent with the size of the initial image, and generating four equally divided images;

respectively carrying out upper half shielding and lower half shielding processing on the initial image to generate two shielding images;

carrying out mirror image processing on the initial image to generate a mirror image;

carrying out center focusing and zooming on the initial image to generate a focused image;

generating an extended training data set corresponding to the initial image according to the initial image, the equally divided images, the mirror image and the gathered image;

after the ROI processing operation is performed on all initial images of the original training data set, an extended training data set corresponding to all initial images in the original training data set is generated, and a target training data set is generated according to the extended training data set.

Preferably, after the first half shielding and the second half shielding are respectively performed on the initial image, two shielding images are generated, including:

and respectively carrying out shielding operation on the upper half part and the lower half part of the initial image according to an image processing tool to generate an upper half shielding image and a lower half shielding image.

Further, the mirroring processing of the initial image to generate a mirror image includes:

and carrying out mirror image processing on the initial image based on a number axis, and recalculating the marked feature points according to a mirror image principle to generate a mirror image.

Further, the center focusing and zooming the initial image to generate a focused image includes:

and carrying out central focusing on the initial image, carrying out size scaling on the focused initial image, wherein the size of the scaled image is the same as that of the initial image, and recalculating the labeled characteristic points according to the scaling to generate a focused image.

Another embodiment of the present invention provides an apparatus for recognizing a facial expression in a live lesson, the apparatus comprising:

the ROI processing module is used for acquiring an original training data set of original facial expressions, and performing ROI processing on the original training data set to generate a target training data set;

the model training module is used for constructing an initial facial expression recognition model based on a cross-layer connection convolutional neural network, training the initial facial expression recognition model according to a target training data set and generating a target facial expression recognition model;

and the facial expression recognition module is used for acquiring a facial image to be recognized, inputting the image to be recognized into the target facial expression recognition model and generating a facial expression recognition result.

Another embodiment of the present invention provides an apparatus for identifying facial expressions in a live lesson, the apparatus comprising at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described method of identifying facial expressions in a live class.

Yet another embodiment of the present invention provides a non-transitory computer-readable storage medium having stored thereon computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform the above-described method of identifying facial expressions in a live class.

Has the advantages that: according to the embodiment of the invention, the ROI processing is adopted in the image processing process, so that the training data set is added, the accuracy of facial expression recognition is improved, and the robustness of a training model is enhanced.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a flow chart of a preferred embodiment of a method for identifying facial expressions in a live lesson according to the present invention;

FIG. 2 is a schematic structural diagram of a cross-layer connected convolutional neural network according to a preferred embodiment of the present invention;

FIG. 3 is a functional block diagram of an apparatus for recognizing facial expressions in a live lesson according to an embodiment of the present invention;

fig. 4 is a diagram illustrating a hardware configuration of an apparatus for recognizing facial expressions in a live lesson according to a preferred embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and effects of the present invention clearer and clearer, the present invention is described in further detail below. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. Embodiments of the present invention will be described below with reference to the accompanying drawings.

The embodiment of the invention provides a method for identifying facial expressions in a live lesson. Referring to fig. 1, fig. 1 is a flowchart illustrating a method for recognizing facial expressions in a live lesson according to a preferred embodiment of the present invention. As shown in fig. 1, it includes the steps of:

s100, acquiring an original training data set of original facial expressions, and performing ROI (region of interest) processing on the original training data set to generate a target training data set;

s200, constructing an initial facial expression recognition model based on a cross-layer connection convolutional neural network, and training the initial facial expression recognition model according to a target training data set to generate a target facial expression recognition model;

and step S300, acquiring a facial image to be recognized, inputting the image to be recognized into the target facial expression recognition model, and generating a facial expression recognition result.

In specific implementation, the embodiment of the invention provides a method for recognizing facial expressions in a live broadcast course aiming at the problems of facial expression recognition in a live broadcast video course in the prior art, the method comprises the steps of processing a situation data set based on the thought of a region of interest (ROI) to generate a target training data set, then improving a LeNet-5 neural network by using a cross-layer connection method, taking low-level network characteristics into consideration, constructing an initial facial expression recognition model, and training the initial area expression recognition model through the target training data set to generate a facial expression recognition model. And acquiring a facial image to be recognized, and inputting the facial image to be recognized into the target facial expression recognition model to analyze the recognition result of the facial expression. The algorithm not only improves the accuracy of facial expression recognition, but also enhances the robustness of the training model.

Fig. 2 is a schematic structural diagram of a cross-layer connected convolutional neural network, and as shown in fig. 2, the network includes 1 input layer, 3 convolutional layers, 2 pooling layers, 1 fully-connected layer, and 1 output layer. The Input layer is an Input layer, and the Input emoticon pixels are 32 × 32. The Layer1 is a convolution Layer having 6 feature maps, and the input 32 × 32 pixel pictures are respectively convolved with 6 convolution kernels of 5 × 5 pixels to obtain a feature map of 28 × 28 pixels. The Layer2 is a pooling Layer, and the 28 × 28 pixel feature map is pooled to obtain a 14 × 14 pixel feature map. The Layer3 is a convolution Layer having 16 feature maps, and the 14 × 14 pixel pictures obtained at the upper Layer are respectively convolved with 16 convolution kernels of 5 × 5 pixels to obtain a feature map of 10 × 10 pixels. The Layer4 is a pooling Layer, and the feature map of 10 × 10 pixels is pooled to obtain a feature map of 5 × 5 pixels. The Layer5 is a convolution Layer having 120 feature maps, and the 5 × 5 pixel pictures obtained at the upper Layer are respectively convolved with 120 convolution kernels of 5 × 5 pixels to obtain a feature map of 1 × 1 pixel. The Layer6 is a fully connected Layer, and has 84 units in total. The Output layer is an Output layer and outputs 7 expression types. The 7 expression types were happy, sad, surprised, afraid, angry, aversive, and neutral, respectively.

Further, before obtaining an original training data set of an original facial expression and performing ROI processing on the original training data set to generate a target training data set, the method includes:

In specific implementation, before the ROI setting process is carried out, the human face needs to be detected by human face recognition, and the whole image area is filled with the human face as much as possible so as to reduce errors.

Further, the face recognition is carried out on the student image, and a face image to be recognized is generated, wherein the face recognition comprises the following steps:

In particular implementation, the key point of the ROI setting scheme is to recognize a facial expression by detecting changes in eyes, nose, and mouth. Therefore, the image after face recognition must be a face image including eyes, nose, and mouth.

Further, acquiring an original training data set of an original facial expression, performing ROI processing on the original training data set, and generating a target training data set, including:

When the method is specifically implemented, ROI interest region processing is firstly carried out on training images, each training image is processed through the ROI interest region, a plurality of specific images are obtained, and a new database is formed. And inputting the new database into the improved cross-layer connection convolutional neural network module, and training to obtain a facial expression recognition model.

We cut the image first and divide the original image equally into four equal parts, which respectively contain the complete left eye area, the complete right eye area and each half area of the mouth. And then, the size of the cut image is zoomed, so that the size of the new image is consistent with that of the original image, the corresponding annotation category is also consistent with that of the original image, and the annotation feature points are recalculated according to the proportion. Thus, 4 new data are obtained, and the 4 new data are recorded as four equally divided images.

The facial expression library is subjected to 8 kinds of improvement processing, and the data of the expression library are added, so that 9 different ROI areas are formed. The data of the original image is added, a batch of data sets are reconstructed, the data sets constructed by the ROI scheme are 9 times of the original data sets in quantity, the diversity of samples is greatly enriched, the effectiveness of the expansion is that different ROI areas are mutually connected and mutually supplemented, and the reliability of the predicted target is enhanced. Most importantly, the data set produced by the method is more detailed and accurate in analyzing the detail characteristics of the expression than the original data. The model can focus more on emotional expression caused by more tiny change of the facial expression.

Further, after the initial image is respectively processed by the first half shielding and the second half shielding, two shielding images are generated, which includes:

In specific implementation, the original image is subjected to shielding processing again, the upper half part and the lower half part of the original image are respectively shielded by using an opencv image processing tool, so that the size of the obtained image is the same as that of the original image, the corresponding labeling type is consistent with that of the original image, the coordinate of the corresponding labeling feature point does not need to be recalculated, and 2 kinds of new data are obtained and recorded as an upper-half shielding image and a lower-half shielding image.

Further, the mirror image processing is performed on the initial image to generate a mirror image, and the mirror image processing method includes:

In specific implementation, the original image is mirrored based on a numerical axis, the size of the mirrored image is the same as that of the original image, the corresponding annotation category is consistent with that of the original image, and the annotation feature points are recalculated by using the mirroring principle. This results in 1 new data, called mirror image.

Further, the initial image is focused and scaled to generate a focused image, which includes:

In specific implementation, the original image is subjected to center focusing, the size of the focused image is zoomed, the size of a new image is consistent with that of the original image, the corresponding annotation type is consistent with that of the original image, and the annotation feature points are recalculated according to the proportion. This results in 1 new data, denoted as the focused image.

The embodiment of the method can show that the invention provides a method for recognizing facial expressions in a direct broadcast course, 8 kinds of processing are carried out on an expression data set based on the thought of a region of interest (ROI), a set of new database is formed, the database is generated by specially training the model, and the image characteristics are particularly beneficial to the improvement of the accuracy of the expression recognition model. The second key point of the invention is that a cross-layer connection method is used for improving the LeNet-5 neural network, the characteristics of a lower layer network are also taken into consideration, the problem of gradient disappearance or explosion is avoided along with the deepening of the network, the accuracy of facial expression recognition is improved, and the robustness of a training model is enhanced.

It should be noted that, a certain order does not necessarily exist between the above steps, and those skilled in the art can understand, according to the description of the embodiments of the present invention, that in different embodiments, the above steps may have different execution orders, that is, may be executed in parallel, may also be executed interchangeably, and the like.

Another embodiment of the present invention provides an apparatus for recognizing facial expressions in a live lesson, as shown in fig. 3, the apparatus 1 including:

the ROI processing module 11 is configured to acquire an original training data set of an original facial expression, perform ROI processing on the original training data set, and generate a target training data set;

the model training module 12 is used for constructing an initial facial expression recognition model based on a cross-layer connection convolutional neural network, training the initial facial expression recognition model according to a target training data set, and generating a target facial expression recognition model;

and the facial expression recognition module 13 is configured to acquire a facial image to be recognized, input the image to be recognized into the target facial expression recognition model, and generate a facial expression recognition result.

The specific implementation is shown in the method embodiment, and is not described herein again.

Another embodiment of the present invention provides an apparatus for recognizing facial expressions in a live lesson, as shown in fig. 4, the apparatus 10 including:

one or more processors 110 and a memory 120, where one processor 110 is illustrated in fig. 4, the processor 110 and the memory 120 may be connected by a bus or other means, and fig. 4 illustrates a connection by a bus as an example.

Processor 110 is operative to implement various control logic of apparatus 10, which may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a single chip, an ARM (Acorn RISC machine) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination of these components. Also, the processor 110 may be any conventional processor, microprocessor, or state machine. Processor 110 may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

The memory 120, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions corresponding to a method for identifying facial expressions in a live lesson in embodiments of the present invention. The processor 110 executes various functional applications and data processing of the device 10, i.e. implements the method of identifying facial expressions in a live lesson in the above-described method embodiments, by running non-volatile software programs, instructions and units stored in the memory 120.

The memory 120 may include a storage program area and a storage data area, wherein the storage program area may store an application program required for operating the device, at least one function; the storage data area may store data created according to the use of the device 10, and the like. Further, the memory 120 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 120 optionally includes memory located remotely from processor 110, which may be connected to device 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

One or more units are stored in the memory 120, which when executed by the one or more processors 110, perform the method of identifying facial expressions in a live lesson in any of the method embodiments described above, e.g., performing the method steps S100-S300 in fig. 1 described above.

Embodiments of the present invention provide a non-transitory computer-readable storage medium storing computer-executable instructions for execution by one or more processors, for example, to perform method steps S100-S300 of fig. 1 described above.

By way of example, non-volatile storage media can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), electrically erasable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in many forms such as Synchronous RAM (SRAM), dynamic RAM, (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The disclosed memory components or memory of the operating environment described herein are intended to comprise one or more of these and/or any other suitable types of memory.

Another embodiment of the present invention provides a computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of identifying facial expressions in a live class of the above-described method embodiment. For example, the method steps S100 to S300 in fig. 1 described above are performed.

The above-described embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that the embodiments may be implemented by software plus a general hardware platform, and may also be implemented by hardware. Based on such understanding, the above technical solutions essentially or contributing to the related art can be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods of the various embodiments or some parts of the embodiments.

Conditional language such as "can," "might," or "may" is generally intended to convey that a particular embodiment can include (yet other embodiments do not include) particular features, elements, and/or operations, among others, unless specifically stated otherwise or otherwise understood within the context as used. Thus, such conditional language is also generally intended to imply that features, elements, and/or operations are in any way required for one or more embodiments or that one or more embodiments must include logic for deciding, with or without input or prompting, whether such features, elements, and/or operations are included or are to be performed in any particular embodiment.

What has been described herein in this specification and the accompanying drawings includes examples of methods and apparatuses that can provide recognition of facial expressions in a live class. It will, of course, not be possible to describe every conceivable combination of components and/or methodologies for purposes of describing the various features of the disclosure, but it can be appreciated that many further combinations and permutations of the disclosed features are possible. It is therefore evident that various modifications can be made to the disclosure without departing from the scope or spirit thereof. In addition, or in the alternative, other embodiments of the disclosure may be apparent from consideration of the specification and drawings and from practice of the disclosure as presented herein. It is intended that the examples set forth in this specification and the drawings be considered in all respects as illustrative and not restrictive. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

Claims

1. A method of recognizing facial expressions in a live lesson, the method comprising:

2. The method of claim 1, wherein the obtaining an original training data set of original facial expressions and performing ROI processing on the original training data set to generate a target training data set comprises:

3. The method of claim 2, wherein the performing face recognition on the student image to generate the facial image to be recognized comprises:

4. The method of claim 3, wherein the obtaining an original training data set of original facial expressions, performing ROI processing on the original training data set, and generating a target training data set comprises:

5. The method of claim 4, wherein the generating two occlusion images after respectively performing top half occlusion and bottom half occlusion on the initial image comprises:

6. The method of claim 5, wherein mirroring the initial image to generate a mirror image comprises:

7. The method of claim 6, wherein the step of center focusing and zooming the initial image to generate a focused image comprises:

8. An apparatus for recognizing facial expressions in a live lesson, the apparatus comprising:

9. An apparatus for recognizing facial expressions in a live lesson, the apparatus comprising at least one processor; and the number of the first and second groups,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of identifying facial expressions in a live lesson of any one of claims 1-7.

10. A non-transitory computer-readable storage medium having stored thereon computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform the method of identifying facial expressions in a live lesson of any one of claims 1-7.