CN112364737A

CN112364737A - Facial expression recognition method, device and equipment for live webcast lessons

Info

Publication number: CN112364737A
Application number: CN202011193684.3A
Authority: CN
Inventors: 孙悦; 李天驰; 王帅
Original assignee: Shenzhen Dianmao Technology Co Ltd
Current assignee: Shenzhen Dianmao Technology Co Ltd
Priority date: 2020-10-30
Filing date: 2020-10-30
Publication date: 2021-02-12

Abstract

The invention discloses a facial expression recognition method, a device and equipment for a network live broadcast course, wherein the method comprises the following steps: constructing an initial convolutional neural network model, optimizing the initial convolutional neural network model, and constructing a deep convolutional neural network model of a double-channel full-connection layer; acquiring a facial expression training sample, training a deep convolutional neural network model according to the facial expression training sample, and generating a facial expression recognition model; and acquiring a facial expression image in a network live broadcast course, inputting the facial expression image into a facial expression recognition model, and generating a facial expression recognition result. According to the embodiment of the invention, the influence of full-connection layers with different scales on the high-level semantic feature expression capability of the image is considered, the full-connection layer with the fusion of two channels is designed, the feature expression capability of the deep convolutional neural network model is enhanced, and the facial expression recognition accuracy is improved.

Description

Facial expression recognition method, device and equipment for live webcast lessons

Technical Field

The invention relates to the technical field of image processing, in particular to a facial expression recognition method, a device and equipment for a live webcast course.

Background

With the rise of the artificial intelligence industry, facial expression recognition technology based on deep learning is more and more concerned by people, and especially in a network live broadcast class, how the current class listening state of a student is can be obtained by analyzing the facial expressions of the student in a live broadcast video, so that teacher management and teaching are facilitated. In recent years, deep learning has achieved excellent results in many computer vision tasks such as image classification and face recognition. Until recently, facial expression recognition methods based on deep learning have also been developed. The characteristic information contained in each layer of the DCNN is distributed in a layering mode in the whole network, and the lower layer mainly contains texture and corner characteristics of the image and is local characteristics of the image. Higher layers contain a specific class of features that are more suited to complex tasks that require global features. As layers grow deeper, features become more complex and global. The features extracted by the full connection layer are generally regarded as high-level features, the traditional DCNN such as the Letnet and the Alexnet uses the full connection layer of a single channel, in addition, the traditional single-channel full connection layer only reserves partial 'important' features of the last pooling layer, and discards the features regarded as 'less important', so that the features extracted by the full connection layer have certain limitation on the aspect of image expression capacity, and the expression recognition accuracy is low.

Accordingly, the prior art is yet to be improved and developed.

Disclosure of Invention

In view of the defects of the prior art, the present invention aims to provide a facial expression recognition method, device and equipment for a live webcast lesson, and aims to solve the technical problem that in the prior art, a single-channel full-connection layer is adopted in the facial expression recognition method, so that features extracted by the full-connection layer have certain limitations in the aspect of image expression capability, and the facial expression recognition accuracy is low.

The technical scheme of the invention is as follows:

a facial expression recognition method for a live webcast session, the method comprising:

constructing an initial convolutional neural network model, optimizing the initial convolutional neural network model, and constructing a deep convolutional neural network model of a double-channel full-connection layer;

acquiring a facial expression training sample, training a deep convolutional neural network model according to the facial expression training sample, and generating a facial expression recognition model;

and acquiring a facial expression image in a network live broadcast course, inputting the facial expression image into a facial expression recognition model, and generating a facial expression recognition result.

Further, the constructing an initial convolutional neural network model, optimizing the initial convolutional neural network model, and constructing a deep convolutional neural network model of a dual-channel fully-connected layer includes:

constructing an initial convolutional neural network model, continuously convolving hidden layers of the initial convolutional neural network by adopting a minimum-scale convolution kernel, and then pooling;

optimizing the network internal structure of the pooled convolutional neural network model, and constructing a deep convolutional neural network model of a double-channel full-connection layer.

Further preferably, the optimizing the network internal structure of the pooled convolutional neural network model includes:

and optimizing the network internal structure of the pooled convolutional neural network model according to the Maxout activating function and the Dropout algorithm.

Further preferably, the obtaining of the facial expression training sample, training the deep convolutional neural network model according to the facial expression training sample, and generating the facial expression recognition model includes:

acquiring a facial expression training sample, and training a deep convolutional neural network model according to the facial expression training sample;

and in the training process, learning is carried out by adopting an A-Softmax algorithm, and a facial expression recognition model is generated according to the learning result.

Preferably, the constructing an initial convolutional neural network model, performing continuous convolution on the hidden layer of the initial convolutional neural network by using a minimum-scale convolution kernel, and then performing pooling includes:

constructing an initial convolutional neural network model, obtaining a convolutional layer of the convolutional neural network model, and using a 0-value filling technology in the convolutional layer;

and carrying out continuous convolution on the convolution layers of the filled convolution neural network by adopting a convolution kernel with the minimum scale, and then carrying out pooling.

Further, performing continuous convolution on hidden layers of the initial convolutional neural network by using a minimum-scale convolution kernel, and then performing pooling, includes:

and continuously convolving hidden layers of the initial convolutional neural network by using a filter convolution kernel of 3x3, and then pooling.

Further, the convolutional neural network is provided with a fully-connected fusion layer, and the network internal structure of the pooled convolutional neural network model is optimized according to the Maxout activation function and the Dropout algorithm, and the method further comprises the following steps:

and optimizing the fully-connected fusion layer of the pooled convolutional neural network model according to the Maxout activating function and the Dropout algorithm.

Another embodiment of the present invention provides a facial expression recognition device for a live webcast session, the device comprising:

the model building module is used for building an initial convolutional neural network model, optimizing the initial convolutional neural network model and building a deep convolutional neural network model of a double-channel full-connection layer;

the model training module is used for acquiring a facial expression training sample, training the deep convolutional neural network model according to the facial expression training sample and generating a facial expression recognition model;

and the facial expression recognition module is used for acquiring a facial expression image in a live network course, inputting the facial expression image into the facial expression recognition model and generating a facial expression recognition result.

Another embodiment of the present invention provides a facial expression recognition apparatus for a live webcast session, the apparatus comprising at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described facial expression recognition method for a live webcast session.

Another embodiment of the present invention also provides a non-transitory computer-readable storage medium storing computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform the above-mentioned facial expression recognition method for a live webcast session.

Has the advantages that: according to the embodiment of the invention, the influence of full-connection layers with different scales on the high-level semantic feature expression capability of the image is considered, the full-connection layer with the fusion of two channels is designed, the feature expression capability of the deep convolutional neural network model is enhanced, and the facial expression recognition accuracy is improved.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a flowchart illustrating a preferred embodiment of a facial expression recognition method for live online lessons according to the present invention;

fig. 2 is a schematic network structure diagram of a face recognition model according to a specific application embodiment of the facial expression recognition method for live webcast lessons in the present invention;

FIG. 3a is a schematic diagram of parameters of each network layer of a face recognition model in the prior art;

fig. 3b is a schematic diagram of parameters of each network layer in a specific application embodiment of the facial expression recognition method for the live webcast lesson according to the present invention;

FIG. 4 is a functional block diagram of an embodiment of a facial expression recognition apparatus for live online lessons according to the present invention;

fig. 5 is a schematic diagram of a hardware structure of a facial expression recognition device for live webcast lessons according to a preferred embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and effects of the present invention clearer and clearer, the present invention is described in further detail below. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. Embodiments of the present invention will be described below with reference to the accompanying drawings.

The embodiment of the invention provides a facial expression recognition method for a live webcast course. Referring to fig. 1, fig. 1 is a flowchart illustrating a facial expression recognition method for live webcast lessons according to a preferred embodiment of the present invention. As shown in fig. 1, it includes the steps of:

s100, constructing an initial convolutional neural network model, optimizing the initial convolutional neural network model, and constructing a deep convolutional neural network model of a double-channel full-connection layer;

s200, obtaining a facial expression training sample, training a deep convolutional neural network model according to the facial expression training sample, and generating a facial expression recognition model;

step S300, collecting facial expression images in a live online course, inputting the facial expression images into a facial expression recognition model, and generating a facial expression recognition result.

In specific implementation, the facial expression recognition algorithm of the embodiment of the invention is used for recognizing the expressions of students in a live webcast class, and by recognizing the expressions, the class listening state of the students can be obtained, thereby providing convenience for subsequently monitoring the class listening effect of the students.

Acquiring an initial convolutional neural network model, continuously convolving hidden layers of the initial convolutional neural network by adopting a minimum-scale convolution kernel, and then pooling; optimizing the internal structure of the network, and constructing a deep convolutional neural network model of a double-channel full-connection layer; inputting the characteristics of the collected face image into a deep convolution neural network model for training to obtain a trained facial expression recognition model; the facial expression image is collected through an image collecting device, the facial image to be recognized is input into a facial expression recognition model, and recognized facial expressions are generated.

Further, an initial convolutional neural network model is constructed, the initial convolutional neural network model is optimized, and a deep convolutional neural network model of a double-channel full-connection layer is constructed, and the method comprises the following steps:

During specific implementation, an initial Convolutional Neural network model is constructed, continuous convolution is carried out on a hidden layer by adopting a small-scale convolution kernel, then (maximum + average) pooling is carried out, the internal structure of the network is optimized, meanwhile, a traditional single-channel full-connection layer is improved, and a DCNN (Deep Convolutional Neural network) model with a double-channel full-connection layer is constructed.

Further, optimizing the network internal structure of the pooled convolutional neural network model, including:

In specific implementation, the internal structure of the network is optimized by combining the Maxout activation function and the Dropout technology.

Further, acquiring a facial expression training sample, training the deep convolutional neural network model according to the facial expression training sample, and generating a facial expression recognition model, including:

In particular, the A-Softmax loss is used during training, the angle is used as a distance measure, and the angular distance and the learned features are combined to enhance discrimination. The network performance is greatly improved, the network feature extraction capability is improved, the number of parameters in the training process is reduced, and a facial expression recognition training model with a good effect is obtained.

Further, constructing an initial convolutional neural network model, continuously convolving hidden layers of the initial convolutional neural network by adopting a minimum-scale convolution kernel, and then pooling, wherein the pooling comprises the following steps:

In specific implementation, the two-channel convolutional neural network includes, as shown in fig. 2, 5 convolutional layers, 3 pooling layers, a fully-connected fusion layer, and a two-channel fully-connected layer. The 0-value padding technique was used at the convolutional layers, and the convolutional operations were performed twice in succession using filters at the C2, C3 convolutional layers and C4, C5 convolutional layers, respectively. By combining the maximum pooling and the average pooling, more diversified feature information can be retained.

Further, performing continuous convolution on hidden layers of the initial convolutional neural network by using a minimum-scale convolution kernel, and then performing pooling, wherein the pooling comprises:

In a specific implementation, as shown in fig. 2, two convolution operations were performed successively on the C2 and C3 convolutional layers and the C4 and C5 convolutional layers, respectively, using a filter of size 3 × 3. By combining the maximum pooling and the average pooling, more diversified feature information can be retained.

In specific implementation, Dropout technology is used for the convolution layer and the full-link layer respectively in order to prevent overfitting, and batch normalization technology is added after the convolution layer in order to improve the generalization of the DCNN model.

As can be seen from fig. 3a and 3b, fig. 3a shows the network layer parameters of the DCNN model before modification, and fig. 3b shows the network layer parameters of the TCNN model after modification. It can be seen that the number of trainable parameters reduced by using successive convolution for C2 and C3 layers is (5x5x24x24-3x3x24x24x2) x 64-258048, the number of trainable parameters reduced by using successive convolution for C4 and C5 layers is (5x5x12x12-3x3x12x12x2) x 128-129024, and the total number of reduced trainable parameters is 387072. Because the improved TCNN model uses two channels at the full link layer, the number of trainable parameters of the full link layer is increased by 256+256 to 512. In fig. 3b, the F3 layer is a feature fusion layer, and the layer is fused by F1 and F2 full link layers, so the parameters of the layer do not belong to trainable parameters. In general, the TCNN model optimizes the number of parameters in the network, and allows trainable parameters in the network to be reduced.

The embodiments of the present invention provide a facial expression recognition method for live webcast lessons, which aims to improve network performance, increase network feature extraction capability, and reduce the number of parameters in a training process, and the method performs (max + average) pooling after continuous convolution by using a small-scale convolution kernel in a hidden layer, optimizes a network internal structure by combining a Maxout activation function and a Dropout technology, and improves a conventional single-channel full-connection layer to construct a DCNN model with a dual-channel full-connection layer. The a-Softmax loss is used during training, using angle as a distance measure, combining the angular distance and learned features to enhance discrimination. And finally, obtaining a facial expression recognition training model with good effect.

According to the embodiment of the invention, the influence of full-connection layers with different scales on the high-level semantic feature expression capability of the image is fully considered, the dual-channel fusion full-connection layer is designed, and the feature expression capability of the DCNN model is enhanced.

And a Maxout activation function is used for replacing a traditional ReLU activation function at a dual-channel full-connection layer, so that the network can express more accurate high-dimensional characteristic information.

Given the problem that ideal facial features exist during FER with a maximum inter-class distance smaller than the minimum inter-class distance, the a-Softmax penalty is used during training to allow TCNN to learn facial features with geometrically interpretable angular separation.

It should be noted that, a certain order does not necessarily exist between the above steps, and those skilled in the art can understand, according to the description of the embodiments of the present invention, that in different embodiments, the above steps may have different execution orders, that is, may be executed in parallel, may also be executed interchangeably, and the like.

Another embodiment of the present invention provides a facial expression recognition apparatus for live online lessons, as shown in fig. 4, the apparatus 1 includes:

the model building module 11 is used for building an initial convolutional neural network model, optimizing the initial convolutional neural network model and building a deep convolutional neural network model of a double-channel full-connection layer;

the model training module 12 is used for acquiring a facial expression training sample, training the deep convolutional neural network model according to the facial expression training sample, and generating a facial expression recognition model;

and the facial expression recognition module 13 is configured to collect facial expression images in a live network course, input the facial expression images into a facial expression recognition model, and generate a facial expression recognition result.

The specific implementation is shown in the method embodiment, and is not described herein again.

Another embodiment of the present invention provides a facial expression recognition apparatus for a live webcast session, as shown in fig. 5, the apparatus 10 includes:

one or more processors 110 and a memory 120, where one processor 110 is illustrated in fig. 5, the processor 110 and the memory 120 may be connected by a bus or other means, and where fig. 5 illustrates a connection by a bus.

Processor 110 is operative to implement various control logic of apparatus 10, which may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a single chip, an ARM (Acorn RISC machine) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination of these components. Also, the processor 110 may be any conventional processor, microprocessor, or state machine. Processor 110 may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

The memory 120 is a non-volatile computer-readable storage medium, and can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions corresponding to the facial expression recognition method for live webcast lessons in the embodiment of the present invention. The processor 110 executes various functional applications and data processing of the device 10, namely, implements the facial expression recognition method for live webcast lessons in the above-described method embodiments, by running the nonvolatile software programs, instructions, and units stored in the memory 120.

The memory 120 may include a storage program area and a storage data area, wherein the storage program area may store an application program required for operating the device, at least one function; the storage data area may store data created according to the use of the device 10, and the like. Further, the memory 120 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 120 optionally includes memory located remotely from processor 110, which may be connected to device 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

One or more units are stored in the memory 120, and when executed by the one or more processors 110, perform the facial expression recognition method for webcast lessons in any of the above-described method embodiments, e.g., performing the above-described method steps S100 to S300 in fig. 1.

Embodiments of the present invention provide a non-transitory computer-readable storage medium storing computer-executable instructions for execution by one or more processors, for example, to perform method steps S100-S300 of fig. 1 described above.

By way of example, non-volatile storage media can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), electrically erasable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in many forms such as Synchronous RAM (SRAM), dynamic RAM, (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The disclosed memory components or memory of the operating environment described herein are intended to comprise one or more of these and/or any other suitable types of memory.

Another embodiment of the present invention provides a computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions that, when executed by a processor, cause the processor to perform the facial expression recognition method for webcast lessons of the above-described method embodiment. For example, the method steps S100 to S300 in fig. 1 described above are performed.

The above-described embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that the embodiments may be implemented by software plus a general hardware platform, and may also be implemented by hardware. Based on such understanding, the above technical solutions essentially or contributing to the related art can be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods of the various embodiments or some parts of the embodiments.

Conditional language such as "can," "might," or "may" is generally intended to convey that a particular embodiment can include (yet other embodiments do not include) particular features, elements, and/or operations, among others, unless specifically stated otherwise or otherwise understood within the context as used. Thus, such conditional language is also generally intended to imply that features, elements, and/or operations are in any way required for one or more embodiments or that one or more embodiments must include logic for deciding, with or without input or prompting, whether such features, elements, and/or operations are included or are to be performed in any particular embodiment.

What has been described herein in the specification and drawings includes examples that can provide a facial expression recognition method and apparatus for a live webcast session. It will, of course, not be possible to describe every conceivable combination of components and/or methodologies for purposes of describing the various features of the disclosure, but it can be appreciated that many further combinations and permutations of the disclosed features are possible. It is therefore evident that various modifications can be made to the disclosure without departing from the scope or spirit thereof. In addition, or in the alternative, other embodiments of the disclosure may be apparent from consideration of the specification and drawings and from practice of the disclosure as presented herein. It is intended that the examples set forth in this specification and the drawings be considered in all respects as illustrative and not restrictive. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

Claims

1. A facial expression recognition method for a live webcast course is characterized by comprising the following steps:

2. The facial expression recognition method for the live webcast class according to claim 1, wherein the constructing of the initial convolutional neural network model, the optimizing of the initial convolutional neural network model, and the constructing of the deep convolutional neural network model with the two channels and the full connection layer comprises:

3. The facial expression recognition method for the live webcast class according to claim 2, wherein the optimizing the network internal structure of the pooled convolutional neural network model comprises:

4. The facial expression recognition method for the live webcast class according to claim 3, wherein the obtaining of the facial expression training sample, training the deep convolutional neural network model according to the facial expression training sample, and generating the facial expression recognition model comprises:

5. The facial expression recognition method for the live webcasting class according to claim 4, wherein the constructing of the initial convolutional neural network model, the performing of pooling after continuous convolution of the hidden layer of the initial convolutional neural network by using a minimum-scale convolution kernel, comprises:

6. The facial expression recognition method for the live webcasting class according to claim 5, wherein the pooling is performed after the continuous convolution of the hidden layer of the initial convolutional neural network by using a convolution kernel with a minimum scale, and the pooling comprises:

7. The facial expression recognition method for live webcast lessons according to claim 6, wherein the convolutional neural network is provided with a fully-connected fusion layer, and the network internal structure of the pooled convolutional neural network model is optimized according to the Maxout activation function and the Dropout algorithm, and further comprising:

8. A facial expression recognition apparatus for a live webcast session, the apparatus comprising:

9. A facial expression recognition device for a live webcast class, the device comprising at least one processor; and the number of the first and second groups,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7 for facial expression recognition in a live online class.

10. A non-transitory computer-readable storage medium storing computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform the method of facial expression recognition for a live webcast session of any one of claims 1-7.