CN107194347A

CN107194347A - A kind of method that micro- expression detection is carried out based on Facial Action Coding System

Info

Publication number: CN107194347A
Application number: CN201710356981.7A
Authority: CN
Inventors: 夏春秋
Original assignee: Shenzhen Vision Technology Co Ltd
Current assignee: Shenzhen Vision Technology Co Ltd
Priority date: 2017-05-19
Filing date: 2017-05-19
Publication date: 2017-09-22

Abstract

A kind of method that micro- expression detection is carried out based on Facial Action Coding System proposed in the present invention, its main contents are included：Visualize CNN wave filters, the network architecture and training, transfer learning, the detection of micro- expression, its process is, initially set up sound mood taxonomy model, analysis proposes the model of e-learning, the wave filter of the network training proposed is visualized in different emotional semantic classification tasks, model is applied into micro- expression detects.The present invention improves discrimination of the existing method in micro- expression detection, illustrate the feature that is produced by unsupervised learning process and for the strong correlation between motor unit in facial expression analysis method, provide high-precision fraction across data and demonstrate in terms of task the generalization ability of the function based on FACS, and improve the discrimination of micro- expression detection, more accurately recognize facial expression and infer emotional state, its validity and accuracy rate applied in every field is improved, the development of artificial intelligence is promoted.

Description

A kind of method that micro- expression detection is carried out based on Facial Action Coding System

Technical field

The present invention relates to Expression Recognition field, more particularly, to a kind of based on the micro- expression of Facial Action Coding System progress The method of detection.

Background technology

Expression Recognition is usually used in man-machine interaction, social gaming, psychological research, aids in the fields such as driving, automatic identification face Express one's feelings and infer emotional state.Specifically, automatically snapped as detecting person taken picture smiling face starts, the automatic expression of game player is more Change, the senior application such as user's viewing effect analysis of multimedia advertising detects painful and unfortunate, the inspection that driver is drowsiness of patient Survey.Facial expression plays a significant role in terms of human communication and behavior, although existing method is in object of observation feature and analysis Certain accuracy has been met, but method is most at present only considers local message, and ignore Space Consistency, so as to guide Evaluated error, causes that the partial target in special scenes accurately can not be recognized and detect.

The present invention proposes a kind of method that micro- expression detection is carried out based on Facial Action Coding System, visual using CNN Change the characteristic pattern of emotion detection.Sound mood taxonomy model is initially set up, analysis proposes the model of e-learning, will carried The wave filter of the network training gone out is visualized in different emotional semantic classification tasks, then, in the high-precision fraction of offer across number According to the generalization ability that the function based on Facial Action Coding System (FACS) is demonstrated with across task aspect, model is applied to micro- Expression detection.The present invention improves discrimination of the existing method in micro- expression detection, illustrates and is produced by unsupervised learning process Raw feature and for the strong correlation between motor unit in facial expression analysis method, provide high-precision fraction across number According to demonstrating the generalization ability of the function based on FACS with across task aspect, and the discrimination of micro- expression detection is improved, more Facial expression is recognized exactly and emotional state is inferred, is improved its validity and accuracy rate applied in every field, is promoted people The development of work intelligence.

The content of the invention

The problem of for existing method discrimination deficiency, the present invention improves identification of the existing method in micro- expression detection Rate, illustrate the feature that is produced by unsupervised learning process with for the strong phase between motor unit in facial expression analysis method Guan Xing, provide high-precision fraction across data and demonstrate in terms of task the generalization ability of the function based on FACS, and The discrimination of micro- expression detection is improved, facial expression is more accurately recognized and infers emotional state, improve it in every field The validity and accuracy rate of application, promote the development of artificial intelligence.

To solve the above problems, the present invention provides a kind of side that micro- expression detection is carried out based on Facial Action Coding System Method, its main contents include：

(1) CNN wave filters are visualized；

(2) network architecture and training；

(3) transfer learning；

(4) micro- expression detection.

Wherein, described visualization CNN wave filters, set up after sound mood taxonomy model, and analysis proposes network science The model of habit, the wave filter of the network training proposed is visualized in different emotional semantic classification tasks, and lower floor provides rudimentary Other class Gabor filter, and provide high level human body close to the intermediate layer of output and higher level and feature can be read, by making In aforementioned manners, it can be seen that the feature of institute's training network, wave filter needed for feature visualization shows maximization by input With the activation for the pixel for being responsible for the response, from trained model is analyzed as can be seen that characteristic pattern and the specific face of network There is very big similitude between portion region and motion, and these regions and motion are with defining Facial Action Coding System (FACS) There is significant correlation the part of moving cell.

Further, described FACS, be Facial Action Coding System, it is first determined 7 main universal moods, full Foot constant characteristic of expressed meaning under different cultural environments, them are marked with corresponding affective state, i.e., happy, sad Wound, it is pleasantly surprised, it is frightened, detest, indignation and despise, be widely used in cognitive calculating, and FACS is a kind of to be based on anatomical system System, all observable face actions for describing every kind of mood, using FACS as methodology measuring system, can be retouched State any expression of motor unit (AU) activation and its enliven intensity, each motor unit describes one group of facial muscles, together altogether With one specific motion of composition.

Further, described CNN wave filters, are represented the doubtful AU of wave filter and actual data using following methods AU labels are concentrated to match：

(1) convolutional layer l and wave filter j is given, activation output is marked as F_{L, j}；

(2) maximum N number of input picture i=arg is extracted_imax F_{L, j}(i)；

(3) for each input i, the AU labels of manual annotations areIf motor unit u exists in i, A_{I, u} For 1；

(4) wave filter j and the correlation of motor unit u presence are P_{J, u}And byDefinition；

A large amount of top neurons are found to be itself and do not produce effective output, last convolution for any input Layer in enliven neuron quantity be about characteristic pattern size 30% (having 60 in 256), the quantity of formal neuron and The vocabulary size of FACS motor unit is approximate, can identify corresponding facial expression.

Wherein, described Gabor filter, it is characterised in that Gabor filter, which is one, is used for the linear of rim detection Wave filter, the frequency of Gabor filter and direction represent the expression for frequency and direction close to human visual system, and it It is standing be used for texture representation and description, in spatial domain, the Gabor filter of one 2 dimension is a sinusoidal plane wave and Gaussian kernel The product of function, with the characteristic for obtaining optimal partial simultaneously in spatial domain and frequency domain, with human biological's visual characteristic very It is similar, therefore, it is possible to describe the partial structurtes letter corresponding to spatial frequency (yardstick), locus and set direction well Breath, Gabor filter is self similarity, that is to say, that all Gabor filters can from morther wavelet by expansion and Rotation is produced, in practical application, Gabor filter can frequency domain different scale, extract correlated characteristic on different directions.

Wherein, the described network architecture and training, realize a simple classical feedforward convolutional neural networks, each net The structure of network is as follows：Input layer, receives gray-scale map or RGB image, and input includes wave filter by 3 convolutional layer blocks, each block Layer, non-linear (or activation) and maximum pond layer composition, wherein 3 convolution blocks, each block has amendment linear unit (ReLU) The pond layer of activation primitive and 2x2, convolutional layer has filter graph, and wave filter (neuron) number is more, and layer is deeper, respectively obtains 64,128 and 256 filter graph sizes, each filter supports 5x5 pixels, and convolution block is one hidden with 512 afterwards That hides neuron is fully connected layer, and the output of hidden layer is transferred to output layer, and Output Size size is influenceed by task, and 8 Individual to be used for emotional semantic classification, up to 50 are used for AU labels, and output layer can change in activation, in order to reduce over-fitting, use Layer is abandoned, using layer is abandoned between last convolutional layer and the layer being fully connected, its probability is respectively 0.25 and 0.5, It is p to abandon layer probability, it is meant that the output of each neuron has Probability p to be arranged to 0.

Further, described network training, using ADAM optimizer training networks, learning rate is 10^-3, attenuation rate is 10^-5, in order to make model generalization to greatest extent, using random upset and the combination of affine transformation, for example, rotate, change, contracting Put, carry out data extending, generated data is generated on image and amplifies training set.

Wherein, described transfer learning, transfer learning is intended to use and trained in advance in different pieces of information for new task Model, neural network model usually requires larger training set, however, in some cases, the size of training set is not enough to Reach correct training, transfer learning allows the feature extractor for using convolutional layer as pre-training, only output layer is according to working as Preceding task is altered or modified, i.e. first layer is considered as predefined feature, and defines the final layer of task by based on can It is adjusted with the study of training set.

Wherein, described micro- expression detection, micro- expression is a kind of more spontaneous and delicate facial movement, by identical face Motion composition, these motions define FACS motor units and intensity is different, and micro- expression often only continues 0.5 second, institute Think and detect implication therein, be 3 steps by each micro- expression decomposition：Starting, summit and skew, respectively description are moved Start, the end spied on and acted, FACS category features extractor is applied to the task of the micro- expression of automatic detection, therefore, using Data set includes the 256 spontaneous micro- expressions shot with 200fps, and all videos are collectively labeled as starting, summit and skew, and The expression passed on, is summit frame addition AU codings, expression is captured by showing the theme video-frequency band of the required response of triggering.

Further, described micro- expression detection network, network instruction is carried out first from training data sequence to selected frame Practice, for each video, starting, summit and first and last frame of skew frame, and sequence are only taken, to explain neutral appearance Gesture, trains CNN to detect mood first, then, future self-training network convolutional layer and shot and long term memory network (LSTM) group Close, it, which is inputted, is connected to first of feature extractor CNN and is fully connected layer, used LSTM only comprising one LSTM layer with One output layer, layer is abandoned after LSTM layers using circulation.

Brief description of the drawings

Fig. 1 is a kind of system flow chart for the method that micro- expression detection is carried out based on Facial Action Coding System of the present invention.

Fig. 2 is a kind of wave filter visualization for the method that micro- expression detection is carried out based on Facial Action Coding System of the present invention Process.

Fig. 3 is a kind of main expression for the method that micro- expression detection is carried out based on Facial Action Coding System of the present invention.

Fig. 4 is that a kind of motor unit for the method that micro- expression detection is carried out based on Facial Action Coding System of the present invention is compiled Code.

Fig. 5 is a kind of data set legend for the method that micro- expression detection is carried out based on Facial Action Coding System of the present invention.

Embodiment

It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combine, the present invention is described in further detail with specific embodiment below in conjunction with the accompanying drawings.

Fig. 1 is a kind of system flow chart for the method that micro- expression detection is carried out based on Facial Action Coding System of the present invention. It is main to include visualization CNN wave filters, the network architecture and training, transfer learning, the detection of micro- expression.

(2) maximum N number of input picture i=arg is extracted_imax F_{L, j}(i)；

(3) for each input i, the AU labels of manual annotationsIf motor unit u exists in i, A_{I, u}For 1；

Fig. 2 is a kind of wave filter visualization for the method that micro- expression detection is carried out based on Facial Action Coding System of the present invention Process.Set up after sound mood taxonomy model, analysis proposes the model of e-learning, by the filtering of proposed network training Device is visualized in different emotional semantic classification tasks.Lower floor provides the Gabor-like wave filters of low level, and close to defeated The intermediate layer gone out and higher level provide high level human body and feature can be read.It is described along being responsible for by inputting in feature visualization The activation of wave filter needed for the pixel of response is maximized.

Fig. 3 is a kind of main expression for the method that micro- expression detection is carried out based on Facial Action Coding System of the present invention.From Left-to-right is respectively to detest, frightened, joyful, surprised, and sad and indignation, is the main expression on facial expression, its generality The implication of expression will not be changed because of Different Culture, simplicity and the requirement to generality is met.

Fig. 4 is that a kind of motor unit for the method that micro- expression detection is carried out based on Facial Action Coding System of the present invention is compiled Code.Facial Action Coding System (FACS) is a kind of based on anatomical system, all observables for describing every kind of mood The face action arrived.Using FACS as methodology measuring system, any expression and its work of motor unit activation can be described Jump intensity.Each motor unit describes one group of facial muscles, cooperatively constitutes a specific motion.It is dynamic including 44 faces Make unit, description such as " is dehisced ", the action such as " narrowing eye " now also added 20 other motor units, count head and eye in The motion of eyeball.

Fig. 5 is a kind of data set legend for the method that micro- expression detection is carried out based on Facial Action Coding System of the present invention. A common model structure is obtained on various data sets using the method based on CNN, and studies these models and FACS Relation.In order to check the generalization ability of learning model, these models are understood using transfer learning method how in other data Performed on collection.In order to understand the predicable based on the state-of-the-art models of CNN in FER, these methods are applied to numerous numbers According to concentration, figure is the part legend of selection.

For those skilled in the art, the present invention is not restricted to the details of above-described embodiment, in the essence without departing substantially from the present invention In the case of refreshing and scope, the present invention can be realized with other concrete forms.In addition, those skilled in the art can be to this hair Bright to carry out various changes and modification without departing from the spirit and scope of the present invention, these improvement and modification also should be regarded as the present invention's Protection domain.Therefore, appended claims are intended to be construed to include preferred embodiment and fall into all changes of the scope of the invention More and modification.

Claims

1. a kind of method that micro- expression detection is carried out based on Facial Action Coding System, it is characterised in that main to include visualization CNN wave filters (one)；The network architecture and training (two)；Transfer learning (three)；Micro- expression detection (four).

2. based on the visualization CNN wave filters (one) described in claims 1, it is characterised in that set up sound mood classification After framework, analysis proposes the model of e-learning, and the wave filter of the network training proposed is appointed in different emotional semantic classifications Visualized in business, lower floor provides the class Gabor filter of low level, and provides high-level close to the intermediate layer of output and higher level Human body feature can be read, by using the above method, it can be seen that the feature of institute's training network, feature visualization pass through input The activation of pixel of the wave filter needed for maximizing with being responsible for the response is shown, be can be seen that from the model for analyzing trained There are very big similitude, and these regions and motion and definition between the characteristic pattern of network and specific facial zone and motion There is significant correlation the part of Facial Action Coding System (FACS) moving cell.

3. based on the FACS described in claims 2, it is characterised in that Facial Action Coding System, it is first determined 7 main Universal mood, meet the constant characteristic of expressed meaning under different cultural environments, him marked with corresponding affective state , i.e., it is happy, it is sad, it is pleasantly surprised, it is frightened, detest, indignation and despise, be widely used in cognitive calculating, and FACS is a kind of base In anatomical system, all observable face actions for describing every kind of mood are surveyed using FACS as methodology Amount system, can describe any expression of motor unit (AU) activation and its enliven intensity, each motor unit describes one group of face Portion's muscle, cooperatively constitutes a specific motion.

4. based on the CNN wave filters described in claims 2, it is characterised in that use following methods by the doubtful AU of wave filter Represent to match with AU labels in actual data set：

(2) maximum N number of input picture i=arg is extracted_imax F_{L, j}(i)；

(3) for each input i, the AU labels of manual annotations areIf motor unit u exists in i, A_{I, u}For 1；

A large amount of top neurons are found to be itself and not produced for any input in effective output, last convolutional layer The quantity for enlivening neuron is about 30% (having 60 in 256) of characteristic pattern size, the quantity of formal neuron and FACS's The vocabulary size of motor unit is approximate, can identify corresponding facial expression.

5. based on the Gabor filter described in claims 2, it is characterised in that Gabor filter, which is one, is used for edge inspection The linear filter of survey, the frequency of Gabor filter and direction are represented close to human visual system for frequency and the table in direction Show, and they are standing for texture representation and description, and in spatial domain, the Gabor filter of one 2 dimension is a sinusoidal plane wave With the product of gaussian kernel function, with the characteristic for obtaining optimal partial simultaneously in spatial domain and frequency domain, regarded with human biological Feel that characteristic is much like, therefore, it is possible to describe the office corresponding to spatial frequency (yardstick), locus and set direction well Portion's structural information, Gabor filter is self similarity, that is to say, that all Gabor filters can be from a morther wavelet warp Cross expansion and rotation produced, in practical application, Gabor filter can frequency domain different scale, extract phase on different directions Close feature.

6. based on the network architecture described in claims 1 and training (two), it is characterised in that realize a simple classics Feedover convolutional neural networks, and the structure of each network is as follows：Input layer, receives gray-scale map or RGB image, and input passes through 3 volumes Lamination block, each block includes filter layer, non-linear (or activation) and maximum pond layer composition, wherein 3 convolution blocks, each block Pond layer with amendment linear unit (ReLU) activation primitive and 2x2, convolutional layer has filter graph, wave filter (neuron) Number is more, and layer is deeper, respectively obtains 64,128 and 256 filter graph sizes, and each filter supports 5x5 pixels, convolution Be after block one there is 512 hidden neurons be fully connected layer, the output of hidden layer is transferred to output layer, exports chi Very little size is influenceed by task, and 8 are used for emotional semantic classification, and up to 50 are used for AU labels, and output layer can become in activation Change, in order to reduce over-fitting, using layer is abandoned, using discarding layer between last convolutional layer and the layer being fully connected, Its probability is respectively 0.25 and 0.5, and it is p to abandon layer probability, it is meant that the output of each neuron has Probability p to be arranged to 0。

7. based on the network training described in claims 6, it is characterised in that utilize ADAM optimizer training networks, learning rate For 10^-3, attenuation rate is 10^-5, in order to make model generalization to greatest extent, use random upset and the combination of affine transformation, example As rotated, change, scaling carries out data extending, generated data is generated on image and amplifies training set.

8. based on the transfer learning (three) described in claims 1, it is characterised in that transfer learning, which is intended to use, is directed to new task The model trained in advance in different pieces of information, neural network model usually requires larger training set, however, in some situations Under, the size of training set is not enough to reach correct training, and transfer learning allows the feature using convolutional layer as pre-training to carry Device is taken, only output layer is altered or modified according to current task, i.e., first layer is considered as predefined feature, and defines and appoint The final layer of business is adjusted by the study based on available training set.

9. (four) are detected based on micro- expression described in claims 1, it is characterised in that micro- expression is a kind of more spontaneous and delicate Facial movement, be made up of identical facial movement, these motions define FACS motor units and intensity is different, micro- Expression often only continues 0.5 second, so being 3 steps by each micro- expression decomposition to detect implication therein：Starting, top Point and skew, describe the beginning of motion respectively, and FACS category features extractor is applied to automatic detection by the end spied on and acted The task of micro- expression, therefore, 256 spontaneous micro- expressions for including shooting with 200fps using data set, all videos are all marked For starting, summit and skew, and the expression passed on, it is summit frame addition AU codings, response needed for by showing triggering Theme video-frequency band captures expression.

10. network is detected based on micro- expression described in claims 9, it is characterised in that right from training data sequence first Selected frame carries out network training, for each video, only takes starting, summit and skew frame, and sequence first and finally One frame, to explain neutral position, trains CNN to detect mood first, then, future self-training network convolutional layer and shot and long term Memory network (LSTM) is combined, and it, which is inputted, is connected to first of feature extractor CNN and is fully connected layer, and used LSTM is only Comprising one LSTM layers and an output layer, layer is abandoned using circulation after LSTM layers.