CN108805087B

CN108805087B - Time sequence semantic fusion association judgment subsystem based on multi-modal emotion recognition system

Info

Publication number: CN108805087B
Application number: CN201810612592.0A
Authority: CN
Inventors: 俞旸; 凌志辉
Original assignee: Nanjing Xinktech Information Technology Co ltd
Current assignee: Nanjing Xinktech Information Technology Co ltd
Priority date: 2018-06-14
Filing date: 2018-06-14
Publication date: 2021-06-15
Anticipated expiration: 2038-06-14
Also published as: CN108805087A

Abstract

The invention discloses a time sequence semantic fusion association judgment subsystem based on a multi-modal emotion recognition system, which comprises data acquisition equipment and output equipment, and is characterized in that: the emotion analysis software system comprehensively analyzes and infers the data obtained by the data acquisition equipment and finally outputs the result to the output equipment; the emotion analysis software system comprises a semantic fusion association judgment subsystem based on time sequence. The invention breakthroughs the emotion recognition of five single modes, innovatively utilizes the deep neural network to comprehensively judge the information of the single modes after the neural network coding and depth association and understanding, greatly improves the accuracy rate, and is suitable for most general inquiry interactive application scenes.

Description

Time sequence semantic fusion association judgment subsystem based on multi-modal emotion recognition system

Technical Field

The invention relates to the technical field of emotion recognition, in particular to a time sequence semantic fusion association judgment subsystem based on a multi-mode emotion recognition system, such as machine learning, deep learning, computer vision, natural language processing, voice recognition, human action recognition, non-contact physiological detection and the like.

Background

Emotion recognition is a technology for judging emotion changes of a person, and mainly infers the psychological state of the person by collecting external expression and behavior changes of the person. In modern society, emotion recognition technology is widely applied to aspects of intelligent equipment development, sales guidance robots, health management, advertising marketing and the like. Emotion is a state that combines human feelings, thoughts and behaviors, and includes a human psychological response to external or self-stimulation and also includes a physiological response accompanying such a psychological response. In various human-machine interaction systems (e.g., robots, interrogation systems, etc.), human-machine interaction becomes more friendly and natural if the system can recognize the emotional state of a human. Therefore, emotion analysis and recognition are important interdisciplinary research subjects in the fields of neuroscience, psychology, cognitive science, computer science, artificial intelligence and the like.

The methods used have also been different for long-term studies of mood. In recent years, with the application and popularization of electroencephalogram signal acquisition equipment, the rapid development of signal processing and machine learning technologies, and the great improvement of computer data processing capacity, emotion recognition research based on electroencephalogram has become a hot topic in the fields of neural engineering and biomedical engineering.

Emotion recognition methods are different from one another corresponding to different emotion induction methods, and common emotion recognition methods are mainly classified into two categories, namely recognition based on non-physiological signals and recognition based on physiological signals. The emotion recognition method based on non-physiological signals mainly comprises the recognition of facial expressions and voice tones. The facial expression recognition method is characterized in that different emotions are recognized according to the corresponding relation between expressions and emotions, people can generate specific facial muscle movement and expression modes under a specific emotion state, if people feel happy, mouth corners can be upwarped, and eyes can have annular folds; anger may frown, open eyes, etc. At present, facial expression recognition is mostly realized by adopting an image recognition method. The speech tone recognition method is realized according to different language expression modes of people in different emotional states, for example, the tone of speaking is cheerful when the mood is happy, and the tone is dull when the mood is fidgety. The non-physiological signal identification method has the advantages of simple operation and no need of special equipment. The disadvantage is that reliability of emotion recognition cannot be guaranteed because people can disguise their own true emotions by disguising facial expressions and voice tones, which are often not easily discovered. Secondly, methods based on non-physiological signal recognition are often difficult to implement for disabled persons suffering from certain specific diseases.

Because the electroencephalogram signals are very weak, the electroencephalogram signals need to be amplified by an amplifier with high amplification factor in the acquisition process. At present, commercial electroencephalogram signal amplifiers are generally large in size and are not portable. Recently, a chip electroencephalogram signal amplifier appears, the problem of overlarge size of the amplifier can be effectively solved, but the cost is still high, and a certain distance is kept from the practicability.

Therefore, it is obvious that emotion recognition methods based on physiological signals all require complex and expensive signal measurement and acquisition systems to obtain accurate biological signals, and cannot be applied in a wide range of scenes, and particularly, the methods are not applicable to special scenes, such as criminal investigation, interrogation and the like, when secret measurement is required.

Because emotion is subjective and conscious experience and feeling of an individual to external stimuli and has characteristics of psychological and physiological responses, people hope that the internal feeling does not need to be directly observed, but the people can deduce through behaviors or physiological changes of the individual, and the emotion recognition method is more advocated at present. In this type of method, most emotion recognition is mainly recognition of the meaning of an expression. The identification method is mainly carried out by means of the movement of large muscle groups on the face. But does not integrate human expressions, spoken words, physical states, voice tones, physiological characteristics, and the like.

In the prior art, for example: the multi-modal intelligent emotion perception system, publication No.: CN107220591 a. The technology provides a multi-modal intelligent emotion perception system, which comprises an acquisition module, a recognition module and a fusion module, wherein the recognition module comprises an emotion recognition unit based on expressions, an emotion recognition unit based on voice, an emotion recognition unit based on behaviors and an emotion recognition unit based on physiological signals, each emotion recognition unit in the recognition module recognizes multi-modal information so as to obtain emotion components, the emotion components comprise emotion types and emotion intensities, and the fusion module fuses the emotion components of the recognition module to realize accurate perception of human emotion.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a method and a system for recognizing emotion by synthesizing human expression, characters, voice, posture and 5 large-modal physiological signals in an innovative way. Compared with the similar patent (for example, publication No. CN107220591A), the invention has the fundamental breakthroughs in the following aspects.

1. The wearable equipment is not necessary equipment, and the innovative proposal only needs to acquire video and sound signals.

2. The invention aims at the feature extraction of physiological signals and is obtained by an innovative non-contact micro-feature amplification mode, and the innovation point greatly reduces the cost and improves the use convenience of the product.

3. The invention also provides comprehensive emotion analysis of multiple rounds of conversations on the basis of basic text emotion analysis. The innovation point not only improves the emotion analysis of each local conversation unit, but also provides comprehensive understanding of emotion of the whole conversation process.

4. The invention also innovatively invents emotion recognition based on human body gestures on the basis of motion recognition. And the posture emotion recognition proposed by the invention is to recognize the main posture of a person as the change of a key node.

5. The invention innovatively provides time sequence-based emotion correspondence, association and reasoning of the basic neural network RNN when synthesizing each single mode as the total emotion recognition.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that: a time sequence semantic fusion association judgment subsystem based on a multi-modal emotion recognition system comprises data acquisition equipment and output equipment, and is characterized in that: the emotion analysis software system comprehensively analyzes and infers the data obtained by the data acquisition equipment and finally outputs the result to the output equipment; the emotion analysis software system comprises a semantic fusion association judgment subsystem based on time sequence.

The time sequence semantic fusion association judgment subsystem based on the multi-modal emotion recognition system is further characterized in that: each RNN recurrent neural network organizes the representation form of the middle neural network understood by each single-mode emotion according to a time sequence, wherein one neural network unit at each time point is output from the corresponding time point of the middle layer of the neural network of the single-mode subsystem; the output of the neural network passing through the single time point of each RNN recurrent neural network is transmitted to the multi-mode fusion association judgment RNN recurrent neural network, the neural network output of the current time point of each single-mode RNN recurrent neural network is collected at each time point of the multi-mode RNN recurrent neural network, and after the multi-modes are integrated, the output of each time point is the final emotion judgment result of the time point.

The time sequence semantic fusion association judgment subsystem based on the multi-modal emotion recognition system is further characterized in that: and training the emotion semantics under the single mode after aligning the time sequence by taking the time sequence as a reference, thereby realizing cross-modal automatic association correspondence on the time sequence and finally fused comprehensive emotion recognition, understanding and reasoning judgment.

The time sequence semantic fusion association judgment subsystem of the multi-modal-based emotion recognition system is further characterized in that: the emotion analysis software system further comprises an emotion recognition subsystem based on facial image expressions, an emotion recognition subsystem based on voice signals, an emotion analysis subsystem based on text semantics, an emotion recognition subsystem based on human body gestures, an emotion recognition subsystem based on physiological signals and a multi-round conversation semantic understanding subsystem.

The time sequence semantic fusion association judgment subsystem based on the multi-modal emotion recognition system is further characterized in that: the emotion recognition subsystem based on facial image expression can generate a specific expression mode under a specific emotion state, and effectively obtains motion field information from a complex background and a multi-pose expression sequence based on motion information of a dynamic image sequence and an expression image, an optical flow model of an area and a reference optical flow algorithm;

the emotion analysis subsystem based on the text semantics has the advantages that the text emotion analysis can be divided into three levels of words, sentences and chapters, the word-based method is to analyze emotion characteristic words, and the polarity of the words is judged or the similarity of the word semantics is calculated according to a threshold value; the sentence-based method comprises the steps of sampling emotion labels for each sentence, extracting evaluation words or acquiring evaluation phrases for analysis; the method based on the chapters is to analyze the overall emotional tendency of the chapters on the basis of sentence emotional tendency analysis;

the emotion recognition subsystem based on human body postures extracts typical examples of various emotion states of a human body, discriminates and analyzes subtle differences of similar emotions for each posture, establishes a feature library, and extracts physical motion information from the characteristics as judgment basis according to motion properties such as duration time, frequency and the like of human body actions to recognize;

the emotion analysis subsystem based on text semantics is an emotion recognition method improved based on a deep Convolutional Neural Network (CNN), the subsystem utilizes vocabulary semantic vectors generated in a target field to carry out emotion classification on texts in a problem field, the input of the emotion analysis subsystem is sentences or documents expressed by a matrix, each line of the matrix corresponds to a word segmentation element, each line is a vector expressing a word, and the vectors are all in the form of word templates (expressed by high-dimensional vectors), and are obtained from the last module or are indexed according to the words in a word list;

the second layer of the subsystem is a convolutional neural network layer;

the third layer of the subsystem is a time-based convergence layer, the incidence relation of the characteristic information extracted from the previous convolutional layer on a time axis is found out, and the corresponding change on the time dimension in each characteristic matrix in the previous layer is summarized and induced, so that more concentrated characteristic information is formed;

the fourth layer of the subsystem is the last full-connection prediction layer, and the method comprises the steps of firstly, performing full arrangement and combination on the concentrated characteristic information obtained from the previous layer, and searching all possible corresponding weight combinations so as to find a coaction mode among the concentrated characteristic information and the concentrated characteristic information; the next internal layer is a Dropout layer, which means that weights of some hidden layer nodes of the network are randomly made to be out of work during model training, the nodes which are out of work are temporarily regarded as not part of the network structure, but the weights of the nodes are kept (only temporarily not updated), because the nodes can work again when a sample is input next time, the next internal layer is tanh (hyperbolic function), which is a nonlinear logistic transformation, and the last internal layer is softmax, which is a common activation function in multi-classification and is based on logistic regression, and the probability of each possible class needing to be predicted is sharpened, so that the predicted class is distinguished;

the emotion recognition subsystem based on human body gestures is characterized in that emotion extraction based on motion recognition is performed according to a data input source, firstly, motion data are represented and modeled, and then, emotion modeling is performed to obtain two sets of representation data related to motions and emotions; then, the continuous action is accurately identified by using the existing action identification method based on the motion data to obtain the action information of the data; matching and corresponding the emotion model obtained before with an emotion database, and finally extracting the emotion of the input data by assisting action information in the process; the method specifically comprises the following steps:

● human body modeling

Firstly, modeling joint points of a human body, regarding the human body as a rigid system with intrinsic relation, and comprising bones and the joint points, wherein the relative motion of the bones and the joint points forms the change of the posture of the human body, namely describing actions at ordinary times, in a plurality of joint points of the human body, according to the lightness and the heaviness of the influence on the emotion, fingers and toes are ignored, the spine of the human body is abstracted into three joints of a neck, a chest and an abdomen, and a human body model is summarized, wherein the upper half body comprises a head, a neck, a chest, an abdomen, two big arms and two small arms, and the lower half body comprises two thighs and two crus;

● emotional state extraction

For the selected multiple emotional states, the expression of each emotional state is carried out under the normal condition of the human body, and the body reaction is analyzed in detail; because the human body is abstracted into a rigid model, the gravity center of the human body moves firstly and is divided into a forward state, a backward state and a natural state; in addition to the movement of the center of gravity, followed by the rotation of the joint points, the human body undergoes motion changes, and the joint points related to emotion include the head, the chest, the shoulders and the elbows, and the corresponding motions are the bending of the head, the rotation of the chest, the swinging and stretching directions of the upper arm, and the bending of the elbows, which parameters, in combination with the movement of the center of gravity, include seven degrees of freedom in total, expressing the motion of the upper half of a person.

The time sequence semantic fusion association judgment subsystem based on the multi-modal emotion recognition system is further characterized in that: the emotion recognition subsystem based on facial image expression is based on an ensemble model based on VGG16 and RESNET 50.

The time sequence semantic fusion association judgment subsystem based on the multi-modal emotion recognition system is further characterized in that: in the emotion recognition subsystem based on the voice signals, acoustic parameters such as fundamental frequency, duration, tone quality and definition are emotion voice characteristic quantities, an emotion voice database is established, and new voice characteristic quantities are continuously extracted to serve as a basic method for voice emotion recognition.

The time sequence semantic fusion association judgment subsystem based on the multi-modal emotion recognition system is further characterized in that: the emotion recognition subsystem based on the voice signal is a model for performing emotion recognition on the voice signal based on a neural network MLP (multi-layer perception model), firstly, segmenting (segmentation) continuous voice signals to obtain discrete sound tiny units, wherein the tiny units are partially overlapped, so that the model can better analyze the current unit and know the previous and next context voice units; then extracting the information of the voice energy (energy) curve by the model; and next, extracting fundamental frequency (pitch) curve information by the subsystem, describing and constructing pitch characteristics by the fundamental frequency characteristics, and extracting a fundamental frequency curve by adopting an autocorrelation method.

The time sequence semantic fusion association judgment subsystem based on the multi-modal emotion recognition system is further characterized in that: the emotion recognition subsystem based on the physiological signals is a non-contact type emotion recognition system based on the physiological signals, the physiological mechanism of emotion comprises emotion perception (electroencephalogram) and emotional physical and physiological reactions (electrocardio, heart rate, myoelectricity, galvanic skin reaction, respiration, vascular pressure and the like), the emotion perception is a main emotion generation mechanism, different physiological reactions of the brain are reflected through the electroencephalogram signals, due to the particularity of the signals, recognition is carried out through three characteristics of time domain, frequency domain and time-frequency domain, and time-frequency average spectrum entropy, fractal dimension and the like are used as characteristic quantities for measuring brain activities;

the emotion recognition subsystem based on the physiological signal utilizes the change of light rays when blood flows in a human body in emotion recognition of the physiological signal: when the heart beats, blood can pass through the blood vessel, the more the blood volume passing through the blood vessel is, the more light absorbed by the blood is, the less light is reflected by the surface of human skin, and the heart rate is estimated through time-frequency analysis of the image;

the first step is to carry out spatial filtering on a video sequence to obtain base bands with different spatial frequencies;

secondly, performing band-pass filtering on each baseband in a time domain to extract the interested part of the variation signals;

and thirdly, amplifying and synthesizing, and counting the number of the peak values of the signal change, namely the physiological heart rate of the person is approximated.

The time sequence semantic fusion association judgment subsystem based on the multi-modal emotion recognition system is further characterized in that: the multi-round conversation semantic understanding-based subsystem adds an emotion recognition attention mechanism on the basis of a traditional seq2seq language generation model to the input speech of the current round, and adds emotion tracking in the previous multi-round conversations on a time sequence in conversation management; inputting each current user spoken utterance into a bi-directional LSTM encoder (encoder), then combining the current different emotion state-discriminating input with the encoder output of the user utterance that was just generated, and inputting the combined input into a decoder, such that the decoder has both the user utterance and the current emotion, and the resulting system dialog response is an output that is personalized and specific to the current user emotion state; an emotion-Aware Information State Update (ISU) strategy, wherein the conversation State is updated at any time when new Information exists; when the conversation state is updated, each updating is determined, and the same system state, the same system behavior and the same user emotion state at the current moment at the previous moment are necessarily generated.

Has the advantages that: the invention breakthroughs the emotion recognition of 5 large single modes, innovatively utilizes the deep neural network to carry out comprehensive judgment on information of a plurality of single modes after neural network coding and depth association and understanding, thereby greatly improving the accuracy, reducing the requirements on environment and hardware, and finally widening the application range to most common application scenes, particularly some special scenes such as criminal investigation, interrogation and the like.

Drawings

Fig. 1 is a schematic diagram of a multi-modal-based emotion recognition system according to an embodiment of the present invention.

Fig. 2 is a flow chart of a multimodal-based emotion recognition system according to an embodiment of the present invention.

FIG. 3 is a schematic diagram of a VGG16 model according to an embodiment of the present invention.

Fig. 4 is a diagram of the core residual architecture in the RESNET50 model according to an embodiment of the present invention.

Fig. 5 is a diagram of an integrated ensemble model architecture according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of the present invention in segmenting a continuous speech signal to obtain discrete sound tiny units.

Fig. 7 is a schematic diagram of the change of Short Term Energy (STE) in sound waves according to the present invention.

FIG. 8 is a schematic diagram of the fundamental frequency information of a person in the present invention.

Fig. 9 is a diagram of the MLP (multi-layer probability) neural network used in the present invention for performing deep learning model architecture.

FIG. 10 is a diagram of textual emotion analysis using a core module based on a deep convolutional neural network as employed in the present invention.

Fig. 11 is a diagram of the application of the convolutional neural network combined with the syntax tree in emotion analysis.

Fig. 12 is a general flowchart of the human body posture detection proposed by the present invention.

FIG. 13 is a diagram of 13 main segments of human body models identified in human body posture detection according to the present invention.

Fig. 14 is a human body phenomenon on which the present invention is based: the greater the amount of blood in the blood vessel, the more light is absorbed by the blood and the less light is reflected from the skin surface.

FIG. 15 is a diagram showing the process and results of amplifying a cosine wave by a factor of α according to the method of the present invention in the human biometric sensing process.

Fig. 16 is a general flow chart of the present invention in multi-round interactive emotion recognition (a process of a cyclic multi-round interactive understanding).

FIG. 17 is an attention mechanism diagram of the present invention incorporating emotion recognition based on the traditional seq2seq language generation model for the input utterance in the current round.

Fig. 18 is a schematic diagram of the present invention for updating the dialogue state based on the emotional perception of previous rounds in a multi-round dialogue.

Fig. 19 is a body architecture diagram for performing comprehensive judgment of multiple single-mode information by neural network coding, depth association and understanding using a deep neural network according to the present invention.

FIG. 20 is a system diagram of the overall product of the invention.

Detailed Description

The invention is further explained in detail below with reference to the figures and the embodiments.

Since any mood is produced with some changes in the body, such as facial expressions, muscle tension, visceral activity, etc. The emotion recognition directly using the change of these signals is a so-called basic recognition method, also called a single-mode emotion recognition method, and the current main methods include facial images, languages, texts, postures, physiological signals, and the like. The invention provides a method and a system for more complete and accurate emotion recognition by fusing, corresponding and reasoning emotion understanding of a computer in each single mode.

The system of the multi-modal-based emotion recognition system proposed in this embodiment is composed of the following parts (fig. 1 is a schematic diagram of the multi-modal-based emotion recognition system according to the embodiment of the present invention):

-a hardware part: data acquisition equipment includes camera, microphone, detects the bracelet of heartbeat, and human gesture multiple spot detects the sensor, robot sensor acquisition system etc. and output device includes display, audio amplifier, earphone, printer, robot interactive system etc..

-a software part: the data obtained by the data acquisition equipment is comprehensively analyzed and reasoned. The system has 7 subsystems (7 modules are shown in figure 1) to form multi-modal emotion recognition based on facial image expression, voice signal, text semantic, human body posture and physiological signal, and multi-round dialogue semantic understanding and multi-modal emotion semantic fusion association judgment based on time sequence.

1. Emotion recognition based on the facial expression image.

The facial expression recognition method is based on the fact that people can generate specific expression patterns under specific emotional states. Both methods based on a template and utilizing a neural network are the most common approaches in static image expression recognition, but the recognition rate is not high necessarily due to single picture recognition. The invention provides a brand-new neural network which is based on a dynamic image sequence, the method considers the motion information of expression images, and an optical flow model based on an area and a reference optical flow algorithm can effectively obtain motion field information from a complex background and a multi-pose expression sequence.

2. Emotion recognition based on speech signals

The speech is an important means for expressing emotion specifically for human, and acoustic parameters such as fundamental frequency, duration, tone quality and definition are main characteristic quantities of emotional voice. Establishing an emotion voice database, and continuously extracting new voice characteristic quantity is a basic method for voice emotion recognition. The support vector machine and the Dempster-Shafer evidence-based theory can also be used as a method for extracting the speech emotion characteristics. The individual difference of the voice signals is obvious, and the traditional voice analysis method needs to establish a huge voice library, which brings certain difficulty to recognition. The present invention proposes an enhanced emotion recognition of speech signals based on a traditional speech recognition type neural network.

3. Text-based emotion recognition

The textual emotion analysis can be divided into three levels of words, sentences and chapters in the research process. The word-based method mainly analyzes emotional characteristic words, judges the polarity of the words or calculates the similarity of the word semantics according to a threshold value; the sentence-based method comprises the steps of sampling emotion labels for each sentence, extracting evaluation words or acquiring evaluation phrases for analysis; the chapter-based method is to perform overall emotional tendency analysis of chapters on the basis of sentence emotional tendency analysis. In the emotion recognition based on the text, the selection of emotion characteristic words is relatively depended on, although a language database is established, each word can be labeled with an emotion label, but many words have various definitions, and the problems need to be considered when the language database is established. The appearance of many emerging words can also greatly interfere with the accuracy of text emotion tendency recognition. Therefore, these conventional corpus-based methods are simple and accurate, but require a lot of manpower to construct a corpus in advance, and are not suitable for cross-domain migration. According to the deep learning-based method provided by the invention, one model can automatically and deeply learn different data in different fields and scenes, so that automatic emotion recognition is carried out.

4. Emotion recognition based on human body posture

The limb movement characteristics of a human body contain rich emotional information. The emotion recognition based on human body gestures mainly comprises the steps of extracting typical examples of various emotional states of the human body, carrying out discriminant analysis on each gesture to obtain nuances of similar emotions, and establishing a feature library. The emotion recognition based on human motion characteristics mainly takes motion properties such as duration, frequency and the like of human motion as judgment bases, and physical motion information is extracted from the motion properties for recognition. Many gestures or movements do not have obvious emotional characteristics and are often not fully resolved during recognition, thus this approach has great limitations. The invention proposes a deeper level of emotion recognition by fusing human body gestures with other signals.

5. Emotion recognition based on physiological signals

Physiological changes are rarely controlled by human subjectivity, so that the result obtained by emotion recognition by applying physiological signals is more objective. The physiological mechanisms of emotion include emotion perception (electroencephalogram) and emotional physical and physiological reactions (electrocardio, heart rate, myoelectricity, galvanic skin reaction, respiration, vascular pressure, etc.). The emotion perception is a main generation mechanism of emotion, different physiological reactions of the brain can be reflected through electroencephalogram signals, due to the particularity of the signals, the signals can be identified through three characteristics of a time domain, a frequency domain and a time-frequency domain, and in addition, the time-frequency average spectrum entropy, the fractal dimension and the like can be used as characteristic quantities for measuring brain activities. Although the physiological signals carry accurate emotion information, the signal intensity is very weak, and if electrocardio signals are collected, the electrocardio signals have large electromyographic potential interference, so the requirement is high in the extraction process. In practice, the number of sources of interference is so great that it is difficult to effectively remove artifacts from physiological signals. The invention provides a method for automatically detecting physiological reactions such as heartbeat, respiration and the like based on the change of blood and skin color of a human face.

Based on the above 5 single-mode emotion recognition subsystems, the invention provides that the emotion semantics under the single modes are trained after time sequence alignment by taking the time sequence as a reference, so that cross-mode automatic correlation correspondence on the time sequence and finally integrated comprehensive emotion recognition, understanding and reasoning judgment are realized. Fig. 2 is a flow chart of a multimodal-based emotion recognition system according to an embodiment of the present invention.

The following is a detailed description of the modules one by one.

1. Emotion recognition based on facial expression images:

the conventional method of recognizing a facial expression image based on computer vision can be roughly classified into the following procedures.

The first image preprocessing is mainly used for eliminating interference factors such as face detection and face graying. The second expression feature extraction is mainly based on the feature extraction of a static image and the image feature extraction of a dynamic sequence, and feature dimension reduction is performed before expression recognition is performed. And finally, the expression recognition is mainly to select a proper classification algorithm to classify the expression characteristics after the dimension reduction.

Conventional classification algorithms include:

● skin color based detection method

Experiments show that the Gaussian mixture model is better than the Gaussian mixture model based on the Gaussian mixture model and the histogram model.

● statistical model-based method

Artificial neural networks: and adopting a plurality of neural networks to carry out different-angle face detection.

Based on the probability model: the face is detected by estimating the conditional probabilities of the face image and the non-face image.

A support vector machine: and judging the human face and the non-human face by adopting a hyperplane of a support vector machine.

● detection method based on heuristic model

Deformation model: the deformed template is matched with the head top contour line and the left and right face contour lines.

Mosaic drawing: and dividing the face area into a plurality of mosaic blocks, and verifying by using a group of rules and edge features.

The deep learning method using artificial neural networks has been greatly improved recently due to the easier acquisition of large-scale data and large-scale GPU computations, and has been shown to be superior to most of the conventional methods above. The embodiment proposes the following ensemble model based on VGG16 and RESNET 50.

First, the VGG16 model architecture of the present embodiment is shown in fig. 3:

next, the core residual architecture in the RESNET50 model of this embodiment is shown in fig. 4:

finally, the comprehensive ensemble model architecture based on the above 2 architectures proposed in this embodiment is shown in fig. 5:

through statistics of results on public experimental data (as shown in the following table), the model provided by the embodiment reaches the current most advanced level, and the operation efficiency is extremely high.

	Rate of accuracy	Accuracy of measurement	Recall rate
				Baseline system based on SVM	31.8％	43.7％	54.2％
Industry mainstream system based on VGG16	59.2％	70.1％	69.5％
				Industry mainstream system based on RESNET50	65.1％	76.5％	74.8％
The algorithm proposed by the invention	67.2％	79.4％	78.2％

2. Emotion recognition based on speech signals:

the traditional speech emotion recognition research is developed without leaving the support of emotion speech databases. The high quality of the emotion speech library directly determines the performance of the emotion recognition system obtained by training the emotion speech library. At present, the emotion voice library types existing in the field are various, a unified establishment standard is not provided, and the emotion voice library can be divided into 3 categories of a performance type, a guidance type and a natural type according to the type of the excited emotion; the method can be divided into two categories of identification type and synthesis type according to application targets; it can be divided into English, German and Chinese according to different languages.

In these methods, the acoustic features used for speech emotion recognition can be roughly classified into 3 types, namely prosodic features, related features based on spectrum and psychoacoustic features, which are often extracted in units of frames, but participate in emotion recognition in the form of global feature statistics. The unit of global statistics is usually an acoustically independent sentence or word, and the commonly used statistical indexes include extremum, extremum range, variance, and the like. The common features are:

● the prosodic features refer to the variation of pitch, duration, speed and weight of speech over the semantic symbol, and are a structural arrangement for the way speech is expressed. The existence of the prosodic features is also called as 'super-note features' or 'super-linguistic features', the emotion distinguishing capability of the prosodic features is widely accepted by researchers in the field of speech emotion recognition and is very common in use, wherein the most common prosodic features comprise duration (duration), fundamental frequency (pitch), energy (energy) and the like.

● the spectral-based correlation features are known as the manifestation of the correlation between vocal tract (vocal tract) shape changes and vocal tract motion (acoustic motion), and have been successfully used in the speech signal processing fields including speech recognition, speaker recognition, etc. Nwe, it was found by studying the related spectrum features of emotional voices that the emotional content of the voices has a significant effect on the distribution of the spectrum energy in each spectrum interval, for example, voices expressing happy emotions show high energy in the high frequency band, while voices expressing sad voices show significantly different low energy in the same frequency band. In recent years, more and more researchers apply spectrum-related features to speech emotion recognition, and play a role in improving system recognition performance, and emotion distinguishing capability of related spectrum features is not negligible. Linear spectral features for use in speech emotion recognition tasks.

● the sound quality characteristic is a subjective evaluation index given to the voice by people, and is used for measuring whether the voice is pure, clear, easy to identify and the like. Acoustic manifestations affecting sound quality are wheezing, tremolo, choking, etc., and are often present in situations where the speaker is emotionally agitated and difficult to suppress. In the experiment of speech emotion recognition, the change of sound quality was consistently recognized by the listeners as having a close relationship with the expression of speech emotion. In speech emotion recognition research, the acoustic features used to measure sound quality are generally: formant frequencies and bandwidths thereof (format frequency and bandwidth), frequency and amplitude perturbations (jitter and hammer), glottal parameters (glottal parameter), and the like.

On the basis, the invention provides a model for emotion recognition of voice signals based on a neural network MLP (multi-layer perception model). First, the present invention segments (segmentation) a continuous speech signal to obtain discrete sound units (as shown in fig. 6). These units overlap partially, allowing the model to better analyze the current unit and understand the preceding and following contextual phonetic units. The model then extracts the speech energy (energy) curve information. Since energy information plays a very important role in speech recognition and also in emotion recognition. Such as happy and angry, the human speech energy may be significantly higher than sad. Fig. 7 shows the change of voice energy of a person when capturing emotional changes of the person, such as joy and anger, using the change in sound waves at Short Term Energy (STE).

Next, the system extracts fundamental (pitch) curve information. Tonal features play a very important role in speech recognition in most languages. Whereas tonal features may be characterized and constructed by fundamental frequency features. Therefore, it is difficult to find a reliable and effective fundamental frequency extraction method in practical environment. The embodiment adopts an autocorrelation method to extract a fundamental frequency curve. Fig. 8 shows that the autocorrelation method is used to extract the fundamental frequency information of a person's vitality in the fundamental frequency curve according to the present embodiment.

In addition, the system provided by the invention also extracts important information such as Mel Frequency Cepstral Coefficients (MFCC) and Formant Frequencies from voice. Finally, the system utilizes the MLP (multi-layer probability) of the neural network to perform deep learning (the model architecture is shown in FIG. 9: the MLP (multi-layer probability) neural network adopted in the embodiment performs deep learning of the voiceprint emotion).

3. Text-based emotion recognition:

the embodiment provides an emotion recognition method based on deep convolutional neural network CNN improvement. The module performs emotion classification of the text in the problem domain using the lexical semantic vectors generated in the target domain. The core of this module is also a deep convolutional neural network system (as shown in fig. 10).

Its input is a sentence or document represented in a matrix. Each row of the matrix corresponds to a word-segmentation element, typically a word, which may also be a character. That is, each line is a vector representing one word. Typically, these vectors are in the form of word entries (a high-dimensional vector representation) obtained from the previous module, but may also be in the form of one-hot vectors, i.e. based on the word's index into the vocabulary. If a sentence with 10 words is represented by a 100-dimensional word vector, a 10x 100-dimensional matrix is obtained as input.

The second layer of the module is a convolutional neural network layer. This step is a significant improvement in this embodiment. The conventional operation is (yellow convolution window in fig. 10), if the convolution window width is m (window size 3 is used in the figure), then take m consecutive words (an example in fig. 10 is "order beijing") and connect their corresponding word vectors together to get a m x d-dimensional vector xi: i + m-1(d represents the word vector dimension). And then multiplying the vector xi: i + m-1 by a convolution kernel w (w is also a vector), wherein ci is f (w.xi: i + m-1+ b), sliding the window to obtain c which is c1, c2, …, cn-m +1, and then taking the maximum value of c to obtain a value, and finally obtaining a K-dimensional vector on the assumption that K convolution kernels exist. These conventional convolution windows are only for consecutive m words. Therefore, the selection operation is performed to process sentences of different lengths, so that no matter how long the sentence is, what the width of the convolution kernel is, and finally a vector representation of a fixed length is obtained, and the maximum value selection is also used for refining the most important feature information, and the assumption is that the maximum value represents the most significant feature. A large number of experiments prove that the convolutional neural network model is suitable for various tasks, has very obvious effect, and does not need to carry out complicated characteristic engineering and syntax parse trees compared with the traditional method. In addition, the effect of inputting the word vectors trained in advance by the model is much better than that of randomly initializing the word vectors, and the word vectors trained in advance are input by deep learning at present. Compared with the conventional convolution window, the present embodiment proposes to convolve m words that are continuous in syntax. These m words may not be actually continuous (the example in fig. 10 is "hotel booking" marked red), but they are a continuous semantic structure in syntax. Such as the sentence "John hit the ball" shown in fig. 11, if the choice is to use a convolution window size of 3, there will be two complete 3-word windows "John hit the" and "hit the ball". But clearly none embody the complete core semantics of the sentence. If the words in the "continuous" window are determined from the parse tree, there are two convolution windows, "John hit ball" and "hit the ball". Therefore, it is clear that the 2 convolution windows all represent more complete and reasonable semantics. The two new convolution windows based on the syntax analysis tree are combined with the traditional convolution window to jointly select the maximum value. The feature information obtained in this way will make it easier for the model to grasp the meaning of a piece of text.

The third layer of the module is a time-based convergence layer. The entry of text words and phrases is strongly related in chronological or chronological order. The main objective of this layer is to find the correlation relationship on the time axis from the feature information extracted from the previous convolutional layer. The main mining process is to summarize the corresponding changes in the time dimension in each feature matrix in the previous layer. Thereby forming more condensed characteristic information.

The fourth layer of the module is the last fully-connected prediction layer. This layer actually contains many fine internal layers. The first is to carry out full permutation and combination on the concentrated characteristic information obtained from the previous layer and search all possible corresponding weight combinations, thereby finding the mode of the co-action between the concentrated characteristic information and the corresponding weight combinations. The next internal layer is the Dropout layer. Dropout refers to randomly disabling the weights of some hidden layer nodes of the network during model training, and those nodes that are not enabled can be temporarily considered not to be part of the network structure, but their weights are retained (only temporarily not updated),

as it may get worked again the next time a sample is input. The next inner layer is tan h (hyperbolic function). This is a non-linear logical transformation. The last internal layer is softmax, which is a commonly used activation function in multi-classification, based on logistic regression. It sharpens the probability of each possible category that needs to be predicted, thus making the predicted categories stand out.

4. Emotion recognition based on human body gestures:

the invention provides a method for extracting emotion based on human posture action and change. The emotion extraction technology based on motion recognition is that according to a data input source, motion data are characterized and modeled firstly, and then emotion modeling is carried out, so that 2 sets of characterization data about motion and emotion are obtained. Then, the continuous motion is accurately recognized by using the existing motion recognition method based on the motion data, and the motion information of the data is obtained. And matching and corresponding the emotion model obtained before with an emotion database, and finally extracting the emotion of the input data by assisting action information in the process. The specific flow is shown in fig. 12.

The system mainly comprises the following steps.

Human body modeling

First, the joint points of the human body are modeled, and the human body can be regarded as a rigid system with intrinsic connection. It contains bones and joint points, and the relative movement of the bones and the joint points forms the change of the posture of the human body, namely the description action in normal times. Among the numerous joints of the human body, according to the influence on the emotion, the following treatment is carried out:

1) the fingers and toes are ignored. The hand information only indicates anger when a fist is made, and the ordinary movement data cannot be used for simulating and estimating strength under the condition of no pressure sensor, so that the hand information is considered to be small in quantity, low in importance and required to be properly simplified. For toes, the amount of relevant information is almost zero. Therefore, the present embodiment simplifies the hand and the foot into one point in order to reduce the extraneous interference.

2) The spine of the human body is abstracted into 3 joints of the neck, chest and abdomen. The range of motion available to the spine is large and the composition of bones is complex and cumbersome. These 3 points with distinct position differences were selected on the spine to make a spine simulation.

From the above steps, a manikin can be summarized, wherein the upper body comprises the head, the neck, the chest, the abdomen, 2 big arms and 2 small arms, and the lower body comprises 2 thighs and 2 small legs. This model includes 13 rigid bodies and 9 degrees of freedom, as shown in fig. 13.

Emotional State extraction

And for the selected multiple emotional states, the expression of each emotional state is carried out under the normal condition of the human body, and the limb reaction is analyzed in detail.

Since the human body is abstracted into a rigid body model, the first parameter to be considered is the movement of the center of gravity of the human body. The movement of the gravity center of the human body is extremely rich, various descriptions can be carried out, and the description required by the emotion is more specific and accurate than the description of the movement of the gravity center. The center of gravity can be encoded into 3 cases-forward, backward and natural.

In addition to the movement of the center of gravity, it is next considered that the rotation of the joint points, which may be subject to motion changes, and the mood-related joint points include the head, chest, shoulders and elbows (the mood expression of the lower body of the human body is extremely limited and therefore temporarily left untreated). The corresponding movements are the bending of the head, the rotation of the chest, the swinging and stretching directions of the upper arm and the bending of the elbow, and the parameters are combined with the movement of the upper gravity center, and the movement with 7 degrees of freedom is totally included, so that the movement of the upper half of a person can be expressed. This set of parameters can be used to make a simple mood evaluation criterion. Referring to the experiment with a sample size of 61 people, made by ackerman, for each emotion in the set of emotions,

the representation can be based on parameters of rotation and movement of the center of gravity. The positive and negative values of the number indicate the direction of movement of the part relative to the coordinate system, while the positive values indicate that the part is moving forward in the right hand rule coordinate system, and the negative values indicate that the direction of movement of the part is negative.

5. Emotion recognition based on physiological signals

The emotion recognition of physiological signals utilizes the change of light when blood flows in a human body: when the heart beats, blood passes through the blood vessels, and the larger the amount of blood passing through the blood vessels is, the more light is absorbed by the blood, and the less light is reflected by the surface of human skin. Thus, the heart rate can be estimated by time-frequency analysis of the images (as shown in FIG. 14: based on the human phenomenon that the greater the amount of blood in the blood vessels, the more light is absorbed by the blood and the less light is reflected from the skin surface).

The so-called lagrangian view starts the analysis from the point of view of tracking the motion trajectory of the pixel of interest (particle) in the image. In 2005, Liu et al originally proposed a motion amplification technique for images, which first clustered feature points of a target, followed by tracking the motion trajectory of the points over time, and finally increased the motion amplitude of the points. However, the lagrange view method has the following disadvantages:

the motion trajectory of the particle needs to be accurately tracked and estimated, and more computing resources need to be consumed;

tracking of particles is performed independently, the consideration of the whole image is lacked, and the image is easy to be not closed, so that the effect after amplification is influenced;

the motion of the target object is amplified by modifying the motion trajectory of the particle, and since the position of the particle changes, the original position of the particle needs to be filled with the background, which also increases the complexity of the algorithm.

Unlike Lagrangian views, Euler views do not explicitly track and estimate the motion of particles, but rather fix the view in one place, e.g., the entire image. Then, it is assumed that the whole image is changed, but the frequency, amplitude, and other characteristics of the change signals are different, and the change signals of interest in the embodiment are located in the change signals. Thus, the amplification of the "variation" becomes a precipitation and enhancement of the frequency band of interest. The technical details are set forth in detail below.

1) Spatial filtering

The first step of the euler image amplification technique (hereinafter abbreviated as EVM) proposed in this embodiment is to perform spatial filtering on a video sequence to obtain different spatial frequency base bands. This is done because:

contribute to noise reduction. The images exhibit different SNRs (signal-to-noise ratios) at different spatial frequencies. In general, the lower the spatial frequency, the higher the signal-to-noise ratio. Therefore, to prevent distortion, these base bands should use different amplification factors. The image at the top layer, namely the image with the lowest spatial frequency and the highest signal-to-noise ratio can use the maximum magnification factor, and the magnification factors of the next layer are reduced in sequence;

facilitating approximation of the image signal. Images with higher spatial frequencies (such as the original video image) may be difficult to approximate using a taylor series expansion. Since in this case the results of the approximation are mixed up, the direct amplification is significantly distorted. For this case, the present embodiment reduces distortion by introducing a spatial wavelength lower limit value. If the spatial wavelength of the current baseband is less than this lower limit, the amplification is reduced.

Since the purpose of spatial filtering is simply to "tile" a number of adjacent pixels, it can be done using a low-pass filter. In order to increase the operation speed, a downsampling operation may be performed by the way. Friends familiar with image processing operations should quickly be able to react: the combination of these two things is a pyramid. In practice, linear EVM is a multi-resolution decomposition using laplacian or gaussian pyramid.

2) Time domain filtering

After obtaining the base bands of different spatial frequencies, each base band is then subjected to a band-pass filtering in the time domain in order to extract the part of the varying signal of interest. For example, if a heart rate signal is to be amplified, then 0.4-4 Hz (24-240 bpm) may be selected for band-pass filtering, which is the range of human heart rates. However, there are many kinds of band pass filters, and an ideal band pass filter, a Butterworth (Butterworth) band pass filter, a gaussian band pass filter, and the like are common. Which one should be selected? This is chosen according to the purpose of the amplification. If the amplification result needs to be subjected to subsequent time-frequency analysis (for example, extracting a heart rate and analyzing the frequency of an instrument), a filter with a narrow pass band, such as an ideal band-pass filter, should be selected, because such a filter can directly intercept the frequency band of interest, and avoid amplifying other frequency bands; if time-frequency analysis on the amplification result is not needed, a filter with a wide passband, such as a Butterworth bandpass filter, a second-order IIR filter and the like, can be selected, and the ringing phenomenon can be better relieved by the filter.

3) Amplification and Synthesis

Through the first two steps, the part of "change" has been found out, i.e. the problem of what is "change" is solved. The following discusses how to amplify the problem of "morphing". An important basis is: the result of the last step of bandpass filtering is an approximation of the change of interest.

Fig. 15 demonstrates the process and results of amplifying a cosine wave by a factor of alpha using the above method. The black curve represents the original signal f (x), the blue curve represents the changed signal f (x + δ), the cyan curve represents the taylor series approximation f (x) + δ (t) θ f (x) θ x for the signal, and the green curve represents the changed part separated by us. This portion is amplified by a times and then returned to the original signal to obtain an amplified signal, and the red curve in FIG. 15 represents the amplified signal f (x) + (1+ a) B (x, t)).

And finally, optimizing a space-time filtering effect by utilizing deep learning, converting RGB space information into a YIQ (ntsc) space on the assumption that the frequency and the heart rate of signal change caused by heartbeat are approximate, processing the two color spaces, and finding out a signal by using a proper band-pass filter. And counting the number of peak values of the signal change, namely approximating the physiological heart rate of the person.

6. Semantic and emotional understanding based on multiple rounds of conversation

The traditional semantic understanding is mostly of the type that does not take into account the interactive environment or at most a single round of question and answer. Currently, the main research method of emotion analysis on traditional machine learning is still based on some traditional algorithms, such as SVM, information entropy, CRF, and the like. Machine learning-based sentiment analysis has the advantage of having the ability to model a variety of features. Manually labeled single words are used as features, and the shortage of the linguistic data is often the bottleneck of performance.

Once there is "interaction," sentiment and emotion analysis becomes much more difficult. Firstly, the method comprises the following steps: the interaction is a continuous process rather than fixed for a short time. This essentially changes the way emotion determination is evaluated. When no interaction exists, for example, commodity comments, if the emotion classification is judged, the value can be realized, and the classification task is clear. But the emotion states are not the same in conversation, and the emotion states are continuously changed, so that the analysis of any single sentence is not meaningful, and the analysis is not a simple classification task any more. For continuous processes, a simple solution is to add a function of gain and attenuation, but this function is very difficult to be accurate, the theoretical basis is not numerous, and it is difficult to evaluate this function to write well or not well. Secondly, the method comprises the following steps: the presence of the interaction hides most of the state information. Less than 5% can be seen in the open face, just one corner of the iceberg (in a manner similar to hidden markov to understand). And both interacting parties default that the other party knows much information. Such as the relationship state between the communication subjects, the requirement purpose of each other, the emotional state, the social relationship, the environment, the content chatted before, the common sense, the character, the three views, etc. The following phenomena are then found: the more information that is common between two people, the more difficult it is because the larger the role of the hidden state, the more dimensionality the hidden state is. Different communication paradigms exist between different people. The pattern varies depending on other various environmental information (including time, place, relationship status, mood of each other, common experience, own chat habits, etc.). Even if the same person, the communication pattern between them is a dynamically changing process, for example, two persons in love can have different communication patterns due to emotional temperature rise and temperature fall. Thirdly, the method comprises the following steps: the interaction involves jumping of information. It is often logical and coherent when a person says something by itself. But chat and personal statements are exactly two things, chat can be very jumpy. This uncertain information leaps exponentially increasing the difficulty of sentiment analysis.

The above 3 main aspects are the reason why it becomes so difficult to judge that the interaction factor emotion analysis is added, and firstly, the evaluation mode is changed, and the evaluation mode is complicated and has no reference. It can be seen from the second and third reasons that the data dimension is too sparse for machine learning (the dominant state is only text, expression and the like, and most states are hidden), and the leap is added, so that the accuracy is high in the statistical manner, and the difficulty degree is conceivable.

Therefore, the invention provides a key improvement on conversation management, strengthens the understanding of languages and the attention mechanism of emotional words, and can effectively grasp the basic semantics and emotion capture in multiple rounds of conversations. The overall flow (as shown in fig. 16) is a process of a cyclic multi-round interactive understanding.

The innovation points of the embodiment are mainly 2 aspects: one is to add a mechanism of attention for emotion recognition to the input utterances of the current round based on the traditional speech generation model of seq2seq, and the other is to add emotion tracking in previous rounds of dialog in time series to the dialog management.

In the first module, the architecture is as shown in fig. 17: the input utterances for the current round add a mechanism of attention for emotion recognition based on the traditional seq2seq language generation model.

In this architecture, each current user spoken utterance is input into a bi-directional LSTM encoder (encoder), which then adds attention to the emotion in the current sentence, unlike conventional language generation models. The current emotion state-discriminating input is then combined with the encoder output of the user utterance that was just generated and input together into a decoder, so that the decoder has both the user utterance and the current emotion, and the resulting system dialog response is then an output that is personalized and specific to the current user emotion state.

The invention provides a 2 nd innovation aiming at multi-turn conversation emotion recognition, which is a simple conversation state updating method: emotion-Aware Information State Update (ISU) policy. The SAISU strategy updates the conversation state at any time when new information exists; specifically, when a user, or the system, or any participant in a conversation, has new information generated, the state of the conversation is updated. The update is based on the emotional perception of the previous rounds. See figure 18 for details.

FIG. 18 shows a dialog state s at time t +1_t+1Dependent on the state s of the preceding time t_tAnd the system behavior a of the preceding time t_tAnd the user behavior and emotion o corresponding to the current time t +1_t+1. The following can be written:

s_t+1←s_t+a_t+o_t+1

at dialog state updates, it is assumed that each update is deterministic. This assumption, therefore, results in the inevitable generation of the same current-time system state for the same system state, the same system behavior, and the same current-time user emotional state at the previous time.

7. Multimodal emotion semantic fusion based on time sequence:

in recent years, with the development of the field of multi-source heterogeneous information fusion processing, features from multi-class reference emotional states can be fused. By utilizing mutual support of different types of signals and carrying out fusion processing on complementary information, the information processing quality is not simply balanced among a plurality of data sources, but is better than any member, and can be greatly improved. The concept of emotional multi-modal analysis has been involved in recent international emotional computing and intelligent interactive academic conferences. Therefore, people have begun to research recognition problems, i.e., multi-modal based emotion recognition, by utilizing complementarity between emotional information of multiple channels, such as facial expressions, speech, eye movements, gestures, and physiological signals. Compared with single signal recognition, the multi-modal information fusion recognition can improve the recognition accuracy undoubtedly. In order to improve the recognition rate of emotion and the robustness of recognition, it is necessary to select different data sources according to different application environments; aiming at different data sources, effective theories and methods are adopted to research efficient and stable emotion recognition algorithms and the like, which are also hot spots of future research in the field.

Currently few systems start integrating 1 to 2 single modalities for emotion detection. Such as the following categories:

● emotion recognition based on visual and auditory perception

The most common multi-modal recognition method is based on visual and auditory methods, the two types of characteristics are convenient to acquire information, and meanwhile, speech emotion recognition and facial expression recognition have complementarity on recognition performance, so that the method is most common. In the cross-cultural multi-modal perception research supported by the society of joy of Japan, attention is paid to the relationship between facial expressions and emotional sounds when emotions are expressed. The system adaptively adjusts the weight of the voice and human face action characteristic parameters in the bimodal emotion recognition, and the emotion recognition rate of the method is over 84 percent. In which, the vision and the hearing are used as input states, asynchronous constraint is carried out at a state layer, and the fusion method improves the recognition rate by 12.5 percent and 11.6 percent respectively.

● emotion recognition based on multiple physiological signals

There are also a number of applications for multi-physiological signal fusion, and in 2004, Lee et al have used multi-physiological signals including heart rate, skin temperature changes, and electrodermal activity to monitor stress status of people. The literature mainly extracts useful characteristics from electrocardio and heart rate signals to carry out species identification. Wufuqiu et al extract and classify the features of three physiological signals including electrocardio signal, respiration signal and body temperature signal. Canentol et al combine various emotional physiological characteristics such as electrocardio, blood volume pulse, skin electrical activity, respiration, etc. to perform emotion recognition. Wagner et al obtained a 92% fusion recognition rate by fusing the physiological parameters of the four channels of electromyographic current, electrocardiogram, skin resistance and respiration. In the literature, the recognition accuracy is improved from 30% to 97.5% through multi-physiological signal fusion.

● emotion recognition based on voice electrocardio combination

In the aspect of combining voice and electrocardio, the literature utilizes a method of weighted fusion and feature space transformation to fuse a voice signal and an electrocardio signal. The average recognition rates obtained by the single-mode emotion classifiers based on the electrocardiosignals and the voice signals are 71% and 80% respectively, and the recognition rate of the multi-mode classifier reaches over 90%.

The embodiment breakthroughs emotion recognition of 5 single modes, innovatively utilizes a deep neural network to comprehensively judge information of a plurality of single modes after neural network coding and depth association and understanding, thereby greatly improving the accuracy, reducing the requirements on environment and hardware, and finally widening the application range to most common application scenes, particularly some special scenes such as criminal investigation, interrogation and the like.

The main architecture of the model is shown in fig. 19: in the embodiment, the deep neural network is used for comprehensively judging the information of a plurality of single modes after the neural network is coded and the depth is associated and understood.

The overall architecture considers that emotion recognition is that a judgment on the current time point is made according to all expressions, actions, characters, voice and physiology which are related before and after on a continuous time axis. Therefore, the method is invented on the basis of the classical seq2seq neural network. Seq2Seq was proposed in 2014 and its main ideas were first described independently by two articles, respectively Sequence to Sequence Learning with Neural Networks of the Google Brain team and Learning phosphor retrieval using RNN Encoder-Decoder for Statistical Machine Translation of the Yoshua Bengio team. These two articles propose a similar solution to the problem of machine translation, whereas Seq2Seq results. The main idea for solving the problem of Seq2Seq is to map a sequence as an input to a sequence as an output through a deep neural network model (usually LSTM, long-short memory network, a recurrent neural network), and the process consists of two links of encoding input and decoding output. seq2seq base model when applied to emotion recognition analysis on a continuous time axis, it requires unique innovative changes to better solve a specific problem. Then, in emotion recognition, in addition to the problem that the usual seq2seq model needs to deal with, the following attention needs to be paid to several key features: 1. a relationship between respective different points in time of the plurality of single modalities; 2. intrinsic effects and relationships between multiple modalities at the same point in time; 3. and integrating emotional ensemble recognition of multiple modes. None of these prior art solutions.

Specifically, the model first includes 5 Recurrent Neural Networks (RNN). The present invention is represented in a practical system by the RNN of long-short term memory (LSTM). Each RNN organizes the representation of the median neural network of each single modal emotional understanding in a time series. Where one neural network element at each time point (one blue bar in fig. 19) is from the output of the corresponding time point of the intermediate layer of the neural network of the single-modality subsystem described above. The output of the neural network (one blue stripe in fig. 19) passing a single point in time for each RNN is fed to the multi-modal fused associative decision RNN. Thus, each time point of the multi-modal RNN aggregates the neural network output at the current time point of each single-modal RNN. After the multi-modal synthesis, the output at each time point is the final emotion judgment result at that time point (orange arrow in fig. 19).

The software and hardware system design application scene of the invention is to provide a software tool for analyzing and studying character expression and psychological emotion change for professional analysts in the field of psychological consultation. The whole system comprises the following four parts: micro-expression analysis and study software, special analysis equipment, a high-definition camera and a printer.

Fig. 20 is a system architecture diagram of the overall product of the invention.

The face of the person being analyzed is recorded in real time by a "high definition camera" and a video stream accessible over a network is provided. The special analysis equipment deploys the product of the invention, and a software interface can be opened only by double-clicking the software shortcut icon; in the program running process, the video address and the expression alarm value can be configured and managed according to the requirement. The invention records, analyzes and judges the facial expression and heart rate data of people in the psychological counseling process, and provides a data analysis result report when the system is finished. The operator can print the analysis result into a document through the printer, so as to be convenient for archiving.

Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A time sequence semantic fusion association judgment subsystem based on a multi-modal emotion recognition system comprises data acquisition equipment and output equipment, and is characterized in that: the emotion analysis software system comprehensively analyzes and infers the data obtained by the data acquisition equipment and finally outputs the result to the output equipment; the emotion analysis software system comprises a semantic fusion association judgment subsystem based on time sequence;

each RNN recurrent neural network organizes the representation form of the middle neural network understood by each single-mode emotion according to a time sequence, wherein one neural network unit at each time point is output from the corresponding time point of the middle layer of the neural network of the single-mode subsystem; the output of the neural network passing through the single time point of each RNN recurrent neural network is transmitted to a multi-mode fusion association judgment RNN recurrent neural network, the neural network output of the current time point of each single-mode RNN recurrent neural network is collected at each time point of the multi-mode RNN recurrent neural network, and after the multi-modes are integrated, the output of each time point is the final emotion judgment result of the time point;

and training the emotion semantics under the single mode after aligning the time sequence by taking the time sequence as a reference, thereby realizing cross-modal automatic association correspondence on the time sequence and finally fused comprehensive emotion recognition, understanding and reasoning judgment.

2. The time-series semantic fusion association judgment subsystem based on the multi-modal emotion recognition system as recited in claim 1, wherein: the emotion analysis software system further comprises an emotion recognition subsystem based on facial image expressions, an emotion recognition subsystem based on voice signals, an emotion analysis subsystem based on text semantics, an emotion recognition subsystem based on human body gestures, an emotion recognition subsystem based on physiological signals and a multi-round conversation semantic understanding subsystem.

3. The time-series semantic fusion association judgment subsystem based on the multi-modal emotion recognition system of claim 2, wherein: the emotion recognition subsystem based on facial image expression can generate a specific expression mode under a specific emotion state, and effectively obtains motion field information from a complex background and a multi-pose expression sequence based on motion information of a dynamic image sequence and an expression image, an optical flow model of an area and a reference optical flow algorithm;

the emotion recognition subsystem based on human body postures extracts typical examples of various emotion states of a human body, discriminates and analyzes subtle differences of similar emotions for each posture, establishes a feature library, and extracts physical motion information from the subtle differences to recognize according to the duration and frequency motion properties of human body actions as judgment bases;

the emotion analysis subsystem based on text semantics is an emotion recognition method improved based on a deep Convolutional Neural Network (CNN), the subsystem utilizes vocabulary semantic vectors generated in a target field to carry out emotion classification on texts in a problem field, the input of the emotion analysis subsystem is sentences or documents represented by a matrix, each line of the matrix corresponds to a word segmentation element, each line is a vector representing a word, and the vectors are all in the form of word entries and are obtained from the last module or are indexed according to the word in a word list;

the second layer of the subsystem is a convolutional neural network layer;

the fourth layer of the subsystem is the last full-connection prediction layer, and the method comprises the steps of firstly, performing full arrangement and combination on the concentrated characteristic information obtained from the previous layer, and searching all possible corresponding weight combinations so as to find a coaction mode among the concentrated characteristic information and the concentrated characteristic information; the next internal layer is a Dropout layer, which means that the weights of some hidden layer nodes of the network are randomly made to be out of work during model training, the nodes which are out of work are temporarily regarded as not part of the network structure, but the weights of the nodes are kept, because the nodes can work again when a sample is input next time, the next internal layer is tanh which is nonlinear logic transformation, and the last internal layer is softmax which is a common activation function in multi-classification and is based on logic regression, and the probability of each possible class needing to be predicted is sharpened, so that the predicted class is distinguished;

human body modeling

emotional state extraction

4. The time-series semantic fusion association judgment subsystem based on the multi-modal emotion recognition system as recited in claim 3, wherein: the emotion recognition subsystem based on facial image expression is an ensemble model based on VGG16 and RESNET 50.

5. The time-series semantic fusion association judgment subsystem based on the multi-modal emotion recognition system of claim 2, wherein: in the emotion recognition subsystem based on the voice signals, the acoustic parameters of fundamental frequency, duration, tone quality and definition are emotion voice characteristic quantities, an emotion voice database is established, and new voice characteristic quantities are continuously extracted to be a basic method for voice emotion recognition.

6. The time-series semantic fusion association judgment subsystem based on the multi-modal emotion recognition system of claim 5, wherein: the emotion recognition subsystem based on the voice signals is a model for performing emotion recognition on the voice signals based on a neural network MLP, firstly, continuous voice signals are segmented to obtain discrete sound tiny units, and the tiny units are partially overlapped, so that the model can better analyze the current unit and know the previous and next context voice units; then extracting voice energy curve information by the model; and then, extracting fundamental frequency curve information by the subsystem, describing and constructing the tone characteristic by the fundamental frequency characteristic, and extracting a fundamental frequency curve by adopting an autocorrelation method.

7. The time-series semantic fusion association judgment subsystem based on the multi-modal emotion recognition system of claim 2, wherein: the emotion recognition subsystem based on the physiological signals is a non-contact type physiological signal emotion recognition system, the physiological mechanism of emotion comprises emotion perception and physical physiological reaction of emotion, the emotion perception is a main emotion generation mechanism, different physiological reactions of the brain are reflected through electroencephalogram signals, due to the particularity of the signals, recognition is carried out through three characteristics of time domain, frequency domain and time-frequency domain, and time-frequency average spectrum entropy and fractal dimension are used as characteristic quantities for measuring brain activities;

8. The time-series semantic fusion association judgment subsystem based on the multi-modal emotion recognition system of claim 2, wherein: the multi-round conversation semantic understanding-based subsystem adds an emotion recognition attention mechanism on the basis of a traditional seq2seq language generation model to the input speech of the current round, and adds emotion tracking in the previous multi-round conversations on a time sequence in conversation management; inputting the spoken words of each current user into a bidirectional LSTM encoder, then combining the current inputs discriminated to different emotional states with the encoder outputs of the user's spoken words generated just before, and inputting the combined input into a decoder together, so that the decoder has both the user's spoken words and the current emotion, and then generating a system dialog response which is personalized and specific to the output of the current user's emotional state; the emotion perception information state updating strategy is characterized in that the conversation state is updated at any time when new information exists; when the conversation state is updated, each updating is determined, and the same system state, the same system behavior and the same user emotion state at the current moment at the previous moment are necessarily generated.