CN108805087B - Time sequence semantic fusion association judgment subsystem based on multi-modal emotion recognition system - Google Patents

Time sequence semantic fusion association judgment subsystem based on multi-modal emotion recognition system Download PDF

Info

Publication number
CN108805087B
CN108805087B CN201810612592.0A CN201810612592A CN108805087B CN 108805087 B CN108805087 B CN 108805087B CN 201810612592 A CN201810612592 A CN 201810612592A CN 108805087 B CN108805087 B CN 108805087B
Authority
CN
China
Prior art keywords
emotion
emotion recognition
time
human body
subsystem
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810612592.0A
Other languages
Chinese (zh)
Other versions
CN108805087A (en
Inventor
俞旸
凌志辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Xinktech Information Technology Co ltd
Original Assignee
Nanjing Xinktech Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Xinktech Information Technology Co ltd filed Critical Nanjing Xinktech Information Technology Co ltd
Priority to CN201810612592.0A priority Critical patent/CN108805087B/en
Publication of CN108805087A publication Critical patent/CN108805087A/en
Application granted granted Critical
Publication of CN108805087B publication Critical patent/CN108805087B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/70Multimodal biometrics, e.g. combining information from different biometric modalities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Abstract

The invention discloses a time sequence semantic fusion association judgment subsystem based on a multi-modal emotion recognition system, which comprises data acquisition equipment and output equipment, and is characterized in that: the emotion analysis software system comprehensively analyzes and infers the data obtained by the data acquisition equipment and finally outputs the result to the output equipment; the emotion analysis software system comprises a semantic fusion association judgment subsystem based on time sequence. The invention breakthroughs the emotion recognition of five single modes, innovatively utilizes the deep neural network to comprehensively judge the information of the single modes after the neural network coding and depth association and understanding, greatly improves the accuracy rate, and is suitable for most general inquiry interactive application scenes.

Description

Time sequence semantic fusion association judgment subsystem based on multi-modal emotion recognition system
Technical Field
The invention relates to the technical field of emotion recognition, in particular to a time sequence semantic fusion association judgment subsystem based on a multi-mode emotion recognition system, such as machine learning, deep learning, computer vision, natural language processing, voice recognition, human action recognition, non-contact physiological detection and the like.
Background
Emotion recognition is a technology for judging emotion changes of a person, and mainly infers the psychological state of the person by collecting external expression and behavior changes of the person. In modern society, emotion recognition technology is widely applied to aspects of intelligent equipment development, sales guidance robots, health management, advertising marketing and the like. Emotion is a state that combines human feelings, thoughts and behaviors, and includes a human psychological response to external or self-stimulation and also includes a physiological response accompanying such a psychological response. In various human-machine interaction systems (e.g., robots, interrogation systems, etc.), human-machine interaction becomes more friendly and natural if the system can recognize the emotional state of a human. Therefore, emotion analysis and recognition are important interdisciplinary research subjects in the fields of neuroscience, psychology, cognitive science, computer science, artificial intelligence and the like.
The methods used have also been different for long-term studies of mood. In recent years, with the application and popularization of electroencephalogram signal acquisition equipment, the rapid development of signal processing and machine learning technologies, and the great improvement of computer data processing capacity, emotion recognition research based on electroencephalogram has become a hot topic in the fields of neural engineering and biomedical engineering.
Emotion recognition methods are different from one another corresponding to different emotion induction methods, and common emotion recognition methods are mainly classified into two categories, namely recognition based on non-physiological signals and recognition based on physiological signals. The emotion recognition method based on non-physiological signals mainly comprises the recognition of facial expressions and voice tones. The facial expression recognition method is characterized in that different emotions are recognized according to the corresponding relation between expressions and emotions, people can generate specific facial muscle movement and expression modes under a specific emotion state, if people feel happy, mouth corners can be upwarped, and eyes can have annular folds; anger may frown, open eyes, etc. At present, facial expression recognition is mostly realized by adopting an image recognition method. The speech tone recognition method is realized according to different language expression modes of people in different emotional states, for example, the tone of speaking is cheerful when the mood is happy, and the tone is dull when the mood is fidgety. The non-physiological signal identification method has the advantages of simple operation and no need of special equipment. The disadvantage is that reliability of emotion recognition cannot be guaranteed because people can disguise their own true emotions by disguising facial expressions and voice tones, which are often not easily discovered. Secondly, methods based on non-physiological signal recognition are often difficult to implement for disabled persons suffering from certain specific diseases.
Because the electroencephalogram signals are very weak, the electroencephalogram signals need to be amplified by an amplifier with high amplification factor in the acquisition process. At present, commercial electroencephalogram signal amplifiers are generally large in size and are not portable. Recently, a chip electroencephalogram signal amplifier appears, the problem of overlarge size of the amplifier can be effectively solved, but the cost is still high, and a certain distance is kept from the practicability.
Therefore, it is obvious that emotion recognition methods based on physiological signals all require complex and expensive signal measurement and acquisition systems to obtain accurate biological signals, and cannot be applied in a wide range of scenes, and particularly, the methods are not applicable to special scenes, such as criminal investigation, interrogation and the like, when secret measurement is required.
Because emotion is subjective and conscious experience and feeling of an individual to external stimuli and has characteristics of psychological and physiological responses, people hope that the internal feeling does not need to be directly observed, but the people can deduce through behaviors or physiological changes of the individual, and the emotion recognition method is more advocated at present. In this type of method, most emotion recognition is mainly recognition of the meaning of an expression. The identification method is mainly carried out by means of the movement of large muscle groups on the face. But does not integrate human expressions, spoken words, physical states, voice tones, physiological characteristics, and the like.
In the prior art, for example: the multi-modal intelligent emotion perception system, publication No.: CN107220591 a. The technology provides a multi-modal intelligent emotion perception system, which comprises an acquisition module, a recognition module and a fusion module, wherein the recognition module comprises an emotion recognition unit based on expressions, an emotion recognition unit based on voice, an emotion recognition unit based on behaviors and an emotion recognition unit based on physiological signals, each emotion recognition unit in the recognition module recognizes multi-modal information so as to obtain emotion components, the emotion components comprise emotion types and emotion intensities, and the fusion module fuses the emotion components of the recognition module to realize accurate perception of human emotion.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a method and a system for recognizing emotion by synthesizing human expression, characters, voice, posture and 5 large-modal physiological signals in an innovative way. Compared with the similar patent (for example, publication No. CN107220591A), the invention has the fundamental breakthroughs in the following aspects.
1. The wearable equipment is not necessary equipment, and the innovative proposal only needs to acquire video and sound signals.
2. The invention aims at the feature extraction of physiological signals and is obtained by an innovative non-contact micro-feature amplification mode, and the innovation point greatly reduces the cost and improves the use convenience of the product.
3. The invention also provides comprehensive emotion analysis of multiple rounds of conversations on the basis of basic text emotion analysis. The innovation point not only improves the emotion analysis of each local conversation unit, but also provides comprehensive understanding of emotion of the whole conversation process.
4. The invention also innovatively invents emotion recognition based on human body gestures on the basis of motion recognition. And the posture emotion recognition proposed by the invention is to recognize the main posture of a person as the change of a key node.
5. The invention innovatively provides time sequence-based emotion correspondence, association and reasoning of the basic neural network RNN when synthesizing each single mode as the total emotion recognition.
In order to achieve the purpose of the invention, the invention adopts the technical scheme that: a time sequence semantic fusion association judgment subsystem based on a multi-modal emotion recognition system comprises data acquisition equipment and output equipment, and is characterized in that: the emotion analysis software system comprehensively analyzes and infers the data obtained by the data acquisition equipment and finally outputs the result to the output equipment; the emotion analysis software system comprises a semantic fusion association judgment subsystem based on time sequence.
The time sequence semantic fusion association judgment subsystem based on the multi-modal emotion recognition system is further characterized in that: each RNN recurrent neural network organizes the representation form of the middle neural network understood by each single-mode emotion according to a time sequence, wherein one neural network unit at each time point is output from the corresponding time point of the middle layer of the neural network of the single-mode subsystem; the output of the neural network passing through the single time point of each RNN recurrent neural network is transmitted to the multi-mode fusion association judgment RNN recurrent neural network, the neural network output of the current time point of each single-mode RNN recurrent neural network is collected at each time point of the multi-mode RNN recurrent neural network, and after the multi-modes are integrated, the output of each time point is the final emotion judgment result of the time point.
The time sequence semantic fusion association judgment subsystem based on the multi-modal emotion recognition system is further characterized in that: and training the emotion semantics under the single mode after aligning the time sequence by taking the time sequence as a reference, thereby realizing cross-modal automatic association correspondence on the time sequence and finally fused comprehensive emotion recognition, understanding and reasoning judgment.
The time sequence semantic fusion association judgment subsystem of the multi-modal-based emotion recognition system is further characterized in that: the emotion analysis software system further comprises an emotion recognition subsystem based on facial image expressions, an emotion recognition subsystem based on voice signals, an emotion analysis subsystem based on text semantics, an emotion recognition subsystem based on human body gestures, an emotion recognition subsystem based on physiological signals and a multi-round conversation semantic understanding subsystem.
The time sequence semantic fusion association judgment subsystem based on the multi-modal emotion recognition system is further characterized in that: the emotion recognition subsystem based on facial image expression can generate a specific expression mode under a specific emotion state, and effectively obtains motion field information from a complex background and a multi-pose expression sequence based on motion information of a dynamic image sequence and an expression image, an optical flow model of an area and a reference optical flow algorithm;
the emotion analysis subsystem based on the text semantics has the advantages that the text emotion analysis can be divided into three levels of words, sentences and chapters, the word-based method is to analyze emotion characteristic words, and the polarity of the words is judged or the similarity of the word semantics is calculated according to a threshold value; the sentence-based method comprises the steps of sampling emotion labels for each sentence, extracting evaluation words or acquiring evaluation phrases for analysis; the method based on the chapters is to analyze the overall emotional tendency of the chapters on the basis of sentence emotional tendency analysis;
the emotion recognition subsystem based on human body postures extracts typical examples of various emotion states of a human body, discriminates and analyzes subtle differences of similar emotions for each posture, establishes a feature library, and extracts physical motion information from the characteristics as judgment basis according to motion properties such as duration time, frequency and the like of human body actions to recognize;
the emotion analysis subsystem based on text semantics is an emotion recognition method improved based on a deep Convolutional Neural Network (CNN), the subsystem utilizes vocabulary semantic vectors generated in a target field to carry out emotion classification on texts in a problem field, the input of the emotion analysis subsystem is sentences or documents expressed by a matrix, each line of the matrix corresponds to a word segmentation element, each line is a vector expressing a word, and the vectors are all in the form of word templates (expressed by high-dimensional vectors), and are obtained from the last module or are indexed according to the words in a word list;
the second layer of the subsystem is a convolutional neural network layer;
the third layer of the subsystem is a time-based convergence layer, the incidence relation of the characteristic information extracted from the previous convolutional layer on a time axis is found out, and the corresponding change on the time dimension in each characteristic matrix in the previous layer is summarized and induced, so that more concentrated characteristic information is formed;
the fourth layer of the subsystem is the last full-connection prediction layer, and the method comprises the steps of firstly, performing full arrangement and combination on the concentrated characteristic information obtained from the previous layer, and searching all possible corresponding weight combinations so as to find a coaction mode among the concentrated characteristic information and the concentrated characteristic information; the next internal layer is a Dropout layer, which means that weights of some hidden layer nodes of the network are randomly made to be out of work during model training, the nodes which are out of work are temporarily regarded as not part of the network structure, but the weights of the nodes are kept (only temporarily not updated), because the nodes can work again when a sample is input next time, the next internal layer is tanh (hyperbolic function), which is a nonlinear logistic transformation, and the last internal layer is softmax, which is a common activation function in multi-classification and is based on logistic regression, and the probability of each possible class needing to be predicted is sharpened, so that the predicted class is distinguished;
the emotion recognition subsystem based on human body gestures is characterized in that emotion extraction based on motion recognition is performed according to a data input source, firstly, motion data are represented and modeled, and then, emotion modeling is performed to obtain two sets of representation data related to motions and emotions; then, the continuous action is accurately identified by using the existing action identification method based on the motion data to obtain the action information of the data; matching and corresponding the emotion model obtained before with an emotion database, and finally extracting the emotion of the input data by assisting action information in the process; the method specifically comprises the following steps:
● human body modeling
Firstly, modeling joint points of a human body, regarding the human body as a rigid system with intrinsic relation, and comprising bones and the joint points, wherein the relative motion of the bones and the joint points forms the change of the posture of the human body, namely describing actions at ordinary times, in a plurality of joint points of the human body, according to the lightness and the heaviness of the influence on the emotion, fingers and toes are ignored, the spine of the human body is abstracted into three joints of a neck, a chest and an abdomen, and a human body model is summarized, wherein the upper half body comprises a head, a neck, a chest, an abdomen, two big arms and two small arms, and the lower half body comprises two thighs and two crus;
● emotional state extraction
For the selected multiple emotional states, the expression of each emotional state is carried out under the normal condition of the human body, and the body reaction is analyzed in detail; because the human body is abstracted into a rigid model, the gravity center of the human body moves firstly and is divided into a forward state, a backward state and a natural state; in addition to the movement of the center of gravity, followed by the rotation of the joint points, the human body undergoes motion changes, and the joint points related to emotion include the head, the chest, the shoulders and the elbows, and the corresponding motions are the bending of the head, the rotation of the chest, the swinging and stretching directions of the upper arm, and the bending of the elbows, which parameters, in combination with the movement of the center of gravity, include seven degrees of freedom in total, expressing the motion of the upper half of a person.
The time sequence semantic fusion association judgment subsystem based on the multi-modal emotion recognition system is further characterized in that: the emotion recognition subsystem based on facial image expression is based on an ensemble model based on VGG16 and RESNET 50.
The time sequence semantic fusion association judgment subsystem based on the multi-modal emotion recognition system is further characterized in that: in the emotion recognition subsystem based on the voice signals, acoustic parameters such as fundamental frequency, duration, tone quality and definition are emotion voice characteristic quantities, an emotion voice database is established, and new voice characteristic quantities are continuously extracted to serve as a basic method for voice emotion recognition.
The time sequence semantic fusion association judgment subsystem based on the multi-modal emotion recognition system is further characterized in that: the emotion recognition subsystem based on the voice signal is a model for performing emotion recognition on the voice signal based on a neural network MLP (multi-layer perception model), firstly, segmenting (segmentation) continuous voice signals to obtain discrete sound tiny units, wherein the tiny units are partially overlapped, so that the model can better analyze the current unit and know the previous and next context voice units; then extracting the information of the voice energy (energy) curve by the model; and next, extracting fundamental frequency (pitch) curve information by the subsystem, describing and constructing pitch characteristics by the fundamental frequency characteristics, and extracting a fundamental frequency curve by adopting an autocorrelation method.
The time sequence semantic fusion association judgment subsystem based on the multi-modal emotion recognition system is further characterized in that: the emotion recognition subsystem based on the physiological signals is a non-contact type emotion recognition system based on the physiological signals, the physiological mechanism of emotion comprises emotion perception (electroencephalogram) and emotional physical and physiological reactions (electrocardio, heart rate, myoelectricity, galvanic skin reaction, respiration, vascular pressure and the like), the emotion perception is a main emotion generation mechanism, different physiological reactions of the brain are reflected through the electroencephalogram signals, due to the particularity of the signals, recognition is carried out through three characteristics of time domain, frequency domain and time-frequency domain, and time-frequency average spectrum entropy, fractal dimension and the like are used as characteristic quantities for measuring brain activities;
the emotion recognition subsystem based on the physiological signal utilizes the change of light rays when blood flows in a human body in emotion recognition of the physiological signal: when the heart beats, blood can pass through the blood vessel, the more the blood volume passing through the blood vessel is, the more light absorbed by the blood is, the less light is reflected by the surface of human skin, and the heart rate is estimated through time-frequency analysis of the image;
the first step is to carry out spatial filtering on a video sequence to obtain base bands with different spatial frequencies;
secondly, performing band-pass filtering on each baseband in a time domain to extract the interested part of the variation signals;
and thirdly, amplifying and synthesizing, and counting the number of the peak values of the signal change, namely the physiological heart rate of the person is approximated.
The time sequence semantic fusion association judgment subsystem based on the multi-modal emotion recognition system is further characterized in that: the multi-round conversation semantic understanding-based subsystem adds an emotion recognition attention mechanism on the basis of a traditional seq2seq language generation model to the input speech of the current round, and adds emotion tracking in the previous multi-round conversations on a time sequence in conversation management; inputting each current user spoken utterance into a bi-directional LSTM encoder (encoder), then combining the current different emotion state-discriminating input with the encoder output of the user utterance that was just generated, and inputting the combined input into a decoder, such that the decoder has both the user utterance and the current emotion, and the resulting system dialog response is an output that is personalized and specific to the current user emotion state; an emotion-Aware Information State Update (ISU) strategy, wherein the conversation State is updated at any time when new Information exists; when the conversation state is updated, each updating is determined, and the same system state, the same system behavior and the same user emotion state at the current moment at the previous moment are necessarily generated.
Has the advantages that: the invention breakthroughs the emotion recognition of 5 large single modes, innovatively utilizes the deep neural network to carry out comprehensive judgment on information of a plurality of single modes after neural network coding and depth association and understanding, thereby greatly improving the accuracy, reducing the requirements on environment and hardware, and finally widening the application range to most common application scenes, particularly some special scenes such as criminal investigation, interrogation and the like.
Drawings
Fig. 1 is a schematic diagram of a multi-modal-based emotion recognition system according to an embodiment of the present invention.
Fig. 2 is a flow chart of a multimodal-based emotion recognition system according to an embodiment of the present invention.
FIG. 3 is a schematic diagram of a VGG16 model according to an embodiment of the present invention.
Fig. 4 is a diagram of the core residual architecture in the RESNET50 model according to an embodiment of the present invention.
Fig. 5 is a diagram of an integrated ensemble model architecture according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of the present invention in segmenting a continuous speech signal to obtain discrete sound tiny units.
Fig. 7 is a schematic diagram of the change of Short Term Energy (STE) in sound waves according to the present invention.
FIG. 8 is a schematic diagram of the fundamental frequency information of a person in the present invention.
Fig. 9 is a diagram of the MLP (multi-layer probability) neural network used in the present invention for performing deep learning model architecture.
FIG. 10 is a diagram of textual emotion analysis using a core module based on a deep convolutional neural network as employed in the present invention.
Fig. 11 is a diagram of the application of the convolutional neural network combined with the syntax tree in emotion analysis.
Fig. 12 is a general flowchart of the human body posture detection proposed by the present invention.
FIG. 13 is a diagram of 13 main segments of human body models identified in human body posture detection according to the present invention.
Fig. 14 is a human body phenomenon on which the present invention is based: the greater the amount of blood in the blood vessel, the more light is absorbed by the blood and the less light is reflected from the skin surface.
FIG. 15 is a diagram showing the process and results of amplifying a cosine wave by a factor of α according to the method of the present invention in the human biometric sensing process.
Fig. 16 is a general flow chart of the present invention in multi-round interactive emotion recognition (a process of a cyclic multi-round interactive understanding).
FIG. 17 is an attention mechanism diagram of the present invention incorporating emotion recognition based on the traditional seq2seq language generation model for the input utterance in the current round.
Fig. 18 is a schematic diagram of the present invention for updating the dialogue state based on the emotional perception of previous rounds in a multi-round dialogue.
Fig. 19 is a body architecture diagram for performing comprehensive judgment of multiple single-mode information by neural network coding, depth association and understanding using a deep neural network according to the present invention.
FIG. 20 is a system diagram of the overall product of the invention.
Detailed Description
The invention is further explained in detail below with reference to the figures and the embodiments.
Since any mood is produced with some changes in the body, such as facial expressions, muscle tension, visceral activity, etc. The emotion recognition directly using the change of these signals is a so-called basic recognition method, also called a single-mode emotion recognition method, and the current main methods include facial images, languages, texts, postures, physiological signals, and the like. The invention provides a method and a system for more complete and accurate emotion recognition by fusing, corresponding and reasoning emotion understanding of a computer in each single mode.
The system of the multi-modal-based emotion recognition system proposed in this embodiment is composed of the following parts (fig. 1 is a schematic diagram of the multi-modal-based emotion recognition system according to the embodiment of the present invention):
-a hardware part: data acquisition equipment includes camera, microphone, detects the bracelet of heartbeat, and human gesture multiple spot detects the sensor, robot sensor acquisition system etc. and output device includes display, audio amplifier, earphone, printer, robot interactive system etc..
-a software part: the data obtained by the data acquisition equipment is comprehensively analyzed and reasoned. The system has 7 subsystems (7 modules are shown in figure 1) to form multi-modal emotion recognition based on facial image expression, voice signal, text semantic, human body posture and physiological signal, and multi-round dialogue semantic understanding and multi-modal emotion semantic fusion association judgment based on time sequence.
1. Emotion recognition based on the facial expression image.
The facial expression recognition method is based on the fact that people can generate specific expression patterns under specific emotional states. Both methods based on a template and utilizing a neural network are the most common approaches in static image expression recognition, but the recognition rate is not high necessarily due to single picture recognition. The invention provides a brand-new neural network which is based on a dynamic image sequence, the method considers the motion information of expression images, and an optical flow model based on an area and a reference optical flow algorithm can effectively obtain motion field information from a complex background and a multi-pose expression sequence.
2. Emotion recognition based on speech signals
The speech is an important means for expressing emotion specifically for human, and acoustic parameters such as fundamental frequency, duration, tone quality and definition are main characteristic quantities of emotional voice. Establishing an emotion voice database, and continuously extracting new voice characteristic quantity is a basic method for voice emotion recognition. The support vector machine and the Dempster-Shafer evidence-based theory can also be used as a method for extracting the speech emotion characteristics. The individual difference of the voice signals is obvious, and the traditional voice analysis method needs to establish a huge voice library, which brings certain difficulty to recognition. The present invention proposes an enhanced emotion recognition of speech signals based on a traditional speech recognition type neural network.
3. Text-based emotion recognition
The textual emotion analysis can be divided into three levels of words, sentences and chapters in the research process. The word-based method mainly analyzes emotional characteristic words, judges the polarity of the words or calculates the similarity of the word semantics according to a threshold value; the sentence-based method comprises the steps of sampling emotion labels for each sentence, extracting evaluation words or acquiring evaluation phrases for analysis; the chapter-based method is to perform overall emotional tendency analysis of chapters on the basis of sentence emotional tendency analysis. In the emotion recognition based on the text, the selection of emotion characteristic words is relatively depended on, although a language database is established, each word can be labeled with an emotion label, but many words have various definitions, and the problems need to be considered when the language database is established. The appearance of many emerging words can also greatly interfere with the accuracy of text emotion tendency recognition. Therefore, these conventional corpus-based methods are simple and accurate, but require a lot of manpower to construct a corpus in advance, and are not suitable for cross-domain migration. According to the deep learning-based method provided by the invention, one model can automatically and deeply learn different data in different fields and scenes, so that automatic emotion recognition is carried out.
4. Emotion recognition based on human body posture
The limb movement characteristics of a human body contain rich emotional information. The emotion recognition based on human body gestures mainly comprises the steps of extracting typical examples of various emotional states of the human body, carrying out discriminant analysis on each gesture to obtain nuances of similar emotions, and establishing a feature library. The emotion recognition based on human motion characteristics mainly takes motion properties such as duration, frequency and the like of human motion as judgment bases, and physical motion information is extracted from the motion properties for recognition. Many gestures or movements do not have obvious emotional characteristics and are often not fully resolved during recognition, thus this approach has great limitations. The invention proposes a deeper level of emotion recognition by fusing human body gestures with other signals.
5. Emotion recognition based on physiological signals
Physiological changes are rarely controlled by human subjectivity, so that the result obtained by emotion recognition by applying physiological signals is more objective. The physiological mechanisms of emotion include emotion perception (electroencephalogram) and emotional physical and physiological reactions (electrocardio, heart rate, myoelectricity, galvanic skin reaction, respiration, vascular pressure, etc.). The emotion perception is a main generation mechanism of emotion, different physiological reactions of the brain can be reflected through electroencephalogram signals, due to the particularity of the signals, the signals can be identified through three characteristics of a time domain, a frequency domain and a time-frequency domain, and in addition, the time-frequency average spectrum entropy, the fractal dimension and the like can be used as characteristic quantities for measuring brain activities. Although the physiological signals carry accurate emotion information, the signal intensity is very weak, and if electrocardio signals are collected, the electrocardio signals have large electromyographic potential interference, so the requirement is high in the extraction process. In practice, the number of sources of interference is so great that it is difficult to effectively remove artifacts from physiological signals. The invention provides a method for automatically detecting physiological reactions such as heartbeat, respiration and the like based on the change of blood and skin color of a human face.
Based on the above 5 single-mode emotion recognition subsystems, the invention provides that the emotion semantics under the single modes are trained after time sequence alignment by taking the time sequence as a reference, so that cross-mode automatic correlation correspondence on the time sequence and finally integrated comprehensive emotion recognition, understanding and reasoning judgment are realized. Fig. 2 is a flow chart of a multimodal-based emotion recognition system according to an embodiment of the present invention.
The following is a detailed description of the modules one by one.
1. Emotion recognition based on facial expression images:
the conventional method of recognizing a facial expression image based on computer vision can be roughly classified into the following procedures.
The first image preprocessing is mainly used for eliminating interference factors such as face detection and face graying. The second expression feature extraction is mainly based on the feature extraction of a static image and the image feature extraction of a dynamic sequence, and feature dimension reduction is performed before expression recognition is performed. And finally, the expression recognition is mainly to select a proper classification algorithm to classify the expression characteristics after the dimension reduction.
Conventional classification algorithms include:
● skin color based detection method
Experiments show that the Gaussian mixture model is better than the Gaussian mixture model based on the Gaussian mixture model and the histogram model.
● statistical model-based method
Artificial neural networks: and adopting a plurality of neural networks to carry out different-angle face detection.
Based on the probability model: the face is detected by estimating the conditional probabilities of the face image and the non-face image.
A support vector machine: and judging the human face and the non-human face by adopting a hyperplane of a support vector machine.
● detection method based on heuristic model
Deformation model: the deformed template is matched with the head top contour line and the left and right face contour lines.
Mosaic drawing: and dividing the face area into a plurality of mosaic blocks, and verifying by using a group of rules and edge features.
The deep learning method using artificial neural networks has been greatly improved recently due to the easier acquisition of large-scale data and large-scale GPU computations, and has been shown to be superior to most of the conventional methods above. The embodiment proposes the following ensemble model based on VGG16 and RESNET 50.
First, the VGG16 model architecture of the present embodiment is shown in fig. 3:
next, the core residual architecture in the RESNET50 model of this embodiment is shown in fig. 4:
finally, the comprehensive ensemble model architecture based on the above 2 architectures proposed in this embodiment is shown in fig. 5:
through statistics of results on public experimental data (as shown in the following table), the model provided by the embodiment reaches the current most advanced level, and the operation efficiency is extremely high.
Rate of accuracy Accuracy of measurement Recall rate
Baseline system based on SVM 31.8% 43.7% 54.2%
Industry mainstream system based on VGG16 59.2% 70.1% 69.5%
Industry mainstream system based on RESNET50 65.1% 76.5% 74.8%
The algorithm proposed by the invention 67.2% 79.4% 78.2%
2. Emotion recognition based on speech signals:
the traditional speech emotion recognition research is developed without leaving the support of emotion speech databases. The high quality of the emotion speech library directly determines the performance of the emotion recognition system obtained by training the emotion speech library. At present, the emotion voice library types existing in the field are various, a unified establishment standard is not provided, and the emotion voice library can be divided into 3 categories of a performance type, a guidance type and a natural type according to the type of the excited emotion; the method can be divided into two categories of identification type and synthesis type according to application targets; it can be divided into English, German and Chinese according to different languages.
In these methods, the acoustic features used for speech emotion recognition can be roughly classified into 3 types, namely prosodic features, related features based on spectrum and psychoacoustic features, which are often extracted in units of frames, but participate in emotion recognition in the form of global feature statistics. The unit of global statistics is usually an acoustically independent sentence or word, and the commonly used statistical indexes include extremum, extremum range, variance, and the like. The common features are:
● the prosodic features refer to the variation of pitch, duration, speed and weight of speech over the semantic symbol, and are a structural arrangement for the way speech is expressed. The existence of the prosodic features is also called as 'super-note features' or 'super-linguistic features', the emotion distinguishing capability of the prosodic features is widely accepted by researchers in the field of speech emotion recognition and is very common in use, wherein the most common prosodic features comprise duration (duration), fundamental frequency (pitch), energy (energy) and the like.
● the spectral-based correlation features are known as the manifestation of the correlation between vocal tract (vocal tract) shape changes and vocal tract motion (acoustic motion), and have been successfully used in the speech signal processing fields including speech recognition, speaker recognition, etc. Nwe, it was found by studying the related spectrum features of emotional voices that the emotional content of the voices has a significant effect on the distribution of the spectrum energy in each spectrum interval, for example, voices expressing happy emotions show high energy in the high frequency band, while voices expressing sad voices show significantly different low energy in the same frequency band. In recent years, more and more researchers apply spectrum-related features to speech emotion recognition, and play a role in improving system recognition performance, and emotion distinguishing capability of related spectrum features is not negligible. Linear spectral features for use in speech emotion recognition tasks.
● the sound quality characteristic is a subjective evaluation index given to the voice by people, and is used for measuring whether the voice is pure, clear, easy to identify and the like. Acoustic manifestations affecting sound quality are wheezing, tremolo, choking, etc., and are often present in situations where the speaker is emotionally agitated and difficult to suppress. In the experiment of speech emotion recognition, the change of sound quality was consistently recognized by the listeners as having a close relationship with the expression of speech emotion. In speech emotion recognition research, the acoustic features used to measure sound quality are generally: formant frequencies and bandwidths thereof (format frequency and bandwidth), frequency and amplitude perturbations (jitter and hammer), glottal parameters (glottal parameter), and the like.
On the basis, the invention provides a model for emotion recognition of voice signals based on a neural network MLP (multi-layer perception model). First, the present invention segments (segmentation) a continuous speech signal to obtain discrete sound units (as shown in fig. 6). These units overlap partially, allowing the model to better analyze the current unit and understand the preceding and following contextual phonetic units. The model then extracts the speech energy (energy) curve information. Since energy information plays a very important role in speech recognition and also in emotion recognition. Such as happy and angry, the human speech energy may be significantly higher than sad. Fig. 7 shows the change of voice energy of a person when capturing emotional changes of the person, such as joy and anger, using the change in sound waves at Short Term Energy (STE).
Next, the system extracts fundamental (pitch) curve information. Tonal features play a very important role in speech recognition in most languages. Whereas tonal features may be characterized and constructed by fundamental frequency features. Therefore, it is difficult to find a reliable and effective fundamental frequency extraction method in practical environment. The embodiment adopts an autocorrelation method to extract a fundamental frequency curve. Fig. 8 shows that the autocorrelation method is used to extract the fundamental frequency information of a person's vitality in the fundamental frequency curve according to the present embodiment.
In addition, the system provided by the invention also extracts important information such as Mel Frequency Cepstral Coefficients (MFCC) and Formant Frequencies from voice. Finally, the system utilizes the MLP (multi-layer probability) of the neural network to perform deep learning (the model architecture is shown in FIG. 9: the MLP (multi-layer probability) neural network adopted in the embodiment performs deep learning of the voiceprint emotion).
3. Text-based emotion recognition:
the embodiment provides an emotion recognition method based on deep convolutional neural network CNN improvement. The module performs emotion classification of the text in the problem domain using the lexical semantic vectors generated in the target domain. The core of this module is also a deep convolutional neural network system (as shown in fig. 10).
Its input is a sentence or document represented in a matrix. Each row of the matrix corresponds to a word-segmentation element, typically a word, which may also be a character. That is, each line is a vector representing one word. Typically, these vectors are in the form of word entries (a high-dimensional vector representation) obtained from the previous module, but may also be in the form of one-hot vectors, i.e. based on the word's index into the vocabulary. If a sentence with 10 words is represented by a 100-dimensional word vector, a 10x 100-dimensional matrix is obtained as input.
The second layer of the module is a convolutional neural network layer. This step is a significant improvement in this embodiment. The conventional operation is (yellow convolution window in fig. 10), if the convolution window width is m (window size 3 is used in the figure), then take m consecutive words (an example in fig. 10 is "order beijing") and connect their corresponding word vectors together to get a m x d-dimensional vector xi: i + m-1(d represents the word vector dimension). And then multiplying the vector xi: i + m-1 by a convolution kernel w (w is also a vector), wherein ci is f (w.xi: i + m-1+ b), sliding the window to obtain c which is c1, c2, …, cn-m +1, and then taking the maximum value of c to obtain a value, and finally obtaining a K-dimensional vector on the assumption that K convolution kernels exist. These conventional convolution windows are only for consecutive m words. Therefore, the selection operation is performed to process sentences of different lengths, so that no matter how long the sentence is, what the width of the convolution kernel is, and finally a vector representation of a fixed length is obtained, and the maximum value selection is also used for refining the most important feature information, and the assumption is that the maximum value represents the most significant feature. A large number of experiments prove that the convolutional neural network model is suitable for various tasks, has very obvious effect, and does not need to carry out complicated characteristic engineering and syntax parse trees compared with the traditional method. In addition, the effect of inputting the word vectors trained in advance by the model is much better than that of randomly initializing the word vectors, and the word vectors trained in advance are input by deep learning at present. Compared with the conventional convolution window, the present embodiment proposes to convolve m words that are continuous in syntax. These m words may not be actually continuous (the example in fig. 10 is "hotel booking" marked red), but they are a continuous semantic structure in syntax. Such as the sentence "John hit the ball" shown in fig. 11, if the choice is to use a convolution window size of 3, there will be two complete 3-word windows "John hit the" and "hit the ball". But clearly none embody the complete core semantics of the sentence. If the words in the "continuous" window are determined from the parse tree, there are two convolution windows, "John hit ball" and "hit the ball". Therefore, it is clear that the 2 convolution windows all represent more complete and reasonable semantics. The two new convolution windows based on the syntax analysis tree are combined with the traditional convolution window to jointly select the maximum value. The feature information obtained in this way will make it easier for the model to grasp the meaning of a piece of text.
The third layer of the module is a time-based convergence layer. The entry of text words and phrases is strongly related in chronological or chronological order. The main objective of this layer is to find the correlation relationship on the time axis from the feature information extracted from the previous convolutional layer. The main mining process is to summarize the corresponding changes in the time dimension in each feature matrix in the previous layer. Thereby forming more condensed characteristic information.
The fourth layer of the module is the last fully-connected prediction layer. This layer actually contains many fine internal layers. The first is to carry out full permutation and combination on the concentrated characteristic information obtained from the previous layer and search all possible corresponding weight combinations, thereby finding the mode of the co-action between the concentrated characteristic information and the corresponding weight combinations. The next internal layer is the Dropout layer. Dropout refers to randomly disabling the weights of some hidden layer nodes of the network during model training, and those nodes that are not enabled can be temporarily considered not to be part of the network structure, but their weights are retained (only temporarily not updated),
as it may get worked again the next time a sample is input. The next inner layer is tan h (hyperbolic function). This is a non-linear logical transformation. The last internal layer is softmax, which is a commonly used activation function in multi-classification, based on logistic regression. It sharpens the probability of each possible category that needs to be predicted, thus making the predicted categories stand out.
4. Emotion recognition based on human body gestures:
the invention provides a method for extracting emotion based on human posture action and change. The emotion extraction technology based on motion recognition is that according to a data input source, motion data are characterized and modeled firstly, and then emotion modeling is carried out, so that 2 sets of characterization data about motion and emotion are obtained. Then, the continuous motion is accurately recognized by using the existing motion recognition method based on the motion data, and the motion information of the data is obtained. And matching and corresponding the emotion model obtained before with an emotion database, and finally extracting the emotion of the input data by assisting action information in the process. The specific flow is shown in fig. 12.
The system mainly comprises the following steps.
Human body modeling
First, the joint points of the human body are modeled, and the human body can be regarded as a rigid system with intrinsic connection. It contains bones and joint points, and the relative movement of the bones and the joint points forms the change of the posture of the human body, namely the description action in normal times. Among the numerous joints of the human body, according to the influence on the emotion, the following treatment is carried out:
1) the fingers and toes are ignored. The hand information only indicates anger when a fist is made, and the ordinary movement data cannot be used for simulating and estimating strength under the condition of no pressure sensor, so that the hand information is considered to be small in quantity, low in importance and required to be properly simplified. For toes, the amount of relevant information is almost zero. Therefore, the present embodiment simplifies the hand and the foot into one point in order to reduce the extraneous interference.
2) The spine of the human body is abstracted into 3 joints of the neck, chest and abdomen. The range of motion available to the spine is large and the composition of bones is complex and cumbersome. These 3 points with distinct position differences were selected on the spine to make a spine simulation.
From the above steps, a manikin can be summarized, wherein the upper body comprises the head, the neck, the chest, the abdomen, 2 big arms and 2 small arms, and the lower body comprises 2 thighs and 2 small legs. This model includes 13 rigid bodies and 9 degrees of freedom, as shown in fig. 13.
Emotional State extraction
And for the selected multiple emotional states, the expression of each emotional state is carried out under the normal condition of the human body, and the limb reaction is analyzed in detail.
Since the human body is abstracted into a rigid body model, the first parameter to be considered is the movement of the center of gravity of the human body. The movement of the gravity center of the human body is extremely rich, various descriptions can be carried out, and the description required by the emotion is more specific and accurate than the description of the movement of the gravity center. The center of gravity can be encoded into 3 cases-forward, backward and natural.
In addition to the movement of the center of gravity, it is next considered that the rotation of the joint points, which may be subject to motion changes, and the mood-related joint points include the head, chest, shoulders and elbows (the mood expression of the lower body of the human body is extremely limited and therefore temporarily left untreated). The corresponding movements are the bending of the head, the rotation of the chest, the swinging and stretching directions of the upper arm and the bending of the elbow, and the parameters are combined with the movement of the upper gravity center, and the movement with 7 degrees of freedom is totally included, so that the movement of the upper half of a person can be expressed. This set of parameters can be used to make a simple mood evaluation criterion. Referring to the experiment with a sample size of 61 people, made by ackerman, for each emotion in the set of emotions,
the representation can be based on parameters of rotation and movement of the center of gravity. The positive and negative values of the number indicate the direction of movement of the part relative to the coordinate system, while the positive values indicate that the part is moving forward in the right hand rule coordinate system, and the negative values indicate that the direction of movement of the part is negative.
5. Emotion recognition based on physiological signals
The emotion recognition of physiological signals utilizes the change of light when blood flows in a human body: when the heart beats, blood passes through the blood vessels, and the larger the amount of blood passing through the blood vessels is, the more light is absorbed by the blood, and the less light is reflected by the surface of human skin. Thus, the heart rate can be estimated by time-frequency analysis of the images (as shown in FIG. 14: based on the human phenomenon that the greater the amount of blood in the blood vessels, the more light is absorbed by the blood and the less light is reflected from the skin surface).
The so-called lagrangian view starts the analysis from the point of view of tracking the motion trajectory of the pixel of interest (particle) in the image. In 2005, Liu et al originally proposed a motion amplification technique for images, which first clustered feature points of a target, followed by tracking the motion trajectory of the points over time, and finally increased the motion amplitude of the points. However, the lagrange view method has the following disadvantages:
the motion trajectory of the particle needs to be accurately tracked and estimated, and more computing resources need to be consumed;
tracking of particles is performed independently, the consideration of the whole image is lacked, and the image is easy to be not closed, so that the effect after amplification is influenced;
the motion of the target object is amplified by modifying the motion trajectory of the particle, and since the position of the particle changes, the original position of the particle needs to be filled with the background, which also increases the complexity of the algorithm.
Unlike Lagrangian views, Euler views do not explicitly track and estimate the motion of particles, but rather fix the view in one place, e.g., the entire image. Then, it is assumed that the whole image is changed, but the frequency, amplitude, and other characteristics of the change signals are different, and the change signals of interest in the embodiment are located in the change signals. Thus, the amplification of the "variation" becomes a precipitation and enhancement of the frequency band of interest. The technical details are set forth in detail below.
1) Spatial filtering
The first step of the euler image amplification technique (hereinafter abbreviated as EVM) proposed in this embodiment is to perform spatial filtering on a video sequence to obtain different spatial frequency base bands. This is done because:
contribute to noise reduction. The images exhibit different SNRs (signal-to-noise ratios) at different spatial frequencies. In general, the lower the spatial frequency, the higher the signal-to-noise ratio. Therefore, to prevent distortion, these base bands should use different amplification factors. The image at the top layer, namely the image with the lowest spatial frequency and the highest signal-to-noise ratio can use the maximum magnification factor, and the magnification factors of the next layer are reduced in sequence;
facilitating approximation of the image signal. Images with higher spatial frequencies (such as the original video image) may be difficult to approximate using a taylor series expansion. Since in this case the results of the approximation are mixed up, the direct amplification is significantly distorted. For this case, the present embodiment reduces distortion by introducing a spatial wavelength lower limit value. If the spatial wavelength of the current baseband is less than this lower limit, the amplification is reduced.
Since the purpose of spatial filtering is simply to "tile" a number of adjacent pixels, it can be done using a low-pass filter. In order to increase the operation speed, a downsampling operation may be performed by the way. Friends familiar with image processing operations should quickly be able to react: the combination of these two things is a pyramid. In practice, linear EVM is a multi-resolution decomposition using laplacian or gaussian pyramid.
2) Time domain filtering
After obtaining the base bands of different spatial frequencies, each base band is then subjected to a band-pass filtering in the time domain in order to extract the part of the varying signal of interest. For example, if a heart rate signal is to be amplified, then 0.4-4 Hz (24-240 bpm) may be selected for band-pass filtering, which is the range of human heart rates. However, there are many kinds of band pass filters, and an ideal band pass filter, a Butterworth (Butterworth) band pass filter, a gaussian band pass filter, and the like are common. Which one should be selected? This is chosen according to the purpose of the amplification. If the amplification result needs to be subjected to subsequent time-frequency analysis (for example, extracting a heart rate and analyzing the frequency of an instrument), a filter with a narrow pass band, such as an ideal band-pass filter, should be selected, because such a filter can directly intercept the frequency band of interest, and avoid amplifying other frequency bands; if time-frequency analysis on the amplification result is not needed, a filter with a wide passband, such as a Butterworth bandpass filter, a second-order IIR filter and the like, can be selected, and the ringing phenomenon can be better relieved by the filter.
3) Amplification and Synthesis
Through the first two steps, the part of "change" has been found out, i.e. the problem of what is "change" is solved. The following discusses how to amplify the problem of "morphing". An important basis is: the result of the last step of bandpass filtering is an approximation of the change of interest.
Fig. 15 demonstrates the process and results of amplifying a cosine wave by a factor of alpha using the above method. The black curve represents the original signal f (x), the blue curve represents the changed signal f (x + δ), the cyan curve represents the taylor series approximation f (x) + δ (t) θ f (x) θ x for the signal, and the green curve represents the changed part separated by us. This portion is amplified by a times and then returned to the original signal to obtain an amplified signal, and the red curve in FIG. 15 represents the amplified signal f (x) + (1+ a) B (x, t)).
And finally, optimizing a space-time filtering effect by utilizing deep learning, converting RGB space information into a YIQ (ntsc) space on the assumption that the frequency and the heart rate of signal change caused by heartbeat are approximate, processing the two color spaces, and finding out a signal by using a proper band-pass filter. And counting the number of peak values of the signal change, namely approximating the physiological heart rate of the person.
6. Semantic and emotional understanding based on multiple rounds of conversation
The traditional semantic understanding is mostly of the type that does not take into account the interactive environment or at most a single round of question and answer. Currently, the main research method of emotion analysis on traditional machine learning is still based on some traditional algorithms, such as SVM, information entropy, CRF, and the like. Machine learning-based sentiment analysis has the advantage of having the ability to model a variety of features. Manually labeled single words are used as features, and the shortage of the linguistic data is often the bottleneck of performance.
Once there is "interaction," sentiment and emotion analysis becomes much more difficult. Firstly, the method comprises the following steps: the interaction is a continuous process rather than fixed for a short time. This essentially changes the way emotion determination is evaluated. When no interaction exists, for example, commodity comments, if the emotion classification is judged, the value can be realized, and the classification task is clear. But the emotion states are not the same in conversation, and the emotion states are continuously changed, so that the analysis of any single sentence is not meaningful, and the analysis is not a simple classification task any more. For continuous processes, a simple solution is to add a function of gain and attenuation, but this function is very difficult to be accurate, the theoretical basis is not numerous, and it is difficult to evaluate this function to write well or not well. Secondly, the method comprises the following steps: the presence of the interaction hides most of the state information. Less than 5% can be seen in the open face, just one corner of the iceberg (in a manner similar to hidden markov to understand). And both interacting parties default that the other party knows much information. Such as the relationship state between the communication subjects, the requirement purpose of each other, the emotional state, the social relationship, the environment, the content chatted before, the common sense, the character, the three views, etc. The following phenomena are then found: the more information that is common between two people, the more difficult it is because the larger the role of the hidden state, the more dimensionality the hidden state is. Different communication paradigms exist between different people. The pattern varies depending on other various environmental information (including time, place, relationship status, mood of each other, common experience, own chat habits, etc.). Even if the same person, the communication pattern between them is a dynamically changing process, for example, two persons in love can have different communication patterns due to emotional temperature rise and temperature fall. Thirdly, the method comprises the following steps: the interaction involves jumping of information. It is often logical and coherent when a person says something by itself. But chat and personal statements are exactly two things, chat can be very jumpy. This uncertain information leaps exponentially increasing the difficulty of sentiment analysis.
The above 3 main aspects are the reason why it becomes so difficult to judge that the interaction factor emotion analysis is added, and firstly, the evaluation mode is changed, and the evaluation mode is complicated and has no reference. It can be seen from the second and third reasons that the data dimension is too sparse for machine learning (the dominant state is only text, expression and the like, and most states are hidden), and the leap is added, so that the accuracy is high in the statistical manner, and the difficulty degree is conceivable.
Therefore, the invention provides a key improvement on conversation management, strengthens the understanding of languages and the attention mechanism of emotional words, and can effectively grasp the basic semantics and emotion capture in multiple rounds of conversations. The overall flow (as shown in fig. 16) is a process of a cyclic multi-round interactive understanding.
The innovation points of the embodiment are mainly 2 aspects: one is to add a mechanism of attention for emotion recognition to the input utterances of the current round based on the traditional speech generation model of seq2seq, and the other is to add emotion tracking in previous rounds of dialog in time series to the dialog management.
In the first module, the architecture is as shown in fig. 17: the input utterances for the current round add a mechanism of attention for emotion recognition based on the traditional seq2seq language generation model.
In this architecture, each current user spoken utterance is input into a bi-directional LSTM encoder (encoder), which then adds attention to the emotion in the current sentence, unlike conventional language generation models. The current emotion state-discriminating input is then combined with the encoder output of the user utterance that was just generated and input together into a decoder, so that the decoder has both the user utterance and the current emotion, and the resulting system dialog response is then an output that is personalized and specific to the current user emotion state.
The invention provides a 2 nd innovation aiming at multi-turn conversation emotion recognition, which is a simple conversation state updating method: emotion-Aware Information State Update (ISU) policy. The SAISU strategy updates the conversation state at any time when new information exists; specifically, when a user, or the system, or any participant in a conversation, has new information generated, the state of the conversation is updated. The update is based on the emotional perception of the previous rounds. See figure 18 for details.
FIG. 18 shows a dialog state s at time t +1t+1Dependent on the state s of the preceding time ttAnd the system behavior a of the preceding time ttAnd the user behavior and emotion o corresponding to the current time t +1t+1. The following can be written:
st+1←st+at+ot+1
at dialog state updates, it is assumed that each update is deterministic. This assumption, therefore, results in the inevitable generation of the same current-time system state for the same system state, the same system behavior, and the same current-time user emotional state at the previous time.
7. Multimodal emotion semantic fusion based on time sequence:
in recent years, with the development of the field of multi-source heterogeneous information fusion processing, features from multi-class reference emotional states can be fused. By utilizing mutual support of different types of signals and carrying out fusion processing on complementary information, the information processing quality is not simply balanced among a plurality of data sources, but is better than any member, and can be greatly improved. The concept of emotional multi-modal analysis has been involved in recent international emotional computing and intelligent interactive academic conferences. Therefore, people have begun to research recognition problems, i.e., multi-modal based emotion recognition, by utilizing complementarity between emotional information of multiple channels, such as facial expressions, speech, eye movements, gestures, and physiological signals. Compared with single signal recognition, the multi-modal information fusion recognition can improve the recognition accuracy undoubtedly. In order to improve the recognition rate of emotion and the robustness of recognition, it is necessary to select different data sources according to different application environments; aiming at different data sources, effective theories and methods are adopted to research efficient and stable emotion recognition algorithms and the like, which are also hot spots of future research in the field.
Currently few systems start integrating 1 to 2 single modalities for emotion detection. Such as the following categories:
● emotion recognition based on visual and auditory perception
The most common multi-modal recognition method is based on visual and auditory methods, the two types of characteristics are convenient to acquire information, and meanwhile, speech emotion recognition and facial expression recognition have complementarity on recognition performance, so that the method is most common. In the cross-cultural multi-modal perception research supported by the society of joy of Japan, attention is paid to the relationship between facial expressions and emotional sounds when emotions are expressed. The system adaptively adjusts the weight of the voice and human face action characteristic parameters in the bimodal emotion recognition, and the emotion recognition rate of the method is over 84 percent. In which, the vision and the hearing are used as input states, asynchronous constraint is carried out at a state layer, and the fusion method improves the recognition rate by 12.5 percent and 11.6 percent respectively.
● emotion recognition based on multiple physiological signals
There are also a number of applications for multi-physiological signal fusion, and in 2004, Lee et al have used multi-physiological signals including heart rate, skin temperature changes, and electrodermal activity to monitor stress status of people. The literature mainly extracts useful characteristics from electrocardio and heart rate signals to carry out species identification. Wufuqiu et al extract and classify the features of three physiological signals including electrocardio signal, respiration signal and body temperature signal. Canentol et al combine various emotional physiological characteristics such as electrocardio, blood volume pulse, skin electrical activity, respiration, etc. to perform emotion recognition. Wagner et al obtained a 92% fusion recognition rate by fusing the physiological parameters of the four channels of electromyographic current, electrocardiogram, skin resistance and respiration. In the literature, the recognition accuracy is improved from 30% to 97.5% through multi-physiological signal fusion.
● emotion recognition based on voice electrocardio combination
In the aspect of combining voice and electrocardio, the literature utilizes a method of weighted fusion and feature space transformation to fuse a voice signal and an electrocardio signal. The average recognition rates obtained by the single-mode emotion classifiers based on the electrocardiosignals and the voice signals are 71% and 80% respectively, and the recognition rate of the multi-mode classifier reaches over 90%.
The embodiment breakthroughs emotion recognition of 5 single modes, innovatively utilizes a deep neural network to comprehensively judge information of a plurality of single modes after neural network coding and depth association and understanding, thereby greatly improving the accuracy, reducing the requirements on environment and hardware, and finally widening the application range to most common application scenes, particularly some special scenes such as criminal investigation, interrogation and the like.
The main architecture of the model is shown in fig. 19: in the embodiment, the deep neural network is used for comprehensively judging the information of a plurality of single modes after the neural network is coded and the depth is associated and understood.
The overall architecture considers that emotion recognition is that a judgment on the current time point is made according to all expressions, actions, characters, voice and physiology which are related before and after on a continuous time axis. Therefore, the method is invented on the basis of the classical seq2seq neural network. Seq2Seq was proposed in 2014 and its main ideas were first described independently by two articles, respectively Sequence to Sequence Learning with Neural Networks of the Google Brain team and Learning phosphor retrieval using RNN Encoder-Decoder for Statistical Machine Translation of the Yoshua Bengio team. These two articles propose a similar solution to the problem of machine translation, whereas Seq2Seq results. The main idea for solving the problem of Seq2Seq is to map a sequence as an input to a sequence as an output through a deep neural network model (usually LSTM, long-short memory network, a recurrent neural network), and the process consists of two links of encoding input and decoding output. seq2seq base model when applied to emotion recognition analysis on a continuous time axis, it requires unique innovative changes to better solve a specific problem. Then, in emotion recognition, in addition to the problem that the usual seq2seq model needs to deal with, the following attention needs to be paid to several key features: 1. a relationship between respective different points in time of the plurality of single modalities; 2. intrinsic effects and relationships between multiple modalities at the same point in time; 3. and integrating emotional ensemble recognition of multiple modes. None of these prior art solutions.
Specifically, the model first includes 5 Recurrent Neural Networks (RNN). The present invention is represented in a practical system by the RNN of long-short term memory (LSTM). Each RNN organizes the representation of the median neural network of each single modal emotional understanding in a time series. Where one neural network element at each time point (one blue bar in fig. 19) is from the output of the corresponding time point of the intermediate layer of the neural network of the single-modality subsystem described above. The output of the neural network (one blue stripe in fig. 19) passing a single point in time for each RNN is fed to the multi-modal fused associative decision RNN. Thus, each time point of the multi-modal RNN aggregates the neural network output at the current time point of each single-modal RNN. After the multi-modal synthesis, the output at each time point is the final emotion judgment result at that time point (orange arrow in fig. 19).
The software and hardware system design application scene of the invention is to provide a software tool for analyzing and studying character expression and psychological emotion change for professional analysts in the field of psychological consultation. The whole system comprises the following four parts: micro-expression analysis and study software, special analysis equipment, a high-definition camera and a printer.
Fig. 20 is a system architecture diagram of the overall product of the invention.
The face of the person being analyzed is recorded in real time by a "high definition camera" and a video stream accessible over a network is provided. The special analysis equipment deploys the product of the invention, and a software interface can be opened only by double-clicking the software shortcut icon; in the program running process, the video address and the expression alarm value can be configured and managed according to the requirement. The invention records, analyzes and judges the facial expression and heart rate data of people in the psychological counseling process, and provides a data analysis result report when the system is finished. The operator can print the analysis result into a document through the printer, so as to be convenient for archiving.
Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (8)

1. A time sequence semantic fusion association judgment subsystem based on a multi-modal emotion recognition system comprises data acquisition equipment and output equipment, and is characterized in that: the emotion analysis software system comprehensively analyzes and infers the data obtained by the data acquisition equipment and finally outputs the result to the output equipment; the emotion analysis software system comprises a semantic fusion association judgment subsystem based on time sequence;
each RNN recurrent neural network organizes the representation form of the middle neural network understood by each single-mode emotion according to a time sequence, wherein one neural network unit at each time point is output from the corresponding time point of the middle layer of the neural network of the single-mode subsystem; the output of the neural network passing through the single time point of each RNN recurrent neural network is transmitted to a multi-mode fusion association judgment RNN recurrent neural network, the neural network output of the current time point of each single-mode RNN recurrent neural network is collected at each time point of the multi-mode RNN recurrent neural network, and after the multi-modes are integrated, the output of each time point is the final emotion judgment result of the time point;
and training the emotion semantics under the single mode after aligning the time sequence by taking the time sequence as a reference, thereby realizing cross-modal automatic association correspondence on the time sequence and finally fused comprehensive emotion recognition, understanding and reasoning judgment.
2. The time-series semantic fusion association judgment subsystem based on the multi-modal emotion recognition system as recited in claim 1, wherein: the emotion analysis software system further comprises an emotion recognition subsystem based on facial image expressions, an emotion recognition subsystem based on voice signals, an emotion analysis subsystem based on text semantics, an emotion recognition subsystem based on human body gestures, an emotion recognition subsystem based on physiological signals and a multi-round conversation semantic understanding subsystem.
3. The time-series semantic fusion association judgment subsystem based on the multi-modal emotion recognition system of claim 2, wherein: the emotion recognition subsystem based on facial image expression can generate a specific expression mode under a specific emotion state, and effectively obtains motion field information from a complex background and a multi-pose expression sequence based on motion information of a dynamic image sequence and an expression image, an optical flow model of an area and a reference optical flow algorithm;
the emotion analysis subsystem based on the text semantics has the advantages that the text emotion analysis can be divided into three levels of words, sentences and chapters, the word-based method is to analyze emotion characteristic words, and the polarity of the words is judged or the similarity of the word semantics is calculated according to a threshold value; the sentence-based method comprises the steps of sampling emotion labels for each sentence, extracting evaluation words or acquiring evaluation phrases for analysis; the method based on the chapters is to analyze the overall emotional tendency of the chapters on the basis of sentence emotional tendency analysis;
the emotion recognition subsystem based on human body postures extracts typical examples of various emotion states of a human body, discriminates and analyzes subtle differences of similar emotions for each posture, establishes a feature library, and extracts physical motion information from the subtle differences to recognize according to the duration and frequency motion properties of human body actions as judgment bases;
the emotion analysis subsystem based on text semantics is an emotion recognition method improved based on a deep Convolutional Neural Network (CNN), the subsystem utilizes vocabulary semantic vectors generated in a target field to carry out emotion classification on texts in a problem field, the input of the emotion analysis subsystem is sentences or documents represented by a matrix, each line of the matrix corresponds to a word segmentation element, each line is a vector representing a word, and the vectors are all in the form of word entries and are obtained from the last module or are indexed according to the word in a word list;
the second layer of the subsystem is a convolutional neural network layer;
the third layer of the subsystem is a time-based convergence layer, the incidence relation of the characteristic information extracted from the previous convolutional layer on a time axis is found out, and the corresponding change on the time dimension in each characteristic matrix in the previous layer is summarized and induced, so that more concentrated characteristic information is formed;
the fourth layer of the subsystem is the last full-connection prediction layer, and the method comprises the steps of firstly, performing full arrangement and combination on the concentrated characteristic information obtained from the previous layer, and searching all possible corresponding weight combinations so as to find a coaction mode among the concentrated characteristic information and the concentrated characteristic information; the next internal layer is a Dropout layer, which means that the weights of some hidden layer nodes of the network are randomly made to be out of work during model training, the nodes which are out of work are temporarily regarded as not part of the network structure, but the weights of the nodes are kept, because the nodes can work again when a sample is input next time, the next internal layer is tanh which is nonlinear logic transformation, and the last internal layer is softmax which is a common activation function in multi-classification and is based on logic regression, and the probability of each possible class needing to be predicted is sharpened, so that the predicted class is distinguished;
the emotion recognition subsystem based on human body gestures is characterized in that emotion extraction based on motion recognition is performed according to a data input source, firstly, motion data are represented and modeled, and then, emotion modeling is performed to obtain two sets of representation data related to motions and emotions; then, the continuous action is accurately identified by using the existing action identification method based on the motion data to obtain the action information of the data; matching and corresponding the emotion model obtained before with an emotion database, and finally extracting the emotion of the input data by assisting action information in the process; the method specifically comprises the following steps:
human body modeling
Firstly, modeling joint points of a human body, regarding the human body as a rigid system with intrinsic relation, and comprising bones and the joint points, wherein the relative motion of the bones and the joint points forms the change of the posture of the human body, namely describing actions at ordinary times, in a plurality of joint points of the human body, according to the lightness and the heaviness of the influence on the emotion, fingers and toes are ignored, the spine of the human body is abstracted into three joints of a neck, a chest and an abdomen, and a human body model is summarized, wherein the upper half body comprises a head, a neck, a chest, an abdomen, two big arms and two small arms, and the lower half body comprises two thighs and two crus;
emotional state extraction
For the selected multiple emotional states, the expression of each emotional state is carried out under the normal condition of the human body, and the body reaction is analyzed in detail; because the human body is abstracted into a rigid model, the gravity center of the human body moves firstly and is divided into a forward state, a backward state and a natural state; in addition to the movement of the center of gravity, followed by the rotation of the joint points, the human body undergoes motion changes, and the joint points related to emotion include the head, the chest, the shoulders and the elbows, and the corresponding motions are the bending of the head, the rotation of the chest, the swinging and stretching directions of the upper arm, and the bending of the elbows, which parameters, in combination with the movement of the center of gravity, include seven degrees of freedom in total, expressing the motion of the upper half of a person.
4. The time-series semantic fusion association judgment subsystem based on the multi-modal emotion recognition system as recited in claim 3, wherein: the emotion recognition subsystem based on facial image expression is an ensemble model based on VGG16 and RESNET 50.
5. The time-series semantic fusion association judgment subsystem based on the multi-modal emotion recognition system of claim 2, wherein: in the emotion recognition subsystem based on the voice signals, the acoustic parameters of fundamental frequency, duration, tone quality and definition are emotion voice characteristic quantities, an emotion voice database is established, and new voice characteristic quantities are continuously extracted to be a basic method for voice emotion recognition.
6. The time-series semantic fusion association judgment subsystem based on the multi-modal emotion recognition system of claim 5, wherein: the emotion recognition subsystem based on the voice signals is a model for performing emotion recognition on the voice signals based on a neural network MLP, firstly, continuous voice signals are segmented to obtain discrete sound tiny units, and the tiny units are partially overlapped, so that the model can better analyze the current unit and know the previous and next context voice units; then extracting voice energy curve information by the model; and then, extracting fundamental frequency curve information by the subsystem, describing and constructing the tone characteristic by the fundamental frequency characteristic, and extracting a fundamental frequency curve by adopting an autocorrelation method.
7. The time-series semantic fusion association judgment subsystem based on the multi-modal emotion recognition system of claim 2, wherein: the emotion recognition subsystem based on the physiological signals is a non-contact type physiological signal emotion recognition system, the physiological mechanism of emotion comprises emotion perception and physical physiological reaction of emotion, the emotion perception is a main emotion generation mechanism, different physiological reactions of the brain are reflected through electroencephalogram signals, due to the particularity of the signals, recognition is carried out through three characteristics of time domain, frequency domain and time-frequency domain, and time-frequency average spectrum entropy and fractal dimension are used as characteristic quantities for measuring brain activities;
the emotion recognition subsystem based on the physiological signal utilizes the change of light rays when blood flows in a human body in emotion recognition of the physiological signal: when the heart beats, blood can pass through the blood vessel, the more the blood volume passing through the blood vessel is, the more light absorbed by the blood is, the less light is reflected by the surface of human skin, and the heart rate is estimated through time-frequency analysis of the image;
the first step is to carry out spatial filtering on a video sequence to obtain base bands with different spatial frequencies;
secondly, performing band-pass filtering on each baseband in a time domain to extract the interested part of the variation signals;
and thirdly, amplifying and synthesizing, and counting the number of the peak values of the signal change, namely the physiological heart rate of the person is approximated.
8. The time-series semantic fusion association judgment subsystem based on the multi-modal emotion recognition system of claim 2, wherein: the multi-round conversation semantic understanding-based subsystem adds an emotion recognition attention mechanism on the basis of a traditional seq2seq language generation model to the input speech of the current round, and adds emotion tracking in the previous multi-round conversations on a time sequence in conversation management; inputting the spoken words of each current user into a bidirectional LSTM encoder, then combining the current inputs discriminated to different emotional states with the encoder outputs of the user's spoken words generated just before, and inputting the combined input into a decoder together, so that the decoder has both the user's spoken words and the current emotion, and then generating a system dialog response which is personalized and specific to the output of the current user's emotional state; the emotion perception information state updating strategy is characterized in that the conversation state is updated at any time when new information exists; when the conversation state is updated, each updating is determined, and the same system state, the same system behavior and the same user emotion state at the current moment at the previous moment are necessarily generated.
CN201810612592.0A 2018-06-14 2018-06-14 Time sequence semantic fusion association judgment subsystem based on multi-modal emotion recognition system Active CN108805087B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810612592.0A CN108805087B (en) 2018-06-14 2018-06-14 Time sequence semantic fusion association judgment subsystem based on multi-modal emotion recognition system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810612592.0A CN108805087B (en) 2018-06-14 2018-06-14 Time sequence semantic fusion association judgment subsystem based on multi-modal emotion recognition system

Publications (2)

Publication Number Publication Date
CN108805087A CN108805087A (en) 2018-11-13
CN108805087B true CN108805087B (en) 2021-06-15

Family

ID=64085933

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810612592.0A Active CN108805087B (en) 2018-06-14 2018-06-14 Time sequence semantic fusion association judgment subsystem based on multi-modal emotion recognition system

Country Status (1)

Country Link
CN (1) CN108805087B (en)

Families Citing this family (48)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109508375A (en) * 2018-11-19 2019-03-22 重庆邮电大学 A kind of social affective classification method based on multi-modal fusion
CN109598217A (en) * 2018-11-23 2019-04-09 南京亨视通信息技术有限公司 A kind of system that the micro- Expression analysis of human body face is studied and judged
CN109558935A (en) * 2018-11-28 2019-04-02 黄欢 Emotion recognition and exchange method and system based on deep learning
CN109620205B (en) * 2018-12-26 2022-10-28 上海联影智能医疗科技有限公司 Electrocardiogram data classification method and device, computer equipment and storage medium
CN110110578B (en) * 2019-02-21 2023-09-29 北京工业大学 Indoor scene semantic annotation method
CN109903837A (en) * 2019-03-05 2019-06-18 浙江强脑科技有限公司 Psychological detection method, device and computer readable storage medium
CN110008676B (en) * 2019-04-02 2022-09-16 合肥智查数据科技有限公司 System and method for multi-dimensional identity checking and real identity discrimination of personnel
CN109960723B (en) * 2019-04-12 2021-11-16 浙江连信科技有限公司 Interaction system and method for psychological robot
CN110147729A (en) * 2019-04-16 2019-08-20 深圳壹账通智能科技有限公司 User emotion recognition methods, device, computer equipment and storage medium
CN110321781B (en) * 2019-05-06 2021-10-26 苏宁金融服务(上海)有限公司 Signal processing method and device for non-contact measurement
CN110188669B (en) * 2019-05-29 2021-01-19 华南理工大学 Air handwritten character track recovery method based on attention mechanism
CN110232412B (en) * 2019-05-30 2020-12-11 清华大学 Human gait prediction method based on multi-mode deep learning
CN110267052B (en) * 2019-06-19 2021-04-16 云南大学 Intelligent barrage robot based on real-time emotion feedback
CN110334626B (en) * 2019-06-26 2022-03-04 北京科技大学 Online learning system based on emotional state
CN110458201B (en) * 2019-07-17 2021-08-24 北京科技大学 Object-oriented classification method and classification device for remote sensing image
CN110569869A (en) * 2019-07-23 2019-12-13 浙江工业大学 feature level fusion method for multi-modal emotion detection
CN110755092B (en) * 2019-09-02 2022-04-12 中国航天员科研训练中心 Non-contact emotion monitoring method with cross-media information fusion function
CN110598608B (en) * 2019-09-02 2022-01-14 中国航天员科研训练中心 Non-contact and contact cooperative psychological and physiological state intelligent monitoring system
CN110704588B (en) * 2019-09-04 2023-05-30 平安科技(深圳)有限公司 Multi-round dialogue semantic analysis method and system based on long-short-term memory network
CN110675859B (en) * 2019-09-05 2021-11-23 华南理工大学 Multi-emotion recognition method, system, medium, and apparatus combining speech and text
CN110781916A (en) * 2019-09-18 2020-02-11 平安科技(深圳)有限公司 Video data fraud detection method and device, computer equipment and storage medium
CN111222009B (en) * 2019-10-25 2022-03-22 汕头大学 Processing method of multi-modal personalized emotion based on long-time memory mechanism
US11915123B2 (en) 2019-11-14 2024-02-27 International Business Machines Corporation Fusing multimodal data using recurrent neural networks
CN110801227B (en) * 2019-12-09 2021-07-20 中国科学院计算技术研究所 Method and system for testing three-dimensional color block obstacle based on wearable equipment
CN111160163B (en) * 2019-12-18 2022-04-01 浙江大学 Expression recognition method based on regional relation modeling and information fusion modeling
CN111325292B (en) * 2020-03-11 2023-05-02 中国电子工程设计院有限公司 Object behavior recognition method and device
CN111401268B (en) * 2020-03-19 2022-11-15 内蒙古工业大学 Multi-mode emotion recognition method and device for open environment
CN111444863B (en) * 2020-03-30 2023-05-23 华南理工大学 Driver emotion recognition method based on camera and adopting 5G vehicle-mounted network cloud assistance
CN111462752B (en) * 2020-04-01 2023-10-13 北京思特奇信息技术股份有限公司 Attention mechanism, feature embedding and BI-LSTM (business-to-business) based customer intention recognition method
CN111680550A (en) * 2020-04-28 2020-09-18 平安科技(深圳)有限公司 Emotion information identification method and device, storage medium and computer equipment
CN111584073B (en) * 2020-05-13 2023-05-09 山东大学 Method for constructing diagnosis models of benign and malignant lung nodules in various pathological types
CN111814609B (en) * 2020-06-24 2023-09-29 厦门大学 Micro-expression recognition method based on deep forest and convolutional neural network
CN112001444A (en) * 2020-08-25 2020-11-27 斑马网络技术有限公司 Multi-scene fusion method for vehicle
CN112466336B (en) * 2020-11-19 2023-05-05 平安科技(深圳)有限公司 Emotion recognition method, device, equipment and storage medium based on voice
CN112287893B (en) * 2020-11-25 2023-07-18 广东技术师范大学 Sow lactation behavior identification method based on audio and video information fusion
CN112581979B (en) * 2020-12-10 2022-07-12 重庆邮电大学 Speech emotion recognition method based on spectrogram
CN112579745B (en) * 2021-02-22 2021-06-08 中国科学院自动化研究所 Dialogue emotion error correction system based on graph neural network
CN112579762B (en) * 2021-02-24 2021-06-08 之江实验室 Dialogue emotion analysis method based on semantics, emotion inertia and emotion commonality
CN113095428B (en) * 2021-04-23 2023-09-19 西安交通大学 Video emotion classification method and system integrating electroencephalogram and stimulus information
CN113139525B (en) * 2021-05-21 2022-03-01 国家康复辅具研究中心 Multi-source information fusion-based emotion recognition method and man-machine interaction system
CN113255800B (en) * 2021-06-02 2021-10-15 中国科学院自动化研究所 Robust emotion modeling system based on audio and video
CN113643724B (en) * 2021-07-06 2023-04-28 中国科学院声学研究所南海研究站 Kiwi emotion recognition method and system based on time-frequency double-branch characteristics
CN113672731B (en) * 2021-08-02 2024-02-23 北京中科闻歌科技股份有限公司 Emotion analysis method, device, equipment and storage medium based on field information
CN113707275B (en) * 2021-08-27 2023-06-23 郑州铁路职业技术学院 Mental health estimation method and system based on big data analysis
CN114287938B (en) * 2021-12-13 2024-02-13 重庆大学 Method and equipment for obtaining safety interval of human body parameters in building environment
CN114926837B (en) * 2022-05-26 2023-08-04 东南大学 Emotion recognition method based on human-object space-time interaction behavior
CN116127079B (en) * 2023-04-20 2023-06-20 中电科大数据研究院有限公司 Text classification method
CN117473304A (en) * 2023-12-28 2024-01-30 天津大学 Multi-mode image labeling method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220591A (en) * 2017-04-28 2017-09-29 哈尔滨工业大学深圳研究生院 Multi-modal intelligent mood sensing system
CN107456218A (en) * 2017-09-05 2017-12-12 清华大学深圳研究生院 A kind of mood sensing system and wearable device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170178346A1 (en) * 2015-12-16 2017-06-22 High School Cube, Llc Neural network architecture for analyzing video data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220591A (en) * 2017-04-28 2017-09-29 哈尔滨工业大学深圳研究生院 Multi-modal intelligent mood sensing system
CN107456218A (en) * 2017-09-05 2017-12-12 清华大学深圳研究生院 A kind of mood sensing system and wearable device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《基于SAE和LSTM RNN的多模态生理信号融合和情感识别研究》;李幼军; 黄佳进; 王海渊; 钟宁;《通信学报》;20171225;第38卷(第12期);109-120 *

Also Published As

Publication number Publication date
CN108805087A (en) 2018-11-13

Similar Documents

Publication Publication Date Title
CN108805087B (en) Time sequence semantic fusion association judgment subsystem based on multi-modal emotion recognition system
CN108805089B (en) Multi-modal-based emotion recognition method
CN108877801B (en) Multi-turn dialogue semantic understanding subsystem based on multi-modal emotion recognition system
CN108899050B (en) Voice signal analysis subsystem based on multi-modal emotion recognition system
CN108805088B (en) Physiological signal analysis subsystem based on multi-modal emotion recognition system
Abdullah et al. Multimodal emotion recognition using deep learning
Jiang et al. A snapshot research and implementation of multimodal information fusion for data-driven emotion recognition
CN112348075B (en) Multi-mode emotion recognition method based on contextual attention neural network
Zadeh et al. Memory fusion network for multi-view sequential learning
Poria et al. A review of affective computing: From unimodal analysis to multimodal fusion
CN111461176B (en) Multi-mode fusion method, device, medium and equipment based on normalized mutual information
CN112766173B (en) Multi-mode emotion analysis method and system based on AI deep learning
Al Osman et al. Multimodal affect recognition: Current approaches and challenges
CN113591525A (en) Driver road rage recognition method with deep fusion of facial expressions and voice
Schels et al. Multi-modal classifier-fusion for the recognition of emotions
Kim et al. Multimodal affect classification at various temporal lengths
Du et al. A novel emotion-aware method based on the fusion of textual description of speech, body movements, and facial expressions
Gladys et al. Survey on Multimodal Approaches to Emotion Recognition
Zaferani et al. Automatic personality traits perception using asymmetric auto-encoder
CN115145402A (en) Intelligent toy system with network interaction function and control method
Nanduri et al. A Review of multi-modal speech emotion recognition and various techniques used to solve emotion recognition on speech data
Meghjani et al. Bimodal information analysis for emotion recognition
Kumar et al. Depression detection using stacked autoencoder from facial features and NLP
Shukla et al. Deep ganitrus algorithm for speech emotion recognition
Dineshkumar et al. Deperssion Detection in Naturalistic Environmental Condition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant