CN112101097A - Depression and suicide tendency identification method integrating body language, micro expression and language - Google Patents
Depression and suicide tendency identification method integrating body language, micro expression and language Download PDFInfo
- Publication number
- CN112101097A CN112101097A CN202010764410.9A CN202010764410A CN112101097A CN 112101097 A CN112101097 A CN 112101097A CN 202010764410 A CN202010764410 A CN 202010764410A CN 112101097 A CN112101097 A CN 112101097A
- Authority
- CN
- China
- Prior art keywords
- layer
- language
- information
- voice
- som
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008449 language Effects 0.000 title claims abstract description 32
- 238000000034 method Methods 0.000 title claims abstract description 29
- 206010010144 Completed suicide Diseases 0.000 title claims abstract description 13
- 230000014509 gene expression Effects 0.000 title claims abstract description 12
- 230000033001 locomotion Effects 0.000 claims abstract description 54
- 230000003068 static effect Effects 0.000 claims abstract description 23
- 230000008451 emotion Effects 0.000 claims abstract description 11
- 238000012545 processing Methods 0.000 claims abstract description 11
- 230000004927 fusion Effects 0.000 claims abstract description 7
- 206010012374 Depressed mood Diseases 0.000 claims abstract description 6
- 238000013527 convolutional neural network Methods 0.000 claims description 28
- 238000013528 artificial neural network Methods 0.000 claims description 21
- 239000011159 matrix material Substances 0.000 claims description 17
- 230000008921 facial expression Effects 0.000 claims description 15
- 230000002996 emotional effect Effects 0.000 claims description 12
- 238000011176 pooling Methods 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 8
- 206010042458 Suicidal ideation Diseases 0.000 claims description 7
- 230000009467 reduction Effects 0.000 claims description 7
- 230000003595 spectral effect Effects 0.000 claims description 7
- 230000006399 behavior Effects 0.000 claims description 6
- 239000006185 dispersion Substances 0.000 claims description 6
- 230000033764 rhythmic process Effects 0.000 claims description 6
- 238000001228 spectrum Methods 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 6
- 239000013598 vector Substances 0.000 claims description 6
- 238000004458 analytical method Methods 0.000 claims description 5
- 238000005457 optimization Methods 0.000 claims description 5
- 230000002457 bidirectional effect Effects 0.000 claims description 4
- 230000001815 facial effect Effects 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 230000002146 bilateral effect Effects 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 230000009545 invasion Effects 0.000 abstract description 2
- 230000008909 emotion recognition Effects 0.000 description 4
- 230000000994 depressogenic effect Effects 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 230000036651 mood Effects 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 206010065604 Suicidal behaviour Diseases 0.000 description 1
- LUTSRLYCMSCGCS-BWOMAWGNSA-N [(3s,8r,9s,10r,13s)-10,13-dimethyl-17-oxo-1,2,3,4,7,8,9,11,12,16-decahydrocyclopenta[a]phenanthren-3-yl] acetate Chemical compound C([C@@H]12)C[C@]3(C)C(=O)CC=C3[C@@H]1CC=C1[C@]2(C)CC[C@H](OC(=O)C)C1 LUTSRLYCMSCGCS-BWOMAWGNSA-N 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 235000015241 bacon Nutrition 0.000 description 1
- 210000000988 bone and bone Anatomy 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000009223 counseling Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 230000001766 physiological effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000006641 stabilisation Effects 0.000 description 1
- 238000011105 stabilization Methods 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2218/00—Aspects of pattern recognition specially adapted for signal processing
- G06F2218/08—Feature extraction
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a method for identifying depression and suicide tendency by fusing body language, micro expression and language. The method comprises the following steps: collecting video and audio by using a Kinect with an infrared camera, and respectively converting video information and audio information into feature text descriptions; performing information fusion on the generated feature text description, and performing emotion classification on a processing result by using self-organizing map (SOM) and a compensation layer; and marking the obtained personnel who possibly have depressed mood or suicide tendency for observation. The invention takes static body movement and dynamic body movement into account, achieving higher efficiency. The invention uses Kinect for data acquisition, and has non-invasion, high performance and easy operation.
Description
Technical Field
The invention belongs to the field of emotion recognition, and particularly relates to a depression and suicide tendency recognition method fusing body language, micro expression and language.
Background
The pace of life is accelerated, and the change of social environment causes a lot of people to feel great stress, so that the depressed psychology is generated, and even self-disabled or suicide behaviors are generated. It is useful to detect the mood of people in order to detect their mood problems as early as possible and to prevent the development of depressed psychology or suicidal intentions. Human emotions can be recognized by various means, such as Electrocardiogram (ECG), electroencephalogram (EEG) (k.takahashi, "Remarks on emotion recognition from multi-modal bio-potential signals", proc.ieee int.conf.ind.technol. (ICIT), vol.3, pp.1138-1143, jun.2004.), speech, facial expressions, and the like. Among various emotion signals, physiological signals are widely used for emotion recognition. In recent years, human motion has also become a new feature.
There are two conventional methods, one is to measure the physiological index of an object by contact (j.kim, and e.andre, "emission recognition based on physiological changes in music stabilization," IEEE Transactions on Pattern Analysis & Machine Analysis, vol.30, No.12, pp.2067-2083,2008), and the other is to observe the physiological property of an object by non-contact method. In fact, although a non-invasive approach is better, subjects can mask their mood. Technically, audio and video (f.xu, j.zhang and j.z.wang, "Microexpression Identification and category Identification Using a Facial Dynamics Map," IEEE Transactions on influence Computing, vol.8, issue 2, pp.1-1,2017.) are readily available, but are susceptible to noise. In principle, a single detection of Static posture or a single detection of dynamic action (H.Wallbott, "Bodily Expressions of emotions," European J.Social psychological, vol.28, pp.879-896,1998. M.Coulon, "associating to Static Body Postsources: Recognition Accuracy, fusions, and Viewoint Dependence," J.Nonverbal Behavior, vol.28, No.2, pp.117-139,2004. J.Burgloon, L.Guerrero, and K.Floyd, non-verbal communication. Allyn and Bacon,2010.) results in a lower computational complexity for Emotion Recognition, but also in a lower Recognition Accuracy. Therefore, it is necessary to blend these characteristics. Through the fusion of the multi-modal characteristics, the emotion types of the detected people can be better identified.
Disclosure of Invention
In order to solve the problems, the invention provides a method for identifying depression and suicide tendency by fusing body language, micro expression and language. The method can realize effective fusion of characteristic information of body actions, facial expressions, languages and the like of the human body, and can detect whether a person has depressed emotion or not and whether the person has suicidal behavior intention or not more efficiently and accurately by carrying out emotion classification on the information. First, information such as voice, body movement, facial expression, etc. is collected using a Kinect with an infrared camera. The voice can be converted into text description through prosody and spectral features extracted from the voice, the text description comprises information such as intonation, intonation and speed, static motion and dynamic motion of a human body are analyzed respectively by adopting a Convolutional Neural Network (CNN) and a bidirectional long-short time memory conditional random field (Bi-LSTM-CRF), and feature extraction and dimension reduction processing are respectively carried out on a face image. Finally, information such as voice, limb movement, facial expressions and the like is fused into the text description, self-organizing map (SOM) and a compensation layer are used for understanding behaviors, and emotions are recognized.
The purpose of the invention is realized by at least one of the following technical solutions.
A depression and suicide tendency recognition method fusing body language, micro expression and language comprises the following steps:
s1, collecting video and audio by using a Kinect with an infrared camera, and respectively converting the video information and the audio information into feature text descriptions;
s2, carrying out information fusion on the feature text description generated in the step S1, and carrying out emotion classification on the processing result by utilizing self-organizing map (SOM) and a compensation layer;
and S3, marking the people possibly having depressed mood or suicide tendency obtained in the step S2 for observation.
Further, in step S1, the video information includes limb movement and facial expression information extracted from the video, the limb movement includes static movement and dynamic movement; the audio information comprises frequency spectrum, rhythm and sound wave information extracted from voice audio, wherein the frequency spectrum information and the rhythm information are used for acquiring voice marks, and the sound wave information is used for acquiring voice content.
Further, in step S1, the extracting of the feature text description specifically includes the following steps:
s1.1, adopting a Convolutional Neural Network (CNN) to finish the identification of static motion and generating a static motion characteristic text description;
s1.2, detecting human skeleton data in real time by using Kinect, calculating the behavior characteristics of a human body, completing the identification of dynamic motion, and generating a text description of the dynamic motion characteristics;
s1.3, completing information identification of facial expressions by using a local identification method, and generating facial activity characteristic text description;
and S1.4, completing the recognition of the voice mark and the recognition of the voice content, and generating language feature text description.
Further, in step S1.1, a single frame is selected from the collected video and input to a Convolutional Neural Network (CNN) for training and testing; inputting all the single frames in the video to a Convolutional Neural Network (CNN) after training is finished to obtain static motion with emotional characteristics, and inputting the static motion with the emotional characteristics into a Softmax classifier for classification to finish the identification of the static motion; softmax function, the calculation formula is as follows:
wherein, WiAnd b represents bias for the weight matrix of the ith type characteristic text.
Further, the Convolutional Neural Network (CNN) calculates convolution using a partial filter, that is, inner product operation is performed using a local submatrix of an input item and the partial filter, and the output is a convolution matrix; the hidden layers in the Convolutional Neural Network (CNN) comprise two convolutional layers and two pooling layers;
the formula for the convolutional layer is as follows:
where l denotes the ith convolutional layer and i denotes the value of the ith component of the convolutional output matrix. j represents the number of corresponding output matrices; the value of j varies from 0 to N, where N represents the number of convolution output matrices; f is a non-linear sigmoid-type function;
the pooling layer uses mean pooling, the input of which is from the upper convolutional layer and the output is used as the input of the next convolutional layer, and the calculation formula is as follows:
wherein the content of the first and second substances,and the local output after the pooling process is finished is represented and is derived from the mean value of the local small matrix with the size of n multiplied by n of the previous layer.
Further, in step S1.2, firstly, positioning and tracking of the human body is completed through Kinect, and joint points of the skeleton are obtained; 15 skeleton joint points are numbered from top to bottom and from left to right; since the position signals of the skeleton are time-varying, when occlusion is encountered, the definition of the position signals is not clear, and therefore, a frame sequence is extracted from the video and input into interval Kalman filtering to improve the precision of the skeleton position; then, a bidirectional long-short term memory network (Bi-LSTM-CRF) with a conditional random field layer is used for analyzing motion sequences of 15 skeleton points respectively to obtain dynamic motion with emotional characteristics;
for a two-way long-short term memory neural network, an input sequence x is given1,x2,…,xt,…,xTAnd (4) wherein T represents the T-th coordinate, and T represents a total of T coordinates, and the output calculation formula of the hidden layer of the long-short term memory neural network is as follows:
ht=σh(Wxhxt+Whhht-1+bh) (4)
wherein h istFor the output of the hidden layer at time t, WxhAs a weight matrix from the input layer to the hidden layer, WhhAs a weight matrix from hidden layer to hidden layer, bhFor concealing the bias of the layer, σhRepresenting an activation function; although LSTM can captureInformation of long-term sequences, but only one direction is considered. Bi-LSTM is used to reinforce the bilateral relationship and make the first layer forward LSTM and the second layer backward LSTM;
and finally, inputting the dynamic motion with the emotional characteristics into the Softmax classifier in the step S1.1 for classification.
Further, in step S1.3, each segmented region of the face is obtained according to the information of the face image frame captured by kinect; processing original images of all parts of the segmented human face into normalized standard images, performing feature extraction by adopting two-dimensional Gabor wavelets, performing dimension reduction by utilizing a Linear Discriminant Analysis (LDA) algorithm, extracting the most distinctive low-dimensional features from a high-dimensional feature space, collecting all samples of the same class according to the extracted low-dimensional features, and separating other samples as far as possible, namely selecting the features with the largest ratio of the dispersion between the sample classes to the dispersion in the sample classes; and finally, classifying the face image frames subjected to the Gabor wavelet feature extraction and LDA dimension reduction through an open-source OpenFace neural network to obtain the recognition result of the facial expression.
Further, in step S1.4, first, directly collecting speech by Kinect, reducing noise in the collected speech by using a wiener-based noise filter, then inputting the speech with reduced noise to a Back Propagation Neural Network (BPNN) (which belongs to a feed-forward type neural network), the BPNN is that a Back Propagation algorithm (Back Propagation) is added to the structure of a feed-forward type network to train to obtain speech with prosodic features and spectral features, and finally inputting the speech with the prosodic features and spectral features to a Softmax classifier to classify, so as to obtain a speech recognition result.
Further, step S2 specifically includes the following steps:
s2.1, embedding the feature text description collected in the step S1 into a feature vector with a fixed size and arranged according to a time sequence by using an LSTM neural network; the LSTM neural network is a Bi-LSTM forward LSTM network in the step 1.2;
s2.2, carrying out normalization processing on the feature vectors in the step S2.1 by adopting a Self-Organization Mapping (SOM) algorithm;
s2.3, because the self-organizing map (SOM) layer is a fuzzy layer which can lose information, a compensation layer is adopted to make up for the loss of the information, namely for different classification results in the SOM, the compensation layer must have a specific layer combined with the specific layer; the size of each layer is the same as that of the SOM network competition layer, and all nodes have own weight ws,tS is the s-th class of the corresponding compensation layer, and t is wsThe t-th node within the layer; the compensation layers are not shared, and each layer corresponds to a specific type of classified SOM output; the formula of the multiplication result is as follows:
us=ws·μs+b# (5)
in the formula, musIs the input weight, w, of the s-th level nodesB is the weight of the s-th level node, and is used for limiting the compensation proportion between-1 and 1;
since similar features may have the same class, the target result is globally optimized, and the global optimization target is:
e=LG+LSOM+LS (6)
wherein the first term LGComprises the following steps:
LGitem I of (1)For minimizing the error of labels and predicted results. The second term is | y- μk‖2To minimize the error between the tag and the SOM network results. Item III | muk-x‖2For minimizing the difference between the input signal of the SOM network and the output signal of the SOM network.
Further, in step S3, based on the output result of step S2, the possibility of depressed mood and suicidal tendency is obtained, and the high-risk person is marked and observed to make a certain psychological counseling.
Compared with the prior art, the invention has the following advantages:
(1) the present invention aligns multimodal data with a text layer. The intermediate representation of text and the proposed fusion method form a framework that fuses speech, limb movements and facial expressions. The method reduces the dimensions of voice, limb movement and facial expression, and unifies three types of information into one component.
(2) The depth information enhances the robustness and accuracy of motion detection.
(3) The invention takes static body movement and dynamic body movement into account, achieving higher efficiency.
(4) The invention uses Kinect for data acquisition, and has non-invasion, high performance and easy operation.
Drawings
Fig. 1 is a flowchart of a method for identifying depression and suicidal tendency by fusing body language, micro expression and language in the embodiment of the invention.
Detailed Description
Specific implementations of the present invention will be further described with reference to the following examples and drawings, but the embodiments of the present invention are not limited thereto.
Example (b):
a method for identifying depression and suicidal tendency by fusing body language, micro expression and language, as shown in figure 1, comprises the following steps:
s1, collecting video and audio by using a Kinect with an infrared camera, and respectively converting the video information and the audio information into feature text descriptions; the video information comprises limb movement and facial expression information extracted from the video, wherein the limb movement comprises static movement and dynamic movement; the audio information comprises frequency spectrum, rhythm and sound wave information extracted from voice audio, wherein the frequency spectrum information and the rhythm information are used for acquiring voice marks, and the sound wave information is used for acquiring voice content.
The extraction of the feature text description specifically comprises the following steps:
s1.1, adopting a Convolutional Neural Network (CNN) to finish the identification of static motion and generating a static motion characteristic text description;
selecting a single frame from the collected videos, inputting the single frame into a Convolutional Neural Network (CNN) for training and testing; inputting all the single frames in the video to a Convolutional Neural Network (CNN) after training is finished to obtain static motion with emotional characteristics, and inputting the static motion with the emotional characteristics into a Softmax classifier for classification to finish the identification of the static motion; softmax function, the calculation formula is as follows:
wherein, WiAnd b represents bias for the weight matrix of the ith type characteristic text.
The formula for the convolutional layer is as follows:
wherein l represents the l convolutional layer, and i represents the value of the i component of the convolutional output matrix; j represents the number of corresponding output matrices; the value of j varies from 0 to N, where N represents the number of convolution output matrices; f is a non-linear sigmoid-type function;
the pooling layer uses mean pooling, the input of which is from the upper convolutional layer and the output is used as the input of the next convolutional layer, and the calculation formula is as follows:
wherein the content of the first and second substances,and the local output after the pooling process is finished is represented and is derived from the mean value of the local small matrix with the size of n multiplied by n of the previous layer.
S1.2, detecting human skeleton data in real time by using Kinect, calculating the behavior characteristics of a human body, completing the identification of dynamic motion, and generating a text description of the dynamic motion characteristics;
firstly, positioning and tracking a human body through a Kinect to obtain joint points of bones; 15 skeleton joint points are numbered from top to bottom and from left to right; since the position signals of the skeleton are time-varying, when occlusion is encountered, the definition of the position signals is not clear, and therefore, a frame sequence is extracted from the video and input into interval Kalman filtering to improve the precision of the skeleton position; then, a bidirectional long-short term memory network (Bi-LSTM-CRF) with a conditional random field layer is used for analyzing motion sequences of 15 skeleton points respectively to obtain dynamic motion with emotional characteristics;
for a two-way long-short term memory neural network, an input sequence x is given1,x2,…,xt,…,xTAnd (4) wherein T represents the T-th coordinate, and T represents a total of T coordinates, and the output calculation formula of the hidden layer of the long-short term memory neural network is as follows:
ht=σh(Wxhxt+Whhht-1+bh) (4)
wherein h istFor the output of the hidden layer at time t, WxhAs a weight matrix from the input layer to the hidden layer, WhhAs a weight matrix from hidden layer to hidden layer, bhFor concealing the bias of the layer, σhRepresenting an activation function; although LSTM can capture information of long-term sequences, only one direction is considered. Bi-LSTM is used to reinforce the bilateral relationship and make the first layer forward LSTM and the second layer backward LSTM;
and finally, inputting the dynamic motion with the emotional characteristics into the Softmax classifier in the step S1.1 for classification.
S1.3, completing information identification of facial expressions by using a local identification method, and generating facial activity characteristic text description;
obtaining each segmentation area of the human face according to the information of the human face image frame captured by kinect; processing original images of all parts of the segmented human face into normalized standard images, performing feature extraction by adopting two-dimensional Gabor wavelets, performing dimension reduction by utilizing a Linear Discriminant Analysis (LDA) algorithm, extracting the most distinctive low-dimensional features from a high-dimensional feature space, collecting all samples of the same class according to the extracted low-dimensional features, and separating other samples as far as possible, namely selecting the features with the largest ratio of the dispersion between the sample classes to the dispersion in the sample classes; and finally, classifying the face image frames subjected to the Gabor wavelet feature extraction and LDA dimension reduction through an open-source OpenFace neural network to obtain the recognition result of the facial expression.
S1.4, completing recognition of the voice mark and recognition of voice content, and generating language feature text description;
firstly, directly collecting voice by Kinect, reducing noise in the collected voice by using a wiener-based noise filter, then inputting the voice with reduced noise into a Back Propagation Neural Network (BPNN) (the Back Propagation Neural Network (BPNN) belongs to one of feedforward type neural networks, and the BPNN is that a Back Propagation algorithm (Back Propagation) is added on the structure of the feedforward type network) to train to obtain voice with prosodic features and spectral features, and finally inputting the voice with the prosodic features and spectral features into a Softmax classifier to classify to obtain a voice recognition result.
S2, carrying out information fusion on the feature text description generated in the step S1, and carrying out emotion classification on the processing result by utilizing self-organizing map (SOM) and a compensation layer; the method specifically comprises the following steps:
s2.1, embedding the feature text description collected in the step S1 into a feature vector with a fixed size and arranged according to a time sequence by using an LSTM neural network; the LSTM neural network is a Bi-LSTM forward LSTM network in the step 1.2;
s2.2, carrying out normalization processing on the feature vectors in the step S2.1 by adopting a Self-Organization Mapping (SOM) algorithm;
s2.3, because the self-organizing map (SOM) layer is a fuzzy layer which can lose information, a compensation layer is adopted to compensate the information loss, namely different classification results in the SOM are compensatedThe layer must have a specific layer bonded to it; the size of each layer is the same as that of the SOM network competition layer, and all nodes have own weight ws,tS is the s-th class of the corresponding compensation layer, and t is wsThe t-th node within the layer; the compensation layers are not shared, and each layer corresponds to a specific type of classified SOM output; the formula of the multiplication result is as follows:
us=ws·μs+b# (5)
in the formula, musIs the input weight, w, of the s-th level nodesB is the weight of the s-th level node, and is used for limiting the compensation proportion between-1 and 1;
s2.4, because similar characteristics may have the same class, global optimization is carried out on the target result, wherein the global optimization target is as follows:
e=LG+LSOM+LS (6)
wherein the first term LGComprises the following steps:
LGitem I of (1)For minimizing the error of labels and predicted results. The second term is | y- μk‖2To minimize the error between the tag and the SOM network results. Item III | muk-x‖2For minimizing the difference between the input signal of the SOM network and the output signal of the SOM network.
And S3, according to the output result of the step S2, the possibility of depressed mood and suicide tendency is obtained, and high-risk persons are marked and observed.
Claims (10)
1. The method for identifying depression and suicide tendency by fusing body language, micro expression and language is characterized by comprising the following steps of:
s1, collecting video and audio by using a Kinect with an infrared camera, and respectively converting the video information and the audio information into feature text descriptions;
s2, carrying out information fusion on the feature text description generated in the step S1, and carrying out emotion classification on the processing result by utilizing self-organizing map (SOM) and a compensation layer;
and S3, marking the people possibly having depressed mood or suicide tendency obtained in the step S2 for observation.
2. The method for recognizing depression and suicidality according to claim 1, wherein in step S1, the video information includes information of body movements and facial expressions extracted from the video, the body movements including static movements and dynamic movements; the audio information comprises frequency spectrum, rhythm and sound wave information extracted from voice audio, wherein the frequency spectrum information and the rhythm information are used for acquiring voice marks, and the sound wave information is used for acquiring voice content.
3. The method for identifying depression and suicidality combined with body language, microexpression and language according to claim 2, wherein the step S1 of extracting the feature text description specifically comprises the following steps:
s1.1, adopting a Convolutional Neural Network (CNN) to finish the identification of static motion and generating a static motion characteristic text description;
s1.2, detecting human skeleton data in real time by using Kinect, calculating the behavior characteristics of a human body, completing the identification of dynamic motion, and generating a text description of the dynamic motion characteristics;
s1.3, completing information identification of facial expressions by using a local identification method, and generating facial activity characteristic text description;
and S1.4, completing the recognition of the voice mark and the recognition of the voice content, and generating language feature text description.
4. The method of claim 3 wherein in step S1.1, a single frame is selected from the collected video and input to a Convolutional Neural Network (CNN) for training and testing; inputting all the single frames in the video to a Convolutional Neural Network (CNN) after training is finished to obtain static motion with emotional characteristics, and inputting the static motion with the emotional characteristics into a Softmax classifier for classification to finish the identification of the static motion; softmax function, the calculation formula is as follows:
wherein, WiAnd b represents bias for the weight matrix of the ith type characteristic text.
5. The method for identifying depression and suicidal ideation tendencies fusing body language, microexpression and language according to claim 4, wherein the Convolutional Neural Network (CNN) calculates convolution by using partial filter, i.e. inner product operation is performed by using partial submatrix of input item and partial filter, and output is convolution matrix; the hidden layers in the Convolutional Neural Network (CNN) comprise two convolutional layers and two pooling layers;
the formula for the convolutional layer is as follows:
wherein l represents the l convolutional layer, and i represents the value of the i component of the convolutional output matrix; j represents the number of corresponding output matrices; the value of j varies from 0 to N, where N represents the number of convolution output matrices; f is a non-linear sigmoid-type function;
the pooling layer uses mean pooling, the input of which is from the upper convolutional layer and the output is used as the input of the next convolutional layer, and the calculation formula is as follows:
6. The method for identifying depression and suicidal tendency fusing body language, micro expression and language according to claim 4, wherein in step S1.2, firstly, positioning and tracking of human body are completed through Kinect to obtain joint points of skeleton; 15 skeleton joint points are numbered from top to bottom and from left to right; extracting a frame sequence from a video and inputting interval Kalman filtering to improve the precision of a skeleton position; then, a bidirectional long-short term memory network (Bi-LSTM-CRF) with a conditional random field layer is used for analyzing motion sequences of 15 skeleton points respectively to obtain dynamic motion with emotional characteristics;
for a two-way long-short term memory neural network, an input sequence x is given1,x2,…,xt,…,xTAnd (4) wherein T represents the T-th coordinate, and T represents a total of T coordinates, and the output calculation formula of the hidden layer of the long-short term memory neural network is as follows:
ht=σh(Wxhxt+Whhht-1+bh) (4)
wherein h istFor the output of the hidden layer at time t, WxhAs a weight matrix from the input layer to the hidden layer, WhhAs a weight matrix from hidden layer to hidden layer, bhFor concealing the bias of the layer, σhRepresenting an activation function; Bi-LSTM is used to reinforce the bilateral relationship and make the first layer forward LSTM and the second layer backward LSTM;
and finally, inputting the dynamic motion with the emotional characteristics into the Softmax classifier in the step S1.1 for classification.
7. The method for identifying depression and suicidal tendency fusing body language, microexpression and language according to claim 3, wherein in step S1.3, each segmented region of the face is obtained according to the information of the image frame of the face captured by kinect; processing original images of all parts of the segmented human face into normalized standard images, performing feature extraction by adopting two-dimensional Gabor wavelets, performing dimension reduction by utilizing a Linear Discriminant Analysis (LDA) algorithm, extracting the most distinctive low-dimensional features from a high-dimensional feature space, collecting all samples of the same class according to the extracted low-dimensional features, and separating other samples as far as possible, namely selecting the features with the largest ratio of the dispersion between the sample classes to the dispersion in the sample classes; and finally, classifying the face image frames subjected to the Gabor wavelet feature extraction and LDA dimension reduction through an open-source OpenFace neural network to obtain the recognition result of the facial expression.
8. The method for recognizing the depression and the suicide tendency fusing the body language, the micro expression and the language according to claim 3, wherein in the step S1.4, firstly, the Kinect directly collects the voice, the collected voice reduces the noise existing in the voice by using a wiener-based noise filter, then the voice with the reduced noise is input into a Back Propagation Neural Network (BPNN) to be trained to obtain the voice with the prosodic feature and the spectral feature, and finally the voice with the prosodic feature and the spectral feature is input into a Softmax classifier to be classified to obtain the recognition result of the voice.
9. The method for recognizing depression and suicidality according to claim 4, wherein step S2 comprises the following steps:
s2.1, embedding the feature text description collected in the step S1 into a feature vector with a fixed size and arranged according to a time sequence by using an LSTM neural network; the LSTM neural network is a Bi-LSTM forward LSTM network in the step 1.2;
s2.2, carrying out normalization processing on the feature vectors in the step S2.1 by adopting a Self-Organization Mapping (SOM) algorithm;
s2.3, because the self-organizing map (SOM) layer is a fuzzy layer which can lose information, a compensation layer is adopted to make up for the loss of the information, namely for different classification results in the SOM, the compensation layer must have a specific layer combined with the specific layer; the size of each layer is the same as that of the SOM network competition layer, and all nodes have own weight ws,tS is the s-th class of the corresponding compensation layer, and t is wsThe t-th node within the layer; the compensation layers are not shared, and each layer corresponds to a specific type of classified SOM output; the formula of the multiplication result is as follows:
us=ws·μs+b# (5)
in the formula, musIs the input weight, w, of the s-th level nodesB is the weight of the s-th level node, and is used for limiting the compensation proportion between-1 and 1;
s2.4, carrying out global optimization on the target result, wherein the global optimization target is as follows:
e=LG+LSOM+LS (6)
wherein the first term LGComprises the following steps:
LGitem I of (1)For minimizing the error of the labels and the predicted results; the second term is | y- μk‖2For minimizing the error between the tag and the SOM network result; item III | muk-x‖2For minimizing the difference between the input signal of the SOM network and the output signal of the SOM network.
10. The method for recognizing depression and suicidality according to claim 1, wherein in step S3, based on the output of step S2, the probability of depressed mood and suicidality is derived, and high-risk persons are marked and observed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010764410.9A CN112101097A (en) | 2020-08-02 | 2020-08-02 | Depression and suicide tendency identification method integrating body language, micro expression and language |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010764410.9A CN112101097A (en) | 2020-08-02 | 2020-08-02 | Depression and suicide tendency identification method integrating body language, micro expression and language |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112101097A true CN112101097A (en) | 2020-12-18 |
Family
ID=73750123
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010764410.9A Pending CN112101097A (en) | 2020-08-02 | 2020-08-02 | Depression and suicide tendency identification method integrating body language, micro expression and language |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112101097A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112768070A (en) * | 2021-01-06 | 2021-05-07 | 万佳安智慧生活技术(深圳)有限公司 | Mental health evaluation method and system based on dialogue communication |
CN112884146A (en) * | 2021-02-25 | 2021-06-01 | 香港理工大学深圳研究院 | Method and system for training model based on data quantization and hardware acceleration |
CN113469153A (en) * | 2021-09-03 | 2021-10-01 | 中国科学院自动化研究所 | Multi-modal emotion recognition method based on micro-expressions, limb actions and voice |
-
2020
- 2020-08-02 CN CN202010764410.9A patent/CN112101097A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112768070A (en) * | 2021-01-06 | 2021-05-07 | 万佳安智慧生活技术(深圳)有限公司 | Mental health evaluation method and system based on dialogue communication |
CN112884146A (en) * | 2021-02-25 | 2021-06-01 | 香港理工大学深圳研究院 | Method and system for training model based on data quantization and hardware acceleration |
CN112884146B (en) * | 2021-02-25 | 2024-02-13 | 香港理工大学深圳研究院 | Method and system for training model based on data quantization and hardware acceleration |
CN113469153A (en) * | 2021-09-03 | 2021-10-01 | 中国科学院自动化研究所 | Multi-modal emotion recognition method based on micro-expressions, limb actions and voice |
CN113469153B (en) * | 2021-09-03 | 2022-01-11 | 中国科学院自动化研究所 | Multi-modal emotion recognition method based on micro-expressions, limb actions and voice |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Avots et al. | Audiovisual emotion recognition in wild | |
Zhang et al. | Cascade and parallel convolutional recurrent neural networks on EEG-based intention recognition for brain computer interface | |
Hsu et al. | Deep learning with time-frequency representation for pulse estimation from facial videos | |
Qian et al. | Artificial intelligence internet of things for the elderly: From assisted living to health-care monitoring | |
Cohn et al. | Feature-point tracking by optical flow discriminates subtle differences in facial expression | |
CN112101097A (en) | Depression and suicide tendency identification method integrating body language, micro expression and language | |
CN109993068B (en) | Non-contact human emotion recognition method based on heart rate and facial features | |
CN112766173B (en) | Multi-mode emotion analysis method and system based on AI deep learning | |
CN111967354B (en) | Depression tendency identification method based on multi-mode characteristics of limbs and micro-expressions | |
Chang et al. | Emotion recognition with consideration of facial expression and physiological signals | |
Benalcázar et al. | Real-time hand gesture recognition based on artificial feed-forward neural networks and EMG | |
CN112101096A (en) | Suicide emotion perception method based on multi-mode fusion of voice and micro-expression | |
Jadhav et al. | Survey on human behavior recognition using affective computing | |
Rayatdoost et al. | Subject-invariant EEG representation learning for emotion recognition | |
CN112380924A (en) | Depression tendency detection method based on facial micro-expression dynamic recognition | |
Du et al. | A novel emotion-aware method based on the fusion of textual description of speech, body movements, and facial expressions | |
Hamid et al. | Integration of deep learning for improved diagnosis of depression using eeg and facial features | |
Singh et al. | Detection of stress, anxiety and depression (SAD) in video surveillance using ResNet-101 | |
Turaev et al. | Review and analysis of patients’ body language from an artificial intelligence perspective | |
Gilanie et al. | An Automated and Real-time Approach of Depression Detection from Facial Micro-expressions. | |
Krishna et al. | Different approaches in depression analysis: A review | |
Vaijayanthi et al. | Human Emotion Recognition from Body Posture with Machine Learning Techniques | |
Hou | Deep Learning-Based Human Emotion Detection Framework Using Facial Expressions | |
Sekar et al. | Semantic-based visual emotion recognition in videos-a transfer learning approach | |
Dhanapal et al. | Pervasive computational model and wearable devices for prediction of respiratory symptoms in progression of COVID-19 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |