CN112101097A

CN112101097A - Depression and suicide tendency identification method integrating body language, micro expression and language

Info

Publication number: CN112101097A
Application number: CN202010764410.9A
Authority: CN
Inventors: 杜广龙
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-08-02
Filing date: 2020-08-02
Publication date: 2020-12-18

Abstract

The invention provides a method for identifying depression and suicide tendency by fusing body language, micro expression and language. The method comprises the following steps: collecting video and audio by using a Kinect with an infrared camera, and respectively converting video information and audio information into feature text descriptions; performing information fusion on the generated feature text description, and performing emotion classification on a processing result by using self-organizing map (SOM) and a compensation layer; and marking the obtained personnel who possibly have depressed mood or suicide tendency for observation. The invention takes static body movement and dynamic body movement into account, achieving higher efficiency. The invention uses Kinect for data acquisition, and has non-invasion, high performance and easy operation.

Description

Depression and suicide tendency identification method integrating body language, micro expression and language

Technical Field

The invention belongs to the field of emotion recognition, and particularly relates to a depression and suicide tendency recognition method fusing body language, micro expression and language.

Background

The pace of life is accelerated, and the change of social environment causes a lot of people to feel great stress, so that the depressed psychology is generated, and even self-disabled or suicide behaviors are generated. It is useful to detect the mood of people in order to detect their mood problems as early as possible and to prevent the development of depressed psychology or suicidal intentions. Human emotions can be recognized by various means, such as Electrocardiogram (ECG), electroencephalogram (EEG) (k.takahashi, "Remarks on emotion recognition from multi-modal bio-potential signals", proc.ieee int.conf.ind.technol. (ICIT), vol.3, pp.1138-1143, jun.2004.), speech, facial expressions, and the like. Among various emotion signals, physiological signals are widely used for emotion recognition. In recent years, human motion has also become a new feature.

There are two conventional methods, one is to measure the physiological index of an object by contact (j.kim, and e.andre, "emission recognition based on physiological changes in music stabilization," IEEE Transactions on Pattern Analysis & Machine Analysis, vol.30, No.12, pp.2067-2083,2008), and the other is to observe the physiological property of an object by non-contact method. In fact, although a non-invasive approach is better, subjects can mask their mood. Technically, audio and video (f.xu, j.zhang and j.z.wang, "Microexpression Identification and category Identification Using a Facial Dynamics Map," IEEE Transactions on influence Computing, vol.8, issue 2, pp.1-1,2017.) are readily available, but are susceptible to noise. In principle, a single detection of Static posture or a single detection of dynamic action (H.Wallbott, "Bodily Expressions of emotions," European J.Social psychological, vol.28, pp.879-896,1998. M.Coulon, "associating to Static Body Postsources: Recognition Accuracy, fusions, and Viewoint Dependence," J.Nonverbal Behavior, vol.28, No.2, pp.117-139,2004. J.Burgloon, L.Guerrero, and K.Floyd, non-verbal communication. Allyn and Bacon,2010.) results in a lower computational complexity for Emotion Recognition, but also in a lower Recognition Accuracy. Therefore, it is necessary to blend these characteristics. Through the fusion of the multi-modal characteristics, the emotion types of the detected people can be better identified.

Disclosure of Invention

In order to solve the problems, the invention provides a method for identifying depression and suicide tendency by fusing body language, micro expression and language. The method can realize effective fusion of characteristic information of body actions, facial expressions, languages and the like of the human body, and can detect whether a person has depressed emotion or not and whether the person has suicidal behavior intention or not more efficiently and accurately by carrying out emotion classification on the information. First, information such as voice, body movement, facial expression, etc. is collected using a Kinect with an infrared camera. The voice can be converted into text description through prosody and spectral features extracted from the voice, the text description comprises information such as intonation, intonation and speed, static motion and dynamic motion of a human body are analyzed respectively by adopting a Convolutional Neural Network (CNN) and a bidirectional long-short time memory conditional random field (Bi-LSTM-CRF), and feature extraction and dimension reduction processing are respectively carried out on a face image. Finally, information such as voice, limb movement, facial expressions and the like is fused into the text description, self-organizing map (SOM) and a compensation layer are used for understanding behaviors, and emotions are recognized.

The purpose of the invention is realized by at least one of the following technical solutions.

A depression and suicide tendency recognition method fusing body language, micro expression and language comprises the following steps:

s1, collecting video and audio by using a Kinect with an infrared camera, and respectively converting the video information and the audio information into feature text descriptions;

s2, carrying out information fusion on the feature text description generated in the step S1, and carrying out emotion classification on the processing result by utilizing self-organizing map (SOM) and a compensation layer;

and S3, marking the people possibly having depressed mood or suicide tendency obtained in the step S2 for observation.

Further, in step S1, the video information includes limb movement and facial expression information extracted from the video, the limb movement includes static movement and dynamic movement; the audio information comprises frequency spectrum, rhythm and sound wave information extracted from voice audio, wherein the frequency spectrum information and the rhythm information are used for acquiring voice marks, and the sound wave information is used for acquiring voice content.

Further, in step S1, the extracting of the feature text description specifically includes the following steps:

s1.1, adopting a Convolutional Neural Network (CNN) to finish the identification of static motion and generating a static motion characteristic text description;

s1.2, detecting human skeleton data in real time by using Kinect, calculating the behavior characteristics of a human body, completing the identification of dynamic motion, and generating a text description of the dynamic motion characteristics;

s1.3, completing information identification of facial expressions by using a local identification method, and generating facial activity characteristic text description;

and S1.4, completing the recognition of the voice mark and the recognition of the voice content, and generating language feature text description.

Further, in step S1.1, a single frame is selected from the collected video and input to a Convolutional Neural Network (CNN) for training and testing; inputting all the single frames in the video to a Convolutional Neural Network (CNN) after training is finished to obtain static motion with emotional characteristics, and inputting the static motion with the emotional characteristics into a Softmax classifier for classification to finish the identification of the static motion; softmax function, the calculation formula is as follows:

wherein, W_iAnd b represents bias for the weight matrix of the ith type characteristic text.

Further, the Convolutional Neural Network (CNN) calculates convolution using a partial filter, that is, inner product operation is performed using a local submatrix of an input item and the partial filter, and the output is a convolution matrix; the hidden layers in the Convolutional Neural Network (CNN) comprise two convolutional layers and two pooling layers;

the formula for the convolutional layer is as follows:

where l denotes the ith convolutional layer and i denotes the value of the ith component of the convolutional output matrix. j represents the number of corresponding output matrices; the value of j varies from 0 to N, where N represents the number of convolution output matrices; f is a non-linear sigmoid-type function;

the pooling layer uses mean pooling, the input of which is from the upper convolutional layer and the output is used as the input of the next convolutional layer, and the calculation formula is as follows:

wherein the content of the first and second substances,

and the local output after the pooling process is finished is represented and is derived from the mean value of the local small matrix with the size of n multiplied by n of the previous layer.

Further, in step S1.2, firstly, positioning and tracking of the human body is completed through Kinect, and joint points of the skeleton are obtained; 15 skeleton joint points are numbered from top to bottom and from left to right; since the position signals of the skeleton are time-varying, when occlusion is encountered, the definition of the position signals is not clear, and therefore, a frame sequence is extracted from the video and input into interval Kalman filtering to improve the precision of the skeleton position; then, a bidirectional long-short term memory network (Bi-LSTM-CRF) with a conditional random field layer is used for analyzing motion sequences of 15 skeleton points respectively to obtain dynamic motion with emotional characteristics;

for a two-way long-short term memory neural network, an input sequence x is given₁,x₂,…,x_t,…,x_TAnd (4) wherein T represents the T-th coordinate, and T represents a total of T coordinates, and the output calculation formula of the hidden layer of the long-short term memory neural network is as follows:

h_t＝σ_h(W_xhx_t+W_hhh_t-1+b_h) (4)

wherein h is_tFor the output of the hidden layer at time t, W_xhAs a weight matrix from the input layer to the hidden layer, W_hhAs a weight matrix from hidden layer to hidden layer, b_hFor concealing the bias of the layer, σ_hRepresenting an activation function; although LSTM can captureInformation of long-term sequences, but only one direction is considered. Bi-LSTM is used to reinforce the bilateral relationship and make the first layer forward LSTM and the second layer backward LSTM;

and finally, inputting the dynamic motion with the emotional characteristics into the Softmax classifier in the step S1.1 for classification.

Further, in step S1.3, each segmented region of the face is obtained according to the information of the face image frame captured by kinect; processing original images of all parts of the segmented human face into normalized standard images, performing feature extraction by adopting two-dimensional Gabor wavelets, performing dimension reduction by utilizing a Linear Discriminant Analysis (LDA) algorithm, extracting the most distinctive low-dimensional features from a high-dimensional feature space, collecting all samples of the same class according to the extracted low-dimensional features, and separating other samples as far as possible, namely selecting the features with the largest ratio of the dispersion between the sample classes to the dispersion in the sample classes; and finally, classifying the face image frames subjected to the Gabor wavelet feature extraction and LDA dimension reduction through an open-source OpenFace neural network to obtain the recognition result of the facial expression.

Further, in step S1.4, first, directly collecting speech by Kinect, reducing noise in the collected speech by using a wiener-based noise filter, then inputting the speech with reduced noise to a Back Propagation Neural Network (BPNN) (which belongs to a feed-forward type neural network), the BPNN is that a Back Propagation algorithm (Back Propagation) is added to the structure of a feed-forward type network to train to obtain speech with prosodic features and spectral features, and finally inputting the speech with the prosodic features and spectral features to a Softmax classifier to classify, so as to obtain a speech recognition result.

Further, step S2 specifically includes the following steps:

s2.1, embedding the feature text description collected in the step S1 into a feature vector with a fixed size and arranged according to a time sequence by using an LSTM neural network; the LSTM neural network is a Bi-LSTM forward LSTM network in the step 1.2;

s2.2, carrying out normalization processing on the feature vectors in the step S2.1 by adopting a Self-Organization Mapping (SOM) algorithm;

s2.3, because the self-organizing map (SOM) layer is a fuzzy layer which can lose information, a compensation layer is adopted to make up for the loss of the information, namely for different classification results in the SOM, the compensation layer must have a specific layer combined with the specific layer; the size of each layer is the same as that of the SOM network competition layer, and all nodes have own weight w_s,tS is the s-th class of the corresponding compensation layer, and t is w_sThe t-th node within the layer; the compensation layers are not shared, and each layer corresponds to a specific type of classified SOM output; the formula of the multiplication result is as follows:

u_s＝w_s·μ_s+b# (5)

in the formula, mu_sIs the input weight, w, of the s-th level node_sB is the weight of the s-th level node, and is used for limiting the compensation proportion between-1 and 1;

since similar features may have the same class, the target result is globally optimized, and the global optimization target is:

e＝L_G+L_SOM+L_S (6)

wherein the first term L_GComprises the following steps:

L_Gitem I of (1)

For minimizing the error of labels and predicted results. The second term is | y- μ_k‖²To minimize the error between the tag and the SOM network results. Item III | mu_k-x‖²For minimizing the difference between the input signal of the SOM network and the output signal of the SOM network.

Further, in step S3, based on the output result of step S2, the possibility of depressed mood and suicidal tendency is obtained, and the high-risk person is marked and observed to make a certain psychological counseling.

Compared with the prior art, the invention has the following advantages:

(1) the present invention aligns multimodal data with a text layer. The intermediate representation of text and the proposed fusion method form a framework that fuses speech, limb movements and facial expressions. The method reduces the dimensions of voice, limb movement and facial expression, and unifies three types of information into one component.

(2) The depth information enhances the robustness and accuracy of motion detection.

(3) The invention takes static body movement and dynamic body movement into account, achieving higher efficiency.

(4) The invention uses Kinect for data acquisition, and has non-invasion, high performance and easy operation.

Drawings

Fig. 1 is a flowchart of a method for identifying depression and suicidal tendency by fusing body language, micro expression and language in the embodiment of the invention.

Detailed Description

Specific implementations of the present invention will be further described with reference to the following examples and drawings, but the embodiments of the present invention are not limited thereto.

Example (b):

a method for identifying depression and suicidal tendency by fusing body language, micro expression and language, as shown in figure 1, comprises the following steps:

s1, collecting video and audio by using a Kinect with an infrared camera, and respectively converting the video information and the audio information into feature text descriptions; the video information comprises limb movement and facial expression information extracted from the video, wherein the limb movement comprises static movement and dynamic movement; the audio information comprises frequency spectrum, rhythm and sound wave information extracted from voice audio, wherein the frequency spectrum information and the rhythm information are used for acquiring voice marks, and the sound wave information is used for acquiring voice content.

The extraction of the feature text description specifically comprises the following steps:

selecting a single frame from the collected videos, inputting the single frame into a Convolutional Neural Network (CNN) for training and testing; inputting all the single frames in the video to a Convolutional Neural Network (CNN) after training is finished to obtain static motion with emotional characteristics, and inputting the static motion with the emotional characteristics into a Softmax classifier for classification to finish the identification of the static motion; softmax function, the calculation formula is as follows:

The formula for the convolutional layer is as follows:

wherein l represents the l convolutional layer, and i represents the value of the i component of the convolutional output matrix; j represents the number of corresponding output matrices; the value of j varies from 0 to N, where N represents the number of convolution output matrices; f is a non-linear sigmoid-type function;

wherein the content of the first and second substances,

firstly, positioning and tracking a human body through a Kinect to obtain joint points of bones; 15 skeleton joint points are numbered from top to bottom and from left to right; since the position signals of the skeleton are time-varying, when occlusion is encountered, the definition of the position signals is not clear, and therefore, a frame sequence is extracted from the video and input into interval Kalman filtering to improve the precision of the skeleton position; then, a bidirectional long-short term memory network (Bi-LSTM-CRF) with a conditional random field layer is used for analyzing motion sequences of 15 skeleton points respectively to obtain dynamic motion with emotional characteristics;

h_t＝σ_h(W_xhx_t+W_hhh_t-1+b_h) (4)

wherein h is_tFor the output of the hidden layer at time t, W_xhAs a weight matrix from the input layer to the hidden layer, W_hhAs a weight matrix from hidden layer to hidden layer, b_hFor concealing the bias of the layer, σ_hRepresenting an activation function; although LSTM can capture information of long-term sequences, only one direction is considered. Bi-LSTM is used to reinforce the bilateral relationship and make the first layer forward LSTM and the second layer backward LSTM;

obtaining each segmentation area of the human face according to the information of the human face image frame captured by kinect; processing original images of all parts of the segmented human face into normalized standard images, performing feature extraction by adopting two-dimensional Gabor wavelets, performing dimension reduction by utilizing a Linear Discriminant Analysis (LDA) algorithm, extracting the most distinctive low-dimensional features from a high-dimensional feature space, collecting all samples of the same class according to the extracted low-dimensional features, and separating other samples as far as possible, namely selecting the features with the largest ratio of the dispersion between the sample classes to the dispersion in the sample classes; and finally, classifying the face image frames subjected to the Gabor wavelet feature extraction and LDA dimension reduction through an open-source OpenFace neural network to obtain the recognition result of the facial expression.

S1.4, completing recognition of the voice mark and recognition of voice content, and generating language feature text description;

firstly, directly collecting voice by Kinect, reducing noise in the collected voice by using a wiener-based noise filter, then inputting the voice with reduced noise into a Back Propagation Neural Network (BPNN) (the Back Propagation Neural Network (BPNN) belongs to one of feedforward type neural networks, and the BPNN is that a Back Propagation algorithm (Back Propagation) is added on the structure of the feedforward type network) to train to obtain voice with prosodic features and spectral features, and finally inputting the voice with the prosodic features and spectral features into a Softmax classifier to classify to obtain a voice recognition result.

S2, carrying out information fusion on the feature text description generated in the step S1, and carrying out emotion classification on the processing result by utilizing self-organizing map (SOM) and a compensation layer; the method specifically comprises the following steps:

s2.3, because the self-organizing map (SOM) layer is a fuzzy layer which can lose information, a compensation layer is adopted to compensate the information loss, namely different classification results in the SOM are compensatedThe layer must have a specific layer bonded to it; the size of each layer is the same as that of the SOM network competition layer, and all nodes have own weight w_s,tS is the s-th class of the corresponding compensation layer, and t is w_sThe t-th node within the layer; the compensation layers are not shared, and each layer corresponds to a specific type of classified SOM output; the formula of the multiplication result is as follows:

u_s＝w_s·μ_s+b# (5)

s2.4, because similar characteristics may have the same class, global optimization is carried out on the target result, wherein the global optimization target is as follows:

e＝L_G+L_SOM+L_S (6)

wherein the first term L_GComprises the following steps:

L_Gitem I of (1)

And S3, according to the output result of the step S2, the possibility of depressed mood and suicide tendency is obtained, and high-risk persons are marked and observed.

Claims

1. The method for identifying depression and suicide tendency by fusing body language, micro expression and language is characterized by comprising the following steps of:

2. The method for recognizing depression and suicidality according to claim 1, wherein in step S1, the video information includes information of body movements and facial expressions extracted from the video, the body movements including static movements and dynamic movements; the audio information comprises frequency spectrum, rhythm and sound wave information extracted from voice audio, wherein the frequency spectrum information and the rhythm information are used for acquiring voice marks, and the sound wave information is used for acquiring voice content.

3. The method for identifying depression and suicidality combined with body language, microexpression and language according to claim 2, wherein the step S1 of extracting the feature text description specifically comprises the following steps:

4. The method of claim 3 wherein in step S1.1, a single frame is selected from the collected video and input to a Convolutional Neural Network (CNN) for training and testing; inputting all the single frames in the video to a Convolutional Neural Network (CNN) after training is finished to obtain static motion with emotional characteristics, and inputting the static motion with the emotional characteristics into a Softmax classifier for classification to finish the identification of the static motion; softmax function, the calculation formula is as follows:

5. The method for identifying depression and suicidal ideation tendencies fusing body language, microexpression and language according to claim 4, wherein the Convolutional Neural Network (CNN) calculates convolution by using partial filter, i.e. inner product operation is performed by using partial submatrix of input item and partial filter, and output is convolution matrix; the hidden layers in the Convolutional Neural Network (CNN) comprise two convolutional layers and two pooling layers;

the formula for the convolutional layer is as follows:

wherein the content of the first and second substances,

6. The method for identifying depression and suicidal tendency fusing body language, micro expression and language according to claim 4, wherein in step S1.2, firstly, positioning and tracking of human body are completed through Kinect to obtain joint points of skeleton; 15 skeleton joint points are numbered from top to bottom and from left to right; extracting a frame sequence from a video and inputting interval Kalman filtering to improve the precision of a skeleton position; then, a bidirectional long-short term memory network (Bi-LSTM-CRF) with a conditional random field layer is used for analyzing motion sequences of 15 skeleton points respectively to obtain dynamic motion with emotional characteristics;

h_t＝σ_h(W_xhx_t+W_hhh_t-1+b_h) (4)

wherein h is_tFor the output of the hidden layer at time t, W_xhAs a weight matrix from the input layer to the hidden layer, W_hhAs a weight matrix from hidden layer to hidden layer, b_hFor concealing the bias of the layer, σ_hRepresenting an activation function; Bi-LSTM is used to reinforce the bilateral relationship and make the first layer forward LSTM and the second layer backward LSTM;

7. The method for identifying depression and suicidal tendency fusing body language, microexpression and language according to claim 3, wherein in step S1.3, each segmented region of the face is obtained according to the information of the image frame of the face captured by kinect; processing original images of all parts of the segmented human face into normalized standard images, performing feature extraction by adopting two-dimensional Gabor wavelets, performing dimension reduction by utilizing a Linear Discriminant Analysis (LDA) algorithm, extracting the most distinctive low-dimensional features from a high-dimensional feature space, collecting all samples of the same class according to the extracted low-dimensional features, and separating other samples as far as possible, namely selecting the features with the largest ratio of the dispersion between the sample classes to the dispersion in the sample classes; and finally, classifying the face image frames subjected to the Gabor wavelet feature extraction and LDA dimension reduction through an open-source OpenFace neural network to obtain the recognition result of the facial expression.

8. The method for recognizing the depression and the suicide tendency fusing the body language, the micro expression and the language according to claim 3, wherein in the step S1.4, firstly, the Kinect directly collects the voice, the collected voice reduces the noise existing in the voice by using a wiener-based noise filter, then the voice with the reduced noise is input into a Back Propagation Neural Network (BPNN) to be trained to obtain the voice with the prosodic feature and the spectral feature, and finally the voice with the prosodic feature and the spectral feature is input into a Softmax classifier to be classified to obtain the recognition result of the voice.

9. The method for recognizing depression and suicidality according to claim 4, wherein step S2 comprises the following steps:

u_s＝w_s·μ_s+b# (5)

s2.4, carrying out global optimization on the target result, wherein the global optimization target is as follows:

e＝L_G+L_SOM+L_S (6)

wherein the first term L_GComprises the following steps:

L_Gitem I of (1)

For minimizing the error of the labels and the predicted results; the second term is | y- μ_k‖²For minimizing the error between the tag and the SOM network result; item III | mu_k-x‖²For minimizing the difference between the input signal of the SOM network and the output signal of the SOM network.

10. The method for recognizing depression and suicidality according to claim 1, wherein in step S3, based on the output of step S2, the probability of depressed mood and suicidality is derived, and high-risk persons are marked and observed.