CN112101097A - Depression and suicide tendency identification method integrating body language, micro expression and language - Google Patents

Depression and suicide tendency identification method integrating body language, micro expression and language Download PDF

Info

Publication number
CN112101097A
CN112101097A CN202010764410.9A CN202010764410A CN112101097A CN 112101097 A CN112101097 A CN 112101097A CN 202010764410 A CN202010764410 A CN 202010764410A CN 112101097 A CN112101097 A CN 112101097A
Authority
CN
China
Prior art keywords
layer
language
information
voice
som
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010764410.9A
Other languages
Chinese (zh)
Inventor
杜广龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202010764410.9A priority Critical patent/CN112101097A/en
Publication of CN112101097A publication Critical patent/CN112101097A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/08Feature extraction

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a method for identifying depression and suicide tendency by fusing body language, micro expression and language. The method comprises the following steps: collecting video and audio by using a Kinect with an infrared camera, and respectively converting video information and audio information into feature text descriptions; performing information fusion on the generated feature text description, and performing emotion classification on a processing result by using self-organizing map (SOM) and a compensation layer; and marking the obtained personnel who possibly have depressed mood or suicide tendency for observation. The invention takes static body movement and dynamic body movement into account, achieving higher efficiency. The invention uses Kinect for data acquisition, and has non-invasion, high performance and easy operation.

Description

Depression and suicide tendency identification method integrating body language, micro expression and language
Technical Field
The invention belongs to the field of emotion recognition, and particularly relates to a depression and suicide tendency recognition method fusing body language, micro expression and language.
Background
The pace of life is accelerated, and the change of social environment causes a lot of people to feel great stress, so that the depressed psychology is generated, and even self-disabled or suicide behaviors are generated. It is useful to detect the mood of people in order to detect their mood problems as early as possible and to prevent the development of depressed psychology or suicidal intentions. Human emotions can be recognized by various means, such as Electrocardiogram (ECG), electroencephalogram (EEG) (k.takahashi, "Remarks on emotion recognition from multi-modal bio-potential signals", proc.ieee int.conf.ind.technol. (ICIT), vol.3, pp.1138-1143, jun.2004.), speech, facial expressions, and the like. Among various emotion signals, physiological signals are widely used for emotion recognition. In recent years, human motion has also become a new feature.
There are two conventional methods, one is to measure the physiological index of an object by contact (j.kim, and e.andre, "emission recognition based on physiological changes in music stabilization," IEEE Transactions on Pattern Analysis & Machine Analysis, vol.30, No.12, pp.2067-2083,2008), and the other is to observe the physiological property of an object by non-contact method. In fact, although a non-invasive approach is better, subjects can mask their mood. Technically, audio and video (f.xu, j.zhang and j.z.wang, "Microexpression Identification and category Identification Using a Facial Dynamics Map," IEEE Transactions on influence Computing, vol.8, issue 2, pp.1-1,2017.) are readily available, but are susceptible to noise. In principle, a single detection of Static posture or a single detection of dynamic action (H.Wallbott, "Bodily Expressions of emotions," European J.Social psychological, vol.28, pp.879-896,1998. M.Coulon, "associating to Static Body Postsources: Recognition Accuracy, fusions, and Viewoint Dependence," J.Nonverbal Behavior, vol.28, No.2, pp.117-139,2004. J.Burgloon, L.Guerrero, and K.Floyd, non-verbal communication. Allyn and Bacon,2010.) results in a lower computational complexity for Emotion Recognition, but also in a lower Recognition Accuracy. Therefore, it is necessary to blend these characteristics. Through the fusion of the multi-modal characteristics, the emotion types of the detected people can be better identified.
Disclosure of Invention
In order to solve the problems, the invention provides a method for identifying depression and suicide tendency by fusing body language, micro expression and language. The method can realize effective fusion of characteristic information of body actions, facial expressions, languages and the like of the human body, and can detect whether a person has depressed emotion or not and whether the person has suicidal behavior intention or not more efficiently and accurately by carrying out emotion classification on the information. First, information such as voice, body movement, facial expression, etc. is collected using a Kinect with an infrared camera. The voice can be converted into text description through prosody and spectral features extracted from the voice, the text description comprises information such as intonation, intonation and speed, static motion and dynamic motion of a human body are analyzed respectively by adopting a Convolutional Neural Network (CNN) and a bidirectional long-short time memory conditional random field (Bi-LSTM-CRF), and feature extraction and dimension reduction processing are respectively carried out on a face image. Finally, information such as voice, limb movement, facial expressions and the like is fused into the text description, self-organizing map (SOM) and a compensation layer are used for understanding behaviors, and emotions are recognized.
The purpose of the invention is realized by at least one of the following technical solutions.
A depression and suicide tendency recognition method fusing body language, micro expression and language comprises the following steps:
s1, collecting video and audio by using a Kinect with an infrared camera, and respectively converting the video information and the audio information into feature text descriptions;
s2, carrying out information fusion on the feature text description generated in the step S1, and carrying out emotion classification on the processing result by utilizing self-organizing map (SOM) and a compensation layer;
and S3, marking the people possibly having depressed mood or suicide tendency obtained in the step S2 for observation.
Further, in step S1, the video information includes limb movement and facial expression information extracted from the video, the limb movement includes static movement and dynamic movement; the audio information comprises frequency spectrum, rhythm and sound wave information extracted from voice audio, wherein the frequency spectrum information and the rhythm information are used for acquiring voice marks, and the sound wave information is used for acquiring voice content.
Further, in step S1, the extracting of the feature text description specifically includes the following steps:
s1.1, adopting a Convolutional Neural Network (CNN) to finish the identification of static motion and generating a static motion characteristic text description;
s1.2, detecting human skeleton data in real time by using Kinect, calculating the behavior characteristics of a human body, completing the identification of dynamic motion, and generating a text description of the dynamic motion characteristics;
s1.3, completing information identification of facial expressions by using a local identification method, and generating facial activity characteristic text description;
and S1.4, completing the recognition of the voice mark and the recognition of the voice content, and generating language feature text description.
Further, in step S1.1, a single frame is selected from the collected video and input to a Convolutional Neural Network (CNN) for training and testing; inputting all the single frames in the video to a Convolutional Neural Network (CNN) after training is finished to obtain static motion with emotional characteristics, and inputting the static motion with the emotional characteristics into a Softmax classifier for classification to finish the identification of the static motion; softmax function, the calculation formula is as follows:
Figure BDA0002614118860000021
wherein, WiAnd b represents bias for the weight matrix of the ith type characteristic text.
Further, the Convolutional Neural Network (CNN) calculates convolution using a partial filter, that is, inner product operation is performed using a local submatrix of an input item and the partial filter, and the output is a convolution matrix; the hidden layers in the Convolutional Neural Network (CNN) comprise two convolutional layers and two pooling layers;
the formula for the convolutional layer is as follows:
Figure BDA0002614118860000031
where l denotes the ith convolutional layer and i denotes the value of the ith component of the convolutional output matrix. j represents the number of corresponding output matrices; the value of j varies from 0 to N, where N represents the number of convolution output matrices; f is a non-linear sigmoid-type function;
the pooling layer uses mean pooling, the input of which is from the upper convolutional layer and the output is used as the input of the next convolutional layer, and the calculation formula is as follows:
Figure BDA0002614118860000032
wherein the content of the first and second substances,
Figure BDA0002614118860000033
and the local output after the pooling process is finished is represented and is derived from the mean value of the local small matrix with the size of n multiplied by n of the previous layer.
Further, in step S1.2, firstly, positioning and tracking of the human body is completed through Kinect, and joint points of the skeleton are obtained; 15 skeleton joint points are numbered from top to bottom and from left to right; since the position signals of the skeleton are time-varying, when occlusion is encountered, the definition of the position signals is not clear, and therefore, a frame sequence is extracted from the video and input into interval Kalman filtering to improve the precision of the skeleton position; then, a bidirectional long-short term memory network (Bi-LSTM-CRF) with a conditional random field layer is used for analyzing motion sequences of 15 skeleton points respectively to obtain dynamic motion with emotional characteristics;
for a two-way long-short term memory neural network, an input sequence x is given1,x2,…,xt,…,xTAnd (4) wherein T represents the T-th coordinate, and T represents a total of T coordinates, and the output calculation formula of the hidden layer of the long-short term memory neural network is as follows:
ht=σh(Wxhxt+Whhht-1+bh) (4)
wherein h istFor the output of the hidden layer at time t, WxhAs a weight matrix from the input layer to the hidden layer, WhhAs a weight matrix from hidden layer to hidden layer, bhFor concealing the bias of the layer, σhRepresenting an activation function; although LSTM can captureInformation of long-term sequences, but only one direction is considered. Bi-LSTM is used to reinforce the bilateral relationship and make the first layer forward LSTM and the second layer backward LSTM;
and finally, inputting the dynamic motion with the emotional characteristics into the Softmax classifier in the step S1.1 for classification.
Further, in step S1.3, each segmented region of the face is obtained according to the information of the face image frame captured by kinect; processing original images of all parts of the segmented human face into normalized standard images, performing feature extraction by adopting two-dimensional Gabor wavelets, performing dimension reduction by utilizing a Linear Discriminant Analysis (LDA) algorithm, extracting the most distinctive low-dimensional features from a high-dimensional feature space, collecting all samples of the same class according to the extracted low-dimensional features, and separating other samples as far as possible, namely selecting the features with the largest ratio of the dispersion between the sample classes to the dispersion in the sample classes; and finally, classifying the face image frames subjected to the Gabor wavelet feature extraction and LDA dimension reduction through an open-source OpenFace neural network to obtain the recognition result of the facial expression.
Further, in step S1.4, first, directly collecting speech by Kinect, reducing noise in the collected speech by using a wiener-based noise filter, then inputting the speech with reduced noise to a Back Propagation Neural Network (BPNN) (which belongs to a feed-forward type neural network), the BPNN is that a Back Propagation algorithm (Back Propagation) is added to the structure of a feed-forward type network to train to obtain speech with prosodic features and spectral features, and finally inputting the speech with the prosodic features and spectral features to a Softmax classifier to classify, so as to obtain a speech recognition result.
Further, step S2 specifically includes the following steps:
s2.1, embedding the feature text description collected in the step S1 into a feature vector with a fixed size and arranged according to a time sequence by using an LSTM neural network; the LSTM neural network is a Bi-LSTM forward LSTM network in the step 1.2;
s2.2, carrying out normalization processing on the feature vectors in the step S2.1 by adopting a Self-Organization Mapping (SOM) algorithm;
s2.3, because the self-organizing map (SOM) layer is a fuzzy layer which can lose information, a compensation layer is adopted to make up for the loss of the information, namely for different classification results in the SOM, the compensation layer must have a specific layer combined with the specific layer; the size of each layer is the same as that of the SOM network competition layer, and all nodes have own weight ws,tS is the s-th class of the corresponding compensation layer, and t is wsThe t-th node within the layer; the compensation layers are not shared, and each layer corresponds to a specific type of classified SOM output; the formula of the multiplication result is as follows:
us=ws·μs+b# (5)
in the formula, musIs the input weight, w, of the s-th level nodesB is the weight of the s-th level node, and is used for limiting the compensation proportion between-1 and 1;
since similar features may have the same class, the target result is globally optimized, and the global optimization target is:
e=LG+LSOM+LS (6)
wherein the first term LGComprises the following steps:
Figure BDA0002614118860000051
LGitem I of (1)
Figure BDA0002614118860000052
For minimizing the error of labels and predicted results. The second term is | y- μk2To minimize the error between the tag and the SOM network results. Item III | muk-x‖2For minimizing the difference between the input signal of the SOM network and the output signal of the SOM network.
Further, in step S3, based on the output result of step S2, the possibility of depressed mood and suicidal tendency is obtained, and the high-risk person is marked and observed to make a certain psychological counseling.
Compared with the prior art, the invention has the following advantages:
(1) the present invention aligns multimodal data with a text layer. The intermediate representation of text and the proposed fusion method form a framework that fuses speech, limb movements and facial expressions. The method reduces the dimensions of voice, limb movement and facial expression, and unifies three types of information into one component.
(2) The depth information enhances the robustness and accuracy of motion detection.
(3) The invention takes static body movement and dynamic body movement into account, achieving higher efficiency.
(4) The invention uses Kinect for data acquisition, and has non-invasion, high performance and easy operation.
Drawings
Fig. 1 is a flowchart of a method for identifying depression and suicidal tendency by fusing body language, micro expression and language in the embodiment of the invention.
Detailed Description
Specific implementations of the present invention will be further described with reference to the following examples and drawings, but the embodiments of the present invention are not limited thereto.
Example (b):
a method for identifying depression and suicidal tendency by fusing body language, micro expression and language, as shown in figure 1, comprises the following steps:
s1, collecting video and audio by using a Kinect with an infrared camera, and respectively converting the video information and the audio information into feature text descriptions; the video information comprises limb movement and facial expression information extracted from the video, wherein the limb movement comprises static movement and dynamic movement; the audio information comprises frequency spectrum, rhythm and sound wave information extracted from voice audio, wherein the frequency spectrum information and the rhythm information are used for acquiring voice marks, and the sound wave information is used for acquiring voice content.
The extraction of the feature text description specifically comprises the following steps:
s1.1, adopting a Convolutional Neural Network (CNN) to finish the identification of static motion and generating a static motion characteristic text description;
selecting a single frame from the collected videos, inputting the single frame into a Convolutional Neural Network (CNN) for training and testing; inputting all the single frames in the video to a Convolutional Neural Network (CNN) after training is finished to obtain static motion with emotional characteristics, and inputting the static motion with the emotional characteristics into a Softmax classifier for classification to finish the identification of the static motion; softmax function, the calculation formula is as follows:
Figure BDA0002614118860000061
wherein, WiAnd b represents bias for the weight matrix of the ith type characteristic text.
The formula for the convolutional layer is as follows:
Figure BDA0002614118860000062
wherein l represents the l convolutional layer, and i represents the value of the i component of the convolutional output matrix; j represents the number of corresponding output matrices; the value of j varies from 0 to N, where N represents the number of convolution output matrices; f is a non-linear sigmoid-type function;
the pooling layer uses mean pooling, the input of which is from the upper convolutional layer and the output is used as the input of the next convolutional layer, and the calculation formula is as follows:
Figure BDA0002614118860000063
wherein the content of the first and second substances,
Figure BDA0002614118860000064
and the local output after the pooling process is finished is represented and is derived from the mean value of the local small matrix with the size of n multiplied by n of the previous layer.
S1.2, detecting human skeleton data in real time by using Kinect, calculating the behavior characteristics of a human body, completing the identification of dynamic motion, and generating a text description of the dynamic motion characteristics;
firstly, positioning and tracking a human body through a Kinect to obtain joint points of bones; 15 skeleton joint points are numbered from top to bottom and from left to right; since the position signals of the skeleton are time-varying, when occlusion is encountered, the definition of the position signals is not clear, and therefore, a frame sequence is extracted from the video and input into interval Kalman filtering to improve the precision of the skeleton position; then, a bidirectional long-short term memory network (Bi-LSTM-CRF) with a conditional random field layer is used for analyzing motion sequences of 15 skeleton points respectively to obtain dynamic motion with emotional characteristics;
for a two-way long-short term memory neural network, an input sequence x is given1,x2,…,xt,…,xTAnd (4) wherein T represents the T-th coordinate, and T represents a total of T coordinates, and the output calculation formula of the hidden layer of the long-short term memory neural network is as follows:
ht=σh(Wxhxt+Whhht-1+bh) (4)
wherein h istFor the output of the hidden layer at time t, WxhAs a weight matrix from the input layer to the hidden layer, WhhAs a weight matrix from hidden layer to hidden layer, bhFor concealing the bias of the layer, σhRepresenting an activation function; although LSTM can capture information of long-term sequences, only one direction is considered. Bi-LSTM is used to reinforce the bilateral relationship and make the first layer forward LSTM and the second layer backward LSTM;
and finally, inputting the dynamic motion with the emotional characteristics into the Softmax classifier in the step S1.1 for classification.
S1.3, completing information identification of facial expressions by using a local identification method, and generating facial activity characteristic text description;
obtaining each segmentation area of the human face according to the information of the human face image frame captured by kinect; processing original images of all parts of the segmented human face into normalized standard images, performing feature extraction by adopting two-dimensional Gabor wavelets, performing dimension reduction by utilizing a Linear Discriminant Analysis (LDA) algorithm, extracting the most distinctive low-dimensional features from a high-dimensional feature space, collecting all samples of the same class according to the extracted low-dimensional features, and separating other samples as far as possible, namely selecting the features with the largest ratio of the dispersion between the sample classes to the dispersion in the sample classes; and finally, classifying the face image frames subjected to the Gabor wavelet feature extraction and LDA dimension reduction through an open-source OpenFace neural network to obtain the recognition result of the facial expression.
S1.4, completing recognition of the voice mark and recognition of voice content, and generating language feature text description;
firstly, directly collecting voice by Kinect, reducing noise in the collected voice by using a wiener-based noise filter, then inputting the voice with reduced noise into a Back Propagation Neural Network (BPNN) (the Back Propagation Neural Network (BPNN) belongs to one of feedforward type neural networks, and the BPNN is that a Back Propagation algorithm (Back Propagation) is added on the structure of the feedforward type network) to train to obtain voice with prosodic features and spectral features, and finally inputting the voice with the prosodic features and spectral features into a Softmax classifier to classify to obtain a voice recognition result.
S2, carrying out information fusion on the feature text description generated in the step S1, and carrying out emotion classification on the processing result by utilizing self-organizing map (SOM) and a compensation layer; the method specifically comprises the following steps:
s2.1, embedding the feature text description collected in the step S1 into a feature vector with a fixed size and arranged according to a time sequence by using an LSTM neural network; the LSTM neural network is a Bi-LSTM forward LSTM network in the step 1.2;
s2.2, carrying out normalization processing on the feature vectors in the step S2.1 by adopting a Self-Organization Mapping (SOM) algorithm;
s2.3, because the self-organizing map (SOM) layer is a fuzzy layer which can lose information, a compensation layer is adopted to compensate the information loss, namely different classification results in the SOM are compensatedThe layer must have a specific layer bonded to it; the size of each layer is the same as that of the SOM network competition layer, and all nodes have own weight ws,tS is the s-th class of the corresponding compensation layer, and t is wsThe t-th node within the layer; the compensation layers are not shared, and each layer corresponds to a specific type of classified SOM output; the formula of the multiplication result is as follows:
us=ws·μs+b# (5)
in the formula, musIs the input weight, w, of the s-th level nodesB is the weight of the s-th level node, and is used for limiting the compensation proportion between-1 and 1;
s2.4, because similar characteristics may have the same class, global optimization is carried out on the target result, wherein the global optimization target is as follows:
e=LG+LSOM+LS (6)
wherein the first term LGComprises the following steps:
Figure BDA0002614118860000081
LGitem I of (1)
Figure BDA0002614118860000082
For minimizing the error of labels and predicted results. The second term is | y- μk2To minimize the error between the tag and the SOM network results. Item III | muk-x‖2For minimizing the difference between the input signal of the SOM network and the output signal of the SOM network.
And S3, according to the output result of the step S2, the possibility of depressed mood and suicide tendency is obtained, and high-risk persons are marked and observed.

Claims (10)

1. The method for identifying depression and suicide tendency by fusing body language, micro expression and language is characterized by comprising the following steps of:
s1, collecting video and audio by using a Kinect with an infrared camera, and respectively converting the video information and the audio information into feature text descriptions;
s2, carrying out information fusion on the feature text description generated in the step S1, and carrying out emotion classification on the processing result by utilizing self-organizing map (SOM) and a compensation layer;
and S3, marking the people possibly having depressed mood or suicide tendency obtained in the step S2 for observation.
2. The method for recognizing depression and suicidality according to claim 1, wherein in step S1, the video information includes information of body movements and facial expressions extracted from the video, the body movements including static movements and dynamic movements; the audio information comprises frequency spectrum, rhythm and sound wave information extracted from voice audio, wherein the frequency spectrum information and the rhythm information are used for acquiring voice marks, and the sound wave information is used for acquiring voice content.
3. The method for identifying depression and suicidality combined with body language, microexpression and language according to claim 2, wherein the step S1 of extracting the feature text description specifically comprises the following steps:
s1.1, adopting a Convolutional Neural Network (CNN) to finish the identification of static motion and generating a static motion characteristic text description;
s1.2, detecting human skeleton data in real time by using Kinect, calculating the behavior characteristics of a human body, completing the identification of dynamic motion, and generating a text description of the dynamic motion characteristics;
s1.3, completing information identification of facial expressions by using a local identification method, and generating facial activity characteristic text description;
and S1.4, completing the recognition of the voice mark and the recognition of the voice content, and generating language feature text description.
4. The method of claim 3 wherein in step S1.1, a single frame is selected from the collected video and input to a Convolutional Neural Network (CNN) for training and testing; inputting all the single frames in the video to a Convolutional Neural Network (CNN) after training is finished to obtain static motion with emotional characteristics, and inputting the static motion with the emotional characteristics into a Softmax classifier for classification to finish the identification of the static motion; softmax function, the calculation formula is as follows:
Figure FDA0002614118850000011
wherein, WiAnd b represents bias for the weight matrix of the ith type characteristic text.
5. The method for identifying depression and suicidal ideation tendencies fusing body language, microexpression and language according to claim 4, wherein the Convolutional Neural Network (CNN) calculates convolution by using partial filter, i.e. inner product operation is performed by using partial submatrix of input item and partial filter, and output is convolution matrix; the hidden layers in the Convolutional Neural Network (CNN) comprise two convolutional layers and two pooling layers;
the formula for the convolutional layer is as follows:
Figure FDA0002614118850000021
wherein l represents the l convolutional layer, and i represents the value of the i component of the convolutional output matrix; j represents the number of corresponding output matrices; the value of j varies from 0 to N, where N represents the number of convolution output matrices; f is a non-linear sigmoid-type function;
the pooling layer uses mean pooling, the input of which is from the upper convolutional layer and the output is used as the input of the next convolutional layer, and the calculation formula is as follows:
Figure FDA0002614118850000022
wherein the content of the first and second substances,
Figure FDA0002614118850000023
and the local output after the pooling process is finished is represented and is derived from the mean value of the local small matrix with the size of n multiplied by n of the previous layer.
6. The method for identifying depression and suicidal tendency fusing body language, micro expression and language according to claim 4, wherein in step S1.2, firstly, positioning and tracking of human body are completed through Kinect to obtain joint points of skeleton; 15 skeleton joint points are numbered from top to bottom and from left to right; extracting a frame sequence from a video and inputting interval Kalman filtering to improve the precision of a skeleton position; then, a bidirectional long-short term memory network (Bi-LSTM-CRF) with a conditional random field layer is used for analyzing motion sequences of 15 skeleton points respectively to obtain dynamic motion with emotional characteristics;
for a two-way long-short term memory neural network, an input sequence x is given1,x2,…,xt,…,xTAnd (4) wherein T represents the T-th coordinate, and T represents a total of T coordinates, and the output calculation formula of the hidden layer of the long-short term memory neural network is as follows:
ht=σh(Wxhxt+Whhht-1+bh) (4)
wherein h istFor the output of the hidden layer at time t, WxhAs a weight matrix from the input layer to the hidden layer, WhhAs a weight matrix from hidden layer to hidden layer, bhFor concealing the bias of the layer, σhRepresenting an activation function; Bi-LSTM is used to reinforce the bilateral relationship and make the first layer forward LSTM and the second layer backward LSTM;
and finally, inputting the dynamic motion with the emotional characteristics into the Softmax classifier in the step S1.1 for classification.
7. The method for identifying depression and suicidal tendency fusing body language, microexpression and language according to claim 3, wherein in step S1.3, each segmented region of the face is obtained according to the information of the image frame of the face captured by kinect; processing original images of all parts of the segmented human face into normalized standard images, performing feature extraction by adopting two-dimensional Gabor wavelets, performing dimension reduction by utilizing a Linear Discriminant Analysis (LDA) algorithm, extracting the most distinctive low-dimensional features from a high-dimensional feature space, collecting all samples of the same class according to the extracted low-dimensional features, and separating other samples as far as possible, namely selecting the features with the largest ratio of the dispersion between the sample classes to the dispersion in the sample classes; and finally, classifying the face image frames subjected to the Gabor wavelet feature extraction and LDA dimension reduction through an open-source OpenFace neural network to obtain the recognition result of the facial expression.
8. The method for recognizing the depression and the suicide tendency fusing the body language, the micro expression and the language according to claim 3, wherein in the step S1.4, firstly, the Kinect directly collects the voice, the collected voice reduces the noise existing in the voice by using a wiener-based noise filter, then the voice with the reduced noise is input into a Back Propagation Neural Network (BPNN) to be trained to obtain the voice with the prosodic feature and the spectral feature, and finally the voice with the prosodic feature and the spectral feature is input into a Softmax classifier to be classified to obtain the recognition result of the voice.
9. The method for recognizing depression and suicidality according to claim 4, wherein step S2 comprises the following steps:
s2.1, embedding the feature text description collected in the step S1 into a feature vector with a fixed size and arranged according to a time sequence by using an LSTM neural network; the LSTM neural network is a Bi-LSTM forward LSTM network in the step 1.2;
s2.2, carrying out normalization processing on the feature vectors in the step S2.1 by adopting a Self-Organization Mapping (SOM) algorithm;
s2.3, because the self-organizing map (SOM) layer is a fuzzy layer which can lose information, a compensation layer is adopted to make up for the loss of the information, namely for different classification results in the SOM, the compensation layer must have a specific layer combined with the specific layer; the size of each layer is the same as that of the SOM network competition layer, and all nodes have own weight ws,tS is the s-th class of the corresponding compensation layer, and t is wsThe t-th node within the layer; the compensation layers are not shared, and each layer corresponds to a specific type of classified SOM output; the formula of the multiplication result is as follows:
us=ws·μs+b# (5)
in the formula, musIs the input weight, w, of the s-th level nodesB is the weight of the s-th level node, and is used for limiting the compensation proportion between-1 and 1;
s2.4, carrying out global optimization on the target result, wherein the global optimization target is as follows:
e=LG+LSOM+LS (6)
wherein the first term LGComprises the following steps:
Figure FDA0002614118850000031
LGitem I of (1)
Figure FDA0002614118850000041
For minimizing the error of the labels and the predicted results; the second term is | y- μk2For minimizing the error between the tag and the SOM network result; item III | muk-x‖2For minimizing the difference between the input signal of the SOM network and the output signal of the SOM network.
10. The method for recognizing depression and suicidality according to claim 1, wherein in step S3, based on the output of step S2, the probability of depressed mood and suicidality is derived, and high-risk persons are marked and observed.
CN202010764410.9A 2020-08-02 2020-08-02 Depression and suicide tendency identification method integrating body language, micro expression and language Pending CN112101097A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010764410.9A CN112101097A (en) 2020-08-02 2020-08-02 Depression and suicide tendency identification method integrating body language, micro expression and language

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010764410.9A CN112101097A (en) 2020-08-02 2020-08-02 Depression and suicide tendency identification method integrating body language, micro expression and language

Publications (1)

Publication Number Publication Date
CN112101097A true CN112101097A (en) 2020-12-18

Family

ID=73750123

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010764410.9A Pending CN112101097A (en) 2020-08-02 2020-08-02 Depression and suicide tendency identification method integrating body language, micro expression and language

Country Status (1)

Country Link
CN (1) CN112101097A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112768070A (en) * 2021-01-06 2021-05-07 万佳安智慧生活技术(深圳)有限公司 Mental health evaluation method and system based on dialogue communication
CN112884146A (en) * 2021-02-25 2021-06-01 香港理工大学深圳研究院 Method and system for training model based on data quantization and hardware acceleration
CN113469153A (en) * 2021-09-03 2021-10-01 中国科学院自动化研究所 Multi-modal emotion recognition method based on micro-expressions, limb actions and voice

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112768070A (en) * 2021-01-06 2021-05-07 万佳安智慧生活技术(深圳)有限公司 Mental health evaluation method and system based on dialogue communication
CN112884146A (en) * 2021-02-25 2021-06-01 香港理工大学深圳研究院 Method and system for training model based on data quantization and hardware acceleration
CN112884146B (en) * 2021-02-25 2024-02-13 香港理工大学深圳研究院 Method and system for training model based on data quantization and hardware acceleration
CN113469153A (en) * 2021-09-03 2021-10-01 中国科学院自动化研究所 Multi-modal emotion recognition method based on micro-expressions, limb actions and voice
CN113469153B (en) * 2021-09-03 2022-01-11 中国科学院自动化研究所 Multi-modal emotion recognition method based on micro-expressions, limb actions and voice

Similar Documents

Publication Publication Date Title
Avots et al. Audiovisual emotion recognition in wild
Zhang et al. Cascade and parallel convolutional recurrent neural networks on EEG-based intention recognition for brain computer interface
Hsu et al. Deep learning with time-frequency representation for pulse estimation from facial videos
Qian et al. Artificial intelligence internet of things for the elderly: From assisted living to health-care monitoring
Cohn et al. Feature-point tracking by optical flow discriminates subtle differences in facial expression
CN112101097A (en) Depression and suicide tendency identification method integrating body language, micro expression and language
CN109993068B (en) Non-contact human emotion recognition method based on heart rate and facial features
CN112766173B (en) Multi-mode emotion analysis method and system based on AI deep learning
CN111967354B (en) Depression tendency identification method based on multi-mode characteristics of limbs and micro-expressions
Chang et al. Emotion recognition with consideration of facial expression and physiological signals
Benalcázar et al. Real-time hand gesture recognition based on artificial feed-forward neural networks and EMG
CN112101096A (en) Suicide emotion perception method based on multi-mode fusion of voice and micro-expression
Jadhav et al. Survey on human behavior recognition using affective computing
Rayatdoost et al. Subject-invariant EEG representation learning for emotion recognition
CN112380924A (en) Depression tendency detection method based on facial micro-expression dynamic recognition
Du et al. A novel emotion-aware method based on the fusion of textual description of speech, body movements, and facial expressions
Hamid et al. Integration of deep learning for improved diagnosis of depression using eeg and facial features
Singh et al. Detection of stress, anxiety and depression (SAD) in video surveillance using ResNet-101
Turaev et al. Review and analysis of patients’ body language from an artificial intelligence perspective
Gilanie et al. An Automated and Real-time Approach of Depression Detection from Facial Micro-expressions.
Krishna et al. Different approaches in depression analysis: A review
Vaijayanthi et al. Human Emotion Recognition from Body Posture with Machine Learning Techniques
Hou Deep Learning-Based Human Emotion Detection Framework Using Facial Expressions
Sekar et al. Semantic-based visual emotion recognition in videos-a transfer learning approach
Dhanapal et al. Pervasive computational model and wearable devices for prediction of respiratory symptoms in progression of COVID-19

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination