CN112101095B

CN112101095B - Suicide and violence tendency emotion recognition method based on language and limb characteristics

Info

Publication number: CN112101095B
Application number: CN202010764407.7A
Authority: CN
Inventors: 杜广龙
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-08-02
Filing date: 2020-08-02
Publication date: 2023-08-29
Anticipated expiration: 2040-08-02
Also published as: CN112101095A

Abstract

The invention discloses a suicide and violence tendency emotion recognition method based on language and limb characteristics. The method comprises the following steps: collecting video and audio by using Kinect, and respectively converting the voice features and the visual features extracted from the video and the audio into text descriptions; fusing the text description through a neural network with a self-organizing map layer to obtain a text description embedded vector; suicide and violence tendencies were analyzed from the text description embedded vector using a Softmax function. The invention allows for both static and dynamic body movements, resulting in higher efficiency.

Description

Suicide and violence tendency emotion recognition method based on language and limb characteristics

Technical Field

The invention belongs to the field of emotion recognition, and particularly relates to a suicide and violence tendency emotion recognition method based on language and limb characteristics.

Background

To prevent people from self-disability or violent tendencies, it is useful to detect their moods. The emotion of a human being can be recognized in various ways, such as an electrocardiogram, an electroencephalogram, speech, facial expression, and the like. Among various emotion signals, physiological signals are widely used for emotion recognition. In recent years, human motion has also become a new feature. There are two conventional methods, one is to measure the physiological index of the object by contact and the other is to observe the physiological characteristic of the object by a non-contact method. In fact, although a non-invasive approach is better, subjects can disguise their mood. Technically, audio and video (university of Beijing, university journal 2006,5 (1): 165-182) are readily available but susceptible to noise. Thus, fusion properties are necessary. Although these methods achieve significant results, improvements are still needed.

Disclosure of Invention

The invention aims to solve the defects of the prior art and provides a suicide and violence tendency emotion recognition method based on language and limb characteristics. Kinect with an Infrared (IR) camera can prevent the face image from being affected by illumination. Therefore, kinect is used to collect information such as voice, limb movements, etc. The present invention considers the spectral and prosodic features of speech to help identify emotion in speech content. By extracting prosody and spectral features from speech, the speech can be converted into textual descriptions, including intonation, and pace of speech. To accurately describe the movement, the body movement is divided into a static body movement and a dynamic body movement. The Convolutional Neural Network (CNN) and the Bi-directional long-short-time memory conditional random field (Bi-LSTM-CRF) are adopted to analyze the static motion and the dynamic motion of the human body respectively. The multi-sensor data fusion requires a reliable data fusion method. It is effective to fuse such information into text for emotion recognition. Finally, the information such as voice, limb actions and the like is fused into the text description.

The object of the invention is achieved by at least one of the following technical solutions.

The suicide and violence tendency emotion recognition method based on language and limb characteristics comprises the following steps:

s1, collecting video and audio by using Kinect, and respectively converting voice features and visual features extracted from the video and the audio into text description;

s2, fusing the text description through a neural network with a self-organizing map layer to obtain a text description embedded vector;

s3, analyzing suicide and violence tendency by using a Softmax function according to the text description embedded vector.

Further, in step S1, the speech features include speech content, prosody and spectrum; the visual characteristics are limb movements of a human body, and the limb movements are divided into static movements and dynamic movements.

Further, step S1 includes the steps of:

s1.1, directly converting voice content into content text description through Windows SDK v2.0 public preview of Kinect; the two characteristics of rhythm and frequency spectrum are converted into video state text description through a Back Propagation Neural Network (BPNN) of a classical structure;

s1.2, converting a single frame selected from a captured video into a static motion text description through Convolutional Neural Network (CNN) processing; acquiring and representing bone joint points from Kinect, recording bone joint point positions at each moment, and finally forming sequence skeleton data; and (3) coding skeleton point sequences corresponding to the continuous actions, namely N set actions, into vectors, processing the vectors by adopting a Bi-directional long-short term memory conditional random field (Bi-LSTM-CRF) to obtain action sequences, and finally classifying the action sequences into corresponding dynamic motion text descriptions by a Softmax classifier.

Further, the Back Propagation Neural Network (BPNN) structure is as follows: there are n training samples in the training sample space Ω, respectivelyThe output value (i.e., predicted value) of the sample k after passing through the neural network is y _k ＝{y _k1 ,...,y _kl Characteristic vector x of kth training sample _k Dimension m, predictive value vector y _k And the true value vector->The vector dimensions are all l. The neural network has a 3-layer structure, wherein the 1 st layer is an input layer, the 3 rd layer is an output layer, the 2 nd layer is a hidden layer, the BP algorithm updates each weight in the network by using a gradient descent algorithm, the size of batch is set as p, a square error sum calculation formula is adopted, and the average square error sum is used as an objective function, namely the objective function is

k represents the kth node of the hidden layer and q represents the qth node of the hidden layer.

Further, the Convolutional Neural Network (CNN) comprises an input layer, an hidden layer and a fully connected layer, wherein the hidden layer comprises two convolutional layers and two pooling layers;

the calculation formula of the convolution layer is as follows:

where l represents the first convolution layer and i represents the value of the i-th component of the convolution output matrix; j represents the number of corresponding output matrixes; the value of j varies between 0 and N, where N represents the number of convolved output matrices; f is a nonlinear sigmoid type function;an ith component representing a jth output matrix of the ith convolutional layer; b _j Representing the bias of the jth output matrix; />A weight representing an a-th convolution kernel of the j-th output matrix;

the method comprises the steps of constructing a pooling layer by using mean pooling, wherein the input of the mean pooling layer is derived from an upper convolution layer, and the output is used as the input of a next convolution layer, and the calculation formula is as follows:

wherein ,representing the local output after the pooling process is finished, < >>Are denoted as output matrices.

Further, in the two-way long-short term memory conditional random field (Bi-LSTM-CRF), an input sequence { x ] is given to the two-way long-short term memory neural network (Bi-LSTM) ₁ ,x ₂ ,…,x _t ,…,x _T -wherein T represents the T-th coordinate and T represents a total of T coordinates, wherein the output of the hidden layer is calculated as:

h _t ＝σ _h (W _xh x _t +W _hh h _t-1 +b _h )；

wherein ,h_t Indicating the output of the hidden layer at time t, W _xh Representing the weight of the input layer to the hidden layer, W _hh Representing weights from hidden layer to hidden layer, b _h Representing the bias, sigma, of the hidden layer _h Representing an activation function; a Bi-directional long-short term memory neural network hidden layer (Bi-LSTM) is used to strengthen the bilateral relationship, and the first layer is a forward LSTM, and the second layer is a backward LSTM.

Further, step S2 includes the steps of:

s2.1, connecting static motion text description, dynamic motion text description and video state text description with fixed sizes into a vector A by using a long-short-term memory (LSTM) neural network; the word2vec method is utilized to convert the content text description into a space vector with a certain fixed length, a Long Short Term Memory (LSTM) neural network is used to embed the space vector converted by the content text description into a vector B with a fixed size, and a Long Short Term Memory (LSTM) neural network is used as a forward LSTM of a Bi-directional long term memory neural network (Bi-LSTM). The vector A and the vector B keep the same size; and connecting the vector A and the vector B with the vector A by using element multiplication to obtain a cross effect, obtaining a text description embedded vector x and carrying out standardization.

Further, in step S3, the suicide and violence tendencies are analyzed according to the text description embedding vector using the Softmax function, and the calculation formula is as follows:

wherein ,W_j B represents bias for the weight matrix of the j-th emotion tendency; the emotion tendencies categories are those with and without suicide and violence tendencies, respectively.

Compared with the prior art, the invention has the following advantages:

(1) The present invention aligns multimodal data with a text layer. The text intermediate representation and the proposed fusion method form a framework for fusing limb movements and facial expressions. The invention reduces the dimension of limb actions and facial expressions, and unifies two types of information into a unified component.

(2) The invention allows for both static and dynamic body movements, resulting in higher efficiency.

(3) The invention adopts Kinect for data acquisition, and has high performance and convenient operation.

Drawings

FIG. 1 is a flow chart of the suicidal and violent predisposition emotion recognition method of the present invention based on language and limb characteristics.

Detailed Description

Specific embodiments of the present invention will be described further below with reference to examples and drawings, but the embodiments of the present invention are not limited thereto.

Examples:

the suicide and violence tendency emotion recognition method based on language and limb characteristics, as shown in fig. 1, comprises the following steps:

the speech features include speech content, prosody and spectrum; the visual characteristics are limb movements of a human body, and the limb movements are divided into static movements and dynamic movements.

Step S1 comprises the steps of:

the Back Propagation Neural Network (BPNN) structure is as follows: there are n training samples in the training sample space Ω, respectivelyThe output value (i.e., predicted value) of the sample k after passing through the neural network is y _k ＝{y _k1 ,...,y _kl Characteristic vector x of kth training sample _k Dimension m, predictive value vector y _k And the true value vector->The vector dimensions are all l. The neural network has a 3-layer structure, wherein the 1 st layer is an input layer, the 3 rd layer is an output layer, the 2 nd layer is a hidden layer, the BP algorithm updates each weight in the network by using a gradient descent algorithm, the size of batch is set as p, a square error sum calculation formula is adopted, and the average square error sum is used as an objective function, namely the objective function is

S1.2, converting a single frame selected from a captured video into a static motion text description through Convolutional Neural Network (CNN) processing; the Convolutional Neural Network (CNN) comprises an input layer, an implicit layer and a fully connected layer, wherein the implicit layer comprises two convolutional layers and two pooling layers;

the calculation formula of the convolution layer is as follows:

where l represents the first convolution layer and i represents the value of the i-th component of the convolution output matrix; j represents the number of corresponding output matrixes; the value of j varies between 0 and N, where N represents a convolutionThe number of output matrices; f is a nonlinear sigmoid type function;an ith component representing a jth output matrix of the ith convolutional layer; b _j Representing the bias of the jth output matrix; />A weight representing an a-th convolution kernel of the j-th output matrix;

Acquiring and representing bone joint points from Kinect, recording bone joint point positions at each moment, and finally forming sequence skeleton data; and (3) coding skeleton point sequences corresponding to the continuous actions, namely N set actions, into vectors, processing the vectors by adopting a Bi-directional long-short term memory conditional random field (Bi-LSTM-CRF) to obtain action sequences, and finally classifying the action sequences into corresponding dynamic motion text descriptions by a Softmax classifier.

In the two-way long-short-term memory conditional random field (Bi-LSTM-CRF), an input sequence { x ] is given to a two-way long-short-term memory neural network (Bi-LSTM) ₁ ,x ₂ ,…,x _t ,…,x _T -wherein T represents the T-th coordinate and T represents a total of T coordinates, wherein the output of the hidden layer is calculated as:

h _t ＝σ _h (W _xh x _t +W _hh h _t-1 +b _h )；

S2, fusing the text description through a neural network with a self-organizing map layer to obtain a text description embedded vector, wherein the method comprises the following steps of:

S3, using a Softmax function to analyze suicide and violence tendency according to the text description embedded vector, wherein the calculation formula is as follows:

Claims

1. The suicide and violence tendency emotion recognition method based on language and limb characteristics is characterized by comprising the following steps of:

s1, collecting video and audio by using Kinect, and respectively converting voice features and visual features extracted from the video and the audio into text description; the method comprises the following steps:

s1.2, converting a single frame selected from a captured video into a static motion text description through Convolutional Neural Network (CNN) processing; acquiring and representing bone joint points from Kinect, recording bone joint point positions at each moment, and finally forming sequence skeleton data; coding skeleton point sequences corresponding to continuous actions, namely N set actions, into vectors, processing the vectors by adopting a Bi-directional long-short term memory conditional random field (Bi-LSTM-CRF) to obtain action sequences, and finally classifying the action sequences into corresponding dynamic motion text descriptions by a Softmax classifier;

the Back Propagation Neural Network (BPNN) structure is as follows:

there are n training samples in the training sample space Ω, respectivelyThe output value (i.e., predicted value) of the sample k after passing through the neural network is y _k ＝{y _k1 ,...,y _kl Characteristic vector x of kth training sample _k Dimension m, predictive value vector y _k And the true value vector->Vector dimensions are all l; the neural network has a 3-layer structure, wherein the 1 st layer is an input layer, the 3 rd layer is an output layer, the 2 nd layer is a hidden layer, the BP algorithm updates each weight in the network by using a gradient descent algorithm, the size of batch is set as p, a square error sum calculation formula is adopted, and the average square error sum is used as a target functionThe number, i.e. the objective function, is:

k represents a kth node of the hidden layer, and q represents a qth node of the hidden layer;

s3, using a Softmax function to analyze suicide and violence tendency according to the text description embedded vector; the suicide and violence tendencies were analyzed from the text description embedded vector using the Softmax function, calculated as follows:

2. The method for identifying suicidal and violent tendencies based on language and limb characteristics as in claim 1 wherein in step S1 the speech characteristics comprise speech content, prosody and spectrum; the visual characteristics are limb movements of a human body, and the limb movements are divided into static movements and dynamic movements.

3. The language and limb feature based suicide and violence prone emotion recognition method of claim 1, the Convolutional Neural Network (CNN) comprising an input layer, an hidden layer and a fully connected layer, the hidden layer comprising two convolutional layers and two pooling layers;

the calculation formula of the convolution layer is as follows:

where l represents the first convolution layer and i represents the value of the i-th component of the convolution output matrix; j represents the number of corresponding output matrixes; the value of j varies between 0 and N, where N represents the number of convolved output matrices; f is a nonlinear sigmoid type function;an ith component representing a jth output matrix of the ith convolutional layer; b _j Representing the bias of the jth output matrix;a weight representing an a-th convolution kernel of the j-th output matrix;

4. The method for identifying suicidal and violent tendencies based on language and limb characteristics according to claim 1, wherein an input sequence { x } is given to a two-way long-short-term memory neural network (Bi-LSTM) in a two-way long-short-term memory conditional random field (Bi-LSTM-CRF) ₁ ,x ₂ ,…,x _t ,…,x _T -wherein T represents the T-th coordinate and T represents a total of T coordinates, wherein the output of the hidden layer is calculated as:

h _t ＝σ _h (W _xh x _t +W _hh h _t-1 +b _h )；

5. The language and limb feature-based suicidal and violent tendencies and emotion recognition method as recited in claim 4, wherein step S2 includes the steps of:

s2.1, connecting static motion text description, dynamic motion text description and video state text description with fixed sizes into a vector A by using a long-short-term memory (LSTM) neural network; converting the content text description into a space vector with a certain fixed length by using a word2vec method, embedding the space vector converted by the content text description into a vector B with a fixed size by using a Long Short Term Memory (LSTM) neural network, and embedding the space vector into a forward LSTM of a bidirectional long and short term memory (Bi-LSTM) neural network by using the LSTM neural network; the vector A and the vector B keep the same size; and connecting the vector A and the vector B with the vector A by using element multiplication to obtain a cross effect, obtaining a text description embedded vector x and carrying out standardization.