CN110956953A

CN110956953A - Quarrel identification method based on audio analysis and deep learning

Info

Publication number: CN110956953A
Application number: CN201911206603.6A
Authority: CN
Inventors: 熊会元; 陈彩婷; 刘晟; 马雯菲; 张秋
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2020-04-03
Anticipated expiration: 2039-11-29
Also published as: CN110956953B

Abstract

The invention provides a quarrel identification method based on audio analysis and deep learning, which comprises the following steps: collecting a voice sample matched with an application scene; making a quarreling voice data set for emotion recognition according to the voice sample, and constructing an emotion recognition model; making a dialogue voice data set according to the voice sample to obtain a correlation coefficient data table and constructing a dialogue correlation classification model; collecting audio to be detected, and extracting voice characteristics and correlation coefficients; and inputting the voice characteristics into an emotion recognition model and the correlation coefficient into a conversation correlation classification model to judge quarreling voice. The quarrel identification method based on the audio analysis and the deep learning provided by the invention combines the quarrel identification method with emotion identification and conversation correlation analysis to effectively judge quarrel in a real situation, is suitable for being applied to public transportation places, is used for assisting management personnel to monitor, and can effectively find quarrel and avoid causing relevant results in time.

Description

Quarrel identification method based on audio analysis and deep learning

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a quarrel identification method based on audio analysis and deep learning.

Background

The crowds in the public transportation place are mixed and dense, and the friction and the corners of mouths are easy to generate between people, so that a series of serious consequences of influencing the public transportation order, hindering the normal operation of public transportation tools, and even harming the personal safety are caused. However, at present, public places do not adopt an identification technology specially used for quarreling, so that timely and effective early warning and manual intervention for violent quarreling and the like are difficult to adopt.

At present, there is a blank in the research of people who have abnormal communication directions such as quarrel at home. The brain can be divided into 'seeing' and 'listening' to the perception mode of the quarrel loudly, in view of the fact that quarrel loudly is different from the general simple behaviors such as 'raising hands' and 'playing balls' that can be followed, the quarrel loudly behavior recognition difficulty based on vision is very high, and at present, there is still a blank in the research of the quarrel loudly behavior recognition field based on vision at home and abroad; in the field of audio recognition, many researches in the field of speech emotion at home and abroad focus on recognition of basic emotion of a speaker, and a blank part exists in the research on abnormal situations of interpersonal communication such as quarrel.

As the basis of man-machine emotion interaction, the research work of emotion recognition is concerned more and more, and different algorithm models are also presented at home and abroad for realizing emotion recognition. To summarize, however, the emotion recognition process generally includes four parts: data acquisition, data preprocessing, emotion feature acquisition and emotion classification. Wherein, the realization methods of each component are different. Currently, the common methods for implementing emotion classification mainly consist of: support Vector Machines (SVMs), decision trees, random forests, hidden markov models, bayesian networks, K-means, linear regression analysis, multi-layer perceptrons, multi-classifier fusion, and deep neural networks, among others. Seppancen [1] and the like extract 40 multi-dimensional emotional acoustic features related to fundamental frequency, energy and duration, and a KNN classifier is used for obtaining the recognition rate of 60%. Li 2 et al extracts two timbre features of amplitude perturbation and frequency perturbation as speech emotion recognition features on a SUSAS emotion recognition database, and achieves 69.1% recognition accuracy by combining the MFCC features with the features of amplitude perturbation and frequency perturbation. Hu 3 et al propose a support vector machine method based on GMM super vector for speech emotion recognition classification task, find that SVM classifier based on GMM super vector is better than simple SVM classifier. Suitai 4 et al propose a speech emotion recognition method based on local feature optimization, which removes and classifies speech frames with insignificant emotional features in speech through clustering analysis, and the average accuracy of the speech emotion recognition method based on local feature optimization on a corpus is improved by 5% to 17%.

At present, the speech emotion recognition technology is mature, and although the classification of common emotions is basically completed, the following defects still exist: (1) the research on speech emotion recognition at home and abroad mainly focuses on how to comprehensively utilize each mainstream algorithm model to improve the speech emotion recognition rate; (2) many of the related researches are emotion recognition for a single speaker in a single sentence, and the characteristics of voice conversation are not considered, so that recognition of conversation abnormal conditions closely related to emotion, such as quarrel, is difficult to realize.

Disclosure of Invention

The invention provides a quarreling recognition method based on audio analysis and deep learning, aiming at overcoming the technical defects that the conventional speech emotion recognition technology is lack of research on abnormal moods such as quarrel in the actual situation and is limited to single-person single sentence recognition.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a quarrel identification method based on audio analysis and deep learning comprises the following steps:

s1: collecting a voice sample matched with an application scene, wherein the voice sample comprises two types of data of quarrel and non-quarrel;

s2: making a quarreling voice data set for emotion recognition according to a voice sample, and constructing a deep neural network for training to obtain an emotion recognition model;

s3: making a dialogue voice data set for correlation analysis according to the voice sample to obtain a correlation coefficient data table, and constructing a support vector machine classifier for training to obtain a dialogue correlation classification model;

s4: collecting audio to be detected, preprocessing the audio to be detected aiming at an emotion recognition model and a conversation correlation classification model, and extracting voice characteristics and correlation coefficients;

s5: inputting the voice characteristics into an emotion recognition model, and judging whether the voice emotion accords with the characteristics of quarrel; if yes, go to step S6; otherwise, judging the voice is not quarreling;

s6: inputting the correlation coefficient into a conversation correlation classification model, and judging whether the voice is in a conversation state; if yes, the voice is judged to be quarreling, otherwise, the voice is judged to be not quarreling.

Wherein, the step S2 specifically includes the following steps:

s21: making a quarreling voice data set for emotion recognition according to the voice sample;

s22: extracting voice features of the quarreling voice data set, and respectively carrying out z-score standardization on each frame to obtain a voice feature sequence group;

s23: zero filling is carried out on other sequences according to the maximum sequence length of the voice characteristic sequence group, the non-fixed length sequence is converted into the fixed length sequence, and the fixed length characteristic sequence group is obtained;

s24: constructing a deep neural network based on a long-time and short-time memory network LSTM and an attention mechanism according to the fixed-length feature sequence group;

s25: and training the deep neural network to obtain an emotion recognition model.

In step S24, the data tag in the fixed-length feature sequence group is subjected to One-Hot Encoding conversion.

In step S24, the deep neural network specifically includes:

an input layer: inputting the fixed-length feature sequence group obtained in the step S23, and filtering a value of '0';

unidirectional LSTM layer: the neuron which is in one-to-one correspondence with the dimension of the characteristic sequence is contained, the preliminary extraction is carried out on the time sequence information, and all output characteristic sequences are returned;

bidirectional LSTM layer: connecting a new one-way LSTM layer, corresponding to the output of the previous one-way LSTM layer one by one, for further time information modeling, and returning all output characteristic sequences through two-way propagation;

attention-driven layer: finding the dependency relationship of the sequence in each time step output from the bidirectional LSTM layer;

first fully-connected layer: contains a plurality of neurons, executes full-connection operation, and outputs:

F＝f(W·t+b)

wherein, W is the weight value of the full connection layer; the t is the input of the full connection layer; b is a bias layer of a full link layer; the f (·) is a ReLu activation function, and the formula is f (x) max (0, x), and x is an argument of the function;

first Dropout layer: after the connection on the first fully-connected layer, the overlap ratio is 0.3, in order to avoid overfitting;

second full connection layer: contains a plurality of neurons;

second Dropout layer: after the connection to the second full connection layer, the overlapping rate is 0.2;

an output layer: the number of neurons in the layer is 2, and features are mapped to two categories of targets through a sigmoid regression algorithm, namely anger/anger resistance; wherein the sigmoid function value is between 0 and 1, and the function form is as follows:

f(x)＝1/(1+e^-x)

where x is the argument of the function.

The LSTM model comprises a unidirectional LSTM layer and a bidirectional LSTM layer, and comprises a plurality of LSTM units, each unit comprises a 3-middle gate structure comprising a forgetting gate, an input gate and an output gate, so that the updating and bidirectional propagation of state information are kept;

in the bidirectional LSTM layer, the bidirectional propagation process includes a forward propagation process and a backward propagation process, where the forward propagation process specifically includes:

f_t＝σ(W_f·[h_t-1,e^t]+b_f)

i_t＝σ(W_i·[h_t-1,e^t]+b_i)

wherein the content of the first and second substances,

the last forward LSTM unit outputs state information; sigma is sigmoid function; w_f、W_i、W_c、W_oIs a connection weight; b_f、b_i、b_c、b_oIs an offset; first, the memory information transferred from the last forward unit is calculated by the forgetting gate

Probability value f of information discarded_tThen the sigma in the input gate determines the information value i that needs to be updated_tAnd obtaining the memory information needing to be added through the tanh layer

Finally, new memory information is obtained by updating

Finally, the determination in the output gate

Partial status information o to be output_tAnd

get the t frame e^tForward propagation characteristic information of

The backward propagation process is the same as the forward propagation process, except that the sequence information is propagated forward from the back of the sentence, thus obtaining backward propagation information

Fusing forward propagation feature information

And backward propagation characteristic information

Obtaining the t frame e^tComplete feature information h_tThe calculation formula is as follows:

then introducing the LSTM model output into the attention mechanism, wherein a weight calculation formula is as follows:

e_tj＝u tanh(wh_j+b)

wherein h is_jRepresenting hidden state information in an attention mechanism Encoder, u and w are weight parameters, and b represents a bias parameter; by using h_jObtaining the similarity e by dot product_tjRepresenting the hidden state of each Encoder process versus the current output S_tOne contribution of, S_tThe output parameters of each Encoder process in the attention mechanism layer are processed; then, the similarity e is calculated by utilizing the softmax function_tjConversion into probability distribution a_tj(ii) a Finally, the hidden state h in Encoder is passed_jAnd conditional probability a_tjIs weighted to obtain an attention vector c_t。

Wherein, the step S25 specifically includes the following steps:

randomly extracting 80% of voice sample data to form a training set, and taking the rest 20% of voice sample data as a test set to train the deep neural network; and training, calculating loss by using a cross entropy loss function, and dynamically updating parameters of the deep neural network by adopting an adaptive moment estimation Adam algorithm to finally obtain an emotion recognition model.

Wherein, the step S3 specifically includes the following steps:

s31: preparing a dialogue voice data set for correlation analysis according to the voice sample;

s32: extracting Mel frequency cepstrum coefficient MFCC of a dialogue voice data set, and in order to enable the time series of two speakers to correspond one to one, for each frame of audio in a dialogue scene, when A speaks, the value of the sequence alist of A corresponding to the frame is expressed by MFCC of the frame; when the opposite party B of A speaks, the value of the alist sequence at this moment is represented by 0; b, obtaining two sequences with the same one-to-one corresponding dimension of alist and blist by using the sequence blist of B when the B speaks in the same way;

s33: extracting a pearson correlation coefficient, specifically:

extracting pearson correlation coefficients between two sequences alist and blist, wherein the specific calculation formula is as follows:

wherein X and Y representThe variable, cov (X, Y), represents covariance, σ_X，σ_YDenotes standard deviation,. mu._XDenotes the expectation of X, μ_YIndicates the desirability of Y;

s34: extracting the sperman correlation coefficient, specifically:

extracting the sperman correlation coefficient between two sequences alist and blist, wherein the specific calculation formula is as follows:

wherein n is the number of elements, and if X is used_i,Y_iRespectively representing ith values of two random variables, i is more than or equal to 1 and less than or equal to n, sorting X and Y to obtain a sorting set X and Y of X and Y, wherein an element X_i,y_iIs X_i,Y_iSorting order in x, y, then d_i＝x_i-y_i(1≤i≤n)；

S35: extracting a Kendell correlation coefficient, which specifically comprises the following steps:

extracting Kendell correlation coefficient between two sequences alist and blist, wherein the specific calculation formula is as follows:

wherein C represents the logarithm of elements with consistency in the variables X and Y; d represents the logarithm of elements with inconsistency in X and Y; n is a radical of₁The number s of elements contained in X represents the number of small collections held in X; u shape_iRepresenting the number of elements contained in the ith subset; t represents the number of small collections owned in Y; v_iRepresenting the number of elements contained in the ith subset;

s36: the pearson correlation coefficient, the spearman correlation coefficient and the Kendell correlation coefficient between two sequences alist and blist in a scene are used as characteristic values to represent conversation correlation or non-conversation correlation in the scene, and the three correlation coefficients are summarized to obtain a correlation coefficient data table;

s37: and importing the correlation coefficient data table into a support vector machine classifier to obtain a conversation correlation classification model.

Wherein, the step S4 specifically includes the following steps:

s41: collecting audio to be detected;

s42: the noise reduction processing is carried out on the audio sequence to be detected, so that the noise influence of the recording equipment and the environmental noise on the voice is reduced;

s43: dividing the audio sequence subjected to noise reduction into voice sections with fixed lengths, extracting an MFCC coefficient and carrying out normalization processing to be used as input of an emotion recognition model;

s44: and cutting the audio sequence subjected to noise reduction at the position where the speaker changes by utilizing an endpoint detection and speaker recognition technology to sequentially obtain a plurality of audio segments of a single speaker, extracting the MFCC coefficients in the same way, reducing dimensions, solving an average value, and performing zero filling to obtain sequences of different speakers, and obtaining three correlation coefficients of pearson coefficient, spearman coefficient and kendell coefficient among the sequences as the input of a conversation correlation classification model.

Wherein, the step S5 specifically includes: inputting the voice features obtained in the step S43 into an emotion recognition model, outputting judgment of voice emotion, namely anger/anger failure from an output layer sigmoid, marking data, and executing a step S6 if the data is angry; otherwise, it is judged as non-quarreling voice.

Wherein, the step S6 specifically includes: inputting the three correlation coefficients obtained in the step S44 into the conversation correlation classification model, inputting the support vector machine model in order to predict, judging the conversation correlation, outputting 1 to represent that the prediction is conversation correlation, and outputting 0 to represent that the prediction is conversation irrelevance.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention provides a quarrel recognition method based on audio analysis and deep learning, which is based on a deep learning model of a bidirectional LSTM and a self-attention mechanism, realizes the preliminary recognition of the emotion related to quarrel, extracts the voice characteristics of conversation correlation on the basis of emotion recognition, and finally realizes quarrel recognition through a conversation correlation classification model based on a support vector machine.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a schematic diagram of a deep neural network architecture;

FIG. 3 is a schematic diagram of the structure of an LSTM cell.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

As shown in fig. 1, the quarreling identification method based on audio analysis and deep learning includes the following steps:

s1: collecting voice samples matched with an application scene, wherein the voice samples comprise quarrel and non-quarrel data, and data balance is kept;

In the specific implementation process, the quarreling identification method based on audio analysis and deep learning provided by the invention is based on a deep learning model of a bidirectional LSTM and a self-attention mechanism, realizes the initial identification of emotion related to quarreling, extracts the voice characteristics of conversation correlation on the basis of emotion identification, and finally realizes quarreling identification through a conversation correlation classification model based on a support vector machine.

Example 2

More specifically, on the basis of embodiment 1, the step S2 specifically includes the following steps:

s21: making a quarreling voice data set for emotion recognition according to the voice sample; the quarreling voice data set adopted by the embodiment is collected from quarreling segments of movie and television drama works, the audio duration is about 10s, the voice characteristics in the quarreling process can be better reflected, and the audio is not too long and difficult to process and generalize. Through noise reduction and listening identification screening, 430 audio frequencies which are full in emotional expression, not easy to confuse and good in tone quality are reserved, 16kHz audio frequencies are adopted, 16bit quantization is carried out, 215 audio frequencies are quarreling, 215 audio frequencies are not quarreling, and the requirement of data balance is met.

S22: and extracting voice features of the quarreling voice data set, and respectively carrying out z-score standardization on each frame to obtain a voice feature sequence group, namely a Mel frequency cepstrum coefficient MFCC. MFCC is a commonly used speech feature that combines the auditory perception features of a human ear with the mechanism of speech signal generation. The present embodiment extracts 48-dimensional MFCC coefficients as audio features from the two data respectively by using the librosa toolkit in python, and performs normalization processing.

More specifically, in step S24, the data tag in the fixed-length feature sequence group is subjected to One-Hot Encoding (One-Hot Encoding) conversion.

More specifically, as shown in fig. 2, in step S24, the deep neural network specifically includes:

an input layer: inputting the 48-dimensional fixed-length feature sequence group obtained in the step S23, and filtering a '0' value;

unidirectional LSTM layer: the method comprises 48 neurons, performs preliminary extraction on the time sequence information, and returns all output characteristic sequences which are still 48-dimensional;

bidirectional LSTM layer: connecting a new unidirectional LSTM layer containing 48 neurons, corresponding to the output of the unidirectional LSTM layer on the previous layer one by one, modeling the further time information, and returning all output characteristic sequences through bidirectional transmission, wherein the dimension is 96;

first fully-connected layer: contains 64 neurons, performs full-connectivity operation, outputs are:

F＝f(W·t+b)

second full connection layer: contains 32 neurons;

f(x)＝1/(1+e^-x)

where x is the argument of the function.

More specifically, as shown in fig. 3, the LSTM model includes a unidirectional LSTM layer and a bidirectional LSTM layer, and includes a plurality of LSTM units, each unit includes a 3-gate structure including a forgetting gate, an input gate, and an output gate, so as to maintain the update and bidirectional propagation of state information;

f_t＝σ(W_f·[h_t-1,e^t]+b_f)

i_t＝σ(W_i·[h_t-1,e^t]+b_i)

wherein the content of the first and second substances,

Finally, new memory information is obtained by updating

Finally, the determination in the output gate

Partial status information o to be output_tAnd

get the t frame e^tForward propagation characteristic information of

Fusing forward propagation feature information

And backward propagation characteristic information

e_tj＝u tanh(wh_j+b)

More specifically, the step S25 includes the following specific steps:

randomly extracting 80% of voice sample data to form a training set, and taking the rest 20% of voice sample data as a test set to train the deep neural network; and training, calculating loss by using a cross entropy loss function, dynamically updating parameters of the deep neural network by adopting an adaptive moment estimation Adam algorithm, setting the learning rate to be 0.0001, setting the batch size to be 32, setting the maximum cycle number to be 20, and finally obtaining the emotion recognition model.

More specifically, the step S3 specifically includes the following steps:

s31: preparing a dialogue voice data set for correlation analysis according to the voice sample; the dialogue voice data set adopted by the embodiment is collected from dialogue segments in movie and television series works; for simplicity, this embodiment only considers two people's conversations; if A represents speaker A speaking and B represents speaker B speaking, to ensure correlation, all segments of the clip are at least ABAB type, and A, B numbers in a segment are even numbers, and the total number of A and B is from 4 to 12. The sampling frequency is set to 44100Hz, 16bit quantization; then, the voice data of each dialogue segment is segmented according to the time sequence, so that each segmented small segment only contains the voice of a speaker; this embodiment finally obtains 10 pieces of data of the quarreling dialog scene and 11 pieces of data of the normal dialog scene, and totally 180 pieces of single speaker audio.

S32: in the embodiment, a library toolkit in python is used to extract 20-dimensional MFCC coefficients from the speech data set as audio features, and normalization processing is performed, so as to obtain a speaker's speech time series, and the 20-dimensional MFCCs in each column are averaged to represent the speech features of the frame. In order to make the time sequences of two speakers correspond one to one, for each frame of audio in a dialog scene, when A speaks, the value of the sequence alist of A corresponding to the frame is represented by MFCC of the frame; when the opposite party B of A speaks, the value of the alist sequence at this moment is represented by 0; b, obtaining two sequences with the same one-to-one corresponding dimension of alist and blist by using the sequence blist of B when the B speaks in the same way;

s33: extracting a pearson correlation coefficient, specifically:

wherein X and Y represent variables, cov (X and Y) represents covariance, σ_X，σ_YDenotes standard deviation,. mu._XDenotes the expectation of X, μ_YIndicates the desirability of Y;

s34: extracting the sperman correlation coefficient, specifically:

s37: importing the correlation coefficient data table into a support vector machine classifier to obtain a conversation correlation classification model; in this embodiment, the training ratio train _ size of the training support vector machine classifier is 0.6, a linear kernel function kernel is adopted, the function rbf is a gaussian kernel, the gamma value is 20, and the penalty coefficient C is 0.8; the decision function decision _ function _ shape is 'ovr', which is a division of one class from other classes.

More specifically, the step S4 specifically includes the following steps:

s41: collecting audio to be detected;

s43: dividing the audio sequence subjected to noise reduction into voice sections with fixed length of 10s, extracting 48-dimensional MFCC coefficients and performing normalization processing to serve as input of an emotion recognition model;

More specifically, the step S5 specifically includes: inputting the voice features obtained in the step S43 into an emotion recognition model, inputting time notes of 48, wherein the time notes represents the frame number of audio, 48 is the feature number, outputting judgment of voice emotion, namely anger/anger, from an output layer sigmoid, marking data, and if the voice emotion is anger, executing a step S6; otherwise, it is judged as non-quarreling voice.

More specifically, the step S6 specifically includes: inputting the three correlation coefficients obtained in the step S44 into the conversation correlation classification model, inputting the support vector machine model in order to predict, judging the conversation correlation, outputting 1 to represent that the prediction is conversation correlation, and outputting 0 to represent that the prediction is conversation irrelevance.

In the specific implementation process, the embodiment of the invention successfully constructs a quarreling speech recognition framework and obtains ideal classification accuracy by applying a bidirectional long-short term memory network, an attention mechanism and correlation analysis based on emotion recognition analysis and conversation correlation analysis.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

[1]Eero

Juhani Toivanen,Tapio

Classification ofemotion in spoken Finnish using vowel-length segments:Increasing reliabilitywith a fusion technique[J].Speech Communication,2010,53(3).

[2]Xi Li,Jidong Tao,Johnson,M.T.,Soltis,J.,Savage,A.,Leong,K.M.,Newman,J.D..Stress and Emotion Classification using Jitter and ShimmerFeatures[P].Acoustics,Speech and Signal Processing,2007.ICASSP 2007.IEEEInternational Conference on,2007.

[3]Hao Hu,Ming-Xing Xu,Wei Wu.GMM Supervector Based SVM with SpectralFeatures for Speech Emotion Recognition[P].Acoustics,Speech and SignalProcessing,2007.ICASSP 2007.IEEE International Conference on,2007.

[4] The method comprises the following steps of 1, indian small kidney, juttingying 21165and Wangshizing voice emotion recognition [ J ] based on local feature optimization, university of Chinese academy of sciences, 2017,34(04), 431 and 438.

Claims

1. The quarreling identification method based on audio analysis and deep learning is characterized by comprising the following steps of:

2. The quarreling identification method based on audio analysis and deep learning of claim 1, wherein the step S2 specifically comprises the following steps:

3. The method for identifying quarrel based on audio analysis and deep learning of claim 2, wherein in said step S24, the data tags of said fixed-length feature sequence group are transformed by One-Hot Encoding.

4. The method for identifying quarrel based on audio analysis and deep learning of claim 2, wherein in the step S24, the deep neural network specifically comprises:

F＝f(W·t+b)

second full connection layer: contains a plurality of neurons;

f(x)＝1/(1+e^-x)

where x is the argument of the function.

5. The quarreling identification method based on audio analysis and deep learning of claim 4, wherein the LSTM model comprises a unidirectional LSTM layer and a bidirectional LSTM layer, and comprises a plurality of LSTM units, each unit comprises a 3-gate structure comprising a forgetting gate, an input gate and an output gate, so as to keep the status information updated and propagated bidirectionally;

f_t＝σ(W_f·[h_t-1,et]+b_f)

i_t＝σ(W_i·[h_t-1,et]+b_i)

wherein the content of the first and second substances,

Finally, new memory information is obtained by updating

Finally, the determination in the output gate

Partial status information o to be output_tAnd

get the t frame e^tForward propagation characteristic information of

Fusing forward propagation feature information

And backward propagation characteristic information

e_tj＝u tanh(wh_j+b)

6. The method for identifying quarrel based on audio analysis and deep learning of claim 2, wherein said step S25 comprises the following steps:

7. The quarreling identification method according to claim 4, wherein the step S3 specifically comprises the following steps:

s33: extracting a pearson correlation coefficient, specifically:

s34: extracting the sperman correlation coefficient, specifically:

wherein n is the number of elements, and if X is used_i,Y_iRespectively representing ith values of two random variables, i is more than or equal to 1 and less than or equal to n, and sequencing X and Y to obtainOrdered set of X, Y, where element X_i,y_iIs X_i,Y_iSorting order in x, y, then d_i＝x_i-y_i(1≤i≤n)；

8. The method for identifying quarrel based on audio analysis and deep learning of claim 7, wherein said step S4 comprises the following steps:

s41: collecting audio to be detected;

9. The method for identifying quarrel based on audio analysis and deep learning of claim 8, wherein the step S5 specifically comprises: inputting the voice features obtained in the step S43 into an emotion recognition model, outputting judgment of voice emotion, namely anger/anger failure from an output layer sigmoid, marking data, and executing a step S6 if the data is angry; otherwise, it is judged as non-quarreling voice.

10. The method for identifying quarrel based on audio analysis and deep learning of claim 9, wherein the step S6 specifically comprises: inputting the three correlation coefficients obtained in the step S44 into the conversation correlation classification model, inputting the support vector machine model in order to predict, judging the conversation correlation, outputting 1 to represent that the prediction is conversation correlation, and outputting 0 to represent that the prediction is conversation irrelevance.