CN110956953A - Quarrel identification method based on audio analysis and deep learning - Google Patents

Quarrel identification method based on audio analysis and deep learning Download PDF

Info

Publication number
CN110956953A
CN110956953A CN201911206603.6A CN201911206603A CN110956953A CN 110956953 A CN110956953 A CN 110956953A CN 201911206603 A CN201911206603 A CN 201911206603A CN 110956953 A CN110956953 A CN 110956953A
Authority
CN
China
Prior art keywords
voice
correlation
layer
quarrel
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911206603.6A
Other languages
Chinese (zh)
Other versions
CN110956953B (en
Inventor
熊会元
陈彩婷
刘晟
马雯菲
张秋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN201911206603.6A priority Critical patent/CN110956953B/en
Publication of CN110956953A publication Critical patent/CN110956953A/en
Application granted granted Critical
Publication of CN110956953B publication Critical patent/CN110956953B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/06Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being correlation coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a quarrel identification method based on audio analysis and deep learning, which comprises the following steps: collecting a voice sample matched with an application scene; making a quarreling voice data set for emotion recognition according to the voice sample, and constructing an emotion recognition model; making a dialogue voice data set according to the voice sample to obtain a correlation coefficient data table and constructing a dialogue correlation classification model; collecting audio to be detected, and extracting voice characteristics and correlation coefficients; and inputting the voice characteristics into an emotion recognition model and the correlation coefficient into a conversation correlation classification model to judge quarreling voice. The quarrel identification method based on the audio analysis and the deep learning provided by the invention combines the quarrel identification method with emotion identification and conversation correlation analysis to effectively judge quarrel in a real situation, is suitable for being applied to public transportation places, is used for assisting management personnel to monitor, and can effectively find quarrel and avoid causing relevant results in time.

Description

Quarrel identification method based on audio analysis and deep learning
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a quarrel identification method based on audio analysis and deep learning.
Background
The crowds in the public transportation place are mixed and dense, and the friction and the corners of mouths are easy to generate between people, so that a series of serious consequences of influencing the public transportation order, hindering the normal operation of public transportation tools, and even harming the personal safety are caused. However, at present, public places do not adopt an identification technology specially used for quarreling, so that timely and effective early warning and manual intervention for violent quarreling and the like are difficult to adopt.
At present, there is a blank in the research of people who have abnormal communication directions such as quarrel at home. The brain can be divided into 'seeing' and 'listening' to the perception mode of the quarrel loudly, in view of the fact that quarrel loudly is different from the general simple behaviors such as 'raising hands' and 'playing balls' that can be followed, the quarrel loudly behavior recognition difficulty based on vision is very high, and at present, there is still a blank in the research of the quarrel loudly behavior recognition field based on vision at home and abroad; in the field of audio recognition, many researches in the field of speech emotion at home and abroad focus on recognition of basic emotion of a speaker, and a blank part exists in the research on abnormal situations of interpersonal communication such as quarrel.
As the basis of man-machine emotion interaction, the research work of emotion recognition is concerned more and more, and different algorithm models are also presented at home and abroad for realizing emotion recognition. To summarize, however, the emotion recognition process generally includes four parts: data acquisition, data preprocessing, emotion feature acquisition and emotion classification. Wherein, the realization methods of each component are different. Currently, the common methods for implementing emotion classification mainly consist of: support Vector Machines (SVMs), decision trees, random forests, hidden markov models, bayesian networks, K-means, linear regression analysis, multi-layer perceptrons, multi-classifier fusion, and deep neural networks, among others. Seppancen [1] and the like extract 40 multi-dimensional emotional acoustic features related to fundamental frequency, energy and duration, and a KNN classifier is used for obtaining the recognition rate of 60%. Li 2 et al extracts two timbre features of amplitude perturbation and frequency perturbation as speech emotion recognition features on a SUSAS emotion recognition database, and achieves 69.1% recognition accuracy by combining the MFCC features with the features of amplitude perturbation and frequency perturbation. Hu 3 et al propose a support vector machine method based on GMM super vector for speech emotion recognition classification task, find that SVM classifier based on GMM super vector is better than simple SVM classifier. Suitai 4 et al propose a speech emotion recognition method based on local feature optimization, which removes and classifies speech frames with insignificant emotional features in speech through clustering analysis, and the average accuracy of the speech emotion recognition method based on local feature optimization on a corpus is improved by 5% to 17%.
At present, the speech emotion recognition technology is mature, and although the classification of common emotions is basically completed, the following defects still exist: (1) the research on speech emotion recognition at home and abroad mainly focuses on how to comprehensively utilize each mainstream algorithm model to improve the speech emotion recognition rate; (2) many of the related researches are emotion recognition for a single speaker in a single sentence, and the characteristics of voice conversation are not considered, so that recognition of conversation abnormal conditions closely related to emotion, such as quarrel, is difficult to realize.
Disclosure of Invention
The invention provides a quarreling recognition method based on audio analysis and deep learning, aiming at overcoming the technical defects that the conventional speech emotion recognition technology is lack of research on abnormal moods such as quarrel in the actual situation and is limited to single-person single sentence recognition.
In order to solve the technical problems, the technical scheme of the invention is as follows:
a quarrel identification method based on audio analysis and deep learning comprises the following steps:
s1: collecting a voice sample matched with an application scene, wherein the voice sample comprises two types of data of quarrel and non-quarrel;
s2: making a quarreling voice data set for emotion recognition according to a voice sample, and constructing a deep neural network for training to obtain an emotion recognition model;
s3: making a dialogue voice data set for correlation analysis according to the voice sample to obtain a correlation coefficient data table, and constructing a support vector machine classifier for training to obtain a dialogue correlation classification model;
s4: collecting audio to be detected, preprocessing the audio to be detected aiming at an emotion recognition model and a conversation correlation classification model, and extracting voice characteristics and correlation coefficients;
s5: inputting the voice characteristics into an emotion recognition model, and judging whether the voice emotion accords with the characteristics of quarrel; if yes, go to step S6; otherwise, judging the voice is not quarreling;
s6: inputting the correlation coefficient into a conversation correlation classification model, and judging whether the voice is in a conversation state; if yes, the voice is judged to be quarreling, otherwise, the voice is judged to be not quarreling.
Wherein, the step S2 specifically includes the following steps:
s21: making a quarreling voice data set for emotion recognition according to the voice sample;
s22: extracting voice features of the quarreling voice data set, and respectively carrying out z-score standardization on each frame to obtain a voice feature sequence group;
s23: zero filling is carried out on other sequences according to the maximum sequence length of the voice characteristic sequence group, the non-fixed length sequence is converted into the fixed length sequence, and the fixed length characteristic sequence group is obtained;
s24: constructing a deep neural network based on a long-time and short-time memory network LSTM and an attention mechanism according to the fixed-length feature sequence group;
s25: and training the deep neural network to obtain an emotion recognition model.
In step S24, the data tag in the fixed-length feature sequence group is subjected to One-Hot Encoding conversion.
In step S24, the deep neural network specifically includes:
an input layer: inputting the fixed-length feature sequence group obtained in the step S23, and filtering a value of '0';
unidirectional LSTM layer: the neuron which is in one-to-one correspondence with the dimension of the characteristic sequence is contained, the preliminary extraction is carried out on the time sequence information, and all output characteristic sequences are returned;
bidirectional LSTM layer: connecting a new one-way LSTM layer, corresponding to the output of the previous one-way LSTM layer one by one, for further time information modeling, and returning all output characteristic sequences through two-way propagation;
attention-driven layer: finding the dependency relationship of the sequence in each time step output from the bidirectional LSTM layer;
first fully-connected layer: contains a plurality of neurons, executes full-connection operation, and outputs:
F=f(W·t+b)
wherein, W is the weight value of the full connection layer; the t is the input of the full connection layer; b is a bias layer of a full link layer; the f (·) is a ReLu activation function, and the formula is f (x) max (0, x), and x is an argument of the function;
first Dropout layer: after the connection on the first fully-connected layer, the overlap ratio is 0.3, in order to avoid overfitting;
second full connection layer: contains a plurality of neurons;
second Dropout layer: after the connection to the second full connection layer, the overlapping rate is 0.2;
an output layer: the number of neurons in the layer is 2, and features are mapped to two categories of targets through a sigmoid regression algorithm, namely anger/anger resistance; wherein the sigmoid function value is between 0 and 1, and the function form is as follows:
f(x)=1/(1+e-x)
where x is the argument of the function.
The LSTM model comprises a unidirectional LSTM layer and a bidirectional LSTM layer, and comprises a plurality of LSTM units, each unit comprises a 3-middle gate structure comprising a forgetting gate, an input gate and an output gate, so that the updating and bidirectional propagation of state information are kept;
in the bidirectional LSTM layer, the bidirectional propagation process includes a forward propagation process and a backward propagation process, where the forward propagation process specifically includes:
ft=σ(Wf·[ht-1,et]+bf)
it=σ(Wi·[ht-1,et]+bi)
Figure BDA0002297068330000041
Figure BDA0002297068330000042
Figure BDA0002297068330000043
Figure BDA0002297068330000044
wherein the content of the first and second substances,
Figure BDA0002297068330000045
the last forward LSTM unit outputs state information; sigma is sigmoid function; wf、Wi、Wc、WoIs a connection weight; bf、bi、bc、boIs an offset; first, the memory information transferred from the last forward unit is calculated by the forgetting gate
Figure BDA0002297068330000046
Probability value f of information discardedtThen the sigma in the input gate determines the information value i that needs to be updatedtAnd obtaining the memory information needing to be added through the tanh layer
Figure BDA0002297068330000047
Finally, new memory information is obtained by updating
Figure BDA0002297068330000048
Finally, the determination in the output gate
Figure BDA0002297068330000049
Partial status information o to be outputtAnd
Figure BDA00022970683300000410
get the t frame etForward propagation characteristic information of
Figure BDA00022970683300000411
The backward propagation process is the same as the forward propagation process, except that the sequence information is propagated forward from the back of the sentence, thus obtaining backward propagation information
Figure BDA00022970683300000412
Fusing forward propagation feature information
Figure BDA00022970683300000413
And backward propagation characteristic information
Figure BDA00022970683300000414
Obtaining the t frame etComplete feature information htThe calculation formula is as follows:
Figure BDA00022970683300000415
then introducing the LSTM model output into the attention mechanism, wherein a weight calculation formula is as follows:
etj=u tanh(whj+b)
Figure BDA00022970683300000416
Figure BDA00022970683300000417
wherein h isjRepresenting hidden state information in an attention mechanism Encoder, u and w are weight parameters, and b represents a bias parameter; by using hjObtaining the similarity e by dot producttjRepresenting the hidden state of each Encoder process versus the current output StOne contribution of, StThe output parameters of each Encoder process in the attention mechanism layer are processed; then, the similarity e is calculated by utilizing the softmax functiontjConversion into probability distribution atj(ii) a Finally, the hidden state h in Encoder is passedjAnd conditional probability atjIs weighted to obtain an attention vector ct
Wherein, the step S25 specifically includes the following steps:
randomly extracting 80% of voice sample data to form a training set, and taking the rest 20% of voice sample data as a test set to train the deep neural network; and training, calculating loss by using a cross entropy loss function, and dynamically updating parameters of the deep neural network by adopting an adaptive moment estimation Adam algorithm to finally obtain an emotion recognition model.
Wherein, the step S3 specifically includes the following steps:
s31: preparing a dialogue voice data set for correlation analysis according to the voice sample;
s32: extracting Mel frequency cepstrum coefficient MFCC of a dialogue voice data set, and in order to enable the time series of two speakers to correspond one to one, for each frame of audio in a dialogue scene, when A speaks, the value of the sequence alist of A corresponding to the frame is expressed by MFCC of the frame; when the opposite party B of A speaks, the value of the alist sequence at this moment is represented by 0; b, obtaining two sequences with the same one-to-one corresponding dimension of alist and blist by using the sequence blist of B when the B speaks in the same way;
s33: extracting a pearson correlation coefficient, specifically:
extracting pearson correlation coefficients between two sequences alist and blist, wherein the specific calculation formula is as follows:
Figure BDA0002297068330000051
wherein X and Y representThe variable, cov (X, Y), represents covariance, σX,σYDenotes standard deviation,. mu.XDenotes the expectation of X, μYIndicates the desirability of Y;
s34: extracting the sperman correlation coefficient, specifically:
extracting the sperman correlation coefficient between two sequences alist and blist, wherein the specific calculation formula is as follows:
Figure BDA0002297068330000052
wherein n is the number of elements, and if X is usedi,YiRespectively representing ith values of two random variables, i is more than or equal to 1 and less than or equal to n, sorting X and Y to obtain a sorting set X and Y of X and Y, wherein an element Xi,yiIs Xi,YiSorting order in x, y, then di=xi-yi(1≤i≤n);
S35: extracting a Kendell correlation coefficient, which specifically comprises the following steps:
extracting Kendell correlation coefficient between two sequences alist and blist, wherein the specific calculation formula is as follows:
Figure BDA0002297068330000053
Figure BDA0002297068330000054
wherein C represents the logarithm of elements with consistency in the variables X and Y; d represents the logarithm of elements with inconsistency in X and Y; n is a radical of1The number s of elements contained in X represents the number of small collections held in X; u shapeiRepresenting the number of elements contained in the ith subset; t represents the number of small collections owned in Y; viRepresenting the number of elements contained in the ith subset;
s36: the pearson correlation coefficient, the spearman correlation coefficient and the Kendell correlation coefficient between two sequences alist and blist in a scene are used as characteristic values to represent conversation correlation or non-conversation correlation in the scene, and the three correlation coefficients are summarized to obtain a correlation coefficient data table;
s37: and importing the correlation coefficient data table into a support vector machine classifier to obtain a conversation correlation classification model.
Wherein, the step S4 specifically includes the following steps:
s41: collecting audio to be detected;
s42: the noise reduction processing is carried out on the audio sequence to be detected, so that the noise influence of the recording equipment and the environmental noise on the voice is reduced;
s43: dividing the audio sequence subjected to noise reduction into voice sections with fixed lengths, extracting an MFCC coefficient and carrying out normalization processing to be used as input of an emotion recognition model;
s44: and cutting the audio sequence subjected to noise reduction at the position where the speaker changes by utilizing an endpoint detection and speaker recognition technology to sequentially obtain a plurality of audio segments of a single speaker, extracting the MFCC coefficients in the same way, reducing dimensions, solving an average value, and performing zero filling to obtain sequences of different speakers, and obtaining three correlation coefficients of pearson coefficient, spearman coefficient and kendell coefficient among the sequences as the input of a conversation correlation classification model.
Wherein, the step S5 specifically includes: inputting the voice features obtained in the step S43 into an emotion recognition model, outputting judgment of voice emotion, namely anger/anger failure from an output layer sigmoid, marking data, and executing a step S6 if the data is angry; otherwise, it is judged as non-quarreling voice.
Wherein, the step S6 specifically includes: inputting the three correlation coefficients obtained in the step S44 into the conversation correlation classification model, inputting the support vector machine model in order to predict, judging the conversation correlation, outputting 1 to represent that the prediction is conversation correlation, and outputting 0 to represent that the prediction is conversation irrelevance.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the invention provides a quarrel recognition method based on audio analysis and deep learning, which is based on a deep learning model of a bidirectional LSTM and a self-attention mechanism, realizes the preliminary recognition of the emotion related to quarrel, extracts the voice characteristics of conversation correlation on the basis of emotion recognition, and finally realizes quarrel recognition through a conversation correlation classification model based on a support vector machine.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a schematic diagram of a deep neural network architecture;
FIG. 3 is a schematic diagram of the structure of an LSTM cell.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
As shown in fig. 1, the quarreling identification method based on audio analysis and deep learning includes the following steps:
s1: collecting voice samples matched with an application scene, wherein the voice samples comprise quarrel and non-quarrel data, and data balance is kept;
s2: making a quarreling voice data set for emotion recognition according to a voice sample, and constructing a deep neural network for training to obtain an emotion recognition model;
s3: making a dialogue voice data set for correlation analysis according to the voice sample to obtain a correlation coefficient data table, and constructing a support vector machine classifier for training to obtain a dialogue correlation classification model;
s4: collecting audio to be detected, preprocessing the audio to be detected aiming at an emotion recognition model and a conversation correlation classification model, and extracting voice characteristics and correlation coefficients;
s5: inputting the voice characteristics into an emotion recognition model, and judging whether the voice emotion accords with the characteristics of quarrel; if yes, go to step S6; otherwise, judging the voice is not quarreling;
s6: inputting the correlation coefficient into a conversation correlation classification model, and judging whether the voice is in a conversation state; if yes, the voice is judged to be quarreling, otherwise, the voice is judged to be not quarreling.
In the specific implementation process, the quarreling identification method based on audio analysis and deep learning provided by the invention is based on a deep learning model of a bidirectional LSTM and a self-attention mechanism, realizes the initial identification of emotion related to quarreling, extracts the voice characteristics of conversation correlation on the basis of emotion identification, and finally realizes quarreling identification through a conversation correlation classification model based on a support vector machine.
Example 2
More specifically, on the basis of embodiment 1, the step S2 specifically includes the following steps:
s21: making a quarreling voice data set for emotion recognition according to the voice sample; the quarreling voice data set adopted by the embodiment is collected from quarreling segments of movie and television drama works, the audio duration is about 10s, the voice characteristics in the quarreling process can be better reflected, and the audio is not too long and difficult to process and generalize. Through noise reduction and listening identification screening, 430 audio frequencies which are full in emotional expression, not easy to confuse and good in tone quality are reserved, 16kHz audio frequencies are adopted, 16bit quantization is carried out, 215 audio frequencies are quarreling, 215 audio frequencies are not quarreling, and the requirement of data balance is met.
S22: and extracting voice features of the quarreling voice data set, and respectively carrying out z-score standardization on each frame to obtain a voice feature sequence group, namely a Mel frequency cepstrum coefficient MFCC. MFCC is a commonly used speech feature that combines the auditory perception features of a human ear with the mechanism of speech signal generation. The present embodiment extracts 48-dimensional MFCC coefficients as audio features from the two data respectively by using the librosa toolkit in python, and performs normalization processing.
S23: zero filling is carried out on other sequences according to the maximum sequence length of the voice characteristic sequence group, the non-fixed length sequence is converted into the fixed length sequence, and the fixed length characteristic sequence group is obtained;
s24: constructing a deep neural network based on a long-time and short-time memory network LSTM and an attention mechanism according to the fixed-length feature sequence group;
s25: and training the deep neural network to obtain an emotion recognition model.
More specifically, in step S24, the data tag in the fixed-length feature sequence group is subjected to One-Hot Encoding (One-Hot Encoding) conversion.
More specifically, as shown in fig. 2, in step S24, the deep neural network specifically includes:
an input layer: inputting the 48-dimensional fixed-length feature sequence group obtained in the step S23, and filtering a '0' value;
unidirectional LSTM layer: the method comprises 48 neurons, performs preliminary extraction on the time sequence information, and returns all output characteristic sequences which are still 48-dimensional;
bidirectional LSTM layer: connecting a new unidirectional LSTM layer containing 48 neurons, corresponding to the output of the unidirectional LSTM layer on the previous layer one by one, modeling the further time information, and returning all output characteristic sequences through bidirectional transmission, wherein the dimension is 96;
attention-driven layer: finding the dependency relationship of the sequence in each time step output from the bidirectional LSTM layer;
first fully-connected layer: contains 64 neurons, performs full-connectivity operation, outputs are:
F=f(W·t+b)
wherein, W is the weight value of the full connection layer; the t is the input of the full connection layer; b is a bias layer of a full link layer; the f (·) is a ReLu activation function, and the formula is f (x) max (0, x), and x is an argument of the function;
first Dropout layer: after the connection on the first fully-connected layer, the overlap ratio is 0.3, in order to avoid overfitting;
second full connection layer: contains 32 neurons;
second Dropout layer: after the connection to the second full connection layer, the overlapping rate is 0.2;
an output layer: the number of neurons in the layer is 2, and features are mapped to two categories of targets through a sigmoid regression algorithm, namely anger/anger resistance; wherein the sigmoid function value is between 0 and 1, and the function form is as follows:
f(x)=1/(1+e-x)
where x is the argument of the function.
More specifically, as shown in fig. 3, the LSTM model includes a unidirectional LSTM layer and a bidirectional LSTM layer, and includes a plurality of LSTM units, each unit includes a 3-gate structure including a forgetting gate, an input gate, and an output gate, so as to maintain the update and bidirectional propagation of state information;
in the bidirectional LSTM layer, the bidirectional propagation process includes a forward propagation process and a backward propagation process, where the forward propagation process specifically includes:
ft=σ(Wf·[ht-1,et]+bf)
it=σ(Wi·[ht-1,et]+bi)
Figure BDA0002297068330000091
Figure BDA0002297068330000092
Figure BDA0002297068330000093
Figure BDA0002297068330000094
wherein the content of the first and second substances,
Figure BDA0002297068330000095
the last forward LSTM unit outputs state information; sigma is sigmoid function; wf、Wi、Wc、WoIs a connection weight; bf、bi、bc、boIs an offset; first, the memory information transferred from the last forward unit is calculated by the forgetting gate
Figure BDA0002297068330000096
Probability value f of information discardedtThen the sigma in the input gate determines the information value i that needs to be updatedtAnd obtaining the memory information needing to be added through the tanh layer
Figure BDA0002297068330000097
Finally, new memory information is obtained by updating
Figure BDA0002297068330000098
Finally, the determination in the output gate
Figure BDA0002297068330000099
Partial status information o to be outputtAnd
Figure BDA00022970683300000910
get the t frame etForward propagation characteristic information of
Figure BDA0002297068330000103
The backward propagation process is the same as the forward propagation process, except that the sequence information is propagated forward from the back of the sentence, thus obtaining backward propagation information
Figure BDA0002297068330000104
Fusing forward propagation feature information
Figure BDA0002297068330000105
And backward propagation characteristic information
Figure BDA0002297068330000106
Obtaining the t frame etComplete feature information htThe calculation formula is as follows:
Figure BDA0002297068330000107
then introducing the LSTM model output into the attention mechanism, wherein a weight calculation formula is as follows:
etj=u tanh(whj+b)
Figure BDA0002297068330000101
Figure BDA0002297068330000102
wherein h isjRepresenting hidden state information in an attention mechanism Encoder, u and w are weight parameters, and b represents a bias parameter; by using hjObtaining the similarity e by dot producttjRepresenting the hidden state of each Encoder process versus the current output StOne contribution of, StThe output parameters of each Encoder process in the attention mechanism layer are processed; then, the similarity e is calculated by utilizing the softmax functiontjConversion into probability distribution atj(ii) a Finally, the hidden state h in Encoder is passedjAnd conditional probability atjIs weighted to obtain an attention vector ct
More specifically, the step S25 includes the following specific steps:
randomly extracting 80% of voice sample data to form a training set, and taking the rest 20% of voice sample data as a test set to train the deep neural network; and training, calculating loss by using a cross entropy loss function, dynamically updating parameters of the deep neural network by adopting an adaptive moment estimation Adam algorithm, setting the learning rate to be 0.0001, setting the batch size to be 32, setting the maximum cycle number to be 20, and finally obtaining the emotion recognition model.
More specifically, the step S3 specifically includes the following steps:
s31: preparing a dialogue voice data set for correlation analysis according to the voice sample; the dialogue voice data set adopted by the embodiment is collected from dialogue segments in movie and television series works; for simplicity, this embodiment only considers two people's conversations; if A represents speaker A speaking and B represents speaker B speaking, to ensure correlation, all segments of the clip are at least ABAB type, and A, B numbers in a segment are even numbers, and the total number of A and B is from 4 to 12. The sampling frequency is set to 44100Hz, 16bit quantization; then, the voice data of each dialogue segment is segmented according to the time sequence, so that each segmented small segment only contains the voice of a speaker; this embodiment finally obtains 10 pieces of data of the quarreling dialog scene and 11 pieces of data of the normal dialog scene, and totally 180 pieces of single speaker audio.
S32: in the embodiment, a library toolkit in python is used to extract 20-dimensional MFCC coefficients from the speech data set as audio features, and normalization processing is performed, so as to obtain a speaker's speech time series, and the 20-dimensional MFCCs in each column are averaged to represent the speech features of the frame. In order to make the time sequences of two speakers correspond one to one, for each frame of audio in a dialog scene, when A speaks, the value of the sequence alist of A corresponding to the frame is represented by MFCC of the frame; when the opposite party B of A speaks, the value of the alist sequence at this moment is represented by 0; b, obtaining two sequences with the same one-to-one corresponding dimension of alist and blist by using the sequence blist of B when the B speaks in the same way;
s33: extracting a pearson correlation coefficient, specifically:
extracting pearson correlation coefficients between two sequences alist and blist, wherein the specific calculation formula is as follows:
Figure BDA0002297068330000111
wherein X and Y represent variables, cov (X and Y) represents covariance, σX,σYDenotes standard deviation,. mu.XDenotes the expectation of X, μYIndicates the desirability of Y;
s34: extracting the sperman correlation coefficient, specifically:
extracting the sperman correlation coefficient between two sequences alist and blist, wherein the specific calculation formula is as follows:
Figure BDA0002297068330000112
wherein n is the number of elements, and if X is usedi,YiRespectively representing ith values of two random variables, i is more than or equal to 1 and less than or equal to n, sorting X and Y to obtain a sorting set X and Y of X and Y, wherein an element Xi,yiIs Xi,YiSorting order in x, y, then di=xi-yi(1≤i≤n);
S35: extracting a Kendell correlation coefficient, which specifically comprises the following steps:
extracting Kendell correlation coefficient between two sequences alist and blist, wherein the specific calculation formula is as follows:
Figure BDA0002297068330000113
Figure BDA0002297068330000114
wherein C represents the logarithm of elements with consistency in the variables X and Y; d represents the logarithm of elements with inconsistency in X and Y; n is a radical of1The number s of elements contained in X represents the number of small collections held in X; u shapeiRepresenting the number of elements contained in the ith subset; t represents the number of small collections owned in Y; viRepresenting the number of elements contained in the ith subset;
s36: the pearson correlation coefficient, the spearman correlation coefficient and the Kendell correlation coefficient between two sequences alist and blist in a scene are used as characteristic values to represent conversation correlation or non-conversation correlation in the scene, and the three correlation coefficients are summarized to obtain a correlation coefficient data table;
s37: importing the correlation coefficient data table into a support vector machine classifier to obtain a conversation correlation classification model; in this embodiment, the training ratio train _ size of the training support vector machine classifier is 0.6, a linear kernel function kernel is adopted, the function rbf is a gaussian kernel, the gamma value is 20, and the penalty coefficient C is 0.8; the decision function decision _ function _ shape is 'ovr', which is a division of one class from other classes.
More specifically, the step S4 specifically includes the following steps:
s41: collecting audio to be detected;
s42: the noise reduction processing is carried out on the audio sequence to be detected, so that the noise influence of the recording equipment and the environmental noise on the voice is reduced;
s43: dividing the audio sequence subjected to noise reduction into voice sections with fixed length of 10s, extracting 48-dimensional MFCC coefficients and performing normalization processing to serve as input of an emotion recognition model;
s44: and cutting the audio sequence subjected to noise reduction at the position where the speaker changes by utilizing an endpoint detection and speaker recognition technology to sequentially obtain a plurality of audio segments of a single speaker, extracting the MFCC coefficients in the same way, reducing dimensions, solving an average value, and performing zero filling to obtain sequences of different speakers, and obtaining three correlation coefficients of pearson coefficient, spearman coefficient and kendell coefficient among the sequences as the input of a conversation correlation classification model.
More specifically, the step S5 specifically includes: inputting the voice features obtained in the step S43 into an emotion recognition model, inputting time notes of 48, wherein the time notes represents the frame number of audio, 48 is the feature number, outputting judgment of voice emotion, namely anger/anger, from an output layer sigmoid, marking data, and if the voice emotion is anger, executing a step S6; otherwise, it is judged as non-quarreling voice.
More specifically, the step S6 specifically includes: inputting the three correlation coefficients obtained in the step S44 into the conversation correlation classification model, inputting the support vector machine model in order to predict, judging the conversation correlation, outputting 1 to represent that the prediction is conversation correlation, and outputting 0 to represent that the prediction is conversation irrelevance.
In the specific implementation process, the embodiment of the invention successfully constructs a quarreling speech recognition framework and obtains ideal classification accuracy by applying a bidirectional long-short term memory network, an attention mechanism and correlation analysis based on emotion recognition analysis and conversation correlation analysis.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.
[1]Eero
Figure BDA0002297068330000131
Juhani Toivanen,Tapio
Figure BDA0002297068330000132
Classification ofemotion in spoken Finnish using vowel-length segments:Increasing reliabilitywith a fusion technique[J].Speech Communication,2010,53(3).
[2]Xi Li,Jidong Tao,Johnson,M.T.,Soltis,J.,Savage,A.,Leong,K.M.,Newman,J.D..Stress and Emotion Classification using Jitter and ShimmerFeatures[P].Acoustics,Speech and Signal Processing,2007.ICASSP 2007.IEEEInternational Conference on,2007.
[3]Hao Hu,Ming-Xing Xu,Wei Wu.GMM Supervector Based SVM with SpectralFeatures for Speech Emotion Recognition[P].Acoustics,Speech and SignalProcessing,2007.ICASSP 2007.IEEE International Conference on,2007.
[4] The method comprises the following steps of 1, indian small kidney, juttingying 21165and Wangshizing voice emotion recognition [ J ] based on local feature optimization, university of Chinese academy of sciences, 2017,34(04), 431 and 438.

Claims (10)

1. The quarreling identification method based on audio analysis and deep learning is characterized by comprising the following steps of:
s1: collecting a voice sample matched with an application scene, wherein the voice sample comprises two types of data of quarrel and non-quarrel;
s2: making a quarreling voice data set for emotion recognition according to a voice sample, and constructing a deep neural network for training to obtain an emotion recognition model;
s3: making a dialogue voice data set for correlation analysis according to the voice sample to obtain a correlation coefficient data table, and constructing a support vector machine classifier for training to obtain a dialogue correlation classification model;
s4: collecting audio to be detected, preprocessing the audio to be detected aiming at an emotion recognition model and a conversation correlation classification model, and extracting voice characteristics and correlation coefficients;
s5: inputting the voice characteristics into an emotion recognition model, and judging whether the voice emotion accords with the characteristics of quarrel; if yes, go to step S6; otherwise, judging the voice is not quarreling;
s6: inputting the correlation coefficient into a conversation correlation classification model, and judging whether the voice is in a conversation state; if yes, the voice is judged to be quarreling, otherwise, the voice is judged to be not quarreling.
2. The quarreling identification method based on audio analysis and deep learning of claim 1, wherein the step S2 specifically comprises the following steps:
s21: making a quarreling voice data set for emotion recognition according to the voice sample;
s22: extracting voice features of the quarreling voice data set, and respectively carrying out z-score standardization on each frame to obtain a voice feature sequence group;
s23: zero filling is carried out on other sequences according to the maximum sequence length of the voice characteristic sequence group, the non-fixed length sequence is converted into the fixed length sequence, and the fixed length characteristic sequence group is obtained;
s24: constructing a deep neural network based on a long-time and short-time memory network LSTM and an attention mechanism according to the fixed-length feature sequence group;
s25: and training the deep neural network to obtain an emotion recognition model.
3. The method for identifying quarrel based on audio analysis and deep learning of claim 2, wherein in said step S24, the data tags of said fixed-length feature sequence group are transformed by One-Hot Encoding.
4. The method for identifying quarrel based on audio analysis and deep learning of claim 2, wherein in the step S24, the deep neural network specifically comprises:
an input layer: inputting the fixed-length feature sequence group obtained in the step S23, and filtering a value of '0';
unidirectional LSTM layer: the neuron which is in one-to-one correspondence with the dimension of the characteristic sequence is contained, the preliminary extraction is carried out on the time sequence information, and all output characteristic sequences are returned;
bidirectional LSTM layer: connecting a new one-way LSTM layer, corresponding to the output of the previous one-way LSTM layer one by one, for further time information modeling, and returning all output characteristic sequences through two-way propagation;
attention-driven layer: finding the dependency relationship of the sequence in each time step output from the bidirectional LSTM layer;
first fully-connected layer: contains a plurality of neurons, executes full-connection operation, and outputs:
F=f(W·t+b)
wherein, W is the weight value of the full connection layer; the t is the input of the full connection layer; b is a bias layer of a full link layer; the f (·) is a ReLu activation function, and the formula is f (x) max (0, x), and x is an argument of the function;
first Dropout layer: after the connection on the first fully-connected layer, the overlap ratio is 0.3, in order to avoid overfitting;
second full connection layer: contains a plurality of neurons;
second Dropout layer: after the connection to the second full connection layer, the overlapping rate is 0.2;
an output layer: the number of neurons in the layer is 2, and features are mapped to two categories of targets through a sigmoid regression algorithm, namely anger/anger resistance; wherein the sigmoid function value is between 0 and 1, and the function form is as follows:
f(x)=1/(1+e-x)
where x is the argument of the function.
5. The quarreling identification method based on audio analysis and deep learning of claim 4, wherein the LSTM model comprises a unidirectional LSTM layer and a bidirectional LSTM layer, and comprises a plurality of LSTM units, each unit comprises a 3-gate structure comprising a forgetting gate, an input gate and an output gate, so as to keep the status information updated and propagated bidirectionally;
in the bidirectional LSTM layer, the bidirectional propagation process includes a forward propagation process and a backward propagation process, where the forward propagation process specifically includes:
ft=σ(W[ht-1,et]+bf)
it=σ(Wi·[ht-1,et]+bi)
Figure FDA0002297068320000021
Figure FDA0002297068320000022
Figure FDA0002297068320000023
Figure FDA0002297068320000024
wherein the content of the first and second substances,
Figure FDA0002297068320000031
the last forward LSTM unit outputs state information; sigma is sigmoid function; wf、Wi、Wc、WoIs a connection weight; bf、bi、bc、boIs an offset; first, the memory information transferred from the last forward unit is calculated by the forgetting gate
Figure FDA0002297068320000032
Probability value f of information discardedtThen the sigma in the input gate determines the information value i that needs to be updatedtAnd obtaining the memory information needing to be added through the tanh layer
Figure FDA00022970683200000313
Finally, new memory information is obtained by updating
Figure FDA0002297068320000033
Finally, the determination in the output gate
Figure FDA0002297068320000034
Partial status information o to be outputtAnd
Figure FDA0002297068320000035
get the t frame etForward propagation characteristic information of
Figure FDA0002297068320000036
The backward propagation process is the same as the forward propagation process, except that the sequence information is propagated forward from the back of the sentence, thus obtaining backward propagation information
Figure FDA0002297068320000037
Fusing forward propagation feature information
Figure FDA0002297068320000038
And backward propagation characteristic information
Figure FDA0002297068320000039
Obtaining the t frame etComplete feature information htThe calculation formula is as follows:
Figure FDA00022970683200000310
then introducing the LSTM model output into the attention mechanism, wherein a weight calculation formula is as follows:
etj=u tanh(whj+b)
Figure FDA00022970683200000311
Figure FDA00022970683200000312
wherein h isjRepresenting hidden state information in an attention mechanism Encoder, u and w are weight parameters, and b represents a bias parameter; by using hjObtaining the similarity e by dot producttjRepresenting the hidden state of each Encoder process versus the current output StOne contribution of, StThe output parameters of each Encoder process in the attention mechanism layer are processed; then, the similarity e is calculated by utilizing the softmax functiontjConversion into probability distribution atj(ii) a Finally, the hidden state h in Encoder is passedjAnd conditional probability atjIs weighted to obtain an attention vector ct
6. The method for identifying quarrel based on audio analysis and deep learning of claim 2, wherein said step S25 comprises the following steps:
randomly extracting 80% of voice sample data to form a training set, and taking the rest 20% of voice sample data as a test set to train the deep neural network; and training, calculating loss by using a cross entropy loss function, and dynamically updating parameters of the deep neural network by adopting an adaptive moment estimation Adam algorithm to finally obtain an emotion recognition model.
7. The quarreling identification method according to claim 4, wherein the step S3 specifically comprises the following steps:
s31: preparing a dialogue voice data set for correlation analysis according to the voice sample;
s32: extracting Mel frequency cepstrum coefficient MFCC of a dialogue voice data set, and in order to enable the time series of two speakers to correspond one to one, for each frame of audio in a dialogue scene, when A speaks, the value of the sequence alist of A corresponding to the frame is expressed by MFCC of the frame; when the opposite party B of A speaks, the value of the alist sequence at this moment is represented by 0; b, obtaining two sequences with the same one-to-one corresponding dimension of alist and blist by using the sequence blist of B when the B speaks in the same way;
s33: extracting a pearson correlation coefficient, specifically:
extracting pearson correlation coefficients between two sequences alist and blist, wherein the specific calculation formula is as follows:
Figure FDA0002297068320000041
wherein X and Y represent variables, cov (X and Y) represents covariance, σX,σYDenotes standard deviation,. mu.XDenotes the expectation of X, μYIndicates the desirability of Y;
s34: extracting the sperman correlation coefficient, specifically:
extracting the sperman correlation coefficient between two sequences alist and blist, wherein the specific calculation formula is as follows:
Figure FDA0002297068320000042
wherein n is the number of elements, and if X is usedi,YiRespectively representing ith values of two random variables, i is more than or equal to 1 and less than or equal to n, and sequencing X and Y to obtainOrdered set of X, Y, where element Xi,yiIs Xi,YiSorting order in x, y, then di=xi-yi(1≤i≤n);
S35: extracting a Kendell correlation coefficient, which specifically comprises the following steps:
extracting Kendell correlation coefficient between two sequences alist and blist, wherein the specific calculation formula is as follows:
Figure FDA0002297068320000043
Figure FDA0002297068320000044
wherein C represents the logarithm of elements with consistency in the variables X and Y; d represents the logarithm of elements with inconsistency in X and Y; n is a radical of1The number s of elements contained in X represents the number of small collections held in X; u shapeiRepresenting the number of elements contained in the ith subset; t represents the number of small collections owned in Y; viRepresenting the number of elements contained in the ith subset;
s36: the pearson correlation coefficient, the spearman correlation coefficient and the Kendell correlation coefficient between two sequences alist and blist in a scene are used as characteristic values to represent conversation correlation or non-conversation correlation in the scene, and the three correlation coefficients are summarized to obtain a correlation coefficient data table;
s37: and importing the correlation coefficient data table into a support vector machine classifier to obtain a conversation correlation classification model.
8. The method for identifying quarrel based on audio analysis and deep learning of claim 7, wherein said step S4 comprises the following steps:
s41: collecting audio to be detected;
s42: the noise reduction processing is carried out on the audio sequence to be detected, so that the noise influence of the recording equipment and the environmental noise on the voice is reduced;
s43: dividing the audio sequence subjected to noise reduction into voice sections with fixed lengths, extracting an MFCC coefficient and carrying out normalization processing to be used as input of an emotion recognition model;
s44: and cutting the audio sequence subjected to noise reduction at the position where the speaker changes by utilizing an endpoint detection and speaker recognition technology to sequentially obtain a plurality of audio segments of a single speaker, extracting the MFCC coefficients in the same way, reducing dimensions, solving an average value, and performing zero filling to obtain sequences of different speakers, and obtaining three correlation coefficients of pearson coefficient, spearman coefficient and kendell coefficient among the sequences as the input of a conversation correlation classification model.
9. The method for identifying quarrel based on audio analysis and deep learning of claim 8, wherein the step S5 specifically comprises: inputting the voice features obtained in the step S43 into an emotion recognition model, outputting judgment of voice emotion, namely anger/anger failure from an output layer sigmoid, marking data, and executing a step S6 if the data is angry; otherwise, it is judged as non-quarreling voice.
10. The method for identifying quarrel based on audio analysis and deep learning of claim 9, wherein the step S6 specifically comprises: inputting the three correlation coefficients obtained in the step S44 into the conversation correlation classification model, inputting the support vector machine model in order to predict, judging the conversation correlation, outputting 1 to represent that the prediction is conversation correlation, and outputting 0 to represent that the prediction is conversation irrelevance.
CN201911206603.6A 2019-11-29 2019-11-29 Quarrel recognition method based on audio analysis and deep learning Active CN110956953B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911206603.6A CN110956953B (en) 2019-11-29 2019-11-29 Quarrel recognition method based on audio analysis and deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911206603.6A CN110956953B (en) 2019-11-29 2019-11-29 Quarrel recognition method based on audio analysis and deep learning

Publications (2)

Publication Number Publication Date
CN110956953A true CN110956953A (en) 2020-04-03
CN110956953B CN110956953B (en) 2023-03-10

Family

ID=69979200

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911206603.6A Active CN110956953B (en) 2019-11-29 2019-11-29 Quarrel recognition method based on audio analysis and deep learning

Country Status (1)

Country Link
CN (1) CN110956953B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950275A (en) * 2020-08-06 2020-11-17 平安科技(深圳)有限公司 Emotion recognition method and device based on recurrent neural network and storage medium
CN112634947A (en) * 2020-12-18 2021-04-09 大连东软信息学院 Animal voice and emotion feature set sequencing and identifying method and system
CN113037311A (en) * 2021-02-24 2021-06-25 重庆工程职业技术学院 Bracelet for indoor positioning and navigation
CN113033336A (en) * 2021-03-08 2021-06-25 北京金山云网络技术有限公司 Home device control method, apparatus, device and computer readable storage medium
CN113096649A (en) * 2021-03-31 2021-07-09 平安科技(深圳)有限公司 Voice prediction method, device, electronic equipment and storage medium
CN113158720A (en) * 2020-12-15 2021-07-23 嘉兴学院 Video abstraction method and device based on dual-mode feature and attention mechanism
CN113241060A (en) * 2021-07-09 2021-08-10 明品云(北京)数据科技有限公司 Security early warning method and system
CN113744724A (en) * 2021-08-30 2021-12-03 苏州浪潮智能科技有限公司 Voice conversion method, device, equipment and storage medium
WO2023019612A1 (en) * 2021-08-16 2023-02-23 Hong Kong Applied Science and Technology Research Institute Company Limited Apparatus and method for speech-emotion recognition with quantified emotional states

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106782602A (en) * 2016-12-01 2017-05-31 南京邮电大学 Speech-emotion recognition method based on length time memory network and convolutional neural networks
CN108597541A (en) * 2018-04-28 2018-09-28 南京师范大学 A kind of speech-emotion recognition method and system for enhancing indignation and happily identifying
CN109256150A (en) * 2018-10-12 2019-01-22 北京创景咨询有限公司 Speech emotion recognition system and method based on machine learning
CN109285562A (en) * 2018-09-28 2019-01-29 东南大学 Speech-emotion recognition method based on attention mechanism
US20190197330A1 (en) * 2010-06-07 2019-06-27 Affectiva, Inc. Cognitive state based vehicle manipulation using near-infrared image processing
CN110223714A (en) * 2019-06-03 2019-09-10 杭州哲信信息技术有限公司 A kind of voice-based Emotion identification method
CN110263217A (en) * 2019-06-28 2019-09-20 北京奇艺世纪科技有限公司 A kind of video clip label identification method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190197330A1 (en) * 2010-06-07 2019-06-27 Affectiva, Inc. Cognitive state based vehicle manipulation using near-infrared image processing
CN106782602A (en) * 2016-12-01 2017-05-31 南京邮电大学 Speech-emotion recognition method based on length time memory network and convolutional neural networks
CN108597541A (en) * 2018-04-28 2018-09-28 南京师范大学 A kind of speech-emotion recognition method and system for enhancing indignation and happily identifying
CN109285562A (en) * 2018-09-28 2019-01-29 东南大学 Speech-emotion recognition method based on attention mechanism
CN109256150A (en) * 2018-10-12 2019-01-22 北京创景咨询有限公司 Speech emotion recognition system and method based on machine learning
CN110223714A (en) * 2019-06-03 2019-09-10 杭州哲信信息技术有限公司 A kind of voice-based Emotion identification method
CN110263217A (en) * 2019-06-28 2019-09-20 北京奇艺世纪科技有限公司 A kind of video clip label identification method and device

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950275A (en) * 2020-08-06 2020-11-17 平安科技(深圳)有限公司 Emotion recognition method and device based on recurrent neural network and storage medium
CN113158720B (en) * 2020-12-15 2024-06-18 嘉兴学院 Video abstraction method and device based on dual-mode feature and attention mechanism
CN113158720A (en) * 2020-12-15 2021-07-23 嘉兴学院 Video abstraction method and device based on dual-mode feature and attention mechanism
CN112634947B (en) * 2020-12-18 2023-03-14 大连东软信息学院 Animal voice and emotion feature set sequencing and identifying method and system
CN112634947A (en) * 2020-12-18 2021-04-09 大连东软信息学院 Animal voice and emotion feature set sequencing and identifying method and system
CN113037311A (en) * 2021-02-24 2021-06-25 重庆工程职业技术学院 Bracelet for indoor positioning and navigation
CN113033336A (en) * 2021-03-08 2021-06-25 北京金山云网络技术有限公司 Home device control method, apparatus, device and computer readable storage medium
CN113096649B (en) * 2021-03-31 2023-12-22 平安科技(深圳)有限公司 Voice prediction method, device, electronic equipment and storage medium
CN113096649A (en) * 2021-03-31 2021-07-09 平安科技(深圳)有限公司 Voice prediction method, device, electronic equipment and storage medium
CN113241060A (en) * 2021-07-09 2021-08-10 明品云(北京)数据科技有限公司 Security early warning method and system
WO2023019612A1 (en) * 2021-08-16 2023-02-23 Hong Kong Applied Science and Technology Research Institute Company Limited Apparatus and method for speech-emotion recognition with quantified emotional states
US11810596B2 (en) 2021-08-16 2023-11-07 Hong Kong Applied Science and Technology Research Institute Company Limited Apparatus and method for speech-emotion recognition with quantified emotional states
CN113744724A (en) * 2021-08-30 2021-12-03 苏州浪潮智能科技有限公司 Voice conversion method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN110956953B (en) 2023-03-10

Similar Documents

Publication Publication Date Title
CN110956953B (en) Quarrel recognition method based on audio analysis and deep learning
CN112348075B (en) Multi-mode emotion recognition method based on contextual attention neural network
Wang et al. Human emotion recognition by optimally fusing facial expression and speech feature
Bhangale et al. Survey of deep learning paradigms for speech processing
Oord et al. Parallel wavenet: Fast high-fidelity speech synthesis
CN109472031B (en) Aspect level emotion classification model and method based on double memory attention
Oord et al. Representation learning with contrastive predictive coding
Hsu et al. Unsupervised learning of disentangled and interpretable representations from sequential data
Cardona et al. Online phoneme recognition using multi-layer perceptron networks combined with recurrent non-linear autoregressive neural networks with exogenous inputs
Boukabous et al. Multimodal sentiment analysis using audio and text for crime detection
CN114898779A (en) Multi-mode fused speech emotion recognition method and system
Hamsa et al. Speaker identification from emotional and noisy speech using learned voice segregation and speech VGG
Atkar et al. Speech emotion recognition using dialogue emotion decoder and CNN Classifier
Punithavathi et al. [Retracted] Empirical Investigation for Predicting Depression from Different Machine Learning Based Voice Recognition Techniques
CN112466284B (en) Mask voice identification method
Palo et al. Comparison of neural network models for speech emotion recognition
Nanduri et al. A Review of multi-modal speech emotion recognition and various techniques used to solve emotion recognition on speech data
Rammohan et al. Speech signal-based modelling of basic emotions to analyse compound emotion: Anxiety
Prakash et al. Analysis of emotion recognition system through speech signal using KNN & GMM classifier
Jain et al. Investigation Using MLP-SVM-PCA Classifiers on Speech Emotion Recognition
Karthikeyan et al. A stacked convolutional neural network framework with multi-scale attention mechanism for text-independent voiceprint recognition
Wei et al. Mapping ultrasound-based articulatory images and vowel sounds with a deep neural network framework
CN112951270B (en) Voice fluency detection method and device and electronic equipment
Osman Emo-tts: Parallel transformer-based text-to-speech model with emotional awareness
Yuenyong et al. Real-Time Thai Speech Emotion Recognition With Speech Enhancement Using Time-Domain Contrastive Predictive Coding and Conv-Tasnet

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant