CN108682431B - Voice emotion recognition method in PAD three-dimensional emotion space - Google Patents

Voice emotion recognition method in PAD three-dimensional emotion space Download PDF

Info

Publication number
CN108682431B
CN108682431B CN201810438464.9A CN201810438464A CN108682431B CN 108682431 B CN108682431 B CN 108682431B CN 201810438464 A CN201810438464 A CN 201810438464A CN 108682431 B CN108682431 B CN 108682431B
Authority
CN
China
Prior art keywords
emotion
value
pad
speech
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201810438464.9A
Other languages
Chinese (zh)
Other versions
CN108682431A (en
Inventor
程艳芬
陈逸灵
李超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University of Technology WUT
Original Assignee
Wuhan University of Technology WUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University of Technology WUT filed Critical Wuhan University of Technology WUT
Priority to CN201810438464.9A priority Critical patent/CN108682431B/en
Publication of CN108682431A publication Critical patent/CN108682431A/en
Application granted granted Critical
Publication of CN108682431B publication Critical patent/CN108682431B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

The invention discloses a speech emotion recognition method in a PAD three-dimensional emotion space, which selects a PAD three-dimensional emotion model based on a dimension theory as a recognition result expression mode, respectively and independently uses Mel frequency cepstrum coefficient, a time ignition sequence and ignition position information characteristics for predicting a PAD value of speech emotion, respectively performs correlation analysis from three dimensions of P (pleasure degree), A (activation degree) and D (dominance degree), calculates weight coefficients of the three characteristics, and obtains a final predicted value of the speech emotion in the PAD three-dimensional emotion space through weighted fusion. Experiments show that the method can more finely position the emotional state of the voice in the emotional space, pay more attention to the expression and embodiment of the internal components of the emotion, and more appropriately reflect the polarity and the degree of the emotional expression, thereby showing the mixed emotional content in the emotional voice.

Description

Voice emotion recognition method in PAD three-dimensional emotion space
Technical Field
The invention belongs to the field of speech emotion recognition, relates to a speech emotion recognition method, and particularly relates to a speech emotion recognition method in a PAD three-dimensional emotion space.
Background
In the speech emotion recognition field, commonly used Cepstral features generally include Mel-Frequency Cepstral Coefficient (MFCC) and the like. The MFCC is a Hz frequency spectrum characteristic calculated by utilizing the nonlinear corresponding relation of Mel frequency and Hz frequency, and the Mel frequency scale has a reinforcing effect on the low-frequency details of a voice signal, so that the MFCC can highlight the useful information of the voice signal, reduce the interference of environmental noise on the voice signal and effectively identify the voice emotion. The spectrogram is popular with speech researchers because the spectrogram can visually present the frequency distribution condition of a speech signal in each time period, the spectrogram is input into a Pulse Coupled Neural Network (PCNN) to extract a time ignition sequence and an entropy sequence for speech emotion recognition, and experiments show that the spectrogram is effective in recognizing common emotion and happy emotion.
The above results of recognizing speech emotion by MFCC and distinguishing emotion by speech emotion characteristics obtained by processing spectrogram by PCNN are important in recognizing emotion types, and the speech emotion recognition results classify emotion into a few categories, such as happy, sad, difficult and angry, however, in fact, it is more convenient for human-computer interaction to represent emotion as a numerical value by using multidimensional space because of the higher ability of computer to process numbers.
Disclosure of Invention
In order to solve the above technical problem, the present invention proposes a method of fusing MFCC, neuron firing sequences, and neuron firing position information prediction PAD values (P (pleasure degree), a (activation degree), D (dominance degree)) results.
The technical scheme adopted by the invention is as follows: a speech emotion recognition method in a PAD three-dimensional emotion space is characterized by comprising the following steps:
step 1: extracting features of emotion voice data, including Mel frequency cepstrum coefficient MFCC, time ignition sequence and ignition position information;
step 2: respectively and independently applying the Mel frequency cepstrum coefficient MFCC, the time ignition sequence and the ignition position information to an SVR algorithm to establish a speech emotion recognition model, and predicting a pleasure value P, an activation value A and an dominance value D of emotion speech;
and step 3: calculating correlation coefficients of predicted values obtained by the three features in three dimensions of P, A, D by using a Pearson correlation analysis method, and determining feature weight;
and 4, step 4: and according to the weights of different characteristics, weighting and fusing to obtain a final PAD value of the emotion voice in the three-dimensional emotion space.
According to the method, firstly, three speech emotion characteristics, namely, MFCC (Mel frequency cepstrum coefficient), time ignition sequence and ignition position information characteristics, are respectively and independently used for predicting the PAD value of speech emotion, then correlation analysis is carried out from three dimensions of P (pleasure degree), A (activation degree) and D (dominance degree) according to a prediction result, the weight coefficients of the three characteristics are calculated, and the final prediction value of the speech emotion in the PAD three-dimensional emotion space is obtained through fusion.
The emotion recognition method is used as a new attempt of speech emotion recognition, provides a certain reference for the research in the field in the future, predicts emotion as a coordinate value and maps the coordinate value to a PAD three-dimensional emotion space instead of adopting discrete word labels (happy, sad, ordinary and the like) in the research as a final recognition result, and further analyzes the constituent elements and the composition proportion of the emotion speech state by calculating the distance between the coordinate value and the basic emotion PAD value, so that mixed emotion types such as 'mild sad' or 'mild happy', and the like can be recognized, the limitation of describing emotion types by the discrete word labels is broken through, the polarity and the degree of emotion expression are reflected more appropriately, and the emotion recognition method is more convenient in processing the single-dimension problem of emotion. Experimental results show that the method provided by the invention has better distinguishing effect on the basic speech emotion types, and focuses on the expression and embodiment of emotion internal components, the calculation time is short, and the method is faster if a parallel machine or hardware is used for realization, and is suitable for application scenes of real-time processing.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention;
FIG. 2 is a schematic diagram of the three-dimensional emotional space distribution of the PAD according to the embodiment of the present invention.
Detailed Description
In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.
The CASIA Chinese emotion corpus selected in the embodiment is a discrete speech emotion database developed by automation of the Chinese academy of sciences, but in a completed subject research, 346 college psychology hires use of a revised simplified version of the Chinese PAD emotion table, PAD values of 14 specific emotion categories are evaluated, values of the 14 emotions in P (pleasure degree), A (activation degree) and D (dominance degree) are obtained, and the values comprise 6 emotion types in the CASIA corpus, so that P, A, D value-known speech emotion data in the CASIA corpus can be used as emotion dimension speech data required by an experiment, and classification effect and fused effectiveness of three speech emotion characteristics in a speech emotion recognition process are verified.
The operating environment of the embodiment is Matlab (R2014b), the system environment is Win10, and the computer is configured to be Intel Core i3-3217U CPU (1.8GHz) and 8GB memory.
Referring to fig. 1, the method for recognizing speech emotion in PAD three-dimensional emotion space provided by the present invention includes the following steps:
step 1: extracting features of emotion voice data, including Mel frequency cepstrum coefficient MFCC, time ignition sequence and ignition position information;
step 1.1: and extracting the Mel frequency cepstrum coefficient MFCC.
Extracting MFCC characteristic parameters, firstly, preprocessing is carried out on a voice signal, wherein the preprocessing comprises pre-emphasis and windowing framing, the frame length is 256, the frame shift is 128, and a window function is a Hamming window. The pre-emphasis energy compensates the attenuation of high frequency part caused by the radiation of voice signal through lip and nostril, realizes the frame division of original signal to obtain the voice sequence s (n) of each frame of voice signal, then carries on Fast Fourier Transform (FFT) to s (n) to obtain the frequency spectrum S (n) of each frame, and obtains the energy spectrum | S (n) of voice signal by taking the module square2. (ii) inducing vacancy | S (n)2Through a mel filter bank Hm(k) Output parameter Pm(M-0, 1, 2.., M-1), which is calculated as follows:
Figure BDA0001655282570000031
wherein M is filterThe number of wave filters is 26 in the embodiment; f. ofmRepresents the center frequency of the triangular filter;
last pair of parameters PmTaking logarithm to make Discrete Cosine Transform (DCT), converting to cepstrum domain to obtain Mel frequency cepstrum coefficient Cmel(k)。
Lm=ln(Pm),m=0,1,2,...,M-1,
Figure BDA0001655282570000032
In the formula, N represents the order of the mel-frequency cepstrum coefficient, and the first-order difference MFCC characteristic parameter of 12 orders is extracted in the present embodiment.
Step 1.2: a temporal firing sequence is extracted.
Firstly, a spectrogram is obtained, and a window is added to a voice signal frame which is divided into a plurality of voice signal frames which are mutually overlapped, wherein a hamming window is selected in the embodiment. Then, the short-time frequency spectrum estimation of the signal is obtained through FFT, time n is used as an abscissa, frequency w is used as an ordinate, and the intensity degree of any given frequency component at a given moment is represented by the gray level of a corresponding point to form a spectrogram. In this embodiment, a simplified PCNN neuron model is selected, and the parameter settings and values are as follows:
TABLE 1 simplified parameter set-up and evaluation of PCNN neuron models
Figure BDA0001655282570000041
Wherein alpha isF,αL,αθRespectively representing feedback input items FijConnect entry LijDynamic threshold value thetaijThe decay time constant of (c). VF,VL,VθThe feedback amplification factor, the connection amplification factor, and the threshold amplification factor of the PCNN are respectively expressed, β is a connection strength factor, and the above parameters are set based on empirical values. The initial values of the link input L, the internal activity item U and the pulse output Y are set to be 0, the input is normalized gray value and belongs to [0, 1 ]]In the meantime. Radius of connecting area r 1.5, internal connecting momentThe matrix W is a 3 x 3 square matrix in which each element value is the inverse of the squared Euclidean distance (r) from the central pixel to each of the surrounding pixels-2)。
Inputting a spectrogram into a PCNN iteration with the number of neurons equal to the number of pixel points of the spectrogram for 50 times, wherein the total ignition frequency of each iteration is equal to the sum of the number of neurons of a release pulse, and extracting the characteristics of a time ignition sequence by means of the image segmentation capability of the PCNN.
Step 1.3: and extracting neuron ignition position information.
And respectively projecting the obtained ignition neuron position distribution diagram to a time axis and a frequency axis, and combining the two projected vectors into one vector. And finally, arranging vectors obtained by the ignition position distribution diagram at each moment into a plurality of columns according to time to form a matrix, namely the speech emotion recognition characteristic matrix.
Step 2: respectively and independently applying the Mel frequency cepstrum coefficient MFCC, the time ignition sequence and the ignition position information to an SVR algorithm (support vector machine regression) to establish a speech emotion recognition model, and predicting a pleasure value P, an activation value A and an dominance value D of emotion speech;
and step 3: calculating correlation coefficients of predicted values obtained by the three features in three dimensions of P, A, D by using a Pearson correlation analysis method, and determining feature weight;
the correlation coefficient calculation formula is as follows:
Figure BDA0001655282570000051
wherein X represents P value, A value and D value predicted by each feature respectively at each calculation, Y represents P value, A value and D value of emotional speech in the emotional scale, and muX,μYRespectively representing the mean values, σ, of the variables X, YX,σYRespectively, the standard deviations of the variables X, Y. RhoX,YNamely the characteristic weight;
and 4, step 4: and according to the weights of different characteristics, weighting and fusing to obtain a final PAD value of the emotion voice in the three-dimensional emotion space.
The final PAD value of the emotional speech in the three-dimensional emotional space is as follows:
P=P1λ1+P2λ2+P3λ3
in the formula, P1,P2,P3And sequentially representing the predicted value of the speech in a P (pleasure degree) dimension by using Mel frequency cepstrum coefficient, ignition time sequence and ignition position information. Lambda [ alpha ]1,λ2,λ3Sequentially representing the normalized values of the correlation coefficients of the three voice emotion characteristics to the voice emotion types in the P (pleasure degree) dimension, and satisfying lambda123=1。
In this embodiment, root-mean-square error (RMSE) is used as an evaluation index for identifying a basic emotion type, and the calculation method is as follows:
Figure BDA0001655282570000052
in the formula, Xobs,iIndicates the predicted value of the experiment, Xmodel,iRepresenting the PAD scale reference value.
The predicted value P, A, D obtained by the invention is respectively calculated and compared with the RMSE of P, A, D predicted value by using MFCC alone, and the calculation result is shown in Table 2, wherein the RMSE value of each dimension is normalized to be between 0 and 1, and the smaller the RMSE is, the better the performance of the corresponding method is.
TABLE 2 RMSE values comparison of methods herein to MFCC predicted PAD values
Figure BDA0001655282570000053
Through experiments, the samples with the same emotion types are found to be distributed in a concentrated mode near the coordinate point corresponding to the PAD reference value of the test sample, the samples with different emotion types are distributed in a scattered mode, and comparison results of RMSE calculated according to the table 2 prove that the voice emotion type distinguishing method can effectively distinguish voice emotion types. The PAD reference values of the 6 emotion voice signals adopted in the experiment are shown in table 3, and table 4 shows the normalization results of the correlation coefficients of the three features of one voice sample corresponding to the dimensions in the recognition process of the method, and the normalization results are used for further carrying out weighting fusion to obtain the final prediction value of the PAD.
PAD scale reference values for 36 emotion categories
Figure BDA0001655282570000061
TABLE 4 correlation coefficient results for three speech emotion characteristics
Figure BDA0001655282570000062
Fig. 2 is a distribution diagram of the three-dimensional emotion space mapped to PAD after the experimental sample is identified by the method, and table 5 shows the statistics of the corresponding numerical range of the distribution diagram.
PAD final predicted value distribution range of 56 emotions in table
Figure BDA0001655282570000071
Because the recognition result is not represented by the traditional discrete emotion label words, the method has the advantages that the distance value between the recognition result and the basic emotion PAD value in the emotion coordinate system is calculated, the constituent elements and the composition proportion of the emotion voice state are further analyzed, and therefore mixed emotion types such as 'mild sadness' or 'mild happiness' can be recognized.
The invention displays the recognition result in the PAD three-dimensional emotional space based on the continuous dimension theory, can clearly present the difference and the connection among various emotional states through the intuitive mapping chart, describes the psychological activities of a plurality of basic emotional types, and embodies the subtle and changeable emotional states of human beings. Experiments show that the method can display the mixed emotion content in the emotion voice through the accurate emotion coordinate value, and complete the voice emotion recognition task more finely.
It should be understood that parts of the specification not set forth in detail are well within the prior art.
It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (4)

1. A speech emotion recognition method in a PAD three-dimensional emotion space is characterized by comprising the following steps:
step 1: extracting features of emotion voice data, including Mel frequency cepstrum coefficient MFCC, time ignition sequence and ignition position information;
the neuron ignition position information is extracted by projecting the ignition neuron position distribution diagram obtained each time to a time axis and a frequency axis respectively and then combining two projected vectors into one vector; finally, arranging vectors obtained by the ignition position distribution diagram at each moment into a plurality of columns according to time to form a matrix, namely a speech emotion recognition characteristic matrix;
wherein, the simplified PCNN neuron model is adopted for extracting the time ignition sequence and the ignition position information, and the parameter alpha isF、αL、αθ、VF、VL、VθBeta values are respectively 0.1, 1.0, 0.5, 0.2, 20 and 0.1; alpha is alphaF,αL,αθRespectively representing feedback input items FijConnect entry LijDynamic threshold value thetaijDecay time constant of VF,VL,VθRespectively representing a feedback amplification factor, a connection amplification factor and a threshold amplification factor of the PCNN, wherein beta is a connection strength factor; the initial values of the link input L, the internal activity item U and the pulse output Y are set to be 0, the input is normalized gray value and belongs to [0, 1 ]]To (c) to (d); the radius r of the connection domain is 1.5, and the internal connection matrix W is a 3 × 3 square matrixWherein each element value is the inverse r of the squared Euclidean distance from the central pixel to each of the surrounding pixels-2
Step 2: respectively and independently applying the Mel frequency cepstrum coefficient MFCC, the time ignition sequence and the ignition position information to an SVR algorithm to establish a speech emotion recognition model, and predicting a pleasure value P, an activation value A and an dominance value D of emotion speech;
and step 3: calculating correlation coefficients of predicted values obtained by the three features in three dimensions of P, A, D by using a Pearson correlation analysis method, and determining feature weight;
and 4, step 4: and according to the weights of different characteristics, weighting and fusing to obtain a final PAD value of the emotion voice in the three-dimensional emotion space.
2. The method for speech emotion recognition in PAD three-dimensional emotion space of claim 1, wherein: in the step 1, in the extraction process of the Mel frequency cepstrum coefficient MFCC, during pretreatment, a frame length is 256, a frame is 128, and a window function is a Hamming window; in the calculation process, the preprocessed voice data is subjected to fast Fourier transform and modulus square to obtain an energy spectrum of the voice data, the energy spectrum passes through a Mel filter bank, and a parameter P is outputm(M-0, 1, 2.., M-1) calculated by:
Figure FDA0003116180960000021
wherein, M is 0,1,2, 26 is taken as the number of the filters M, f is fmRepresenting the center frequency, H, of the triangular filterm(k) Representing the frequency response of the triangular filter, s (n) representing the result of the fast fourier transform of the speech signal; and finally extracting first-order difference MFCC characteristic parameters of 12 orders.
3. The method for speech emotion recognition in PAD three-dimensional emotion space of claim 1, wherein: in step 3, the correlation coefficients of the predicted values obtained by the three features in the three dimensions of P, A, D are:
Figure FDA0003116180960000022
wherein X represents P value, A value and D value predicted by each feature respectively at each calculation, Y represents P value, A value and D value of emotional speech in the emotional scale, and muX,μYRespectively representing the mean values, σ, of the variables X, YX,σYRespectively representing the standard deviation of variables X and Y; rhoX,YThe values are the calculated feature weights.
4. The method for speech emotion recognition in PAD three-dimensional emotion space according to any of claims 1-3, wherein: in step 4, the final PAD value of the emotion voice in the three-dimensional emotion space is as follows:
P=P1λ1+P2λ2+P3λ3
in the formula, P1、P2、P3Respectively representing the predicted values of the voice in the P dimension by adopting a Mel frequency cepstrum coefficient, an ignition time sequence and ignition position information; lambda [ alpha ]1、λ2、λ3Sequentially representing the normalization values of the correlation coefficients of the three voice emotion characteristics to the voice emotion types in the P dimension, and satisfying lambda123=1。
CN201810438464.9A 2018-05-09 2018-05-09 Voice emotion recognition method in PAD three-dimensional emotion space Expired - Fee Related CN108682431B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810438464.9A CN108682431B (en) 2018-05-09 2018-05-09 Voice emotion recognition method in PAD three-dimensional emotion space

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810438464.9A CN108682431B (en) 2018-05-09 2018-05-09 Voice emotion recognition method in PAD three-dimensional emotion space

Publications (2)

Publication Number Publication Date
CN108682431A CN108682431A (en) 2018-10-19
CN108682431B true CN108682431B (en) 2021-08-03

Family

ID=63805990

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810438464.9A Expired - Fee Related CN108682431B (en) 2018-05-09 2018-05-09 Voice emotion recognition method in PAD three-dimensional emotion space

Country Status (1)

Country Link
CN (1) CN108682431B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109409296B (en) * 2018-10-30 2020-12-01 河北工业大学 Video emotion recognition method integrating facial expression recognition and voice emotion recognition
CN110555084B (en) * 2019-08-26 2023-01-24 电子科技大学 Remote supervision relation classification method based on PCNN and multi-layer attention
CN111402928B (en) * 2020-03-04 2022-06-14 华南理工大学 Attention-based speech emotion state evaluation method, device, medium and equipment
CN113749656B (en) * 2021-08-20 2023-12-26 杭州回车电子科技有限公司 Emotion recognition method and device based on multidimensional physiological signals
CN117933269B (en) * 2024-03-22 2024-06-18 合肥工业大学 Multi-mode depth model construction method and system based on emotion distribution

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012168740A1 (en) * 2011-06-10 2012-12-13 X-System Limited Method and system for analysing sound
CN107633851A (en) * 2017-07-31 2018-01-26 中国科学院自动化研究所 Discrete voice emotion identification method, apparatus and system based on the prediction of emotion dimension

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012168740A1 (en) * 2011-06-10 2012-12-13 X-System Limited Method and system for analysing sound
CN107633851A (en) * 2017-07-31 2018-01-26 中国科学院自动化研究所 Discrete voice emotion identification method, apparatus and system based on the prediction of emotion dimension

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
PAD情绪模型在情感语音识别中的应用研究;宋静;《中国优秀硕士学位论文全文数据库信息科技辑》;20160815;第36页第2段,第43页第5段,第46页第2段,第50页第2段,图4-2 *
一种基于脉冲耦合神经网络的语音情感识别新方法;梁泽,马义德,张恩溯,朱望飞,汤书森;《计算机应用》;20080331;第28卷(第3期);第1页第2栏第2段,第3页第1栏第5段 *
基于PAD三维情绪模型的情感语音转换与识别;周慧;《中国优秀硕士学位论文全文数据库信息科技辑》;20100615;第10页第3段,第44页第3段 *

Also Published As

Publication number Publication date
CN108682431A (en) 2018-10-19

Similar Documents

Publication Publication Date Title
CN108682431B (en) Voice emotion recognition method in PAD three-dimensional emotion space
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
CN113947127A (en) Multi-mode emotion recognition method and system for accompanying robot
Jayanthi et al. An integrated framework for emotion recognition using speech and static images with deep classifier fusion approach
Zhou et al. Deception detecting from speech signal using relevance vector machine and non-linear dynamics features
CN113807249A (en) Multi-mode feature fusion based emotion recognition method, system, device and medium
Chen et al. Mandarin emotion recognition combining acoustic and emotional point information
Lei et al. BAT: Block and token self-attention for speech emotion recognition
Huijuan et al. Coarse-to-fine speech emotion recognition based on multi-task learning
Yasmin et al. A rough set theory and deep learning-based predictive system for gender recognition using audio speech
CN115273904A (en) Angry emotion recognition method and device based on multi-feature fusion
Cheng et al. DNN-based speech enhancement with self-attention on feature dimension
CN113571095B (en) Speech emotion recognition method and system based on nested deep neural network
Oo et al. Fusion of Log-Mel Spectrogram and GLCM feature in acoustic scene classification
CN117058597B (en) Dimension emotion recognition method, system, equipment and medium based on audio and video
Garg et al. RETRACTED: Urban Sound Classification Using Convolutional Neural Network Model
CN113160823B (en) Voice awakening method and device based on impulse neural network and electronic equipment
Tailleur et al. Spectral trancoder: using pretrained urban sound classifiers on undersampled spectral representations
CN112861949B (en) Emotion prediction method and system based on face and sound
Jagadeeshwar et al. ASERNet: Automatic speech emotion recognition system using MFCC-based LPC approach with deep learning CNN
CN112735477B (en) Voice emotion analysis method and device
Mamutova et al. DEVELOPING A SPEECH EMOTION RECOGNITION SYSTEM USING CNN ENCODERS WITH ATTENTION FOCUS
Pathonsuwan et al. RS-MSConvNet: A novel end-to-end pathological voice detection model
Li et al. Attention based convolutional neural network with multi-frequency resolution feature for environment sound classification
Sushma et al. Emotion analysis using signal and image processing approach by implementing deep neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210803

CF01 Termination of patent right due to non-payment of annual fee