CN108682431A - A kind of speech-emotion recognition method in PAD three-dimensionals emotional space - Google Patents

A kind of speech-emotion recognition method in PAD three-dimensionals emotional space Download PDF

Info

Publication number
CN108682431A
CN108682431A CN201810438464.9A CN201810438464A CN108682431A CN 108682431 A CN108682431 A CN 108682431A CN 201810438464 A CN201810438464 A CN 201810438464A CN 108682431 A CN108682431 A CN 108682431A
Authority
CN
China
Prior art keywords
emotional
speech
pad
value
dimensionals
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810438464.9A
Other languages
Chinese (zh)
Other versions
CN108682431B (en
Inventor
程艳芬
陈逸灵
李超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University of Technology WUT
Original Assignee
Wuhan University of Technology WUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University of Technology WUT filed Critical Wuhan University of Technology WUT
Priority to CN201810438464.9A priority Critical patent/CN108682431B/en
Publication of CN108682431A publication Critical patent/CN108682431A/en
Application granted granted Critical
Publication of CN108682431B publication Critical patent/CN108682431B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

The invention discloses the speech-emotion recognition methods in a kind of PAD three-dimensionals emotional space, the PAD three-dimensionals emotion model based on dimension theory is selected to show mode as recognition result, the PAD values that mel-frequency cepstrum coefficient, time igniting sequence and ignition location information characteristics are individually used for speech emotional are predicted, respectively correlation analysis is carried out from three P (pleasant degree), A (activity), D (dominance) dimensions, the weight coefficient of these three features is calculated, Weighted Fusion obtains final predicted value of the speech emotional in PAD three-dimensional emotional spaces.Experiment shows that this method can more meticulously position the affective state of voice in emotional space, more focuses on the expression and embodiment in emotion in ingredient, more properly reflects the polarity and degree of emotional expression, to show miscellaneous affective content of mixing in emotional speech.

Description

A kind of speech-emotion recognition method in PAD three-dimensionals emotional space
Technical field
The invention belongs to speech emotion recognition fields, are related to a kind of speech-emotion recognition method, and in particular to a kind of PAD tri- Tie up the speech-emotion recognition method in emotional space.
Background technology
In speech emotion recognition field, common cepstrum feature generally has mel-frequency cepstrum coefficient (Mel-Frequency Cepstral Coefficient, MFCC) etc..MFCC is calculated using the nonlinear correspondence relation of Mel frequencies and Hz frequencies The Hz spectrum signatures arrived, Mel dimensions in frequency has booster action to the low frequency details of voice signal, therefore MFCC can protrude voice letter Number useful information, reduce interference of the ambient noise to voice signal, can effectively identify speech emotional, but due to emotion language Sound signal it is non-stationary particularly evident, FFT direct to signal cannot reflect that its is non-stationary, thus be used alone MFCC be used for There are larger false recognition rates for speech emotion recognition.Sound spectrograph because can intuitively present voice signal day part frequency distribution situation And favored deeply by voice study scholar, by sound spectrograph input pulse coupled neural network (Pulse Coupled Neural Network, PCNN) extraction time igniting sequence and Entropy sequence be used for speech emotion recognition, and experiment shows usually and high to identifying Emerging two kinds of emotions are effective.
It is above-mentioned that speech emotional feature obtained by sound spectrograph is handled using MFCC identification speech emotionals and using PCNN into market The result of sensillary area point is it is found that they emphasize particularly on different fields in terms of identifying affective style, and speech emotion recognition result is by emotional semantic classification For a few classification, for example, it is glad, sad, sad, angry, however, in fact, emotion is expressed as counting using hyperspace Value is more convenient for human-computer interaction, because the ability of computer disposal number is stronger.
Invention content
In order to solve the above-mentioned technical problem, the present invention proposes a kind of by MFCC, neurons firing series and neuron point The method of fiery location information prediction PAD values (P (pleasant degree), A (activity), D (dominance)) result fusion.
The technical solution adopted in the present invention is:A kind of speech-emotion recognition method in PAD three-dimensionals emotional space, it is special Sign is, includes the following steps:
Step 1:Extract emotional speech data feature, including mel-frequency cepstrum coefficient MFCC, the time igniting sequence and Ignition location information;
Step 2:Mel-frequency cepstrum coefficient MFCC, time igniting sequence and ignition location information are individually applied SVR algorithms establish speech emotion recognition model, predict pleasant angle value P, the activation angle value A and advantage angle value D of emotional speech;
Step 3:Predicted value obtained by three kinds of features is calculated in tri- dimensions of P, A, D using Pearson correlation analysis On related coefficient, determine feature weight;
Step 4:According to the weight of different characteristic, it is final in three-dimensional emotional space that Weighted Fusion obtains emotional speech PAD values.
The present invention is first by three kinds of MFCC (mel-frequency cepstrum coefficient), time igniting sequence and ignition location information characteristics The PAD values that speech emotional feature is individually used for speech emotional are predicted, and then (are swashed from P (pleasant degree), A according to prediction result Activity), three dimensions of D (dominance) carry out correlation analysis, calculate the weight coefficient of these three features, fusion obtains voice feelings Feel the final predicted value in PAD three-dimensional emotional spaces.
The present invention provides certain ginseng as one new trial of speech emotion recognition, for the research in the field from now on It examines, final recognition result does not use discrete word label in the studies above (glad, sad, usually etc.) to indicate, but by feelings Sense is predicted as coordinate values and maps to PAD three-dimensional emotional spaces, by calculating it at a distance from basic emotion PAD values, into one Step analyzes the constitution element and composition ratio of the emotional speech state, so as to identify " mildly partially sad " or " temperature With it is higher emerging " etc. mixed feelings type, breach discrete adjective label and describe emotion type limitation, more properly reflect The polarity and degree of emotional expression are more convenient on the single dimensional problem of processing emotion.The experimental results showed that the present invention proposes Method on the basis of preferably distinguishing effect, what is more focused on is the table in emotion in ingredient having to basic speech emotional type It reaches and embodies, the calculating time is short, if faster will be suitble to the application scenarios handled in real time with parallel machine or hardware realization.
Description of the drawings
Fig. 1 is the flow chart of the embodiment of the present invention;
Fig. 2 is the PAD three-dimensional emotional space distribution schematic diagrams of the embodiment of the present invention.
Specific implementation mode
Understand for the ease of those of ordinary skill in the art and implement the present invention, with reference to the accompanying drawings and embodiments to this hair It is bright to be described in further detail, it should be understood that implementation example described herein is merely to illustrate and explain the present invention, not For limiting the present invention.
The CASIA Chinese Emotional Corpus that the present embodiment is selected is the discrete voice developed by Institute of Automation, Chinese Academy of sociences Affection data library, but since in a completed subject study, Chinese Academy of Sciences's psychology has engaged 346 university students to use Its sinicization PAD emotion scale of simplification version revised has carried out 14 kinds of specific emotional scopes the evaluation of PAD values, has obtained this Value of 14 kinds of emotions on P (pleasant degree), A (activity) and D (dominance), wherein containing 6 kinds of feelings in CASIA corpus Type is felt, so speech emotional data known to P, A, D value can be as the dimension emotion language needed for experiment in CASIA corpus Sound data, it was demonstrated that three kinds of speech emotional features respectively classification function and effect during speech emotion recognition and fusion after have Effect property.
The running environment of the present embodiment is Matlab (R2014b), system environments Win10, allocation of computer Intel Core i3-3217U CPU (1.8GHz), 8GB memories.
Referring to Fig.1, the speech-emotion recognition method in a kind of PAD three-dimensionals emotional space provided by the invention, including it is following Step:
Step 1:Extract emotional speech data feature, including mel-frequency cepstrum coefficient MFCC, the time igniting sequence and Ignition location information;
Step 1.1:Extract mel-frequency cepstrum coefficient MFCC.
Extraction MFCC characteristic parameters first pre-process voice signal, and pretreatment includes preemphasis and adding window framing, Frame length takes 256, and frame pipettes 128, and window function is Hamming window.Pre-add Beijing South Maxpower Technology Co. Ltd compensates voice signal and is made after lip and nostril radiation At the decaying of high frequency section, realize that original signal framing obtains the voice sequence s (n) of every frame voice signal, then to s (n) into Row Fast Fourier Transform (FFT) (FFT) obtains the frequency spectrum S (n) of each frame, and the energy spectrum of voice signal is obtained to its modulus square | S (n) |2.Will | S (n) |2Pass through Meier filter group Hm(k), output parameter Pm(m=0,1,2 ..., M-1), computational methods are as follows:
In formula, M is the number of filter, and the present embodiment takes 26;fmIndicate the centre frequency of triangular filter;
Finally to parameter PmIt takes logarithm to do discrete cosine transform (DCT), is transformed into cepstrum domain and obtains mel-frequency cepstrum system Number Cmel(k)。
Lm=ln (Pm), m=0,1,2 ..., M-1,
In formula, N represents the exponent number of mel cepstrum coefficients, and the present embodiment extracts the first-order difference MFCC characteristic parameters of 12 ranks.
Step 1.2:Extraction time igniting sequence.
Sound spectrograph is obtained first, to being divided into several overlapped speech signal frame windowing processes, the present embodiment selection Hamming window.Then the short-term spectrum for the signal being obtained by FFT is estimated, makees abscissa with time n, frequency w makees ordinate, any Given frequency ingredient is indicated in the degree of strength of given time using the gray scale of respective point, and sound spectrograph is constituted.The present embodiment is selected Simplified PCNN neuron models, parameter setting and value such as the following table 1:
The parameter setting and value for the PCNN neuron models that table 1 simplifies
Wherein αF, αL, αθFeed back input item F is indicated respectivelyij, connection input item Lij, dynamic threshold thresholding θijDecaying when Between constant.VF, VL, VθIndicate that the feedback amplification coefficient, connection amplification coefficient and threshold value amplification coefficient of PCNN, β are that connection is strong respectively Coefficient is spent, above-mentioned parameter is empirically worth setting.Link input L, internal activity item U, pulse output Y initial value be set as 0, it is defeated Enter for normalized gray value, belongs between [0,1].Link field radius r=1.5, internal connection matrix W are one 3 × 3 sides Battle array, inverse (r of the pixel to the Euclidean distance square of each pixel of surrounding wherein centered on each element numerical value-2)。
Sound spectrograph is inputted the PCNN iteration 50 times that neuron number is equal to sound spectrograph pixel number, is changed each time Total ignition times in generation are equal to the sum of the neuron number of release pulse, and time igniting is carried out by the image segmentation ability of PCNN Sequence signature extracts.
Step 1.3:Extract neuron firing location information.
The igniting neuron location map obtained every time is projected on time shaft and frequency axis respectively, then after projection Two vectors be merged into a vector.Finally, vector obtained by the ignition location distribution map by each moment is lined up according to the time Multiple row, one matrix of composition is speech emotion recognition eigenmatrix.
Step 2:Mel-frequency cepstrum coefficient MFCC, time igniting sequence and ignition location information are individually applied SVR algorithms (Support vector regression) establish speech emotion recognition model, predict pleasant angle value P, the activation angle value A of emotional speech With advantage angle value D;
Step 3:Predicted value obtained by three kinds of features is calculated in tri- dimensions of P, A, D using Pearson correlation analysis On related coefficient, determine feature weight;
Related coefficient calculation formula is:
In formula, X is respectively represented in each calculate and is represented emotion using P values, A values and D values, Y obtained by the prediction of each feature P value, A value and D value of the voice in emotion scale, μX, μYVariable X, the average value of Y, σ are indicated respectivelyX, σYVariable is indicated respectively The standard deviation of X, Y.ρX,YAs feature weight;
Step 4:According to the weight of different characteristic, it is final in three-dimensional emotional space that Weighted Fusion obtains emotional speech PAD values.
Final PAD value of the emotional speech in three-dimensional emotional space be:
P=P1λ1+P2λ2+P3λ3
In formula, P1, P2, P3It represents successively and uses mel-frequency cepstrum coefficient, duration of ignition sequence, ignition location information pair Predicted value of the voice in P (pleasant degree) dimension.λ1, λ2, λ3Indicate three of the above speech emotional feature to the speech emotional successively Type meets λ in the related coefficient normalized value of P (pleasant degree) dimension123=1.
The present embodiment is using root-mean-square error (root-mean-square error, RMSE) as to basic emotion type The evaluation index of identification, computational methods such as formula are:
In formula, Xobs,iIndicate experiment predicted value, Xmodel,iIndicate PAD scale reference values.
The RMSE and comparison for calculating separately P, A, D predicted value obtained by the present invention and MFCC prediction P, A, D values being used alone, meter The results are shown in Table 2 for calculation, wherein the RMSE value of each dimension standardizes between 0 to 1, the corresponding method of the smaller explanations of RMSE Performance is better.
2 context of methods of table and the RMSE value of MFCC prediction PAD values compare
By experiment, find to be distributed in test sample PAD reference values respective coordinates point in each identical affective style sample set Near, variant affective style sample, which is then distributed, more to be disperseed, and calculates the comparing result proof of gained RMSE originally according to table 2 Invention can effectively carry out speech emotional type classification.Test PAD reference values such as 3 institute of table of the 6 kinds of emotional speech signals used Show, table 4 is that three kinds of features of wherein one speech samples in this method identification process correspond to the related coefficient normalization of each dimension As a result, seeking the final predicted values of its PAD for being further weighted fusion.
The PAD scale reference values of 36 kinds of affective domains of table
The related coefficient result of 4 three kinds of speech emotional features of table
Fig. 2 is then the distribution map that experiment sample is mapped to PAD three-dimensional emotional spaces after this method identifies, table 5 is this point The corresponding numberical range statistics of Butut.
The final distribution of forecasting value ranges of PAD of 56 kinds of emotions of table
Since recognition result of the present invention does not use conventional discrete affective tag word to indicate that advantage, which is presented as, passes through calculating Its distance value with basic emotion PAD values in emotion coordinate system, further analyze the constitution element of the emotional speech state with And composition ratio, so as to identify the mixed feelings types such as " mildly partially sad " or " mild higher emerging ".
Recognition result is illustrated in the PAD three-dimensional emotional spaces based on continuous dimension theory by the present invention, by intuitive Mapping graph can clearly be presented difference between various affective states and contact, and the psychology for describing a variety of basic emotion type hybrids is living It is dynamic, embody the delicate changeable affective state of the mankind.Experiment shows that the present invention can show emotion by accurate emotion coordinate value It mixes in voice miscellaneous affective content, more meticulously completes speech emotion recognition task.
It should be understood that the part that this specification does not elaborate belongs to the prior art.
It should be understood that the above-mentioned description for preferred embodiment is more detailed, can not therefore be considered to this The limitation of invention patent protection range, those skilled in the art under the inspiration of the present invention, are not departing from power of the present invention Profit requires under protected ambit, can also make replacement or deformation, each fall within protection scope of the present invention, this hair It is bright range is claimed to be determined by the appended claims.

Claims (5)

1. the speech-emotion recognition method in a kind of PAD three-dimensionals emotional space, which is characterized in that include the following steps:
Step 1:Extract the feature of emotional speech data, including mel-frequency cepstrum coefficient MFCC, time igniting sequence and igniting Location information;
Step 2:Individually SVR is applied to calculate mel-frequency cepstrum coefficient MFCC, time igniting sequence and ignition location information Method establishes speech emotion recognition model, predicts pleasant angle value P, the activation angle value A and advantage angle value D of emotional speech;
Step 3:Predicted value obtained by three kinds of features is calculated in tri- dimensions of P, A, D using Pearson correlation analysis Related coefficient determines feature weight;
Step 4:According to the weight of different characteristic, Weighted Fusion obtains final PAD value of the emotional speech in three-dimensional emotional space.
2. the speech-emotion recognition method in PAD three-dimensionals emotional space according to claim 1, it is characterised in that:Step 1 In, in mel-frequency cepstrum coefficient MFCC extraction process, when pretreatment, frame length takes 256, and frame pipettes 128, and window function is Hamming Window;In calculating process, pretreated voice data is subjected to Fast Fourier Transform (FFT) and modulus square obtains its energy spectrum, it will Energy spectrum passes through Meier filter group, output parameter Pm(m=0,1,2 ..., M-1), calculation is:
Wherein, m=0,1,2 ..., M-1, number of filter M take 26, fmIndicate the centre frequency of triangular filter, Hm(k) it indicates The frequency response of triangular filter, S (n) indicate the Fast Fourier Transform (FFT) acquired results of voice signal;12 ranks of final extraction First-order difference MFCC characteristic parameters.
3. the speech-emotion recognition method in PAD three-dimensionals emotional space according to claim 1, it is characterised in that:Step 1 In, extraction time igniting sequence and ignition location information are using simplified PCNN neuron models, parameter alphaF、αL、αθ、VF、VL、 Vθ, β values be respectively 0.1,1.0,1.0,0.5,0.2,20,0.1;αF, αL, αθFeed back input item F is indicated respectivelyij, connection input Item Lij, dynamic threshold thresholding θijDamping time constant, VF, VL, VθFeedback amplification coefficient, the connection amplification of PCNN are indicated respectively Coefficient and threshold value amplification coefficient, β are bonding strength coefficients;Link input L, internal activity item U, pulse output Y initial value set It is 0, inputs as normalized gray value, belong between [0,1];Link field radius r=1.5, internal connection matrix W are one 3 × 3 square formation, inverse of the pixel to the Euclidean distance square of each pixel of surrounding wherein centered on each element numerical value r-2
4. the speech-emotion recognition method in PAD three-dimensionals emotional space according to claim 1, it is characterised in that:Step 3 In, related coefficient of the gained predicted value in tri- dimensions of P, A, D is three kinds of features respectively:
In formula, X is respectively represented in each calculate and is represented emotional speech using P values, A values and D values, Y obtained by the prediction of each feature P values, A values in emotion scale and D values, μX, μYVariable X, the average value of Y, σ are indicated respectivelyX, σYVariable X is indicated respectively, Y's Standard deviation;ρX,YValue is exactly the feature weight being calculated.
5. the speech-emotion recognition method in PAD three-dimensionals emotional space according to any one of claims 1-4, feature It is:In step 4, final PAD value of the emotional speech in three-dimensional emotional space is:
P=P1λ1+P2λ2+P3λ3
In formula, P1、P2、P3It respectively represents using mel-frequency cepstrum coefficient, duration of ignition sequence, ignition location information to the language Predicted value of the sound in P dimensions;λ1、λ2、λ3Indicate three of the above speech emotional feature to the speech emotional type in P dimensions successively Related coefficient normalized value, meet λ123=1.
CN201810438464.9A 2018-05-09 2018-05-09 Voice emotion recognition method in PAD three-dimensional emotion space Expired - Fee Related CN108682431B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810438464.9A CN108682431B (en) 2018-05-09 2018-05-09 Voice emotion recognition method in PAD three-dimensional emotion space

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810438464.9A CN108682431B (en) 2018-05-09 2018-05-09 Voice emotion recognition method in PAD three-dimensional emotion space

Publications (2)

Publication Number Publication Date
CN108682431A true CN108682431A (en) 2018-10-19
CN108682431B CN108682431B (en) 2021-08-03

Family

ID=63805990

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810438464.9A Expired - Fee Related CN108682431B (en) 2018-05-09 2018-05-09 Voice emotion recognition method in PAD three-dimensional emotion space

Country Status (1)

Country Link
CN (1) CN108682431B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109409296A (en) * 2018-10-30 2019-03-01 河北工业大学 The video feeling recognition methods that facial expression recognition and speech emotion recognition are merged
CN110555084A (en) * 2019-08-26 2019-12-10 电子科技大学 remote supervision relation classification method based on PCNN and multi-layer attention
CN111402928A (en) * 2020-03-04 2020-07-10 华南理工大学 Attention-based speech emotion state evaluation method, device, medium and equipment
CN113749656A (en) * 2021-08-20 2021-12-07 杭州回车电子科技有限公司 Emotion identification method and device based on multi-dimensional physiological signals
CN117933269A (en) * 2024-03-22 2024-04-26 合肥工业大学 Multi-mode depth model construction method and system based on emotion distribution

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012168740A1 (en) * 2011-06-10 2012-12-13 X-System Limited Method and system for analysing sound
CN107633851A (en) * 2017-07-31 2018-01-26 中国科学院自动化研究所 Discrete voice emotion identification method, apparatus and system based on the prediction of emotion dimension

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012168740A1 (en) * 2011-06-10 2012-12-13 X-System Limited Method and system for analysing sound
CN107633851A (en) * 2017-07-31 2018-01-26 中国科学院自动化研究所 Discrete voice emotion identification method, apparatus and system based on the prediction of emotion dimension

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
周慧: "基于PAD三维情绪模型的情感语音转换与识别", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
宋静: "PAD情绪模型在情感语音识别中的应用研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
梁泽,马义德,张恩溯,朱望飞,汤书森: "一种基于脉冲耦合神经网络的语音情感识别新方法", 《计算机应用》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109409296A (en) * 2018-10-30 2019-03-01 河北工业大学 The video feeling recognition methods that facial expression recognition and speech emotion recognition are merged
CN109409296B (en) * 2018-10-30 2020-12-01 河北工业大学 Video emotion recognition method integrating facial expression recognition and voice emotion recognition
CN110555084A (en) * 2019-08-26 2019-12-10 电子科技大学 remote supervision relation classification method based on PCNN and multi-layer attention
CN110555084B (en) * 2019-08-26 2023-01-24 电子科技大学 Remote supervision relation classification method based on PCNN and multi-layer attention
CN111402928A (en) * 2020-03-04 2020-07-10 华南理工大学 Attention-based speech emotion state evaluation method, device, medium and equipment
CN113749656A (en) * 2021-08-20 2021-12-07 杭州回车电子科技有限公司 Emotion identification method and device based on multi-dimensional physiological signals
CN113749656B (en) * 2021-08-20 2023-12-26 杭州回车电子科技有限公司 Emotion recognition method and device based on multidimensional physiological signals
CN117933269A (en) * 2024-03-22 2024-04-26 合肥工业大学 Multi-mode depth model construction method and system based on emotion distribution

Also Published As

Publication number Publication date
CN108682431B (en) 2021-08-03

Similar Documents

Publication Publication Date Title
CN108682431A (en) A kind of speech-emotion recognition method in PAD three-dimensionals emotional space
Badshah et al. Deep features-based speech emotion recognition for smart affective services
Demircan et al. Application of fuzzy C-means clustering algorithm to spectral features for emotion classification from speech
Hao et al. Visual-audio emotion recognition based on multi-task and ensemble learning with multiple features
Kaya et al. Robust acoustic emotion recognition based on cascaded normalization and extreme learning machines
Zhou et al. Deception detecting from speech signal using relevance vector machine and non-linear dynamics features
Xiao et al. Multi-sensor data fusion for sign language recognition based on dynamic Bayesian network and convolutional neural network
Ye et al. GM-TCNet: Gated multi-scale temporal convolutional network using emotion causality for speech emotion recognition
Wu et al. Environmental sound classification via time–frequency attention and framewise self-attention-based deep neural networks
Huang et al. Emotional speech feature normalization and recognition based on speaker-sensitive feature clustering
Dang et al. Dynamic multi-rater gaussian mixture regression incorporating temporal dependencies of emotion uncertainty using kalman filters
Sharma et al. Framework for gender recognition using voice
Karthikeyan Adaptive boosted random forest-support vector machine based classification scheme for speaker identification
Zhao et al. A survey on automatic emotion recognition using audio big data and deep learning architectures
Xue et al. Physiological-physical feature fusion for automatic voice spoofing detection
Gorrostieta et al. Attention-based Sequence Classification for Affect Detection.
Taran A nonlinear feature extraction approach for speech emotion recognition using VMD and TKEO
US20130262097A1 (en) Systems and methods for automated speech and speaker characterization
Fonnegra et al. Speech emotion recognition based on a recurrent neural network classification model
Nelus et al. Privacy-preserving audio classification using variational information feature extraction
Liu et al. Hierarchical component-attention based speaker turn embedding for emotion recognition
Mallikarjunan et al. Text-independent speaker recognition in clean and noisy backgrounds using modified VQ-LBG algorithm
Sukvichai et al. Automatic speech recognition for Thai sentence based on MFCC and CNNs
Lei et al. Robust scream sound detection via sound event partitioning
Trivikram et al. EVALUATION OF HYBRID FACE AND VOICE RECOGNITION SYSTEMS FOR BIOMETRIC IDENTIFICATION IN AREAS REQUIRING HIGH SECURITY.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210803