CN102592593B - Emotional-characteristic extraction method implemented through considering sparsity of multilinear group in speech - Google Patents

Emotional-characteristic extraction method implemented through considering sparsity of multilinear group in speech Download PDF

Info

Publication number
CN102592593B
CN102592593B CN201210091525.1A CN201210091525A CN102592593B CN 102592593 B CN102592593 B CN 102592593B CN 201210091525 A CN201210091525 A CN 201210091525A CN 102592593 B CN102592593 B CN 102592593B
Authority
CN
China
Prior art keywords
centerdot
matrix
rank
overbar
carry out
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210091525.1A
Other languages
Chinese (zh)
Other versions
CN102592593A (en
Inventor
吴强
刘琚
孙建德
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN201210091525.1A priority Critical patent/CN102592593B/en
Publication of CN102592593A publication Critical patent/CN102592593A/en
Application granted granted Critical
Publication of CN102592593B publication Critical patent/CN102592593B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Complex Calculations (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an emotional-characteristic extraction method implemented through considering the sparsity of a multilinear group in a speech. The method comprising the following steps: considering multiple factors such as time, frequency, scale and direction information included in a speech signal; carrying out characteristic extraction by using a sparse decomposition method for multilinear groups; carrying out multilinear characterization on an energy spectrum of the speech signal through Gabor functions with different scales and the directions; solving a characteristic projection matrix by using a sparse tensor decomposition method for groups; calculating a characteristic projection with a frequency order; carrying out characteristic decorrelation through discrete cosine transform; and finally, calculating first-order and second-order differential coefficients so as to obtain the emotional characteristics of the speed. According to the invention, the factors such as time, frequency, scale and direction and the like in a speech signal is taken into consideration and used for extracting emotional characteristics, and the characteristic projection is performed by using a sparse tensor decomposition method for groups, thereby finally improving the accuracy rate of various speech emotion recognitions.

Description

A kind of emotional characteristics extracting method of considering polyteny group sparse characteristic in voice
Technical field
The present invention relates to a kind ofly for improving the voice mood feature extracting method of voice mood recognition performance, belong to the voice process technology field.
Background technology
Voice are one of modes the most easily that people are exchanged in daily life, and this also makes the researchist try to explore how to utilize voice as the instrument exchanged between people and machine.Except traditional interactive modes such as speech recognition, speaker's mood is also a kind of important interactive information, and machine Understanding speaker's mood automatically is one of important symbol of human-computer interaction intelligent.
Voice mood is identified in the signal processing and the intelligent human-machine interaction field has important value, and a lot of potential application are arranged.Aspect man-machine interaction, the mood of identifying the speaker by computing machine can improve cordiality and the accuracy of system, and for example long-distance educational system can be adjusted course in time by identification student's mood, thereby promotes teaching efficiency; In telephone contact center and mobile communication, can obtain in time user's emotional information, improve the quality of service; Whether the energy that onboard system can detect the driver by Emotion identification is concentrated, and makes corresponding auxiliary warning.Aspect medical science, voice-based Emotion identification can be used as a kind of instrument, helps the doctor to be diagnosed patient's the state of an illness.
For voice mood identification, an important problem is exactly how to extract effective feature to be used for meaning different moods.According to traditional feature extracting method, usually one section voice signal can be divided into to multiframe, in order to obtain approximate signal stably.The periodic feature obtained from each frame is called local feature, such as fundamental tone, energy etc., its advantage is that existing sorter can utilize local feature to estimate comparatively accurately the parameter of different emotional states, shortcoming is that intrinsic dimensionality and sample number are more, has influence on the speed of feature extraction and classification.Add up and obtain feature and be called global characteristics by the feature to whole sentence, its advantage is to obtain nicety of grading and speed preferably, but has lost the time sequence information of voice signal, the problem of lack of training samples easily occurs.Generally, voice mood is identified feature commonly used following a few class: continuous acoustic feature, spectrum signature, the feature based on the Teager energy operator etc.
According to the result of study of psychology and metrics etc., speaker's mood in voice the most intuitively feature be exactly the continuous feature of the rhythm, as fundamental tone, energy, the speed of speaking etc.Corresponding global characteristics comprises the average, median, standard deviation, maximal value, minimum value of fundamental tone or energy etc., and first, second resonance peak etc.
Spectrum signature provides the useful frequency information in the voice signal, is also important feature extraction mode in voice mood identification.Spectrum signature commonly used comprises linear predictor coefficient (LPC), linear prediction cepstrum coefficient coefficient (LPCC), Mel frequency cepstrum coefficient (MFCC), perceptual weighting linear prediction (PLP) etc.
Voice are that Nonlinear Space air-flow in sonification system produces, and Teager energy operator (TEO) is that a kind of that the people such as Teager proposes can follow the tracks of the arithmetic operation that in the glottis cycle, signal energy changes fast, for the fine structure of analyzing speech.Under different emotional states, the flexible situation of muscle can affect the motion of sonification system hollow air-flow, according to the people's such as Bou-Ghazale result of study, can know, the feature based on TEO can be used for detecting the intense strain in voice.
According to numerous experimental evaluation results, for voice mood identification, select suitable characteristic present for different classification task, the feature based on the Teager energy is suitable for detecting the intense strain in voice signal; Acoustic feature is applicable to distinguishing high mood (high-arousal emotion) and the low mood (low-arousal emotion) of waking up waken up continuously; And, for the mood classification task of multiclass, the voice that spectrum signature is best suited for characterize, if spectrum signature is combined with acoustic feature continuously, or consider the association analysis of many factors, also can reach the purpose of raising nicety of grading.
At voice mood feature extraction and the another one important stage after having selected, classify exactly.In area of pattern recognition, various sorters all are used to the voice mood feature is classified at present, comprise Hidden Markov Model (HMM) (HMM), gauss hybrid models (GMM), support vector machine (SVM), linear discriminant analysis (LDA) and integrated classifier etc.Hidden Markov Model (HMM) is one of recognizer the most widely of application in voice mood identification, this has benefited from its generally application in voice signal, be particularly useful for processing the data with sequential organization, from current result of study, the Emotion identification system based on Hidden Markov Model (HMM) can provide than the high-class accuracy rate.Gauss hybrid models can be regarded as the Hidden Markov Model (HMM) of only having a state, is very suitable for polynary distribution is carried out to modeling, and the people such as Breazeal utilize GMM to be applied to the KISMET speech database as sorter, and five class moods are carried out to Classification and Identification.The support vector machine area of pattern recognition that has been widely used, its ultimate principle is to higher dimensional space, to make characteristic line to divide Projection Character by kernel function, compare HMM and GMM, it has advantages of training algorithm global optimum and Existence dependency in the extensive border of data, and many results of study are utilize support vector machine as the sorter of voice mood identification and obtained classifying quality preferably.
As shown in Figure 1, traditional voice mood recognition methods based on spectrum signature adopts following steps usually:
1) pre-service is carried out in the voice signal of input, comprise windowing, filtering, pre-emphasis etc.;
2) signal is carried out to short time discrete Fourier transform, by the Mei Er quarter window, carry out filtering, then ask logarithmic spectrum (getting log);
3) utilize discrete cosine transform to calculate cepstrum, then weighting, ask cepstral mean to subtract, and calculates difference;
4) utilize gauss hybrid models (GMM) to be trained, obtain the model of different moods;
5) mood model obtained by training, identified test data, obtains recognition accuracy.
At present for two class mood classification, as negative emotions and neutral mood, reached nicety of grading relatively preferably, but the classification for the multiclass mood, due to the unbalancedness of data, only consider the reasons such as single factors (frequency or time), make the feature property distinguished poor, the mood nicety of grading is relatively low, and this makes voice-based Emotion identification system applies be restricted.
Summary of the invention
Only consider single factors for the feature extraction in the traditional voice Emotion identification, as frequency or time, make the poor problem of the feature property distinguished, the present invention proposes a kind of voice mood feature extracting method of considering polyteny group sparse characteristic in voice, identifying and can improve multiclass Emotion identification accuracy rate for voice mood.
The emotional characteristics extracting method of polyteny group sparse characteristic in consideration voice of the present invention is:
Consider that voice signal comprises the multiple factors of time, frequency, yardstick and directional information, utilize the method for polyteny group Its Sparse Decomposition to carry out feature extraction, Gabor function by different scale and direction carries out the polyteny sign to the speech signal energy spectrum, utilize the sparse tensor resolution method of group to solve the Projection Character matrix, Projection Character on the calculated rate rank, to the feature decorrelation, obtain single order and the second order difference coefficient of feature through discrete cosine transform by difference; Specifically comprise the following steps:
(1) gather voice signal s (t) (by equipment collections such as microphones), utilize Short Time Fourier Transform that s (t) is transformed to time-frequency domain, obtain time-frequency representation S (f, t) and the energy spectrum P (f, t) of signal;
(2) utilize the two-dimensional Gabor function with different scale and direction to carry out convolutional filtering to energy spectrum, the Gabor function definition is as follows:
g k ‾ ( x ‾ ) = k ‾ 2 σ 2 · e - ( k ‾ 2 · x ‾ 2 / 2 σ 2 ) · [ e j k ‾ · x ‾ - e - ( σ 2 / 2 ) ] ,
Wherein:
Figure GDA000030868017000312
=P (f, t) is the element that energy spectrum P (f, t) is f in t frame, frequency;
Figure GDA00003086801700032
be the yardstick of control function and the vector of direction, j means imaginary part unit, k v=2 -(v+2)/2, φ=u (π/K), the direction of u representative function, the yardstick of v representative function, K means total direction number, σ is the constant of determining the function envelope, is made as 2 π.
The Gabor function is the polyteny sign of voice signal to the result of energy spectrum P (f, t) convolutional filtering here
Figure GDA00003086801700034
that a size is 5 rank tensors, each rank mean respectively time, frequency, direction, yardstick and classification, then right the frequency rank carry out the filtering of Mei Er quarter window and obtain 5 new rank tensors p, psize be N 1* N 2* N 3* N 4* N 5, the length of every single order is N i, i=1 ..., 5;
(3) polyteny obtained is characterized pcarry out the sparse tensor resolution of group, calculate the projection matrix U on different factors (i), i=1 ..., 5, in order to carry out Projection Character, set up following decomposition model:
PΛ× 1U (1)× 2U (2)× 3U (3)× 4U (4)× 5U (5)
Wherein, U (i)that the size that decomposition obtains afterwards is N ithe projection matrix of * K; Λbe the 5 rank tensors that diagonal element is 1, size is K * K * K * K * K; * imean the Matrix Multiplication computing of tensor i rank, it is defined as follows:
( X ‾ × i A ) n 1 , · · · n i - 1 , k , n i + 1 , · · · n M = Σ n i X ‾ n 1 , · · · n M A k , n i
Wherein xmean that a size is N 1* ... * N mm rank tensor, A is that a size is N ithe matrix of * K,
Figure GDA00003086801700038
it is tensor xelement,
Figure GDA00003086801700039
it is the element of matrix A;
Calculate projection matrix U (i), i=1 ... the concrete decomposable process of I is as follows, and i means the index of rank (corresponding different factors) here, I=5:
1. adopt alternately lowest mean square or random initializtion U (i)>=0, i=1 ..., I;
2. to projection matrix U (i), i=1 ..., each column vector of I
Figure GDA000030868017000310
i=1 ..., I, k=1 ..., K carries out normalization;
3. error objective function
Figure GDA000030868017000311
while being greater than certain threshold value, following operation is carried out in circulation:
● from i=1 to I, carry out successively:
Figure GDA00003086801700041
Wherein, || || fmean the Frobenius norm,
Figure GDA00003086801700042
it is tensor p (k)i rank tensor matrixes launch,
Figure GDA00003086801700043
Figure GDA00003086801700044
⊙ is that the Khatri-Rao of matrix is long-pending, and o means vectorial apposition, λ kand q ibe for regulating the weight coefficient of objective function composition degree of rarefication, get the numerical value between 0 to 1;
If ● i ≠ 5, γ k i = u k ( I ) T u k ( I ) , Wherein
Figure GDA00003086801700046
mean
Figure GDA00003086801700047
transposition, if i=5,
Figure GDA00003086801700048
4. work as objective function ewhile being less than certain threshold value, circulation finishes, and calculates projection matrix U (i), i=1 ..., I;
(4) utilize the U of the projection matrix corresponding to frequency domain obtained (2)polyteny to voice signal characterizes pcarry out Projection Character:
S ‾ = P ‾ × 2 U + ( 2 ) ,
Wherein, [Y] +=max (0, Y) mean to choose the matrix that the non-negative element in matrix Y forms, if element is less than 0, be set to 0,
Figure GDA000030868017000410
projection matrix U (2)the matrix that the non-negative element of pseudoinverse forms, * 2representing matrix
Figure GDA000030868017000411
with pcarry out 2 rank Matrix Multiplications of tensor;
(5) the time rank are fixed, to the polyteny sparse representation obtained scarry out tensor and launch operation, obtain size and be
Figure GDA000030868017000412
eigenmatrix S (f), wherein N ^ 1 = K · N 3 · N 4 · N 5 ;
(6) utilize discrete cosine transform to S (f)carry out decorrelation, obtain voice mood feature F, the single order of calculated characteristics and second order difference coefficient obtain final emotional characteristics.
The present invention considers the feature extraction for mood of factors such as time, frequency, yardstick and direction in voice signal, utilizes the sparse tensor resolution method of group to carry out Projection Character, has finally improved the accuracy rate of multiclass voice mood identification.
The accompanying drawing explanation
Fig. 1 is the schematic block diagram of traditional voice Emotion identification process;
Fig. 2 is the schematic diagram of feature extracting method of the present invention;
Fig. 3 is the schematic block diagram that adopts voice mood identifying of the present invention.
Fig. 4 is the experimental result comparison diagram to four class voice mood identifications.
Embodiment
As shown in Figure 2, the voice mood recognition methods based on polyteny group sparse features of the present invention specifically comprises the following steps:
(1) collect voice signal s (t) by equipment such as microphones, utilize Short Time Fourier Transform that s (t) is transformed to time-frequency domain, obtain time-frequency representation S (f, t) and the energy spectrum P (f, t) of signal;
(2) utilize the two-dimensional Gabor function with different scale and direction to carry out convolutional filtering to energy spectrum, the polyteny that obtains voice signal characterizes
Figure GDA00003086801700051
then right
Figure GDA00003086801700052
the frequency rank carry out the filtering of Mei Er quarter window and characterized p;
The Gabor function definition is as follows:
g k ‾ ( x ‾ ) = k ‾ 2 σ 2 · e - ( k ‾ 2 · x ‾ 2 / 2 σ 2 ) · [ e j k ‾ · x ‾ - e - ( σ 2 / 2 ) ] ,
Wherein: it is the element that energy spectrum P (f, t) is f in t frame, frequency;
Figure GDA00003086801700055
be the yardstick of control function and the vector of direction, j means imaginary part unit, k v=2 -(v+2)/2π, φ=u (π/K), the direction of u representative function, the yardstick of v representative function, K means total direction number, σ is the constant of determining the function envelope, is made as 2 π.
The Gabor function is the polyteny sign of voice signal to the result of energy spectrum P (f, t) convolutional filtering
Figure GDA00003086801700056
here
Figure GDA00003086801700057
that a size is
Figure GDA00003086801700058
5 rank tensors, each rank mean respectively time, frequency, direction, yardstick and classification, then right
Figure GDA00003086801700059
the frequency rank carry out the filtering of Mei Er quarter window and obtain 5 new rank tensors p, psize be N 1* N 2* N 3* N 4* N 5, the length of every single order is N i, i=1 ..., 5;
(3) to characterizing pcarry out the sparse tensor resolution of group, calculate the projection matrix U on different factors (i), i=1 ..., 5, in order to carry out Projection Character.Set up following decomposition model:
P≈Λ× 1U (1)× 2U (2)× 3U (3)× 4U (4)× 5U (5)
Wherein, U (i)that the size that decomposition obtains afterwards is N ithe projection matrix of * K; Λbe the 5 rank tensors that diagonal element is 1, size is K * K * K * K * K; * imean the Matrix Multiplication computing of tensor i rank, it is defined as follows:
( X ‾ × i A ) n 1 , · · · n i - 1 , k , n i + 1 , · · · n M = Σ n i X ‾ n 1 , · · · n M A k , n i
Wherein xmean that a size is N 1* ... * N mm rank tensor, A is that a size is N ithe matrix of * K,
Figure GDA000030868017000511
it is tensor xelement,
Figure GDA000030868017000512
it is the element of matrix A.
For calculating projection matrix U (i), i=1 ..., I, I=5 here, concrete decomposable process is as follows:
A) adopt alternately lowest mean square or random initializtion U (i)>=0, i=1 ..., I;
B) to projection matrix U (i), i=1 ..., each column vector of I
Figure GDA000030868017000513
i=1 ..., I, k=1 ..., K carries out normalization;
C) error objective function
Figure GDA000030868017000514
while being greater than certain threshold value, circulation is carried out to finish drilling
Do:
● from n=1 to I, carry out successively
Figure GDA00003086801700061
Wherein, || || fmean the Frobenius norm,
Figure GDA00003086801700062
Figure GDA00003086801700063
it is tensor p (k)i rank tensor matrixes launch,
Figure GDA00003086801700064
⊙ is that the Khatri-Rao of matrix is long-pending, and o means vectorial apposition, λ kand q ibe for regulating the weight coefficient of objective function composition degree of rarefication, get the numerical value between 0 to 1;
If ● n ≠ 5, γ k i = u k ( I ) T u k ( I ) , Wherein
Figure GDA00003086801700066
mean
Figure GDA00003086801700067
transposition, if n=5,
D) work as objective function ewhile being less than certain threshold value, circulation finishes, and calculates projection matrix U (i), i=1 ... I;
(4) utilize the U of the projection matrix corresponding to frequency domain obtained (2)polyteny to voice signal characterizes pcarry out Projection Character:
S ‾ = P ‾ × 2 U + ( 2 )
Wherein, [Y] +=max (0, Y) mean to choose the matrix that the non-negative element in matrix Y forms, if element is less than 0, be set to 0,
Figure GDA000030868017000610
projection matrix U (2)the matrix that the non-negative element of pseudoinverse forms, * 2representing matrix
Figure GDA000030868017000611
with pcarry out 2 rank Matrix Multiplications of tensor;
(5) the time rank are fixed, to the polyteny sparse representation obtained scarry out tensor and launch operation, obtain size and be
Figure GDA000030868017000612
eigenmatrix S (f), wherein N ^ 1 = N 2 · N 3 · N 4 · N 5 ;
(6) utilize discrete cosine transform to S (f)carry out decorrelation, obtain voice mood feature F, the single order of calculated characteristics and second order difference coefficient obtain final emotional characteristics.
As shown in Figure 3, adopt above-mentioned feature extracting method to carry out the process of voice mood identification, comprise the following steps:
1) obtain the voice signal data s that there are different mood labels l(t), l=1 ..., L, the different moods of total L class;
2) utilize the feature extracting method shown in Fig. 2 to be extracted the feature F of different moods
3) utilize mixed Gaussian mixture model (GMM) to carry out modeling to different emotional characteristicses, by learning training, obtain the corresponding mood model M of mood of l class l;
4) when the voice signal of given unknown type of emotion
Figure GDA000030868017000614
while being tested, the mood model M that utilizes GMM to set up l, l=1 ..., L, carry out the measuring and calculation maximum posteriori probability successively, obtains the mood classification of maximum probability, is the Emotion identification result of this voice signal.
Effect of the present invention can further illustrate by experiment.
The recognition performance of the feature extracting method of the present invention's proposition has been tested in experiment on FAU Aibo data set, and 4 class moods (Anger, Emphatic, Neutral, Rest) are identified.The sampling rate of this experiment voice signal is 8kHz, adopt Hamming window to carry out windowing, the 23ms window is long, the 10ms window moves, utilize Short Time Fourier Transform to calculate the energy spectrum of signal, there are 4 different yardsticks and 4 different directions Gabor functions carry out the time-frequency convolutional filtering to energy spectrum, adopt the Mel bank of filters calculating Mei Er power spectrum that size is 36, utilize projection matrix to carry out Projection Character on the frequency domain rank, utilize DCT to carry out decorrelation to feature.
Fig. 4 has provided the method for the present invention's proposition and the recognition performance of existing Feature Extraction Technology (MFCC and LFPC feature) compares, from final recognition accuracy, after adopting the present invention, the accuracy rate of multiclass voice mood identification effectively improves, than classic method, MFCC has improved 6.1%, than the LFPC method, has improved 5.8%.

Claims (2)

1. a voice mood feature extracting method of considering polyteny group sparse features in voice is characterized in that:
Consider that voice signal comprises the multiple factors of time, frequency, yardstick and directional information, utilize the method for polyteny group Its Sparse Decomposition to carry out feature extraction, Gabor function by different scale and direction carries out the polyteny sign to the speech signal energy spectrum, utilize the sparse tensor resolution method of group to solve the Projection Character matrix, Projection Character on the calculated rate rank, through discrete cosine transform, to the feature decorrelation, the single order of calculated characteristics and second order difference coefficient specifically comprise the following steps:
(1) gather voice signal s (t), utilize Short Time Fourier Transform that s (t) is transformed to time-frequency domain, obtain time-frequency representation S (f, t) and the energy spectrum P (f, t) of signal;
(2) utilize the two-dimensional Gabor function with different scale and direction to carry out convolutional filtering to energy spectrum, the Gabor function definition is as follows:
g k ‾ ( x ‾ ) = k ‾ 2 σ 2 · e - ( k ‾ 2 · x ‾ 2 / 2 σ 2 ) · [ e j k ‾ · x ‾ - e - ( σ 2 / 2 ) ] ,
Wherein:
Figure FDA00003531693700012
it is the element that energy spectrum P (f, t) is f in t frame, frequency;
Figure FDA00003531693700013
be the yardstick of control function and the vector of direction, j means imaginary part unit, k v=2 -(v+2)/2π, φ=u (π/K), the direction of u representative function, the yardstick of v representative function, K means total direction number, σ is the constant of determining the function envelope, is made as 2 π;
The Gabor function is the polyteny sign of voice signal to the result of energy spectrum P (f, t) convolutional filtering
Figure FDA00003531693700014
here
Figure FDA00003531693700015
that a size is
Figure FDA00003531693700016
5 rank tensors, each rank mean respectively time, frequency, direction, yardstick and classification, then right
Figure FDA00003531693700017
the frequency rank carry out the filtering of Mei Er quarter window and obtain 5 new rank tensors p, its size is N 1* N 2* N 3* N 4* N 5, the length of every single order is N i, i=1 ..., 5;
(3) polyteny obtained is characterized pcarry out the sparse tensor resolution of group, calculate the projection matrix U on different factors (i), i=1 ..., 5, in order to carry out Projection Character, set up following decomposition model:
PΛ× 1U (1)× 2U (2)× 3U (3)× 4U (4)× 5U (5)
Wherein, U (i)that the size that decomposition obtains afterwards is N ithe projection matrix of * K, Λbe the 5 rank tensors that diagonal element is 1, size is K * K * K * K * K, * imean the Matrix Multiplication computing of tensor i rank, it is defined as follows:
( X ‾ × i A ) n 1 , · · · n i - 1 , k , n i + 1 , · · · n M = Σ n i X ‾ n 1 , · · · n M A k , n i
Wherein xmean that a size is N 1* ... * N mm rank tensor, A is that a size is N ithe matrix of * K,
Figure FDA000035316937000114
it is tensor xelement,
Figure FDA000035316937000224
it is the element of matrix A;
(4) utilize the U of the projection matrix corresponding to frequency domain obtained (2)polyteny to voice signal characterizes pcarry out Projection Character:
S ‾ = P ‾ × 2 U + ( 2 )
Wherein, [Y] +=max (0, Y) mean to choose the matrix that the non-negative element in matrix Y forms, if element is less than 0, be set to 0,
Figure FDA000035316937000219
, be projection matrix U (2)the matrix that the non-negative element of pseudoinverse forms, * 2representing matrix
Figure FDA000035316937000220
with pcarry out 2 rank Matrix Multiplications of tensor;
(5) the time rank are fixed, to the polyteny sparse representation obtained scarry out tensor and launch operation, obtain size and be
Figure FDA00003531693700026
eigenmatrix S (f), wherein
Figure FDA00003531693700027
(6) utilize discrete cosine transform to S (f)carry out decorrelation, obtain voice mood feature F, the single order of calculated characteristics and second order difference coefficient obtain final emotional characteristics.
2. the voice mood feature extracting method of polyteny group sparse features in consideration voice according to claim 1, is characterized in that: described calculating projection matrix U (i), i=1 ..., the concrete decomposable process of I is as follows, and i means the index on rank here, I=5:
1. adopt alternately lowest mean square or random initializtion U (i)>=0, i=1 ..., I;
2. to projection matrix U (i), i=1 ..., each column vector of I
Figure FDA000035316937000221
i=1 ..., I, k=1 ..., K carries out normalization;
3. error objective function
Figure FDA00003531693700028
while being greater than certain threshold value, following operation is carried out in circulation:
● from n=1 to I, carry out successively:
Wherein, || || fmean the Frobenius norm,
Figure FDA000035316937000210
it is tensor p (k)i rank tensor matrixes launch,
Figure FDA000035316937000212
Figure FDA000035316937000213
Figure FDA000035316937000223
the Khatri-Rao that is matrix is long-pending, and ο means vectorial apposition, λ kand q ibe for regulating the weight coefficient of objective function composition degree of rarefication, get the numerical value between 0 to 1;
If ● n ≠ 5,
Figure FDA000035316937000214
wherein
Figure FDA000035316937000215
mean
Figure FDA000035316937000216
transposition, if n=5,
Figure FDA000035316937000217
4. work as objective function ewhile being less than certain threshold value, circulation finishes, and calculates projection matrix U (i), i=1 ..., I.
CN201210091525.1A 2012-03-31 2012-03-31 Emotional-characteristic extraction method implemented through considering sparsity of multilinear group in speech Expired - Fee Related CN102592593B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210091525.1A CN102592593B (en) 2012-03-31 2012-03-31 Emotional-characteristic extraction method implemented through considering sparsity of multilinear group in speech

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210091525.1A CN102592593B (en) 2012-03-31 2012-03-31 Emotional-characteristic extraction method implemented through considering sparsity of multilinear group in speech

Publications (2)

Publication Number Publication Date
CN102592593A CN102592593A (en) 2012-07-18
CN102592593B true CN102592593B (en) 2014-01-01

Family

ID=46481134

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210091525.1A Expired - Fee Related CN102592593B (en) 2012-03-31 2012-03-31 Emotional-characteristic extraction method implemented through considering sparsity of multilinear group in speech

Country Status (1)

Country Link
CN (1) CN102592593B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102833918B (en) * 2012-08-30 2015-07-15 四川长虹电器股份有限公司 Emotional recognition-based intelligent illumination interactive method
CN103245376B (en) * 2013-04-10 2016-01-20 中国科学院上海微系统与信息技术研究所 A kind of weak signal target detection method
CN103531206B (en) * 2013-09-30 2017-09-29 华南理工大学 A kind of local speech emotional characteristic extraction method with global information of combination
CN103531199B (en) * 2013-10-11 2016-03-09 福州大学 Based on the ecological that rapid sparse decomposition and the degree of depth learn
CN103825678B (en) * 2014-03-06 2017-03-08 重庆邮电大学 A kind of method for precoding amassing 3D MU MIMO based on Khatri Rao
CN105047194B (en) * 2015-07-28 2018-08-28 东南大学 A kind of self study sound spectrograph feature extracting method for speech emotion recognition
CN107886942B (en) * 2017-10-31 2021-09-28 东南大学 Voice signal emotion recognition method based on local punishment random spectral regression
CN109060371A (en) * 2018-07-04 2018-12-21 深圳万发创新进出口贸易有限公司 A kind of auto parts and components abnormal sound detection device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101030316A (en) * 2007-04-17 2007-09-05 北京中星微电子有限公司 Safety driving monitoring system and method for vehicle
CN101404060A (en) * 2008-11-10 2009-04-08 北京航空航天大学 Human face recognition method based on visible light and near-infrared Gabor information amalgamation

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6964023B2 (en) * 2001-02-05 2005-11-08 International Business Machines Corporation System and method for multi-modal focus detection, referential ambiguity resolution and mood classification using multi-modal input
US8886206B2 (en) * 2009-05-01 2014-11-11 Digimarc Corporation Methods and systems for content processing

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101030316A (en) * 2007-04-17 2007-09-05 北京中星微电子有限公司 Safety driving monitoring system and method for vehicle
CN101404060A (en) * 2008-11-10 2009-04-08 北京航空航天大学 Human face recognition method based on visible light and near-infrared Gabor information amalgamation

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Bimodal Emotion Recognition Based on Speech Signals and Facial Expression;Tu, Binbin; Yu, Fengqin;《6th International Conference on Intelligent Systems and Knowledge Engineering》;20111231;全文 *
Continuous Emotion Recognition Using Gabor Energy Filters;Dahmane, Mohamed; Meunier, Jean;《4th Bi-Annual International Conference of the Humaine Association on Affective Computing and Intelligent Interaction》;20111231;全文 *
Feature extraction of speech signals in emotion identification;Morales-Perez,M. et al;《30th Annual International Conference of the IEEE-Engineering-in-Medicine-and-Biology-Society》;20081231;全文 *

Also Published As

Publication number Publication date
CN102592593A (en) 2012-07-18

Similar Documents

Publication Publication Date Title
CN102592593B (en) Emotional-characteristic extraction method implemented through considering sparsity of multilinear group in speech
CN107331384B (en) Audio recognition method, device, computer equipment and storage medium
CN106057212B (en) Driving fatigue detection method based on voice personal characteristics and model adaptation
CN101136199B (en) Voice data processing method and equipment
CN102800314B (en) English sentence recognizing and evaluating system with feedback guidance and method
CN107610707A (en) A kind of method for recognizing sound-groove and device
CN112259106A (en) Voiceprint recognition method and device, storage medium and computer equipment
CN105702251B (en) Reinforce the speech-emotion recognition method of audio bag of words based on Top-k
CN101923855A (en) Test-irrelevant voice print identifying system
CN111243569B (en) Emotional voice automatic generation method and device based on generation type confrontation network
CN101930735A (en) Speech emotion recognition equipment and speech emotion recognition method
CN104978507A (en) Intelligent well logging evaluation expert system identity authentication method based on voiceprint recognition
CN103456302A (en) Emotion speaker recognition method based on emotion GMM model weight synthesis
He et al. Stress detection using speech spectrograms and sigma-pi neuron units
Awais et al. Speaker recognition using mel frequency cepstral coefficient and locality sensitive hashing
Rahman et al. Dynamic time warping assisted svm classifier for bangla speech recognition
CN101419799A (en) Speaker identification method based mixed t model
Yutai et al. Speaker recognition based on dynamic MFCC parameters
Wisesty et al. A classification of marked hijaiyah letters’ pronunciation using hidden Markov model
CN105006231A (en) Distributed large population speaker recognition method based on fuzzy clustering decision tree
Prakash et al. Analysis of emotion recognition system through speech signal using KNN & GMM classifier
Rida et al. An efficient supervised dictionary learning method for audio signal recognition
Yousfi et al. Holy Qur'an speech recognition system distinguishing the type of recitation
Bouchakour et al. Noise-robust speech recognition in mobile network based on convolution neural networks
Al-Rawahy et al. Text-independent speaker identification system based on the histogram of DCT-cepstrum coefficients

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140101

Termination date: 20170331

CF01 Termination of patent right due to non-payment of annual fee