CN109767790A - A kind of speech-emotion recognition method and system - Google Patents

A kind of speech-emotion recognition method and system Download PDF

Info

Publication number
CN109767790A
CN109767790A CN201910173689.0A CN201910173689A CN109767790A CN 109767790 A CN109767790 A CN 109767790A CN 201910173689 A CN201910173689 A CN 201910173689A CN 109767790 A CN109767790 A CN 109767790A
Authority
CN
China
Prior art keywords
voice signal
obtains
pretreatment
sound spectrograph
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910173689.0A
Other languages
Chinese (zh)
Inventor
巩微
范文庆
金连婧
伏文龙
黄玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Communication University of China
Original Assignee
Communication University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Communication University of China filed Critical Communication University of China
Priority to CN201910173689.0A priority Critical patent/CN109767790A/en
Publication of CN109767790A publication Critical patent/CN109767790A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The present invention discloses a kind of speech-emotion recognition method and system.The recognition methods includes: acquisition voice signal;The voice signal is pre-processed, pretreatment voice signal is obtained;Calculate the corresponding sound spectrograph of the pretreatment voice signal;The corresponding paragraph length of the emotion recognition rate highest is determined as best paragraph length by the emotion recognition rate for calculating the pretreatment voice signal of multiple and different paragraph length;The acoustic feature of the voice signal is extracted according to the corresponding sound spectrograph of the best paragraph length;By the acoustic feature using the emotion of voice signal described in convolutional neural networks Classification and Identification.Speech emotion recognition rate is improved using the speech-emotion recognition method based on sound spectrograph and convolutional neural networks.

Description

A kind of speech-emotion recognition method and system
Technical field
The present invention relates to field of speech recognition, more particularly to a kind of speech-emotion recognition method and system.
Background technique
Emerging field of the speech emotion recognition as multi-crossed disciplines such as artificial intelligence, psychology, computational sciences, into 21 After century, with the fast development of artificial intelligence field, the demand of speech emotion recognition is increasing, so analysis, research language The affective characteristics for including in sound judge that the mood of speaker's happiness, anger, grief and joy has very important influence.
The research in traditional speech emotion recognition field is partial to analyze the acoustics statistical nature of voice, wherein emotion language Speech entry in sound database is less, semantic also simpler emotional speech database.In the prior art, for emotion recognition Acoustic feature can be divided into prosodic features, the feature based on spectrum, sound quality feature.In the 21st century, is with artificial intelligence field The demand of fast development, speech emotion recognition becomes larger, in terms of the extraction of affective characteristics, earliest enlightenment formula algorithm, including it is suitable The selection, preferential selection forward of sequence Backwards selection, sequence, the algorithm of the extraction of linear character parameter is also applied, including principal component Analytic approach, Fisher face Fisher face, due to analysis method in the prior art analysis result it is accurate Rate is low, proposes a kind of method that feature is automatically extracted using deepness belief network, and uses in the prior art linear The method of the method and k nearest neighbor method and support vector machines of identification and classification is returned using maximum likelihood Bayes Method, core Return with three kinds of classifiers of k nearest neighbor method, achieve the discrimination of 60%-65%.
The discrimination of the progress speech emotional of the classification method and analysis method that use in the prior art is lower.
Summary of the invention
The object of the present invention is to provide a kind of speech-emotion recognition methods of discrimination that can be improved speech emotion recognition And system.
To achieve the above object, the present invention provides following schemes:
A kind of speech-emotion recognition method, the recognition methods include:
Obtain voice signal;
The voice signal is pre-processed, pretreatment voice signal is obtained;
Calculate the corresponding sound spectrograph of the pretreatment voice signal;
The emotion recognition rate for calculating the pretreatment voice signal of multiple and different paragraph length, by the emotion recognition rate The corresponding paragraph length of highest is determined as best paragraph length;
The acoustic feature of the voice signal is extracted according to the corresponding sound spectrograph of the best paragraph length;
By the acoustic feature using the emotion of voice signal described in convolutional neural networks Classification and Identification.
Optionally, the pretreatment voice signal obtains pretreatment voice signal and specifically includes:
The voice signal is passed through into digitized processing, obtains pulse voice signal;
By pulse speech signal samples processing, the pulse voice signal of discrete time and continuous amplitude is obtained;
By the pulse voice signal quantification treatment of the discrete time and continuous amplitude, discrete time and discrete amplitude are obtained Pulse voice signal;
The pulse voice signal of the discrete time and discrete amplitude is subjected to preemphasis processing, obtains preemphasis voice letter Number;
The preemphasis voice signal is subjected to framing windowing process, obtains pretreatment voice signal.
Optionally, the corresponding sound spectrograph of the pretreatment voice signal that calculates specifically includes:
Obtain the sample frequency F of the pretreatment voice signals, sample data sequence SgWith paragraph length;
According to the long N of window of the paragraph length and window functionnewThe pretreatment voice signal is divided into N sections, obtains N sections Voice signal;
Frame, which is calculated, according to the paragraph length and the N sections of voice signal moves Nsfgtft
To the i-th frame voice signal SiWindowing process obtains adding window voice signal S 'i,
S′i=Si×hanning(Nnew), wherein the value of i is 1,2 ..., N;
By the adding window voice signal S 'iFourier transformation is carried out, Fourier transformation voice signal Z is obtainedi
According to the Fourier transformation voice signal ZiPhase thetaiCalculate the i-th frame voice signal SiEnergy density Function | Zi|2;The window function is subjected to NsfgtftA frame moves, and obtains i+1 frame voice signal Si+1Energy density function | Zi+1 |2
Obtain [a Nnew/ 2] the matrix R of+1 row, N column;
The matrix R is mapped as grayscale image, obtains the corresponding sound spectrograph of the calculating pretreatment voice signal.
Optionally, described to have the acoustic feature using the emotion of voice signal described in convolutional neural networks Classification and Identification Body includes:
The sound spectrograph is handled using the convolutional layer of convolutional neural networks, and the three-dimensional sound spectrograph is converted to N number of two dimension Feature;
Wherein, bjFor the departure function that can be trained, kijFor convolution kernel, xiIndicate i-th section of sound spectrograph of input;yiIt indicates The corresponding two dimensional character of i-th section of sound spectrograph of output;
By the corresponding two dimensional character y of i-th section of sound spectrograph of the outputiIt is handled by pond layer, obtains low resolution sound Learn feature y 'i
It is provided with full articulamentum between the convolutional layer and the pond layer, has activation primitive in the full articulamentum, institute Full articulamentum is stated for the data transmission between the convolutional layer and the pond layer.
A kind of speech emotion recognition system, the identifying system include:
Voice signal obtains module, for obtaining voice signal;
Preprocessing module obtains pretreatment voice signal for pre-processing the voice signal;
Sound spectrograph computing module, for calculating the corresponding sound spectrograph of the pretreatment voice signal;
Best paragraph length determination modul, the feelings of the pretreatment voice signal for calculating multiple and different paragraph length Feel discrimination, the corresponding paragraph length of the emotion recognition rate highest is determined as best paragraph length;
Acoustic feature extraction module, for according to the best paragraph length corresponding sound spectrograph extraction voice signal Acoustic feature;
Convolutional neural networks module, for believing the acoustic feature using voice described in convolutional neural networks Classification and Identification Number emotion.
Optionally, the preprocessing module specifically includes:
Digitized processing unit obtains pulse voice signal for the voice signal to be passed through digitized processing;
Sample processing unit, for obtaining discrete time and continuous amplitude for pulse speech signal samples processing Pulse voice signal;
Quantization processing unit, for obtaining the pulse voice signal quantification treatment of the discrete time and continuous amplitude The pulse voice signal of discrete time and discrete amplitude;
Preemphasis processing unit, for carrying out the pulse voice signal of the discrete time and discrete amplitude at preemphasis Reason obtains preemphasis voice signal;
Framing windowing unit obtains pretreatment voice for the preemphasis voice signal to be carried out framing windowing process Signal.
Optionally, the sound spectrograph computing module specifically includes:
Voice signal information acquisition unit is pre-processed, for obtaining the sample frequency F of the pretreatment voice signals, adopt Sample data sequence SgWith paragraph length;
Speech signal segments unit is pre-processed, for the long N of window according to the paragraph length and window functionnewIt will be described pre- Processing voice signal is divided into N sections, obtains N sections of voice signals;
Frame moves computing unit, moves N for calculating frame according to the paragraph length and the N sections of voice signalsfgtft
Windowing process unit, for the i-th frame voice signal SiWindowing process obtains adding window voice signal S 'i,
S′i=Si×hanning(Nnew), wherein the value of i is 1,2 ..., N;
Fourier transform unit is used for the adding window voice signal S 'iFourier transformation is carried out, Fourier transformation is obtained Voice signal Zi
Sound spectrograph acquiring unit, for according to the Fourier transformation voice signal ZiPhase thetaiCalculate the i-th frame language Sound signal SiEnergy density function | Zi|2;The window function is subjected to NsfgtftA frame moves, and obtains i+1 frame voice signal Si+1 Energy density function | Zi+1|2
Obtain [a Nnew/ 2] the matrix R of+1 row, N column;
The matrix R is mapped as grayscale image, obtains the corresponding sound spectrograph of the calculating pretreatment voice signal.
Optionally, the convolutional neural networks module specifically includes:
Convolution layer unit is handled for the sound spectrograph using the convolutional layer of convolutional neural networks, three-dimensional institute's predicate spectrum Figure is converted to N number of two dimensional character;
Pond layer unit, for by the corresponding two dimensional character y of i-th section of sound spectrograph of the outputiIt is handled by pond layer, Obtain low resolution acoustic feature y 'i
Full connection layer unit, for being provided with full articulamentum, the full connection between the convolutional layer and the pond layer There is activation primitive in layer, the full articulamentum is for the data transmission between the convolutional layer and the pond layer.
The specific embodiment provided according to the present invention, the invention discloses following technical effects: the invention discloses one kind Speech-emotion recognition method and system.The recognition methods is to obtain voice signal;The voice signal is pre-processed, pre- place is obtained Manage voice signal;Calculate the corresponding sound spectrograph of the pretreatment voice signal;Calculate the pre- place of multiple and different paragraph length The emotion recognition rate for managing voice signal, is determined as best paragraph length for the corresponding paragraph length of the emotion recognition rate highest; The acoustic feature of the voice signal is extracted according to the corresponding sound spectrograph of the best paragraph length;The acoustic feature is used The emotion of voice signal described in convolutional neural networks Classification and Identification.Using the speech emotional based on sound spectrograph and convolutional neural networks Recognition methods improves speech emotion recognition rate, the feature of the sound spectrograph based on best paragraph length and the knowledge of convolutional neural networks Other method also further improves the discrimination of speech emotional.
Detailed description of the invention
It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.
Fig. 1 is the flow chart of speech-emotion recognition method provided by the invention;
Fig. 2 is the composition block diagram of speech emotion recognition system provided by the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
The object of the present invention is to provide a kind of speech-emotion recognition methods of discrimination that can be improved speech emotion recognition And system.
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.
As shown in Figure 1, a kind of speech-emotion recognition method, the recognition methods include:
Step 100: obtaining voice signal;
Step 200: pre-processing the voice signal, obtain pretreatment voice signal;
Step 300: calculating the corresponding sound spectrograph of the pretreatment voice signal;
Step 400: the emotion recognition rate of the pretreatment voice signal of multiple and different paragraph length is calculated, by the feelings The corresponding paragraph length of sense discrimination highest is determined as best paragraph length;
Step 500: the acoustic feature of the voice signal is extracted according to the corresponding sound spectrograph of the best paragraph length;
Step 600: by the acoustic feature using the emotion of voice signal described in convolutional neural networks Classification and Identification.
The step 200: pre-processing the voice signal, obtains pretreatment voice signal and specifically includes:
The voice signal is passed through into digitized processing, obtains pulse voice signal;
By pulse speech signal samples processing, the pulse voice signal of discrete time and continuous amplitude is obtained;
By the pulse voice signal quantification treatment of the discrete time and continuous amplitude, discrete time and discrete amplitude are obtained Pulse voice signal;
The pulse voice signal of the discrete time and discrete amplitude is subjected to preemphasis processing, obtains preemphasis voice letter Number;
The preemphasis voice signal is subjected to framing windowing process, obtains pretreatment voice signal.
The step 300: it calculates the corresponding sound spectrograph of the pretreatment voice signal and specifically includes:
Obtain the sample frequency F of the pretreatment voice signals, sample data sequence SgWith paragraph length;
According to the long N of window of the paragraph length and window functionnewThe pretreatment voice signal is divided into N sections, obtains N sections Voice signal;
Frame, which is calculated, according to the paragraph length and the N sections of voice signal moves Nsfgtft
To the i-th frame voice signal SiWindowing process obtains adding window voice signal S 'i,
S′i=Si×hanning(Nnew), wherein the value of i is 1,2 ..., N;
By the adding window voice signal S 'iFourier transformation is carried out, Fourier transformation voice signal Z is obtainedi
According to the Fourier transformation voice signal ZiPhase thetaiCalculate the i-th frame voice signal SiEnergy density Function | Zi|2;The window function is subjected to NsfgtftA frame moves, and obtains i+1 frame voice signal Si+1Energy density function | Zi+1 |2
Obtain [a Nnew/ 2] the matrix R of+1 row, N column;
The matrix R is mapped as grayscale image, the corresponding sound spectrograph of the calculating pretreatment voice signal is obtained, leads to The quantity for needing the coefficient of training can be reduced by crossing the shared filter of weight.
The step 600: by the acoustic feature using the emotion of voice signal described in convolutional neural networks Classification and Identification It specifically includes:
The sound spectrograph is handled using the convolutional layer of convolutional neural networks, and the three-dimensional sound spectrograph is converted to N number of two dimension Feature;
Wherein, bjFor the departure function that can be trained, kijFor convolution kernel, xiIndicate i-th section of sound spectrograph of input;yiIt indicates The corresponding two dimensional character of i-th section of sound spectrograph of output;
By the corresponding two dimensional character y of i-th section of sound spectrograph of the outputiIt is handled by pond layer, obtains low resolution sound Learn feature y 'i
It is provided with full articulamentum between the convolutional layer and the pond layer, has activation primitive in the full articulamentum, institute Full articulamentum is stated for the data transmission between the convolutional layer and the pond layer.
As shown in Fig. 2, a kind of speech emotion recognition system, the identifying system include:
Voice signal obtains module 1, for obtaining voice signal;
Preprocessing module 2 obtains pretreatment voice signal for pre-processing the voice signal;
Sound spectrograph computing module 3, for calculating the corresponding sound spectrograph of the pretreatment voice signal;
Best paragraph length determination modul 4, for calculating the pretreatment voice signal of multiple and different paragraph length The corresponding paragraph length of the emotion recognition rate highest is determined as best paragraph length by emotion recognition rate;
Acoustic feature extraction module 5, for being believed according to the best paragraph length corresponding sound spectrograph extraction voice Number acoustic feature;
Convolutional neural networks module 6, for the acoustic feature to be used voice described in convolutional neural networks Classification and Identification The emotion of signal.
The preprocessing module 2 specifically includes:
Digitized processing unit obtains pulse voice signal for the voice signal to be passed through digitized processing;
Sample processing unit, for obtaining discrete time and continuous amplitude for pulse speech signal samples processing Pulse voice signal;
Quantization processing unit, for obtaining the pulse voice signal quantification treatment of the discrete time and continuous amplitude The pulse voice signal of discrete time and discrete amplitude;
Preemphasis processing unit, for carrying out the pulse voice signal of the discrete time and discrete amplitude at preemphasis Reason obtains preemphasis voice signal;
Framing windowing unit obtains pretreatment voice for the preemphasis voice signal to be carried out framing windowing process Signal.
The sound spectrograph computing module 3 specifically includes:
Voice signal information acquisition unit is pre-processed, for obtaining the sample frequency F of the pretreatment voice signals, adopt Sample data sequence SgWith paragraph length;
Speech signal segments unit is pre-processed, for the long N of window according to the paragraph length and window functionnewIt will be described pre- Processing voice signal is divided into N sections, obtains N sections of voice signals;
Frame moves computing unit, moves N for calculating frame according to the paragraph length and the N sections of voice signalsfgtft
Windowing process unit, for the i-th frame voice signal SiWindowing process obtains adding window voice signal S 'i,
S′i=Si×hanning(Nnew), wherein the value of i is 1,2 ..., N;
Fourier transform unit is used for the adding window voice signal S 'iFourier transformation is carried out, Fourier transformation is obtained Voice signal Zi
Sound spectrograph acquiring unit, for according to the Fourier transformation voice signal ZiPhase thetaiCalculate the i-th frame language Sound signal SiEnergy density function | Zi|2;The window function is subjected to NsfgtftA frame moves, and obtains i+1 frame voice signal Si+1 Energy density function | Zi+1|2
Obtain [a Nnew/ 2] the matrix R of+1 row, N column;
The matrix R is mapped as grayscale image, obtains the corresponding sound spectrograph of the calculating pretreatment voice signal.
The convolutional neural networks module 6 specifically includes:
Convolution layer unit is handled for the sound spectrograph using the convolutional layer of convolutional neural networks, three-dimensional institute's predicate spectrum Figure is converted to N number of two dimensional character;
Pond layer unit, for by the corresponding two dimensional character y of i-th section of sound spectrograph of the outputiIt is handled by pond layer, Obtain low resolution acoustic feature y 'i
Full connection layer unit, for being provided with full articulamentum, the full connection between the convolutional layer and the pond layer There is activation primitive in layer, the full articulamentum is for the data transmission between the convolutional layer and the pond layer.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For system disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.
Used herein a specific example illustrates the principle and implementation of the invention, and above embodiments are said It is bright to be merely used to help understand method and its core concept of the invention;At the same time, for those skilled in the art, foundation Thought of the invention, there will be changes in the specific implementation manner and application range.In conclusion the content of the present specification is not It is interpreted as limitation of the present invention.

Claims (8)

1. a kind of speech-emotion recognition method, which is characterized in that the recognition methods includes:
Obtain voice signal;
The voice signal is pre-processed, pretreatment voice signal is obtained;
Calculate the corresponding sound spectrograph of the pretreatment voice signal;
The emotion recognition rate for calculating the pretreatment voice signal of multiple and different paragraph length, by the emotion recognition rate highest Corresponding paragraph length is determined as best paragraph length;
The acoustic feature of the voice signal is extracted according to the corresponding sound spectrograph of the best paragraph length;
By the acoustic feature using the emotion of voice signal described in convolutional neural networks Classification and Identification.
2. a kind of speech-emotion recognition method according to claim 1, which is characterized in that the pretreatment voice letter Number, it obtains pretreatment voice signal and specifically includes:
The voice signal is passed through into digitized processing, obtains pulse voice signal;
By pulse speech signal samples processing, the pulse voice signal of discrete time and continuous amplitude is obtained;
By the pulse voice signal quantification treatment of the discrete time and continuous amplitude, the arteries and veins of discrete time and discrete amplitude is obtained Rush voice signal;
The pulse voice signal of the discrete time and discrete amplitude is subjected to preemphasis processing, obtains preemphasis voice signal;
The preemphasis voice signal is subjected to framing windowing process, obtains pretreatment voice signal.
3. a kind of speech-emotion recognition method according to claim 1, which is characterized in that described to calculate the pretreatment language The corresponding sound spectrograph of sound signal specifically includes:
Obtain the sample frequency F of the pretreatment voice signals, sample data sequence SgWith paragraph length;
According to the long N of window of the paragraph length and window functionnewThe pretreatment voice signal is divided into N sections, obtains N sections of voices Signal;
Frame, which is calculated, according to the paragraph length and the N sections of voice signal moves Nsfgtft
To the i-th frame voice signal SiWindowing process obtains adding window voice signal Si',
Si'=Si×hanning(Nnew), wherein the value of i is 1,2 ..., N;
By the adding window voice signal Si' Fourier transformation is carried out, obtain Fourier transformation voice signal Zi
According to the Fourier transformation voice signal ZiPhase thetaiCalculate the i-th frame voice signal SiEnergy density function | Zi|2;The window function is subjected to NsfgtftA frame moves, and obtains i+1 frame voice signal Si+1Energy density function | Zi+1|2
Obtain [a Nnew/ 2] the matrix R of+1 row, N column;
The matrix R is mapped as grayscale image, obtains the corresponding sound spectrograph of the calculating pretreatment voice signal.
4. a kind of speech-emotion recognition method according to claim 1, which is characterized in that described to adopt the acoustic feature The emotion of the voice signal described in convolutional neural networks Classification and Identification specifically includes:
The sound spectrograph is handled using the convolutional layer of convolutional neural networks, and the three-dimensional sound spectrograph is converted to N number of two dimensional character;
Wherein, bjFor the departure function that can be trained, kijFor convolution kernel, xiIndicate i-th section of sound spectrograph of input;yiIndicate output The corresponding two dimensional character of i-th section of sound spectrograph;
By the corresponding two dimensional character y of i-th section of sound spectrograph of the outputiIt is handled by pond layer, obtains low resolution acoustic feature yi′;
It is provided with full articulamentum between the convolutional layer and the pond layer, there is activation primitive in the full articulamentum, it is described complete Articulamentum is for the data transmission between the convolutional layer and the pond layer.
5. a kind of speech emotion recognition system, which is characterized in that the identifying system includes:
Voice signal obtains module, for obtaining voice signal;
Preprocessing module obtains pretreatment voice signal for pre-processing the voice signal;
Sound spectrograph computing module, for calculating the corresponding sound spectrograph of the pretreatment voice signal;
Best paragraph length determination modul is known for calculating the emotion of the pretreatment voice signal of multiple and different paragraph length The corresponding paragraph length of the emotion recognition rate highest is determined as best paragraph length by not rate;
Acoustic feature extraction module, for extracting the sound of the voice signal according to the corresponding sound spectrograph of the best paragraph length Learn feature;
Convolutional neural networks module, for the acoustic feature to be used voice signal described in convolutional neural networks Classification and Identification Emotion.
6. a kind of speech emotion recognition system according to claim 5, which is characterized in that the preprocessing module is specifically wrapped It includes:
Digitized processing unit obtains pulse voice signal for the voice signal to be passed through digitized processing;
Sample processing unit, for obtaining the pulse of discrete time and continuous amplitude for pulse speech signal samples processing Voice signal;
Quantization processing unit, for obtaining the pulse voice signal quantification treatment of the discrete time and continuous amplitude discrete The pulse voice signal of time and discrete amplitude;
Preemphasis processing unit, for the pulse voice signal of the discrete time and discrete amplitude to be carried out preemphasis processing, Obtain preemphasis voice signal;
Framing windowing unit obtains pretreatment voice signal for the preemphasis voice signal to be carried out framing windowing process.
7. a kind of speech emotion recognition system according to claim 5, which is characterized in that the sound spectrograph computing module tool Body includes:
Voice signal information acquisition unit is pre-processed, for obtaining the sample frequency F of the pretreatment voice signals, sampled data Sequence SgWith paragraph length;
Speech signal segments unit is pre-processed, for the long N of window according to the paragraph length and window functionnewBy the pretreatment Voice signal is divided into N sections, obtains N sections of voice signals;
Frame moves computing unit, moves N for calculating frame according to the paragraph length and the N sections of voice signalsfgtft
Windowing process unit, for the i-th frame voice signal SiWindowing process obtains adding window voice signal Si',
Si'=Si×hanning(Nnew), wherein the value of i is 1,2 ..., N;
Fourier transform unit is used for the adding window voice signal Si' Fourier transformation is carried out, obtain Fourier transformation voice Signal Zi
Sound spectrograph acquiring unit, for according to the Fourier transformation voice signal ZiPhase thetaiCalculate the i-th frame voice letter Number SiEnergy density function | Zi|2;The window function is subjected to NsfgtftA frame moves, and obtains i+1 frame voice signal Si+1Energy Metric density function | Zi+1|2
Obtain [a Nnew/ 2] the matrix R of+1 row, N column;
The matrix R is mapped as grayscale image, obtains the corresponding sound spectrograph of the calculating pretreatment voice signal.
8. a kind of speech-emotion recognition method according to claim 1, which is characterized in that the convolutional neural networks module It specifically includes:
Convolution layer unit is handled for the sound spectrograph using the convolutional layer of convolutional neural networks, and the three-dimensional sound spectrograph turns It is changed to N number of two dimensional character;
Pond layer unit, for by the corresponding two dimensional character y of i-th section of sound spectrograph of the outputiIt is handled, is obtained by pond layer Low resolution acoustic feature yi′;
Full connection layer unit, for being provided with full articulamentum between the convolutional layer and the pond layer, in the full articulamentum There is activation primitive, the full articulamentum is for the data transmission between the convolutional layer and the pond layer.
CN201910173689.0A 2019-02-28 2019-02-28 A kind of speech-emotion recognition method and system Pending CN109767790A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910173689.0A CN109767790A (en) 2019-02-28 2019-02-28 A kind of speech-emotion recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910173689.0A CN109767790A (en) 2019-02-28 2019-02-28 A kind of speech-emotion recognition method and system

Publications (1)

Publication Number Publication Date
CN109767790A true CN109767790A (en) 2019-05-17

Family

ID=66457882

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910173689.0A Pending CN109767790A (en) 2019-02-28 2019-02-28 A kind of speech-emotion recognition method and system

Country Status (1)

Country Link
CN (1) CN109767790A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110415728A (en) * 2019-07-29 2019-11-05 内蒙古工业大学 A kind of method and apparatus identifying emotional speech
CN110534133A (en) * 2019-08-28 2019-12-03 珠海亿智电子科技有限公司 A kind of speech emotion recognition system and speech-emotion recognition method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20090063202A (en) * 2009-05-29 2009-06-17 포항공과대학교 산학협력단 Method for apparatus for providing emotion speech recognition
US20130297297A1 (en) * 2012-05-07 2013-11-07 Erhan Guven System and method for classification of emotion in human speech
CN104021373A (en) * 2014-05-27 2014-09-03 江苏大学 Semi-supervised speech feature variable factor decomposition method
CN107705806A (en) * 2017-08-22 2018-02-16 北京联合大学 A kind of method for carrying out speech emotion recognition using spectrogram and deep convolutional neural networks
CN108010514A (en) * 2017-11-20 2018-05-08 四川大学 A kind of method of speech classification based on deep neural network
CN108597539A (en) * 2018-02-09 2018-09-28 桂林电子科技大学 Speech-emotion recognition method based on parameter migration and sound spectrograph
CN108899049A (en) * 2018-05-31 2018-11-27 中国地质大学(武汉) A kind of speech-emotion recognition method and system based on convolutional neural networks
CN109036465A (en) * 2018-06-28 2018-12-18 南京邮电大学 Speech-emotion recognition method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20090063202A (en) * 2009-05-29 2009-06-17 포항공과대학교 산학협력단 Method for apparatus for providing emotion speech recognition
US20130297297A1 (en) * 2012-05-07 2013-11-07 Erhan Guven System and method for classification of emotion in human speech
CN104021373A (en) * 2014-05-27 2014-09-03 江苏大学 Semi-supervised speech feature variable factor decomposition method
CN107705806A (en) * 2017-08-22 2018-02-16 北京联合大学 A kind of method for carrying out speech emotion recognition using spectrogram and deep convolutional neural networks
CN108010514A (en) * 2017-11-20 2018-05-08 四川大学 A kind of method of speech classification based on deep neural network
CN108597539A (en) * 2018-02-09 2018-09-28 桂林电子科技大学 Speech-emotion recognition method based on parameter migration and sound spectrograph
CN108899049A (en) * 2018-05-31 2018-11-27 中国地质大学(武汉) A kind of speech-emotion recognition method and system based on convolutional neural networks
CN109036465A (en) * 2018-06-28 2018-12-18 南京邮电大学 Speech-emotion recognition method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
SATHIT PRASOMPHAN: "Improvement of speech emotion recognition with neural network classifier by using speech spectrogram", 《2015 INTERNATIONAL CONFERENCE ON SYSTEMS, SIGNALS AND IMAGE PROCESSING (IWSSIP)》 *
张若凡 等: "基于语谱图的老年人语音情感识别方法", 《软件导刊》 *
王建伟: "基于深度学习的情绪感知系统的研究与设计", 《中国优秀硕士论文全文数据库 信息科技辑》 *
田熙燕 等: "基于语谱图和卷积神经网络的语音情感识别", 《河南科技学院学报》 *
黄晨晨 等: "基于深度信念网络的语音情感识别的研究", 《计算机研究与发展》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110415728A (en) * 2019-07-29 2019-11-05 内蒙古工业大学 A kind of method and apparatus identifying emotional speech
CN110415728B (en) * 2019-07-29 2022-04-01 内蒙古工业大学 Method and device for recognizing emotion voice
CN110534133A (en) * 2019-08-28 2019-12-03 珠海亿智电子科技有限公司 A kind of speech emotion recognition system and speech-emotion recognition method
CN110534133B (en) * 2019-08-28 2022-03-25 珠海亿智电子科技有限公司 Voice emotion recognition system and voice emotion recognition method

Similar Documents

Publication Publication Date Title
Alim et al. Some commonly used speech feature extraction algorithms
Stanton et al. Predicting expressive speaking style from text in end-to-end speech synthesis
US8676574B2 (en) Method for tone/intonation recognition using auditory attention cues
CN108922513A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN111724770B (en) Audio keyword identification method for generating confrontation network based on deep convolution
CN115641543B (en) Multi-modal depression emotion recognition method and device
CN109767756A (en) A kind of speech feature extraction algorithm based on dynamic partition inverse discrete cosine transform cepstrum coefficient
CN108986798B (en) Processing method, device and the equipment of voice data
CN109036470B (en) Voice distinguishing method, device, computer equipment and storage medium
Rammo et al. Detecting the speaker language using CNN deep learning algorithm
CN109065073A (en) Speech-emotion recognition method based on depth S VM network model
CN112053694A (en) Voiceprint recognition method based on CNN and GRU network fusion
CN114722812A (en) Method and system for analyzing vulnerability of multi-mode deep learning model
Jie et al. Speech emotion recognition of teachers in classroom teaching
CN109767790A (en) A kind of speech-emotion recognition method and system
CN116612541A (en) Multi-mode emotion recognition method, device and storage medium
CN111724809A (en) Vocoder implementation method and device based on variational self-encoder
CN113571095B (en) Speech emotion recognition method and system based on nested deep neural network
CN114898779A (en) Multi-mode fused speech emotion recognition method and system
CN113111786B (en) Underwater target identification method based on small sample training diagram convolutional network
Wang et al. Speech signal feature parameters extraction algorithm based on PCNN for isolated word recognition
CN111785262B (en) Speaker age and gender classification method based on residual error network and fusion characteristics
CN112712789A (en) Cross-language audio conversion method and device, computer equipment and storage medium
Peng et al. Multi-scale model for mandarin tone recognition
CN116312617A (en) Voice conversion method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190517