CN109767790A

CN109767790A - A kind of speech-emotion recognition method and system

Info

Publication number: CN109767790A
Application number: CN201910173689.0A
Authority: CN
Inventors: 巩微; 范文庆; 金连婧; 伏文龙; 黄玮
Original assignee: Communication University of China
Current assignee: Communication University of China
Priority date: 2019-02-28
Filing date: 2019-02-28
Publication date: 2019-05-17

Abstract

The present invention discloses a kind of speech-emotion recognition method and system.The recognition methods includes: acquisition voice signal；The voice signal is pre-processed, pretreatment voice signal is obtained；Calculate the corresponding sound spectrograph of the pretreatment voice signal；The corresponding paragraph length of the emotion recognition rate highest is determined as best paragraph length by the emotion recognition rate for calculating the pretreatment voice signal of multiple and different paragraph length；The acoustic feature of the voice signal is extracted according to the corresponding sound spectrograph of the best paragraph length；By the acoustic feature using the emotion of voice signal described in convolutional neural networks Classification and Identification.Speech emotion recognition rate is improved using the speech-emotion recognition method based on sound spectrograph and convolutional neural networks.

Description

A kind of speech-emotion recognition method and system

Technical field

The present invention relates to field of speech recognition, more particularly to a kind of speech-emotion recognition method and system.

Background technique

Emerging field of the speech emotion recognition as multi-crossed disciplines such as artificial intelligence, psychology, computational sciences, into 21 After century, with the fast development of artificial intelligence field, the demand of speech emotion recognition is increasing, so analysis, research language The affective characteristics for including in sound judge that the mood of speaker's happiness, anger, grief and joy has very important influence.

The research in traditional speech emotion recognition field is partial to analyze the acoustics statistical nature of voice, wherein emotion language Speech entry in sound database is less, semantic also simpler emotional speech database.In the prior art, for emotion recognition Acoustic feature can be divided into prosodic features, the feature based on spectrum, sound quality feature.In the 21st century, is with artificial intelligence field The demand of fast development, speech emotion recognition becomes larger, in terms of the extraction of affective characteristics, earliest enlightenment formula algorithm, including it is suitable The selection, preferential selection forward of sequence Backwards selection, sequence, the algorithm of the extraction of linear character parameter is also applied, including principal component Analytic approach, Fisher face Fisher face, due to analysis method in the prior art analysis result it is accurate Rate is low, proposes a kind of method that feature is automatically extracted using deepness belief network, and uses in the prior art linear The method of the method and k nearest neighbor method and support vector machines of identification and classification is returned using maximum likelihood Bayes Method, core Return with three kinds of classifiers of k nearest neighbor method, achieve the discrimination of 60%-65%.

The discrimination of the progress speech emotional of the classification method and analysis method that use in the prior art is lower.

Summary of the invention

The object of the present invention is to provide a kind of speech-emotion recognition methods of discrimination that can be improved speech emotion recognition And system.

To achieve the above object, the present invention provides following schemes:

A kind of speech-emotion recognition method, the recognition methods include:

Obtain voice signal；

The voice signal is pre-processed, pretreatment voice signal is obtained；

Calculate the corresponding sound spectrograph of the pretreatment voice signal；

The emotion recognition rate for calculating the pretreatment voice signal of multiple and different paragraph length, by the emotion recognition rate The corresponding paragraph length of highest is determined as best paragraph length；

The acoustic feature of the voice signal is extracted according to the corresponding sound spectrograph of the best paragraph length；

By the acoustic feature using the emotion of voice signal described in convolutional neural networks Classification and Identification.

Optionally, the pretreatment voice signal obtains pretreatment voice signal and specifically includes:

The voice signal is passed through into digitized processing, obtains pulse voice signal；

By pulse speech signal samples processing, the pulse voice signal of discrete time and continuous amplitude is obtained；

By the pulse voice signal quantification treatment of the discrete time and continuous amplitude, discrete time and discrete amplitude are obtained Pulse voice signal；

The pulse voice signal of the discrete time and discrete amplitude is subjected to preemphasis processing, obtains preemphasis voice letter Number；

The preemphasis voice signal is subjected to framing windowing process, obtains pretreatment voice signal.

Optionally, the corresponding sound spectrograph of the pretreatment voice signal that calculates specifically includes:

Obtain the sample frequency F of the pretreatment voice signal_s, sample data sequence S_gWith paragraph length；

According to the long N of window of the paragraph length and window function_newThe pretreatment voice signal is divided into N sections, obtains N sections Voice signal；

Frame, which is calculated, according to the paragraph length and the N sections of voice signal moves N_sfgtft；

To the i-th frame voice signal S_iWindowing process obtains adding window voice signal S '_i,

S′_i=S_i×hanning(N_new), wherein the value of i is 1,2 ..., N；

By the adding window voice signal S '_iFourier transformation is carried out, Fourier transformation voice signal Z is obtained_i；

According to the Fourier transformation voice signal Z_iPhase theta_iCalculate the i-th frame voice signal S_iEnergy density Function | Z_i|²；The window function is subjected to N_sfgtftA frame moves, and obtains i+1 frame voice signal S_i+1Energy density function | Z_i+1 |²；

Obtain [a N_new/ 2] the matrix R of+1 row, N column；

The matrix R is mapped as grayscale image, obtains the corresponding sound spectrograph of the calculating pretreatment voice signal.

Optionally, described to have the acoustic feature using the emotion of voice signal described in convolutional neural networks Classification and Identification Body includes:

The sound spectrograph is handled using the convolutional layer of convolutional neural networks, and the three-dimensional sound spectrograph is converted to N number of two dimension Feature；

Wherein, b_jFor the departure function that can be trained, k_ijFor convolution kernel, x_iIndicate i-th section of sound spectrograph of input；y_iIt indicates The corresponding two dimensional character of i-th section of sound spectrograph of output；

By the corresponding two dimensional character y of i-th section of sound spectrograph of the output_iIt is handled by pond layer, obtains low resolution sound Learn feature y '_i；

It is provided with full articulamentum between the convolutional layer and the pond layer, has activation primitive in the full articulamentum, institute Full articulamentum is stated for the data transmission between the convolutional layer and the pond layer.

A kind of speech emotion recognition system, the identifying system include:

Voice signal obtains module, for obtaining voice signal；

Preprocessing module obtains pretreatment voice signal for pre-processing the voice signal；

Sound spectrograph computing module, for calculating the corresponding sound spectrograph of the pretreatment voice signal；

Best paragraph length determination modul, the feelings of the pretreatment voice signal for calculating multiple and different paragraph length Feel discrimination, the corresponding paragraph length of the emotion recognition rate highest is determined as best paragraph length；

Acoustic feature extraction module, for according to the best paragraph length corresponding sound spectrograph extraction voice signal Acoustic feature；

Convolutional neural networks module, for believing the acoustic feature using voice described in convolutional neural networks Classification and Identification Number emotion.

Optionally, the preprocessing module specifically includes:

Digitized processing unit obtains pulse voice signal for the voice signal to be passed through digitized processing；

Sample processing unit, for obtaining discrete time and continuous amplitude for pulse speech signal samples processing Pulse voice signal；

Quantization processing unit, for obtaining the pulse voice signal quantification treatment of the discrete time and continuous amplitude The pulse voice signal of discrete time and discrete amplitude；

Preemphasis processing unit, for carrying out the pulse voice signal of the discrete time and discrete amplitude at preemphasis Reason obtains preemphasis voice signal；

Framing windowing unit obtains pretreatment voice for the preemphasis voice signal to be carried out framing windowing process Signal.

Optionally, the sound spectrograph computing module specifically includes:

Voice signal information acquisition unit is pre-processed, for obtaining the sample frequency F of the pretreatment voice signal_s, adopt Sample data sequence S_gWith paragraph length；

Speech signal segments unit is pre-processed, for the long N of window according to the paragraph length and window function_newIt will be described pre- Processing voice signal is divided into N sections, obtains N sections of voice signals；

Frame moves computing unit, moves N for calculating frame according to the paragraph length and the N sections of voice signal_sfgtft；

Windowing process unit, for the i-th frame voice signal S_iWindowing process obtains adding window voice signal S '_i,

S′_i=S_i×hanning(N_new), wherein the value of i is 1,2 ..., N；

Fourier transform unit is used for the adding window voice signal S '_iFourier transformation is carried out, Fourier transformation is obtained Voice signal Z_i；

Sound spectrograph acquiring unit, for according to the Fourier transformation voice signal Z_iPhase theta_iCalculate the i-th frame language Sound signal S_iEnergy density function | Z_i|²；The window function is subjected to N_sfgtftA frame moves, and obtains i+1 frame voice signal S_i+1 Energy density function | Z_i+1|²；

Obtain [a N_new/ 2] the matrix R of+1 row, N column；

Optionally, the convolutional neural networks module specifically includes:

Convolution layer unit is handled for the sound spectrograph using the convolutional layer of convolutional neural networks, three-dimensional institute's predicate spectrum Figure is converted to N number of two dimensional character；

Pond layer unit, for by the corresponding two dimensional character y of i-th section of sound spectrograph of the output_iIt is handled by pond layer, Obtain low resolution acoustic feature y '_i；

Full connection layer unit, for being provided with full articulamentum, the full connection between the convolutional layer and the pond layer There is activation primitive in layer, the full articulamentum is for the data transmission between the convolutional layer and the pond layer.

The specific embodiment provided according to the present invention, the invention discloses following technical effects: the invention discloses one kind Speech-emotion recognition method and system.The recognition methods is to obtain voice signal；The voice signal is pre-processed, pre- place is obtained Manage voice signal；Calculate the corresponding sound spectrograph of the pretreatment voice signal；Calculate the pre- place of multiple and different paragraph length The emotion recognition rate for managing voice signal, is determined as best paragraph length for the corresponding paragraph length of the emotion recognition rate highest； The acoustic feature of the voice signal is extracted according to the corresponding sound spectrograph of the best paragraph length；The acoustic feature is used The emotion of voice signal described in convolutional neural networks Classification and Identification.Using the speech emotional based on sound spectrograph and convolutional neural networks Recognition methods improves speech emotion recognition rate, the feature of the sound spectrograph based on best paragraph length and the knowledge of convolutional neural networks Other method also further improves the discrimination of speech emotional.

Detailed description of the invention

It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.

Fig. 1 is the flow chart of speech-emotion recognition method provided by the invention；

Fig. 2 is the composition block diagram of speech emotion recognition system provided by the invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.

As shown in Figure 1, a kind of speech-emotion recognition method, the recognition methods include:

Step 100: obtaining voice signal；

Step 200: pre-processing the voice signal, obtain pretreatment voice signal；

Step 300: calculating the corresponding sound spectrograph of the pretreatment voice signal；

Step 400: the emotion recognition rate of the pretreatment voice signal of multiple and different paragraph length is calculated, by the feelings The corresponding paragraph length of sense discrimination highest is determined as best paragraph length；

Step 500: the acoustic feature of the voice signal is extracted according to the corresponding sound spectrograph of the best paragraph length；

Step 600: by the acoustic feature using the emotion of voice signal described in convolutional neural networks Classification and Identification.

The step 200: pre-processing the voice signal, obtains pretreatment voice signal and specifically includes:

The step 300: it calculates the corresponding sound spectrograph of the pretreatment voice signal and specifically includes:

S′_i=S_i×hanning(N_new), wherein the value of i is 1,2 ..., N；

Obtain [a N_new/ 2] the matrix R of+1 row, N column；

The matrix R is mapped as grayscale image, the corresponding sound spectrograph of the calculating pretreatment voice signal is obtained, leads to The quantity for needing the coefficient of training can be reduced by crossing the shared filter of weight.

The step 600: by the acoustic feature using the emotion of voice signal described in convolutional neural networks Classification and Identification It specifically includes:

As shown in Fig. 2, a kind of speech emotion recognition system, the identifying system include:

Voice signal obtains module 1, for obtaining voice signal；

Preprocessing module 2 obtains pretreatment voice signal for pre-processing the voice signal；

Sound spectrograph computing module 3, for calculating the corresponding sound spectrograph of the pretreatment voice signal；

Best paragraph length determination modul 4, for calculating the pretreatment voice signal of multiple and different paragraph length The corresponding paragraph length of the emotion recognition rate highest is determined as best paragraph length by emotion recognition rate；

Acoustic feature extraction module 5, for being believed according to the best paragraph length corresponding sound spectrograph extraction voice Number acoustic feature；

Convolutional neural networks module 6, for the acoustic feature to be used voice described in convolutional neural networks Classification and Identification The emotion of signal.

The preprocessing module 2 specifically includes:

The sound spectrograph computing module 3 specifically includes:

S′_i=S_i×hanning(N_new), wherein the value of i is 1,2 ..., N；

Obtain [a N_new/ 2] the matrix R of+1 row, N column；

The convolutional neural networks module 6 specifically includes:

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For system disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.

Used herein a specific example illustrates the principle and implementation of the invention, and above embodiments are said It is bright to be merely used to help understand method and its core concept of the invention；At the same time, for those skilled in the art, foundation Thought of the invention, there will be changes in the specific implementation manner and application range.In conclusion the content of the present specification is not It is interpreted as limitation of the present invention.

Claims

1. a kind of speech-emotion recognition method, which is characterized in that the recognition methods includes:

Obtain voice signal；

The voice signal is pre-processed, pretreatment voice signal is obtained；

The emotion recognition rate for calculating the pretreatment voice signal of multiple and different paragraph length, by the emotion recognition rate highest Corresponding paragraph length is determined as best paragraph length；

2. a kind of speech-emotion recognition method according to claim 1, which is characterized in that the pretreatment voice letter Number, it obtains pretreatment voice signal and specifically includes:

By the pulse voice signal quantification treatment of the discrete time and continuous amplitude, the arteries and veins of discrete time and discrete amplitude is obtained Rush voice signal；

The pulse voice signal of the discrete time and discrete amplitude is subjected to preemphasis processing, obtains preemphasis voice signal；

3. a kind of speech-emotion recognition method according to claim 1, which is characterized in that described to calculate the pretreatment language The corresponding sound spectrograph of sound signal specifically includes:

According to the long N of window of the paragraph length and window function_newThe pretreatment voice signal is divided into N sections, obtains N sections of voices Signal；

To the i-th frame voice signal S_iWindowing process obtains adding window voice signal S_i',

S_i'=S_i×hanning(N_new), wherein the value of i is 1,2 ..., N；

By the adding window voice signal S_i' Fourier transformation is carried out, obtain Fourier transformation voice signal Z_i；

According to the Fourier transformation voice signal Z_iPhase theta_iCalculate the i-th frame voice signal S_iEnergy density function | Z_i|²；The window function is subjected to N_sfgtftA frame moves, and obtains i+1 frame voice signal S_i+1Energy density function | Z_i+1|²；

Obtain [a N_new/ 2] the matrix R of+1 row, N column；

4. a kind of speech-emotion recognition method according to claim 1, which is characterized in that described to adopt the acoustic feature The emotion of the voice signal described in convolutional neural networks Classification and Identification specifically includes:

The sound spectrograph is handled using the convolutional layer of convolutional neural networks, and the three-dimensional sound spectrograph is converted to N number of two dimensional character；

Wherein, b_jFor the departure function that can be trained, k_ijFor convolution kernel, x_iIndicate i-th section of sound spectrograph of input；y_iIndicate output The corresponding two dimensional character of i-th section of sound spectrograph；

By the corresponding two dimensional character y of i-th section of sound spectrograph of the output_iIt is handled by pond layer, obtains low resolution acoustic feature y_i′；

It is provided with full articulamentum between the convolutional layer and the pond layer, there is activation primitive in the full articulamentum, it is described complete Articulamentum is for the data transmission between the convolutional layer and the pond layer.

5. a kind of speech emotion recognition system, which is characterized in that the identifying system includes:

Voice signal obtains module, for obtaining voice signal；

Best paragraph length determination modul is known for calculating the emotion of the pretreatment voice signal of multiple and different paragraph length The corresponding paragraph length of the emotion recognition rate highest is determined as best paragraph length by not rate；

Acoustic feature extraction module, for extracting the sound of the voice signal according to the corresponding sound spectrograph of the best paragraph length Learn feature；

Convolutional neural networks module, for the acoustic feature to be used voice signal described in convolutional neural networks Classification and Identification Emotion.

6. a kind of speech emotion recognition system according to claim 5, which is characterized in that the preprocessing module is specifically wrapped It includes:

Sample processing unit, for obtaining the pulse of discrete time and continuous amplitude for pulse speech signal samples processing Voice signal；

Quantization processing unit, for obtaining the pulse voice signal quantification treatment of the discrete time and continuous amplitude discrete The pulse voice signal of time and discrete amplitude；

Preemphasis processing unit, for the pulse voice signal of the discrete time and discrete amplitude to be carried out preemphasis processing, Obtain preemphasis voice signal；

Framing windowing unit obtains pretreatment voice signal for the preemphasis voice signal to be carried out framing windowing process.

7. a kind of speech emotion recognition system according to claim 5, which is characterized in that the sound spectrograph computing module tool Body includes:

Voice signal information acquisition unit is pre-processed, for obtaining the sample frequency F of the pretreatment voice signal_s, sampled data Sequence S_gWith paragraph length；

Speech signal segments unit is pre-processed, for the long N of window according to the paragraph length and window function_newBy the pretreatment Voice signal is divided into N sections, obtains N sections of voice signals；

Windowing process unit, for the i-th frame voice signal S_iWindowing process obtains adding window voice signal S_i',

S_i'=S_i×hanning(N_new), wherein the value of i is 1,2 ..., N；

Fourier transform unit is used for the adding window voice signal S_i' Fourier transformation is carried out, obtain Fourier transformation voice Signal Z_i；

Sound spectrograph acquiring unit, for according to the Fourier transformation voice signal Z_iPhase theta_iCalculate the i-th frame voice letter Number S_iEnergy density function | Z_i|²；The window function is subjected to N_sfgtftA frame moves, and obtains i+1 frame voice signal S_i+1Energy Metric density function | Z_i+1|²；

Obtain [a N_new/ 2] the matrix R of+1 row, N column；

8. a kind of speech-emotion recognition method according to claim 1, which is characterized in that the convolutional neural networks module It specifically includes:

Convolution layer unit is handled for the sound spectrograph using the convolutional layer of convolutional neural networks, and the three-dimensional sound spectrograph turns It is changed to N number of two dimensional character；

Pond layer unit, for by the corresponding two dimensional character y of i-th section of sound spectrograph of the output_iIt is handled, is obtained by pond layer Low resolution acoustic feature y_i′；

Full connection layer unit, for being provided with full articulamentum between the convolutional layer and the pond layer, in the full articulamentum There is activation primitive, the full articulamentum is for the data transmission between the convolutional layer and the pond layer.