CN113611326B

CN113611326B - Real-time voice emotion recognition method and device

Info

Publication number: CN113611326B
Application number: CN202110987593.5A
Authority: CN
Inventors: 刘振焘; 韩梦婷; 曹卫华; 黄海; 彭志昆
Original assignee: China University of Geosciences
Current assignee: China University of Geosciences
Priority date: 2021-08-26
Filing date: 2021-08-26
Publication date: 2023-05-12
Anticipated expiration: 2041-08-26
Also published as: CN113611326A

Abstract

The invention provides a real-time voice emotion recognition method and a device, which are used for extracting formants from a Mel frequency spectrum of a voice signal, detecting and obtaining main formants with first three amplitudes by comparing local amplitude maxima of a Mel filter group, filtering formants with other unobvious effects by utilizing a real-time noise gate, and finally selecting formants which are well matched with adjacent frames. Therefore, syllables are separated by using the maximum value and the minimum value of the formant amplitude, silent pause when speaking is judged by the composite energy of the first three formants in a frame, syllable sections are detected, and a syllable-level emotion recognition method comprising 15 manual features is provided. Real-time accurate recognition of voice emotion is achieved.

Description

Real-time voice emotion recognition method and device

Technical Field

The invention relates to the technical field of signal processing, in particular to a real-time voice emotion recognition method and device.

Background

Currently, the detection information applied to human emotion recognition research includes voice, facial expression, physiological signals, limb language, etc. The voice is the fastest and most natural method for communication between people, and the research of voice emotion recognition is significant for promoting harmonious man-machine interaction.

The speech emotion recognition technology can be applied to a plurality of fields of medical treatment, education, business assistance and the like. In medicine, speech emotion recognition is often used to recognize the mental state of a patient and assist the disabled in speaking; in education, key fragments of interest of students can be analyzed by means of a voice emotion recognition method, and emotion states and fatigue degrees of the students when hearing class can be detected, so that teachers are helped to grasp understanding and learning conditions of the students when learning class, and the method can be used for monitoring emotion states of remote classroom users in a learning process, so that teaching emphasis or progress can be adjusted in time; in business assistance, a customer service system can quickly identify emotion of a user by using a method related to the voice emotion identification field and generate a service quality report of a call center, so that the customer service center is helped to comprehensively improve service quality; in automobile driving, voice emotion recognition can acquire emotion states of a driver according to information such as voice and speech speed of the driver, and then the driver is given a certain prompt, so that traffic accidents are prevented.

The voice emotion recognition technology is widely applied in many scenes, and is in face of urgent demands for realizing emotion intelligence of a man-machine interaction system at home and abroad, so that the scientific technology in the related fields of emotion calculation, man-machine interaction and the like is broken through.

The current mainstream method of speech emotion recognition is based on deep neural network processing, uses Mel spectrogram as characteristic to input into designed deep neural network for learning, increases the time required for processing while improving recognition accuracy, and results in overlong delay time of designed overall model, and lower practicality in real-time recognition. Most voice emotion recognition methods focus on how to optimize the extraction process of the mel spectrum, neglecting the fundamental problem of preprocessing is that the feature selection, and the methods proposed in part of research methods are mainly based on text meanings, and understand emotion by using the meanings of sentences and words in the text, so that the generalization capability of the system is further reduced.

Disclosure of Invention

The invention aims to solve the technical problems that the traditional voice emotion recognition method focuses on feature classification optimization, so that the calculation amount is large, the instantaneity cannot be realized, and the systematic generalization capability caused by emotion is lower as part of the method is understood through the meaning of sentences and words in a text.

In order to achieve the above purpose, the invention provides a real-time voice emotion recognition method based on syllable-level feature extraction, and the invention classifies based on multi-layer perceptrons, thereby achieving the purpose of simplifying calculation and meeting the requirement of low delay in real-time voice emotion recognition. The syllable characteristics based on formants are divided to be beneficial to better identifying the voice in a cross-language and cross-corpus scene, and the influence of vowels on a system is found to be far higher than that of consonants in the study.

According to one aspect of the present invention, there is provided a real-time voice emotion recognition method including the steps of:

preprocessing an original voice signal, and extracting a Mel frequency spectrum;

extracting formants of each sampling frame from the mel frequency spectrum;

obtaining the first three formants of the amplitude value in each sampling frame as the first main formants by comparing the maximum value of the local amplitude value in the formants of each sampling frame;

denoising the first main formants according to the silencing threshold value of the real-time noise gate to obtain denoised formants;

calculating a matching index between any two formants of any two frames in the denoised formants, and reconstructing the formants according to the matching index to obtain formants with original frame lengths;

obtaining the maximum value and the minimum value of the amplitude of the reconstructed resonance peak;

acquiring the first three formants of the amplitude values in each sampling frame in the reconstructed formants as second main formants;

calculating the composite energy of the second main formant;

taking the maximum value and the minimum value of the reconstructed formant amplitude as obvious silent pause syllable segmentation standards, and carrying out voice segmentation according to the change of the composite energy to obtain a plurality of syllables;

counting the characteristics in each syllable;

according to the characteristics in each syllable, the emotion type probability of each syllable is obtained through a multi-layer perceptron;

and carrying out statement-level confidence aggregation on the emotion category probability of each syllable to obtain a statement-level emotion recognition result.

Further, the preprocessing step specifically includes:

pre-emphasis is carried out on the original voice signal to obtain a pre-emphasized signal;

carrying out framing windowing and Fourier transformation processing on the pre-emphasized signal to obtain a transformed signal;

processing the transformed signal by a Mel filter bank to obtain Mel frequency of each sampling frame;

and connecting the Mel filter groups of a plurality of adjacent sampling frames to obtain Mel frequency spectrums of the voice signals.

Further, in the step of obtaining the first three formants of the amplitude values in each sampling frame as the first main formants by comparing the maximum values of the local amplitude values in the formants of each sampling frame, the calculation formula of the relevant parameters of the first main formants is as follows:

the calculation formula of the power amplitude of the h highest amplitude resonance peak is as follows:

the calculation formula of the Mel scale frequency of the h highest amplitude formant is as follows:

the bandwidth of formant h is calculated as:

wherein p is _h The power amplitude of the h highest amplitude resonance peak, p (l) is the amplitude of the Mel filter bank, f _h Mel scale frequency, w, for the h highest amplitude formant _h Is the bandwidth of formant h.

Further, the calculation formula of the silencing threshold of the real-time noise gate is as follows:

wherein A is _min Is the silencing threshold of the real-time noise gate, A _imp Is the highest peak amplitude attenuation value in the mel spectrum, which is continuously updated based on new peaks in the current incoming frame that are higher than the current attenuation value.

Further, arbitrary two frames t are calculated _a ,t _b Any two formants h of (2) _a ,h _b The specific calculation formula of the matching index is as follows:

wherein t is _b -t _a Representing the time difference between two frames, f _b -f _a Representing the frequency difference between two frames,

expressed as the ratio of the maximum power amplitude to the minimum power amplitude in two frames, L _a Representing the number of formants that have been connected to other formants, K _t And K is equal to _f Is a manhattan distance constant, and depends on the horizontal and vertical unit distances of adjacent formants.

Further, the specific calculation formula for calculating the composite energy of the second main resonance peak is as follows:

wherein e _c (t) is the composite energy at time coordinate t, e _h (t) is the energy of the h formant, f _h (t) is the frequency of the H formant, H _E Is an emphasis constant for increasing the energy weight of the high frequency formants, and the composite energy is used for distinguishing silent pauses when speaking.

Further, the features within the syllable include at least 15.

Further, the original frame length is 25ms.

According to another aspect of the present invention, the present invention also provides a real-time voice emotion recognition device, including the following modules:

the Mel frequency spectrum extraction module is used for extracting Mel frequency spectrum after preprocessing the original voice signal;

the formant extraction module is used for extracting formants of each sampling frame from the Mel frequency spectrum;

the first main formant acquisition module is used for acquiring formants of the first three amplitude values in each sampling frame as first main formants by comparing the maximum value of the local amplitude values in the formants of each sampling frame;

the real-time noise gate module is used for denoising the first main formants through a silencing threshold value of a real-time noise gate to obtain denoised formants;

the formant matching reconstruction module is used for calculating the matching index between any two formants of any two frames in the denoised formants, and reconstructing the formants with the original frame length according to the matching index;

the amplitude maximum value acquisition module is used for acquiring the maximum value and the minimum value of the reconstructed formant amplitude;

the second main formant acquisition module is used for acquiring formants of the front three of the amplitude values in each sampling frame in the reconstructed formants as second main formants;

a composite energy calculation module for calculating composite energy of the second main formants;

the voice segmentation module is used for taking the maximum value and the minimum value of the reconstructed formant amplitude as obvious silent pause syllable segmentation standards, and carrying out voice segmentation according to the change of the composite energy to obtain a plurality of syllables;

the syllable characteristic statistics module is used for counting the characteristics in each syllable;

the syllable emotion classification module is used for obtaining emotion type probability of each syllable through the multi-layer perceptron according to the characteristics in the syllables;

and the sentence-level confidence aggregation module is used for carrying out sentence-level confidence aggregation on the emotion category probability of each syllable to obtain a sentence-level emotion recognition result.

Further, the mel spectrum extraction module includes:

the pre-emphasis module is used for carrying out pre-emphasis processing on the original voice signal to obtain a pre-emphasized signal;

the framing windowing and Fourier transformation module is used for carrying out framing windowing and Fourier transformation on the pre-emphasized signal to obtain a transformed signal;

the Mel filter module is used for processing the transformed signal through a Mel filter bank to obtain Mel frequency of each sampling frame;

and the adjacent frame connecting module is used for connecting the Mel filter groups of a plurality of adjacent sampling frames to obtain the Mel frequency spectrum of the voice signal.

The invention has the beneficial effects that:

(1) Compared with the traditional voice emotion recognition method, syllable-level features based on formants are used, recognition is not needed to be realized by changing word and sentence semantics or sequence, so that cross-library emotion recognition can be realized, and the overfitting problem of a system can be overcome to a certain extent.

(2) The traditional voice emotion recognition method uses a Mel spectrogram as an input characteristic, and can obtain a better experimental result in an experimental environment, but cannot realize real-time recognition due to overlarge calculated amount.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a flowchart of a real-time speech emotion recognition method according to an embodiment of the present invention;

fig. 2 is a block diagram of a real-time speech emotion recognition device according to an embodiment of the present invention.

Detailed Description

For a clearer understanding of technical features, objects and effects of the present invention, a detailed description of embodiments of the present invention will be made with reference to the accompanying drawings.

Embodiment one:

referring to fig. 1, an embodiment of the present invention provides a real-time voice emotion recognition method, including the following steps:

step one: extraction of mel spectrum

After pre-emphasis is performed on the original voice signal, a sliding hamming window with a step length of 15ms and 25ms (recommended value) is used for processing, each sampling frame is processed by a fast fourier transform (Fast Fourier Transform, FFT) and a mel filter, the mel frequency of each sampling frame is obtained by the following conversion formula, and the conversion formula of the mel frequency and the hertz scale frequency is as follows:

where m represents the mel frequency and f represents the hertz scale frequency.

The center frequency of the mel-filter bank can be expressed as:

wherein f (l) represents the center frequency of the mel filter bank l on the hertz scale, m _l Is the lower limit, m, of the mel filter bank, i, on the mel scale _l+1 Is the lower limit of the mel filter bank l +1 for the adjacent frame on the mel scale.

The mel spectrum of the speech signal is obtained by connecting a mel filter bank of several adjacent frames (frame length typically 25 ms).

Step two: formant extraction and registration

(1) The fundamental frequency formants typically have the highest frequency amplitude compared to other frequencies of vowels, but because of the difference in pronunciation of everyone, large deviations can occur, so denoising is required to detect voiced sounds in the mel spectrum. If there is no specific formant, i.e. there is no frequency band with higher amplitude than other frequency bands, the frequency band with the first three of the amplitudes in the filter bank is taken as the first three formants, and the formants with the first three of the amplitudes are detected by comparing the local amplitude maxima of the Mel filter bank.

Wherein p is _h Power amplitude, p, for the h highest amplitude formant _h-1 For the power amplitude of the h-1 st highest amplitude formant, p (l) is the amplitude of the mel filter bank, and likewise, the mel scale frequency of the highest amplitude formant can be calculated as:

wherein f _h The mel scale frequency for the highest amplitude of the h-th formant.

Equation (4) gives the center frequency of the formants, however, the formants do not have an exact narrow band, so the width of the formants is an important criterion for measuring the quality of sound, and can measure the sharpness of sound. Calculated from the frequency range (from the current minimum to the neighbor peak minimum) as:

wherein w is _h Is the bandwidth of formant h.

(2) Real-time noise door

The amplitude information of the language is an important factor for judging the emotion of a speaker, and regularization of the amplitude information can reduce the accuracy of effective information. In the final stage of feature extraction, the present invention uses a long-term mean regularization to counteract differences in the environment or distance between the speaker and microphone. The syllable interruption part is detected based on the characteristics such as pause, formant interruption and the like, and the method with larger calculation amount such as deep learning is not used. For this reason, a noise gating algorithm is proposed that dynamically adapts to the amplitude pulse-based silence threshold, avoiding the traditional approach of determining syllable disruption by constant sustained peak. Upper or minimum voiced amplitude A of silence threshold _min Reset to by an attenuated pulse amplitude value

Wherein A is _imp Is the highest peak amplitude attenuation value in the mel-frequency spectrum, which is continuously updated based on new peaks in the current incoming frame that are higher than the current attenuation value. The attenuation rate is set to A _imp The self value falling within 0.5 seconds stops further falling at a time. Furthermore, the frequency of the band-pass filter is only allowed to be used for A in the range of 100-1200Hz _min Is set up by the above-mentioned equipment. The noise gate spectrally filters out formants that otherwise would be of insignificant consequence.

(3) Formant registration

The voice signal is decomposed into formant characteristics, and the formant characteristics can be selected to be well matched with adjacent framesIs a resonance peak of (2). This allows the system to filter out portions without formant features in the transverse direction of the mel-frequency spectrum. Formants with a length of more than 25ms of one frame must be connected by a plurality of adjacent formants so that the formants are restored to the original time length, a sampling window creates a mel spectrum with a fixed length, but the time span of the formants may exceed the length of one frame, so that spectral fragments need to be spliced together to reconstruct the formants with the original time length. The splicing can be realized by spectral clustering or any clustering method based on aggregation in a time neighborhood, and the task is completed by calculating a matching index which is used for measuring the proximity degree of the formants of the newly-entered frame and the nearest frame. Assigning a label to the formants in the newly entered frame, the maximum matching index value of the label being (h ₀ ,h ₁ ,h ₂ ,…,h _hmax ). Calculate arbitrary two frames (t _a ,t _b ) Any two formants (h) _a ,h _b ) The matching index between the two is

expressed as the ratio of the maximum power amplitude to the minimum power amplitude in two frames, L _a Representing the number of formants that have been connected to other formants, K _t And K is equal to _f For Manhattan distance constant, K is taken at a frame length of typically 25ms, depending on the horizontal and vertical unit distances of adjacent formants _t ＝10，K _f ＝10。

Step three: syllable segmentation and statistics

(1) Syllable segmentation

There is no clear boundary between two parts of a word or between two adjacent words, and in response to this problem, the present invention proposes a technique for separating syllables using the maximum and minimum of formant amplitudes. Silence and pause during speech can be used as an obvious syllable segmentation sign, otherwise, syllable separation rules are difficult to specify, so syllable segmentation can be realized by setting the amplitude threshold of a voice frame to adjust the threshold of a syllable ending parameter.

In most cases, multiple resonance peaks increase the energy within the frame. Thus, the formant central energy is not directly converted into a volume form that can be measured directly. At the same time, the overall energy within the frame may contain noise that needs to be removed. The present invention therefore calculates the composite energy taking into account mainly the first 3 major formant energies per frame as

Wherein e _c (t) is the composite energy at time coordinate t, e _h (t) is the energy of the h formant, f _h (t) is the frequency of the H formant, H _E Is an emphasis constant for increasing the energy weight of the high frequency formants, which is used to discriminate silent pauses when speaking, because the high frequency contains more energy than the low frequency if the amplitude remains unchanged. Calculating only the first three formants can avoid the low energy formants from being mistaken for energy generated when the speaker utters. The method is mainly used for distinguishing silence and pause when speaking.

During the initialization phase, the composite energy e is found _c When a rising edge is detected, a plateau is entered, i.e _c Without substantial conversion, continue to detect e greater than 50% _c When the peak value falls, continuously detecting the maximum value of the upper and lower amplitudes of the formants in the current frame, recording the threshold syllable, and when e _c Below the lower threshold, syllables or speech segments will be truncated, the audio will be divided into parts or words by longer pauses, which require 2 frames long [ ] for shorter pauses<50 ms) for syllable segmentation.

(2) Syllable statistical features

Syllables have different shapes and sizes. The spectral representation of syllables is more lexical dependentContent rather than emotion content, the present invention thus extracts statistical features rather than sequence features, and proposes a syllable-level emotion recognition method comprising 15 features to estimate timbre, tone, accent and accent of syllables within formants h rather than overall pitch. There are basically five types of features, namely formant frequencies, accents, powers, accents, and signal-to-noise ratios. Formant frequencies, powers and spans are calculated at syllable level, each characteristic being at t _s0 ≥t,t _sn < measured in t, t _s0 And t _sn Is the first and last frame of syllables in the entire sentence. Each feature calculation method is as follows:

1) Frequency of the first 3 main formants:

freq a: average value μ (f) of formant h frequency _h (t)) at time intervals t _s0 ≥t,t _sn ＜t

Freq B: standard deviation σ (f) of formant h frequency _h (t)) at time intervals t _s0 ≥t,t _sn ＜t

Freq C: bandwidth average μ (w) of formant h frequency _h (t)) at time intervals t _s0 ≥t,t _sn ＜t

2) Tones of the first 3 main formants:

account A: rising pitch, increasing average frequency of formants along syllable length:

wherein X is _h,rise Representing rising sound, f _h (t)-f _h (t-1) represents the difference between adjacent formant frequencies, rise _h,t The value of (2) is 0 or 1, satisfying:

account B: downsound, reduction of the average frequency of formants along syllable length:

wherein X is _h,fall Representing a fall-off, fall _h,t The value of (2) is 0 or 1, satisfying:

3) Power of the first 3 main formants:

power A: average power of syllables;

power B: the power standard deviation of syllables is a measure of speech quality;

power C: per-frame variable A _imp Is the energy of syllables of (2);

power D: per-frame variable A _imp Is the energy of the total sounding frame;

4) Accents of the first 3 major formants;

stress A: formant power peak counts along syllable time axis;

stress B: average value of formant power peak value (mu) _peaks )；

Stress C: standard deviation of formant power maximum (sigma _peaks )；

Stress D：μ _peaks Relative ratio to average power.

5) Signal-to-noise ratio of the first 3 main formants:

SNR a: the ratio of the energy of the first three detected formants to the total energy of the spectrum;

SNR B: the ratio of the maximum amplitude of sound to the minimum limit of the sounding formants.

Step four: syllable emotion classification

By testing different classifiers and different parameters, the effect of the complex classifier is inferior to that of a simple classifier, so that the invention only uses the simplest multi-layer perceptron form to classify, only comprises a hidden layer, and the loss function used for training is an absolute cross entropy loss function, which can be expressed as:

wherein N is _v The total number of emotion tags, v is the serial number of the emotion tag,

first v output for model ^th Scalar value (Softmax sort probability), y _v The neural network model predicts the class probability for each emotion over a single syllable for the target value output by the corresponding model.

Step five: statement-level confidence aggregation

When only one emotion label is needed for the whole sentence, estimation is performed through weighted summation

Wherein the weight is the square root T of the duration of syllables on the index s _s ，P _s,c For the prediction probability of the class c,

for the average value of the prediction probabilities of all classes of the syllable, C _u,c For the confidence class of the utterance u, N _s Is the total number of syllables in the utterance.

Embodiment two:

referring to fig. 2, the invention also provides a real-time voice emotion recognition device, which mainly comprises the following modules:

the system comprises a Mel frequency spectrum extraction module 1, a formant extraction module and registration module 2, a syllable segmentation and statistics module 3, a syllable emotion classification module 4 and a statement level confidence aggregation module 5;

in some embodiments, the mel spectrum extraction module 1 specifically includes:

In some embodiments, the formant extraction module and registration module 2 specifically includes:

and the formant matching reconstruction module is used for calculating the matching index between any two formants of any two frames in the denoised formants, and reconstructing the formants with the original frame length according to the matching index.

In some embodiments, the syllable segmentation and statistics module 3 specifically includes:

and the syllable characteristic statistics module is used for counting the characteristics in each syllable.

The syllable emotion classification module 4 is used for obtaining emotion type probability of each syllable through a multi-layer perceptron according to the characteristics in the syllables;

the sentence-level confidence aggregation module 5 is configured to obtain a sentence-level emotion recognition result by performing sentence-level confidence aggregation on the emotion category probability of each syllable.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the terms first, second, third, etc. do not denote any order, but rather the terms first, second, third, etc. are used to interpret the terms as labels.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. The real-time voice emotion recognition method is characterized by comprising the following steps of:

extracting formants of each sampling frame from the mel frequency spectrum;

denoising each first main formant according to the silencing threshold value of the real-time noise gate to obtain denoised formants;

calculating the composite energy of the second main formant;

counting the characteristics in each syllable;

2. The method for recognizing real-time speech emotion according to claim 1, wherein said preprocessing step comprises:

3. The method for recognizing real-time speech emotion according to claim 1, wherein said step of obtaining the first three formants of each sample frame by comparing local amplitude maxima among formants of each sample frame as a first main formant comprises the following calculation formulas of relevant parameters of said first main formant:

the bandwidth of formant h is calculated as:

wherein p is _h Power amplitude, p, for the h highest amplitude formant _h-1 The power amplitude of the h-1 th highest amplitude resonance peak, p (l) is the amplitude of the Mel filter bank, f _h Mel scale frequency, w, for the h highest amplitude formant _h Is the bandwidth of formant h.

4. The method for recognizing real-time voice emotion according to claim 1, wherein the calculation formula of the silencing threshold of the real-time noise gate is:

5. A real-time speech emotion recognition method as defined in claim 1, characterized in that any two frames t are calculated _a ,t _b Any two formants h of (2) _a ,h _b The specific calculation formula of the matching index is as follows:

wherein I is _a,b Represents the matching index, t _b -t _a Representing the time difference between two frames, f _b -f _a Representing the frequency difference between two frames,

expressed as the ratio of the minimum power amplitude to the maximum power amplitude in two frames, L _a Representing the number of formants that have been connected to other formants, K _t And K is equal to _f Is a manhattan distance constant, and depends on the horizontal and vertical unit distances of adjacent formants.

6. The method for recognizing real-time speech emotion according to claim 1, wherein the calculation of the composite energy of the second main formants is as follows:

in the formula e _c (t) is the composite energy at time coordinate t, e _h (t) is the energy of the h formant, f _h (t) is the frequency of the H formant, H _E Is an emphasis constant for increasing the energy weight of the high frequency formants, and the composite energy is used for distinguishing silent pauses when speaking.

7. The method of claim 1, wherein the features in the syllable include at least 15.

8. The method of claim 1, wherein the original frame length is 25ms.

9. A real-time speech emotion recognition device, comprising the following modules:

the real-time noise gate module is used for denoising each first main formant through a silencing threshold value of the real-time noise gate to obtain denoised formants;

the syllable emotion classification module is used for obtaining emotion type probability of each syllable through the multi-layer perceptron according to the characteristics in each syllable;

10. The apparatus of claim 9, wherein the mel-frequency spectrum extraction module comprises: