CN113611326B - Real-time voice emotion recognition method and device - Google Patents

Real-time voice emotion recognition method and device Download PDF

Info

Publication number
CN113611326B
CN113611326B CN202110987593.5A CN202110987593A CN113611326B CN 113611326 B CN113611326 B CN 113611326B CN 202110987593 A CN202110987593 A CN 202110987593A CN 113611326 B CN113611326 B CN 113611326B
Authority
CN
China
Prior art keywords
formants
amplitude
formant
syllable
mel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110987593.5A
Other languages
Chinese (zh)
Other versions
CN113611326A (en
Inventor
刘振焘
韩梦婷
曹卫华
黄海
彭志昆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Geosciences
Original Assignee
China University of Geosciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Geosciences filed Critical China University of Geosciences
Priority to CN202110987593.5A priority Critical patent/CN113611326B/en
Publication of CN113611326A publication Critical patent/CN113611326A/en
Application granted granted Critical
Publication of CN113611326B publication Critical patent/CN113611326B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/15Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being formant information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Abstract

The invention provides a real-time voice emotion recognition method and a device, which are used for extracting formants from a Mel frequency spectrum of a voice signal, detecting and obtaining main formants with first three amplitudes by comparing local amplitude maxima of a Mel filter group, filtering formants with other unobvious effects by utilizing a real-time noise gate, and finally selecting formants which are well matched with adjacent frames. Therefore, syllables are separated by using the maximum value and the minimum value of the formant amplitude, silent pause when speaking is judged by the composite energy of the first three formants in a frame, syllable sections are detected, and a syllable-level emotion recognition method comprising 15 manual features is provided. Real-time accurate recognition of voice emotion is achieved.

Description

Real-time voice emotion recognition method and device
Technical Field
The invention relates to the technical field of signal processing, in particular to a real-time voice emotion recognition method and device.
Background
Currently, the detection information applied to human emotion recognition research includes voice, facial expression, physiological signals, limb language, etc. The voice is the fastest and most natural method for communication between people, and the research of voice emotion recognition is significant for promoting harmonious man-machine interaction.
The speech emotion recognition technology can be applied to a plurality of fields of medical treatment, education, business assistance and the like. In medicine, speech emotion recognition is often used to recognize the mental state of a patient and assist the disabled in speaking; in education, key fragments of interest of students can be analyzed by means of a voice emotion recognition method, and emotion states and fatigue degrees of the students when hearing class can be detected, so that teachers are helped to grasp understanding and learning conditions of the students when learning class, and the method can be used for monitoring emotion states of remote classroom users in a learning process, so that teaching emphasis or progress can be adjusted in time; in business assistance, a customer service system can quickly identify emotion of a user by using a method related to the voice emotion identification field and generate a service quality report of a call center, so that the customer service center is helped to comprehensively improve service quality; in automobile driving, voice emotion recognition can acquire emotion states of a driver according to information such as voice and speech speed of the driver, and then the driver is given a certain prompt, so that traffic accidents are prevented.
The voice emotion recognition technology is widely applied in many scenes, and is in face of urgent demands for realizing emotion intelligence of a man-machine interaction system at home and abroad, so that the scientific technology in the related fields of emotion calculation, man-machine interaction and the like is broken through.
The current mainstream method of speech emotion recognition is based on deep neural network processing, uses Mel spectrogram as characteristic to input into designed deep neural network for learning, increases the time required for processing while improving recognition accuracy, and results in overlong delay time of designed overall model, and lower practicality in real-time recognition. Most voice emotion recognition methods focus on how to optimize the extraction process of the mel spectrum, neglecting the fundamental problem of preprocessing is that the feature selection, and the methods proposed in part of research methods are mainly based on text meanings, and understand emotion by using the meanings of sentences and words in the text, so that the generalization capability of the system is further reduced.
Disclosure of Invention
The invention aims to solve the technical problems that the traditional voice emotion recognition method focuses on feature classification optimization, so that the calculation amount is large, the instantaneity cannot be realized, and the systematic generalization capability caused by emotion is lower as part of the method is understood through the meaning of sentences and words in a text.
In order to achieve the above purpose, the invention provides a real-time voice emotion recognition method based on syllable-level feature extraction, and the invention classifies based on multi-layer perceptrons, thereby achieving the purpose of simplifying calculation and meeting the requirement of low delay in real-time voice emotion recognition. The syllable characteristics based on formants are divided to be beneficial to better identifying the voice in a cross-language and cross-corpus scene, and the influence of vowels on a system is found to be far higher than that of consonants in the study.
According to one aspect of the present invention, there is provided a real-time voice emotion recognition method including the steps of:
preprocessing an original voice signal, and extracting a Mel frequency spectrum;
extracting formants of each sampling frame from the mel frequency spectrum;
obtaining the first three formants of the amplitude value in each sampling frame as the first main formants by comparing the maximum value of the local amplitude value in the formants of each sampling frame;
denoising the first main formants according to the silencing threshold value of the real-time noise gate to obtain denoised formants;
calculating a matching index between any two formants of any two frames in the denoised formants, and reconstructing the formants according to the matching index to obtain formants with original frame lengths;
obtaining the maximum value and the minimum value of the amplitude of the reconstructed resonance peak;
acquiring the first three formants of the amplitude values in each sampling frame in the reconstructed formants as second main formants;
calculating the composite energy of the second main formant;
taking the maximum value and the minimum value of the reconstructed formant amplitude as obvious silent pause syllable segmentation standards, and carrying out voice segmentation according to the change of the composite energy to obtain a plurality of syllables;
counting the characteristics in each syllable;
according to the characteristics in each syllable, the emotion type probability of each syllable is obtained through a multi-layer perceptron;
and carrying out statement-level confidence aggregation on the emotion category probability of each syllable to obtain a statement-level emotion recognition result.
Further, the preprocessing step specifically includes:
pre-emphasis is carried out on the original voice signal to obtain a pre-emphasized signal;
carrying out framing windowing and Fourier transformation processing on the pre-emphasized signal to obtain a transformed signal;
processing the transformed signal by a Mel filter bank to obtain Mel frequency of each sampling frame;
and connecting the Mel filter groups of a plurality of adjacent sampling frames to obtain Mel frequency spectrums of the voice signals.
Further, in the step of obtaining the first three formants of the amplitude values in each sampling frame as the first main formants by comparing the maximum values of the local amplitude values in the formants of each sampling frame, the calculation formula of the relevant parameters of the first main formants is as follows:
the calculation formula of the power amplitude of the h highest amplitude resonance peak is as follows:
Figure BDA0003231237620000031
the calculation formula of the Mel scale frequency of the h highest amplitude formant is as follows:
Figure BDA0003231237620000032
the bandwidth of formant h is calculated as:
Figure BDA0003231237620000033
wherein p is h The power amplitude of the h highest amplitude resonance peak, p (l) is the amplitude of the Mel filter bank, f h Mel scale frequency, w, for the h highest amplitude formant h Is the bandwidth of formant h.
Further, the calculation formula of the silencing threshold of the real-time noise gate is as follows:
Figure BDA0003231237620000034
wherein A is min Is the silencing threshold of the real-time noise gate, A imp Is the highest peak amplitude attenuation value in the mel spectrum, which is continuously updated based on new peaks in the current incoming frame that are higher than the current attenuation value.
Further, arbitrary two frames t are calculated a ,t b Any two formants h of (2) a ,h b The specific calculation formula of the matching index is as follows:
Figure BDA0003231237620000035
wherein t is b -t a Representing the time difference between two frames, f b -f a Representing the frequency difference between two frames,
Figure BDA0003231237620000036
expressed as the ratio of the maximum power amplitude to the minimum power amplitude in two frames, L a Representing the number of formants that have been connected to other formants, K t And K is equal to f Is a manhattan distance constant, and depends on the horizontal and vertical unit distances of adjacent formants.
Further, the specific calculation formula for calculating the composite energy of the second main resonance peak is as follows:
Figure BDA0003231237620000037
wherein e c (t) is the composite energy at time coordinate t, e h (t) is the energy of the h formant, f h (t) is the frequency of the H formant, H E Is an emphasis constant for increasing the energy weight of the high frequency formants, and the composite energy is used for distinguishing silent pauses when speaking.
Further, the features within the syllable include at least 15.
Further, the original frame length is 25ms.
According to another aspect of the present invention, the present invention also provides a real-time voice emotion recognition device, including the following modules:
the Mel frequency spectrum extraction module is used for extracting Mel frequency spectrum after preprocessing the original voice signal;
the formant extraction module is used for extracting formants of each sampling frame from the Mel frequency spectrum;
the first main formant acquisition module is used for acquiring formants of the first three amplitude values in each sampling frame as first main formants by comparing the maximum value of the local amplitude values in the formants of each sampling frame;
the real-time noise gate module is used for denoising the first main formants through a silencing threshold value of a real-time noise gate to obtain denoised formants;
the formant matching reconstruction module is used for calculating the matching index between any two formants of any two frames in the denoised formants, and reconstructing the formants with the original frame length according to the matching index;
the amplitude maximum value acquisition module is used for acquiring the maximum value and the minimum value of the reconstructed formant amplitude;
the second main formant acquisition module is used for acquiring formants of the front three of the amplitude values in each sampling frame in the reconstructed formants as second main formants;
a composite energy calculation module for calculating composite energy of the second main formants;
the voice segmentation module is used for taking the maximum value and the minimum value of the reconstructed formant amplitude as obvious silent pause syllable segmentation standards, and carrying out voice segmentation according to the change of the composite energy to obtain a plurality of syllables;
the syllable characteristic statistics module is used for counting the characteristics in each syllable;
the syllable emotion classification module is used for obtaining emotion type probability of each syllable through the multi-layer perceptron according to the characteristics in the syllables;
and the sentence-level confidence aggregation module is used for carrying out sentence-level confidence aggregation on the emotion category probability of each syllable to obtain a sentence-level emotion recognition result.
Further, the mel spectrum extraction module includes:
the pre-emphasis module is used for carrying out pre-emphasis processing on the original voice signal to obtain a pre-emphasized signal;
the framing windowing and Fourier transformation module is used for carrying out framing windowing and Fourier transformation on the pre-emphasized signal to obtain a transformed signal;
the Mel filter module is used for processing the transformed signal through a Mel filter bank to obtain Mel frequency of each sampling frame;
and the adjacent frame connecting module is used for connecting the Mel filter groups of a plurality of adjacent sampling frames to obtain the Mel frequency spectrum of the voice signal.
The invention has the beneficial effects that:
(1) Compared with the traditional voice emotion recognition method, syllable-level features based on formants are used, recognition is not needed to be realized by changing word and sentence semantics or sequence, so that cross-library emotion recognition can be realized, and the overfitting problem of a system can be overcome to a certain extent.
(2) The traditional voice emotion recognition method uses a Mel spectrogram as an input characteristic, and can obtain a better experimental result in an experimental environment, but cannot realize real-time recognition due to overlarge calculated amount.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a flowchart of a real-time speech emotion recognition method according to an embodiment of the present invention;
fig. 2 is a block diagram of a real-time speech emotion recognition device according to an embodiment of the present invention.
Detailed Description
For a clearer understanding of technical features, objects and effects of the present invention, a detailed description of embodiments of the present invention will be made with reference to the accompanying drawings.
Embodiment one:
referring to fig. 1, an embodiment of the present invention provides a real-time voice emotion recognition method, including the following steps:
step one: extraction of mel spectrum
After pre-emphasis is performed on the original voice signal, a sliding hamming window with a step length of 15ms and 25ms (recommended value) is used for processing, each sampling frame is processed by a fast fourier transform (Fast Fourier Transform, FFT) and a mel filter, the mel frequency of each sampling frame is obtained by the following conversion formula, and the conversion formula of the mel frequency and the hertz scale frequency is as follows:
Figure BDA0003231237620000051
where m represents the mel frequency and f represents the hertz scale frequency.
The center frequency of the mel-filter bank can be expressed as:
Figure BDA0003231237620000052
wherein f (l) represents the center frequency of the mel filter bank l on the hertz scale, m l Is the lower limit, m, of the mel filter bank, i, on the mel scale l+1 Is the lower limit of the mel filter bank l +1 for the adjacent frame on the mel scale.
The mel spectrum of the speech signal is obtained by connecting a mel filter bank of several adjacent frames (frame length typically 25 ms).
Step two: formant extraction and registration
(1) The fundamental frequency formants typically have the highest frequency amplitude compared to other frequencies of vowels, but because of the difference in pronunciation of everyone, large deviations can occur, so denoising is required to detect voiced sounds in the mel spectrum. If there is no specific formant, i.e. there is no frequency band with higher amplitude than other frequency bands, the frequency band with the first three of the amplitudes in the filter bank is taken as the first three formants, and the formants with the first three of the amplitudes are detected by comparing the local amplitude maxima of the Mel filter bank.
Figure BDA0003231237620000061
Wherein p is h Power amplitude, p, for the h highest amplitude formant h-1 For the power amplitude of the h-1 st highest amplitude formant, p (l) is the amplitude of the mel filter bank, and likewise, the mel scale frequency of the highest amplitude formant can be calculated as:
Figure BDA0003231237620000062
wherein f h The mel scale frequency for the highest amplitude of the h-th formant.
Equation (4) gives the center frequency of the formants, however, the formants do not have an exact narrow band, so the width of the formants is an important criterion for measuring the quality of sound, and can measure the sharpness of sound. Calculated from the frequency range (from the current minimum to the neighbor peak minimum) as:
Figure BDA0003231237620000063
wherein w is h Is the bandwidth of formant h.
(2) Real-time noise door
The amplitude information of the language is an important factor for judging the emotion of a speaker, and regularization of the amplitude information can reduce the accuracy of effective information. In the final stage of feature extraction, the present invention uses a long-term mean regularization to counteract differences in the environment or distance between the speaker and microphone. The syllable interruption part is detected based on the characteristics such as pause, formant interruption and the like, and the method with larger calculation amount such as deep learning is not used. For this reason, a noise gating algorithm is proposed that dynamically adapts to the amplitude pulse-based silence threshold, avoiding the traditional approach of determining syllable disruption by constant sustained peak. Upper or minimum voiced amplitude A of silence threshold min Reset to by an attenuated pulse amplitude value
Figure BDA0003231237620000071
Wherein A is imp Is the highest peak amplitude attenuation value in the mel-frequency spectrum, which is continuously updated based on new peaks in the current incoming frame that are higher than the current attenuation value. The attenuation rate is set to A imp The self value falling within 0.5 seconds stops further falling at a time. Furthermore, the frequency of the band-pass filter is only allowed to be used for A in the range of 100-1200Hz min Is set up by the above-mentioned equipment. The noise gate spectrally filters out formants that otherwise would be of insignificant consequence.
(3) Formant registration
The voice signal is decomposed into formant characteristics, and the formant characteristics can be selected to be well matched with adjacent framesIs a resonance peak of (2). This allows the system to filter out portions without formant features in the transverse direction of the mel-frequency spectrum. Formants with a length of more than 25ms of one frame must be connected by a plurality of adjacent formants so that the formants are restored to the original time length, a sampling window creates a mel spectrum with a fixed length, but the time span of the formants may exceed the length of one frame, so that spectral fragments need to be spliced together to reconstruct the formants with the original time length. The splicing can be realized by spectral clustering or any clustering method based on aggregation in a time neighborhood, and the task is completed by calculating a matching index which is used for measuring the proximity degree of the formants of the newly-entered frame and the nearest frame. Assigning a label to the formants in the newly entered frame, the maximum matching index value of the label being (h 0 ,h 1 ,h 2 ,…,h hmax ). Calculate arbitrary two frames (t a ,t b ) Any two formants (h) a ,h b ) The matching index between the two is
Figure BDA0003231237620000072
Wherein t is b -t a Representing the time difference between two frames, f b -f a Representing the frequency difference between two frames,
Figure BDA0003231237620000073
expressed as the ratio of the maximum power amplitude to the minimum power amplitude in two frames, L a Representing the number of formants that have been connected to other formants, K t And K is equal to f For Manhattan distance constant, K is taken at a frame length of typically 25ms, depending on the horizontal and vertical unit distances of adjacent formants t =10,K f =10。
Step three: syllable segmentation and statistics
(1) Syllable segmentation
There is no clear boundary between two parts of a word or between two adjacent words, and in response to this problem, the present invention proposes a technique for separating syllables using the maximum and minimum of formant amplitudes. Silence and pause during speech can be used as an obvious syllable segmentation sign, otherwise, syllable separation rules are difficult to specify, so syllable segmentation can be realized by setting the amplitude threshold of a voice frame to adjust the threshold of a syllable ending parameter.
In most cases, multiple resonance peaks increase the energy within the frame. Thus, the formant central energy is not directly converted into a volume form that can be measured directly. At the same time, the overall energy within the frame may contain noise that needs to be removed. The present invention therefore calculates the composite energy taking into account mainly the first 3 major formant energies per frame as
Figure BDA0003231237620000081
Wherein e c (t) is the composite energy at time coordinate t, e h (t) is the energy of the h formant, f h (t) is the frequency of the H formant, H E Is an emphasis constant for increasing the energy weight of the high frequency formants, which is used to discriminate silent pauses when speaking, because the high frequency contains more energy than the low frequency if the amplitude remains unchanged. Calculating only the first three formants can avoid the low energy formants from being mistaken for energy generated when the speaker utters. The method is mainly used for distinguishing silence and pause when speaking.
During the initialization phase, the composite energy e is found c When a rising edge is detected, a plateau is entered, i.e c Without substantial conversion, continue to detect e greater than 50% c When the peak value falls, continuously detecting the maximum value of the upper and lower amplitudes of the formants in the current frame, recording the threshold syllable, and when e c Below the lower threshold, syllables or speech segments will be truncated, the audio will be divided into parts or words by longer pauses, which require 2 frames long [ ] for shorter pauses<50 ms) for syllable segmentation.
(2) Syllable statistical features
Syllables have different shapes and sizes. The spectral representation of syllables is more lexical dependentContent rather than emotion content, the present invention thus extracts statistical features rather than sequence features, and proposes a syllable-level emotion recognition method comprising 15 features to estimate timbre, tone, accent and accent of syllables within formants h rather than overall pitch. There are basically five types of features, namely formant frequencies, accents, powers, accents, and signal-to-noise ratios. Formant frequencies, powers and spans are calculated at syllable level, each characteristic being at t s0 ≥t,t sn < measured in t, t s0 And t sn Is the first and last frame of syllables in the entire sentence. Each feature calculation method is as follows:
1) Frequency of the first 3 main formants:
freq a: average value μ (f) of formant h frequency h (t)) at time intervals t s0 ≥t,t sn <t
Freq B: standard deviation σ (f) of formant h frequency h (t)) at time intervals t s0 ≥t,t sn <t
Freq C: bandwidth average μ (w) of formant h frequency h (t)) at time intervals t s0 ≥t,t sn <t
2) Tones of the first 3 main formants:
account A: rising pitch, increasing average frequency of formants along syllable length:
Figure BDA0003231237620000082
wherein X is h,rise Representing rising sound, f h (t)-f h (t-1) represents the difference between adjacent formant frequencies, rise h,t The value of (2) is 0 or 1, satisfying:
Figure BDA0003231237620000091
account B: downsound, reduction of the average frequency of formants along syllable length:
Figure BDA0003231237620000092
wherein X is h,fall Representing a fall-off, fall h,t The value of (2) is 0 or 1, satisfying:
Figure BDA0003231237620000093
3) Power of the first 3 main formants:
power A: average power of syllables;
power B: the power standard deviation of syllables is a measure of speech quality;
power C: per-frame variable A imp Is the energy of syllables of (2);
power D: per-frame variable A imp Is the energy of the total sounding frame;
4) Accents of the first 3 major formants;
stress A: formant power peak counts along syllable time axis;
stress B: average value of formant power peak value (mu) peaks );
Stress C: standard deviation of formant power maximum (sigma peaks );
Stress D:μ peaks Relative ratio to average power.
5) Signal-to-noise ratio of the first 3 main formants:
SNR a: the ratio of the energy of the first three detected formants to the total energy of the spectrum;
SNR B: the ratio of the maximum amplitude of sound to the minimum limit of the sounding formants.
Step four: syllable emotion classification
By testing different classifiers and different parameters, the effect of the complex classifier is inferior to that of a simple classifier, so that the invention only uses the simplest multi-layer perceptron form to classify, only comprises a hidden layer, and the loss function used for training is an absolute cross entropy loss function, which can be expressed as:
Figure BDA0003231237620000094
wherein N is v The total number of emotion tags, v is the serial number of the emotion tag,
Figure BDA0003231237620000095
first v output for model th Scalar value (Softmax sort probability), y v The neural network model predicts the class probability for each emotion over a single syllable for the target value output by the corresponding model.
Step five: statement-level confidence aggregation
When only one emotion label is needed for the whole sentence, estimation is performed through weighted summation
Figure BDA0003231237620000101
Wherein the weight is the square root T of the duration of syllables on the index s s ,P s,c For the prediction probability of the class c,
Figure BDA0003231237620000102
for the average value of the prediction probabilities of all classes of the syllable, C u,c For the confidence class of the utterance u, N s Is the total number of syllables in the utterance.
Embodiment two:
referring to fig. 2, the invention also provides a real-time voice emotion recognition device, which mainly comprises the following modules:
the system comprises a Mel frequency spectrum extraction module 1, a formant extraction module and registration module 2, a syllable segmentation and statistics module 3, a syllable emotion classification module 4 and a statement level confidence aggregation module 5;
in some embodiments, the mel spectrum extraction module 1 specifically includes:
the pre-emphasis module is used for carrying out pre-emphasis processing on the original voice signal to obtain a pre-emphasized signal;
the framing windowing and Fourier transformation module is used for carrying out framing windowing and Fourier transformation on the pre-emphasized signal to obtain a transformed signal;
the Mel filter module is used for processing the transformed signal through a Mel filter bank to obtain Mel frequency of each sampling frame;
and the adjacent frame connecting module is used for connecting the Mel filter groups of a plurality of adjacent sampling frames to obtain the Mel frequency spectrum of the voice signal.
In some embodiments, the formant extraction module and registration module 2 specifically includes:
the formant extraction module is used for extracting formants of each sampling frame from the Mel frequency spectrum;
the first main formant acquisition module is used for acquiring formants of the first three amplitude values in each sampling frame as first main formants by comparing the maximum value of the local amplitude values in the formants of each sampling frame;
the real-time noise gate module is used for denoising the first main formants through a silencing threshold value of a real-time noise gate to obtain denoised formants;
and the formant matching reconstruction module is used for calculating the matching index between any two formants of any two frames in the denoised formants, and reconstructing the formants with the original frame length according to the matching index.
In some embodiments, the syllable segmentation and statistics module 3 specifically includes:
the amplitude maximum value acquisition module is used for acquiring the maximum value and the minimum value of the reconstructed formant amplitude;
the second main formant acquisition module is used for acquiring formants of the front three of the amplitude values in each sampling frame in the reconstructed formants as second main formants;
a composite energy calculation module for calculating composite energy of the second main formants;
the voice segmentation module is used for taking the maximum value and the minimum value of the reconstructed formant amplitude as obvious silent pause syllable segmentation standards, and carrying out voice segmentation according to the change of the composite energy to obtain a plurality of syllables;
and the syllable characteristic statistics module is used for counting the characteristics in each syllable.
The syllable emotion classification module 4 is used for obtaining emotion type probability of each syllable through a multi-layer perceptron according to the characteristics in the syllables;
the sentence-level confidence aggregation module 5 is configured to obtain a sentence-level emotion recognition result by performing sentence-level confidence aggregation on the emotion category probability of each syllable.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the terms first, second, third, etc. do not denote any order, but rather the terms first, second, third, etc. are used to interpret the terms as labels.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (10)

1. The real-time voice emotion recognition method is characterized by comprising the following steps of:
preprocessing an original voice signal, and extracting a Mel frequency spectrum;
extracting formants of each sampling frame from the mel frequency spectrum;
obtaining the first three formants of the amplitude value in each sampling frame as the first main formants by comparing the maximum value of the local amplitude value in the formants of each sampling frame;
denoising each first main formant according to the silencing threshold value of the real-time noise gate to obtain denoised formants;
calculating a matching index between any two formants of any two frames in the denoised formants, and reconstructing the formants according to the matching index to obtain formants with original frame lengths;
obtaining the maximum value and the minimum value of the amplitude of the reconstructed resonance peak;
acquiring the first three formants of the amplitude values in each sampling frame in the reconstructed formants as second main formants;
calculating the composite energy of the second main formant;
taking the maximum value and the minimum value of the reconstructed formant amplitude as obvious silent pause syllable segmentation standards, and carrying out voice segmentation according to the change of the composite energy to obtain a plurality of syllables;
counting the characteristics in each syllable;
according to the characteristics in each syllable, the emotion type probability of each syllable is obtained through a multi-layer perceptron;
and carrying out statement-level confidence aggregation on the emotion category probability of each syllable to obtain a statement-level emotion recognition result.
2. The method for recognizing real-time speech emotion according to claim 1, wherein said preprocessing step comprises:
pre-emphasis is carried out on the original voice signal to obtain a pre-emphasized signal;
carrying out framing windowing and Fourier transformation processing on the pre-emphasized signal to obtain a transformed signal;
processing the transformed signal by a Mel filter bank to obtain Mel frequency of each sampling frame;
and connecting the Mel filter groups of a plurality of adjacent sampling frames to obtain Mel frequency spectrums of the voice signals.
3. The method for recognizing real-time speech emotion according to claim 1, wherein said step of obtaining the first three formants of each sample frame by comparing local amplitude maxima among formants of each sample frame as a first main formant comprises the following calculation formulas of relevant parameters of said first main formant:
the calculation formula of the power amplitude of the h highest amplitude resonance peak is as follows:
Figure FDA0004171865620000011
the calculation formula of the Mel scale frequency of the h highest amplitude formant is as follows:
Figure FDA0004171865620000021
the bandwidth of formant h is calculated as:
Figure FDA0004171865620000022
wherein p is h Power amplitude, p, for the h highest amplitude formant h-1 The power amplitude of the h-1 th highest amplitude resonance peak, p (l) is the amplitude of the Mel filter bank, f h Mel scale frequency, w, for the h highest amplitude formant h Is the bandwidth of formant h.
4. The method for recognizing real-time voice emotion according to claim 1, wherein the calculation formula of the silencing threshold of the real-time noise gate is:
Figure FDA0004171865620000023
wherein A is min Is the silencing threshold of the real-time noise gate, A imp Is the highest peak amplitude attenuation value in the mel spectrum, which is continuously updated based on new peaks in the current incoming frame that are higher than the current attenuation value.
5. A real-time speech emotion recognition method as defined in claim 1, characterized in that any two frames t are calculated a ,t b Any two formants h of (2) a ,h b The specific calculation formula of the matching index is as follows:
Figure FDA0004171865620000024
wherein I is a,b Represents the matching index, t b -t a Representing the time difference between two frames, f b -f a Representing the frequency difference between two frames,
Figure FDA0004171865620000025
expressed as the ratio of the minimum power amplitude to the maximum power amplitude in two frames, L a Representing the number of formants that have been connected to other formants, K t And K is equal to f Is a manhattan distance constant, and depends on the horizontal and vertical unit distances of adjacent formants.
6. The method for recognizing real-time speech emotion according to claim 1, wherein the calculation of the composite energy of the second main formants is as follows:
Figure FDA0004171865620000026
in the formula e c (t) is the composite energy at time coordinate t, e h (t) is the energy of the h formant, f h (t) is the frequency of the H formant, H E Is an emphasis constant for increasing the energy weight of the high frequency formants, and the composite energy is used for distinguishing silent pauses when speaking.
7. The method of claim 1, wherein the features in the syllable include at least 15.
8. The method of claim 1, wherein the original frame length is 25ms.
9. A real-time speech emotion recognition device, comprising the following modules:
the Mel frequency spectrum extraction module is used for extracting Mel frequency spectrum after preprocessing the original voice signal;
the formant extraction module is used for extracting formants of each sampling frame from the Mel frequency spectrum;
the first main formant acquisition module is used for acquiring formants of the first three amplitude values in each sampling frame as first main formants by comparing the maximum value of the local amplitude values in the formants of each sampling frame;
the real-time noise gate module is used for denoising each first main formant through a silencing threshold value of the real-time noise gate to obtain denoised formants;
the formant matching reconstruction module is used for calculating the matching index between any two formants of any two frames in the denoised formants, and reconstructing the formants with the original frame length according to the matching index;
the amplitude maximum value acquisition module is used for acquiring the maximum value and the minimum value of the reconstructed formant amplitude;
the second main formant acquisition module is used for acquiring formants of the front three of the amplitude values in each sampling frame in the reconstructed formants as second main formants;
a composite energy calculation module for calculating composite energy of the second main formants;
the voice segmentation module is used for taking the maximum value and the minimum value of the reconstructed formant amplitude as obvious silent pause syllable segmentation standards, and carrying out voice segmentation according to the change of the composite energy to obtain a plurality of syllables;
the syllable characteristic statistics module is used for counting the characteristics in each syllable;
the syllable emotion classification module is used for obtaining emotion type probability of each syllable through the multi-layer perceptron according to the characteristics in each syllable;
and the sentence-level confidence aggregation module is used for carrying out sentence-level confidence aggregation on the emotion category probability of each syllable to obtain a sentence-level emotion recognition result.
10. The apparatus of claim 9, wherein the mel-frequency spectrum extraction module comprises:
the pre-emphasis module is used for carrying out pre-emphasis processing on the original voice signal to obtain a pre-emphasized signal;
the framing windowing and Fourier transformation module is used for carrying out framing windowing and Fourier transformation on the pre-emphasized signal to obtain a transformed signal;
the Mel filter module is used for processing the transformed signal through a Mel filter bank to obtain Mel frequency of each sampling frame;
and the adjacent frame connecting module is used for connecting the Mel filter groups of a plurality of adjacent sampling frames to obtain the Mel frequency spectrum of the voice signal.
CN202110987593.5A 2021-08-26 2021-08-26 Real-time voice emotion recognition method and device Active CN113611326B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110987593.5A CN113611326B (en) 2021-08-26 2021-08-26 Real-time voice emotion recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110987593.5A CN113611326B (en) 2021-08-26 2021-08-26 Real-time voice emotion recognition method and device

Publications (2)

Publication Number Publication Date
CN113611326A CN113611326A (en) 2021-11-05
CN113611326B true CN113611326B (en) 2023-05-12

Family

ID=78342097

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110987593.5A Active CN113611326B (en) 2021-08-26 2021-08-26 Real-time voice emotion recognition method and device

Country Status (1)

Country Link
CN (1) CN113611326B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1256932A2 (en) * 2001-05-11 2002-11-13 Sony France S.A. Method and apparatus for synthesising an emotion conveyed on a sound
CN101419800A (en) * 2008-11-25 2009-04-29 浙江大学 Emotional speaker recognition method based on frequency spectrum translation
CN101930735A (en) * 2009-06-23 2010-12-29 富士通株式会社 Speech emotion recognition equipment and speech emotion recognition method
CN102142253A (en) * 2010-01-29 2011-08-03 富士通株式会社 Voice emotion identification equipment and method
CN102655003A (en) * 2012-03-21 2012-09-05 北京航空航天大学 Method for recognizing emotion points of Chinese pronunciation based on sound-track modulating signals MFCC (Mel Frequency Cepstrum Coefficient)
CN104077598A (en) * 2014-06-27 2014-10-01 电子科技大学 Emotion recognition method based on speech fuzzy clustering
CN109493886A (en) * 2018-12-13 2019-03-19 西安电子科技大学 Speech-emotion recognition method based on feature selecting and optimization
CN109599128A (en) * 2018-12-24 2019-04-09 北京达佳互联信息技术有限公司 Speech-emotion recognition method, device, electronic equipment and readable medium
CN110827857A (en) * 2019-11-28 2020-02-21 哈尔滨工程大学 Speech emotion recognition method based on spectral features and ELM
CN112885378A (en) * 2021-01-22 2021-06-01 中国地质大学(武汉) Speech emotion recognition method and device and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101346758B (en) * 2006-06-23 2011-07-27 松下电器产业株式会社 Emotion recognizer
FR3062945B1 (en) * 2017-02-13 2019-04-05 Centre National De La Recherche Scientifique METHOD AND APPARATUS FOR DYNAMICALLY CHANGING THE VOICE STAMP BY FREQUENCY SHIFTING THE FORMS OF A SPECTRAL ENVELOPE

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1256932A2 (en) * 2001-05-11 2002-11-13 Sony France S.A. Method and apparatus for synthesising an emotion conveyed on a sound
CN101419800A (en) * 2008-11-25 2009-04-29 浙江大学 Emotional speaker recognition method based on frequency spectrum translation
CN101930735A (en) * 2009-06-23 2010-12-29 富士通株式会社 Speech emotion recognition equipment and speech emotion recognition method
CN102142253A (en) * 2010-01-29 2011-08-03 富士通株式会社 Voice emotion identification equipment and method
CN102655003A (en) * 2012-03-21 2012-09-05 北京航空航天大学 Method for recognizing emotion points of Chinese pronunciation based on sound-track modulating signals MFCC (Mel Frequency Cepstrum Coefficient)
CN104077598A (en) * 2014-06-27 2014-10-01 电子科技大学 Emotion recognition method based on speech fuzzy clustering
CN109493886A (en) * 2018-12-13 2019-03-19 西安电子科技大学 Speech-emotion recognition method based on feature selecting and optimization
CN109599128A (en) * 2018-12-24 2019-04-09 北京达佳互联信息技术有限公司 Speech-emotion recognition method, device, electronic equipment and readable medium
CN110827857A (en) * 2019-11-28 2020-02-21 哈尔滨工程大学 Speech emotion recognition method based on spectral features and ELM
CN112885378A (en) * 2021-01-22 2021-06-01 中国地质大学(武汉) Speech emotion recognition method and device and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
基于Gammatone滤波器的混合特征语音情感识别;余琳;姜囡;;光电技术应用(03);全文 *
基于语音频谱的共振峰声码器实现;王坤赤;蒋华;;现代电子技术(21);全文 *
语音信号中情感特征参数提取及识别的研究;许小军;南京机械高等专科学校学报(03);全文 *
语音情感特征提取及其降维方法综述;刘振焘;计算机学报;全文 *

Also Published As

Publication number Publication date
CN113611326A (en) 2021-11-05

Similar Documents

Publication Publication Date Title
Ancilin et al. Improved speech emotion recognition with Mel frequency magnitude coefficient
Bezoui et al. Feature extraction of some Quranic recitation using mel-frequency cepstral coeficients (MFCC)
Hansen et al. Automatic voice onset time detection for unvoiced stops (/p/,/t/,/k/) with application to accent classification
Fook et al. Comparison of speech parameterization techniques for the classification of speech disfluencies
Yusnita et al. Malaysian English accents identification using LPC and formant analysis
Natarajan et al. Segmentation of continuous speech into consonant and vowel units using formant frequencies
Eray et al. An application of speech recognition with support vector machines
Elminir et al. Evaluation of different feature extraction techniques for continuous speech recognition
Warohma et al. Identification of regional dialects using mel frequency cepstral coefficients (MFCCs) and neural network
Mistry et al. Overview: Speech recognition technology, mel-frequency cepstral coefficients (mfcc), artificial neural network (ann)
Deekshitha et al. Broad phoneme classification using signal based features
Nancy et al. Audio based emotion recognition using Mel frequency Cepstral coefficient and support vector machine
Unnibhavi et al. LPC based speech recognition for Kannada vowels
Gaudani et al. Comparative study of robust feature extraction techniques for ASR for limited resource Hindi language
Ouhnini et al. Towards an automatic speech-to-text transcription system: amazigh language
Kumari et al. A new gender detection algorithm considering the non-stationarity of speech signal
CN113611326B (en) Real-time voice emotion recognition method and device
Mahesha et al. Classification of speech dysfluencies using speech parameterization techniques and multiclass SVM
Mazumder et al. Feature extraction techniques for speech processing: A review
Yousfi et al. Isolated Iqlab checking rules based on speech recognition system
Razak et al. Towards automatic recognition of emotion in speech
Bansod et al. Speaker Recognition using Marathi (Varhadi) Language
Sas et al. Gender recognition using neural networks and ASR techniques
Camarena-Ibarrola et al. Speaker identification using entropygrams and convolutional neural networks
Aggarwal et al. Parameterization techniques for automatic speech recognition system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant