LU600124B1 - Voice emotion recognition method and system - Google Patents

Voice emotion recognition method and system

Info

Publication number
LU600124B1
LU600124B1 LU600124A LU600124A LU600124B1 LU 600124 B1 LU600124 B1 LU 600124B1 LU 600124 A LU600124 A LU 600124A LU 600124 A LU600124 A LU 600124A LU 600124 B1 LU600124 B1 LU 600124B1
Authority
LU
Luxembourg
Prior art keywords
voice
emotion
signal
voice signal
model
Prior art date
Application number
LU600124A
Other languages
German (de)
Inventor
Hexing Wang
Yanjun Chen
Original Assignee
Univ Northeastern Qinhuangdao
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Univ Northeastern Qinhuangdao filed Critical Univ Northeastern Qinhuangdao
Priority to LU600124A priority Critical patent/LU600124B1/en
Application granted granted Critical
Publication of LU600124B1 publication Critical patent/LU600124B1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Molecular Biology (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Signal Processing (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Acoustics & Sound (AREA)
  • Computing Systems (AREA)
  • Psychiatry (AREA)
  • Hospice & Palliative Care (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Child & Adolescent Psychology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a voice emotion recognition method, system, where the method includes the following steps: acquiring a voice signal and preprocessing the voice signal to acquire the preprocessed voice signal; extracting acoustic parameters from the preprocessed voice signal to obtain an acoustic parameter set; acquiring attribute information of a speaker, and selecting a corresponding voice emotion model from a pre-built voice emotion model base according to the attribute information, where the voice emotion model is constructed based on a deep neural network model and obtained by training a training set, the training set includes voice data of the same crowd in different emotional states; according to the corresponding voice emotion model, inputting an acoustic parameter set into the corresponding voice emotion model and an emotion state set; and obtaining the cultural background information of the speaker, and modify the emotional state set according to the cultural background information.

Description

LU6001 24
DESCRIPTION
VOICE EMOTION RECOGNITION METHOD AND SYSTEM
TECHNICAL FIELD
The invention relates to the technical field of voice emotion recognition, in particular to a voice emotion recognition method and system.
BACKGROUND
A key technical problem in voice emotion recognition is how to extract feature parameters that can accurately reflect the change of emotion state from voice signals. Traditional feature extraction methods, such as fundamental frequency and energy, are greatly influenced by individual differences of speakers, and the description of voice emotion is not comprehensive enough. How to find a new voice emotional feature with strong robustness and high discrimination is very important. It needs to dig deep into the physiological and acoustic mechanism closely related to emotion in the process of voice production, and consider the influence of the speaker's gender, age, cultural background and other factors. The extraction of this new emotional feature requires knowledge across many disciplines, such as voice recognition, psychology of emotion, acoustic phonetics and so on, which poses a new challenge to the existing feature extraction framework.
SUMMARY
The purpose of the invention is to provide a voice emotion recognition method and system, which fully consider the influence of individual differences of speakers and cultural factors on emotional expression, and significantly improve the accuracy and adaptability of voice emotion recognition.
LU6001 24
In order to achieve the above objectives, the invention provides the following scheme:
A voice emotion recognition method includes the following steps: acquiring a voice signal and preprocessing the voice signal to acquire the preprocessed voice signal; extracting acoustic parameters from the preprocessed voice signal to obtain an acoustic parameter set; acquiring attribute information of a speaker, and selecting a corresponding voice emotion model from a pre-built voice emotion model base according to the attribute information, where the voice emotion model is constructed based on a deep neural network model and obtained by training a training set, and the training set includes voice data of the same crowd in different emotional states: inputting the acoustic parameter set into the corresponding voice emotion model and emotion state set according to the corresponding voice emotion model: acquiring the cultural background information of the speaker, and modifying the emotional state set according to the cultural background information to obtain an emotional recognition result.
Optionally, obtaining the preprocessed voice signal includes: converting the voice signal into a digital signal, and framing the digital signal: calculating the short-time energy and the short-time zero-crossing rate of each frame signal, and judging whet each frame signal is a voice frame according to the short-time energy and the short-time zero-crossing rate; removing the mute part according to the voice frame, obtaining a voice signal from which the mute part is removed and performing frequency domain transformation to obtain a frequency domain signal: denoising the frequency domain signal and performing inverse transformation to obtain the preprocessed voice signal.
Optionally, obtaining the acoustic parameter set includes: extracting the preprocessed voice signal by adopting a pitch detection algorithm to obtain fundamental frequency parameters; extracting the preprocessed voice signal by adopting a short-time energy analysis method to obtain energy parameters;
LU6001 24 extracting the preprocessed voice signal by adopting linear predictive coding technology to obtain formant parameters; acquiring the acoustic parameter set according to the fundamental frequency parameter, the energy parameter and the formant parameter.
Optionally, inputting the acoustic parameter set into a corresponding voice emotion model and obtaining the emotion state set includes: the convolution layer of the deep neural network extracts the local features of the acoustic parameter set, captures the time sequence relationship of the acoustic parameter set through the circulation layer, and maps the local features to different emotion categories through the fully connected layer to obtain the probability distribution of emotion categories.
Optionally, the method further includes inputting the acoustic parameter set into a universal voice emotion model to obtain an emotion state set if there is no corresponding voice emotion model in the pre-built voice emotion model base, where the training set of the universal voice emotion model includes voice data of different people with different emotion states.
Optionally, obtaining the cultural background information of the speaker and modifying the emotional state set according to the cultural background information includes: acquiring the cultural background information according to the attribute information: according to the cultural background information, obtaining a corresponding emotional expression feature vector from an emotional expression knowledge base: carrying out weighted correction on each emotion category in the emotion state set according to the emotion expression feature vector to obtain the emotion recognition result.
Optionally, the weighted correction of each emotion category in the emotion state set according to the emotion expression feature vector includes: performing matching calculation on the emotional state set and the emotional expression feature vector, and judging whether correction is needed through a similarity threshold;
LU6001 24 if so, performing weighted correction on each emotion category in the emotion state set according to the emotion expression feature vector, and processing the corrected emotion state set through a softmax normalization function to obtain the emotion recognition result.
On the other hand, the invention also provides a voice emotion recognition system, which includes a preprocessing module, an acoustic parameter extraction module, a voice emotion model selection module, a voice emotion recognition module and a recognition result correction module; the preprocessing module is used for acquiring and preprocessing a voice signal to acquire the preprocessed voice signal: the acoustic parameter extraction module is used for extracting acoustic parameters from the preprocessed voice signal to obtain an acoustic parameter set: the voice emotion model selection module is used for acquiring the attribute information of a speaker and selecting a corresponding voice emotion model from a pre-built voice emotion model base according to the attribute information, where the voice emotion model is constructed based on a deep neural network model and obtained through training of a training set, and the training set includes voice data of the same crowd in different emotional states; the voice emotion recognition module is used for inputting the acoustic parameter set into the corresponding voice emotion model according to the corresponding voice emotion model to obtain an emotion state set; the recognition result correction module is used for obtaining the cultural background information of the speaker, and correcting the emotional state set according to the cultural background information to obtain an emotional recognition result.
Optionally, obtaining the preprocessed voice signal includes: converting the voice signal into a digital signal, and framing the digital signal; calculating the short-time energy and the short-time zero-crossing rate of each frame signal, and judging whet each frame signal is a voice frame according to the short-time energy and the short-time zero-crossing rate;
LU6001 24 removing the mute part according to the voice frame, obtaining a voice signal from which the mute part is removed and performing frequency domain transformation to obtain a frequency domain signal: denoising the frequency domain signal and performing inverse transformation to obtain the preprocessed voice signal.
Optionally, obtaining the acoustic parameter set includes: extracting the preprocessed voice signal by adopting a pitch detection algorithm to obtain fundamental frequency parameters; extracting the preprocessed voice signal by adopting a short-time energy analysis method to obtain energy parameters; extracting the preprocessed voice signal by adopting linear predictive coding technology to obtain formant parameters; acquiring the acoustic parameter set according to the fundamental frequency parameter, the energy parameter and the formant parameter.
The invention has the beneficial effects that the invention discloses a voice emotion recognition method, which includes the following steps: firstly, preprocessing an input voice signal and extracting acoustic features, and then selecting a corresponding model from a pre-built voice emotion model library for emotion judgment according to attribute information such as gender, age and the like of a speaker, finally, combined with the speaker's cultural background information, the emotional state is corrected to get the final recognition result.
Through multi-level and multi-dimensional analysis, the invention fully considers the influence of individual differences and cultural factors of speakers on emotional expression, and significantly improves the accuracy and adaptability of voice emotion recognition. This method can be widely used in intelligent customer service, emotional computing and other scenes that need to accurately identify voice emotions.
LU6001 24
BRIEF DESCRIPTION OF THE FIGURES
In order to explain the embodiments of the invention or the technical scheme in the prior art more clearly, the drawings needed in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the invention, and other drawings can be obtained according to these drawings without creative work for ordinary people in the field.
Fig. 1 is a flowchart of a voice emotion recognition method according to an embodiment of the invention.
DESCRIPTION OF THE INVENTION
In the following, the technical scheme in the embodiment of the invention will be clearly and completely described with reference to the attached drawings.
Obviously, the described embodiment is only a part of the embodiment of the invention, but not the whole embodiment. Based on the embodiments in the invention, all other embodiments obtained by ordinary technicians in the field without creative labor belong to the scope of protection of the invention.
In order to make the above objects, features and advantages of the invention more obvious and easy to understand, the invention will be further described in detail with the attached drawings and specific embodiments.
Embodiment 1: as shown in Fig. 1, this embodiment provides a voice emotion recognition method, which includes: acquiring a voice signal and preprocessing the voice signal to acquire the preprocessed voice signal; extracting acoustic parameters from the preprocessed voice signal to obtain an acoustic parameter set; acquiring attribute information of a speaker, and selecting a corresponding voice emotion model from a pre-built voice emotion model base according to the attribute information, where the voice emotion model is constructed based on a
LU6001 24 deep neural network model and obtained by training a training set, and the training set includes voice data of the same crowd in different emotional states: according to the corresponding voice emotion model, inputting an acoustic parameter set into the corresponding voice emotion model and an emotion state set: obtaining the cultural background information of the speaker, and modify the emotional state set according to the cultural background information to obtain the emotional recognition result.
Further, obtaining the preprocessed voice signal includes: converting voice signals into digital signals, and framing the digital signals; calculating the short-time energy and the short-time zero-crossing rate of each frame signal, and judging whether each frame signal is a voice frame according to the short-time energy and the short-time zero-crossing rate; removing the mute part according to the voice frame, obtaining the voice signal from which the mute part is removed and performing frequency domain transformation to obtain the frequency domain signal, the frequency domain signal is denoised and inverse transformed to obtain the preprocessed voice signal.
Specifically, voice signals are acquired and converted into digital signals for processing. Frame the digital signal, with the length of each frame being 20-30 ms and the frame shifting being 10-15 ms. Calculate the short-time energy and short- time zero-crossing rate of each frame signal, and judge whether the frame is a voice frame according to the energy and zero-crossing rate. Endpoint detection is carried out on voice frames, the starting point and ending point of voice are judged and the mute parts before and after voice are removed. Perform frequency domain transform on the voice segment, such as FFT transform, to obtain frequency domain signals. The frequency domain signal is denoised by spectral subtraction, and the power spectrum of noise is estimated. The estimated noise spectrum is subtracted from the voice spectrum to obtain the denoised voice spectrum. Inverse transform the denoised voice spectrum, such as IFFT transform, to obtain the denoised voice signal and output it as the preprocessed voice signal.
LU6001 24
In voice emotion recognition, preprocessing can highlight personal voice characteristics and improve recognition accuracy. Generally speaking, the application of these preprocessing technologies makes the voice signal more "clean" and "regular", which is beneficial to the subsequent analysis and processing. By eliminating irrelevant noise and highlighting effective information, the performance and robustness of voice processing system can be significantly improved.
Further, obtaining the acoustic parameter set includes: using pitch detection algorithm, the preprocessed voice signal is extracted to obtain the fundamental frequency parameters; using short-term energy analysis method, the preprocessed voice signal is extracted to obtain energy parameters; using linear predictive coding technology, the preprocessed voice signal is extracted to obtain formant parameters; according to the fundamental frequency parameters, energy parameters and formant parameters, an acoustic parameter set is obtained.
Specifically, pitch detection is a key step in voice signal processing, which is used to extract the fundamental frequency parameters of voice. Pitch detection algorithms include autocorrelation method and cepstrum method. Taking autocorrelation method as an example, the fundamental frequency is estimated by calculating the correlation between the voice signal and its own delayed version.
For a typical male speaker, its fundamental frequency range is usually between 80-160 Hz. Suppose that the fundamental frequency of a voice segment is 120 Hz, which means that the vocal cords vibrate 120 times per second. Short-term energy analysis is an important method to evaluate the strength of voice signal. It divides the voice signal into short time frames and calculates the energy of each frame.
For example, for a voice frame of 20 ms, the sum of its squares can be calculated as the energy value. By observing the change of energy value, we can judge the beginning and end of voice and distinguish voiced sound from unvoiced sound.
Linear predictive coding (LPC) is a widely used voice signal analysis technology to extract formant parameters. LPC assumes that the current voice sample can be predicted linearly by the past samples.
LU6001 24
By solving the LPC coefficient, the frequency response of the channel can be obtained, and the formant frequency can be estimated. For example, for an adult female speaker, the first formant (F1) is usually in the range of 300-800 Hz, and the second formant (F2) is in the range of 1000-2500Hz. The fundamental frequency, energy and formant parameters are combined into an acoustic parameter set, which provides a basis for subsequent acoustic modeling. These parameters together describe the pitch, loudness and timbre characteristics of voice.
Further, inputting the acoustic parameter set into the corresponding voice emotion model to obtain the emotion state set includes: the convolution layer of the deep neural network extracts the local features of the acoustic parameter set, captures the time series relationship of the acoustic parameter set through the circulation layer, and maps the local features to different emotion categories through the full connection layer to obtain the probability distribution of emotion categories.
Specifically, the extracted acoustic parameters are input into the pre-trained deep neural network model, which adopts a combination of convolutional neural network and cyclic neural network. In the deep neural network, the local features of the acoustic parameter set are extracted through the convolution layer, and then the time series relationship of the acoustic parameter set is captured through the circulation layer, and finally it is mapped to different emotion categories through the full connection layer. According to the output of the deep neural network, the probability distribution of the voice signal belonging to each emotion category is obtained. Corresponding the emotion analysis result with the acoustic parameter set to form emotion labeling data. At the same time, the accurately labeled data are added to the training set to iteratively optimize the deep neural network model and improve the generalization ability of the model.
Further, the method also includes inputting an acoustic parameter set into a universal voice emotion model to obtain an emotion state set if there is no corresponding voice emotion model in the pre-built voice emotion model base, where the training set of the universal voice emotion model includes voice data of different people with different emotion states.
LU6001 24
Specifically, the speaker's gender, age and other attribute information can be obtained through user registration information, face recognition technology, etc.
People of different genders and age groups will have different voice characteristics when expressing their emotions. For example, the tone of young women is usually higher, while the tone of old men is relatively lower, and there may be differences in the expression of emotions; when young people express excitement, the voice speed may be faster and the tone may change more, while the old people may be relatively flat. Therefore, choosing the appropriate voice emotion model according to the speaker's attribute information can improve the accuracy of emotion recognition. For example, suppose that the speaker is a young woman, about 25 years old. Next, according to the young woman's attribute information, search in the pre-built voice emotion model base. This model base is pre-constructed, which contains voice emotion models trained for people with different gender, age and other attributes. For example, the model base may contain "young female model", "young male model", "old female model" and "old male model". Assuming that a phonetic emotion model specially trained for "young female" is found, and then this model is selected for subsequent analysis. If you can't find a perfect match model, you can choose a general voice emotion model. This general model is usually obtained by training the voice data of a large number of different people. Although the accuracy may not be as good as the exclusive model, it can also provide some reference. After selecting the model, load the model into memory. This step is similar to opening an application and loading it from the hard disk into memory to run quickly.
LU6001 24
Further, obtaining the cultural background information of the speaker and modifying the emotional state set according to the cultural background information includes: obtaining cultural background information according to the attribute information: according to the cultural background information, the corresponding emotional expression feature vector is obtained from the emotional expression knowledge base: according to the emotion expression feature vector, each emotion category in the emotion state set is weighted and corrected to obtain the emotion recognition result.
Further, the weighted correction of each emotion category in the emotion state set according to the emotion expression feature vector includes: matching the emotional state set with the emotional expression feature vector, and judge whether it needs to be corrected through the similarity threshold; if it is corrected, each emotion category in the emotion state set is weighted according to the emotion expression feature vector, and the corrected emotion state set is processed by softmax normalization function to obtain the emotion recognition result.
Specifically, obtain the personal attribute information of the speaker, including age, gender, nationality, occupation, etc., and determine the cultural background classification of the speaker according to the attribute information. According to different cultural background classification, the corresponding emotional expression feature vectors are obtained from the emotional expression knowledge base. The emotional state set is matched with the emotional expression feature vector corresponding to the cultural background, and whether adaptive correction is needed is judged through the similarity threshold. If it needs to be corrected, the probability values of each emotion category in the emotional state set are weighted according to the emotional expression characteristics of the cultural background. The adjusted emotional probability value is processed by softmax normalization function, and the modified emotional state set is obtained.
LU6001 24
The corrected emotional state set is output as the final emotional recognition result of the speaker. Among them, the emotional expression knowledge base is a data set containing emotional expression characteristics in different cultural backgrounds.
Embodiment 2
A voice emotion recognition system includes a preprocessing module, an acoustic parameter extraction module, a voice emotion model selection module, a voice emotion recognition module and a recognition result correction module; the preprocessing module is used for acquiring and preprocessing a voice signal to acquire the preprocessed voice signal: the acoustic parameter extraction module is used for extracting acoustic parameters from the preprocessed voice signal to obtain an acoustic parameter set: the voice emotion model selection module is used for acquiring the attribute information of the speaker, and selecting a corresponding voice emotion model from a pre-built voice emotion model base according to the attribute information, where the voice emotion model is constructed based on a deep neural network model and obtained through training of a training set, and the training set includes voice data of the same crowd in different emotional states: the voice emotion recognition module is used for inputting an acoustic parameter set into the corresponding voice emotion model according to the corresponding voice emotion model to obtain an emotion state set, the recognition result correcting module is used for obtaining the cultural background information of the speaker, and correcting the emotional state set according to the cultural background information to obtain the emotional recognition result.
LU6001 24
Further, obtaining the preprocessed voice signal includes: converting voice signals into digital signals, and framing the digital signals; calculating the short-time energy and the short-time zero-crossing rate of each frame signal, and judging whether each frame signal is a voice frame according to the short-time energy and the short-time zero-crossing rate; removing the mute part according to the voice frame, obtaining the voice signal from which the mute part is removed and performing frequency domain transformation to obtain the frequency domain signal, the frequency domain signal is denoised and inverse transformed to obtain the preprocessed voice signal.
Further, obtaining the acoustic parameter set includes: using pitch detection algorithm, the preprocessed voice signal is extracted to obtain the fundamental frequency parameters; using short-term energy analysis method, the preprocessed voice signal is extracted to obtain energy parameters; using linear predictive coding technology, the preprocessed voice signal is extracted to obtain formant parameters; according to the fundamental frequency parameters, energy parameters and formant parameters, an acoustic parameter set is obtained.
The above-mentioned embodiment is only a description of the preferred mode of the invention, and does not limit the scope of the invention. Under the premise of not departing from the design spirit of the invention, various modifications and improvements made by ordinary technicians in the field to the technical scheme of the invention shall fall within the protection scope determined by the claims of the invention.

Claims (10)

LU6001 24 CLAIMS
1. À voice emotion recognition method, comprising: acquiring a voice signal and preprocessing the voice signal to acquire a preprocessed voice signal; extracting acoustic parameters from the preprocessed voice signal to obtain an acoustic parameter set; acquiring attribute information of a speaker, and selecting a corresponding voice emotion model from a pre-built voice emotion model base according to the attribute information, wherein the voice emotion model is constructed based on a deep neural network model and obtained by training a training set, and the training set comprises voice data of the same crowd in different emotional states; inputting the acoustic parameter set into the corresponding voice emotion model and emotion state set according to the corresponding voice emotion model: and acquiring a cultural background information of the speaker, and modifying the emotional state set according to the cultural background information to obtain an emotional recognition result.
2. The voice emotion recognition method according to claim 1, wherein obtaining the preprocessed voice signal comprises: converting the voice signal into a digital signal, and framing the digital signal: calculating a short-time energy and a short-time zero-crossing rate of each frame signal, and judging whet each frame signal is a voice frame according to the short-time energy and the short-time zero-crossing rate; removing a mute part according to the voice frame, obtaining the voice signal, from the voice signal the mute part is removed, performing frequency domain transformation to obtain a frequency domain signal: and denoising the frequency domain signal and performing inverse transformation to obtain the preprocessed voice signal.
LU6001 24
3. The voice emotion recognition method according to claim 2, wherein obtaining the acoustic parameter set comprises: extracting the preprocessed voice signal by adopting a pitch detection algorithm to obtain fundamental frequency parameters; extracting the preprocessed voice signal by adopting a short-time energy analysis method to obtain energy parameters; extracting the preprocessed voice signal by adopting linear predictive coding technology to obtain formant parameters; and acquiring the acoustic parameter set according to the fundamental frequency parameter, the energy parameter and the formant parameter.
4. The voice emotion recognition method according to claim 1, wherein inputting the acoustic parameter set into the corresponding voice emotion model and obtaining the emotion state set comprises: the convolution layer of the deep neural network extracts the local features of the acoustic parameter set, captures the time sequence relationship of the acoustic parameter set through the circulation layer, and maps the local features to different emotion categories through the fully connected layer to obtain the probability distribution of emotion categories.
5. The voice emotion recognition method according to claim 1, wherein the method further comprises inputting the acoustic parameter set into a general voice emotion model to obtain an emotion state set if there is no corresponding voice emotion model in the pre-built voice emotion model base, wherein the training set of the general voice emotion model comprises voice data of different people with different emotion states.
LU6001 24
6. The voice emotion recognition method according to claim 1, wherein obtaining the cultural background information of the speaker and modifying the emotional state set according to the cultural background information comprises: acquiring the cultural background information according to the attribute information: according to the cultural background information, obtaining a corresponding emotional expression feature vector from an emotional expression knowledge base: and carrying out weighted correction on each emotion category in the emotion state set according to the emotion expression feature vector to obtain the emotion recognition result.
7. The voice emotion recognition method according to claim 6, wherein the weighted correction of each emotion category in the emotion state set according to the emotion expression feature vector comprises: performing matching calculation on the emotional state set and the emotional expression feature vector, and judging whether correction is needed through a similarity threshold; if so, performing weighted correction on each emotion category in the emotion state set according to the emotion expression feature vector, and processing the corrected emotion state set through a softmax normalization function to obtain the emotion recognition result.
8. A system of voice emotion recognition method according to claim 1, comprising a preprocessing module, an acoustic parameter extraction module, a voice emotion model selection module, a voice emotion recognition module and a recognition result correction module; the preprocessing module is used for acquiring and preprocessing a voice signal to acquire the preprocessed voice signal; the acoustic parameter extraction module is used for extracting acoustic parameters from the preprocessed voice signal to obtain an acoustic parameter set:
LU6001 24 the voice emotion model selection module is used for acquiring the attribute information of a speaker and selecting a corresponding voice emotion model from a pre-built voice emotion model base according to the attribute information, wherein the voice emotion model is constructed based on a deep neural network model and obtained through training of a training set, and the training set comprises voice data of the same crowd in different emotional states; the voice emotion recognition module is used for inputting the acoustic parameter set into the corresponding voice emotion model according to the corresponding voice emotion model to obtain an emotion state set, and the recognition result correction module is used for obtaining the cultural background information of the speaker, and correcting the emotional state set according to the cultural background information to obtain an emotional recognition result.
9. The system according to claim 8, wherein obtaining the preprocessed voice signal comprises: converting the voice signal into a digital signal, and framing the digital signal: calculating that short-time energy and the short-time zero-crossing rate of each frame signal, and judging whet each frame signal is a voice frame according to the short-time energy and the short-time zero-crossing rate; removing the mute part according to the voice frame, obtaining a voice signal, from the voice signal the mute part is removed, performing frequency domain transformation to obtain a frequency domain signal: and denoising the frequency domain signal and performing inverse transformation to obtain the preprocessed voice signal.
LU6001 24
10. The system according to claim 8, wherein obtaining the acoustic parameter set comprises: extracting the preprocessed voice signal by adopting a pitch detection algorithm to obtain fundamental frequency parameters; extracting the preprocessed voice signal by adopting a short-time energy analysis method to obtain energy parameters; extracting the preprocessed voice signal by adopting linear predictive coding technology to obtain formant parameters; and acquiring the acoustic parameter set according to the fundamental frequency parameter, the energy parameter and the formant parameter.
LU600124A 2025-01-24 2025-01-24 Voice emotion recognition method and system LU600124B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
LU600124A LU600124B1 (en) 2025-01-24 2025-01-24 Voice emotion recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
LU600124A LU600124B1 (en) 2025-01-24 2025-01-24 Voice emotion recognition method and system

Publications (1)

Publication Number Publication Date
LU600124B1 true LU600124B1 (en) 2025-07-25

Family

ID=96547871

Family Applications (1)

Application Number Title Priority Date Filing Date
LU600124A LU600124B1 (en) 2025-01-24 2025-01-24 Voice emotion recognition method and system

Country Status (1)

Country Link
LU (1) LU600124B1 (en)

Similar Documents

Publication Publication Date Title
Venkataramanan et al. Emotion recognition from speech
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
CN113012720B (en) Depression detection method by multi-voice feature fusion under spectral subtraction noise reduction
CN116665669A (en) A voice interaction method and system based on artificial intelligence
CN108305616A (en) A kind of audio scene recognition method and device based on long feature extraction in short-term
CN107767881B (en) Method and device for acquiring satisfaction degree of voice information
Hu et al. Pitch‐based gender identification with two‐stage classification
CN108305639B (en) Speech emotion recognition method, computer-readable storage medium, and terminal
US11495234B2 (en) Data mining apparatus, method and system for speech recognition using the same
CN112053694A (en) Voiceprint recognition method based on CNN and GRU network fusion
CN114121023A (en) Speaker separation method, speaker separation device, electronic equipment and computer readable storage medium
CN118173092A (en) An online customer service platform based on AI voice interaction
CN110136726A (en) A kind of estimation method, device, system and the storage medium of voice gender
CN108682432B (en) Voice emotion recognition device
CN113516987B (en) Speaker recognition method, speaker recognition device, storage medium and equipment
CN119517012A (en) A speech recognition method and system for an intelligent speech robot
CN114512133B (en) Method, device, server and storage medium for identifying sound-emitting objects
CN116206593A (en) Voice quality inspection method, device and equipment
CN117831544A (en) Method and system for extracting and identifying bird sound features oriented to complex sound scenes
CN118486297B (en) Response method based on voice emotion recognition and intelligent voice assistant system
LU600124B1 (en) Voice emotion recognition method and system
Hasan et al. Bengali speech emotion recognition: A hybrid approach using B-LSTM
Merzougui et al. Diagnosing spasmodic dysphonia with the power of AI
CN118762718A (en) A method and system capable of dynamically tracking and identifying long-term progressive changes in personal timbre
CN117457005A (en) A voiceprint recognition method and device based on momentum contrast learning

Legal Events

Date Code Title Description
FG Patent granted

Effective date: 20250725