US20210142820A1 - Method and system for speech emotion recognition - Google Patents
Method and system for speech emotion recognition Download PDFInfo
- Publication number
- US20210142820A1 US20210142820A1 US16/677,324 US201916677324A US2021142820A1 US 20210142820 A1 US20210142820 A1 US 20210142820A1 US 201916677324 A US201916677324 A US 201916677324A US 2021142820 A1 US2021142820 A1 US 2021142820A1
- Authority
- US
- United States
- Prior art keywords
- speech
- emotion
- text
- word
- observed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 73
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 61
- 230000008451 emotion Effects 0.000 claims abstract description 158
- 238000004891 communication Methods 0.000 claims abstract description 67
- 238000010801 machine learning Methods 0.000 claims abstract description 58
- 230000000007 visual effect Effects 0.000 claims abstract description 56
- 238000012549 training Methods 0.000 claims abstract description 15
- 230000008859 change Effects 0.000 claims description 18
- 238000012545 processing Methods 0.000 claims description 10
- 239000003086 colorant Substances 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 2
- 230000002996 emotional effect Effects 0.000 description 17
- 238000012360 testing method Methods 0.000 description 15
- 238000013459 approach Methods 0.000 description 14
- 230000008569 process Effects 0.000 description 14
- 238000010586 diagram Methods 0.000 description 13
- 238000001514 detection method Methods 0.000 description 12
- 230000003595 spectral effect Effects 0.000 description 12
- 230000007935 neutral effect Effects 0.000 description 10
- 238000000605 extraction Methods 0.000 description 9
- 238000010200 validation analysis Methods 0.000 description 8
- 239000011159 matrix material Substances 0.000 description 7
- 238000003066 decision tree Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000003190 augmentative effect Effects 0.000 description 3
- 230000004907 flux Effects 0.000 description 3
- 230000014509 gene expression Effects 0.000 description 3
- 241000238558 Eucarida Species 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 230000036651 mood Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 235000021110 pickles Nutrition 0.000 description 2
- 238000007637 random forest analysis Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000000796 flavoring agent Substances 0.000 description 1
- 235000019634 flavors Nutrition 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 230000003278 mimic effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/027—Syllables being the recognition units
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/221—Announcement of recognition results
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Definitions
- the present invention generally relates to speech recognition and more particularly to a method and system for speech recognition to classify and predict voice messages and commands by extracting properties from voice samples and augmenting a presentation of the converted text of the voice samples to the user.
- Speech recognition is the process of converting a speech signal into a sequence of words. It may also be referred to as Automatic Speech Recognition (ASR) or Speech-to-Text (STT).
- ASR Automatic Speech Recognition
- STT Speech-to-Text
- ASR Automatic Speech Recognition
- STT Speech-to-Text
- speech recognition has become ubiquitous and is used in many aspects of daily life. For example, speech-to-text processing (e.g., word processors or emails), and personal assistants on mobile phones (e.g., APPLE®'s SIRI® on iOS®, GOOGLE® Home on ANDROID®).
- a method and system are provided for improving the emotion content displayed in speech recognition in speech samples with observed emotion content during a speech to text chat communications.
- a method for speech emotion recognition for enriching speech to text communications between users in a speech chat session includes: implementing a speech emotion recognition model to enable converting observed emotions in a speech sample to enrich text with visual emotion content in speech to text communications by: generating a data set of speech samples with labels of a plurality of emotion classes; extracting a set of acoustic features from each of a plurality of emotion classes; generating a machine learning (ML) model based on at least one of the set of acoustic features and data set; training the ML model from acoustic features from speech samples during speech chat sessions; predicting emotion content based on a trained ML model in the observed speech; generating enriched text based on predicted emotion content of the trained ML model; and presenting the enriched text in speech to text communications between users in the speech chat session for visual notice of an observed emotion in the speech sample.
- ML machine learning
- the method further includes the enriching text in the speech to text communications by changing a color of a select set of the text of word in a phrase of a converted speech sample in the speech to text communications in the chat session for the visual notice of the observed emotion in the speech sample.
- the method further includes the enriching text in the speech to text communications by including an emoticon symbol and shading with one or more colors different parts of the emoticon symbol in speech to text communications of a converted speech sample in the speech to text communications in the chat session for the visual notice of the observed emotion in the speech sample by color shading in one or more of the different parts of the emoticon symbol.
- the method further includes the changing dynamically consistent with a change of the observed emotion in the speech sample, the color shading in the one or more of the different parts of the emoticon symbol wherein the change in color shading occurs gradually.
- the method further includes changing dynamically consistent with a dramatic change of the observed emotion in the speech sample, the color shading in all the different parts of the emoticon symbol at once and implementing a speech emotion recognition model to enable converting observed emotions in speech samples to enrich text with visual emotion content in speech to text communications by: displaying the visual emotion content with text by a color mapping with emoticons; applying a threshold to compute duration of syllables of a phrase of words in the speech samples wherein each duration corresponds to a perceived excitement in the speech sample; implementing word stretching on a computed word duration and on a ratio of the computed word duration to syllables of phrases of the word in the speech sample to gauge intensity of words in the speech sample at a word level; and representing the perceived excitement in the speech samples by word stretching and by highlighting particular words.
- the method further includes: highlighting or bolding the word in the phrase for the prominence based on the threshold duration for visual notice of the observed emotion.
- the method further includes stretching one or more letters in the word in the phrase for the prominence based on the threshold duration for visual notice of the observed emotion.
- the method further includes computing the duration to syllable ratio based on a first, a second and a third threshold, including: a first syllable count corresponding to a first duration to symbol ratio greater than a first threshold; a second syllable count corresponding to a second duration to symbol ratio greater than a second threshold; and a third syllable count corresponding to a third duration to symbol ratio greater than a third threshold.
- a computer program product tangibly embodied in a computer-readable storage device and comprising instructions that when executed by a processor perform a method for speech emotion recognition for enriching speech to text communications between users in a speech chat session.
- the method includes: implementing a speech emotion recognition model to enable converting observed emotions in a speech sample to enrich text with visual emotion content in speech to text communications by: generating a data set of speech samples with labels of a plurality of emotion classes; extracting a set of acoustic features from each emotion class; generating a machine learning (ML) model based on at least one acoustic feature of the set of acoustic features and data set; training the ML model from acoustic features from speech samples during speech chat sessions; predicting emotion content based on a trained ML model in the observed speech; generating enriched text based on predicted emotion content of the trained ML model; and presenting the enriched text in speech to text communications between users in the speech chat session for visual notice of an observed emotion in
- ML machine
- the method further includes the enriching text in the speech to text communications by changing a color of a select set of the text of word in a phrase of a converted speech sample in the speech to text communications in the chat session for the visual notice of the observed emotion in the speech sample.
- the method further includes the enriching text in the speech to text communications by including an emoticon symbol and shading with one or more colors different parts of the emoticon symbol in speech to text communications of a converted speech sample in the speech to text communications in the chat session for the visual notice of the observed emotion in the speech sample by color shading in one or more of the different parts of the emoticon symbol.
- the method further includes the changing dynamically consistent with a change of the observed emotion in the speech sample, the color shading in the one or more of the different parts of the emoticon symbol wherein the change in color shading occurs gradually.
- the method further includes the changing dynamically consistent with a dramatic change of the observed emotion in the speech sample, the color shading in all the different parts of the emoticon symbol at once.
- the method further includes the implementing a speech emotion recognition model to enable converting observed emotions in speech samples to enrich text with visual emotion content in speech to text communications by: implementing a speech emotion recognition model to enable converting observed emotions in speech samples to enrich text with visual emotion content in speech to text communications by: displaying the visual emotion content with text by a color mapping with emoticons; applying a threshold to compute duration of syllables of a phrase of words in the speech samples wherein each duration corresponds to a perceived excitement in the speech sample; implementing word stretching on a computed word duration and on a ratio of the computed word duration to syllables of phrases of the word in the speech sample to gauge intensity of words in the speech sample at a word level; representing the perceived excitement in the speech samples by word stretching and by highlighting particular words.
- the method further includes the highlighting or bolding the word in the phrase for
- the method further includes the highlighting of the particular words further by bolding the word in the phrase for the prominence for visual notice of the perceived excitement, and the stretching one or more letters in the word in the phrase for the prominence for visual notice of the observed emotion.
- the method further includes the computing the duration to syllable ratio based on a first, a second and a third threshold, including a first syllable count corresponding to a first duration to symbol ratio greater than a first threshold; a second syllable count corresponding to a second duration to symbol ratio greater than a second threshold; and a third syllable count corresponding to a third duration to symbol ratio greater than a third threshold.
- a system in yet another embodiment, includes at least one processor; and at least one computer-readable storage device comprising instructions that when executed causes the performance of a method for processing speech samples for speech emotion recognition for enriching speech to text communications between users in speech chat.
- the system includes: a speech emotion recognition model implemented by the processor to enable converting observed emotions in a speech sample to enrich text with visual emotion content in speech to text communications, wherein the processor configured to: generate a data set of speech samples with labels of a plurality of emotion classes; extract a set of acoustic features from each emotion class; select a machine learning (ML) model based on at least one of the set of acoustic features and data set; train the ML model from a particular acoustic feature from speech samples during speech chat sessions; predict emotion content based on a trained ML model in the observed speech; generate enriched text based on predicted emotion content of the trained ML model; and present the enriched text in speech to text communications between users in the chat session for visual notice of an observed emotion in the speech sample.
- the system further includes the processor further configured to: implement a speech emotion recognition model to enable converting observed emotions in speech samples to enrich text with visual emotion content in speech to text communications to: display the visual emotion content with text by a color mapping with emoticons; apply a threshold to compute duration of syllables of a phrase of words in the speech samples wherein each duration corresponds to a perceived excitement in the speech sample; implement word stretching on a computed word duration and on a ratio of the computed word duration to syllables of phrases of the word in the speech sample to gauge intensity of words in the speech sample at a word level; and represent the perceived excitement in the speech samples by word stretching and by highlighting particular words.
- a speech emotion recognition model to enable converting observed emotions in speech samples to enrich text with visual emotion content in speech to text communications to: display the visual emotion content with text by a color mapping with emoticons; apply a threshold to compute duration of syllables of a phrase of words in the speech samples wherein each duration corresponds to a perceived excitement in the speech sample;
- FIG. 1 illustrates an exemplary flowchart for implementing a machine learning training and predicting model for speech emotion recognition in the speech emotional recognition application in accordance in accordance with the exemplary embodiments described herein;
- FIG. 2 illustrates an exemplary diagram of visual presentations of emotion content in a speech to text communications in accordance with the exemplary embodiments described herein;
- FIG. 3 illustrates an exemplary diagram of visual presentations of emotion content in a speech to text communications in accordance with the exemplary embodiments described herein;
- FIG. 4 illustrates an exemplary flowchart of a process of constructing a machine learning (ML) model for word stretching for visual presentations of emotion content in a speech to text communications in accordance with the exemplary embodiments described herein;
- ML machine learning
- FIG. 5 illustrates an exemplary flowchart of identifying process to determine whether the word stretching is required for visual presentations of emotion content in a speech to text communications in accordance with the exemplary embodiments described herein;
- FIG. 6 illustrates an exemplary flowchart for a process to identify which vowel sounds or characters to be stretched in the word for visual presentations of emotion content in a speech to text communications in accordance with the exemplary embodiments described herein;
- FIG. 7 illustrates an exemplary flowchart of word stretching in the English language based on the number of vowels and syllables for visual presentations of emotion content in a speech to text communications in accordance with the exemplary embodiments described herein;
- FIG. 8 illustrates an exemplary diagram of word stretching with vowels for visual presentations of emotion content in a speech to text communications in accordance with the exemplary embodiments described herein;
- FIG. 9 illustrates an exemplary diagram of the detection of prominent words for visual presentations of emotion content in a speech to text communications in accordance with the exemplary embodiments described herein;
- FIG. 10 illustrates an exemplary diagram of bolding prominent words for highlighting in visual presentations of emotion content in a speech to text communications in accordance with the exemplary embodiments described herein;
- FIG. 11 illustrates an exemplary diagram of bolding prominent words and stretching the words to show excitement in visual presentations of emotion content in a speech to text communications in accordance with the exemplary embodiments described herein;
- FIG. 12 illustrates an exemplary diagram of a cloud network of an emotion content recognition system in a speech to text communications in accordance with the exemplary embodiments described herein.
- Speech recognition is the process of converting a speech signal into a sequence of words. It is also referred to as Automatic Speech Recognition (ASR) or Speech-to-Text (STT).
- ASR Automatic Speech Recognition
- STT Speech-to-Text
- the use of speech recognition has become ubiquitous and is used in many aspects of daily life. For example, use may be found in automotive systems or environment in which users are busy with their hands, home automation (e.g., voice command recognition systems), speech-to-text processing (e.g., word processors or emails), and personal assistants on mobile phones (e.g., APPLE®'s SIRI® on iOS, MICROSOFT®'s CORTANA® on WINDOW®s Phone, GOOGLE® Now on ANDROID®).
- Emotions color the language and act as a necessary ingredient for interaction between individuals. For example, one party in a conversation when listening to the other party reacts to the other party's emotional state and adapts their responses depending on what kind of emotions are sent and received. With the advent of text and other non-verbal communications, the emotional component if often lacking and therefore the recipient is unable to react or respond to different emotional states of the transmitting party. For example, in chat communications what is often received is text without expressions of emotions. The recipient cannot determine at a given moment the sender of the text emotional state leading to confusion, misinterpretation, etc. of chat text exchanges.
- the present disclosure provides systems and methods of presenting a rich text output by extracting key properties from voice samples and augmenting the text with additional features to visually illustrate emotion content with the text to the viewer.
- tone and flavor i.e. the human emotions in speech
- speech to text is a clinical process that focuses on the exactness of the natural language processing of the speech.
- human emotion There has been little emphasis on focusing on the human emotion in the delivery of the speech and as a result part of the emotional text components are lost and this in return provides the user with an often-bland text presentation devoid of human emotional content.
- the present disclosure describes systems and methods implementing a speech emotion recognition application that re-create emotions present in speech samples by visual representations or changes of the text from the speech sample to generate enriched text output.
- the present disclosure describes systems and methods that implement a speech emotion recognition application in voice chat applications for capturing emotions in speech samples that are converted into text and communicated via chat applications.
- the present disclosure describes systems and methods that implement a speech emotion recognition application with speech Emotion recognition features, text and word stretching detection and stretching, text and word prominence detection and presentations.
- the present disclosure provides a speech emotion recognition application that implements a machine learning (ML) solution to train the model for a large labeled dataset of speech samples with an emotional category as the label (e.g. angry, sad, happy, neutral, fear, etc.).
- the speech emotion recognition application employs a two-step process of training a model by implementing the ML solution to create a training set that contains emotion features extracted from speech samples that creates a training model of an emotion dataset and classifying labels of various emotional classes to classify the data set.
- the speech emotion recognition application also employs a rule-based process for enriching the text with emotion content.
- FIG. 1 illustrates an exemplary flowchart for implementing a machine learning training and predicting model for speech emotion recognition in accordance with an embodiment.
- a dataset of audio samples with labeled emotion class is created by a speech emotion recognition application.
- the data set is representative of the text-domain to infer various emotions from speech using machine learning techniques to obtain meaningful results.
- the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): This database contains 24 professional actors (12 male, 12 female), speaking two sentences in American English accent in 8 different emotions which are calm, happy, sad, angry, fearful, surprise, disgust and neutral. All the audio samples are 48 kHz sampled and the bit depth is 16 bit. The actors are asked to speak in two intensities for each sentence and emotion; (2). Surrey Audio-Visual Expressed Emotion (SAVEE): This database was recorded for the development of an automatic speech emotion recognition system. The database has 4 male actors speaking some English sentences in 7 different emotions like angry, disgust, happy, fear, sad, surprise and neutral.
- the database has a total of 480 utterances.
- the data were recorded in a visual media lab with high-quality equipment and they were labeled; (3).
- the actresses were recruited from the Toronto area and both of them speak English as their first language.
- the actresses were asked to speak the sentences in 7 emotions which are anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral. There are total 2800 audio samples; (4.)
- the actors speak in emotional classes such as happy, angry, anxious, fearful, bored, disgusted and neutral. There are more than 10 actors and 500 utterances per actor; (5.)
- the MELD Multimodal Emotion Lines Dataset: the MELD has more than 1400 dialogues and 13000 utterances from Friends TV series. Multiple speakers participated in the dialogues. Each utterance in dialogue has been labeled by any of these seven emotions—Anger, Disgust, Sadness, Joy, Neutral, Surprise and Fear. The disadvantage with this dataset is the background music and the background laughter sounds.
- the speech emotion recognition application includes feature extraction and features classification phase.
- a set of features are extracted from speech samples or audio samples.
- PCM audio packets generated from a mobile device in communication with a set-top box may send PCM audio samples to a voice server that hosts the speech emotion recognition application to perform acoustic feature extraction.
- One type of classification is short term classification based on short-period characteristics such as energy, formants, and pitch.
- the acoustic feature extraction as step 15 can include extracting a set of Mel-Frequency Cepstrum Coefficients (MFCC) coefficients for classification.
- MFCC Mel-Frequency Cepstrum Coefficients
- a set of MFCC coefficients of 13 MFCC coefficients are extracted and a first and second order derivative is derived for each of the MFCC coefficients resulting in classifying 39 features (13MFCC+13 DEL+13 DEL DEL) (i.e. statistics like mean, median, variance, maximum and minimum are calculated).
- the building and training of the machine learning (ML) model involve first, for the building stage: creating multiple formants where, as an example, a set (i.e. a set of parameters) of a first five formant frequencies and formant bandwidths are extracted. From these first five formant frequencies and bandwidths, approximately 10 features derivatives are calculated which results in making about 20 features. For each of these 20 features, additional sub-features can be derived that include statistics such as mean, median, variance, maximum and minimum calculations. This, in turn, results in a set of about 100 features in total.
- a set of statistic features for pitch and derivative features can be extracted that result in a total of a set of 10 pitch features.
- ZCR zero-crossing rate feature
- a set of feature statistics for ZCR and for derivatives of the ZCR feature are extracted resulting in a total ZCR feature set of about 10 features.
- the same applicable number of features can be derived for other features such as for RMS energy features, for Voiced and Unvoiced energy Fundamental frequency: Statistics and its derivative are extracted.
- the speaking rate the reciprocal of the average length of the voiced part of the utterance includes 1 feature
- the Jitter the shimmer includes 1 feature
- the Shimmer includes 2 features (shimmer and shimmer in dB)
- the harmonic to noise ratio includes 1 feature.
- the speech emotion recognition application may be implemented in python using python libraries that include a sound file to read a received .wav file with speech, a parsing library such as parselmouth: to extract features like formants, intensity, jitter, shimmer, shimmer in dB, harmonic to noise ratio, a feature extraction library file like librosa: to extract features like ZCR, MFCC, RMS energy, spectral centroid, spectral bandwidth, spectral flatness, spectral roll-off, and spectral contrast; a pitch extraction library file called pysptk to extract features like pitch, fundamental frequency; a flux extraction python file called pyAudioAnalysis to extract spectral flux.
- python libraries that include a sound file to read a received .wav file with speech, a parsing library such as parselmouth: to extract features like formants, intensity, jitter, shimmer, shimmer in dB, harmonic to
- Various machine learning techniques may be implemented by the speech emotion recognition application to recognize emotions and can include techniques such as adaBoostClassifier, a decision tree, an extra trees classifier for classification and feature importance, a logistic regression, a neural network, a random forest, a support vector machine, and XGBoost Classifier.
- Deep learning techniques such as Deep Boltzmann Machine (DBM), Recurrent Neural Network (RNN), Recursive Neural Network (RNN), Deep Belief Network (DBN), Convolutional Neural Networks (CNN) and Auto Encoder (AE) for speech emotion recognition.
- DBM Deep Boltzmann Machine
- RNN Recurrent Neural Network
- RNN Recursive Neural Network
- DNN Deep Belief Network
- CNN Convolutional Neural Networks
- AE Auto Encoder
- the decision_tree_1 is a decision tree model with maximum depth 8 which is trained on TESS, BERLIN, RAVDESS and SAVEE dataset on all features.
- the classifying output or report of decision_tree_1 model based on validation data includes micro, macro and weighted averages of about 0.78 for classifying the emotional content of the text.
- the decision_tree_1_depth_13 The decision tree model with maximum depth 13 which is trained on TESS, BERLIN, RAVDESS and SAVEE dataset on all features.
- the classification report of decision_tree_1_depth_13 based on the validation data has a micro, macro and weight average of 0.8 for classifying the speech emotion features recognized.
- the decision_tree_1_depth_8_features_10 has a maximum depth 8 which is trained on TESS, BERLIN, RAVDESS and SAVVE dataset on the top 10 features given by the feature indices of extra trees classifier.
- a Classification report of decision_tree_1_depth_8_features_10 on validation data has a micro, macro and weight average of 0.75, 0.77, and 0.76 respectively for classifying the speech emotion features recognized.
- the decision_tree_2_depth_5 with maximum depth 5 which is trained on BERLIN, SAVEE, and TESS on all features.
- Classification report of decision_tree_2_depth_5 on validation data has a micro, macro and weight average of 0.88, 0.87, and 0.88 respectively for classifying all the speech emotion features recognized.
- the decision_tree_2_depth_8 is a decision tree model with maximum depth 8 which is trained on BERLIN, SAVEE, and TESS on all features.
- the classification report of decision_tree_2_depth_8 on validation data has a micro, macro and weight average of 0.90 for classifying all the speech emotion features recognized.
- the decision_tree_3_depth_8_normalised is a decision tree model with maximum depth 8 which is trained on BERLIN, SAVEE, and RAVDESS on all features with normalization.
- the scaler is fit to the training data and is saved as a .pickle file: ‘decision_tree_3_depth_8_normaliser_scaler.pickle’.
- a Classification report of decision_tree_3_depth_8_normalised on validation data has a micro, macro and weight average of 0.83, 0.81, and 0.83 respectively for classifying all the speech emotion features recognized.
- yPred (forest.predict(X test)) 2 print (confusion matrix(y_test, dtree_prediction)) 3 print (classification_report(y_test, dtree_predictions)) [[236 17 5 4] [21 192 11 6] [2 7 209 10] [0 5 12 224]] precision recall f1-score support 0 0.91 0.90 0.91 262 1 0.87 0.83 0.85 230 2 0.88 0.92 0.90 228 3 0.92 0.93 0.92 241 micro avg 0.90 0.90 0.90 961 macro avg 0.89 0.90 0.90 961 weighted avg 0.90 0.90 0.90 961
- the classification report of extra_tree_1_features_25 on validation data has a micro, macro and weight average of 0.90, 0.89 and 0.90 respectively for classifying all the speech emotion features recognized.
- yPred (forest.predict(x_test)) 2 print (′′confusion_matrix:′′) 3 print (confusion_matrix(y_test, y_pred)) 4 print (′′Classification Report′′) 5 print (classification_report(y_test, y_pred))
- Confusion matrix [[192 9 2 3] [23 140 6 3] [3 3 151 5] [0 1 14 173]]
- the Classification report of random_forest_1_noEstimators_100_features_20 on validation data has a micro, macro and weight average of 0.90 for classifying all the speech emotion features recognized.
- a rules-based model based on results of the various saved models has the following rules for implementing the speech emotion recognition application.
- the rules include a calculation of the count (i.e. an “angry,” “happy,” “neutral,” “sad” count) of individual emotion classes among the above 8 models.
- the angry count is greater than or equal to all other emotion counts then it is declared angry else if there is any happy output in decision_tree_1, decision_tree_1_depth_13 or decision_tree_2_depth_5 then it is declared happy; Else if neutral count is greater than or equal to all other emotion counts then it is declared neutral; Else if there is any sad output in decision_tree_1, decision_tree_1_depth_13 or decision_tree_2_depth_5 then it is declared sad; and if none of the other rules are satisfied then it is declared as neutral
- FIG. 2 illustrates an exemplary diagram of presentation of an emoticon and emotion text in accordance with an embodiment.
- the emotion status of a user is identified by various algorithmic solutions of the emotion speech recognition application.
- the emotion speech recognition is performed in real-time and displayed in a graphic user interface (GUI) 200 created using PYTHON® code but can also be render in HTML (i.e. in an HTML page by actuation) with a record button 210 in the graphic user interface 200 .
- the speech recorded is transposed into text 210 with various colors to represent an expression corresponding to the text.
- the text “so happy” can be generated in a different color such as “red” (note text is not shown in color in FIG.
- the text such as “I am very scared>confident” can be represented in block 220 with various attributes such as the block 220 shaded in black and the text displayed in a different color such as yellow to stand out more and to represent confidence and correspond to the text phrase displayed below of “I am very confident:>confident” 230 .
- the emotion of the user changes, in-text 230 to “I am very scared>confident”, the text 230 is displayed in the color “blue” to show the change from being very confident to being scared. In this manner, the change of state of emotions is exhibited in real-time.
- the analogy expressed is that emotions associated with a user are not static but dynamic and constantly changing. One moment, a user is confident, and the next moment the user is less than confident.
- an emoticon 240 is presented that in real-time changes from color in an upper part of shades of bluish color to a lower part of shades of green, yellow and orange.
- the emoticon 240 mimic dynamic emotion changes that are occurring in the speech of the user as determined by voice emotion analysis solutions of the speech emotion recognition application.
- the output of the speech emotion recognition is configured in a manner to be presented in different color shades of in the emoticon 240 that is displayed.
- the emoticon 240 changes colors or shades in color to correspond to detected changes in the user's emotion as determined by the emotion speech recognition application.
- the emoticon 240 may be represented in its entirety in the color of red. As the user's speech detected emotions change, for example, the user becomes more confident, the emoticon 240 may be configured to turn more blackish dynamically. The change in color may occur immediately or suddenly with demarked or immediate color changes; or maybe a gradual change in color over a period of time.
- FIG. 3 illustrates another an exemplary diagram of the presentation of an emoticon and emotion text in accordance with an embodiment.
- the emoticon 310 is presented with a very happy smile.
- the emoticon 310 is selected based on the emotion analysis by the speech emotion recognition application which selected an emoticon 310 that is the best fit or best correspondence to the emotion recognized in the observed voice.
- a table or emoticons with different expressions may be stored locally and accessible by the speech emotion recognition application for presenting with or in place of recognized emotions in speech.
- other like emoticons such as GIFs or JPEG images may also be used to convey a presentation of emotions in a display to the user of the recognized emotions.
- FIG. 4 is a exemplary flowchart for building an ML model for word stretching to show an emotion such as excitement by the emotion speech recognition application in accordance with an embodiment.
- at step 410 is a dataset of the labeled audio segment of each word
- at step 420 feature extraction
- at step 430 building and training an ML model
- at step 440 the model is saved.
- the threshold-based approach is found to be more accurate than the machine learning approach.
- the threshold-based approach considers the ratio of the duration of each word to the number of syllables in it, the higher this ratio is the more the chances of it becoming stretched.
- FIG. 5 is a flowchart for word stretching detection for a speech emotion recognition application in accordance with an embodiment.
- voice samples are received.
- speech is converted to text and timestamped at the time of conversion.
- extracting audio segment corresponding to each word After which the flow is divided into either the machine learning process or the threshold-based approach.
- the feature extraction is executed.
- the trained model is loaded and fed the features for processing.
- the output equal to 1 if the probability is greater or equal to 0.5, or the output is equal to zero if the probability is less than or equal to 0.5 to determine which words, vowels, etc. to be stretched.
- the threshold-based approach is to compute the duration to syllable ratio to apply a threshold and includes to compute the duration of syllables of a phrase of words in the speech samples with each duration corresponding to a perceived excitement in the speech sample.
- the output is at step 550 is 1.
- syllable count is a unique number (ex: 1, 2, 3) dursylratio is expressed in terms of threshold 1), threshold 2 etc...
- the output is at step 560 is 1.
- the output is at step 570 is 1. If no, then at step 575 the output is equal to 0.
- FIG. 6 is a flowchart for implementing word stretching for the speech emotion recognition application in accordance with an embodiment.
- the audio segment of the individual word is received by the speech emotion recognition application.
- the identification of each syllable nuclei is performed and at step 630 , the speech emotion recognition application computes the duration of each syllable.
- it identifies the syllable whose duration is greater than the threshold.
- the stretched syllable is mapped to the corresponding character in the word.
- each of the relevant characters is repeated to create the stretched syllable of characters in the word.
- the set of rules is designed for stretching in words based in English on the number of vowels and syllables. Though the implementation is for English language implementations, it is contemplated that a variety of other languages may also be amenable to the stretching process with or without minor modifications of the described stretching process. In other words, the English language-based description of stretching characters in words is not limited to just English words.
- FIG. 7 is a flowchart of stretching words with vowels by the speech emotion recognition application in accordance with an embodiment. As indicated, there are a plethora of rules that are necessitated in the stretching process, one such basic rule for word stretching is that single character words are left unchanged.
- FIG. 7 illustrates each of the rules which depend on the number of vowels and the number of syllables in the word and can range from a set of 1 to 5. The rules can be described as follows: Case 1 : at 705 , words starting with a vowel and have only one vowel.
- Case 3 at 715 , words not starting with a vowel and has only one vowel: For this type of words, there are two subcases.
- Case 3 a at 750 if the word is single syllable: Repeat at 755 , the corresponding vowel. For example: way->waaay, hi->hiii.
- Case 3 b If the word has more than one syllable: at 760 , consider ‘y’ also as a vowel and repeat the corresponding vowel. For example: lady->ladyyy, happy->happyyy.
- Case 4 at 720 , words that don't start with vowel and has more than one vowel: Repeat at 775 the corresponding vowel. For example: hello->heeello, where->wheeere, but at 770 if numbers of syllables are more than number of vowels then consider ‘y’ also as a vowel and repeat the corresponding vowel. For example: agony->agonyyy.
- Case 5 at 725 , words with no vowels: These types of words have two subcases, Case 5 a ) at 780 , if ‘y’ is there in the word: at 785 , Consider ‘y’ as a vowel and repeat the vowel. For example, my->myyy and Case 5 b ) at 790 , if ‘y’ is not there in the word: then repeat the last letter.
- FIG. 8 is a diagram that illustrates the presentation of the results of word stretching by the speech emotion recognition application in accordance with an embodiment.
- the output of the word stretching is illustrated in FIG. 8 whereupon actuating the record button 805 of the GUI 800 , the vowels “aaa” in the phrase “whaaat is this” 810 are stretched (i.e. the corresponding vowel repeated), in the phrase “Ohhh that was a goaaall” 815 , the word “ohhh” 820 last “h” is considered a vowel sound and repeated.
- FIG. 9 is a flowchart of the detection of prominent words by the speech emotion recognition application in accordance with an embodiment.
- the flowchart in FIG. 9 illustrates word prominence detection and highlighting.
- the detection of prominent words in the conversation can add more meaning and give a little more clarity.
- the phrase ‘I never said she stolen my money’ is a 7-word sentence that can get 7 different meanings when each individual word is stressed.
- the following are examples of bolded stressed letters and words in the phrase that can convey different meaning by viewing the phrases:
- the pseudo code for word prominent detection is:
- the approach for detection of prominent words begins at step 905 , with the speech or audio segment extraction of each word. Then the acoustic features, as well as the lexical features at step 910 , are extracted. The flow is then divided into the ML approach or the rule-based approach. Briefly, the ML approach includes at step 915 , selecting the relevant features, at step 920 feeding the features to the trained model, and then predicting the result in a range from 0 to 1.
- the alternate path of the rule-based approach includes at step 935 selecting the most helpful features, at step 940 applying the rule to the selected features, at step 945 computing the number of rules satisfied, and at step 950 determining if the majority is satisfied by labeling it as a 1 else if the majority is not satisfied then labeling it with a 0.
- the output of both paths, the ML and the rule-based paths is sent to step 930 for highlighting the word using bold, underline, script etc.
- the rule-based paths yielded superior results than the ML method, but both solutions are feasible alternatives.
- FIG. 10 is a diagram that illustrates the presentation of results of the output of prominent words detection and highlighting by the speech emotion recognition application in accordance with an embodiment.
- the GUI 1000 highlights portions of phrases that are shown in the GUI 1000 . For example, the word “what” in the phrase “I know what mean” is highlighted or bolded.
- FIG. 11 illustrates a diagram of the presentation of results of incorporating emotion, word stretching and word prominence detection by the speech emotion recognition application in accordance with an embodiment.
- the final output is shown as stretching vowels in words, for example in the word “happy” as “haaappy” 1100 , bolding the words in the phrase “I am so haaappy I am so happy” to “I am so haaappy I am so happy” 1105 .
- the phrase “so beautiful>tentative” 1110 can be illustrated to show prominence by bolding the words.
- another color can also be used to show prominence (e.g. an alternative such as a color in red etc can show the prominence).
- FIG. 12 illustrates a diagram of the network for the speech emotion recognition application in accordance with an embodiment.
- the speech samples are sent from client devices to the cloud
- the enriched text is sent from the cloud to the client device
- the cloud sends the speech samples to the voice cloud server
- the voice cloud server sends the enriched text to the cloud.
- a client device 1205 which may include a series of client devices that are configured with the application platform to host a chat app 1210 and a microphone 1215 for receiving speech samples, audio packets, etc. . . .
- the client device 1205 can be a set-top box configured with a voice remote, a smartphone or for that matter any mobile device with processors capable of speech processing, text communication and connecting with the network cloud 1220 .
- the client device 1205 has a display 1200 for displaying text communication via the chat app 1210 .
- the client device 1205 sends voice samples from a user to the network cloud 1220 for speech to text conversion and enriching with emotion content.
- the voice samples are received at a voice cloud server 1223 having a speech emotion recognition module containing processors for processing speech to text by implementing a speech emotion recognition application.
- the speech emotion recognition module 1225 is configured with various module and software such as natural language processing module 1235 for converting speech to text, machine learning module 1230 for implementing various deep learning models, rule-based models, a generating training and testing emotion recognition module (i.e. various trained models 1255 ), a data set module 1250 for storing data sets of recognized emotion data, and various trained modules 1255 . Additional modules may be added or removed as required in the implementation of the speech emotion recognition system.
- the speech emotion recognition module 1225 communicates with a server chat app 1260 for enriching the text in the chat session between various client devices 1205 . That is, the voice samples that are received by the voice cloud server 1223 are transposed into enriched text (i.e. stretched, highlighted, colored, added emoticons, etc..) text that is sent to the server chat app 1260 of inclusion, replacement or augmenting of the chat text communication between each of the client devices 1205 .
- DSP digital signal processor
- ASIC application-specific integrated circuit
- FPGA field-programmable gate array
- a general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine.
- a processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
- a software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
- An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium.
- the storage medium may be integral to the processor.
- the processor and the storage medium may reside in an ASIC.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Child & Adolescent Psychology (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
Description
- The present invention generally relates to speech recognition and more particularly to a method and system for speech recognition to classify and predict voice messages and commands by extracting properties from voice samples and augmenting a presentation of the converted text of the voice samples to the user.
- Human voice signals can provide a good correlation with the emotional status of a person. The mood of a person can be figured out by observing the tone of his speech. Speech recognition is the process of converting a speech signal into a sequence of words. It may also be referred to as Automatic Speech Recognition (ASR) or Speech-to-Text (STT). The use of speech recognition has become ubiquitous and is used in many aspects of daily life. For example, speech-to-text processing (e.g., word processors or emails), and personal assistants on mobile phones (e.g., APPLE®'s SIRI® on iOS®, GOOGLE® Home on ANDROID®).
- Current speech recognition systems use speaker-dependent speech engines which depend on knowledge of a particular speaker's voice characteristics to achieve the required accuracy levels. This kind of speech engines must be trained for a particular user before the speech engine can recognize the speech of the user. Often, when performing the speech to text conversions, human emotional content is observed when processing the received speech. However, conventional trained speech recognition systems do not adequately re-create the human emotional content
- Hence, it is desirable to address these inadequacies raised in speech recognition in the communications particularly in a speech to text chat applications where voice emotion content is observed but is not represented in the text communications. The present disclosure addresses at least this need.
- A method and system are provided for improving the emotion content displayed in speech recognition in speech samples with observed emotion content during a speech to text chat communications.
- In an exemplary embodiment, a method for speech emotion recognition for enriching speech to text communications between users in a speech chat session is provided. The method includes: implementing a speech emotion recognition model to enable converting observed emotions in a speech sample to enrich text with visual emotion content in speech to text communications by: generating a data set of speech samples with labels of a plurality of emotion classes; extracting a set of acoustic features from each of a plurality of emotion classes; generating a machine learning (ML) model based on at least one of the set of acoustic features and data set; training the ML model from acoustic features from speech samples during speech chat sessions; predicting emotion content based on a trained ML model in the observed speech; generating enriched text based on predicted emotion content of the trained ML model; and presenting the enriched text in speech to text communications between users in the speech chat session for visual notice of an observed emotion in the speech sample.
- The method further includes the enriching text in the speech to text communications by changing a color of a select set of the text of word in a phrase of a converted speech sample in the speech to text communications in the chat session for the visual notice of the observed emotion in the speech sample. The method further includes the enriching text in the speech to text communications by including an emoticon symbol and shading with one or more colors different parts of the emoticon symbol in speech to text communications of a converted speech sample in the speech to text communications in the chat session for the visual notice of the observed emotion in the speech sample by color shading in one or more of the different parts of the emoticon symbol. The method further includes the changing dynamically consistent with a change of the observed emotion in the speech sample, the color shading in the one or more of the different parts of the emoticon symbol wherein the change in color shading occurs gradually.
- The method further includes changing dynamically consistent with a dramatic change of the observed emotion in the speech sample, the color shading in all the different parts of the emoticon symbol at once and implementing a speech emotion recognition model to enable converting observed emotions in speech samples to enrich text with visual emotion content in speech to text communications by: displaying the visual emotion content with text by a color mapping with emoticons; applying a threshold to compute duration of syllables of a phrase of words in the speech samples wherein each duration corresponds to a perceived excitement in the speech sample; implementing word stretching on a computed word duration and on a ratio of the computed word duration to syllables of phrases of the word in the speech sample to gauge intensity of words in the speech sample at a word level; and representing the perceived excitement in the speech samples by word stretching and by highlighting particular words.
- The method further includes: highlighting or bolding the word in the phrase for the prominence based on the threshold duration for visual notice of the observed emotion. The method further includes stretching one or more letters in the word in the phrase for the prominence based on the threshold duration for visual notice of the observed emotion.
- The method further includes computing the duration to syllable ratio based on a first, a second and a third threshold, including: a first syllable count corresponding to a first duration to symbol ratio greater than a first threshold; a second syllable count corresponding to a second duration to symbol ratio greater than a second threshold; and a third syllable count corresponding to a third duration to symbol ratio greater than a third threshold.
- In another exemplary embodiment, a computer program product tangibly embodied in a computer-readable storage device and comprising instructions that when executed by a processor perform a method for speech emotion recognition for enriching speech to text communications between users in a speech chat session is provided. The method includes: implementing a speech emotion recognition model to enable converting observed emotions in a speech sample to enrich text with visual emotion content in speech to text communications by: generating a data set of speech samples with labels of a plurality of emotion classes; extracting a set of acoustic features from each emotion class; generating a machine learning (ML) model based on at least one acoustic feature of the set of acoustic features and data set; training the ML model from acoustic features from speech samples during speech chat sessions; predicting emotion content based on a trained ML model in the observed speech; generating enriched text based on predicted emotion content of the trained ML model; and presenting the enriched text in speech to text communications between users in the speech chat session for visual notice of an observed emotion in the speech sample.
- The method further includes the enriching text in the speech to text communications by changing a color of a select set of the text of word in a phrase of a converted speech sample in the speech to text communications in the chat session for the visual notice of the observed emotion in the speech sample.
- The method further includes the enriching text in the speech to text communications by including an emoticon symbol and shading with one or more colors different parts of the emoticon symbol in speech to text communications of a converted speech sample in the speech to text communications in the chat session for the visual notice of the observed emotion in the speech sample by color shading in one or more of the different parts of the emoticon symbol.
- The method further includes the changing dynamically consistent with a change of the observed emotion in the speech sample, the color shading in the one or more of the different parts of the emoticon symbol wherein the change in color shading occurs gradually.
- The method further includes the changing dynamically consistent with a dramatic change of the observed emotion in the speech sample, the color shading in all the different parts of the emoticon symbol at once. The method further includes the implementing a speech emotion recognition model to enable converting observed emotions in speech samples to enrich text with visual emotion content in speech to text communications by: implementing a speech emotion recognition model to enable converting observed emotions in speech samples to enrich text with visual emotion content in speech to text communications by: displaying the visual emotion content with text by a color mapping with emoticons; applying a threshold to compute duration of syllables of a phrase of words in the speech samples wherein each duration corresponds to a perceived excitement in the speech sample; implementing word stretching on a computed word duration and on a ratio of the computed word duration to syllables of phrases of the word in the speech sample to gauge intensity of words in the speech sample at a word level; representing the perceived excitement in the speech samples by word stretching and by highlighting particular words. The method further includes the highlighting or bolding the word in the phrase for the prominence based on the threshold duration for visual notice of the observed emotion.
- The method further includes the highlighting of the particular words further by bolding the word in the phrase for the prominence for visual notice of the perceived excitement, and the stretching one or more letters in the word in the phrase for the prominence for visual notice of the observed emotion. The method further includes the computing the duration to syllable ratio based on a first, a second and a third threshold, including a first syllable count corresponding to a first duration to symbol ratio greater than a first threshold; a second syllable count corresponding to a second duration to symbol ratio greater than a second threshold; and a third syllable count corresponding to a third duration to symbol ratio greater than a third threshold.
- In yet another embodiment, a system is provided and includes at least one processor; and at least one computer-readable storage device comprising instructions that when executed causes the performance of a method for processing speech samples for speech emotion recognition for enriching speech to text communications between users in speech chat. The system includes: a speech emotion recognition model implemented by the processor to enable converting observed emotions in a speech sample to enrich text with visual emotion content in speech to text communications, wherein the processor configured to: generate a data set of speech samples with labels of a plurality of emotion classes; extract a set of acoustic features from each emotion class; select a machine learning (ML) model based on at least one of the set of acoustic features and data set; train the ML model from a particular acoustic feature from speech samples during speech chat sessions; predict emotion content based on a trained ML model in the observed speech; generate enriched text based on predicted emotion content of the trained ML model; and present the enriched text in speech to text communications between users in the chat session for visual notice of an observed emotion in the speech sample.
- The system further includes the processor further configured to: implement a speech emotion recognition model to enable converting observed emotions in speech samples to enrich text with visual emotion content in speech to text communications to: display the visual emotion content with text by a color mapping with emoticons; apply a threshold to compute duration of syllables of a phrase of words in the speech samples wherein each duration corresponds to a perceived excitement in the speech sample; implement word stretching on a computed word duration and on a ratio of the computed word duration to syllables of phrases of the word in the speech sample to gauge intensity of words in the speech sample at a word level; and represent the perceived excitement in the speech samples by word stretching and by highlighting particular words.
- This summary is provided to describe select concepts in a simplified form that are further described in the Detailed Description.
- This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
- Furthermore, other desirable features and characteristics of the system and method will become apparent from the subsequent detailed description and the appended claims, taken in conjunction with the accompanying drawings and the preceding background.
- The present invention will hereinafter be described in conjunction with the following drawing figures, wherein like numerals denote like elements, and wherein:
-
FIG. 1 illustrates an exemplary flowchart for implementing a machine learning training and predicting model for speech emotion recognition in the speech emotional recognition application in accordance in accordance with the exemplary embodiments described herein; -
FIG. 2 illustrates an exemplary diagram of visual presentations of emotion content in a speech to text communications in accordance with the exemplary embodiments described herein; -
FIG. 3 illustrates an exemplary diagram of visual presentations of emotion content in a speech to text communications in accordance with the exemplary embodiments described herein; -
FIG. 4 illustrates an exemplary flowchart of a process of constructing a machine learning (ML) model for word stretching for visual presentations of emotion content in a speech to text communications in accordance with the exemplary embodiments described herein; -
FIG. 5 illustrates an exemplary flowchart of identifying process to determine whether the word stretching is required for visual presentations of emotion content in a speech to text communications in accordance with the exemplary embodiments described herein; -
FIG. 6 illustrates an exemplary flowchart for a process to identify which vowel sounds or characters to be stretched in the word for visual presentations of emotion content in a speech to text communications in accordance with the exemplary embodiments described herein; -
FIG. 7 illustrates an exemplary flowchart of word stretching in the English language based on the number of vowels and syllables for visual presentations of emotion content in a speech to text communications in accordance with the exemplary embodiments described herein; -
FIG. 8 illustrates an exemplary diagram of word stretching with vowels for visual presentations of emotion content in a speech to text communications in accordance with the exemplary embodiments described herein; -
FIG. 9 illustrates an exemplary diagram of the detection of prominent words for visual presentations of emotion content in a speech to text communications in accordance with the exemplary embodiments described herein; -
FIG. 10 illustrates an exemplary diagram of bolding prominent words for highlighting in visual presentations of emotion content in a speech to text communications in accordance with the exemplary embodiments described herein; -
FIG. 11 illustrates an exemplary diagram of bolding prominent words and stretching the words to show excitement in visual presentations of emotion content in a speech to text communications in accordance with the exemplary embodiments described herein; and -
FIG. 12 illustrates an exemplary diagram of a cloud network of an emotion content recognition system in a speech to text communications in accordance with the exemplary embodiments described herein. - The following detailed description is merely exemplary in nature and is not intended to limit the invention or the application and uses of the invention. As used herein, the word “exemplary” means “serving as an example, instance, or illustration.” Thus, any embodiment described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments. All the embodiments described herein are exemplary embodiments provided to enable persons skilled in the art to make or use the invention and not to limit the scope of the invention that is defined by the claims. Furthermore, there is no intention to be bound by any expressed or implied theory presented in the preceding technical field, background, brief summary, or the following detailed description.
- Speech recognition is the process of converting a speech signal into a sequence of words. It is also referred to as Automatic Speech Recognition (ASR) or Speech-to-Text (STT). The use of speech recognition has become ubiquitous and is used in many aspects of daily life. For example, use may be found in automotive systems or environment in which users are busy with their hands, home automation (e.g., voice command recognition systems), speech-to-text processing (e.g., word processors or emails), and personal assistants on mobile phones (e.g., APPLE®'s SIRI® on iOS, MICROSOFT®'s CORTANA® on WINDOW®s Phone, GOOGLE® Now on ANDROID®).
- Emotions color the language and act as a necessary ingredient for interaction between individuals. For example, one party in a conversation when listening to the other party reacts to the other party's emotional state and adapts their responses depending on what kind of emotions are sent and received. With the advent of text and other non-verbal communications, the emotional component if often lacking and therefore the recipient is unable to react or respond to different emotional states of the transmitting party. For example, in chat communications what is often received is text without expressions of emotions. The recipient cannot determine at a given moment the sender of the text emotional state leading to confusion, misinterpretation, etc. of chat text exchanges.
- In various exemplary embodiments, the present disclosure provides systems and methods of presenting a rich text output by extracting key properties from voice samples and augmenting the text with additional features to visually illustrate emotion content with the text to the viewer.
- Often is the case that when performing a speech to text conversion the tone and flavor (i.e. the human emotions in speech) in delivery of the text from the spoken speech are lost when converted. This is because speech to text is a clinical process that focuses on the exactness of the natural language processing of the speech. There has been little emphasis on focusing on the human emotion in the delivery of the speech and as a result part of the emotional text components are lost and this in return provides the user with an often-bland text presentation devoid of human emotional content.
- In the various exemplary embodiment, the present disclosure describes systems and methods implementing a speech emotion recognition application that re-create emotions present in speech samples by visual representations or changes of the text from the speech sample to generate enriched text output.
- In various exemplary embodiments, the present disclosure describes systems and methods that implement a speech emotion recognition application in voice chat applications for capturing emotions in speech samples that are converted into text and communicated via chat applications.
- In various exemplary embodiments, the present disclosure describes systems and methods that implement a speech emotion recognition application with speech Emotion recognition features, text and word stretching detection and stretching, text and word prominence detection and presentations.
- It is generally thought that human voice signals and its features have a very good correlation with the emotional status of a person. The mood of a person can be figured out by observing the tone of his speech. The present disclosure provides a speech emotion recognition application that implements a machine learning (ML) solution to train the model for a large labeled dataset of speech samples with an emotional category as the label (e.g. angry, sad, happy, neutral, fear, etc.). The speech emotion recognition application employs a two-step process of training a model by implementing the ML solution to create a training set that contains emotion features extracted from speech samples that creates a training model of an emotion dataset and classifying labels of various emotional classes to classify the data set. In addition, the speech emotion recognition application also employs a rule-based process for enriching the text with emotion content.
-
FIG. 1 illustrates an exemplary flowchart for implementing a machine learning training and predicting model for speech emotion recognition in accordance with an embodiment. InFIG. 1 , atstep 10, a dataset of audio samples with labeled emotion class is created by a speech emotion recognition application. The data set is representative of the text-domain to infer various emotions from speech using machine learning techniques to obtain meaningful results. - Various solutions have been developed for data sets of emotions in a speech that can include the following: (1). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): This database contains 24 professional actors (12 male, 12 female), speaking two sentences in American English accent in 8 different emotions which are calm, happy, sad, angry, fearful, surprise, disgust and neutral. All the audio samples are 48 kHz sampled and the bit depth is 16 bit. The actors are asked to speak in two intensities for each sentence and emotion; (2). Surrey Audio-Visual Expressed Emotion (SAVEE): This database was recorded for the development of an automatic speech emotion recognition system. The database has 4 male actors speaking some English sentences in 7 different emotions like angry, disgust, happy, fear, sad, surprise and neutral. The database has a total of 480 utterances. The data were recorded in a visual media lab with high-quality equipment and they were labeled; (3). The Toronto Emotional Speech Set (TESS): the database consists of two female actors of ages 26 and 64 speaking some sentences with the format ‘Say the word ______’ and there were 2000 target words filled in the sentence. The actresses were recruited from the Toronto area and both of them speak English as their first language. The actresses were asked to speak the sentences in 7 emotions which are anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral. There are total 2800 audio samples; (4.) The Berlin Database of Emotional speech (Emo-DB): This database includes a set or actors both male and female speaking some sentences in the Berlin language. The actors speak in emotional classes such as happy, angry, anxious, fearful, bored, disgusted and neutral. There are more than 10 actors and 500 utterances per actor; (5.) The MELD (Multimodal Emotion Lines Dataset): the MELD has more than 1400 dialogues and 13000 utterances from Friends TV series. Multiple speakers participated in the dialogues. Each utterance in dialogue has been labeled by any of these seven emotions—Anger, Disgust, Sadness, Joy, Neutral, Surprise and Fear. The disadvantage with this dataset is the background music and the background laughter sounds.
- The speech emotion recognition application includes feature extraction and features classification phase. At
step 15, a set of features are extracted from speech samples or audio samples. For example, PCM audio packets generated from a mobile device in communication with a set-top box may send PCM audio samples to a voice server that hosts the speech emotion recognition application to perform acoustic feature extraction. One type of classification is short term classification based on short-period characteristics such as energy, formants, and pitch. The acoustic feature extraction asstep 15 can include extracting a set of Mel-Frequency Cepstrum Coefficients (MFCC) coefficients for classification. - That is, in an exemplary embodiment, a set of MFCC coefficients of 13 MFCC coefficients are extracted and a first and second order derivative is derived for each of the MFCC coefficients resulting in classifying 39 features (13MFCC+13 DEL+13 DEL DEL) (i.e. statistics like mean, median, variance, maximum and minimum are calculated). This results in a total of about 195 Feature combinations. The building and training of the machine learning (ML) model involve first, for the building stage: creating multiple formants where, as an example, a set (i.e. a set of parameters) of a first five formant frequencies and formant bandwidths are extracted. From these first five formant frequencies and bandwidths, approximately 10 features derivatives are calculated which results in making about 20 features. For each of these 20 features, additional sub-features can be derived that include statistics such as mean, median, variance, maximum and minimum calculations. This, in turn, results in a set of about 100 features in total.
- For example, for a pitch feature, a set of statistic features for pitch and derivative features can be extracted that result in a total of a set of 10 pitch features. Likewise, for a zero-crossing rate feature (ZCR), a set of feature statistics for ZCR and for derivatives of the ZCR feature are extracted resulting in a total ZCR feature set of about 10 features. The same applicable number of features can be derived for other features such as for RMS energy features, for Voiced and Unvoiced energy Fundamental frequency: Statistics and its derivative are extracted. 10 features, Intensity: 10 features Spectral flux: 10 features Spectral bandwidth: 10 features Spectral centroid: 10 features Spectral contrast: 10 features Spectral flatness: 10 features Spectral roll-off: 10 features
- Also, the speaking rate: the reciprocal of the average length of the voiced part of the utterance includes 1 feature, the Jitter: the shimmer includes 1 feature, the Shimmer includes 2 features (shimmer and shimmer in dB) and the harmonic to noise ratio includes 1 feature.
- Combining all these features results in a total of about 430 features. Though, it should be noted that not every feature was used in each model generated and further the features for each model are selected based on importance to the model, particularly in the case with models that are generated with a varying number of features.
- The speech emotion recognition application, in an exemplary embodiment, may be implemented in python using python libraries that include a sound file to read a received .wav file with speech, a parsing library such as parselmouth: to extract features like formants, intensity, jitter, shimmer, shimmer in dB, harmonic to noise ratio, a feature extraction library file like librosa: to extract features like ZCR, MFCC, RMS energy, spectral centroid, spectral bandwidth, spectral flatness, spectral roll-off, and spectral contrast; a pitch extraction library file called pysptk to extract features like pitch, fundamental frequency; a flux extraction python file called pyAudioAnalysis to extract spectral flux.
- Various machine learning techniques may be implemented by the speech emotion recognition application to recognize emotions and can include techniques such as adaBoostClassifier, a decision tree, an extra trees classifier for classification and feature importance, a logistic regression, a neural network, a random forest, a support vector machine, and XGBoost Classifier. Also, Deep learning techniques such as Deep Boltzmann Machine (DBM), Recurrent Neural Network (RNN), Recursive Neural Network (RNN), Deep Belief Network (DBN), Convolutional Neural Networks (CNN) and Auto Encoder (AE) for speech emotion recognition.
- The following is a list of derived models for classifying features in the speech to text emotion recognition model. The decision_tree_1 is a decision tree model with maximum depth 8 which is trained on TESS, BERLIN, RAVDESS and SAVEE dataset on all features.
-
1 print (cm) [[203 29 6 3] [32 153 29 8] [13 21 167 12] [3 6 33 156]] 1 print (classification_report(y_test, dtree_predictions)) precision recall f1- score support 0 0.81 0.84 0.83 241 1 0.73 0.69 0.71 222 2 0.71 0.78 0.75 213 3 0.87 0.79 0.83 198 micro avg 0.78 0.78 0.78 874 macro avg 0.78 0.78 0.78 874 weighted avg 0.78 0.78 0.78 874 - The classifying output or report of decision_tree_1 model based on validation data includes micro, macro and weighted averages of about 0.78 for classifying the emotional content of the text.
- The decision_tree_1_depth_13: The decision tree model with maximum depth 13 which is trained on TESS, BERLIN, RAVDESS and SAVEE dataset on all features.
-
1 print (cm) [[194 37 4 6] [25 161 21 15] [10 16 174 13] [6 6 13 173]] 1 print (classification_report(y_test, dtree_predictions)) precision recall f1- score support 0 0.83 0.80 0.82 241 1 0.73 0.73 0.73 222 2 0.82 0.82 0.82 213 3 0.84 0.87 0.85 198 micro avg 0.80 0.80 0.80 874 macro avg 0.80 0.81 0.80 874 weighted avg 0.80 0.80 0.80 874 - The classification report of decision_tree_1_depth_13 based on the validation data has a micro, macro and weight average of 0.8 for classifying the speech emotion features recognized.
- The decision_tree_1_depth_8_features_10 has a maximum depth 8 which is trained on TESS, BERLIN, RAVDESS and SAVVE dataset on the top 10 features given by the feature indices of extra trees classifier.
-
[[205 18 13 5] [49 145 25 3] [21 13 170 9] [14 6 45 133]] precision recall f1- score support 0 0.71 0.85 0.77 241 1 0.80 0.65 0.72 222 2 0.67 0.80 0.73 213 3 0.89 0.67 0.76 198 micro avg 0.75 0.75 0.75 874 macro avg 0.77 0.74 0.75 874 weighted avg 0.76 0.75 0.75 874 - A Classification report of decision_tree_1_depth_8_features_10 on validation data has a micro, macro and weight average of 0.75, 0.77, and 0.76 respectively for classifying the speech emotion features recognized.
- The decision_tree_2_depth_5 with
maximum depth 5 which is trained on BERLIN, SAVEE, and TESS on all features. -
1 print (confusion_matrix(y_test, dtree_prediction)) 2 print (classification_report(y_test, dtree_predictions)) [[144 11 3 13] [21 128 4 7] [8 1 102 23] [5 1 9 382]] precision recall f1- score support 0 0.81 0.84 0.83 171 1 0.91 0.80 0.85 160 2 0.86 0.76 0.81 134 3 0.90 0.96 0.93 397 micro avg 0.88 0.88 0.88 862 macro avg 0.87 0.84 0.85 862 weighted avg 0.88 0.88 0.88 862 - Classification report of decision_tree_2_depth_5 on validation data has a micro, macro and weight average of 0.88, 0.87, and 0.88 respectively for classifying all the speech emotion features recognized.
- The decision_tree_2_depth_8 is a decision tree model with maximum depth 8 which is trained on BERLIN, SAVEE, and TESS on all features.
-
1 print (confusion_matrix(y_test, dtree_prediction)) 2 print (classification_report(y_test, dtree_predictions)) [[147 14 1 9] [19 131 3 7] [4 1 112 17] [3 1 4 389]] precision recall f1- score support 0 0.85 0.86 0.85 171 1 0.89 0.82 0.85 160 2 0.93 0.84 0.88 134 3 0.92 0.98 0.95 397 micro avg 0.90 0.90 0.90 862 macro avg 0.90 0.87 0.88 862 weighted avg 0.90 0.90 0.90 862 - The classification report of decision_tree_2_depth_8 on validation data has a micro, macro and weight average of 0.90 for classifying all the speech emotion features recognized.
- The decision_tree_3_depth_8_normalised is a decision tree model with maximum depth 8 which is trained on BERLIN, SAVEE, and RAVDESS on all features with normalization. The scaler is fit to the training data and is saved as a .pickle file: ‘decision_tree_3_depth_8_normaliser_scaler.pickle’.
-
1 print (confusion_matrix(y_test, dtree_prediction)) 2 print (classification report(y test, dtree_predictions)) [[189 21 7 4] [43 163 14 8] [12 13 137 27] [5 11 12 398]] precision recall f1- score support 0 0.76 0.86 0.80 221 1 0.78 0.71 0.75 228 2 0.81 0.72 0.76 189 3 0.91 0.93 0.92 426 micro avg 0.83 0.83 0.83 1064 macro avg 0.81 0.81 0.81 1064 weighted avg 0.83 0.83 0.83 1064 - A Classification report of decision_tree_3_depth_8_normalised on validation data has a micro, macro and weight average of 0.83, 0.81, and 0.83 respectively for classifying all the speech emotion features recognized.
- The extra_tree_1_features_25 is an extra tree classifier with a number of estimators=250 which is trained on TESS. SAVEE, BERLIN and RAVDESS dataset on the top 25 features given by the feature indices of extra trees classifier.
-
1 yPred = (forest.predict(X test)) 2 print (confusion matrix(y_test, dtree_prediction)) 3 print (classification_report(y_test, dtree_predictions)) [[236 17 5 4] [21 192 11 6] [2 7 209 10] [0 5 12 224]] precision recall f1- score support 0 0.91 0.90 0.91 262 1 0.87 0.83 0.85 230 2 0.88 0.92 0.90 228 3 0.92 0.93 0.92 241 micro avg 0.90 0.90 0.90 961 macro avg 0.89 0.90 0.90 961 weighted avg 0.90 0.90 0.90 961 - The classification report of extra_tree_1_features_25 on validation data has a micro, macro and weight average of 0.90, 0.89 and 0.90 respectively for classifying all the speech emotion features recognized.
- The random_forest_1_noEstimators_100_features_20 is a random forest model with the number of estimators=100 which is trained on TESS, SAVEE, BERLIN, and RAVDESS on top 20 features given by the feature indices of extra trees classifier.
-
1 yPred = (forest.predict(x_test)) 2 print (″confusion_matrix:″) 3 print (confusion_matrix(y_test, y_pred)) 4 print (″Classification Report″) 5 print (classification_report(y_test, y_pred)) Confusion matrix: [[192 9 2 3] [23 140 6 3] [3 3 151 5] [0 1 14 173]] Classification Report precision recall f1- score support 0 0.88 0.93 0.91 206 1 0.92 0.81 0.86 172 2 0.87 0.93 0.90 162 3 0.94 0.92 0.93 188 micro avg 0.90 0.90 0.90 728 macro avg 0.90 0.90 0.90 728 weighted avg 0.90 0.90 0.90 728 - The Classification report of random_forest_1_noEstimators_100_features_20 on validation data has a micro, macro and weight average of 0.90 for classifying all the speech emotion features recognized.
- In an exemplary embodiment of a rules-based model based on results of the various saved models has the following rules for implementing the speech emotion recognition application. The rules include a calculation of the count (i.e. an “angry,” “happy,” “neutral,” “sad” count) of individual emotion classes among the above 8 models. If the angry count is greater than or equal to all other emotion counts then it is declared angry else if there is any happy output in decision_tree_1, decision_tree_1_depth_13 or decision_tree_2_depth_5 then it is declared happy; Else if neutral count is greater than or equal to all other emotion counts then it is declared neutral; Else if there is any sad output in decision_tree_1, decision_tree_1_depth_13 or decision_tree_2_depth_5 then it is declared sad; and if none of the other rules are satisfied then it is declared as neutral
- In
FIG. 2 , illustrates an exemplary diagram of presentation of an emoticon and emotion text in accordance with an embodiment. InFIG. 2 , the emotion status of a user is identified by various algorithmic solutions of the emotion speech recognition application. For example, inFIG. 2 , the emotion speech recognition is performed in real-time and displayed in a graphic user interface (GUI) 200 created using PYTHON® code but can also be render in HTML (i.e. in an HTML page by actuation) with arecord button 210 in thegraphic user interface 200. The speech recorded is transposed intotext 210 with various colors to represent an expression corresponding to the text. For example, the text “so happy” can be generated in a different color such as “red” (note text is not shown in color inFIG. 2 ) to signify happiness. The text such as “I am very scared>confident” can be represented inblock 220 with various attributes such as theblock 220 shaded in black and the text displayed in a different color such as yellow to stand out more and to represent confidence and correspond to the text phrase displayed below of “I am very confident:>confident” 230. As the emotion of the user changes, in-text 230 to “I am very scared>confident”, thetext 230 is displayed in the color “blue” to show the change from being very confident to being scared. In this manner, the change of state of emotions is exhibited in real-time. The analogy expressed is that emotions associated with a user are not static but dynamic and constantly changing. One moment, a user is confident, and the next moment the user is less than confident. Further, inFIG. 2 , in thedisplay page 235, anemoticon 240 is presented that in real-time changes from color in an upper part of shades of bluish color to a lower part of shades of green, yellow and orange. Theemoticon 240 mimic dynamic emotion changes that are occurring in the speech of the user as determined by voice emotion analysis solutions of the speech emotion recognition application. In other words, the output of the speech emotion recognition is configured in a manner to be presented in different color shades of in theemoticon 240 that is displayed. As the emotions change, theemoticon 240 changes colors or shades in color to correspond to detected changes in the user's emotion as determined by the emotion speech recognition application. - In an exemplary embodiment, if a user is very happy, the
emoticon 240 may be represented in its entirety in the color of red. As the user's speech detected emotions change, for example, the user becomes more confident, theemoticon 240 may be configured to turn more blackish dynamically. The change in color may occur immediately or suddenly with demarked or immediate color changes; or maybe a gradual change in color over a period of time. -
FIG. 3 illustrates another an exemplary diagram of the presentation of an emoticon and emotion text in accordance with an embodiment. InFIG. 3 , for a text of “so happy” 300, theemoticon 310 is presented with a very happy smile. In other words, theemoticon 310 is selected based on the emotion analysis by the speech emotion recognition application which selected anemoticon 310 that is the best fit or best correspondence to the emotion recognized in the observed voice. In various exemplary embodiments, a table or emoticons with different expressions may be stored locally and accessible by the speech emotion recognition application for presenting with or in place of recognized emotions in speech. Further, other like emoticons such as GIFs or JPEG images may also be used to convey a presentation of emotions in a display to the user of the recognized emotions. -
FIG. 4 is a exemplary flowchart for building an ML model for word stretching to show an emotion such as excitement by the emotion speech recognition application in accordance with an embodiment. InFIG. 4 , atstep 410 is a dataset of the labeled audio segment of each word, atstep 420 feature extraction, atstep 430 building and training an ML model, and atstep 440 the model is saved. The threshold-based approach is found to be more accurate than the machine learning approach. The threshold-based approach considers the ratio of the duration of each word to the number of syllables in it, the higher this ratio is the more the chances of it becoming stretched. The thresholds for this feature were categorized into three classes of words. Single syllable words: Threshold=0.57. Two syllable words: Threshold=0.41. Three syllable words: Threshold=0.32 -
FIG. 5 is a flowchart for word stretching detection for a speech emotion recognition application in accordance with an embodiment. Atstep 505, voice samples are received. Atstep 510, speech is converted to text and timestamped at the time of conversion. Atstep 515, extracting audio segment corresponding to each word; After which the flow is divided into either the machine learning process or the threshold-based approach. In the machine learning approach, atstep 520, the feature extraction is executed. Atstep 525, the trained model is loaded and fed the features for processing. Atstep 530, a process of predicting the stretching status for each of the words loaded. At step 535, the output equal to 1 if the probability is greater or equal to 0.5, or the output is equal to zero if the probability is less than or equal to 0.5 to determine which words, vowels, etc. to be stretched. The threshold-based approach, atstep 540, is to compute the duration to syllable ratio to apply a threshold and includes to compute the duration of syllables of a phrase of words in the speech samples with each duration corresponding to a perceived excitement in the speech sample. - At
step 545, if the sylcount is equal to 1 and the dursyratio greater thanthreshold 1 then the output is atstep 550 is 1. In general syllable count is a unique number (ex: 1, 2, 3) dursylratio is expressed in terms of threshold 1),threshold 2 etc... Atstep 555, if the sylcount is equal to 2 and the dursyratio greater thanthreshold 2 then the output is atstep 560 is 1. Atstep 565, if the sylcount is equal to 3 or greater and the dursyratio greater thanthreshold 3 then the output is atstep 570 is 1. If no, then atstep 575 the output is equal to 0. Once the identification of which word to stretch is done, the next step is to implement that stretching. That includes the identification of which vowel exactly in the word was stretched. Hence, by implementing word stretching on a computed word duration and on a ratio of the computed word duration to syllables of phrases of the word in the speech sample, this enables the intensity of each word to be gauged in the speech sample at a word level so the perceived excitement in the speech samples can be represented by word stretching and by highlighting particular words. -
FIG. 6 is a flowchart for implementing word stretching for the speech emotion recognition application in accordance with an embodiment. InFIG. 6 , atstep 610, the audio segment of the individual word is received by the speech emotion recognition application. Atstep 620, the identification of each syllable nuclei is performed and atstep 630, the speech emotion recognition application computes the duration of each syllable. Atstep 640, it identifies the syllable whose duration is greater than the threshold. Atstep 650, the stretched syllable is mapped to the corresponding character in the word. Atstep 660, each of the relevant characters is repeated to create the stretched syllable of characters in the word. - In order to implement the stretching process of characters in each word, a set of rules is created. The set of rules is designed for stretching in words based in English on the number of vowels and syllables. Though the implementation is for English language implementations, it is contemplated that a variety of other languages may also be amenable to the stretching process with or without minor modifications of the described stretching process. In other words, the English language-based description of stretching characters in words is not limited to just English words.
-
FIG. 7 is a flowchart of stretching words with vowels by the speech emotion recognition application in accordance with an embodiment. As indicated, there are a plethora of rules that are necessitated in the stretching process, one such basic rule for word stretching is that single character words are left unchanged.FIG. 7 illustrates each of the rules which depend on the number of vowels and the number of syllables in the word and can range from a set of 1 to 5. The rules can be described as follows: Case 1: at 705, words starting with a vowel and have only one vowel. For this type ofword 730, the last character should be repeated; For example: oh->ohhh, and at->attt; Case 2: at 710, words starting with a vowel and have more than one vowel. For this type of words, there are two subcases as follows: Case 2 a) at 735 if the word has two vowels at the starting then at 745 repeats the next corresponding vowel, else at 740 repeats the last character. For example: out->outtt, ouch->ouchhh, and Case 2 b) Else: at 745 repeat the corresponding vowel. For example: about->abooout, again->agaaain. In Case 3: at 715, words not starting with a vowel and has only one vowel: For this type of words, there are two subcases. Case 3 a) at 750 if the word is single syllable: Repeat at 755, the corresponding vowel. For example: way->waaay, hi->hiii. Case 3 b) If the word has more than one syllable: at 760, consider ‘y’ also as a vowel and repeat the corresponding vowel. For example: lady->ladyyy, happy->happyyy. - Case 4: at 720, words that don't start with vowel and has more than one vowel: Repeat at 775 the corresponding vowel. For example: hello->heeello, where->wheeere, but at 770 if numbers of syllables are more than number of vowels then consider ‘y’ also as a vowel and repeat the corresponding vowel. For example: agony->agonyyy. Case 5: at 725, words with no vowels: These types of words have two subcases, Case 5 a) at 780, if ‘y’ is there in the word: at 785, Consider ‘y’ as a vowel and repeat the vowel. For example, my->myyy and Case 5 b) at 790, if ‘y’ is not there in the word: then repeat the last letter.
-
FIG. 8 is a diagram that illustrates the presentation of the results of word stretching by the speech emotion recognition application in accordance with an embodiment. The output of the word stretching is illustrated inFIG. 8 whereupon actuating therecord button 805 of theGUI 800, the vowels “aaa” in the phrase “whaaat is this” 810 are stretched (i.e. the corresponding vowel repeated), in the phrase “Ohhh that was a goaaall” 815, the word “ohhh” 820 last “h” is considered a vowel sound and repeated. -
FIG. 9 is a flowchart of the detection of prominent words by the speech emotion recognition application in accordance with an embodiment. The flowchart inFIG. 9 illustrates word prominence detection and highlighting. The detection of prominent words in the conversation can add more meaning and give a little more clarity. For example, the phrase ‘I never said she stole my money’ is a 7-word sentence that can get 7 different meanings when each individual word is stressed. The following are examples of bolded stressed letters and words in the phrase that can convey different meaning by viewing the phrases: -
-
-
-
-
-
-
- If the machine learning approach and rules-based approach were attempted, but ML but because of outside factors, deemed not to be applicable to all chat services. In other words, a more general methodology of word prominence can appeal to a wider class of chat applications that fall outside the training set. In this case, as each speaker has his own energy level and pace while talking, the detection of word prominence is easily implemented in a rule-based approach.
- The rules can be described as follows for implementing the word prominence features: for i in range (no of Words in the sentence):
- 1.MeanIntensity[i]>3*(mean of all the words' meanIntensity)
- 2.MeanDerIntensity[i]>3*(mean of all the words' meanDerIntensity)
- 3.varianceIntensity[i]>2*(mean of all the words' varianceIntensity)
- 4.varianceDerIntensity[i]>0.9*(mean of all the words' varianceDerIntensity)
- 5.DurationSyllableRatio[i]>mean of all the words' durationSyllableRatio
- 6.meanRMS[i]>mean of all the words' meanRMS
- 7.maxRMS[i]>mean of all the words' maxRMS
- 8.wordType[i]=1 (which means the word should be a content word)
- The pseudo code for word prominent detection is:
- function detect ProminentWords ( )//identify the stressed words from the user sentence using the word level features {array wordProminenceStatus [numWords]={0,0, . . . }//set all words as not prominent initially int numOfRulesSatisfied;
- //variable for holding the number of rules satisfied out of 8 rules for (i=0; i<numWords; i++)
//looping over all the words of the sentence {numOfRulesSatisfied=0//reset the value for every iteration
//Checking for Rule 1: Checking whether the meanIntensity of the current word is greater than 3 times the average meanIntensity of all the words in the sentence
if(meanIntensity[i]>3*(mean of all the words' meanIntensity)) numOfRulesSatisfied=numOfRulesSatisfied+1//increment by 1
//Checking for Rule 2: Checking whether the meanDerIntensity of the current word is greater than 3 times the average meanDerIntensity of all the words in the sentence if(meanDerIntensity[i]>3*(mean of all the words' meanDerIntensity))//meanDerIntensity means mean of intensity derivativesnumOfRulesSatisfied=numOfRulesSatisfied+1 //increment by 1
//Checking for Rule 3: Checking whether the varianceIntensity of the current word is greater than 2 times the average varianceIntensity of all the words in the sentence if(varianceIntensity[i]>2*(mean of all the words' varianceIntensity))numOfRulesSatisfied=numOfRulesSatisfied+1//increment by 1
//Checking for Rule 4: Checking whether the varianceDerIntensity of the current word is greater than 0.9 times the average varianceDerIntensity of all the words in the sentence if(varianceDerIntensity[i]>0.9*(mean of all the words' varianceDerIntensity))
//varianceDerIntensity means variance of intensity derivatives numOfRulesSatisfied=numOfRulesSatisfied+1//increment by 1
//Checking for Rule 5: Checking whether the durationSyllableRatio of the current word is greater than the average durationSyllableRatio of all the words in the sentence if(durationSyllableRatio[i]>mean of all the words' durationSyllableRatio) numOfRulesSatisfied=numOfRulesSatisfied+1//increment by
//Checking for Rule 6: Checking whether the meanRMS of the current word is greater than the average meanRMS of all the words in the sentence if(meanRMS[i]>mean of all the words' meanRMS) numOfRulesSatisfied=numOfRulesSatisfied+1//increment by 1
//Checking for Rule 7: Checking whether the maxRMS of the current word is greater than the average maxRMS of all the words in the sentence if(maxRMS[i]>mean of all the words' maxRMS) numOfRulesSatisfied=numOfRulesSatisfied+1//increment by 1
//Checking for Rule 8: Checking whether the wordType of the current word is equal to 1 (the word is a content word)(Wordtype=1=>content word, Wordtype=0=>functional word) if(wordType[i]==1)//(which means the word should be a content word) numOfRulesSatisfied=numOfRulesSatisfied+1//increment by 1 if(numOfRulesSatisfied>=4)//if majority of the rules are satisfied wordProminenceStatus[i]=1//flagging this word as a prominent (stressed)
//for the words whose property wordProminenceStatus is 1, those words are expressed in BOLD format for highlighting - In
FIG. 9 , the approach for detection of prominent words begins atstep 905, with the speech or audio segment extraction of each word. Then the acoustic features, as well as the lexical features atstep 910, are extracted. The flow is then divided into the ML approach or the rule-based approach. Briefly, the ML approach includes atstep 915, selecting the relevant features, atstep 920 feeding the features to the trained model, and then predicting the result in a range from 0 to 1. The alternate path of the rule-based approach includes atstep 935 selecting the most helpful features, atstep 940 applying the rule to the selected features, atstep 945 computing the number of rules satisfied, and atstep 950 determining if the majority is satisfied by labeling it as a 1 else if the majority is not satisfied then labeling it with a 0. The output of both paths, the ML and the rule-based paths is sent to step 930 for highlighting the word using bold, underline, script etc. In exemplary embodiments, the rule-based paths yielded superior results than the ML method, but both solutions are feasible alternatives. -
FIG. 10 is a diagram that illustrates the presentation of results of the output of prominent words detection and highlighting by the speech emotion recognition application in accordance with an embodiment. InFIG. 10 , theGUI 1000 highlights portions of phrases that are shown in theGUI 1000. For example, the word “what” in the phrase “I know what mean” is highlighted or bolded. -
FIG. 11 . illustrates a diagram of the presentation of results of incorporating emotion, word stretching and word prominence detection by the speech emotion recognition application in accordance with an embodiment. InFIG. 11 , the final output is shown as stretching vowels in words, for example in the word “happy” as “haaappy” 1100, bolding the words in the phrase “I am so haaappy I am so happy” to “I am so haaappy I am so happy” 1105. Finally, the phrase “so beautiful>tentative” 1110 can be illustrated to show prominence by bolding the words. In an exemplary embodiment, another color can also be used to show prominence (e.g. an alternative such as a color in red etc can show the prominence). -
FIG. 12 illustrates a diagram of the network for the speech emotion recognition application in accordance with an embodiment. In various exemplary embodiments, the speech samples are sent from client devices to the cloud, the enriched text is sent from the cloud to the client device, the cloud sends the speech samples to the voice cloud server, and the voice cloud server sends the enriched text to the cloud. InFIG. 12 , there is illustrated aclient device 1205 which may include a series of client devices that are configured with the application platform to host achat app 1210 and amicrophone 1215 for receiving speech samples, audio packets, etc. . . . Theclient device 1205 can be a set-top box configured with a voice remote, a smartphone or for that matter any mobile device with processors capable of speech processing, text communication and connecting with thenetwork cloud 1220. Theclient device 1205 has adisplay 1200 for displaying text communication via thechat app 1210. In an exemplary embodiment, theclient device 1205 sends voice samples from a user to thenetwork cloud 1220 for speech to text conversion and enriching with emotion content. The voice samples are received at avoice cloud server 1223 having a speech emotion recognition module containing processors for processing speech to text by implementing a speech emotion recognition application. The speechemotion recognition module 1225 is configured with various module and software such as naturallanguage processing module 1235 for converting speech to text,machine learning module 1230 for implementing various deep learning models, rule-based models, a generating training and testing emotion recognition module (i.e. various trained models 1255), adata set module 1250 for storing data sets of recognized emotion data, and various trainedmodules 1255. Additional modules may be added or removed as required in the implementation of the speech emotion recognition system. In addition, the speechemotion recognition module 1225 communicates with aserver chat app 1260 for enriching the text in the chat session betweenvarious client devices 1205. That is, the voice samples that are received by thevoice cloud server 1223 are transposed into enriched text (i.e. stretched, highlighted, colored, added emoticons, etc..) text that is sent to theserver chat app 1260 of inclusion, replacement or augmenting of the chat text communication between each of theclient devices 1205. - Those of skill in the art will appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. Some of the embodiments and implementations are described above in terms of functional and/or logical block components (or modules) and various processing steps. However, it should be appreciated that such block components (or modules) may be realized by any number of hardware, software, and/or firmware components configured to perform the specified functions. To clearly illustrate the interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality.
- Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. For example, an embodiment of a system or a component may employ various integrated circuit components, e.g., memory elements, digital signal processing elements, logic elements, look-up tables, or the like, which may carry out a variety of functions under the control of one or more microprocessors or other control devices. In addition, those skilled in the art will appreciate that the embodiments described herein are merely exemplary implementations.
- The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
- The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a controller or processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC.
- In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Numerical ordinals such as “first,” “second,” “third,” etc. simply denote different singles of a plurality and do not imply any order or sequence unless specifically defined by the claim language. The sequence of the text in any of the claims does not imply that process steps must be performed in a temporal or logical order according to such sequence unless it is specifically defined by the language of the claim. The process steps may be interchanged in any order without departing from the scope of the invention as long as such an interchange does not contradict the claim language and is not logically nonsensical.
- Furthermore, depending on the context, words such as “connect” or “coupled to” used in describing a relationship between different elements do not imply that a direct physical connection must be made between these elements. For example, two elements may be connected to each other physically, electronically, logically, or in any other manner, through one or more additional elements.
- While at least one exemplary embodiment has been presented in the foregoing detailed description of the invention, it should be appreciated that a vast number of variations exist. It should also be appreciated that the exemplary embodiment or exemplary embodiments are only examples, and are not intended to limit the scope, applicability, or configuration of the invention in any way. Rather, the foregoing detailed description will provide those skilled in the art with a convenient road map for implementing an exemplary embodiment of the invention. It is understood that various changes may be made in the function and arrangement of elements described in an exemplary embodiment without departing from the scope of the invention as set forth in the appended claims.
Claims (20)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/677,324 US11133025B2 (en) | 2019-11-07 | 2019-11-07 | Method and system for speech emotion recognition |
US17/446,385 US11688416B2 (en) | 2019-11-07 | 2021-08-30 | Method and system for speech emotion recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/677,324 US11133025B2 (en) | 2019-11-07 | 2019-11-07 | Method and system for speech emotion recognition |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/446,385 Continuation US11688416B2 (en) | 2019-11-07 | 2021-08-30 | Method and system for speech emotion recognition |
Publications (2)
Publication Number | Publication Date |
---|---|
US20210142820A1 true US20210142820A1 (en) | 2021-05-13 |
US11133025B2 US11133025B2 (en) | 2021-09-28 |
Family
ID=75847619
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/677,324 Active 2040-04-01 US11133025B2 (en) | 2019-11-07 | 2019-11-07 | Method and system for speech emotion recognition |
US17/446,385 Active US11688416B2 (en) | 2019-11-07 | 2021-08-30 | Method and system for speech emotion recognition |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/446,385 Active US11688416B2 (en) | 2019-11-07 | 2021-08-30 | Method and system for speech emotion recognition |
Country Status (1)
Country | Link |
---|---|
US (2) | US11133025B2 (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113241095A (en) * | 2021-06-24 | 2021-08-10 | 中国平安人寿保险股份有限公司 | Conversation emotion real-time recognition method and device, computer equipment and storage medium |
CN113397546A (en) * | 2021-06-24 | 2021-09-17 | 福州大学 | Method and system for constructing emotion recognition model based on machine learning and physiological signals |
CN113988456A (en) * | 2021-11-10 | 2022-01-28 | 中国工商银行股份有限公司 | Emotion classification model training method, emotion prediction method and emotion prediction device |
CN115050395A (en) * | 2022-05-07 | 2022-09-13 | 南京邮电大学 | Noise-containing speech emotion recognition method based on multi-field statistical characteristics and improved CNN |
US20220294750A1 (en) * | 2021-03-15 | 2022-09-15 | Fujifilm Business Innovation Corp. | Information processing apparatus, information processing method, and non-transitory computer readable medium |
US20220300251A1 (en) * | 2019-12-10 | 2022-09-22 | Huawei Technologies Co., Ltd. | Meme creation method and apparatus |
US11478704B2 (en) * | 2020-11-04 | 2022-10-25 | Sony Interactive Entertainment Inc. | In-game visualization of spectator feedback |
US20220351730A1 (en) * | 2020-04-09 | 2022-11-03 | Yp Labs Co., Ltd. | Method and system providing service based on user voice |
CN115456114A (en) * | 2022-11-04 | 2022-12-09 | 之江实验室 | Method, device, medium and equipment for model training and business execution |
GB2621873A (en) * | 2022-08-25 | 2024-02-28 | Sony Interactive Entertainment Inc | Content display system and method |
CN118193683A (en) * | 2024-05-14 | 2024-06-14 | 福州掌中云科技有限公司 | Text recommendation method and system based on language big model |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11133025B2 (en) * | 2019-11-07 | 2021-09-28 | Sling Media Pvt Ltd | Method and system for speech emotion recognition |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US2537240A (en) | 1946-02-01 | 1951-01-09 | Bendix Aviat Corp | Air speed indicator |
EP0314838A1 (en) | 1987-11-06 | 1989-05-10 | The Boeing Company | Aircraft's tail section drag compensating for nose-down pitching moment |
US7912720B1 (en) * | 2005-07-20 | 2011-03-22 | At&T Intellectual Property Ii, L.P. | System and method for building emotional machines |
JP2007041988A (en) * | 2005-08-05 | 2007-02-15 | Sony Corp | Information processing device, method and program |
US8788270B2 (en) * | 2009-06-16 | 2014-07-22 | University Of Florida Research Foundation, Inc. | Apparatus and method for determining an emotion state of a speaker |
US9031293B2 (en) * | 2012-10-19 | 2015-05-12 | Sony Computer Entertainment Inc. | Multi-modal sensor based emotion recognition and emotional interface |
WO2015105994A1 (en) * | 2014-01-08 | 2015-07-16 | Callminer, Inc. | Real-time conversational analytics facility |
US9542927B2 (en) * | 2014-11-13 | 2017-01-10 | Google Inc. | Method and system for building text-to-speech voice from diverse recordings |
CN114936856A (en) * | 2017-05-16 | 2022-08-23 | 苹果公司 | User interface for peer-to-peer transmission |
WO2020046831A1 (en) * | 2018-08-27 | 2020-03-05 | TalkMeUp | Interactive artificial intelligence analytical system |
US11011183B2 (en) * | 2019-03-25 | 2021-05-18 | Cisco Technology, Inc. | Extracting knowledge from collaborative support sessions |
US11133025B2 (en) * | 2019-11-07 | 2021-09-28 | Sling Media Pvt Ltd | Method and system for speech emotion recognition |
-
2019
- 2019-11-07 US US16/677,324 patent/US11133025B2/en active Active
-
2021
- 2021-08-30 US US17/446,385 patent/US11688416B2/en active Active
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220300251A1 (en) * | 2019-12-10 | 2022-09-22 | Huawei Technologies Co., Ltd. | Meme creation method and apparatus |
US11941323B2 (en) * | 2019-12-10 | 2024-03-26 | Huawei Technologies Co., Ltd. | Meme creation method and apparatus |
US11694690B2 (en) * | 2020-04-09 | 2023-07-04 | Yp Labs Co., Ltd. | Method and system providing service based on user voice |
US20220351730A1 (en) * | 2020-04-09 | 2022-11-03 | Yp Labs Co., Ltd. | Method and system providing service based on user voice |
US11478704B2 (en) * | 2020-11-04 | 2022-10-25 | Sony Interactive Entertainment Inc. | In-game visualization of spectator feedback |
US20220294750A1 (en) * | 2021-03-15 | 2022-09-15 | Fujifilm Business Innovation Corp. | Information processing apparatus, information processing method, and non-transitory computer readable medium |
US11716298B2 (en) * | 2021-03-15 | 2023-08-01 | Fujifilm Business Innovation Corp. | Information processing apparatus, information processing method, and non-transitory computer readable medium |
CN113241095A (en) * | 2021-06-24 | 2021-08-10 | 中国平安人寿保险股份有限公司 | Conversation emotion real-time recognition method and device, computer equipment and storage medium |
CN113397546A (en) * | 2021-06-24 | 2021-09-17 | 福州大学 | Method and system for constructing emotion recognition model based on machine learning and physiological signals |
CN113988456A (en) * | 2021-11-10 | 2022-01-28 | 中国工商银行股份有限公司 | Emotion classification model training method, emotion prediction method and emotion prediction device |
CN115050395A (en) * | 2022-05-07 | 2022-09-13 | 南京邮电大学 | Noise-containing speech emotion recognition method based on multi-field statistical characteristics and improved CNN |
GB2621873A (en) * | 2022-08-25 | 2024-02-28 | Sony Interactive Entertainment Inc | Content display system and method |
CN115456114A (en) * | 2022-11-04 | 2022-12-09 | 之江实验室 | Method, device, medium and equipment for model training and business execution |
CN118193683A (en) * | 2024-05-14 | 2024-06-14 | 福州掌中云科技有限公司 | Text recommendation method and system based on language big model |
Also Published As
Publication number | Publication date |
---|---|
US11688416B2 (en) | 2023-06-27 |
US20210390973A1 (en) | 2021-12-16 |
US11133025B2 (en) | 2021-09-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11133025B2 (en) | Method and system for speech emotion recognition | |
US11630999B2 (en) | Method and system for analyzing customer calls by implementing a machine learning model to identify emotions | |
US11514886B2 (en) | Emotion classification information-based text-to-speech (TTS) method and apparatus | |
JP6341092B2 (en) | Expression classification device, expression classification method, dissatisfaction detection device, and dissatisfaction detection method | |
US10452352B2 (en) | Voice interaction apparatus, its processing method, and program | |
US9293133B2 (en) | Improving voice communication over a network | |
CN107818798A (en) | Customer service quality evaluating method, device, equipment and storage medium | |
JP6440967B2 (en) | End-of-sentence estimation apparatus, method and program thereof | |
US12131586B2 (en) | Methods, systems, and machine-readable media for translating sign language content into word content and vice versa | |
TW201214413A (en) | Modification of speech quality in conversations over voice channels | |
JP5506738B2 (en) | Angry emotion estimation device, anger emotion estimation method and program thereof | |
CN110634479B (en) | Voice interaction system, processing method thereof, and program thereof | |
JP7526846B2 (en) | voice recognition | |
CN117043856A (en) | End-to-end model on high-efficiency streaming non-recursive devices | |
CN114328867A (en) | Intelligent interruption method and device in man-machine conversation | |
Chakraborty et al. | Knowledge-based framework for intelligent emotion recognition in spontaneous speech | |
CN108831503B (en) | Spoken language evaluation method and device | |
KR102193656B1 (en) | Recording service providing system and method supporting analysis of consultation contents | |
Eyben et al. | Audiovisual vocal outburst classification in noisy acoustic conditions | |
Chakraborty et al. | Spontaneous speech emotion recognition using prior knowledge | |
CN114792521A (en) | Intelligent answering method and device based on voice recognition | |
US11632345B1 (en) | Message management for communal account | |
Kafle et al. | Modeling Acoustic-Prosodic Cues for Word Importance Prediction in Spoken Dialogues | |
US11501091B2 (en) | Real-time speech-to-speech generation (RSSG) and sign language conversion apparatus, method and a system therefore | |
EP4006900A1 (en) | System with speaker representation, electronic device and related methods |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: SLING MEDIA PVT LTD, INDIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:RAIKAR, YATISH JAYANT NAIK;TRIPATHI, VARUNKUMAR;CHITTELLA, KIRAN;AND OTHERS;REEL/FRAME:050951/0875 Effective date: 20191107 |
|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
AS | Assignment |
Owner name: DISH NETWORK TECHNOLOGIES INDIA PRIVATE LIMITED, INDIA Free format text: CHANGE OF NAME;ASSIGNOR:SLING MEDIA PVT. LTD.;REEL/FRAME:061365/0493 Effective date: 20220609 |