GB2494104A - Recognizing the emotional effect a speaker is having on a listener by analyzing the sound of his or her voice - Google Patents

Recognizing the emotional effect a speaker is having on a listener by analyzing the sound of his or her voice Download PDF

Info

Publication number
GB2494104A
GB2494104A GB1114305.4A GB201114305A GB2494104A GB 2494104 A GB2494104 A GB 2494104A GB 201114305 A GB201114305 A GB 201114305A GB 2494104 A GB2494104 A GB 2494104A
Authority
GB
United Kingdom
Prior art keywords
text
speech
emotion
determined
emotions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB1114305.4A
Other versions
GB201114305D0 (en
Inventor
Simon Mark Adam Bell
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to GB1114305.4A priority Critical patent/GB2494104A/en
Publication of GB201114305D0 publication Critical patent/GB201114305D0/en
Publication of GB2494104A publication Critical patent/GB2494104A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use

Abstract

Recognizing the ernotional effect a speaker is having on a listener by analyzing the sound of his or her voice. The invention lies in analyzing characteristics detected in a portion of human speech, in particular in measuring pre-determined characteristics so as to determine the emotional response induced in a listener by a portion of speech. It converts the portion of speech into digital speech signals, processes the digital speech signal to measure a plurality of pre-determined acoustic features, maps the measurements to at least one emotion in a predetermined plurality of emotion and presents the emotion.

Description

Emotion Analysis In Speech The present invention relates to a method and corresponding apparatus for analysing characteristics detected in a portion of human speech. In particular, it relates to the measurement of pre-determined characteristics so as to determine the emotional response induced in a listener by said portion of speech.
The analysis of human speech has been an area of commercial and academic interest for some time. Audio signal processing techniques are known wherein auditory signals (sound) arc fed into a device such as a microphone and are subsequently altered so that the sound is represented in either digital or analog form. Typically, the pressure wave-form is expressed in a digital representation (as a sequence of binary digits) so that the signal can then be processed electronically by a computer.
Numemus software applications have been written to enable the computerised analysis and manipulation of digitally-represented speech. Speech may be analysed for a variety of purposes, including the attempted detection of emotional content of the spoken utterance or the emotional state of the speaker.
For example, WO 2007072485 (Al) discloses a method and system developed by exAudios Technologies Limited for indentifying the emotional attitude of a speaker by the analysis of intonation patterns detected in the speaker's voice. This information can then be used by the listener to adapt his own speech so as to hopefully achieve a pre-determined goal. For example, a call centre operative wishing to sell a product or service to a potential customer may use the system to analyse the customer's speech and assess the customer's emotional state. The listener (call centre operative) can then adapt his sales approach (i.e. his own speech) accordingly in an attempt to make the sale. For example, if the customer's speech indicates that he is unhappy, the sales person will want to alter his approach as an unhappy customer is unlikely to make a purchase.
Another known system is the EmotionSense system devised by St. Andrews and Cambridge Universities (lJbiComp 2010, September26-29 2010, Copenhagen, Denmark; ACM 978-l-60558-843-8/10/09). EmotionSense provides a mobile sensing platform for social psychology studies based on mobile phone usage, including the sensing of speakers' emotions in relation to the user's location and the time of day. The emotion sensing component of the system records audio samples with a variable sampling interval. Each sample is processed to extract speaker and emotion-related information by comparing it against a set of pre-loaded emotion and speaker-dependent models. As with the exAudios arrangement, EmotionSense is directed to the analysis of speech to determine the speaker's emotional state.
Thus, neither of these disclosures address the task of analysing a speaker's voice to determine the emotional state that the speech will induce in the other person (i.e. the listener). Such an approach would enable a speaker to assess the response that his speech is generating in the listener and thus adapt (or maintain) his subsequent speech in order to influence the listener in accordance with the speaker's motivational goal.
Thus, there is a need for a new approach which provides a measurement of the emotional influence that a speaker's utterances produce on a listener. Such an approach has now been devised.
In addressing this problem, the deviser of the present invention has drawn upon the established theories and principles of Transactional Analysis (TA). TA is an integrative approach to the theory of psychology and psychotherapy developed during the late 1950s by psychiatrist Eric Beme. Two key concepts within TA theory are 1) the ego-state model and 2) transactions.
TA's ego-state model TA can be used as a theory of personality to describe how human beings behave, think, feel. Berne's model of Parent-Adult-Child ego-states is used to explain how an individual experiences and manifests his personality by using one of these 3 states at any given time: * Parent: a state in which an individual's behaviour or thinking is determined by subconsciously reproducing behaviour observed in their parents; * Adult: a state which is devoid of major emotional influence, akin to a computer processing information; * Child: childhood being the source of emotions, creativity, recreation, spontaneity and intimacy, a pcrson in the Child state mimics the way he felt, thought or S behaved in childhood.
Transactions: Berne's TA approach focuses upon the content of people's interactions (which Berne calls "games") with one another. According to TA, the analysis and alteration of these transactions provides a means for solving individuals' emotional problems. In TA, a transaction' is a flow of communication. As well as being experienced as positive' or negative', transactions fall into 3 categories: 1. Reciproca'/Complementary: psych&ogically balanced interactions in which both partners appropriately address the ego state of the other person: Adult to Adult, Parent to Child, Child to Parent, Child to Child; 2. Duplex/Covert: complex interactions in which understanding the explicit social conversation depends upon knowledge of an implicit psychological transaction; for example, the words used may relate to the Adult state but body language or gestures of the participant indicates an intent pertaining to the Child state (e.g. humour, flirtation etc).
3. Crossed: crossed transactions arc communication failures in which a dialogue participant addrcsscs an incorrect ego-state. For example, instcad of addressing an Adult ego-state, a participant speaks as a Parent would speak to a Child.
Building upon and drawing from the principles of TA, the inventor has developed Berne's Parent-Adult-Child model so as to divide the sound of the human voice into three categories: Directive Logical Passionate As TA further divides the Child and Parent ego states into positive and negative, plus selfless and selfish sub-states, these give rise to nine ego states in respect of a sound.
Taking these theories further, the inventor has recognised that the middle (logical) sound can also be sub-divided into positive and negative, plus selfless and selfish sub-states, such that between the three ego states it is possible to discern 12 sounds which represent human emotions as expressed in speech. Further still, it was observed that a neutral sound could be recognised in speech. By adding this neutral sound to the selfless and selfish sub-states in each ego state, the inventor has arrived at 18 emotional states which can be considered as corresponding to a recognisable sound within the human voice.
Thus, in accordance with the present invention there is provided a method for determining one or more emotions experienced by a listener in response to a portion of human speech, the method comprising the steps of converting the portion of speech into digital speech signals; processing the digital speech signals to measure a plurality of pre-determined acoustic features; mapping the measurements to at least one emotion in a pre-determined plurality of emotions; presenting the emotion.
Preferably, the plurality of pre-determined acoustic features may include two or more of pace, tone, timing, attonation and/or dtone.
These acoustic features (or characteristics') may be expressed as follows: PACE -Ihe speed (or rate') of speech; TONE -the pitch mcasurcd in hertz; TIMING -the rate of stress markings (how oflen we emphasise sounds): ATTONATION -the relative acceleration or deceleration immediately prior to emphasis; DTONE -how we open up or close down our pitch (luring a conversation: (hone may be measured as down', flat' or up'.
Preferably, thc pre-determined plurality of emot ions includes one or more of shamc, S sensuality, sadness, joy, pride, disgust, satisfaction, contempt, anger, guilt, relief, contentment, freedom, amusement, embarrassment, surprise, fear and/or excitement.
However, other emotions maybe included.
Preferably, the emotion (joy', fear' etc.) is presented visually on a screen or display device. For examp'e, the emotion may be displayed as a word (or tag') on a mobile phone display or a computer screen. Alternatively or additionally, a graphical representation or image, such as an icon, may be used to represent the emotion visually. The image may be an icon depicting a facial expression representative of the relevant emotion. Additionally or alternatively, the emotion may be presented audibly, for example by sound produced via a speaker on the phone, computer or hand-held device.
Preferably, the portion of speech is captured via a microphone over a pre-determined period of time. Thus, the user may input the portion of speech into the computer, mobile phone or hand held device by speaking into a microphone for, say, a 10 second period.
However, the invention is not intended to be limited in respect of the length of time.
The step of processing the digital speech signals may comprise at least one of the following steps: zero meaning the signal computing the log energy in a frame of the portion of speech using the log energy to determine whether this frame should be processed as it contains speech, or skipped. If skipped, then no further operations are performed until the next frame; computing the power spectral density from the fast Fourier Transform (FET) of the frame and from that computing the autocorrclation function; performing pitch extraction, such as searching the autocorrelation function for a maxima and smoothing this value; deciding whether this frame is voiced or unvoiced performing dimensional ity reduction to a commonly used representation ía speech processing, such as Mel Frequency Ceptral Coefficients (MFCC); computing the rate of speech; computing the timing; computing the rate of change of pace; computing the rate of change of tone.
Preferably, each emotion in the plurality of emotions is associated with a value or range of values for each of the acoustic features. For example, a matrix of values or ranges may be drawn up wherein, for each emotion, a value or range is determined for each of the acoustic features. Thus, a pattern of acoustic measurements is determined for each of the emotions, whereby actual measurements taken from the portion of speech can be cross referenced against the values or ranges in the matrix to identify a particular emotion associated with those recorded measurements.
Thus, the step of mapping the measurements to at least one emotion may include: comparing the measurements obtained from processing the digital speech signals with the values or range of values associated with the plurality of emotions to determine which emotion is associated with the measurements obtained from the portion of speech.
However, mapping the measurements to at least one emotion may be performed by using hand labelled data to assign emotional tags to each time frame and then using machine learning techniques such as hidden Markov models, neural networks or support vector machines.
Preferably, the steps described above are performed repeatedly such that an emotion is determined and presented at pre-determined intervals of time. For example, one or more emotions may be displayed every second to indicate the response which the speech is inducing in the listener.
Preferably, a summary of the emotions experienced most commonly by the listener during a pre-determined interval of time is presented. This period of time maybe longer than that mentioned in the previous paragraph. Thus, at the end of this extended period of time the emotions which were predicted or identified most frequently during the portion of speech arc displayed or spoken as a summary of the overall effect which the speech is likely to have had on the listener. These emotions (tags or images) may be presented in such as way as to indicate their level of frequency. For example, the most commonly identified emotion may be displayed at the top of a list, or may be presented in a particular colour (such as red), or maybe displayed more brightly than the other emotions on the display, or may be displayed on a different position.
Also in accordance with the present invention there is provided apparatus arranged and configured to perform the steps of any preceding claim, the apparatus including: a microphone for capturing the portion of speech; software arranged and configured to convert the portion of speech captured by the microphone into digital speech signals; software arranged and configured to process the digital speech signals so as to measure the plurality of pre-dctermined acoustic features; software arranged and configured to map the measurements to the at least one emotion in a pre-determined plurality of emotions; a computer processor configured to execute the software for converting the portion of speech, processing the digital speech signals, and/or mapping the measurements; and/or a device for presenting the emotion visually or audibly.
These and other aspects of the present invention will be apparent from, and elucidated with reference to, the embodiment described herein.
A preferred embodiment of the invention will now be provided by way of example only, and with reference to the accompanying Figures in which: Figure 1 shows a table of 18 emotional states and their corresponding speech patterns as measured in terms of pace, tone, timing, attonation and dtone.
Figure 2 shows an exemplary embodiment of a device in accordance with the present invention, wherein the 18 emotional states of Figure 1 are expressed by graphical representations of facial expressions. l0
Figure 3 shows thc 18 emotional states of Figure 1 and their respective positions as on the device of Figure 2.
As explained above, by building upon Transactional Analysis theories the inventor of the present invention has recognised that distinct patterns can be discerned by listening to (and analysing) certain characteristics within the human voice during speech. These patterns can be used to predict or detect the emotional state which the speech is likely to induce in the listener.
The characteristics (or features') measured by the invention are as follows: 1. PACE -the speed (or ratc') of speech; usually 115 to 200 words per minute 2. TONE -the pitch measured in hertz; usually 70 to 400 hertz 3. TIIvIING -the rate of stress markings (or put another way: how often we emphasise sounds); usually 2.1 to 3.8 times per second 4. ATTONATTON -the relative acceleration or deceleration inrnwdiately prior to emphasis; 5. DTONE -how wc open up or close down our pitch during a conversation; dtonc can be measured as down', flat' or up'.
These characteristics can be used by a speaker, consciously or subconsciously, to illicit a desired emotional response in the listener. Thus, a speaker having a particular motivational goal can alter the pattern of these 5 characteristics in his speech to emotionally manipulate the listener in order to achieve the speaker's goal.
Figure 1 shows a list of 18 emotional states which have been identified by the inventor as being invoked in a listener by a particular pattern of acoustic characteristics. For example, fast speech (200 words per minute) spoken at a high pitch (400 Hz) with frequent emphasis on certain words (3.8 sounds emphasised per second), a great deal of attonation and an increase of pitch at the end of the utterance, would be likely to induce a sense of excitement in the listener.
Thus, it is possible to determine the emotional response produced in the listener as a result of a portion of speech by: 1. listening to the speech to detect and measure each of the acoustic characteristics in this pre-determined set; 2. mapping the measurements to the emotional states (or tags') shown in Figure ito determine the emotional response which will be induced in the listener by the speech; 3. displaying or otherwise presenting the determined emotional response; presentation of the emotional response enables the speaker to adapt or maintain his speech patterns in accordance with his motivations. For example, if the speaker's intention is to sell a product to the Listener but the analysis indicates that his speech is inducing fear then he will most probably wish to alter his speech patterns. Alternatively, if the indication is that the listener is experiencing contentment or satisfaction, then he may wish to maintain his previous speech patterns as it appears that it is having a positive, desired effect.
Therefore, in general the present invention can be considered to comprise components which handle the signal processing, analysis and presentation aspects of the system. These are discussed in more detail below.
In particular, embodiments of the invention comprise a listening' device such as a microphone to receive the speech (sound waves) made by the speaker. The sound waves are converted into digital signals for subsequent analysis by a software component executing upon a processor. The processor may reside in a computer (e.g. microcomputer or laptop), or in a mobile phone, or in a hand held device, and the invention is not intended to be limited in respect of the type of device in which it is implemented.
The analysis involves detecting and measuring the 5 acoustic characteristics described above. The software is configured to compare the measurements against pre-detcrmined metrics stored in memory (see Figure 1), determine the emotional response corresponding to the identified pattern of characteristics, arid present that result to a human user.
Signal processing The role of the signal processing component is to takc the digital speech signal and provide relevant features to the analysis (or classification') stage. Speech is commonly sampled at rates between 8kHz and 44.1 kHz. In one exemplary implementation 16khz is used, although the skilled addressee would understand that other sample rates could be used.
The speech is windowed and for each frame: 1. features of the frame are used to determine if this frame contains speech and is to be processed further; 2. if speech is detected, then several acoustic features are extracted which are: 1. pitch (or tone) 2. rate (or pace) of speech. More words per minute is a faster rate.
3. rate of stress markings (or timing) 4. rate of change of feature 1, pitch 5. rate of change of feature 2, rate Alternatively or additionally, other acoustic features may be extracted.
The first stage of signal processing is to provide overlapping frames of speech which are windowed to reduced discontinuities. Although 25ms windows are commonly used in speech processing with lOms frame step, it has been found that a longer 32ms window with 12.Sms frame step works particularly well for the present application as pitch extraction forms a large part of the signal processing. The conventional Hamming window is used.
Within each frame the following operations are performed: 1. zero mean the signal 2. compute the log energy in the frame 3. using this and potentially other speech features, decide if this frame should be processed as it contains speech, or skipped. If skipped, then no further operations are performed until the next frame 4. Compute the power spectral density from the fast Fourier Transform (FF1) of the frame and from that compute the autocorrelation function 5. Perform pitch extraction, such as searching the autocorrelation function for a maxima and smoothing this value 6. Decide whether this frame is voiced or unvoiced, for example by looking at the relative magnitude of the peak in the autocorrelation function as well as other features 7. Perform dimensionality reduction to a commonly used representation in speech processing, such as Mel Frequency Ceptral Coefficients (MFCC).
8. Compute the rate of speech by a method such as computing the log of the Euclidean distance between the current MFCCs and those in the last time frame and passing this through a first order filter.
9. Compute the timing information by a method such as computing the weighted standard deviation of the log powers of voiced frames normalised by the mean log power in voiced frames.
10. Compute the rate of change of pace by smoothing the pace information with a first order filter with twice the time constant used in the calculation of pace and computing the relative difference between this and the standard pace measurement.
11. Compute the rate of chaage of tone in the same way.
The result of these calculations is five emotion-related measurements.
Analysis of Emotional Response Once the features have been exncted, the analysis phase maps these feature measurements to emotional tags. In cssence, a classification process is performed. Emotional tags are words such as: sensual, sadness, joy, pride, disgust, satisfaction, contempt, anger, guilt, relieL contentment, freedom, amusement, embarrassment, surprise, fear and excitement.
The 1 8 pre-determined tags used by the present invention are shown in Figure 1.
Various methods can be used to map the features into tags. For example, the value of each measured feature can be divided up into contiguous ranges and a tag associated with each range. A voting scheme across features can then be employed to decide which tag (emotion) to display.
However, the skilled addressee will understand that a variety of methods may be used.
Alternative methods include using hand labelled data to assign emotional tags to each time frame and then using machine leaming techniques such as hidden Markov models, neural networks or support vector machines to classi fcatures.
Presentation of emotional taQs After the emotional response has been determined it can be communicated to, for example, the speaker. The emotional tag may be presented visually andior audibly.
In one embodiment, the emotional tag is displayed on a computer screen. However, in other embodiments the presentation may be performed on mobile phone display or a self-contained device 1 such that shown in Figure 2.
In use, a user presses a start' button and begins speaking into the microphone 2 associated with the device or computer system. For the following 10 seconds, the speech is analysed according to the above approach to identify and measure the pre-determined features.
Throughout this period of time, one or more words will be presented to the user (speaker) every second to indicate the emotion(s) that the listener is likely to respond with. Thus, if the speaker wants the listener to feel joy, he would say hello very differently (i.e. use a different speech pattern) from the way he would say hello if he wanted the listener to feel guilt.
At the end often seconds, the user is presented with a summary of the emotional effect he has had on the listener. The four most common sounds (i.e. emotional effects) made during the 10 second will be displayed. The emotions which are displayed are those tags which were triggered most often in accordance with the 5 speech criteria during the 10 second portion of speech. These tags will remain on display until the programme is ended or restarted.
Moreover, the displayed tags (emotions) are coloured red (telling), amber (sharing with) or green (asking) allowing the speaker to train or adapt his speech in accordance with the five acoustic criteria described above, so he can achieve his goal more effectively and/or often.
In the device I shown in Figure 2, the emotions are represented graphically using simple images of facial expressions 3. One portion 4 of the device containing some of the emotions is coloured green, a second portion 5 is coloured amber, and a third portion 6 is coloured red. These will be described in more detail below.
The four words or emotions are presented at 25% brightness in their relevant positions and will remain in that position at that brightness for one second until another set of words is activated. These are likely to be different arid if one word or emotion is triggered twice by the relevant criteria then only three words will be presented at 50% brightness in their relevant position for one second when another set of words or emotions is activated.
If one word or emotion is triggered three times by the relevant criteria then only two words are presentcd at 75% brightness in their relevant position for one second when another set of words or emotions is activated. In this case, it may be that one word or emotion is triggered four times by the relevant criteria so that only one word will be presented at 100% brightness in the relevant position for one second and so on. in addition to this, the words will be coloured red, amber or green in accordance with rate of change of pitch.
It should be noted that red' does not necessarily indicate wrong'. If one measures a listener's voice and compares it to the speaker's, the listener may also be red which means that a directive' conversation is taking place -both participants are accepting' each other's communication compared to crossing' each other's communication as indicated below: red accepts red amber accepts amber green accepts green red accepts green green accepts red; but amber does not accept or is crossed' by red or green.
When communication is not accepted or is crossed, the conversation will feel awkward.
This would motivate the listener to change his/her behaviour, allowing the speaker to achieve his goal. However, if the listener does not alter his position then the speaker will be required to accept their choice (in accordance with the above list) until the speaker is ready to try to change the listener's behaviour again. It should be noted that in this scenario, it is assumed that the listener did not want to comply with the speaker's wishes in the first instance.
Advantages of the system include the following.
* By understanding the emotional response that his speech is triggering in the listener, the speaker is able to optiniise his interaction with that other individual.
* This facilitates collaboration and enhances the likelihood that the speaker will achieve his goal by providing guidance which enables the speaker to adapt his speech so as to produce the effect he desires in the other person.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be capable of designing many alternative embod[ments without departing from the scope of the invention as defined by the appended claims. In the claims, any reference signs placed in parentheses shall not be construed as limiting the claims. The word comprising" and "comprises", and the like, does not exclude the presence of elements or steps other than those listed in any claim or the specification as a whole. In the present specification, "comprises" means "includes or consists of' and "comprising" means "including or consisting of'. The singular reference of an element does not exclude the plural reference of such elements and vice-versa. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In a device claim enumerating several means, several of these means may be embodied by one and the same item of hardware.
The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims (3)

  1. <claim-text>CLAIMS: I. A method for determining one or more emotions experienced by a listener in response to a portion of human speech, the method comprising the steps of: converting the portion of speech into digital speech signals; processing the digital speech signals to measure a plurality of pre-determined acoustic features; mapping the measurements to at least one emotion in a pre-determined plurality of emotions; presenting the emotion.</claim-text> <claim-text>2. A method according to claim I wherein the plurality of prc-determined acoustic features includes two or more of pace, tone, timing, attonation and/or dtone.</claim-text> <claim-text>3. A method according to claim I or 2 wherein the pie-determined plurality of emotions includes one or more of shame, sensuality, sadness, joy, pride, disgust, satisfaction, contempt, anger, guilt, relief, contentment, freedom, amusement, embarrassment, surprise, fear and/or excitement.</claim-text> <claim-text>4. A method according to any preceding claim wherein the emotion is presented visually on a screen or display device.</claim-text> <claim-text>5. A method according to any preceding claim wherein the portion of speech is captured via a microphone over a prc-determined period of time.</claim-text> <claim-text>6. A method according to any preceding claim wherein processitig the digital speech signaLs comprises at least one of the following steps: 1. zero mcaning the signal
  2. 2. computing the log energy in a frame of the portion of speech
  3. 3. using the log energy to determine whether this frame should be processed as it contains speech, or skipped. If skipped, then no further operations are performed until the next frame; -computing the power spectral density from the flist Fourier Transfbrm (FF1) of the frame and from that computing the autocorrelation function; -performing pitch extraction, such as searching the autocotrelation function for a maxima and smoothing this value; -deciding whether this frame is voiced or unvoiced -performing climensionality reduction to a commonly used representation in speech processing, such as Mel Frequency Ceplral Coefficients (MFCC); -computing the rate of speech; -computing the timing; -computing the rate of change of pace; -computing the rate of change of tone.</claim-text> <claim-text>7. A method according toy preceding claim wherein each emotion in the plurality of emotions is associated with a value or range of values for each of the acoustic features.</claim-text> <claim-text>8. A method according to claim 7 wherein mapping the measurements to at least one cmotion includes: comparing thc mcasurcments obtained from processing the digital speech signals with the values or range of values associated with the plurality of emotions to determine which emotion is associated with the measurements obtained flom the portion of speech.</claim-text> <claim-text>9. A method according to any preceding claim wherein mapping the measurements to at least one emotion is performed by using hand labelled data to assign emotional tags to each time frame and then using machine learning techniques such as hidden Markov models, neural networks or support vector machines.</claim-text> <claim-text>10. A method according to any preceding claim wherein the steps are performed repeatedly such that an emotion is determined and presented at pre-determined intervals of time.</claim-text> <claim-text>11. A method according to claim 10 wherein a summary of the emotions experienced most commonly by the listener during a pre-determined interval of time is presented.U. Apparatus arranged and configured to perform the steps of any preceding claim, thc apparatus including: a microphone for capturing the portion of speech; software arranged and configured to convert the portion of speech captured by the microphone into digital speech signals; software arranged and configured to process the digital speech signals so as to measure the plurality of pre-determined acoustic features; software arranged and configured to map the measurements to the at least one emotion in a pre-determined plurality of emotions; a computer processor configured to execute the software for converting the portion of speech, processing the digital speech signals, and/or mapping the measurements; and/or a device for presenting the emotion visually or audibly.</claim-text>
GB1114305.4A 2011-08-19 2011-08-19 Recognizing the emotional effect a speaker is having on a listener by analyzing the sound of his or her voice Withdrawn GB2494104A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1114305.4A GB2494104A (en) 2011-08-19 2011-08-19 Recognizing the emotional effect a speaker is having on a listener by analyzing the sound of his or her voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1114305.4A GB2494104A (en) 2011-08-19 2011-08-19 Recognizing the emotional effect a speaker is having on a listener by analyzing the sound of his or her voice

Publications (2)

Publication Number Publication Date
GB201114305D0 GB201114305D0 (en) 2011-10-05
GB2494104A true GB2494104A (en) 2013-03-06

Family

ID=44800554

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1114305.4A Withdrawn GB2494104A (en) 2011-08-19 2011-08-19 Recognizing the emotional effect a speaker is having on a listener by analyzing the sound of his or her voice

Country Status (1)

Country Link
GB (1) GB2494104A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1256937A2 (en) * 2001-05-11 2002-11-13 Sony France S.A. Emotion recognition method and device
EP1318505A1 (en) * 2000-09-13 2003-06-11 A.G.I. Inc. Emotion recognizing method, sensibility creating method, device, and software
WO2007072485A1 (en) * 2005-12-22 2007-06-28 Exaudios Technologies Ltd. System for indicating emotional attitudes through intonation analysis and methods thereof
US20090076811A1 (en) * 2007-09-13 2009-03-19 Ilan Ofek Decision Analysis System
US7940914B2 (en) * 1999-08-31 2011-05-10 Accenture Global Services Limited Detecting emotion in voice signals in a call center

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7940914B2 (en) * 1999-08-31 2011-05-10 Accenture Global Services Limited Detecting emotion in voice signals in a call center
EP1318505A1 (en) * 2000-09-13 2003-06-11 A.G.I. Inc. Emotion recognizing method, sensibility creating method, device, and software
EP1256937A2 (en) * 2001-05-11 2002-11-13 Sony France S.A. Emotion recognition method and device
WO2007072485A1 (en) * 2005-12-22 2007-06-28 Exaudios Technologies Ltd. System for indicating emotional attitudes through intonation analysis and methods thereof
US20090076811A1 (en) * 2007-09-13 2009-03-19 Ilan Ofek Decision Analysis System

Also Published As

Publication number Publication date
GB201114305D0 (en) 2011-10-05

Similar Documents

Publication Publication Date Title
WO2014122416A1 (en) Emotion analysis in speech
US10516938B2 (en) System and method for assessing speaker spatial orientation
US6157913A (en) Method and apparatus for estimating fitness to perform tasks based on linguistic and other aspects of spoken responses in constrained interactions
Schuller et al. Medium-term speaker states—A review on intoxication, sleepiness and the first challenge
CN105074815B (en) For the visual feedback of speech recognition system
US11837249B2 (en) Visually presenting auditory information
Thompson et al. Children′ s integration of speech and pointing gestures in comprehension
Cen et al. A real-time speech emotion recognition system and its application in online learning
Deb et al. Analysis and classification of cold speech using variational mode decomposition
McKechnie et al. Automated speech analysis tools for children’s speech production: A systematic literature review
Yousaf et al. A novel technique for speech recognition and visualization based mobile application to support two-way communication between deaf-mute and normal peoples
US9870521B1 (en) Systems and methods for identifying objects
Hansen et al. On the issues of intra-speaker variability and realism in speech, speaker, and language recognition tasks
JPWO2017200080A1 (en) Dialogue method, dialogue apparatus, and program
US20230176813A1 (en) Graphical interface for speech-enabled processing
JP2006267465A (en) Uttering condition evaluating device, uttering condition evaluating program, and program storage medium
KR20150144031A (en) Method and device for providing user interface using voice recognition
US20060053012A1 (en) Speech mapping system and method
Ettlinger et al. Vowel discrimination by English, French and Turkish speakers: Evidence for an exemplar-based approach to speech perception
Siegert et al. How do we speak with Alexa: Subjective and objective assessments of changes in speaking style between HC and HH conversations
CN108962243A (en) arrival reminding method and device, mobile terminal and computer readable storage medium
Kim et al. How visual timing and form information affect speech and non-speech processing
Siegert et al. “Speech Melody and Speech Content Didn’t Fit Together”—Differences in Speech Behavior for Device Directed and Human Directed Interactions
Ward A corpus-based exploration of the functions of disaligned pitch peaks in American English dialog,”
CN114446268B (en) Audio data processing method, device, electronic equipment, medium and program product

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)