WO2007148493A1 - 感情認識装置 - Google Patents
感情認識装置 Download PDFInfo
- Publication number
- WO2007148493A1 WO2007148493A1 PCT/JP2007/060329 JP2007060329W WO2007148493A1 WO 2007148493 A1 WO2007148493 A1 WO 2007148493A1 JP 2007060329 W JP2007060329 W JP 2007060329W WO 2007148493 A1 WO2007148493 A1 WO 2007148493A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- emotion
- characteristic
- phoneme
- timbre
- speech
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/26—Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Definitions
- the present invention relates to an emotion recognition device that recognizes the emotion of a speaker by voice. More specifically, by recognizing that a characteristic timbre was produced in the spoken voice due to the tension and relaxation of the vocal organs that change from moment to moment depending on the emotion, facial expression, attitude, or speech style of the speaker. It relates to a voice emotion recognition device that recognizes the emotion of a speaker.
- a conversation system with an interface by voice dialogue such as an automatic telephone answering, electronic secretary, dialogue robot, etc.
- voice dialogue such as an automatic telephone answering, electronic secretary, dialogue robot, etc.
- the voice power uttered by the user also affects the user's emotions. Understanding is an important requirement. For example, when the above-mentioned automatic telephone support or dialogue robot interacts with the user by voice, the voice recognition of the dialogue system does not always recognize voice accurately. If the dialog system misrecognizes, the dialog system prompts the user again for voice input. In such a situation, the user gets a little angry or frustrated. This is especially true when misrecognitions overlap.
- prosodic features such as voice pitch (fundamental frequency), loudness (power), speech rate, etc. are extracted from the speech uttered by the speaker and input.
- a method of recognizing emotions based on judgments such as “voice is high” and “voice is loud” has been proposed (for example, see Patent Document 1 and Patent Document 2).
- a method for determining “high, energy in frequency domain is large” t for the entire input speech has been proposed.
- a method for recognizing emotions by obtaining statistical representative values such as the average, maximum value, and minimum value from the sequence of voice power and fundamental frequency has been proposed (see, for example, Patent Document 3).
- a method of recognizing emotions using prosodic time patterns such as sentence and word intonations and accents (see, for example, Patent Document 4 and Patent Document 5) has been proposed.
- FIG. 20 shows a conventional emotion recognition apparatus using voice described in Patent Document 1.
- FIG. 20 shows a conventional emotion recognition apparatus using voice described in Patent Document 1.
- the microphone 1 converts input sound into an electrical signal.
- the voice code recognition unit 2 performs voice recognition of the voice input from the microphone 1 and outputs the recognition result to the sensibility information extraction unit 3 and the output control unit 4.
- the speech speed detection unit 31, the fundamental frequency detection unit 32, and the volume detection unit 33 of the sensibility information extraction unit 3 each have a speech speed, a fundamental frequency, and a volume from the voice input from the microphone 1. Extract.
- the sound level determination criterion storage unit 34 stores a reference for determining the sound level by comparing the speech speed, fundamental frequency, and volume of the input speech with the standard speech speed, fundamental frequency, and volume, respectively. Has been.
- the standard audio feature quantity storage unit 35 stores standard utterance speed, fundamental frequency, and volume that are used as a reference when determining the audio level.
- the voice level analysis unit 36 determines a voice level, that is, a speech speed level, a basic frequency level, and a volume level, based on the ratio between the input voice feature quantity and the standard voice feature quantity.
- the sensitivity level analysis knowledge base storage unit 37 stores rules for determining the sensitivity level based on various audio levels determined by the audio level analysis unit 36. Sensitivity The level analysis unit 38 uses the output from the voice level analysis unit 36 and the output from the voice code recognition means 2 based on the rules stored in the sensitivity level analysis knowledge base storage unit 37 to determine the sensitivity level, that is, the sensitivity level. Determine type and level.
- the output control means 4 controls the output device 5 according to the sensitivity level output by the sensitivity level analysis unit 38, and generates an output corresponding to the sensitivity level of the input voice.
- the information used to determine the voice level here is the prosody obtained in units of speech speed, average fundamental frequency, speech, sentence or phrase expressed in how many mora spoke per second. Information.
- prosodic information is also used to transmit linguistic information, and the method of transmitting the linguistic information is different for each language type.
- linguistic information such as “Hashi” and “Chopsticks”, that have different meanings depending on the accent created by the basic frequency.
- characters In Chinese, it is known that even the same sound shows completely different meanings (characters) due to the movement of the fundamental frequency called four voices.
- the accent In English, the accent is expressed by the strength of the voice, which is called stress rather than the fundamental frequency. The position of the stress is based on the meaning of the word or phrase and the hand that distinguishes the part of speech.
- Patent Document 1 JP-A-9 22296 (Pages 6-9, Table 15 and Fig. 2)
- Patent Document 2 Japanese Patent Laid-Open No. 2001-83984 (Pages 4-5, Fig. 4)
- Patent Document 3 Japanese Patent Laid-Open No. 2003-99084
- Patent Document 4 Japanese Unexamined Patent Publication No. 2005-39501 (Page 12)
- Patent Document 5 Japanese Unexamined Patent Application Publication No. 2005-283647
- emotion recognition based on prosody is used for expressing language information in prosodic information for each language, and in order to separate fluctuations from fluctuations as emotional expressions, Audio data, analysis processing and statistical processing are required. Even in the same language, even if the voice is from the same speaker, which varies greatly depending on local differences and individual ages, it varies greatly depending on physical condition. For this reason, when there is no standard data for each user, it is difficult to always generate stable results for unspecified number of voices with prominent emotional expressions with large regional and individual differences.
- the method of preparing standard data for each individual cannot be adopted for a guidance system in a public place such as a call center or a station that is assumed to be used by an unspecified number of people. This is because standard data for each speaker cannot be prepared.
- prosodic data analyzes the number of mora per second, statistical representative values such as average and dynamic range, or time patterns, etc., as a whole length of speech such as utterances, sentences, and phrases. There is a need to. For this reason, when the characteristics of speech changes in a short time, it is difficult to follow the analysis, and there is a problem that emotion recognition by speech cannot be performed with high accuracy.
- the present invention solves the above-described conventional problems, and can detect emotions in a short unit called a phoneme unit, and has a characteristic timbre with relatively few individual differences, language differences, and local differences,
- the purpose is to provide a voice emotion recognition device that performs high-precision emotion recognition using the relationship with the emotions of the elderly.
- An emotion recognition apparatus is an emotion recognition apparatus that recognizes an emotion of a speaker of an input voice from an input voice, and is a characteristic timbre related to a specific emotion from the input voice. Based on the characteristic timbre detected by the characteristic timbre means, voice recognition means for recognizing the type of phoneme included in the input speech, and recognized by the voice recognition means Based on the type of phoneme, the characteristic timbre Based on a characteristic timbre generation index calculation means for calculating, for each phoneme, a characteristic timbre generation index indicating the ease with which the voice is uttered, and a rule in which the emotion becomes stronger as the characteristic timbre index is smaller Emotion determination means for determining a speaker's emotion of the input speech in the phonology in which the characteristic timbre is generated from the characteristic timbre generation index calculated by the timbre generation index calculation means.
- the generation mechanism of the physical characteristics of speech is that the lips and tongue are easily pressed due to the action of opening and closing the vocal tract with the lips, tongue, and palate like a plosive. , And tsuta, which are determined by the physiological causes of the vocal organs. For this reason, the vocal organs become tense or relaxed depending on the speaker's emotions or speech attitudes. Characteristic timbre can be detected. Based on the detection results of this characteristic timbre, it is possible to recognize speaker emotions that are unaffected by differences in language types, individual differences due to speaker characteristics, and regional differences in phonological units.
- the emotion recognition apparatus described above further determines the emotion intensity in the phoneme in which the characteristic tone color is generated based on a calculation rule in which the emotion intensity increases as the characteristic tone generation index decreases.
- Emotion intensity discrimination means for discrimination is provided.
- the emotion intensity determining means is characterized in that the characteristic timbre generation index for each phoneme calculated by the characteristic timbre generation index calculation means and the time during which the characteristic timbre detected by the characteristic timbre detection means is generated.
- the emotional intensity in the phoneme in which the characteristic tone color is generated is determined based on a calculation rule in which the emotional intensity increases as the characteristic tone color generation index decreases.
- the emotion recognition apparatus further includes an acoustic feature amount database storing an acoustic feature amount for each phoneme type, and a language feature amount representing a word dictionary having at least a reading or a phonetic symbol.
- a language feature database including the speech recognition means For a word in which the characteristic timbre has been detected, by reducing the weight of the acoustic feature amount included in the acoustic feature amount database and increasing the weight of the language feature amount included in the language feature amount database, Based on the acoustic feature database and the language feature database, the type of phoneme included in the input speech is recognized.
- the present invention is realized as an emotion recognition method using characteristic means included in an emotion recognition apparatus as a step which can be realized as an emotion recognition apparatus having such characteristic means as much as possible. It can also be realized as a program that causes a computer to execute the characteristic steps included in the emotion recognition method. Needless to say, such a program can be distributed via a recording medium such as a CD-ROM (Compact Disc-Read Only Memory) or a communication network such as the Internet.
- a recording medium such as a CD-ROM (Compact Disc-Read Only Memory) or a communication network such as the Internet.
- an average utterance (a normal utterance was made) caused by the vocal organs becoming tense or relaxed depending on the emotion or speech attitude of the speaker.
- Utterances that deviate from utterances i.e. average utterances in certain acoustic characteristics, such as back voices, strong voices or breathy voices that are observed in phonological units throughout the speech
- this characteristic timbre detection result it is possible to recognize the emotions of speakers that are not affected by individual differences and local differences due to language types and speaker characteristics. Can follow the change of emotions.
- Fig. 1A shows the frequency of mora uttered with "powerful” sound or “harsh voice” in the voice with the emotional expression of "strong anger” for speaker 1 Is a graph showing for each consonant in the mora.
- Fig. 1B shows the frequency of mora uttered by “powerful” sound or “harsh voice” in the voice accompanied by emotional expression of “strong anger” for speaker 2 Child It is the graph shown for every sound.
- Figure 1C shows the frequency of mora uttered with “harsh voice” for speaker 1 with a “powerful” sound in the voice accompanied by a “medium anger” emotional expression. Is a graph showing for each consonant in the mora.
- Figure 1D shows Mora's voice uttered with "harsh voice” for speaker 2, with a "powerful” sound in the voice accompanied by a “medium anger” emotional expression It is a graph showing the frequency for each consonant in the mora.
- FIG. 2A is a graph showing the frequency of occurrence of the characteristic timbre “blur” in the recorded voice according to the phoneme type of speaker 1 for speaker 1.
- FIG. 2B is a graph showing the frequency of occurrence of the characteristic timbre “blur” in the recorded voice according to the phoneme type of the voice for speaker 2.
- FIG. 3A is a diagram showing a comparison between the occurrence position of the voice of the characteristic timbre observed in the recorded voice and the time position of the voice of the estimated characteristic timbre.
- FIG. 3B is a diagram showing a comparison between the occurrence position of the voice of the characteristic timbre observed in the recorded voice and the time position of the voice of the estimated characteristic timbre.
- FIG. 4 is a block diagram of a voice emotion recognition apparatus in Embodiment 1 of the present invention.
- FIG. 5 is a flowchart showing the operation of the voice emotion recognition apparatus according to Embodiment 1 of the present invention.
- Fig. 6 is a diagram showing an example of a rule for calculating a characteristic tone color generation index in the first embodiment of the present invention.
- FIG. 7 is a diagram showing an example of emotion type determination rules according to the first embodiment of the present invention.
- FIG. 8 is a diagram showing an example of emotion strength calculation rules in the first embodiment of the present invention.
- Figure 9 shows the relationship between the frequency of mora with “force” and the frequency of mora without “force” and the value of the index, and the strength (weakness) of emotion and the value of the index.
- FIG. 10 is a voice emotion recognition apparatus according to a modification of the first embodiment of the present invention.
- FIG. 11 is a flowchart showing an operation of the emotion recognition apparatus by voice in the modification of the first embodiment of the present invention.
- FIG. 12 is a diagram showing a comparison between the occurrence position of the characteristic timbre sound observed in the recorded sound and the occurrence of the characteristic timbre.
- FIG. 13 is a diagram showing an example of emotion type determination rules in a modification of the first embodiment of the present invention.
- FIG. 14 is a block diagram of a voice emotion recognition apparatus according to Embodiment 2 of the present invention.
- FIG. 15 is a flowchart showing the operation of the emotion recognition apparatus using voice according to the second embodiment of the present invention.
- FIG. 16A is a diagram showing a specific example of the speech recognition processing in the second exemplary embodiment of the present invention.
- FIG. 16B is a diagram showing a specific example of the speech recognition processing in the second exemplary embodiment of the present invention.
- FIG. 16C is a diagram showing a specific example of the speech recognition processing in the second exemplary embodiment of the present invention.
- FIG. 17 is a functional block diagram of the emotion recognition apparatus using speech in the third embodiment of the present invention.
- FIG. 18 is a flowchart showing the operation of the emotion recognition apparatus in the third embodiment.
- FIG. 19 is a diagram showing an example of a phoneme input method according to the third embodiment.
- FIG. 20 is a block diagram of a conventional emotion recognition apparatus using voice.
- Characteristic tone generation index calculation rule storage unit 1 Characteristic tone generation index calculation unit
- Figure 1A shows the frequency of mora uttered for speaker 1 with “powerful” sound or “harsh voice” in the voice with emotional expression of “strong anger”. It is a graph shown for each consonant.
- Figure 1B shows the frequency of mora uttered with “powerful” sound or “harsh voice” in the voice with emotional expression of “strong anger” for speaker 2 for each consonant in mora. It is the shown graph.
- Fig. 1C and Fig. 1D show the "powerful” sound in the voice with the expression of "medium anger” for the same speaker as in Fig. 1A and Fig. 1B, respectively.
- the graphs in Fig. 1A and Fig. 1B show that the conditions for the occurrence of "force” appearing in the voice of the emotion of "anger” are common to speakers. Show.
- the power of “force” in the two speakers shown in Fig. 1A and Fig. IB is biased in the same tendency depending on the type of consonant of the mora.
- the probability of being uttered with “powerful” sound differs depending on the type of phoneme, and is uttered with “powerful” sound. If the probabilities are lower and utterances of “powerful” sounds are detected in the types of phonemes, it can be estimated that the degree of emotion of “anger” is large.
- FIG. 1A and FIG. 1C showing the appearance frequency of the characteristic tone color “strength” for speaker 1 who is the same person are compared.
- the medium anger expression shown in Figure 1C does not produce a “powerful” sound, but the strong and angry expression shown in Figure 1A Some of them produce “powered” sounds.
- the medium anger expression shown in Figure 1C is less likely to generate a ⁇ powered '' sound, but the strong and angry expression shown in Figure 1A is ⁇ Some increase the frequency of “powerful” sounds. In this way, it can be seen that if the intensity of anger increases, a “powered” sound will be generated even if the phoneme is supposed to be hard to work.
- the phonological bias of the probability of being uttered with “powerful” sounds is common to the speakers.
- Fig. 2A and Fig. 2B are "breathing” sounds in voice accompanied by emotional expression of “cheerful power", that is, “smell”, "soft voice” 5 is a graph showing the frequency of mora uttered in each for each consonant in the mora.
- Fig. 2A shows speaker 1
- Fig. 2B shows speaker 2's “breathing” sound in the voice with the expression of cheerful power, that is, “blurred” or "soft!
- the frequency of occurrence of characteristic tones varies depending on the type of consonant of the mora. For each speaker shown in the graphs of Fig. 2A and Fig.
- the occurrence probability bias due to phoneme and the speaker commonality of the bias are also seen in "back voice” and "turn over” sounds in addition to “powerful” sounds and “blurred” sounds.
- Voices uttered by utterances that deviate from the average utterance such as normal utterances
- the values are different from the voices uttered by the average utterance mode.
- a specific acoustic characteristic value may be distributed at a position that is statistically separated from the distribution position of many voices. Such distributions are observed in specific utterance styles or emotional expressions. For example, in the case of “breathing” acoustic characteristic values, it is possible to confirm a tendency to belong to a voice expressing familiarity. In turn, by extracting “powerful” sounds in input speech, “breathing” sounds as described in Japanese Patent Laid-Open No.
- FIGS. 3A and 3B show the input “10” shown in FIG. 3A using the estimation formula created from the same data as FIGS. 1A to 1D, using one of the statistical learning methods.
- ⁇ It takes about a minute (it will be as powerful as possible) '' and the input ⁇ Warmed (warmed) '' shown in Fig. 3B, each mora is uttered with a ⁇ powered '' sound ⁇ Ease of power ''
- FIG. 3A shows a “powered” sound only with a high probability mora, indicating a small “anger”.
- “Atama” has a high or medium probability of occurrence of “strength”.
- the probability of “anger” is low It shows that it is big.
- information indicating the phoneme type such as the types of consonants and vowels contained in the mora, or the phoneme category, information on the mora position in the accent phrase, and information on the preceding and following phonemes And are independent variables.
- the binary value of whether or not a “powerful” sound or “harsh voice” is generated is used as a dependent variable.
- This example shows the result of creating an estimation formula based on these independent variables and dependent variables, using quantity ⁇ II, and dividing the probability of occurrence into three stages: low, medium, and high.
- This example shows that the degree of emotion or speech attitude of a speaker can be determined by obtaining the probability of occurrence of characteristic timbre for each mora of the input speech using the speech recognition result.
- FIG. 4 is a functional block diagram of the emotion recognition apparatus using voice according to Embodiment 1 of the present invention.
- FIG. 5 is a flowchart showing the operation of the emotion recognition apparatus in the first embodiment.
- FIG. 6 is an example of calculation rules stored in the characteristic tone generation index calculation rule storage unit 110
- FIG. 7 is an example of determination criteria stored in the emotion type determination criterion storage unit 112
- FIG. This is an example of emotion intensity calculation rules stored in the intensity calculation rule storage unit 114.
- an emotion recognition device is a device for recognizing emotions from speech.
- Microphone 1 speech recognition feature quantity extraction unit 101, inverse filter 102, periodicity analysis unit 103, A timbre detection unit 104, a feature quantity database 105, a speech recognition unit 106, a switch 107, a characteristic timbre generation phoneme identification unit 108, a prosody information extraction unit 109, and a characteristic timbre generation index calculation rule storage unit 110
- the microphone 1 is a processing unit that converts input sound into an electrical signal.
- the feature amount extraction unit 101 for speech recognition is a processing unit that analyzes input speech and extracts a parameter representing a spectral envelope, for example, a mel cepstrum coefficient.
- the inverse filter 102 is an inverse filter of the spectrum envelope information output from the speech recognition feature quantity extraction unit 101, and is a processing unit that outputs the sound source waveform of the audio input from the microphone 1.
- the periodicity analysis unit 103 is a processing unit that analyzes the periodicity of the sound source waveform output from the inverse filter 102 and extracts sound source information.
- the characteristic timbre detection unit 104 is characterized by features such as “powerful” voice, “back voice”, and “breathing” (blurred) voice that appear in the uttered voice depending on the emotion and speech attitude of the speaker.
- This is a processing unit that detects a timbre from sound source information output by the periodicity analysis unit 103 using physical characteristics such as amplitude fluctuation of the sound source waveform and periodic fluctuation of the sound source waveform.
- the feature quantity database 105 is a storage device that holds a feature quantity for each phoneme type for speech recognition.
- the feature quantity database 105 holds data expressing a distribution of feature quantities for each phoneme as a probability model.
- the feature database 105 is composed of a feature database created from speech data in which no characteristic timbre is found in speech and a feature database created from speech data in which a specific feature timbre is seen. Has been.
- voice data with no characteristic timbre can be created using a database with characteristic data 105a without characteristic timbre, and voice data with a characteristic timbre of a “strong” voice can also be created.
- Volume database with feature database 105b voice data with distinctive timbre of “breathing” (faint) voice feature database created with blurring feature database 105c, A feature database created from speech data in which both the characteristic tone of the "powerful” voice and the characteristic tone of the "breathing” (smear) voice were seen. It is configured as a quantity database 105d.
- the speech recognition unit 106 refers to the feature amount database 105, performs matching between the feature amount output by the feature amount extraction unit 101 for speech recognition and the feature amount stored in the feature amount database 105, and performs speech recognition. Is a processing unit.
- the switch 107 indicates whether or not the sound source waveform detected by the characteristic timbre detection unit 104 is fluctuated.
- the voice recognition unit 106 switches whether to refer to the database of the V deviation that constitutes the feature amount database 105 according to the type of fluctuation.
- the characteristic timbre generation phoneme identification unit 108 uses the phonological sequence information output from the speech recognition unit 106 and the time position information of the characteristic timbre in the input speech output from the characteristic timbre detection unit 104 to input speech. It is a processing unit that identifies which phoneme in which a characteristic timbre has occurred.
- the prosodic information extraction unit 109 is a processing unit that extracts the fundamental frequency and power of speech from the sound source waveform output from the inverse filter 102.
- the characteristic tone generation index calculation rule storage unit 110 uses an index of the likelihood of generating a characteristic tone for each phoneme as an attribute of the phoneme (for example, consonant type, vowel type, accent phrase or stress phrase). This is a storage device that stores rules for determining the position from the relationship between the position and the accent or stress position, the absolute value of the fundamental frequency, or the slope.
- the characteristic timbre generation index calculation unit 111 calculates the characteristic timbre generation index from the phonological sequence information generated by the speech recognition unit 106 and the prosody information output from the prosody information extraction unit 109, that is, the fundamental frequency and power.
- a processing unit that refers to the rule storage unit 110 and calculates a characteristic tone generation index for each phoneme of the input speech.
- the emotion type determination criterion storage unit 112 is a storage device that stores a criterion for determining an emotion type based on a combination of a characteristic tone color type and a characteristic tone color generation index of the mora and the adjacent mora.
- the emotion type determination unit 113 refers to the criteria of the emotion type determination criterion storage unit 112 based on the characteristic tone generation position information generated by the characteristic tone generation phoneme specification unit 108, It is a processing unit that determines the type of emotion.
- the emotion intensity calculation rule storage unit 114 is a storage device that stores a rule for calculating the degree of emotion or speech attitude from the characteristic tone generation index and the characteristic tone generation position information of the input voice. .
- the emotion intensity calculation unit 115 includes information on the phonemes in which the characteristic timbre is generated in the input speech generated by the characteristic timbre generation phoneme identification unit 108 and the phonology calculated by the characteristic timbre generation index calculation unit 111.
- a processing unit that outputs the degree of emotion or utterance attitude, emotion type, and phonological sequence with reference to the emotion intensity calculation rule storage unit 114 from the characteristic tone generation index for each is there.
- the display unit 116 is a display device that displays the output of the emotion strength calculation unit 115.
- step S1001 sound is input from the microphone 1 (step S1001).
- the voice recognition feature quantity extraction unit 101 analyzes the input voice and extracts a mel cepstrum coefficient as an acoustic feature quantity for voice recognition (step S1002).
- the inverse filter 102 sets parameters so as to be an inverse filter of the mel cepstrum coefficient generated in step S1002, and passes the audio signal input from the microphone in step S1001, and extracts the sound source waveform. (Step S1003).
- the periodicity analysis unit 103 uses the periodicity of the sound source waveform extracted in step S1003 to obtain a cut-off characteristic in which the low frequency side is gentle and the high frequency side is steep as in the technique described in Japanese Patent Laid-Open No. 10-197575, for example.
- the magnitude of amplitude modulation, the magnitude of frequency modulation, and the strength of the fundamental wave of the filter output are calculated, and the time domain of the periodic signal in the input speech is output as the periodic signal section (step S1004).
- the characteristic timbre detection unit 104 detects the fundamental frequency fluctuation (jitter) of the sound source waveform among the fluctuations of the sound source waveform for the periodic signal section extracted by the periodicity analysis unit 103 in step S1004. Detects fluctuations in the high frequency component of the sound source waveform (step S1005).
- the fundamental frequency fluctuation is detected by using an instantaneous frequency obtained by the method disclosed in, for example, Japanese Patent Laid-Open No. 10-19757.
- the fluctuation of the high frequency component of the sound source waveform is determined by, for example, the amplitude of the peak-to-peak peak of the sound source waveform, the minimum value of the differential amplitude of the sound source waveform (maximum negative) as in the technique described in Japanese Patent Application Laid-Open No. 2004-279436.
- the value divided by (peak) is detected by a method using a normalized amplitude index normalized by the fundamental frequency.
- step S1006 In other words, when frequency fluctuation of the sound source waveform is detected in step S1005, the feature database 1 with “force” is included in the feature database 105. 05b and voice recognition unit 106 are connected by switch 107.
- step S1005 if a fluctuation of the high frequency component of the sound source waveform, that is, a breathing (faint) component, is detected, the feature database 105c and the speech recognition unit 106 are switched from the feature database 105. Connect with 107. If both the frequency fluctuation of the sound source waveform and the fluctuation of the high frequency component of the sound source waveform are detected in step S1005, among the feature value database 105, the feature value database 105d and the voice recognition unit 106 and are connected by switch 107. In step S1005, if neither the frequency fluctuation of the sound source waveform nor the fluctuation of the high frequency component of the sound source waveform is detected, the feature quantity database 105a of the feature quantity database 105 and the voice recognition are recognized. Connect parts 106 and via switch 107.
- the speech recognition unit 106 refers to the feature amount database connected by the switch 107 in step S1006 in the feature amount database 105, performs speech recognition using the mel cepstrum coefficient extracted in step S1002, As a recognition result, a phoneme string is output together with time position information in the input speech (step S1007).
- the characteristic timbre generation phoneme identification unit 108 includes the time position of the phonological sequence information with time position information output from the speech recognition unit 106 and the time position of the characteristic timbre in the input speech output from the characteristic timbre detection unit 104. Based on the information, the power at which the characteristic timbre occurs in which phoneme in the input speech is specified (step S 1008).
- the prosodic information extraction unit 109 analyzes the sound source waveform output from the inverse filter 102 and extracts the fundamental frequency and the sound source power (step S 1009).
- the characteristic tone color generation index calculation unit 111 generates a basic frequency pattern from the phoneme sequence with time position information generated by the speech recognition unit 106 and the basic frequency and sound source power information extracted by the prosody information extraction unit 109. Are matched with the phoneme string to generate accent punctuation and accent information corresponding to the phoneme string (step S1010).
- the characteristic tone generation index calculation unit 111 stores mora such as consonant, vowel, mora position in accent phrase, relative position of accent nuclear power, etc. stored in the characteristic tone generation index calculation rule storage unit 110. Attribute power A characteristic timbre generation index is calculated for each mora of the phoneme string using a rule for determining the likelihood of generating a characteristic timbre (step S1011). Characteristic sound The rule for calculating the color generation index is, for example, a statistic that treats qualitative data using the attribute of the mora attribute as an explanatory variable and the binary value of whether or not a characteristic timbre has occurred as a dependent variable from audio data containing audio with a characteristic timbre. It is created by performing statistical learning using quantity II, which is one of the statistical methods, and generating a model that can numerically express the likelihood of occurrence of characteristic timbres.
- the characteristic timbre occurrence index calculation rule storage unit 110 stores the statistical learning result for each type of characteristic timbre as shown in FIG.
- the characteristic timbre generation index calculation unit 111 applies a statistical model stored in the characteristic timbre generation index calculation rule storage unit 110 according to the attribute of each mora, and calculates a characteristic timbre generation index.
- the characteristic tone generation index calculation unit 111 calculates the score of the attribute of the first mora “a” as “no consonant”, that is, the score of the consonant.
- the characteristic tone color generation index calculation unit 111 calculates the characteristic tone color generation index of the first mora “A” by adding these scores.
- the characteristic tone generation index is calculated as 0.89 for the third mora and 0.908 for the third mora.
- the emotion type determination unit 113 identifies the characteristic tone generation type in the input speech from the characteristic tone generation position described in the phoneme unit generated by the characteristic tone generation phoneme identification unit 108, for example, FIG.
- the emotion type in the mora in which the characteristic timbre included in the input voice is generated is identified with reference to the information in the emotion type determination criterion storage unit 112 described as above (step S 1012).
- the voice of the characteristic tone is strong, only for the mora generated by the characteristic tone, Emotions are judged according to the table in Fig. 7, and emotional changes in units of mora are recognized.
- Fig. 3 (b) the calculation according to Fig.
- the input voice “Juppu is as powerful” as shown in Fig. 3A is “Haze”, and “Haze” is uttered, but there is no characteristic tone in the previous mora. And immediately after “Do”, “Strength” is uttered. For this reason, “Ho” is judged by combining the occurrence index of 2.26 of the relevant mora and the occurrence index of “strength” of the immediate mora 0.753, 0.35. According to the table, the input speech is judged to include the emotion of “happy” and “excited” for “ho” and “do”. However, only “power” is detected as the characteristic tone in the part of “Kaka” that follows “So”, and it is determined that it contains the feeling of “anger” according to the table in FIG. You can follow the emotions that fluctuate as you speak.
- the value of the characteristic tone generation index for each mora calculated in step S 1011 (for example, the first “A” is 1.51 and the next “TA” is 0. 79, the third mora “Ta” is 0.908), while referring to the emotion strength calculation rule stored in the emotion strength calculation rule storage unit 11 4 described in FIG.
- the occurrence index of “force” is 1.51, which is 0.9 or more, so the ease of “force” is judged as “high”.
- Fig. 3B when “warm up” of "warmed up” is uttered with "powerful” voice, the ease of "powerfulness” is large. The strength of the emotion of “anger” decreases because of “V” with force.
- the next “ta” has an occurrence index of “power” of 0.79, medium “anger” with medium “anger”, and the third mora “ta” has an occurrence index of 0.98. Therefore, “strength” is easy and “anger” is low.
- the display unit 116 displays the mora as the output of the emotion type determination unit 113 calculated in step S1013. Is displayed (step S1014).
- step S1012 For the input shown in Fig. 3A, in step S1012, "Ho” for "July enough” is based on the occurrence index of 2.26 and the occurrence index of "force” of 0.365. It is determined that it is a “quick excitement”.
- the occurrence index of “the strength” and the “faint” The value obtained by multiplying it by the occurrence index is 0.8249, and the intensity of “happy” excitement is weak.
- the “strength” index is 1.553, which is 0.53 of the corresponding mora and half of the next mora, 1.57, and the “blurred” index is the previous mora “H”.
- the index of 2.26 is half 1.26, and the result of multiplying these is 1.171195, so the intensity of “quickly excited” is weak.
- the strength indicator is 2.55, which is half the index of the previous mora, half of the indicator of the immediately following mora, and the indicator of the mora, and the strength of “anger” is “weak” Judged.
- FIG. 6 is a diagram schematically showing the relationship between the value of the index and the index.
- the index of ease of “strength” obtained for each mora on the horizontal axis is set so that “strength” becomes easier as it goes to the right.
- the vertical axis indicates the frequency of occurrence of mora with or without “force” in the voice and the “force” probability for each mora.
- the left axis of the graph shows the frequency of occurrence of mora with or without “force”
- the right axis of the graph shows the probability of “force” for each mora.
- the solid line is a function created from actual speech data and shows the relationship between the index value and the frequency of occurrence of mora with “force”.
- the dotted line is created from actual speech data.
- this is a function showing the relationship between the index value and the frequency of occurrence of “no force”.
- the frequency of occurrence of “force” in a mora with a certain index value is obtained, and the “force” occurrence probability is summarized in 100 minutes as the “feeling weakness” shown by the broken line. That's it.
- the probability of occurrence, or “weakness of emotion” has a characteristic that emotions become stronger when the occurrence index becomes smaller, and emotions become weaker when the occurrence index becomes larger.
- the emotion intensity range is set from the voice data at the time, and the occurrence index corresponding to the boundary of the set emotion intensity range is obtained from the function, and a table as shown in FIG. 8 is created.
- the emotion strength calculation rule storage unit 114 shown in FIG. 8 calculates the emotion strength using the table created from the function of “weakness of emotion”, but stores the function shown in FIG. It is also possible to calculate “weakness of emotion”, that is, function strength directly from the function.
- the fluctuation of the sound source is extracted as a characteristic timbre in which emotion is reflected from the input speech, and a feature amount database including the characteristic timbre and a feature amount database not including the characteristic timbre
- the voice recognition accuracy is improved by switching the feature database according to the presence or absence of sound source fluctuation.
- the characteristic timbre is actually generated in the part where the characteristic timbre is likely to be generated, based on the comparison result between the characteristic timbre required by the voice recognition result and the presence or absence of the sound source fluctuation of the actual input voice.
- the accuracy of speech recognition for characteristic timbres found in speech with emotional expression is low when using a feature-value database created with expressive speech data, but speech containing characteristic timbres.
- the speech recognition accuracy is also improved.
- the recognition accuracy the calculation accuracy of the ease of generating characteristic timbres calculated using phonological sequences is also improved. For this reason, the calculation of emotion intensity also improves accuracy.
- by detecting characteristic timbres in units of mora and performing emotion recognition in units of mora changes in emotions in the input speech can be followed in units of mora. Therefore, when the system is used for dialog control, it is effective to specify what kind of reaction the speaker who is the user has responded to which event in the dialog process.
- FIG. 10 is a functional block diagram of a modification of the emotion recognition apparatus using voice according to the first embodiment of the present invention.
- FIG. 11 is a flowchart showing the operation of the emotion recognition apparatus using voice in the modification of the first embodiment.
- Fig. 12 schematically shows the morphological sequence of the input speech, the mora uttered with the characteristic timbre, its “strength” generation index, and the “fogging” generation index value.
- FIG. 13 shows an example of reference information for determining the type of emotion stored in the emotion type determination rule storage unit 132.
- the emotion recognition device shown in FIG. 10 has the same configuration as the emotion recognition device according to Embodiment 1 shown in FIG. 4, but is partially different. That is, the emotion type determination criterion storage unit 112 in FIG. 4 is replaced with the emotion type determination rule storage unit 132. Also, the emotion type determination unit 113 and the emotion strength calculation unit 115 are replaced with an emotion type strength calculation unit 133. Further, the emotion strength calculation rule storage unit 114 is eliminated, and the emotion type strength calculation unit 133 is configured to refer to the emotion type determination rule storage unit 132.
- the voice emotion recognition apparatus configured as described above calculates a characteristic tone color generation index for each mora in step S1 011 in the first embodiment.
- the emotion type strength calculation unit 133 determines the type and strength of emotion according to the emotion type determination rule as shown in FIG. 13 (step S 1313).
- Embodiment 1 in FIG. 3B, “Ri” of the fifth mora has a characteristic tone generation index of 0.85, and it can be determined from FIG. 8 that the emotion is “anger” and the strength is “strong”. .
- the determination result of the intensity of emotion is different from the case of determining for each motor as in the first embodiment.
- the modified example in which the dialogue system determines the type and intensity of emotion in the entire input speech is effective when the dialogue between the person and the dialogue system is short and simple. As in Embodiment 1, it is very important to judge the type and intensity of emotion for each mora and obtain changes in the type and intensity of emotion in the case of complex conversations or long conversations. is there.
- the numerical values used for emotion determination are different for each type of characteristic tone color for each mora. -Calculated by the sum of the reciprocal of the index of the index.
- the characteristic timbre generation index value at the characteristic timbre generation position of the input voice is averaged for each characteristic timbre type, and the number of mora in which the characteristic timbre occupies the total number of mora of the input voice is calculated as the characteristic timbre frequency.
- the inverse is multiplied by the average value of the characteristic tone generation index obtained earlier.
- the value of the characteristic timbre generation index at the characteristic timbre generation position of the input voice is averaged for each characteristic timbre type, and is used for emotion determination by multiplying the characteristic timbre frequency by the inverse of the average value.
- the numerical value obtained may be obtained.
- the numerical value used for the determination of the account is a method in which the ease of occurrence of characteristic timbre is used as a weight in emotion judgment, and the judgment criteria according to the calculation method are stored in the emotion type judgment rule storage unit 132. If so, it may be obtained by other methods.
- the intensity of the characteristic tone color generation index is obtained in step S1313, and the emotion type determination rule storage unit 132 stores the determination rule based on the difference in intensity for each characteristic sound color. It may be configured by the ratio of the intensity of the target tone color generation index.
- the sound source fluctuation is extracted as a characteristic timbre reflecting emotions from the input voice.
- speech recognition with improved speech recognition accuracy is performed by switching the feature database depending on the presence or absence of sound source fluctuations.
- the probability of occurrence of characteristic timbre is calculated using the speech recognition result.
- the emotion recognition using the characteristic timbre in the speech according to the present invention can be performed by obtaining the characteristic timbre generation index using the phoneme sequence of the speech recognition result.
- speech recognition there is a problem that the speech recognition accuracy is often lowered when characteristic timbres associated with emotions often deviate from the general acoustic model power.
- the first embodiment since there are multiple types of force acoustic models that solve this problem by preparing and switching acoustic models including characteristic tones, the amount of data is increased, and an acoustic model is generated. There was a problem that offline work would increase.
- the recognition result by the acoustic model is corrected by using the language model, the recognition accuracy is improved, and the characteristic sequence is based on the phoneme string of the correct speech recognition result. It shows a configuration for obtaining a timbre generation index and performing highly accurate emotion recognition.
- FIG. 14 is a functional block diagram of the voice emotion recognition apparatus according to the second embodiment of the present invention.
- FIG. 15 is a flowchart showing the operation of the speech emotion recognition apparatus according to the second embodiment.
- 16A to 16C show specific examples of the operation of the second embodiment.
- FIG. 14 the description of the same part as in FIG. 4 is omitted, and only the part different from FIG. 4 is described. Also in FIG. 15, the description of the same part as in FIG. 5 is omitted, and only the part different from FIG. 5 is described.
- the configuration of the emotion recognition device is that the prosody information extraction unit 109 and the switch 107 are eliminated from the functional block diagram of FIG. 4, and the feature database 105 is replaced with the acoustic feature database 205, and the language feature database Fig. 4 except that 206 was added and the speech recognition unit 106 was replaced with the continuous word speech recognition unit 207, which recognizes the linguistic information including only the phoneme from the acoustic feature and the language feature based on the language model.
- the configuration is similar.
- Voice is input from the microphone 1 (step S1001), and the voice recognition feature quantity extraction unit 101 extracts mel cepstrum coefficients (step S1002).
- the inverse filter 102 extracts the sound source waveform (step S1003), and the periodicity analysis unit 103 outputs the time domain of the periodic signal in the input speech as the periodic signal section (step S1004).
- the characteristic timbre detection unit 104 detects the fluctuation of the sound source waveform in the periodic signal section, for example, the fundamental frequency fluctuation (jitter) of the sound source waveform and the fluctuation of the high frequency component of the sound source waveform (step S 1005).
- the continuous word speech recognition unit 207 stores an acoustic model and stores the acoustic feature database 2 05 and the language model are stored, the language feature database 206 is referenced, and speech recognition is performed using the mel cepstrum coefficient extracted in step S 1002.
- the speech recognition by the continuous word speech recognition unit 207 is based on, for example, a speech recognition method using a probability model using an acoustic model and a language model. Recognition is generally
- [0093] can be expressed as follows. Since the balance between the acoustic model and the language model is not always equivalent, it is necessary to give weight to both models. Generally, the weight of the language model is set as the ratio of both weights,
- the weight ⁇ of the language model is temporal in general recognition processing. Has a constant value.
- the continuous word speech recognition unit 207 acquires information on the occurrence position of the characteristic timbre detected in step S 1005 and changes the language model weight ⁇ for each word.
- W arg max log P (YIW) + ⁇ , log P w s
- Continuous speech recognition is performed based on the model expressed as follows.
- speech recognition is performed with reference to the acoustic feature database and the language feature database
- the weight of the language model is increased and the acoustic model is relatively compared.
- the weight is reduced (step S2006), and speech recognition is performed (step S2007).
- the continuous word speech recognition unit 207 estimates the accent phrase boundary and accent position from the word reading information, accent information, and part-of-speech information for the word string and phoneme string as a result of speech recognition of the input speech (step S2010).
- the phonological sequence of the input speech is “name is a pencil V” and “pencil” is a characteristic tone, “power”.
- the continuous word speech recognition unit 207 acquires information on the occurrence position of the characteristic timbre detected in step S 1005, and does not include the characteristic timbre.
- the language model weight ⁇ 0.9 determined from the learning data that does not include the characteristic timbre.
- the conventional continuous speech recognition method that is, the weight of the language model is fixed, and even if the part is uttered with the characteristic timbre, it is uttered with the characteristic timbre.
- the continuous word speech recognition unit 207 recognizes in step S2006 when the input speech including the characteristic timbre is recognized by the acoustic model created from the learning data not including the characteristic timbre.
- the weight of the language model is increased for the “pencil force” uttered by “force”.
- the weight a 2.3 of the language model created from the data including
- the characteristic timbre generation index calculation unit 111 includes the characteristic timbre generation position described in the phonological sequence and the phonological unit output from the continuous word speech recognition unit 207, and the accent phrase boundary and the accent position of the phonological sequence. Get information about.
- the characteristic tone generation index calculation unit 111 includes the acquired information, the position in the consonant, the vowel, the end ccent phrase, the relative position of the accent nuclear power, etc. stored in the characteristic tone generation index calculation rule storage unit 110.
- the characteristic tone generation index is calculated for each mora of the phoneme sequence using the mora attribute power and the rule for determining the ease of generation of the characteristic tone color (step S1011).
- the emotion type determination unit 113 specifies the characteristic tone generation type in the input speech from the characteristic tone generation position described in the phoneme unit generated by the characteristic tone generation source phoneme specification unit 208, and stores the emotion type determination criterion memory
- the emotion type corresponding to the characteristic timbre type included in the input voice is specified with reference to the information of the unit 112 (step S1012).
- the emotion intensity calculation unit 115 compares the characteristic tone generation position of the input speech described in phonological units with the characteristic tone generation index for each mora calculated by the characteristic tone generation index calculation unit 111 in step S1011.
- the emotion strength for each mora is calculated according to the rules stored in the emotion strength calculation rule storage unit 114 based on the relationship between the size of the index of each mora and the state of the corresponding mora of the input speech (step S1013). ).
- the display unit 116 displays the emotion intensity for each mora as the output of the emotion type determination unit 113 calculated in step S1013 (step S1014).
- the weight of the language model applied to the frame that does not include the characteristic timbre is 0.9, and the weight of the language model applied to the frame uttered with ⁇ power '' is 2. Forces of 3 Other values may be used as long as the weight of the language model is relatively large in frames containing characteristic timbres. It is also possible to set the weight of the language model to be applied to characteristic timbres such as “faint” and “back voice” other than “power”, or to frames that include characteristic timbres. It is also possible to set two types of weights: language model weights to be applied and language model weights to be applied to frames that do not contain characteristic timbres.
- the sound source is a characteristic timbre that reflects emotion from the input voice. If the fluctuation is extracted and there is sound source fluctuation, the weighting factor a of the language model is increased in consideration of the difficulty of matching the acoustic model in the acoustic feature database, and the relative weight of the acoustic model is increased. Lighten. As a result, erroneous recognition of the phonetic level due to the mismatch of the acoustic model can be prevented, and the speech recognition accuracy at the sentence level can be improved. On the other hand, the type of emotion in the input speech is determined based on the presence or absence of sound source fluctuations, and the ease of occurrence of characteristic timbres is calculated using the speech recognition results.
- the timbre is generated, it is determined that the intensity of the emotion is low, and if the characteristic timbre is generated in the input voice in the portion where the characteristic timbre is difficult to generate, it is determined that the intensity of the emotion is high. .
- the characteristic timbre is generated in the input voice in the portion where the characteristic timbre is difficult to generate, it is determined that the intensity of the emotion is high. .
- the weight of the language model determines the balance between the existing language model and the acoustic model. For this reason, it is possible to generate a feature amount database with a small amount of data, compared to the case of generating an acoustic model including characteristic timbres.
- the characteristic timbre found in voices with emotional expressions is low in voice recognition accuracy when an acoustic feature quantity database made from expressionless voice data is used, but a characteristic timbre is generated. If the acoustic model is appropriate, the weight of the acoustic model is reduced and the weight of the language model is increased. This reduces the effect of applying an inappropriate acoustic model and improves speech recognition accuracy.
- the calculation accuracy of the ease of generating characteristic timbres calculated using phoneme sequences is also improved. For this reason, the accuracy of the emotion intensity calculation is also improved. Furthermore, by detecting characteristic timbres in units of phonemes and performing emotion recognition in units of phonemes, it is possible to follow emotional changes in the input speech in units of phonemes. For this reason, when used for dialog control, etc., it is effective to specify what kind of reaction the speaker who is the user has responded to which event in the dialog operation process.
- FIG. 17 is a functional block diagram of a voice emotion recognition apparatus according to Embodiment 3 of the present invention.
- FIG. 18 is a flowchart showing the operation of the emotion recognition apparatus in the third embodiment.
- FIG. 19 shows an example of a phoneme input method according to the third embodiment.
- FIG. 17 the description of the same parts as those in FIG. 4 is omitted, and the parts different from those in FIG. Only explained.
- FIG. 18 the description of the same part as in FIG. 5 is omitted, and only the part different from FIG. 5 is described.
- the feature amount extraction unit 101 for speech recognition in FIG. 4 is replaced with a feature amount analysis unit 301. Further, the configuration is the same as that in FIG. 4 except that the feature quantity database 105 and the switch 107 are eliminated and the speech recognition unit 106 is replaced with a phoneme input unit 306.
- an emotion recognition device is a device for recognizing emotions from speech.
- Microphone 1 feature amount analysis unit 301, inverse filter 102, periodicity analysis unit 103, characteristic timbre detection
- An output unit 104 a phoneme input unit 306, a characteristic tone color generation phoneme identification unit 108, a prosody information extraction unit 109, a characteristic tone color generation index calculation rule storage unit 110, a characteristic tone color generation index calculation unit 111,
- An emotion type determination criterion storage unit 112 an emotion type determination unit 113, an emotion intensity calculation rule storage unit 114, an emotion intensity calculation unit 115, and a display unit 116 are provided.
- the feature amount analysis unit 301 is a processing unit that analyzes input speech and extracts a parameter representing a spectral envelope, for example, a mel cepstrum coefficient.
- the phoneme input unit 306 is input means for the user to input a corresponding phoneme type for a specific section of the input waveform, and is a pointing device such as a mouse or a pen tablet, for example.
- a pointing device such as a mouse or a pen tablet
- the user designates a section using a pointing device for the waveform of the input speech presented on the screen using a pointing device, and inputs or displays the phoneme type corresponding to the section from the keyboard.
- a pointing device to select from the list of phoneme types that have been selected! Enter the phoneme type using the / ⁇ ⁇ method.
- step S1001 sound is input from the microphone 1 (step S1001).
- the feature amount analysis unit 301 analyzes the input speech and extracts a mel cepstrum coefficient as an acoustic feature amount representing the spectrum information (step S3001).
- the inverse filter 102 sets parameters so as to be an inverse filter of the mel cepstrum coefficient generated in step S3001, passes the audio signal input from the microphone in step S1001, and extracts the sound source waveform (step S 10 03).
- the periodicity analysis unit 103 calculates the fundamental wave likeness of the sound source waveform extracted in step S1003, and based on the fundamental wave likeness, the time domain of the periodic signal in the input speech is calculated as the periodic signal. Output as a section (step S1004).
- the characteristic timbre detection unit 104 detects fluctuations in the sound source waveform for the periodic signal section extracted by the periodicity analysis unit 103 in step S1004 (step S1005).
- the phoneme input unit 306 receives a phoneme type corresponding to a specific section of the input speech (step S3002).
- the phoneme input unit 306 outputs the input speech segment and the corresponding phoneme type to the characteristic timbre utterance phoneme specifying unit 108 as the time position of the input speech and the phoneme information corresponding to the time position.
- the characteristic timbre generation phoneme identification unit 108 includes the phonological sequence information with time position information output from the phonological input unit 306 and the time position of the characteristic timbre in the input speech output by the characteristic timbre detection unit 104. Based on the information, the power at which the characteristic timbre is generated in which phoneme in the input speech is specified (step S1008).
- the prosodic information extraction unit 109 analyzes the sound source waveform output from the inverse filter 102 and extracts the fundamental frequency and the sound source power (step S 1009).
- the characteristic tone color generation index calculation unit 111 calculates the basic frequency pattern and the sound source from the basic frequency and sound source power information extracted by the phoneme sequence with time position information input in step S3002 and the prosody information extraction unit 109.
- the power pattern ridges and valleys are compared with the phoneme string, and accent punctuation and accent information corresponding to the phoneme string is generated (step S1010).
- the characteristic tone generation index calculation unit 111 stores phonemes such as consonants, vowels, positions in accent phrases, and relative positions of accent nuclear power stored in the characteristic tone generation index calculation rule storage unit 110.
- a characteristic tone generation index is calculated for each phoneme of the phoneme sequence using a rule for determining the ease of occurrence of the characteristic tone from the attribute (step S1011).
- the emotion type determination unit 113 specifies the characteristic tone generation type in the input speech from the characteristic tone generation position described in the phoneme unit generated by the characteristic tone generation phoneme specification unit 108, and determines the emotion type By referring to the information in the reference storage unit 112, the emotion type in the phoneme in which the characteristic timbre included in the input voice is generated is specified (step S1012).
- the emotion strength calculation unit refers to the rules stored in the emotion strength calculation rule storage unit 114, The emotional intensity is calculated for each phoneme (step S1013). It is possible to obtain the change in emotion intensity in more detail than the emotion judgment in step S1012.
- the display unit 116 displays the emotion intensity for each phoneme as the output of the emotion type determination unit 113 calculated in step S 1013 (step S 1014).
- the emotion intensity calculation rule is determined in step S1013.
- the emotional intensity for each phoneme was calculated according to the rules stored in the storage unit 114, the characteristic tone generation index for each phoneme was calculated as in the variation of the first embodiment, and based on the result, It is also possible to calculate the emotion type and intensity of the entire utterance.
- characteristic timbre generation index is calculated using these as parameters, and the emotion type and intensity are estimated based on the characteristic timbre generation index.
- the emotion recognition device of this application when the voice with the same phoneme uttered with a characteristic tone and the accent position of consecutive voices shifted by one phoneme is input to the emotion recognition device of this application, By confirming the change, it is confirmed that the characteristic tone generation index using the phoneme type and prosodic information as parameters is calculated, and the emotion type and intensity are estimated based on the characteristic tone generation index. it can.
- the emotion recognition apparatus using voice acquires the entire input voice and performs processing of power.
- the sound input from the microphone 1 may be sequentially processed.
- the sequential processing uses the phoneme, which is the processing unit of speech recognition, as the unit of sequential processing, and in the second embodiment, the units such as phrases or phrases that can be processed in language are sequentially used. It shall be a unit of processing.
- the vocal tract transfer characteristics are obtained based on the force vocal tract model in which the sound source waveform is obtained by the inverse filter of the mel cepstrum in the first embodiment and the modifications thereof, the second embodiment, and the third embodiment.
- the method of obtaining the sound source waveform such as a method of obtaining the sound source waveform by the inverse filter or a method of obtaining the sound source waveform based on the model of the sound source waveform, may use a method other than the method using the inverse filter of the mel cepstrum.
- the acoustic characteristic model of speech recognition uses the mel cepstrum parameters, and other speech recognition The method may be used.
- the sound source waveform can be obtained by using the inverse filter of the mel cepstrum V, or it can be obtained by other methods!
- the frequency fluctuation of the sound source and the fluctuation of the high frequency component of the sound source are expressed as “power” as characteristic timbres. ”And“ Kasule ”, but the amplitude fluctuations of the sound source, etc.,“ The Acoustical Society of Japan, Journal 51 ⁇ 11 (1995), pp869-875 Hidemi Sugaya “Nagamori Sakai” It is also possible to detect characteristic timbres other than “power” and “blur” such as back voices and tense voices listed in.
- the basic frequency and the sound source power extraction in the first embodiment and its modified examples, the second embodiment, and the third embodiment are extracted by the accent phrase in step S 1009, that is, the characteristic tone color generation index calculation unit 111.
- Force step performed immediately before determining the boundary and accent position Step S1003
- the characteristic frequency and sound source power may be extracted at any timing before the characteristic tone color generation index calculation unit 111 determines the accent phrase boundary and the accent position.
- the characteristic tone color generation index calculation unit 111 in the first embodiment and its modified examples, the second embodiment, and the third embodiment uses a quantification class as a statistical learning method, and uses explanatory variables. We used consonants, vowels, positions in the accent phrase, and relative positions from the accent kernel, but statistical learning methods can be used by other methods. It is also possible to calculate a characteristic tone generation index using a continuous amount such as the time length of the pattern phoneme.
- the input sound is assumed to be input from the microphone 1 and is recorded and recorded. Or a voice signal input from the outside of the device.
- the recognized emotion type and intensity are displayed on the display unit 116. It is good also as what records it to a device or outputs to the exterior of a device.
- the speech emotion recognition apparatus detects a voice having characteristic timbre that appears in various places depending on the tension or relaxation of the vocal organs, emotion, facial expression, or speech style. It recognizes the emotions or attitudes of voice speakers, and is useful as a speech dialogue interface for robots. It can also be applied to applications such as call centers and automatic telephone answering systems for telephone exchanges. In addition, in mobile terminal applications where the behavior of the character image changes according to the tone of the voice during voice communication, the mobile phone is equipped with an application that changes the behavior and expression of the character image according to the emotional changes that appear in the voice. It can also be applied to terminals.
Abstract
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/997,458 US8204747B2 (en) | 2006-06-23 | 2007-05-21 | Emotion recognition apparatus |
JP2007541566A JP4085130B2 (ja) | 2006-06-23 | 2007-05-21 | 感情認識装置 |
CN2007800009004A CN101346758B (zh) | 2006-06-23 | 2007-05-21 | 感情识别装置 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2006173937 | 2006-06-23 | ||
JP2006-173937 | 2006-06-23 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2007148493A1 true WO2007148493A1 (ja) | 2007-12-27 |
Family
ID=38833236
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2007/060329 WO2007148493A1 (ja) | 2006-06-23 | 2007-05-21 | 感情認識装置 |
Country Status (4)
Country | Link |
---|---|
US (1) | US8204747B2 (ja) |
JP (1) | JP4085130B2 (ja) |
CN (1) | CN101346758B (ja) |
WO (1) | WO2007148493A1 (ja) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010210730A (ja) * | 2009-03-09 | 2010-09-24 | Univ Of Fukui | 乳幼児の感情診断装置及び方法 |
WO2010148141A2 (en) * | 2009-06-16 | 2010-12-23 | University Of Florida Research Foundation, Inc. | Apparatus and method for speech analysis |
JP2011242755A (ja) * | 2010-04-22 | 2011-12-01 | Fujitsu Ltd | 発話状態検出装置、発話状態検出プログラムおよび発話状態検出方法 |
CN102737629A (zh) * | 2011-11-11 | 2012-10-17 | 东南大学 | 一种嵌入式语音情感识别方法及装置 |
WO2014069075A1 (ja) * | 2012-10-31 | 2014-05-08 | 日本電気株式会社 | 不満会話判定装置及び不満会話判定方法 |
US8935168B2 (en) | 2011-02-10 | 2015-01-13 | Fujitsu Limited | State detecting device and storage medium storing a state detecting program |
CN105551499A (zh) * | 2015-12-14 | 2016-05-04 | 渤海大学 | 面向语音与面部表情信号的情感可视化方法 |
JP2017111760A (ja) * | 2015-12-18 | 2017-06-22 | カシオ計算機株式会社 | 感情推定器生成方法、感情推定器生成装置、感情推定方法、感情推定装置及びプログラム |
WO2018020763A1 (ja) * | 2016-07-26 | 2018-02-01 | ソニー株式会社 | 情報処理装置、情報処理方法、およびプログラム |
WO2018122919A1 (ja) * | 2016-12-26 | 2018-07-05 | 三菱電機株式会社 | 感性表現語による検索装置 |
CN108630231A (zh) * | 2017-03-22 | 2018-10-09 | 卡西欧计算机株式会社 | 信息处理装置、感情识别方法以及存储介质 |
CN111816213A (zh) * | 2020-07-10 | 2020-10-23 | 深圳小辣椒科技有限责任公司 | 一种基于语音识别的情绪分析方法及系统 |
JP2021032920A (ja) * | 2019-08-15 | 2021-03-01 | 日本電信電話株式会社 | パラ言語情報推定装置、学習装置、それらの方法、およびプログラム |
EP3983875A4 (en) * | 2019-09-16 | 2022-07-27 | Samsung Electronics Co., Ltd. | ELECTRONIC DEVICE AND METHOD FOR PROVIDE INSTRUCTION MANUAL THEREOF |
Families Citing this family (77)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009009722A2 (en) | 2007-07-12 | 2009-01-15 | University Of Florida Research Foundation, Inc. | Random body movement cancellation for non-contact vital sign detection |
JP5327054B2 (ja) * | 2007-12-18 | 2013-10-30 | 日本電気株式会社 | 発音変動規則抽出装置、発音変動規則抽出方法、および発音変動規則抽出用プログラム |
CN101727904B (zh) * | 2008-10-31 | 2013-04-24 | 国际商业机器公司 | 语音翻译方法和装置 |
CN101561868B (zh) * | 2009-05-19 | 2011-08-10 | 华中科技大学 | 基于高斯特征的人体运动情感识别方法 |
US8548807B2 (en) * | 2009-06-09 | 2013-10-01 | At&T Intellectual Property I, L.P. | System and method for adapting automatic speech recognition pronunciation by acoustic model restructuring |
WO2011011413A2 (en) * | 2009-07-20 | 2011-01-27 | University Of Florida Research Foundation, Inc. | Method and apparatus for evaluation of a subject's emotional, physiological and/or physical state with the subject's physiological and/or acoustic data |
JP2011033680A (ja) * | 2009-07-30 | 2011-02-17 | Sony Corp | 音声処理装置及び方法、並びにプログラム |
KR101708682B1 (ko) * | 2010-03-03 | 2017-02-21 | 엘지전자 주식회사 | 영상표시장치 및 그 동작 방법. |
KR101262922B1 (ko) * | 2009-12-10 | 2013-05-09 | 한국전자통신연구원 | 감성 변화에 따른 감성지수 결정 장치 및 그 방법 |
US8412530B2 (en) * | 2010-02-21 | 2013-04-02 | Nice Systems Ltd. | Method and apparatus for detection of sentiment in automated transcriptions |
JP2011209787A (ja) * | 2010-03-29 | 2011-10-20 | Sony Corp | 情報処理装置、および情報処理方法、並びにプログラム |
JP5610197B2 (ja) * | 2010-05-25 | 2014-10-22 | ソニー株式会社 | 検索装置、検索方法、及び、プログラム |
US8595005B2 (en) | 2010-05-31 | 2013-11-26 | Simple Emotion, Inc. | System and method for recognizing emotional state from a speech signal |
FR2962048A1 (fr) * | 2010-07-02 | 2012-01-06 | Aldebaran Robotics S A | Robot humanoide joueur, methode et systeme d'utilisation dudit robot |
CN102479024A (zh) * | 2010-11-24 | 2012-05-30 | 国基电子(上海)有限公司 | 手持装置及其用户界面构建方法 |
EP2659486B1 (en) * | 2010-12-30 | 2016-03-23 | Nokia Technologies Oy | Method, apparatus and computer program for emotion detection |
JP5602653B2 (ja) * | 2011-01-31 | 2014-10-08 | インターナショナル・ビジネス・マシーンズ・コーポレーション | 情報処理装置、情報処理方法、情報処理システム、およびプログラム |
US8630860B1 (en) * | 2011-03-03 | 2014-01-14 | Nuance Communications, Inc. | Speaker and call characteristic sensitive open voice search |
JP5708155B2 (ja) * | 2011-03-31 | 2015-04-30 | 富士通株式会社 | 話者状態検出装置、話者状態検出方法及び話者状態検出用コンピュータプログラム |
US8756061B2 (en) | 2011-04-01 | 2014-06-17 | Sony Computer Entertainment Inc. | Speech syllable/vowel/phone boundary detection using auditory attention cues |
EP2707872A2 (en) * | 2011-05-12 | 2014-03-19 | Johnson Controls Technology Company | Adaptive voice recognition systems and methods |
JP5664480B2 (ja) * | 2011-06-30 | 2015-02-04 | 富士通株式会社 | 異常状態検出装置、電話機、異常状態検出方法、及びプログラム |
US9520125B2 (en) * | 2011-07-11 | 2016-12-13 | Nec Corporation | Speech synthesis device, speech synthesis method, and speech synthesis program |
KR101830767B1 (ko) * | 2011-07-14 | 2018-02-22 | 삼성전자주식회사 | 사용자의 감정 인식 장치 및 방법 |
KR101801327B1 (ko) * | 2011-07-29 | 2017-11-27 | 삼성전자주식회사 | 감정 정보 생성 장치, 감정 정보 생성 방법 및 감정 정보 기반 기능 추천 장치 |
US9763617B2 (en) * | 2011-08-02 | 2017-09-19 | Massachusetts Institute Of Technology | Phonologically-based biomarkers for major depressive disorder |
GB2514943A (en) * | 2012-01-24 | 2014-12-10 | Auraya Pty Ltd | Voice authentication and speech recognition system and method |
US10007724B2 (en) * | 2012-06-29 | 2018-06-26 | International Business Machines Corporation | Creating, rendering and interacting with a multi-faceted audio cloud |
US9031293B2 (en) | 2012-10-19 | 2015-05-12 | Sony Computer Entertainment Inc. | Multi-modal sensor based emotion recognition and emotional interface |
US9020822B2 (en) * | 2012-10-19 | 2015-04-28 | Sony Computer Entertainment Inc. | Emotion recognition using auditory attention cues extracted from users voice |
US9672811B2 (en) | 2012-11-29 | 2017-06-06 | Sony Interactive Entertainment Inc. | Combining auditory attention cues with phoneme posterior scores for phone/vowel/syllable boundary detection |
US9183849B2 (en) * | 2012-12-21 | 2015-11-10 | The Nielsen Company (Us), Llc | Audio matching with semantic audio recognition and report generation |
US9195649B2 (en) | 2012-12-21 | 2015-11-24 | The Nielsen Company (Us), Llc | Audio processing techniques for semantic audio recognition and report generation |
US9158760B2 (en) | 2012-12-21 | 2015-10-13 | The Nielsen Company (Us), Llc | Audio decoding with supplemental semantic audio recognition and report generation |
US9396723B2 (en) | 2013-02-01 | 2016-07-19 | Tencent Technology (Shenzhen) Company Limited | Method and device for acoustic language model training |
CN103971677B (zh) * | 2013-02-01 | 2015-08-12 | 腾讯科技(深圳)有限公司 | 一种声学语言模型训练方法和装置 |
WO2015019345A1 (en) * | 2013-08-06 | 2015-02-12 | Beyond Verbal Communication Ltd | Emotional survey according to voice categorization |
EP3057493B1 (en) * | 2013-10-20 | 2020-06-24 | Massachusetts Institute Of Technology | Using correlation structure of speech dynamics to detect neurological changes |
US20150111185A1 (en) * | 2013-10-21 | 2015-04-23 | Paul Laroche | Interactive emotional communication doll |
CN103531208B (zh) * | 2013-11-01 | 2016-08-03 | 东南大学 | 一种基于短时记忆权重融合的航天应激情感识别方法 |
KR102191306B1 (ko) | 2014-01-22 | 2020-12-15 | 삼성전자주식회사 | 음성 감정 인식 시스템 및 방법 |
WO2015116678A1 (en) | 2014-01-28 | 2015-08-06 | Simple Emotion, Inc. | Methods for adaptive voice interaction |
US9947342B2 (en) | 2014-03-12 | 2018-04-17 | Cogito Corporation | Method and apparatus for speech behavior visualization and gamification |
EP2933067B1 (en) * | 2014-04-17 | 2019-09-18 | Softbank Robotics Europe | Method of performing multi-modal dialogue between a humanoid robot and user, computer program product and humanoid robot for implementing said method |
US9685174B2 (en) * | 2014-05-02 | 2017-06-20 | The Regents Of The University Of Michigan | Mood monitoring of bipolar disorder using speech analysis |
US11289077B2 (en) * | 2014-07-15 | 2022-03-29 | Avaya Inc. | Systems and methods for speech analytics and phrase spotting using phoneme sequences |
US20160042766A1 (en) * | 2014-08-06 | 2016-02-11 | Echostar Technologies L.L.C. | Custom video content |
US9667786B1 (en) | 2014-10-07 | 2017-05-30 | Ipsoft, Inc. | Distributed coordinated system and process which transforms data into useful information to help a user with resolving issues |
WO2016057781A1 (en) | 2014-10-08 | 2016-04-14 | The University Of Florida Research Foundation, Inc. | Method and apparatus for non-contact fast vital sign acquisition based on radar signal |
US9747276B2 (en) | 2014-11-14 | 2017-08-29 | International Business Machines Corporation | Predicting individual or crowd behavior based on graphical text analysis of point recordings of audible expressions |
US9355089B1 (en) * | 2014-12-08 | 2016-05-31 | International Business Machines Corporation | Intention detection in domain-specific information |
CN105989836B (zh) * | 2015-03-06 | 2020-12-01 | 腾讯科技(深圳)有限公司 | 一种语音采集方法、装置及终端设备 |
US9833200B2 (en) | 2015-05-14 | 2017-12-05 | University Of Florida Research Foundation, Inc. | Low IF architectures for noncontact vital sign detection |
US10997226B2 (en) | 2015-05-21 | 2021-05-04 | Microsoft Technology Licensing, Llc | Crafting a response based on sentiment identification |
EP3350806A4 (en) | 2015-09-14 | 2019-08-07 | Cogito Corporation | SYSTEMS AND METHODS FOR IDENTIFYING HUMAN EMOTIONS AND / OR MENTAL HEALTH CONDITIONS BASED ON ANALYZES OF AUDIO INPUTS AND / OR BEHAVIORAL DATA COLLECTED FROM COMPUTING DEVICES |
KR102437689B1 (ko) | 2015-09-16 | 2022-08-30 | 삼성전자주식회사 | 음성 인식 서버 및 그 제어 방법 |
CN106562792B (zh) * | 2015-10-08 | 2021-08-06 | 松下电器(美国)知识产权公司 | 信息提示装置的控制方法和信息提示装置 |
CN105334743B (zh) * | 2015-11-18 | 2018-10-26 | 深圳创维-Rgb电子有限公司 | 一种基于情感识别的智能家居控制方法及其系统 |
JP6306071B2 (ja) | 2016-02-09 | 2018-04-04 | Pst株式会社 | 推定装置、推定プログラム、推定装置の作動方法および推定システム |
CN106228976B (zh) * | 2016-07-22 | 2019-05-31 | 百度在线网络技术(北京)有限公司 | 语音识别方法和装置 |
JP6589838B2 (ja) * | 2016-11-30 | 2019-10-16 | カシオ計算機株式会社 | 動画像編集装置及び動画像編集方法 |
EP3392884A1 (en) * | 2017-04-21 | 2018-10-24 | audEERING GmbH | A method for automatic affective state inference and an automated affective state inference system |
US10339931B2 (en) | 2017-10-04 | 2019-07-02 | The Toronto-Dominion Bank | Persona-based conversational interface personalization using social network preferences |
US10460748B2 (en) | 2017-10-04 | 2019-10-29 | The Toronto-Dominion Bank | Conversational interface determining lexical personality score for response generation with synonym replacement |
KR102525120B1 (ko) * | 2018-04-19 | 2023-04-25 | 현대자동차주식회사 | 데이터 분류 장치, 이를 포함하는 차량, 및 데이터 분류 장치의 제어방법 |
JP7159655B2 (ja) * | 2018-07-09 | 2022-10-25 | 富士フイルムビジネスイノベーション株式会社 | 感情推定システムおよびプログラム |
US11380351B2 (en) * | 2018-09-20 | 2022-07-05 | Samsung Electronics Co., Ltd. | System and method for pulmonary condition monitoring and analysis |
KR102228866B1 (ko) * | 2018-10-18 | 2021-03-17 | 엘지전자 주식회사 | 로봇 및 그의 제어 방법 |
CN110110135A (zh) * | 2019-04-17 | 2019-08-09 | 西安极蜂天下信息科技有限公司 | 声音特征数据库更新方法及装置 |
US11183201B2 (en) | 2019-06-10 | 2021-11-23 | John Alexander Angland | System and method for transferring a voice from one body of recordings to other recordings |
RU2718868C1 (ru) * | 2019-06-19 | 2020-04-15 | Федеральное Государственное Бюджетное Образовательное Учреждение Высшего Образования "Новосибирский Государственный Технический Университет" | Способ диагностики психоэмоционального состояния по голосу |
US11019207B1 (en) * | 2019-11-07 | 2021-05-25 | Hithink Royalflush Information Network Co., Ltd. | Systems and methods for smart dialogue communication |
CN110910903B (zh) * | 2019-12-04 | 2023-03-21 | 深圳前海微众银行股份有限公司 | 语音情绪识别方法、装置、设备及计算机可读存储介质 |
EP4044624A1 (en) | 2021-02-15 | 2022-08-17 | Sonova AG | Tracking happy moments of hearing device users |
CN113611326B (zh) * | 2021-08-26 | 2023-05-12 | 中国地质大学(武汉) | 一种实时语音情感识别方法及装置 |
CN114566189B (zh) * | 2022-04-28 | 2022-10-04 | 之江实验室 | 基于三维深度特征融合的语音情感识别方法及系统 |
CN115460031B (zh) * | 2022-11-14 | 2023-04-11 | 深圳市听见时代科技有限公司 | 一种基于物联网的智能音响控制监管系统及方法 |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH11119791A (ja) * | 1997-10-20 | 1999-04-30 | Hitachi Ltd | 音声感情認識システムおよび方法 |
JP2003210833A (ja) * | 2002-01-17 | 2003-07-29 | Aruze Corp | 対話ゲームシステム、対話ゲーム方法及びプログラム |
JP2004037989A (ja) * | 2002-07-05 | 2004-02-05 | Nippon Telegr & Teleph Corp <Ntt> | 音声受付システム |
JP2004259238A (ja) * | 2003-02-25 | 2004-09-16 | Kazuhiko Tsuda | 自然言語解析における感情理解システム |
JP2004310034A (ja) * | 2003-03-24 | 2004-11-04 | Matsushita Electric Works Ltd | 対話エージェントシステム |
JP2005348872A (ja) * | 2004-06-09 | 2005-12-22 | Nippon Hoso Kyokai <Nhk> | 感情推定装置及び感情推定プログラム |
JP2006071936A (ja) * | 2004-09-01 | 2006-03-16 | Matsushita Electric Works Ltd | 対話エージェント |
JP2006106711A (ja) * | 2004-09-10 | 2006-04-20 | Matsushita Electric Ind Co Ltd | 情報処理端末 |
Family Cites Families (37)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0922296A (ja) | 1995-07-05 | 1997-01-21 | Sanyo Electric Co Ltd | 感性情報入力処理装置及びその処理方法 |
JP3112654B2 (ja) | 1997-01-14 | 2000-11-27 | 株式会社エイ・ティ・アール人間情報通信研究所 | 信号分析方法 |
IL122632A0 (en) * | 1997-12-16 | 1998-08-16 | Liberman Amir | Apparatus and methods for detecting emotions |
US6185534B1 (en) * | 1998-03-23 | 2001-02-06 | Microsoft Corporation | Modeling emotion and personality in a computer user interface |
US6275806B1 (en) * | 1999-08-31 | 2001-08-14 | Andersen Consulting, Llp | System method and article of manufacture for detecting emotion in voice signals by utilizing statistics for voice signal parameters |
US7222075B2 (en) * | 1999-08-31 | 2007-05-22 | Accenture Llp | Detecting emotions using voice signal analysis |
US6353810B1 (en) * | 1999-08-31 | 2002-03-05 | Accenture Llp | System, method and article of manufacture for an emotion detection system improving emotion recognition |
US6480826B2 (en) * | 1999-08-31 | 2002-11-12 | Accenture Llp | System and method for a telephonic emotion detection that provides operator feedback |
US6427137B2 (en) * | 1999-08-31 | 2002-07-30 | Accenture Llp | System, method and article of manufacture for a voice analysis system that detects nervousness for preventing fraud |
US6151571A (en) * | 1999-08-31 | 2000-11-21 | Andersen Consulting | System, method and article of manufacture for detecting emotion in voice signals through analysis of a plurality of voice signal parameters |
JP2001083984A (ja) | 1999-09-09 | 2001-03-30 | Alpine Electronics Inc | インタフェース装置 |
US7280964B2 (en) * | 2000-04-21 | 2007-10-09 | Lessac Technologies, Inc. | Method of recognizing spoken language with recognition of language color |
TWI221574B (en) * | 2000-09-13 | 2004-10-01 | Agi Inc | Sentiment sensing method, perception generation method and device thereof and software |
US7139699B2 (en) * | 2000-10-06 | 2006-11-21 | Silverman Stephen E | Method for analysis of vocal jitter for near-term suicidal risk assessment |
US6970820B2 (en) * | 2001-02-26 | 2005-11-29 | Matsushita Electric Industrial Co., Ltd. | Voice personalization of speech synthesizer |
CN1159702C (zh) * | 2001-04-11 | 2004-07-28 | 国际商业机器公司 | 具有情感的语音-语音翻译系统和方法 |
EP1256937B1 (en) * | 2001-05-11 | 2006-11-02 | Sony France S.A. | Emotion recognition method and device |
AU2002230151B2 (en) * | 2001-08-06 | 2006-08-03 | Index Corporation | Apparatus for determining dog's emotions by vocal analysis of barking sounds and method for the same |
US6721704B1 (en) * | 2001-08-28 | 2004-04-13 | Koninklijke Philips Electronics N.V. | Telephone conversation quality enhancer using emotional conversational analysis |
EP1300831B1 (en) * | 2001-10-05 | 2005-12-07 | Sony Deutschland GmbH | Method for detecting emotions involving subspace specialists |
EP1326445B1 (en) * | 2001-12-20 | 2008-01-23 | Matsushita Electric Industrial Co., Ltd. | Virtual television phone apparatus |
JP3673507B2 (ja) * | 2002-05-16 | 2005-07-20 | 独立行政法人科学技術振興機構 | 音声波形の特徴を高い信頼性で示す部分を決定するための装置およびプログラム、音声信号の特徴を高い信頼性で示す部分を決定するための装置およびプログラム、ならびに擬似音節核抽出装置およびプログラム |
JP2004063953A (ja) * | 2002-07-31 | 2004-02-26 | Ube Ind Ltd | ダイシングテ−プ |
EP1391876A1 (en) * | 2002-08-14 | 2004-02-25 | Sony International (Europe) GmbH | Method of determining phonemes in spoken utterances suitable for recognizing emotions using voice quality features |
JP4204839B2 (ja) * | 2002-10-04 | 2009-01-07 | 株式会社エイ・ジー・アイ | 発想モデル装置、自発感情モデル装置、発想のシミュレーション方法、自発感情のシミュレーション方法、およびプログラム |
JP3706112B2 (ja) | 2003-03-12 | 2005-10-12 | 独立行政法人科学技術振興機構 | 音声合成装置及びコンピュータプログラム |
JP2005039501A (ja) | 2003-07-14 | 2005-02-10 | Nec Corp | 携帯電話録音サービスシステム、方法およびプログラム |
JP2005202854A (ja) * | 2004-01-19 | 2005-07-28 | Nec Corp | 画像処理装置、画像処理方法及び画像処理プログラム |
JP2005283647A (ja) | 2004-03-26 | 2005-10-13 | Matsushita Electric Ind Co Ltd | 感情認識装置 |
US7788104B2 (en) * | 2004-09-10 | 2010-08-31 | Panasonic Corporation | Information processing terminal for notification of emotion |
JP4456537B2 (ja) * | 2004-09-14 | 2010-04-28 | 本田技研工業株式会社 | 情報伝達装置 |
US20060122834A1 (en) * | 2004-12-03 | 2006-06-08 | Bennett Ian M | Emotion detection device & method for use in distributed systems |
WO2006132159A1 (ja) * | 2005-06-09 | 2006-12-14 | A.G.I. Inc. | ピッチ周波数を検出する音声解析装置、音声解析方法、および音声解析プログラム |
US8209182B2 (en) * | 2005-11-30 | 2012-06-26 | University Of Southern California | Emotion recognition system |
WO2007072485A1 (en) * | 2005-12-22 | 2007-06-28 | Exaudios Technologies Ltd. | System for indicating emotional attitudes through intonation analysis and methods thereof |
US20070192108A1 (en) * | 2006-02-15 | 2007-08-16 | Alon Konchitsky | System and method for detection of emotion in telecommunications |
KR101014321B1 (ko) * | 2009-02-24 | 2011-02-14 | 한국전자통신연구원 | 최소 분류 오차 기법을 이용한 감정 인식 방법 |
-
2007
- 2007-05-21 CN CN2007800009004A patent/CN101346758B/zh active Active
- 2007-05-21 US US11/997,458 patent/US8204747B2/en active Active
- 2007-05-21 JP JP2007541566A patent/JP4085130B2/ja active Active
- 2007-05-21 WO PCT/JP2007/060329 patent/WO2007148493A1/ja active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH11119791A (ja) * | 1997-10-20 | 1999-04-30 | Hitachi Ltd | 音声感情認識システムおよび方法 |
JP2003210833A (ja) * | 2002-01-17 | 2003-07-29 | Aruze Corp | 対話ゲームシステム、対話ゲーム方法及びプログラム |
JP2004037989A (ja) * | 2002-07-05 | 2004-02-05 | Nippon Telegr & Teleph Corp <Ntt> | 音声受付システム |
JP2004259238A (ja) * | 2003-02-25 | 2004-09-16 | Kazuhiko Tsuda | 自然言語解析における感情理解システム |
JP2004310034A (ja) * | 2003-03-24 | 2004-11-04 | Matsushita Electric Works Ltd | 対話エージェントシステム |
JP2005348872A (ja) * | 2004-06-09 | 2005-12-22 | Nippon Hoso Kyokai <Nhk> | 感情推定装置及び感情推定プログラム |
JP2006071936A (ja) * | 2004-09-01 | 2006-03-16 | Matsushita Electric Works Ltd | 対話エージェント |
JP2006106711A (ja) * | 2004-09-10 | 2006-04-20 | Matsushita Electric Ind Co Ltd | 情報処理端末 |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010210730A (ja) * | 2009-03-09 | 2010-09-24 | Univ Of Fukui | 乳幼児の感情診断装置及び方法 |
WO2010148141A2 (en) * | 2009-06-16 | 2010-12-23 | University Of Florida Research Foundation, Inc. | Apparatus and method for speech analysis |
WO2010148141A3 (en) * | 2009-06-16 | 2011-03-31 | University Of Florida Research Foundation, Inc. | Apparatus and method for speech analysis |
US8788270B2 (en) | 2009-06-16 | 2014-07-22 | University Of Florida Research Foundation, Inc. | Apparatus and method for determining an emotion state of a speaker |
US9099088B2 (en) | 2010-04-22 | 2015-08-04 | Fujitsu Limited | Utterance state detection device and utterance state detection method |
JP2011242755A (ja) * | 2010-04-22 | 2011-12-01 | Fujitsu Ltd | 発話状態検出装置、発話状態検出プログラムおよび発話状態検出方法 |
US8935168B2 (en) | 2011-02-10 | 2015-01-13 | Fujitsu Limited | State detecting device and storage medium storing a state detecting program |
CN102737629A (zh) * | 2011-11-11 | 2012-10-17 | 东南大学 | 一种嵌入式语音情感识别方法及装置 |
JPWO2014069075A1 (ja) * | 2012-10-31 | 2016-09-08 | 日本電気株式会社 | 不満会話判定装置及び不満会話判定方法 |
WO2014069075A1 (ja) * | 2012-10-31 | 2014-05-08 | 日本電気株式会社 | 不満会話判定装置及び不満会話判定方法 |
CN105551499A (zh) * | 2015-12-14 | 2016-05-04 | 渤海大学 | 面向语音与面部表情信号的情感可视化方法 |
JP2017111760A (ja) * | 2015-12-18 | 2017-06-22 | カシオ計算機株式会社 | 感情推定器生成方法、感情推定器生成装置、感情推定方法、感情推定装置及びプログラム |
JPWO2018020763A1 (ja) * | 2016-07-26 | 2019-01-17 | ソニー株式会社 | 情報処理装置 |
WO2018020763A1 (ja) * | 2016-07-26 | 2018-02-01 | ソニー株式会社 | 情報処理装置、情報処理方法、およびプログラム |
JP2019124952A (ja) * | 2016-07-26 | 2019-07-25 | ソニー株式会社 | 情報処理装置、情報処理方法、およびプログラム |
WO2018122919A1 (ja) * | 2016-12-26 | 2018-07-05 | 三菱電機株式会社 | 感性表現語による検索装置 |
JP2018159788A (ja) * | 2017-03-22 | 2018-10-11 | カシオ計算機株式会社 | 情報処理装置、方法及びプログラム |
CN108630231A (zh) * | 2017-03-22 | 2018-10-09 | 卡西欧计算机株式会社 | 信息处理装置、感情识别方法以及存储介质 |
CN108630231B (zh) * | 2017-03-22 | 2024-01-05 | 卡西欧计算机株式会社 | 信息处理装置、感情识别方法以及存储介质 |
JP2021032920A (ja) * | 2019-08-15 | 2021-03-01 | 日本電信電話株式会社 | パラ言語情報推定装置、学習装置、それらの方法、およびプログラム |
JP7141641B2 (ja) | 2019-08-15 | 2022-09-26 | 日本電信電話株式会社 | パラ言語情報推定装置、学習装置、それらの方法、およびプログラム |
EP3983875A4 (en) * | 2019-09-16 | 2022-07-27 | Samsung Electronics Co., Ltd. | ELECTRONIC DEVICE AND METHOD FOR PROVIDE INSTRUCTION MANUAL THEREOF |
CN111816213A (zh) * | 2020-07-10 | 2020-10-23 | 深圳小辣椒科技有限责任公司 | 一种基于语音识别的情绪分析方法及系统 |
Also Published As
Publication number | Publication date |
---|---|
JPWO2007148493A1 (ja) | 2009-11-19 |
CN101346758B (zh) | 2011-07-27 |
US20090313019A1 (en) | 2009-12-17 |
US8204747B2 (en) | 2012-06-19 |
JP4085130B2 (ja) | 2008-05-14 |
CN101346758A (zh) | 2009-01-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP4085130B2 (ja) | 感情認識装置 | |
Polzin et al. | Emotion-sensitive human-computer interfaces | |
JP4914295B2 (ja) | 力み音声検出装置 | |
Ten Bosch | Emotions, speech and the ASR framework | |
US7062439B2 (en) | Speech synthesis apparatus and method | |
US6725199B2 (en) | Speech synthesis apparatus and selection method | |
US7062440B2 (en) | Monitoring text to speech output to effect control of barge-in | |
Rudzicz | Adjusting dysarthric speech signals to be more intelligible | |
US7280968B2 (en) | Synthetically generated speech responses including prosodic characteristics of speech inputs | |
US7191132B2 (en) | Speech synthesis apparatus and method | |
JPH09500223A (ja) | 多言語音声認識システム | |
JP2001215993A (ja) | 対話処理装置および対話処理方法、並びに記録媒体 | |
JP5040778B2 (ja) | 音声合成装置、方法及びプログラム | |
Sigmund | Voice recognition by computer | |
Fellbaum et al. | Principles of electronic speech processing with applications for people with disabilities | |
JPH11175082A (ja) | 音声対話装置及び音声対話用音声合成方法 | |
Picart et al. | Analysis and HMM-based synthesis of hypo and hyperarticulated speech | |
Bosch | Emotions: what is possible in the ASR framework | |
Diwakar et al. | Improving speech to text alignment based on repetition detection for dysarthric speech | |
Furui | Robust methods in automatic speech recognition and understanding. | |
Chen et al. | Optimization of dysarthric speech recognition | |
JP2000244609A (ja) | 話者状況適応型音声対話装置及び発券装置 | |
JP2000075894A (ja) | 音声認識方法及び装置、音声対話システム、記録媒体 | |
Schramm et al. | A Brazilian Portuguese language corpus development | |
Bharadwaj et al. | Analysis of Prosodic features for the degree of emotions of an Assamese Emotional Speech |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 200780000900.4 Country of ref document: CN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2007541566 Country of ref document: JP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 11997458 Country of ref document: US |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 07743763 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 07743763 Country of ref document: EP Kind code of ref document: A1 |