IL108401A

IL108401A - Method and apparatus for indicating the emotional state of a person

Info

Publication number: IL108401A
Application number: IL10840194A
Authority: IL
Original assignee: Hashavshevet Manufacture 1988
Priority date: 1994-01-21
Filing date: 1994-01-21
Publication date: 1996-12-05
Also published as: WO1995020216A1; AU1566495A; IL108401A0

Description

.irru "79 'M n M^n ΐτχ ττππτ no'? ETHOO AND APPARATUS FOR INDICATING THE EMOTIONAL STATE OF A PERSON HASHAVSHEVET (MANUFACTURE) (1988) LTD. n"ya (1988) (ΙΙΪ') πικακνπ B:8742 METHOD AND APPARATUS FOR INDICATING THE EMOTIONAL STATE OF A PERSON The present invention relates to a method for indicating the emotional state of a person by a person's speech. The invention also relates to a speech analyzer for analyzing a person's speech in order to indicate the emotional state of the person.

Studies on the effects of emotion on the acoustic characteristics of speech have shown that average values and ranges of fundamental frequency (Fo) differ from one emotion to another. A review of the state of the art in this respect is found in an article by Iain R. Murray and John L . Arnott : Toward the Stimulation of Emotion in Synthetic Speech: A Review of the Literature on Human Vocal Emotion, appearing in J.Acoust Soc Am, 1 93 : 1 097- 1 1 08 , 1 993 . These studies show that the acoustic properties appearing to be among the most sensitive indicators of emotion are attributes that specify the contours of the fundamental frequency Fo throughout an utterance. Respiration is frequently a sensitive indicator in certain emotional situations, and an increase in its rate results in increased subglottal pressure during speech.

The present invention is based on the assumption that the emotional state of the speaker influences the muscle activity in the larynx and the state of the vocal cords more than and other parts of the speech generating system such as the tongue, lips and jaws. Thus, any analysis of the speech signal that reflects vocal cord activity is more likely to be influenced by physiological changes brought about by the emotional state of the speaker .

Such physiological changes as increased subglottal pressure generally give rise to a narrowing of individual glottal pulses, and hence to a change in the spectrum of the pulses .

One speech analysis technique for determining emotional stress is disclosed in USA Patents 4,093,821 and 4,142,067 by J.D. Williamson. These patents relate to a speech analyzer which determines the emotional state of a person by analyzing the real time frequency or pitch components within the first format band of human speech. Because of the characteristics of the first format speech sounds, the system described in these patents analyzes an FM demodulated first format speech signal and produces an output indicative of nulls or "flat" spots therein. By thus analyzing certain first format frequency distribution patterns, a qualitative measurement of variations in speech-related muscle tension is measured, which in turn is correlated to the emotional state of the speaker.

This approach of pitch detection thus assumes that the speech signal can be viewed as an FM modulated signal, thereby obtaining a modulated pitch signal. However, this assumption does not always hold, and therefore the results obtained from the signal analysis with such a system do not always correlate with the subject's emotional state.

An object of the present invention is to provide a novel method of indicating the emotional state of a person by the person's speech. Another object of the invention is to provide a speech analyzer for analyzing a person's speech and for indicating the emotional state of the person.

According to one aspect of the present invention, there is provided a method of indicating the emotional state of a person by the person's speech, comprising: detecting speech waves of the person; subjecting the detected speech waves to a backward and forward inverse filtering operation to obtain residual signals; cross-correlating said residual signals to produce a cross-correlated output; integrating with maximum overlapping the backward and forward residual signal, thereby achieving an estimation of the glottal wave; and processing the above glottal wave estimation to privide an indication of the emotional state of the speaker.

According to further features in the preferred embodiment of the invention described below, the said cross-correlation output is processed to produce a measurement of the pitch perturbations in the detected speech waves, which pitch perturbations are utilized to indicate the emotional state of the person. In the described preferred embodiment, the above-mentioned estimated glottal wave is also processed to produce measurements of the cepstrum coefficients energy level, zero crossing, etc., in the detected speech waves.

These measurements are also utilized to indicate the emotional state of the person.

The invention also provides a novel speech analyzer for indicating the emotional state of a person in accordance with the above method.

Further features and advantages of the invention will be apparent from the description below.

The invention is herein described, by way of example only, with reference to the accompanying drawings, wherein : Fig. 1 is a general block diagram illustrating one form of speech analyzer system constructed in accordance with the present invention; Fig. 2 is a flow chart explaining the operation of the endpoint detector in the system of Fig. 1; Fig. 3 is a block diagram illustrating the feature extraction circuit in the system of Fig. 1; Figs. 4a and 4b, taken together, constitute a flow chart for explaining the operation of the feature extraction circuit of Fig. 3; Fig. 5 illustrates the pre-processed input signal to the feature extraction circuit and to the endpoint detection circuit in Fig. 1 ; Fig. 6 illustrates the output signal from the endpoint detection circuit of Fig. 1 to the feature extraction circuit; Fig. 7 illustrates the input speech signal applied to the feature extraction circuit as illustrated in Fig. 3; Fig. 8 illustrates the input to the LPC (Linear Predictive Code) analyzer circuit of Fig. 3; Figs. 9a and 9b illustrate the outputs of the inverse filter circuit of Fig. 3; Figs. 10a and 10b illustrate the outputs of the integration circuit in the system of Fig. 3; Figs. 11a and 11b illustrate the inputs to the cross-correlation estimation circuit in Fig. 3; and Fig. 12 illustrates the input signal to the cepstrum estimation circuit illustrated in Fig. 3.

Fig. 1 illustrates the overall system for analyzing the speech signals of a person in order to provide an indication of the emotional state of the person. The illustrated system is based on the assumption that emotion is highly correlated with the glottal speech waveform.

Thus, the system illustrated in Fig. 1 includes a detector, such as a microphone, or telephone, SD, which detects the speech waves of a person and feeds them to bandpass filter and analog-to-digital converter circuit 1. When the source of the voice is a telephone line, this circuit passes the frequency bandpass of 60 to 3200 Hz corresponding to the telephone line frequency bandwidth.

The output from circuit 1 is fed to a processor, generally designated PROC. This processor includes: an endpoint detection circuit 2, whose operation is described with reference to Fig. 2; a feature extraction circuit 3, more particularly described in Fig. 3 and in the flow chart of Figs. 4a and 4b; a decision-making circuit 4; and a reference template 5 which is used in the decision-making process. The output of the decision-making circuit 4 is displayed in a display DISP.

The filtered output of circuit 1 as appearing at port 1b is shown in Fig. 5. This filtered output is fed both to input port 2a of the endpoint detection circuit 2, and to input port 3a of the feature extraction circuit 3.

The endpoint detection circuit 2 detects the start and endpoints of significant signals. For this purpose, it operates according to an algorithm based on the energy measured criterion. Thus, if the measured energy is greater than the adaptively estimated noise energy threshold, then it is considered to be a significant signal. Fig. 2 is a flow chart illustrating the operation of the endpoint detector circuit 2. The output of this circuit, as appearing in output port 2b, is illustrated in Fig. 6, and is applied to input port 3b of the feature extraction circuit 3.

It will thus be seen that the feature extraction circuit 3 illustrated in Fig. 1 receives, via its input port 3a, the pre-filtered and digitized signal from circuit 1 , and via its input port 3b, the output of the endpoint detection circuit 2. Circuit 3 extracts certain features as described more particularly below with reference to Figs. 3, 4a and 4b, and produces an output, via its output port 3c, to the input port 4a of the decision making circuit 4, and also to the input port 5a of the reference template circuit 5.

Circuit 5 maintains a number of reference templates regarding the features extracted by circuit 3 and decisions rules determined by circuit 4, and with which the outputs of circuit 3 are compared in order to indicate the emotional state of the person whose speech is being analyzed. Thus, the reference templates could be pre-recorded templates of that person, or of the population in general, correlating the extracted features to emotional levels; alternatively, the reference templates could be based on continuously obtained and updated data from the speech of the same person correlating the features extracted by circuit 3 and the decision rules determined by circuit 4 to the emotional level of that person.

The details of the feature extraction circuit 3 are more particularly illustrated in Fig. 3. This circuit includes a gate 6 which receives via its input port 6a (corresponding to port 3a in Fig. 1) of the filtered and digitized output from circuit 1 ; it also receives via its input port 3b (Fig. 1), the output from the endpoint detection circuit 2.

Fig. 6 illustrates the input at input port 3b from the endpoint detection circuit, and Fig. 7 illustrates the input at input port 3a from the filter and digitizer circuit 1.

The output port 6b of gate 6 is connected to the input port 7a of a pre-emphasis circuit (differentiator) 7, and also to the input port 9a of an inverse filter 9. The output of the pre-emphasis circuit 7, as appearing on port 7b, is fed to the input port 8a of an LPC-analyzer circuit 8. This circuit determines the coefficients of the filter representing the vocal tract. The output of this analyzer appears in output port 8b and is fed to the input port 9b of an inverse filter 9. The inverse filter 9 also receives, via its input port 9a, the output of gate 6.

The computed coefficients of the LPC-analyzer are used to determine the inverse filter characteristics. The inverse filter 9 performs a forward and backward filtering operation, obtaining two signals at its output port 9c.

These two signals are the residual errors of the speech signal obtained from the inverse filter, and are shown in Figs. 9a and 9b. The residual error signals appearing in output port 9c of the inverse filter 9 are fed to the input port 10a of an integrator circuit 10. This circuit performs the integration as shown in Fig. 3, and produces an output at its output port 10b as shown in Figs. 10a and 10b.

The outputs of the integrator 10 are fed to the input port 11a of a trend estimator circuit 11. The function of this circuit is to remove any DC bias and trend that might have been built during the integration. After removal of the trend, the output signals, as shown in Figs. 11a and 11b, provide the two residual errors.

Using a predetermined range, the output signals of the trend estimator circuit 11 are inputted via input port 12a to a cross-correlator circuit 12. This circuit performs a cross-correlation operation on the two input signals and produces a cross-correlation output signal at its output port 12b.

The output of the cross-correlator circuit 12 is applied to the input port 13a of a first-peak circuit 13 for detecting the first peak in the cross-correlation output signal from circuit 12, and is also applied to a second-peak detector circuit 16 for detecting the second peak in the cross-correlation output signal.

The output from the first-peak detector circuit 13, as appearing in output port 13b, is applied to the input port 14b of a residual error circuit 14, which also receives, via its input port 14a, the output of the trend estimator circuit 11. Circuit 14 computes a new residual error, illustrated in Fig. 12, by integration, with , overlapping of the lag of the first peak, of the backward and forward residual errors, and outputs this new residual error via its output port 14c to input port 15a of a cepstrum estimator circuit 15.

The "cepstrum" is the inverse Fourier Transform of the logarithm of the power spectrum of a signal. Cepstrum estimator 15 computes the cepstrum coefficient vector "c" using Fast Fourier Transform (FFT) as follows: wherein e(n) is the new residual error signal, and IFFT is the inverse FFT. These coefficients are outputted via port 15b to the input port 4a of the decision circuit 4 in Fig. 1.

Another output produced by the feature extraction circuit 3 to the input port 4a of the decision circuit 4 includes the perturbations in the pitch of the detected speech waves. These perturbations are derived by processing the output of the cross-correlation from circuit 12 in the second-peak detector circuit 16. The latter circuit receives the cross-correlation output signal via its input port 16b, and also receives the output signal from the first-peak detector circuit 13 via its input port 16a.

Circuit 16 computes the second peak according to the following criteria: (1) the difference between the , - amplitudes of the first and second peaks of the cross- correlation output should not be significant, and (2) the lag in time between the first and second peaks should be physiologically meaningful. Circuit 16 thus determines the second peak in the cross-correlation output signal from circuit 12, and from that, measures the lag between the first and second peaks. This lag is applied via its output port 16c to a pitch analyzer circuit 17.

The pitch analyzer circuit 17 computes the pitch contour characteristics and pitch perturbations, and outputs these signals, via its output port 17b, to the input port 4a (Fig. 1) of the decision circuit 4, with the cepstrum vector signals from the cepstrum estimator circuit 15.

The operation of the feature extraction circuit 3 illustrated in Fig. 3 is more particularly shown in the flow chart of Figs. 4a and 4b which set forth, for purposes of example, an algorithm for each step in the processing of the speech signals.

The decision circuit 4 thus receives the pitch perturbation measurements and the cepstrum coefficients measurements and other measurements such as energy level, zero crossing, etc., from the feature extraction circuit 3 (Fig. 1), and compares these measurements with the reference templates 5, to thereby provide an indication of the emotional state of the person. As indicated earlier, the reference templates 5 may be pre-recorded references , ■ indicating emotional states of the person whose speech is being analyzed, or of the population in general; alternatively, the reference templates could be data continuously obtained and updated in a self-adapted manner from the speech of the same person.

The output of decision circuit 4 is displayed in display DISP.

It will thus be seen that by subjecting the detected speech waves to a backward and forward inverse filtering operation, two residual signals are obtained.

These two residual signals are then cross-correlated to produce a cross-correlation output. Also a new residual signal is computed by integrating with maximum overlapping the forward and backward residual signals, thereby achieving an estimation of the glottal wave, whose following attributes are then extracted: pitch, cepstrum coefficient, energy level, zero crossing (unvoice). Each measurement is calculated for both its mean and perturbation. These parameters (all or in part) provide an indication of the emotional state of the speaker.

While the invention has been described with respect to one preferred embodiment, it will be appreciated that this is set forth merely for purposes of example, and that many other variations, modifications and applications of the invention may be made.

Claims

WHAT IS CLAIMED IS:

1. A method of indicating the emotional state of a person by the person's speech, comprising: detecting speech waves of the person; subjecting the detected speech waves to a backward and forward inverse filtering operation to obtain residual signals; cross-correlating said residual signals to produce a cross-correlated output; integrating with maximum overlapping the backward and forward residual signal, thereby achieving an estimation of the glottal wave; and processing the above glottal wave estimation to provide an indication of the emotional state of the speaker .

2. The method according to Claim 1 , wherein said cross-correlated output is processed to produce a measurement of the pitch perturbations in the detected speech waves, which pitch perturbations are utilized to indicate the emotional state of the person.

3. The method according to Claim 2 , wherein said pitch perturbation measurement includes measuring the lag between first and second peaks in said cross-correlated output signal.

4. The method according to either of Claims 2 or 3, wherein said estimated glottal wave is also processed to produce a measurement of the cepstrum coefficients and/or the energy level, and/or the zero crossing in the detected speech waves, which measurement is also utilized to indicate the emotional state of the person.

5. The method according to any one of Claims 1-4, wherein said detected speech waves are subjected to a Linear Predictive Coding Analysis before being subjected to the backward and forward inverse filtering operation.

6. The method according to any one of Claims 1-5, wherein the detected speech waves are filtered by a bandpass filter and digitized, before being subjected to the backward and forward filtering operation.

7. The method according to Claim 6, wherein the end points of the detected signals, after having been filtered and digitized, are detected before being subjected to the backward and forward filtering operation.

8. The method according to any one of Claims 1-7, wherein the results of processing the glottal wave estimation are compared with a pre-recorded reference to provide an indication of the emotional state of the person.

9. The method according to any one of Claims 1-7, wherein the results of processing the glottal wave estimation are compared with a reference which is continuously obtained and updated from the speech of the same person, to provide an indication of the emotional state of the person.

10. A speech analyzer, comprising: a speech detector for detecting speech waves of the person; a backward and forward reverse filter for subjecting the detected speech waves to a backward and forward inverse filtering operation to obtain residual signals; a cross-correlator circuit for cross-correlating the residual signals to produce a cross-correlated output signal; a circuit for integrating with maximum overlapping the backward and forward signals, thereby achieving an estimation of the glottal wave; and a processor for processing the glottal wave estimation to provide an indication of the emotional state of the person.

11. The speech analyzer according to Claim 10 , wherein said processor produces a measurement of the pitch perturbations in the detected waves, which pitch perturbations are utilized to indicate the emotional state of the person.

12. The speech analyzer according to Claim 11 , wherein said processor also produces a measurement of the cepstrum coefficients energy level, zero crossing (both mean and perturbation) in the detected speech waves, which measurement is also utilized to indicate the emotional state of the person.

13. The speech analyzer according to Claim 10, wherein said processor includes : a first-peak measuring circuit for measuring the first peak in said cross-correlated output signal; a second-peak measuring circuit for measuring the second peak in said cross-correlated output signal, and the lag from said first peak; and a pitch analyzer circuit receiving the output of said second-peak measuring circuit for measuring the pitch perturbations in the original speech waves of the person, which pitch perturbations are utilized for indicating the emotional state of the person.

14. The speech analyzer according to Claim 13, wherein said processor further includes a cepstrum coefficient measuring circuit and/or energy measuring circuit, and/or zero crossing measuring circuit which processes the output of said first-peak measuring circuit to produce a measurement of the cepstrum coefficients in the original speech waves of the person, which measurement is also utilized for indicating the emotional state of the person.

15. The speech analyzer according to any one of Claims 10-14, wherein said processor includes a Linear Predictive Coding Analyzer for analyzing the detected speech waves before they are fed to said backward and forward reverse filter.

16. The speech analyzer according to any one of Claims 10-15, wherein said processor includes an integration circuit which integrates said residual signals outputted by the backward and forward inverse filter before they are cross-correlated in said cross-correlator circuit.

17. The speech analyzer according to Claim 16, wherein said processor further includes a trend estimator circuit receiving the output of said integration circuit and removing DC bias that may have been built up during the integration of said residual signals before they are cross-correlated in said cross-correlator circuit.

18. The speech analyzer according to any one of Claims 10-17, wherein said processor includes an end point detector circuit receiving the output of said speech detector and determining the start and end points of significant signals.

19. The speech analyzer according to any one of Claims 10-18, further including a bandpass filter between said speech detector and said background and forward reverse filter.

20. The speech analyzer according to any one of Claims 10-19, further including a digitizer for digitizing the output of said speech detector before being fed to said backward and forward reverse filter.

21. The speech analyzer according to any one of Claims 10-20, wherein said processor includes a pre-recorded reference template against which the processed cross- correlation output signal is compared to provide an indication of the emotional state of the person.

22. The speech analyzer according to any one of Claims 10-20, wherein said processor includes a continuously-updated reference template which is continuously updated from the speech of the same person and with which the processed cross-correlated output signal is compared to provide an indication of the emotional state of the person.

23. The speech analyzer according to any one of Claims 11-22, further including a display for displaying the output of said processor.