GB2354363A

GB2354363A - Apparatus detecting the presence of speech

Info

Publication number: GB2354363A
Application number: GB9909423A
Authority: GB
Inventors: David Llewellyn Rees
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 1999-04-23
Filing date: 1999-04-23
Publication date: 2001-03-21
Anticipated expiration: 2019-04-23
Also published as: GB2354363B; GB9909423D0

Abstract

Speech is detected by treating the average frame energy of an input speech signal as a sampled signal and looking for modulations within the sampled signal that are characteristic of speech. The apparatus includes a compensation unit 77 for compensating for varying dynamic ranges of input signal variation caused by different distances between the user and the microphone.

Description

2354363 SPEECH PROCESSING APPARATUS AND METHOD The present invention

relates to a speech processing apparatus and method. The invention has particular, although not exclusive relevance to the detection of speech within an input speech signal.

In some applications, such as speech recognition, speaker verification and voice transmission systems, the microphone used to convert the user's speech into a corresponding electrical signal is continuously switched on. Therefore, even when the user is not speaking, there will constantly be an output signal from the microphone corresponding to silence or background noise. In order (i) to prevent unnecessary processing of this background noise signal; (ii) to prevent misrecognitions caused by the noise; and (iii) to increase overall performance, such systems employ speech detection circuits which continuously monitor the signal from the microphone and which only activate the main speech processing when speech is identified in the incoming signal.

Most prior art devices detect the beginning and end of speech by monitoring the energy within the input signal, since during silence, the signal energy is small but during speech it is large. In particular, in the conventional systems speech is detected by comparing the 2 average energy with a threshold and waiting f or it to be exceeded indicating that speech has then started. In order for this technique to be able to accurately determine the points at which speech starts and ends (the so-called end points), the threshold has to be set to a value near the noise f loor. This system works well in an environment with a low, constant level of noise. However, it is not suitable in many environments where there is a high level of noise which can change significantly with time. Examples of such environments include in a car, near a road or in a crowded public place. The noise in these environments can mask quieter portions of speech and changes in the noise level can cause noise to be detected as speech.

one aim of the present invention is to provide an alternative system for detecting speech within an input signal.

According to one aspect, the present invention provides an apparatus for detecting the presence of speech within an input signal, comprising: first processing means for processing the input signal to generate an energy based signal which varies with the energy within the input signal; second processing means for processing the energy based signal to compensate for varying dynamic range variations of the input signal; and means for detecting 3 the presence of speech within the input signal using the compensated energy signal. By using such an apparatus, the system can detect the presence of speech for varying distances between the user and the microphone.

Preferably, the compensated energy signal is filtered to remove energy variations which have a frequency below a predetermined frequency and the detecting means detects the presence of speech from the filtered energy signal.

Such an apparatus has the advantage that it can detect the presence of speech more accurately even in environments where there are high levels of noise. This is possible because changes in the noise level are usually relatively slow (less than 1Hz) compared with the energy variations caused by speech (which occur around 4 or 5Hz).

The present invention also provides a corresponding method and computer software for programming a programmable processor to carry out the method of the present invention.

An exemplary embodiment of the invention will now be described with reference to the accompanying drawings in which:

Figure I is a schematic view of a computer which may be 4 programmed to operate an embodiment of the present invention; Figure 2 is a schematic overview of a.speech recognition system; Figure 3 is a block diagram of a preprocessor which forms part of the system shown in Figure 2, which illustrates some of the processing circuitry that is used to process the input speech signal; Figure 4 is a diagrammatical representation of the division of the input speech signal S(t) into a series of time frames; Figure 5 is a diagrammatical representation of a typical speech signal for a single time frame; Figure 6 is a block diagram showing in more detail, an end point detector which forms part of the preprocessor shown in Figure 3; Figure 7 is a plot illustrating the way in which the frame magnitude of two utterances of the same word vary with time; Figure 8a is a plot of a compensation function used to scale the f rame magnitudes to reduce the ef f ect of dynamic range variation of the input signal; Figure 8b is a plot of the f rame magnitude of the two input utterances illustrated in Figure 7 after compensation using the compensation function illustrated in Figure 8a; Figure 8c is a plot illustrating the form of alternative compensation functions which may be used to compensate for dynamic range variations of the input signal; Figure 9a is a plot of the frame magnitude of an input speech signal after compensation, which illustrates the way in which the frame magnitude changes at the beginning and end of speech within the input signal; Figure 9b is a plot of the modulation power of the magnitude signal shown in Figure 9a within a frequency band centred around 4Hz; Figure 10a is a flowchart which illustrates part of the steps taken by a control unit which forms part of the end point detector shown in Figure 6; Figure 1 Ob is a flowchart which illustrates the remaining steps taken by the control unit shown in Figure 6; 6 Figure 11 is a plot of the frame magnitude signal shown in Figure 9a after being filtered to remove low frequency variations and the DC offset; and Figure 12 is a block diagram illustrating the processing circuits employed in an alternative end point detector for detecting the beginning and end of speech within an input signal.

Embodiments of the present invention can be implemented in computer hardware, but the embodiment to be described is implemented in software which is run in conjunction with processing hardware such as a personal computer, workstation, photocopier, facsimile machine or the like.

Figure 1 shows a personal computer (PC) 1 which may be programmed to operate an embodiment of the present invention. A keyboard 3, a pointing device 5, a microphone 7 and a telephone line 9 are connected to the PC 1 via an interface 11. The keyboard 3 and pointing device 5 enable the system to be controlled by a user. The microphone 7 converts the acoustic speech signal of the user into an equivalent electrical signal and supplies this to the PC 1 for processing. An internal modem and speech receiving circuit (not shown) may be connected to the telephone line 9 so that the PC I can communicate with, for example, a remote computer or with 7 a remote user.

The programme instructions which make the PC 1 operate in accordance with the present invention may be supplied for use with an existing PC 1 on, for example a storage device such as a magnetic disc 13, or by downloading the software from the internet (not shown) via the internal modem and the telephone line 9.

The operation of the speech recognition system of this embodiment will now be briefly described with reference to Figure 2. A more detailed description of the speech recognition system can be found in the Applicant's earlier European patent application EP 0789349, the content of which is hereby incorporated by reference. Electrical signals representative of the input speech from, for example, the microphone 7 are applied to a preprocessor 15 which converts the input speech signal into a sequence of parameter frames, each representing a corresponding time frame of the input speech signal. The sequence of parameter frames are supplied, via buffer 16, to a recognition block 17 where the speech is recognised by comparing the input sequence of parameter frames with reference models or word models 19, each model comprising a sequence of parameter frames expressed in the same kind of parameters as those of the input speech to be recognised.

8 A language model 21 and a noise model 23 are also provided as inputs to the recognition block 17 to aid in the recognition process. The noise model is representative of silence or background noise and, in this embodiment, comprises a single parameter frame of the same type as those of the input speech signal to be recognised. The language model 21 is used to constrain the allowed sequence of words output from the recognition block 17 so as to conform with sequences of words known to the system. The word sequence output from the recognition block 17 may then be transcribed for use in, for example, a word processing package or can be used as operator commands to initiate, stop or modify the action of the PC 1.

A more detailed explanation will now be given of the preprocessor described above.

PREPROCESSOR The functions of the preprocessor 15 are to extract the information required from the speech and to reduce the amount of data that has to be processed. There are many different types of information which can be extracted from the input signal. In this embodiment the preprocessor 15 is designed to extract "formant" related information. Formants are defined as being the resonant frequencies of the vocal tract of the user, which change 9 as the shape of the vocal tract changes.

Figure 3 shows a block diagram of some of the preprocessing that is performed on- the input speech signal. Input speech S(t) from the microphone 7 or the telephone line 9 is supplied to filter block 61, which removes frequencies within the input speech signal that contain little meaningful information. Most of the information useful for speech recognition is contained in the frequency band between 30OHz and 4KHz. Therefore, filter block 61 removes all frequencies outside this frequency band. Since no information which is useful for speech recognition is filtered out by the filter block 61, there is no loss of recognition performance.

Further, in some environments, for example in a motor vehicle, most of the background noise is below 30OHz and the filter block 61 can result in an effective increase in signal-to-noise ratio of approximately lOdB or more. The filtered speech signal is then converted into 16 bit digital samples by the analogue-to-digital converter (ADC) 63. To adhere to the Nyquist sampling criterion, ADC 63 samples the filtered signal at a rate of 8000 times per second. in this embodiment, the whole input speech utterance is converted into digital samples and stored in a buffer (not shown), prior to the subsequent steps in the processing of the speech signals.

Af ter the input speech has been sampled it is divided into nonoverlapping equal length frames in block 65. The reason for this division of the input speech into frames will now be described in more detail. As mentioned above, during continuous speech the formant. related information changes continuously, the rate of change being directly related to the rate of movement of the speech articulators which is limited by physiological constraints. Therefore, in order to track the changing formant frequencies, the speech signal must be analysed over short time periods or frames, this method being known in the art of speech analysis as a "short time" analysis of speech. There are two considerations that have to be addressed when performing a short time analysis: (i) what rate should the time frames be extracted from the speech signal, and (ii) how large a time frame should be used.

The f irst consideration depends on the rate of movement of the speech articulators i.e. the frames should be sufficiently close to ensure that important events are not missed and to ensure that there is reasonable continuity. The second consideration is determined by a compromise between the time frame being short enough so that the speech signal's properties during the frame are constant, and the frame being long enough to give sufficient frequency detail so that the formants can be distinguished.

In this embodiment, in order to reduce the amount of computation required, both in the front end processing and later in the recognition stage, non-overlapping frames of 128 samples (corresponding to 16 milliseconds of speech) are directly extracted from the speech without a conventional windowing function. This is illustrated in Figure 4 and 5, which show a portion of an input signal S(t) and the division of the signal into nonoverlapping frames and one of these frames Sk (r), respectively. In a conventional system, overlapping frames are usually extracted using a window function which reduces frequency distortions caused by extracting the frames from the speech signal. The applicant has found, however, that with non-overlapping frames, these conventional windowing functions worsen rather than improve recognition performance.

The speech frames Sk(r) output by the block 65 are then written into a circular buffer 66 which can store 62 frames corresponding to approximately one second of speech. The frames written in the circular buffer 66 are also passed to an endpoint detector 68 which process the frames to identify when the speech in the input signal begins, and after it has begun, when it ends. Until speech is detected within the input signal, the frames 12 in the circular buffer are not fed to the computationally intensive feature extractor 70. However, when the endpoint detector 68 detects the beginning of speech within the input signal, it signals the circular buffer to start passing the frames received after the start of speech point to the feature extractor 70 which then extracts a set of parameters for each frame representative of the speech signal within the frame.

SPEECH DETECTION The way in which the endpoint detector 68 operates in this embodiment, will now be described with reference to Figures 6 to 11. In this embodiment, speech is detected by treating the frame energy of the input signal as a sampled signal and looking for modulations within that sampled signal that are characteristic of speech. In particular, the energy due to speech is strongly modulated at frequencies around 4Hz, with very little modulation below lHz or above 1OHz. In contrast, changes in noise level tend to occur relatively slowly, typically modulating the signal energy at less than 1Hz. In addition, random fluctuations in the noise energy are uncorrelated from frame to frame and are spread over the modulation frequency range from OHz to half the frame rate. Therefore, in this embodiment, the endpoint detector 68 is arranged to detect the presence of speech by band-pass filtering a measure of the frame energy in 13 a frequency band between 2Hz and 6Hz, by calculating the modulation power within this frequency band and by applying a detection threshold to the calculated modulation power.

A more detailed description will now be given of the way in which speech is detected in the input signal. As shown in Figure 6, each frame (S"(r)) of input speech is input to an energy based calculation unit 76 which, in this embodiment, is arranged- to calculate the average frame magnitude, since this avoids the need to calculate squares which would be required to calculate the frame energy. In particular, the energy based calculation unit 76 calculates the following for each frame of speech:

N ek E ISk(T) 1 (1) N r=1 where N is the number of samples in each f rame (which in this embodiment is 128). The frame magnitude (ek) output by the calculation unit 76 is then input to a compensation unit 77 which tries to compensate for the effects of varying dynamic range of the input signal caused by, for example, the user moving away from or closer to the microphone.

14 Figure 7 illustrates the way in which the frame magnitude varies within two utterances of a word, one spoken a few centimetres from the microphone (signal 71) and one spoken a metre from the microphone (signal 73). As shown in Figure 7, the dynamic range of variation 75 of the utterance 71 is much larger than the dynamic range of variation 79 for the utterance 73. As those skilled in the art will appreciate, this can cause problems with speech detectors which detect the beginning and end of speech by comparing the energy with a f ixed threshold Th. In particular, such a system would detect the beginning and end of speech to be at times to and t5 respectively f or signal 71 and at times t, and t2 and t3 and t4 respectively for signal 73. To try to compensate f or this, some prior art systems employ a variable threshold Th. However, this only complicates the speech detector and is only as good as the algorithm used to dynamically change the threshold value.

Figure 8a is a plot of the compensation function 81 employed in the compensation unit 77. As shown, the compensation function 81 has a steeper slope for smaller values of the input frame magnitude than it has for larger values of input frame magnitude, i.e. it has a larger gain f or smaller f rame magnitudes than it does f or larger frame magnitudes. This has the effect of increasing the dynamic range of input signals having small dynamic ranges whilst maintaining the dynamic range of input signals having large dynamic ranges approximately the same. This is illustrated in Figure 8 by the double-headed arrows 83-.1 and 83-2 which represent the input dynamic range and the output dynamic range respectively for an input signal having a relatively small input dynamic range and 85-1 and 85-2 which represent the input dynamic range and the output dynamic range respectively for an input signal having a relatively large input dynamic range.

Figure 8b is a plot showing how the frame magnitude signals 71 and 73 shown in Figure 7 vary after compensation by the compensation unit 77. As shown, the compensation unit 77 has increased the frame magnitude for the utterance 731 so that the two magnitude signals are much more alike. In this embodiment, the compensation function 81 can be expressed mathematically by:

e out: (,In (2) ) cc where, in this embodiment, a is a constant between zero and one. A compensation function of this form ensures that the gradient of the compensation function will be greater for smaller values of input frame magnitude than 16 for larger values of input frame magnitude. This is because changes in the output frame magnitude are scaled by the input frame magnitude. In particular, when cz = 3i a change in the input magnitude results in the following change in the output magnitude:

in dout dek ek (3) 2 F_-3 n Vk As those skilled in the art will appreciate, various compensation functions can be used which will achieve the same result. Figure 8c illustrates the form of three other compensation functions 87, 89 and 91 which can be employed. The compensation function 87 can also be expressed by equation (2) above, but with a value of a closer to zero. The compensation function 89 is a linear piece-wise function having three linear regions and compensation function 91 has two linear regions. Various other compensation functions could be employed, provided that the gradient of the compensation function is steeper for smaller values of input frame magnitude than it is for larger values.

Figure 9a is a plot illustrating the way in which the compensated frame magnitude (ek"t) varies within an example input signal. The input signal comprises 17 background noise portions 72-1 and 72-2 which correspond to background noise and which bound a speech containing portion 74. As shown in Figure 9a, the frame magnitude during the background noise portions does not fluctuate much with time. In contrast, in the speech containing portion 74 the frame magnitude fluctuates considerably with time and has a larger mean value.

As mentioned above, in this embodiment, the energy based signal shown in Figure 9a is bandpass filtered by a band pass filter having cut-off frequencies of 2Hz and 6Hz and having a peak response at about 4Hz. The modulation power of the bandpass filtered signal is then determined and this is plotted in Figure 9b for the magnitude signal shown in Figure 9a. As shown, the modulation power in regions 72-1 and 72- 2 are relatively small compared with the modulation power during the speech portion 74. This will be the same regardless of the amount of energy within the background noise. Therefore, by comparing this bandpass modulation power for each frame with a fixed detection threshold Th, the start of speech (SOS) and the end of speech (EOS) can be detected more accurately than the conventional approach described above, especially in noisy environments.

This bandpass filtering and modulation power calculation are determined by the bandpass filter and modulation 18 power calculation unit 82. In particular, the calculation unit 82 takes a sequence of compensated magnitude values defined by a sliding window of fixed size and ending at the compensated magnitude value for the last received frame (k) and computes the first non-DC coefficient of a discrete Fourier transform of the compensated frame magnitudes in the sliding window. In particular, the calculation unit 82 calculates a bandpass modulation power, wk, for frame k, from:

2 N-1 w, E e, exp (i 2n (4) n=O N where N is the number of frames in the window. In this embodiment N is set to 16 which corresponds to a bandpass filter with peak response at about 4Hz. The value of w.

for each frame is then compared with a detection threshold Th in a threshold circuit 84 which outputs a control signal to the control unit 86 identifying whether or not the bandpass modulation power for the current frame is above or below the detection threshold.

Depending on the application, the control unit 86 could cause the feature extractor 70 to commence processing of the input signal as soon as the threshold circuit 84 detects that the bandpass modulation power wk exceeds 19 the detection threshold Th. However, in this embodiment, a more accurate determination of the start of speech and of the end of speech is made in order to ensure there is minimum processing of background signals by the feature extractor 70, to reduce recognition errors caused by the noise and to improve recognition performance. In this embodiment this is achieved, using a maximum likelihood calculation which is calculated when the control unit 86 identifies that the bandpass modulation power, Wk1 exceeds the detection threshold Th for a predetermined number of frames.

Figure 10 shows the control steps performed by the control unit 86 in deciding when to perform the maximum likelihood calculation. In this embodiment, the control unit 86 has two states, an INSPEECH state and an INSILENCE state. When the control unit 86 is in the INSILENCE state, it searches for the beginning of speech and when it is in the INSPEECH state, it searches f or the end of speech. As shown in Figure 10a, in step S1, the control unit 86 determines if it is in the INSPEECH state. If it is not, then processing proceeds to step S3 where the control unit 86 determines if the bandpass modulation power wk for the current frame k is greater than the detection threshold Th, from the signal received by the threshold circuit 84. If it is not, then processing proceeds to step S5 where k is incremented and the same procedure is carried out again for the next frame. If the bandpass modulation power wk is greater than the detection threshold Th, then the processing proceeds from step S3 to step S7 where a count [CNTABV] associated with the number of frames above the detection threshold Th is incremented. This count CNTABV is then compared with a predefined number NDTCT (which indicates that speech has started) in step S9. In this embodiment NDTCT is 18, which corresponds to 288 milliseconds of input speech.

If the number of frames above the threshold i.e. CNTABV, is not greater than the predetermined number NDTCT, then the f rame number k is incremented in step S13 and in step S15, the control unit 86 determines if the bandpass modulation power wk for the next frame is above the detection threshold Th. If it is, then the processing returns to step S7 where the count CNTABV of the number of frames above the threshold is incremented. If the bandpass modulation power wk is less than the threshold at step S15, then processing proceeds to step S17, where the count (CNTBLW) of the number of consecutive f rames below the threshold is incremented. Subsequently, in step S19, the count CNTBLW of the number of consecutive frames below the threshold is compared with a predetermined number NHLD (indicating that the control 21 unit 86 should stop counting and wait for the threshold to be exceeded again). In this embodiment, NHLD is 6, which corresponds to 96 milliseconds of input signal.

If the count CNTBLW is greater than the predetermined number NHLD, then both the counts CNTABV and CNTBLW are reset in step S21 and the processing returns to step S5 where the control unit 86 waits, through the action of steps S3 and S5, for the next frame which is above the detection threshold Th. If at step S19, the number of consecutive frames which are below the threshold is not greater than the predetermined number NHLD, then processing proceeds to step S23 where the frame number k is incremented. In step S25, the control unit 86 then determines if the bandpass modulation power wk for the next frame is above the detection threshold Th. if it is not, then the processing returns to step S17, where the count CNTBL of the number of consecutive frames below the threshold is incremented. If, on the other hand the control unit 86 determines, in step S25, that the bandpass modulation power w. for the next frame is above the detection threshold Th, then the processing passes from step S25 to step S27, where the number of frames which are below the detection threshold is reset to zero and the processing returns to step S7, where the number of frames which are above the detection threshold is incremented. Once the count CNTABV is above NDTCT, 22 indicating speech has started, then the processing proceeds f rom step S9 to step S28, where the control unit 86 initiates the calculation of the start of speech point using a maximum likelihood calculation on recent frames.

The state of the control unit 86 is then changed to be INSPEECH in step S29 and the processing returns to step S1.

Therefore, to summarise, when the control unit 86 is in the state INSILENCE and when the bandpass modulation power first exceeds the detection threshold Th, the control unit 86 starts counting the number of frames above the threshold and the number of consecutive frames below the threshold. If the number of consecutive frames below the threshold exceeds NHLD, the algorithm stops counting and waits for the threshold to be exceeded again. If this does not happen before the count CNTABV of the number of frames above the threshold exceeds NDTCT, then the state is changed to INSPEECH and the start point is calculated using recent frames. Full processing of the data by the feature extractor 70 can then begin after the start of speech has been calculated.

Once the start of speech has been determined, the control unit 86 is programmed to look for the end of speech. In particular, referring to Figure 10a again, at step S1, after the start of speech has been calculated in step S28 23 and the state of the controller has been set to INSPEECH, the processing will pass from step S1 to step S31 shown in Figure 10b, where the control unit 86 checks to see if the bandpass modulation power wk for the current frame k is below the detection threshold Th. If wk is above the detection threshold, then the processing loops to step S33 where the frame counter k is incremented and the control unit checks the bandpass modulation power of the next frame. When the control unit 86 identifies a frame having a bandpass modulation power below the threshold, the processing proceeds to step S35, where the count CNTBLW of the number of - consecutive frames below the threshold is incremented. Processing then proceeds to step S37 where the control unit 86 checks if the number of consecutive frames below the threshold exceeds a predetermined number NEND, which indicates that the speech has ended. Inthis embodiment, NEND is 14, corresponding to 224 milliseconds.

If the number of consecutive frames is less than NEND, then speech has not ended and the processing proceeds to step S39, where the frame counter k is incremented. Processing then proceeds to step S41 where the control unit 86 determines if the bandpass modulation power for the next frame is below the detection threshold Th. If it is not, then the count CNTBLW of the number of consecutive frames below the detection threshold is reset 24 in step S43 and processing returns to step S33. If at step S41, the bandpass modulation power is still below the detection threshold, then the processing returns to step S35, where the count of the number of consecutive frames below the threshold is incremented. Once the number of consecutive frames below the threshold has exceeded NEND, the processing proceeds to step S45, where the control unit 86 initiates the calculation of the endpoint of speech using a maximum likelihood calculation with recent frames. The state of the control unit 86 is then changed to INSILENCE in step S47 and the processing returns to step Sl.

Therefore, in summary, after the beginning of speech has been determined, the control unit 86 continuously looks for the end of speech. This is done by the control unit 86 counting the number of consecutive frames below the detection threshold and when this number exceeds a predetermined number, NEND, the control unit 86 changes state to INSILENCE and the end of speech is calculated.

MAXIMUM LIKELIHOOD END-POINT DETECTION As mentioned above, the beginning and end points of the speech within the input signal are calculated using a maximum likelihood method. In particular, the likelihood for an end point occurring at a particular frame is calculated and the frame with the largest likelihood is chosen as the end point. Again, the compensated frame magnitude signal is used in the likelihood calculation and a simple model for this parameter is assumed.

Referring to Figure 6, when the control unit 86 identifies that speech has started, it outputs a control signal on line 88 to the buffer 78 which causes the N most recent frame magnitudes to be read out of the buffer 78 and input to a high pass filter 90. The filter 90 removes the DC offset and any slowly varying noise contribution in the magnitude signal and outputs the filtered magnitudes to buffer 92. In this embodiment, the filter 90 is a second order recursive filter, with a cut-off frequency of 1Hz. Figure 11 shows the output of the high-pass filter 90 for the magnitude signal shown in Figure 9a. As shown, the filtered frame magnitude fluctuates about zero during the silence portions 72-1 and 72-2 but oscillates during the speech portions 74. As a result, it is assumed that during the silence portions, the filtered frame magnitudes are uncorrelated from frame to frame, whereas in the speech portion, the filtered frame magnitude depends upon the filtered frame magnitude of its neighbouring frames.

The maximum likelihood endpoint calculation unit 94 then processes the N filtered magnitudes stored in the buffer 92 by taking each point as a possible starting point 26 (i.e. as being the end point) and treating all frames before this point as noise and all frames after this point as speech and applying each of the designated noise frames into a noise model and each of the designated speech frames into a speech model to give a likelihood score for that point being the end point. This process is performed for each of the N frames in the buffer 92 and the one that gives the best likelihood score is determined to be the end point.

In this embodiment Laplacian statistics are used to model the noise and speech portions and the likelihood L, that frames 1 to M in the buffer 92 are silence is given by:

_M m L, = (2 G2) 2 exp '2 lyi (5) (Y where yj is the high-pass filtered magnitude and a, is the silence variance. Similarly, the likelihood L2 that f rames M + 1 to N are speech is given by:

_ (N-M) N L, = (2? (Jd) 2 exp _L2 F (6) 2, I yj -b_yj -I (_ G2 i=M+l where a first order auto-regressive process with a Laplacian driving term with variance CY2 has been used.

27 The parameter b is the prediction co-ef f icient of the auto-aggressive model and, in this embodiment, a f ixed value of 0. 8 is used. The Laplacian statistics were f ound to be more representative of the data than the more usual Gaussian statistics and lead to more robust estimates and require less computation. However, Gaussian statistics can be used. Multiplying the likelihoods L, and L2 gives the likelihood for a transition from silence to speech at frame M.

The variances a, and CF2 are unknown but values which maximise the likelihood can be calculated from the data by differentiating equations (2) and (3) and finding a which makes the differentials equal to zero. This gives 15 the following expressions for a, and 02:

0, (M) Vf:7_ M (7) F I _Vi 02(M) 2 Jyj-ay_j_jj (8) OV_M) Substituting these estimates into the likelihood, taking logarithms and neglecting constant terms gives the following log likelihood to be maximised:

28 1 (M) = -M 1n(j, (M) - (N-M) 1nU2 (M) (9) This is calculated for each M, and the frame with the largest 1 is then chosen as the end point.

The same algorithm is used to calculate the end of speech (EOS), except that the data is time reversed. Additionally, it is important to ensure that there are enough frames of silence and enough frames of speech included in the window of N f rames to allow a reliable end point estimate. This is ensured by dynamically choosing the window size (N) to include a sufficient number of silence and speech frames. This is achieved by taking all the frames since the first time the detection threshold Th is exceeded up until the control unit decides that speech has started, together with the 16 frames which immediately precede the first frame which exceeded the detection threshold.

FEATURE EXTRACTION As mentioned above, after the endpoint detector detects the beginning of speech, it signals the circular buffer 66 shown in Figure 3 to start passing the speech frames received from that point to the feature extractor 70. The particular operation of the feature extractor 70 and of the subsequent recognition processing is not relevant 29 to the present invention. Accordingly, a description of these features will not be given here. The reader is directed to the applicant's earlier European application EP 0789349 for a description of these. features.

Figure 12 is a block diagram illustrating the form of a different speech detector for detecting the beginning and end of speech within an input signal. As shown, the frames of speech are input to the energy-based calculation unit 76 which, as in the first embodiment, calculates the average frame magnitude (ek) which is passed to the compensation unit 77. The compensated frame magnitudes are then passed to a threshold circuit 84 which compares the compensated frame magnitudes with a f ixed threshold (Th) and the result is passed to the control unit 86. As in the first embodiment, the control unit controls the feeding of the signal frames to the feature extractor in dependence upon the output from the threshold circuitry. In this embodiment, the control unit sends frames to the feature extractor unit as soon as the threshold is exceeded by the compensated frame magnitude.

In the above embodiments, the average frame magnitude was used as a measure of the average frame energy. This was because the calculation of the average frame magnitude does not require the calculation of squares which would be required to calculate the average frame energy. However, as those skilled in the art will appreciate, the average frame energy could be used.

In the above embodiments, a compensation function having the form of equation (2) was employed. As those skilled in the art will appreciate, the value of the constant a in equation (2) can be user definable (via an appropriate user interface), so that the device will work over a desired range of variation of the input signal's dynamic range. Alternatively, the value of a can be dynamically changed in dependence upon the past use of the device or on a measure of the recognition performance, such as the number of recognition errors.

As those skilled in the art will appreciate, this type of end point detection can be employed in various other applications, such as speaker verification systems and the like.

31

Claims

CLAIMS:

1. An apparatus f or detecting the presence of speech within an input signal, comprising:.

means f or receiving the input signal; first processing means for processing the received signal to generate an energy based signal which varies with local energy within the received signal; second processing means for processing the energy based signal to compensate for varying dynamic range variations of the input signal to provide a compensated energy based signal; and means for detecting the presence of speech in said input signal using said compensated energy based signal.

2. An apparatus according to claim 1, wherein said detecting means comprises means for filtering said energy signal to remove energy variations which have a frequency below a predetermined frequency to provide a filtered energy signal and means for detecting the presence of speech in the input signal using said filtered energy signal.

3. An apparatus according to claim 2, wherein said filtering means is operable to remove energy variations which have a frequency above a predetermined frequency.

32

4. An apparatus according to claim 3, wherein said filtering means is operable to filter out energy variations below 2Hz and above 1OHz.

5. An apparatus according to claim 3 or 4, wherein said filtering means is operable to pass energy variations which have a frequency of approximately 4Hz.

6. An apparatus according to any of claims 2 to 4, wherein said detecting means is operable to compare said filtered energy signal with a predetermined threshold and to detect the presence of speech in dependence upon the result of said comparison.

7. An apparatus according to any of claims 2 to 6, wherein said processing means is operable to divide the input speech signal into a number of successive time f rames and to determine a measure of the energy of the input signal in each of said time f rames to generate said energy based signal.

8. An apparatus according to claim 7, wherein the input signal is sampled and wherein said measure of the energy is the average magnitude of the samples within the time frame.

9. An apparatus according to claim 7, wherein said 33 input signal is sampled and wherein said measure of the energy is the average sample energy within the time f rame.

10. An apparatus according to any of claims 7 to 9, further comprising modulation power determination means for determining the modulation power of the filtered signal within a predetermined frequency band.

11. An apparatus according to claim 10, wherein said filtering means and said modulation power determining means are operable to filter and determine the modulation power in discrete portions of said energy based signal.

12. An apparatus according to claim 11, wherein said filtering means and said power modulation determining means are formed by a discrete Fourier transform means which is operable to determine the first non-DC coefficient of a discrete Fourier transform of each discrete portion of said energy based signal.

13. An apparatus according to claim 1, wherein said detecting means is operable to compare said compensated energy signal with a predetermined threshold and to detect the presence of speech in dependence upon the result of said comparison.

34

14. An apparatus according to claim 1 or 13, wherein said first processing means is operable to divide the input speech signal into a number of successive time frames and to determine an energy based measurement of the input signal in each of said time frames to generate said energy based signal.

15. An apparatus according to claim 14, wherein the input signal is sampled and wherein said measure of the energy is the average magnitude of the samples within the time frame.

16. An apparatus according to claim 14, wherein said input signal is sampled and wherein said measure of the energy is the average sample energy within the time frame.

17. An apparatus according to any preceding claim, further comprising means for determining the boundary between a speech containing portion and a background noise containing portion in said input signal.

18. An apparatus according to claim 17, wherein said boundary determining means is operable for determining the likelihood that said boundary is located at each of a plurality of possible locations within said compensated energy signal and means for determining the location which has the largest likelihood associated therewith.

19. An apparatus according to claim 18 when dependent upon claim 6, wherein said boundary determining means is only operable to determine said likelihoods when said filtered energy signal exceeds said threshold.

20. An apparatus according to claim 18, wherein said boundary determining means is only operable to determine said likelihoods when said compensated energy signal exceeds a predetermined threshold.

21. An apparatus according to any preceding claim, wherein said second processing means is arranged to process said energy based signal so that a change in the compensated energy based signal caused by a given change in the input energy based signal is greater at lower input energy levels than at higher input energy levels.

22. An apparatus according to any preceding claim, wherein said second processing means is operable to apply a variable gain to said energy based signal to generate said compensated energy based signal.

23. An apparatus according to claim 22, wherein said second processing means is operable to apply a gain to said energy based signal which is inversely proportional 36 to the energy level of the energy based signal.

24. An apparatus according to any preceding claim, wherein said second processing means is operable to generate said compensated energy signal using the following equation:

out in ek = (ek) 0' where ej,"' is the compensated energy based signal, e,, in is the input energy based signal to be compensated and a has a value between zero and one.

25. An apparatus according to claim 24, wherein ct is a constant between zero and one.

26. An apparatus according to claim 25, wherein at is 15 user definable.

27. An apparatus according to claim 25, wherein a equals a half.

28. An apparatus according to claim 24, further comprising means for varying a in dependence upon a characteristic of a subsequent speech processing apparatus which processes the detected speech.

29. An apparatus for detecting the presence of speech 37 within an input signal, comprising: means for receiving the input signal and for dividing the input signal into a sequence of time frames; means for determining an energy_ based measurement representative of the energy of the input signal within each time, frame to generate an energy based signal which varies over the input signal; means for processing said energy based signal to compensate for variations in the dynamic range of variation of the input signal, to provide a compensated energy based signal; and means for detecting the presence of speech in said input signal using said compensated energy based signal.

30. An apparatus for detecting the presence of speech within an input signal, comprising: means for receiving a sequence of energy based measurements determined from a corresponding sequence of time portions of the input signal; 20 means for processing each of said energy based measurements to compensate for variations in the dynamic range of variation of the input signal, to provide a compensated sequence of energy based measurements; and means for detecting the presence of speech in said input signal using said compensated sequence of energy based measurements.

38

31. A speech processing apparatus comprising:

means for receiving an input signal comprising a speech containing portion bounded by noise portions representative of background noise; an apparatus according to any of claims 1 to 30 for detecting the speech containing portion in the input signal; and means for comparing the speech containing portion with one or more reference models representative of speech signals and for outputting a comparison result.

32. A method of detecting the presence of speech within an input signal, comprising the steps of: receiving the input signal; 15 a first processing step of processing the received signal to generate an energy based signal which varies with local energy within the received signal; a second processing step of processing the energy based signal to compensate for varying dynamic range variations of the input signal to provide a compensated energy based signal; and detecting the presence of speech in said input signal using said compensated energy based signal.

33. A method according to claim 32, wherein said detecting step comprises the steps of filtering said energy signal to remove energy variations which have a t 39 frequency below a predetermined frequency to provide a filtered energy signal and detecting the presence of speech in the input signal using said filtered energy signal.

34. A method according to claim 33, wherein said filtering step removes energy variations which have a frequency above a predetermined frequency.

35. A method according to claim 34, wherein said filtering step filters out energy variations below 2Hz and above 1OHz.

36. A method according to claim 34 or 35, wherein said filtering step passes energy variations which have a frequency of approximately 4Hz.

37. A method according to any of claims 33 to 35, wherein said detecting step compares said filtered energy signal with a predetermined threshold and detects the presence of speech in dependence upon the result of said comparison.

38. A method according to any of claims 33 to 37, wherein said processing step divides the input speech signal into a number of successive time frames and determines a measure of the energy of the input signal in each of said time f rames to generate said energy based signal.

39. A method according to claim 38, wherein the input signal is sampled and wherein said measure of the energy is the average magnitude of the samples within the time f rame.

40. A method according to claim 38, wherein said input signal is sampled and wherein said measure of the energy is the average sample energy within the time frame.

41. A method according to any of claims 38 to 40, further comprising the step of determining the modulation power of the filtered signal within a predetermined frequency band.

42. A method according to claim 41, wherein said filtering step and said modulation power determining step filter and determine the modulation power in discrete portions of said energy based signal.

43. A method according to claim 42, wherein said filtering step and said power modulation determining step determines the first non-DC coefficient of a discrete Fourier transform of each discrete portion of said energy based signal.

41

44. A method according to claim 32, wherein said detecting step compares said compensated energy signal with a predetermined threshold and detects the presence of speech in dependence upon the result of said 5 comparison.

45. A method according to claim 32 or 44, wherein said first processing step divides the input speech signal into a number of successive time frames and determines an energy based measurement of the input signal in each of said time frames to generate said energy based signal.

46. A method according to claim 45, wherein the input signal is sampled and wherein said measure of the energy is the average magnitude of the samples within the time frame.

47. A method according to claim 45, wherein said input signal is sampled and wherein said measure of the energy is the average sample energy within the time frame.

48. A method according to any of claims 32 to 47, further comprising the step of determining the boundary between a speech containing portion and a background noise containing portion in said input signal.

49. A method according to claim 48, wherein said 42 boundary determining step determines the likelihood that said boundary is located at each of a plurality of possible locations within said compensated energy signal and determines the location which has the largest likelihood associated therewith.

50. A method according to claim 49 when dependent upon claim 37, wherein said boundary determining step determines said likelihoods when said filtered energy signal exceeds said threshold.

51. A method according to claim 49, wherein said boundary determining step only determines said likelihoods when said compensated energy signal exceeds a predetermined threshold.

52. A method according to any of claims 32 to 51, wherein said second processing step processes said energy based signal so that a change in the compensated energy based signal caused by a given change in the input energy based signal is greater at lower input energy levels than at higher input energy levels.

53. A method according to any of claims 32 to 52, wherein said second processing step applies a variable gain to said energy based signal to generate said compensated energy based signal.

43

54. A method according to claim 53, wherein said second processing step applies a gain to said energy based signal which is inversely proportional to the energy level of the energy based signal.

55. A method according to any preceding claim, wherein said second processing step generates said compensated energy signal using the following equation:

e out in k = ( ek where ek"' is the compensated energy based signal, ek in is the input energy based signal to be compensated and a has a value between zero and one.

56. A method according to claim 55, wherein a is a constant between zero and one.

57. A method according to claim 56, wherein a is user definable.

58. A method according to claim 56, wherein a equals a half.

59. A method according to claim 53, wherein a is varied in dependence upon a characteristic of a subsequent speech processing system which processes the detected speech.

44

60. A method of detecting the presence of speech within an input signal, comprising the steps of:

receiving the input signal and for dividing the input signal into a sequence of time frames; determining an energy based measurement representative of the energy of the input signal within each time frame to generate an energy based signal which varies over the input signal; processing said energy based signal to compensate for variations in the dynamic range of variation of the input signal, to provide a compensated energy based signal; and detecting the presence of speech in said input signal using said compensated energy based signal.

61. A method of detecting the presence of speech within an input signal, comprising the steps of: receiving a sequence of energy based measurements determined from a corresponding sequence of time portions of the input signal; processing each of said energy based measurements to compensate for variations in the dynamic range of variation of the input signal, to provide a compensated sequence of energy based measurements; and detecting the presence of speech in said input signal using said compensated sequence of energy based measurements.

6 2. A computer readable medium carrying instructions for configuring a programmable processor to be configured as a speech detecting apparatus according to any of claims 1 to 3 1.

63. A signal carrying instructions for configuring a programmable processing circuit as a speech detection apparatus according to any of claims 1 to 31.

64. A computer readable medium storing computer executable process steps for detecting the presence of speech within an input signal, the process steps comprising:

steps for receiving the input signal; steps for processing the received signal to generate an energy based signal which varies with local energy within the received signal; steps for processing the energy based signal to compensate for varying dynamic range variations of the input signal to provide a compensated energy based signal; and steps for detecting the presence of speech in the input signal using said compensated energy based signal.

65. Computer executable process steps for detecting the presence of speech within an input signal, the process steps comprising:

46 steps for receiving the input signal; steps for processing the received signal to generate an energy based signal which varies with local energy within the received signal; steps for processing the energy based signal to compensate for varying dynamic range variations of the input signal to provide a compensated energy based signal; and steps for detecting the presence of speech in the input signal using said compensated energy based signal.