MXPA97006033A

MXPA97006033A - Detection of activity of

Info

Publication number: MXPA97006033A
Application number: MXPA/A/1997/006033A
Authority: MX
Inventors: Anthony Bridges James
Original assignee: Anthony Bridges James; British Telecommunications Public Limited Company
Priority date: 1995-02-15
Filing date: 1997-08-07
Publication date: 1998-07-03

Abstract

The present invention relates to a speech activity detector comprising an input for receiving an output speech signal transmitted from a speech system to a user and an input for receiving an input signal from the user. Both output and input signals are divided into frames limited by time, means for calculating an aspect of each frame of the input signal and for forming a calculated aspect function and a threshold are provided. Based on the function, it is determined if the input signal includes speech or not. Means are provided for determining echo return loss during an output speech signal from the interactive speech system and for controlling the threshold depending on the return loss of echo media.

Description

VOICE ACTIVITY DETECTION DESCRIPTION OF THE INVENTION This invention relates to the detection of speech activity. There are many automatic systems that depend on speech detection for operation, for example, automatic speech systems and cellular radio encoding systems. Such systems monitor the transmission trajectories from the user's equipment for the occurrence of discourse and, during the occurrence of discourse, appropriate measures. Unfortunately, the transmission paths are rarely noise-free. Systems that are simply arranged to detect activity on the path may therefore incorrectly take measurements if there is noise present. The usual noise that is present is the noise of the line (ie the noise that is present regardless of whether or not a signal is being transmitted) and the background noise of a telephone conversation, such as the barking of a dog, the sound of the television, the noise of a car engine, etc. Another source of noise from communication systems is the echo. For example, the echo in a public switch telephone network (PSTN) is essentially caused by electrical and / or acoustic coupling, for example in the interface of four cables to two cables of a conventional exchange box; or the acoustic coupling in a telephone device, from the earphone to the microphone. The acoustic echo varies according to the moment during a call due to the variation of the air path, ie when the speaker alters the position of his head between the microphone and the megaphone. Similarly in telephone chains, the interior of the cab has a limited damping characteristic and is reverberant which results in a resonant behavior. Again, this causes the path of the acoustic echo to vary if the speaker moves around the cabin or actually with any movement of the air. The acoustic echo is an aspect that matters more and more in this time due to the increase in the use of telephones without the use of the headset. The effect of the total echo or reflection path is to attenuate, delay and filter a signal. The path of the echo depends on the line, the way of exchange and the type of phone. This means that the reflection path transfer function can vary between calls because any of the line, divert path and device can change from call to call as the different switch gear is selected to make the connection. Various techniques are known to improve echo control in person-to-person talk communication systems. There are three main techniques. First, the insertion losses can be added in the transmission path of the speaker to reduce the level of the output signal. However, insertion losses can make the received signal intolerably low for the listener. Alternatively, echo suppressors operate on the principle of detecting the signal levels in the transmitting and receiving path and then comparing the levels to determine how to operate the interchangeable insertion loss pads. The high attenuation is placed in the transmission path when the speech on the received path is detected. Echo suppressors are usually used over longer delay connections such as international telephone links where adequate fixed insertion losses would be insufficient. Echo cancellers are voice operated devices that use adaptive signal processors to reduce or eliminate the echo when calculating an echo path transfer function. An output signal is fed to the device and the output signal resulting from the received signal is subtracted. Whenever the model is representative of the real eco path, the echo should be canceled theoretically. However, echo cancellers suffer from stability problems and are in terms of expensive computing. Echo cancellers are also very sensitive to noise explosions during training. An example of an automatic speech system is the telephone answering machine, which records the messages that the caller leaves. Usually, when a user calls an automatic speech system, a warning is sounded to the user that usually requires a response. Therefore, an output signal from the speech system is passed along the transmission line to the megaphone of a user's telephone. The user then provides a response to the warning that is passed to the speech system when it takes the appropriate measures later. It has been proposed that by allowing a caller to an automatic speech system to interrupt outgoing announcements from the system, it greatly improves the possibility of using the system by those callers who are familiar with the dialogue of the system. This installation is often called "interrupt" or "overdrive".

If a user speaks during a warning, said words may be preceded or corrupted by an echo of the exit notice. Clean vocabulary expressions essentially isolated from the user are transformed into embedded vocabulary expressions (in which the word vocabulary becomes contaminated with additional sounds). In automatic speech systems that comprise automatic speech recognition, due to the limitations of current speech recognition technology, this results in a reduction in recognition performance. If a user has never used the service provided by the automatic speech system, the user will need to listen to the notices provided by the speech generator in its entirety. However, once the user has become familiar with the service and information that is required at each stage, the user may wish to provide the required response before the notice has ended. If a speech recognizer or half-tape recorder goes off until the warning ends, no attempt will be made to acknowledge a previous response from the user. If, on the other hand, the speech recognizer or the recording medium is turned on all the time, the input will include both the echo of the output message and the response provided by the user.

Such a signal is unlikely to be recognized by a speech recognizer. Voice activity detectors (VAD) have therefore been developed to detect voice activity on the trajectory. The known speech activity detectors depend on the generation of an estimated amount of noise in an input signal and the comparison of an input signal with the estimated amount that is fixed or updated during the periods in which there is no speech. An example of such a voice activated system is described in U.S. Patent No. 5155760 and U.S. Patent No. 4410763. Voice activity detectors are used to detect speech in the input signal, and to interrupt the exit warning and turn on the recognizer when such speech is detected. A user will hear a shortened notice. This is satisfactory if the user has interrupted. If, however, the speech activity detector has incorrectly detected the speech, the user will hear a shortened message and will not have instructions on how to continue with the system. This is obviously undesirable. The present invention provides a speech activity detector for use with a speech system, the speech activity detector comprising an input for receiving an output speech signal transmitted from a speech system to a user and an input for receiving speech. an input signal from the user, both the input and output signals being divided into frames limited by time, means for calculating an aspect of each frame of the input signal, means for forming a function of the calculated aspect and threshold, and based on the function, to determine whether the input signal includes speech or not, characterized in that the means are provided to determine the echo return loss during an output speech signal from the interactive speech system and to control the threshold depending on the return loss of measured echo. The echo return loss is derived from the difference in the level of the output signal and the echo level of the output signal received by the speech activity detector. The echo return loss is a measure of the attenuation of the output warning by the transmission path. By controlling the threshold based on the measured echo return loss, not only is the number of triggers of the speech activity detector reduced when the user gives a response on a line that has a large amount of echo. Although this seems unattractive, it should be appreciated that it is preferable that the voice activity detector does not fire when the user interrupts the voice activity detector to fire when the user has not interrupted., which would leave the user with a shortened notice and without additional help. The threshold can be a function of the echo return loss and the maximum possible power of the output signal. Both are long-term characteristics of the line (although the loss of return of voice can be re-measured from time to time). Preferably the threshold is the difference between the maximum power and the return loss of echo. It may be preferred that the threshold be a function of the echo return loss and the calculated aspect of each frame of the output speech signal (i.e., the threshold represents an attenuation of each frame of the 'output' signal). Preferably the calculated aspect is the average power of each frame of a signal although other aspects may be used, such as frame power. More than one aspect of the input signal can be calculated and several functions can be formed. The voice activity detector may also include data that refers to the statistical models that represent the calculated aspect for at least one >; signal containing speech substantially free of noise and a noise signal, the function of the calculated aspect and the threshold being compared with the statistical models. Statistical models of noise signal may represent line noise and / or typical background noise and / or an echo of the output signal. According to the invention, a voice activity detection method is also provided, comprising receiving an output speech signal transmitted from a speech system to a user and receiving an input signal from the user, both input signals and output being divided into frames limited by time, calculate an aspect of each frame of the input signal, form a function of the calculated aspect and a threshold and, based on the function, determine whether the input signal includes or not speech, characterized by the measurement of echo return loss during an exit speech signal from the speech system and threshold control depending on the return loss of echo measured. Preferably, the threshold is a function of the echo return loss and the maximum possible potential of the output signal. As mentioned in the above, the threshold can be a function of echo return loss and the same aspect calculated from a frame of the output speech signal. The calculated aspect can be the average power of each frame of a signal.

The invention will now be described by way of example with reference to the accompanying drawings in which: Figure 1 shows an automatic speech system including a speech activity detector according to the invention; and Figure 2 shows the components of a speech activity detector according to the invention. Figure 1 shows an automatic speech system 2, which includes a speech activity detector according to the invention, connected through a telephone network exchanged publicly with a user terminal, which is usually a telephone 4. Automatic speech system is preferably located in an exchange on the network. The automatic speech system 2 is connected to a hybrid 6 transformer through an output line 8 and an input line 10. A user telephone connects to the hybrid through a two-way line 12. The echoes in the PSTN are essentially caused by the electrical and / or acoustic coupling of, for example, the interface of four cables to two cables in the hybrid transformer 6 (indicated by arrow 7). The acoustic coupling in the telephone set 4, from the earphone to the microphone, causes acoustic echo (indicated by arrow 9). The automatic speech system 2 comprises a speech generator 22, a speech recognizer 24 and a speech activity detector 26 (VAD). The type of speech generator 22 and speech recognizer 24 will not be discussed further because they do not form part of the invention. It will be clear to the person skilled in the art that any suitable speech generator, for example, "those using text technology for speech or pre-recorded messages, may be used." Furthermore, any suitable type of speech recognizer may be used. use, when a user calls the automatic speech system, the speech generator 22 sounds a warning to the user, which usually requires a response.Therefore, an exit speech signal is passed from the speech system along the transmission line 8 to the hybrid transformer 6 which diverts the signal to the megaphone of the user's telephone 4. At the end of the warning, the user provides a response that is passed to the speech recognizer 24 through the hybrid 6 and the entry line 10. The speech recognizer 24 then attempts to recognize the response and takes the appropriate measure in response to the recognition result. If the service provided by the automatic speech system has never been used, the user will need to hear the warnings provided by the speech generator 22 in its entirety. However, once the user has become familiar with the service and the information that is required - at each stage, the user may wish to provide the appropriate response before the notice ends. If the speech recognizer 24 is turned off until the announcement ends, no attempt will be made to recognize the user's early response. If, on the other hand, the speech recognizer 24 is on all the time, the entry to the speech recognizer will include both the echo of the output announcement and the response provided by the user. Such a signal will be very unlikely to be recognized by the speech recognizer. The speech activity detector 26 is provided to detect the direct speech (i.e. the speech of the user) in the input signal. The speech recognizer 24 is maintained in an inoperative mode until the speech is detected by the speech activity detector 26.

An output signal from the speech activity detector 26 passes to the speech generator 22, which is then interrupted (shortening the notice), and the speech recognizer 24 which, in response, is activated. Figure 2 shows the speech activity detector 26 of the invention with more details. The speech activity detector 26 has an input 260 for receiving an output warning signal from the speech generator 22 and the input 261 for receiving the received signal through the input line 10. For each signal, the voice activity detector includes a frame sequencer 26-2 that divides the input signal into data frames comprising 256 contiguous samples. Because speech energy is relatively static for 15 milliseconds, frames of 32 ms are preferred with an overlap of 16 ms between adjacent frames. This has the effect of making the VAD more robust with respect to impulsive noise. The data frame is then passed to an aspect generator 263 that calculates the average power of each frame. The average power of a frame of a signal is determined by the following equation: Frame Power Log Average P, n = 0 log bit) N where N is the number of samples in a frame, in this case 256. The echo return loss is a measure of the attenuation, that is, the difference (in decibels) between the output signal and the reflected signal. The echo return loss (ERL) is the difference that the calculated aspects for the exit warning and the return echo are, say, ERL - 10 log ,, 772_, P-i notice the output where N is the number of samples over which the average power P ^ is calculated. N must be as high as possible. As can be seen from Figure 2, the echo return loss is determined by subtracting the average power of an output signal frame from the average power of an input echo frame. This is achieved by exciting the transmission path 8, 10 with a warning from the system, such as a welcome notice. The level of the output warning signal and the return echo are then calculated as described above by the frame sequencer 262 and the aspect generator 263. The resulting signal levels are subtracted by subtractor 264 to form the echo return loss. The echo return loss is then subtracted by the subtractor 265 from the maximum possible power for the transmission path i.e. the subtractor 265 calculates the threshold signal: Threshold = Maximum Potential Power - Eco Return Loss The return loss Typical echo is about 12 dB although the rate is of the order of 6-30 dB, the maximum possible power over a telephone line for a signal is around 72 dB.

The ERL is calculated from the first 50 or approximately frames of the exit notice, although more or fewer frames can be used. Once the ERL has been calculated, the switch 267 is diverted to pass the data relative to the input line to the eustrator 266. The threshold signal is then, for the rest of the call, subtracted by the subtracter 266 in the power average of each frame of the input signal. Therefore, the result of the subtractor 266 is Pa * input signal - (maximum possible power - ERL) The result of the subtractor 266 is passed to a comparator 268, which compares the result with a threshold. If the result is above the threshold, the input signal is considered to include the direct speech of the user and a signal is output from the speech activity detector to deactivate the speech generator 22 and activate the speech recognizer 24. If the result is below the threshold, no signal is taken from the speech activity detector and the speech recognizer remains without operation. In another embodiment of the invention, the result of the subtractor 266 is passed to a classifier (not shown) that classifies the input signal as speech or non-speech. This can be achieved by comparing the result of the subtracter 266 with the statistical models that represent the same aspect for typical speech and non-speech signals. In an additional mode, the threshold signal is formed according to the following equation: (Pav | output signal - ERL) The resulting threshold signal is input to subtractor 266 to form the product: Pav | input signal - (Pav | exit notice - ERL) The echo return loss is calculated at the beginning of at least the first warning from the speech system. The echo return loss can be calculated from a single frame if necessary, since the echo return loss is calculated on a frame-by-frame basis. Therefore, even if a user speaks almost immediately it is still possible to calculate the echo return loss. The frame sequencers 262 and the aspect generators 263 have been described as an integral part of the speech activity detector. It will be clear to the person skilled in the art that this is not an essential aspect of the invention, either or both of these being separate components. It is also not necessary for a separate frame sequencer and a look generator to be provided for each signal. An individual frame sequencer and an aspect generator may be sufficient to generate an aspect from each signal.

Claims

1. A speech activity detector for use with a speech system, the speech activity detector comprising an input to receive an output speech signal transmitted from the speech system to a user and an input to receive an input signal from the user, both output and input signals being divided into frames limited by time, means for calculating an aspect from each frame of the input signal, means for forming a function of the calculated aspect and threshold and, based on the function, determining whether the input signal includes speech or not, characterized in that the means are provided to determine the echo return loss during an output speech signal from the speech system and to control the threshold depending on the echo return loss measure.

2. The speech activity detector according to claim 1, characterized in that the threshold is a function of the echo return loss and the maximum possible power of the output signal.

3. The speech activity detector according to claim 1, characterized in that the threshold is a function of the echo return loss and an aspect calculated from a frame of the output speech signal.

4. The speech activity detector according to any of claims 1, 2 or 3, characterized in that the calculated aspect is the average power of each frame of a signal.

5. The speech activity detector according to any of the preceding claims characterized in that it further comprises data relating to the statistical models representing the calculated aspect for at least one signal containing speech substantially free of noise and a noisy signal, the function of the calculated aspect and the threshold being compared with the statistical models.

6. The speech activity detector according to claim 5, characterized in that the statistical models of the noise signal represent line noise, typical background noise and / or an echo of the output signal.

7. A method of detecting speech activity which comprises receiving an output signal transmitted from a speech system to a user and receiving an input signal from the user, both output and input signals being divided into frames limited by time, calculate an aspect of each frame of the input signal, perform a function of the calculated aspect and a threshold and, based on the function, determine whether the input signal includes speech or not, characterized by the measurement of the echo return loss during an exit speech signal from the speech system and threshold control depending on the return loss of echo measured.

8. The method in accordance with the claim 7, characterized in that the threshold is a function of the echo return loss and the maximum possible power of the output signal.

9. The method in accordance with the claim 7, characterized in that the threshold is a function of the echo return loss and the same aspect calculated from a frame of the output speech signal.

10. The method according to any of claims 7 to 9, characterized in that the calculated aspect is the average power of each frame of a signal.