AU662616B2

AU662616B2 - Speech detection circuit

Info

Publication number: AU662616B2
Application number: AU36710/93A
Authority: AU
Inventors: Jose Fernando Chicharo; Andrew Perkis; Bernt Ribbum
Original assignee: Alcatel Australia Ltd
Current assignee: Nokia Services Ltd
Priority date: 1992-04-06
Filing date: 1993-04-02
Publication date: 1995-09-07
Anticipated expiration: 2013-04-02
Also published as: AU3671093A

Description

P/00/Oil1 28/5/91 Regulation 3.2 66 261 AUS3TRALIA Patents Act 1990 it it

ORIGINAL

COMPLETE SPECIFICATION STANDARD PATENT Invention Title: "SPEECH DETECTION CIRCUIT" I. t I It

III'

.54 I The following statement is a full description of this invention, including the best me~hod of performing it known to us:- .4 ,i i-O~rS

I

2 Technical Field This invention relates to a method of and apparatus for identifying the presence of speech signals, and is of particular application where the speech signals may be intermingled with other signals. A specific application of the invention is the discrimination of speech and tone signals in telephone circuits.

Figure 1 is a chart of typical characteristics of telephone signal tones.

Background Art The article "Telephone Answer Detection for Automated Answering Systems" Western Electronic Show and Convention, WESCON, Conference Record SS/2, 1985 by RM. Baugerter describes a method for detecting the onset of speech from a preceding tone sequence based on the count and processing of inflection points of the input signal.

Disclosure of the Invention This invention makes use of the fact that speech signals exhibit a pitch element having a long-term correlation trajectory characteristic of speech. The invention proposes to make use of the pitch correlation characteristic to detect the presence of speech.

In the context of a telephone system, a correlation threshold can be set to 20 discriminate speech from background noise and signal tones. In one embodiment the threshold may be varied in response to the detection of signals on the line.

To improve the discrimination between speech and tone signals, a further e embodiment of the invention includes obtaining a measure of the wideband 'i .,;signal energy from which the tone signals have been filtered. A technique for {I 25 accurately filtering the tone signals is disclosed in our co-pending Australian Patent Application No. PL 1733 entitled "A Tone Filter".

00. Preferably the signal processing is carried out digitally. Typically, telephone signals are sampled at 8kHz.

Brief Description of the Drawings Figure 1 is a chart of typical telephone signalling tones.

Figure (2a) shows a speech signal spectrum.

Figure (2b) shows the pitch lag for the signal in Figure 2(a).

Figure (2c) shows the auto-correlation for the signal in Figure 2(a).

,e .e it 3 Figure 3 shows search windows for detecting pitch lag.

Figure 4 shows a typical graph of correlation vs. pitch period.

Figure 5 shows the adjusted pitch windows used in re-evaluation.

Figure 6 shows a block diagram of an arrangement embodying the present invention.

Figure 7 shows a block diagram of an arrangement including a speech detector embodying the invention and a tone detector.

Figure 8(a) to 8(d) are flow diagrams illustrating the operation of the present invention.

Figure 9 shows a further embodiment of the invention.

Figure 10 shows a flow chart setting out the operation of the embodiment of Fig. 9.

Best Mode of Carrying out the Invention.

The invention will now be described with reference to the drawings.

The chart of Figure 1 specifies the tolerances for live voice and allowable noise in row 8. The frequency and frequency deviation columns indicate the r broad range of frequencies covered by speech signals, and the tight tolerances placed on the tone signals.

With reference to Figure 6 the signals on line 60 are digitised in A/D converter 61 and stored in memory 62 under the control of processor 64. The digitised signals are stored in frames of 16 bits. The stored information is then j; analysed in pitch detector 63 as described below.

The wideband energy of the digitised signals is measured in a process schematically represented at 65. Essentially the measurement of wideband 25 energy consists in filtering the tone signals out of the digitised signals and squaring the remainder to obtain a measure of the energy.

The output of the pitch detector 63 and the energy measuring circuit A are applied to decision logic 66 which assesses whether or not the signal is speech.

In the embodiment shown in Figure 7, the digitised input signal is also applied to an array 70 of tone filters 71 as described in our co-pending application. The output of these filters is also applied via squaring means 72 to the decision circuit 73 so individual tones can be identified by comparison of the Y 4 signal levels.

The pitch detector 76 includes autocorrelation means 77, selection means 78 to select the maximum from the autocorrelation means 77, pitch selection means 79, and means to update parameters 79a.

The operation of the speech recognition circuit will now be described in more detail.

According to the Dual-Input LPC model, a voiced segment of speech can be modelled as a pulse train that is frequency shaped by an autoregressive filter.

Without regard to the filter, a long term correlation analysis can be made directly on the signal to extract information about the periodicity. In selecting a suitable method for measuring this, a balance must be struck to keep the complexity down, as an autocorrelation search is a time-consuming process.

A recent and accurate method for pitch detection is a method based on one suggested by Medan, Yair and Chazan IEEE Transactions on Signal Processing Vol 39, pp 40-48, January 1991, here referred to as the Superpitch method.

This method is based on measuring the autocorrelation in a signal, with a frame length equal to the pitch period. As an example, when the incoming signal is x, a normalised correlation factor PN is calculated for a suspected pitch 20 period of N as follows: where LAST is the index of the most recent sample ir. 'the frame buffers.

All calculations may be performed in 16/32 bit fixed point arithmetic, 25 avoiding division and root extractions.

The correlation coefficients are calculated for a set of possible pitch lags (or periods) ranging from 20 through 135 samples, where only every third o 1 o^ 1 possible LAST is the index of the m ost r ecent sample ir the calculation. That is, All calculations may be performed in 16/32 bit fixed point arithmetic,

O

o 25 avoiding division and root extractions.

The correlation coefficients are calculated for a set of possible pitch lags (or periods) ranging from 20 through 135 samples, where only every third possible lag is tested due to the complexity involved in the calculation. That is, the autocorrelation is calculated for N= 20, 23, 134. While the discarding of 2/3 of the samples causes some loss of information, this is done without I I

A

p low-pass filtering of the input signal and this leads to fewer false detections on non-speech signals. A speech segment is deemed voiced if the maximum value fof p exceeds a predetermined threshold (see below). The limits for the pitch search are 20 and 134. The lower limit of 20 is chosen since it corresponds to a fundamental frequency of 400 Hz (at a sampling rate of 8 kHz), which is well above what is expected from the human vocal tract. The upper limit of 135 is half the number of buffered samples (270); the algorithm cannot search longer intervals without buffering more samples. This interval corresponds to a fundamental frequency of 59.3 Hz.

Note that the segments used in the calculations are right-aligned, to allow I the most recent samples to be used for the measures. This is depicted in Figure 3.

The maximum value of p will in some cases not correspond to the 'true' pitch period, but rather a harmonic. Finding the correct pitch period is important, especially when pitch tracking is involved. Therefore, an array of the different PN values is kept, one for each possible lag value in the range indicated above. If more than one of the PN are above the threshold, this array is S subsequently searched for local maxima N 1 N, (Figure and the different p's are re-evaluated for those N, but now with a period length equal to that of NP j 20 (Figure Using this extended length, the pitch period is chosen as the smallest N for which p, exceeds the set threshold.

9 As mentioned above a threshold is set, against which the p values are compared. Since a detection of pitch onset should be detected with high accuracy, a slight delay in the detection may be tolerated, and a fixed threshold, /I 25 THIGH is used for onsets, or Unvoiced/Voiced transitions. The value used for THIGH. is 0.77. Once a voiced segment is detected, however, the threshold should be altered, so that the voiced decision is retained throughout the voiced part. This is done by setting the threshold for the next frame to a new value T MAX(TMIN., p.xTFACTOR), where TFACTOR is set to 0.77, and TMIN is set to 0.72. The value TMIN is used to ensure the threshold never falls too low.

An additional element is included to allow for V/V transitions, that is, where one voiced sound follows another, and where the autocorrelation factor falls momentarily low due to the slight change in the speech waveform. This i- 1. I -e 6 means that after a voiced segment is established, p is allowed to fall below T during one segment only; the new threshold will be TTRANS, which is set to 0.70.

During a suspected V/V transition, the pitch tracking is turned off.

The pitch analysis is made on the most recent 3 signal frames, which are stored in memory. Due to the fixed point implementation, signal scaling is necessary both to retain accuracy and to avoid overflows. Each new frame is checked for maximum amplitude, and a scale factor is attached to the frame, stating how the frame should be scaled to obtain a maximum amplitude close to 0.125 of the absolute max for an input sample. This is the optimum scaling for the frame, allowing for maximum accuracy with guaranteed no overflow.

However, since the three frames that are searched must be scaled to the same factor, a re-scaling of past frames takes place whenever necessary, to at any point ensure the maximum allowable signal amplitude.

For example, if at one particular point the three previous frames in the buffers did not need any scaling, the samples buffered are all unscaled. If the next frame is of low amplitude and needs to be scaled UP, no scaling will be done since an overflow might occur from the buffered frames. If, on the till contrary, the amplitude of the new buffer is too high, the previous frames will be down-scaled by the same factor as the new frame.

After pitch has been deemed present over a number of frames, pitch tracking is turned on, and the range of pitch lags searched is limited to an area of 39 non-decimated values centred around the previously detected lag. The reason for this limitation is to lower the computational complexity only. In a normal pitch tracker this would help avoiding spurious 'glitches' in the pitch detection, but for this application this is of less -ncern.

A sample speech segment (samples taken from the Harvard Database; American Male voice) is shown in Figure A plot of detected pitch lag, S autocorrelation values, and thresholds are given in Figure 2(b) and for the same segment. The correlation and threshold values in the figures are integer numbers scaled by a factor of 214 16384 from the floating point numbers given above. Pitch is detected when the autocorrelation value is above the threshold, as shown between the dotted lines in the figures. Voiced segments are detected in segments 96 and 111; this corresponds quite well with the 7 waveform samples 8640 anrid 9990. In some frames we can see that the maximum autocorrelation value found is zero; this is due to the combined factors of: signal scaling (the signal amplitude is low in those frames), and especially scaling due to the neighbouring frames causes loss of accuracy; -the decimation used in the search will sometimes give negative values only for the autocorrelation; those are never processed and a zero is returned.

The Speech Detection is based on a trajectory of the pitch measured over a number of signal frames. The main aim is to make sure that voiced speech (or pitch) is indeed present, and in addition make sure there is enough stability in the signal. Thus incorrect detection due to few, but high, authocorrelation values is substantially eliminated.

For each speech frame, both the Wideband Energy and the pitch lag is measured; a low maximum autocorrelation is signalled with a lag of zero.

These values are buffered for the last three frames. Speech is deemed present when the following statements are TRUE: the WBE is above the set minimum for all three frames; and a valid pitch is detected for all three frames.

i i....The flow charts in Figure 8 describe the operation of the arrangement of t.

20 Figure 7. In particular: e Figure 8(a) shows the steps involved in energy and pitch calculation; Figure 8(b) shows elimination of low energy tones and early signalling of off-hook tone; Figure 8(c) shows elimination of high twist tones and a check for offset 25 Bong tone; Figure 8(d) shows the final Tone and Speech decision process, i "i By operating on tone energy information, wide band energy information #4 *.to and information on the periodic structure of the waveform (pitch information) obtained from the other subsystems, a decision is arrived at as to what (if any) signal is present. In many ways this can be considered as an expert system, ie,, present and past information is applied to a set of rules or conditions which categorise the current signal.

The accompanying flow chart (Figure 8a to d) is one representation of the Si l. I I decision circuit. The flow chart describes the majority of the logic used to determine the identity of the current signal. The flowchart is self explanatory.

Information which is available consists of: Energies of each tone frequency Wideband energy (all energy except tone energy) Pitch period The system starts by determining whether the current output should continue to signal that speech is present. This is accomplished by checking that speech was signalled in the last frame and that the mean wide band energy for the last three frames continues to remain above the noise threshold (set to -48 dBm). Thus, once all residual wideband energy has disappeared, the circuit is enabled for further speech or tone detection.

The energy associated with each tone needs to be determined in order to establish which, if any, is present. Since the tones consist of either a single or several frequencies, the tone energy is determined by adding the relevant energies measured by the narrow band signal detector. Subsequently, the tone with the greatest energy is chosen.

The second section simply sets a flag indicating that sufficient wideband energy is present for speech detection to occur. Note that in principle this threshold is tone dependent, ie., the threshold level depends on which tone has the maximum energy. In fact all tones except the single frequency tones (Ready L, and Carrier) are set to the same level (approximately -35 dBm the minimum speech energy) while the single frequency tones set a slightly lower threshold of -42 dBm. This extra leeway helps reduce false detections with low energy j: 25 voiced speech (both single frequency tones are centred in high speech activity regions).

The following parts test the validity of the tone. A number of tests are made on the energy to determine if it falls within the given specifications. The prime motivation for such tight checks is to distinguish tones from speech with the assumption that if the signal meets all the specifications for a tone, the probability of the signal being speech is low.

This energy is compared with a threshold to determine whether the energy is above the minimum for that tone (minimum tone signal levels range 1 9 from -20 dBm to -52 dBm). If the energy is below the minimum, the detection is set to none.

If the chosen tone is the Dial tone, the energy of the Bong tone is compared to a threshold. If this level is exceeded, the tone is set to Bong. This accounts for the possibility that a Bong tone may exist but the predominant energy lies within the Dial tone that accompanies it.

If the chosen tone is the Offhook tone, the wideband energy flag (SP) is set to zero. Due to the high Offhook tone amplitude (O to -20 dBm) any deviation of frequency would lift the wideband energy considerably. Such an occurrence would indicate that speech is possibly present, which would prevent tone signalling. Thus, the SP flag is set to zero upon the detection of any Offhook energy. By using this method,the notch filters centered on the four Offhook frequencies in the wideband energy filter are unnecessary, thus reducing the complexity of this section by a third.

The Ring Back and Busy tones are difficult to isolate due to a number of reasons: Their energies lie in regions of speech that typically contain high amounts of energy/activity; i The specified amplitude levels are very low (-47 and -52 dBm); Relatively large frequency offsets are possible and 1 The first problem is solved by conducting a number of tests based on the tone energy itself and the wideband energy. Should the wideband energy flag (SP) be set, the tone is immediately declared not present. In addition, if sufficiently high wideband energy is present (-42 dBm) and this energy is greater 25 than the tone energy, the tone is set to none. This accounts for the possibility of very low energy voiced speech having sufficient Ringback or Busy tone energy. The second and third difficulties are overcome by careful adjustment of 'che notch filter bandwidths.

A further elimination stage utilises the signal twist characteristics. If the twist of the selected tone exceeds the specifications, the tone is declared not present. If the current tone is the Bong tone, a check is made to ensure the twist of the accompanying Dial tone is within limits.

The Bong tone presents a problem similar to that of the Offhook tone. A c i ~r I 1 1 I Lii I- ~e possibly large frequency deviation coupled with a relatively high amplitude will cause the wideband energy to be high (SP set). If this is the case and the current tone is, or the previous tone was, the Bong tone, and the wideband energy is less than the Bong or the Dial tone, the SP flag is reset to zero. This configuration also prevents a glitch occurring when the Bong tone is turned off after lO0ms, yet the Dial tone is still present. Normally, in this case, the wideband energy increases for a short time due to the fallinc step function.

The final stage bases a signalling decision on the current and past tone and speech detections. This stage is presented with a tone designator and an SP flag. If the SP flag is set, a check is made to ensure that the SP flag was set for the previous four frames. If this is true the algorithm checks that a valid pitch detection has occurred, otherwise a restart occurs. Finally, the pitch information from the last three frames is checked for non-zero components. If none have occurred, speech is considered present and the appropriate output signal is made.

If the SP flag is not set and the present tone is identical to the previous frame's tone a check is made to ensure the tone is not the Ring Back or Busy tone. If it is not, the appropriate tone is signalled on the output. A check is made to ensure that if the Bong tone occurred it remains signalled until the Dial tone ceases. If the tone was either the Ring Back or the Busy tone and the SP flag has not been set in the last four frames and at least 11 of the last 13 frames have been consistently Ring Back or Busy, the appropriate tone is 4,t: signalled.

The check for the SP flag is necessary to eliminate glitches when a high 25 energy tone turns off (falling step function). When this happens, the SP flag is O0 0 set (high wideband energy) for a short time, which causes tone signalling to stop, Because of the very low amplitudes possible with these two tones, there ~may still be enough residual tone energy left (due to the filter transient delay) once the wideband energy dies away (SP not set) to cause repeat tone signalling. This check ensures that the signalled tone is terminated precisely when the tone finishes.

We have found that in some applications, such as where the signal is noisy, the time-domain aialysis of the Superpitch routine could produce h 11 erroneous results, such as identifying a DTMF signal as speech. In addition the pitch detection uses an autocorrelation search, and this uses a complex algorithm requiring considerable computational overhead especially as the accuracy of the estimation of the periodicity of voiced speech is increased.

Thus, in an alternative embodiment, we have replaced the pitch detection function with a modified version of the wideband energy detector used to detect speech.

Fig. 9 shows this embodiment, and Figs. 10a to 10d show a flow chart for the embodiment.

The arrangement shown in Fig. 9 is similar to the arrangement of Fig. 7, except that the pitch detector block has been removed, the operaton of the wideband energy detector being modified to remove the need for the pitch detector.

The wideband energy measure is based on a set of notch filters connected in cascade, to remove energy at the selected frequencies. This enables re-use of the notch filter routines from the tone detection algorithm,

'S

with only a moderate increase in program size.

The detection time for speech is specified in Fig. 1B as 200 mS. A reliable measure for speech can thus be a sufficiently high wideband energy for 20 a number of frames. We have chosen 6 frames (Fig. 10d) at a selected S L threshold of e.g. -35 dBm (Fig. increase to 6 frames from the 4 frames used with the embodiment of .o ~Figs. 7 and 8d improves the accuracy of the wideband energy detection process to the extent that the pitch detectior, of Fig. 7 can be eliminated.

25 Fig. 9 illustrates an embodiment of the detection algorithm.

°The analog input signal is digitized at 90 to produce signal which is S applied to the narrow band signal detector 91 and the wideband energy detector 95, The outputs from detectors 91 and 95 are applied to the telephone line 'at' signal decision matrix 99, The operational steps are shown in the flowchart of Fig. In the narrow band signal detector 91, the following detection process is carried out for each one of the set of signal frequencies. The signal s(z) is subjected to a notch filter process at 92 to eliminate the frequency of interest, LI I p.- 12 The filtered signal w(z) is then subtracted from s(z) at 93 to rpoduce the wanted sig-.' A measure of the energy of n(z) is then obtained at 94 and this rr" isured energy signal is applied to the decision matrix 99.

In the wideband energy detector 95 each of the set of signal frequencies is eliminated from s(z) via the series notch filters 96, and the energy of the remaining signal is detected at 97 and applied to the decision matrix 99.

The steps of the process are explailod with reference to the flow chart in Fig. At the start of each detection cycle a new frame of samples of s(z) is examined and the energy of each of the set of control signal frequencies (tones) is computed. The wideband energy (WBE) of the input signal minus the control signal frequencies is computed.

The wideband energy is examined with the WBE of the previous two frames and if the mean of the three frames is greater than a predetermined WBE level 45 dBm) then the new frame is recognized as speech.

If the input is not recognized as speech, the tone energies are examined to determino which, if any, are present.

The energy E of the tone with the highest energy is chosen for testing.

This is first compared with a minimum energy level (tone floor) to determine if there is a tone signal present.

S; If the presence of tone is indicated, the nature of the tone can be j identified from its characteristic frequency i.e. the notch filter from which it is II produced. The flow chart steps through the process of identifying the tone.

The arrangement may be implemented using an ASIC or by using a ,d t 25 microprocessor to carry out the digital filtering and signal analysis.

I ti i

Claims

2. A method as claimed in claim 1 including the steps of converting the input signal from an analog signal to a first digital signal in which digital information is grouped in frames, and for each frame: digitally filtering each of the tone signals out of the first digital signal to produce a modified digital signal; squaring the modified digital signal to obtain a measure of the energy of the modified digital signal; and comparing the energy of the modified digital signal with a threshold value; generating a speech indication signal when the energy of the modified signal exceeds the threshold value.
3. A method as claimed in claim 2 including the steps of storing a predetermined number of measures of energy of the modified digital signals obtained from preceding frames; calculating the mean of the measure of the energy of the modified digital signal and said stored measures of energy; comparing said mean with the threshold value; generating a speech indication signal when said mean exceeds the threshold value.
4. A method as claimed in claim 3, including the steps of storing the result of the speech indication signal analysis from the immediately preceding frame; and generating the speech indication signal only if said result was a speech S 25 indication signal. A method as claimed in any one of claims 1 to 4 including the steps of detecting the pitch lag of the first digital signal; determining if the pitch lag is within an allowable range for a predetermined number of consecutive frames; generating a speech indication signal if and only if the pitch lag is within said allowable range for a predetermined number of consecutive frames.
6. A method of detecting the presence of speech signals substantially as herein described with reference to the accompanying drawings.
7. An arrangement for detecting speech signals in an input signal which may T contain one or more of a set of tone signals each of known frequency, the 14 arrangement including: filter means to filter out the frequency of each of the set of tone signals from the input signals to produce a modified signal; energy detector means to produce an energy signal representative of the energy of the modified signal; and comparator means to compare the energy signal with a threshold value; wherein a speech indication signal is produced when the energy signal exceeds the threshold value.
8. An apparatus as claimed in claim 7, including an analog to digital converter to convert the input signal to a first digital signal in which digital information is grouped in frames, wherein the filter means comprise digital filter means in which the tone signals are digitally filtered out of the first digital signal to produce a modified digital signal; and wherein the energy detector means comprise digital squaring means.
9. An arrangement as claimed in claim 8 including store means to store a predetermined number of energy signals obtained from preceding frames; and digital arithmetic calculating means to calculate the mean of the stored energy signals and the present energy signal; wherein said mean is compared with the threshold value. An arrangement as claimed in claim 9, wherein the result of the speech indication signal analysis from the preceding frame is stored in the store means; and wherein a speech indication signal is generated only if said result indicated speech was present.
11. An arrangement as claimed in any one of claims 6 to 10 including pitch lag detector means to detect the pitch lag of the first digital signal; and gague means to determine whether the pitch lag is within an allowable range for a j 25 predetermined number of consecutive frames; wherein a speech indication signal is generated only if the pitch lag was within the allowable range for the predetermined number of consecutive frames.
12. An arrangement for detecting speech signals substantially as herein described with reference to the accompanying drawings. DATED THIS TWELFTH DAY OF FEBRUARY 1993 ALCATEL AUSTRALIA LIMITED ABSTRACT A speech detection arrangement to discriminate between speech signals and a plurality of known control tones includes an array of narrow band signal detectors each detecting the energy of a respective one of the control tones in the input signal, a wideband energy detector detecting the total energy in the input signal, and subtraction means to subtract the energy of e rch of the control tones from the total energy, to produce a remainder signal rep ntative of the energy of the speech signal in the input signal. Figure 9 i I o S^