CA1335003C - Voice activity detection - Google Patents
Voice activity detectionInfo
- Publication number
- CA1335003C CA1335003C CA000593386A CA593386A CA1335003C CA 1335003 C CA1335003 C CA 1335003C CA 000593386 A CA000593386 A CA 000593386A CA 593386 A CA593386 A CA 593386A CA 1335003 C CA1335003 C CA 1335003C
- Authority
- CA
- Canada
- Prior art keywords
- signal
- speech
- measure
- voice activity
- input signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Lifetime
Links
- 230000000694 effects Effects 0.000 title claims abstract description 38
- 238000001514 detection method Methods 0.000 title claims description 22
- 238000001228 spectrum Methods 0.000 claims abstract description 28
- 230000003595 spectral effect Effects 0.000 claims abstract description 27
- 230000004044 response Effects 0.000 claims abstract description 19
- 230000003044 adaptive effect Effects 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 5
- 238000000034 method Methods 0.000 claims 1
- 238000011524 similarity measure Methods 0.000 claims 1
- 206010019133 Hangover Diseases 0.000 description 3
- 101000822695 Clostridium perfringens (strain 13 / Type A) Small, acid-soluble spore protein C1 Proteins 0.000 description 1
- 101000655262 Clostridium perfringens (strain 13 / Type A) Small, acid-soluble spore protein C2 Proteins 0.000 description 1
- 101000655256 Paraclostridium bifermentans Small, acid-soluble spore protein alpha Proteins 0.000 description 1
- 101000655264 Paraclostridium bifermentans Small, acid-soluble spore protein beta Proteins 0.000 description 1
- 206010036086 Polymenorrhoea Diseases 0.000 description 1
- 101100330413 Schizosaccharomyces pombe (strain 972 / ATCC 24843) dad2 gene Proteins 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000011045 prefiltration Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/84—Detection of presence or absence of voice signals for discriminating voice from noise
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
- Telephone Function (AREA)
- Mobile Radio Communication Systems (AREA)
- Noise Elimination (AREA)
- Geophysics And Detection Of Objects (AREA)
- Measuring Pulse, Heart Rate, Blood Pressure Or Blood Flow (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
- Signal Processing Not Specific To The Method Of Recording And Reproducing (AREA)
- Indexing, Searching, Synchronizing, And The Amount Of Synchronization Travel Of Record Carriers (AREA)
- Transmission Systems Not Characterized By The Medium Used For Transmission (AREA)
Abstract
Voice activity detector (VAD) for use in an LPC coder in a mobile radio system, uses autocorrelation coefficients R0, R1..... of the input signal, weighted and combined, to provide a measure M which depends on the power within that part of the spectrum containing no noise, which is thresholded against a variable threshold to provide a speech/no speech logic output. The measure is
Description
- 133~003 VOIOE ACTIVITY DETECTION
A voice activity detector is a device which is supplied with a signal ~ith the object of detecting periods of speech, or periods containing only noise~.
Although the present invention is not limited thereto, one application of particular interest for such detectors is in mobile radio telephone systems where the knowledge as to the presence or otherwise of speech can be used exploited by a speech coder to improve the efficient utilisation of radio spectrum, and where also the noise level (from a vehicle-mounted unit) is likely to be high.
The essence of voice activity detection is to locate a me~sure which differs appreciably -between speech and non-speech periods. In apparatus which includes a speech coder, a number of parameters are readily available from one or other stage of the coder, and it is therefore desirable to economise on processing needed by utilising some such parameter. In many environments, the main noise sources occur in known defined areas of the frequency spectrum. For example, in a moving car much of the noise (eg, engine noise) is concentrated in the low frequency regions of the spectrum. Where such knowledge of the spectral position of noise is available, it is desirable to ~ase the decision as to whether speech is present or absent upon measurements taken fro~-that portion of the spectrum which contains relatively little noise. It would, of course, be possible in practice to pre-filter the signal before analysing to detect speech activity, but where the voice activity detector follows the output of a speech coder, prefiltering-would distort the voice signal to be coded.
~, _ - 2 - 1335 00~
According to a f irst aspect of the invention there is provided voice activity detection apparatus comprising means for receiving an input signal, means for estimating the noise signal component of the input signal, means for continually forming a measure ~ of the spectral sim~larity between a portion of the input signal and the noise signal, and means- for comparing a p~rameter derived from the measure M with a threshoId va~ue T to produce an output to indicate the presence or absence of speech in dependence upon whether or not that value is exceeded.
According to a second aspect of the invention there is provided voice activity detection apparatus co~prising:
means for continually forming a spectral distortion measure of the similarity between a portion of the input signal and earlier portions of the input signal and me~ns for comparing the degree of var~ation between successive values of the measure with a threshold v~lue to produce an output incating the presence or absence of speech in dependence upon ~hether or not that value is exceeded.
Preferably, the measure is the Itakura-Saito Distortion Xeasure.
Other aspects of the present invention ~re as defined in the claLms.
Some embodiments of the invention will now be described, by way of example, ~ith reference to the accompanying drawings, in which:
Fig~re 1 is a block diagram of a first embodiment of - the invention;
F-gure 2 shows a-second embodiment of the invention;
Figure 3 shows a third, preferred embodiment -of the invention.
i _ - 3 - 133 S 0 03 -The general principle underlying a first Yoice Activity Detector according to the a first embodiment of the invention is as follows.
A frame of n signal samples (sO~ 51~ 52' s3~ s4 -- sn_l ) will, ~hen passed through a notional fourth order finite impulse response (FIR) digital filter of impulse response (1, ho~ hl, h2, -h3), result in a filtered signal (ignoring samples from previous frames) s =
( O)' (S-l + hoSo)~
(S2 + hoSl + hlSO)' (s3 + hos2 ~ hlsl ~ h2So)~
(S4 ~ hos3 ~ hlS2 + h2Sl + hlSOl' (Ss + hos4 + hls3 + h2S2 + h3Sl)' (S6 + hoS5 + hlS4 + h2S3 + h3s2), (S7 ... ) The zero order autocorrelation coefficient is the sum of each term squared, which may be normalized i.e. divided by the total number of terms (for constant frame lengths it is easier to omit the division); that of the filtered signal is thus R~o _ r ~s'i) -.
~ - o and this is therefore a measure of the power of the notional filtered signal s' - in other words, of that part of the siqnal s which falls within the passband of the notional filter.
Expanding, neglecting the first 4 terms, .
_ - 4 - 1 335 0 03 R~o = (s4 + hos3 + hlS2 + h2Sl + h3So)2 + (s5 + hos4 + hls3 + h252 + h3Sl-) = 54 hos4s3 + hls4s2 + h2s4sl + h3s4sO
+ hoS4S3 + h25o + hOhlS3S2 - hOh2S3Sl+ hoh3S3so hls4s2 + hOhlS3S2 + h21s2 + hlh2S251 + hlh3S2SO
2s4sl + hOhlS3Sl + hlh2S2Sl + h2s21h2h3SlSO
3 4 0 hoh3s3so + hlh3S2sO + h2h3SlsO + h23so + . . .
= Ro (1 + ho2+ h21+ h2+ h32) Rl ~2ho + 2h~hl + 2hlh2 + 2h2h3) + R2 (2hl + 2hlh3 + 2hOh2) + R3 (2h2 + 2hoh3) R4 (2h3) So R~o can be obtained from a combination of the autocorrelation coefficients Ri,- weighted by the bracketed constants which determine the frequency band to which the value of R'o is responsive. In faet, the bracketed terms are the autocorrelation coefficients of the impulse response of the notional filter, so that the expression above may ~e simplified to _ _ 5 _ 1 3 35 0 03 O oHo + 2 ~ RiHi, (1) i = 1 where N is the filter order and Hi are the (un-normalised) autocorrelation coefficients of the impulse response of the filter.
In other vords, the effect on the signal autocorrelation coefficients of filtering a signal may be simulated by producing a weighted sum of the autocorrelation coef$icients of the (unfiltered) signal-, using the impulse response that the required filter wou~d have had.
Thus, a relatively simple algorithm, involving a small number of multiplication operations, may simulate the effect of a digital filter requiring typically a hundred times this nu~er of multiplication operations.
This filtering operation may alternatiYely be viewed as a form of spectrum comparison, with the signal spectrum being matched against a reference spectrum (the inverse of the response of the notional filter). Since the notional filter in this application is selected so as to approximate the inverse of the noise spectrum, this operation may be viewed as a spectral comparison between speech and noise spectra, and the zeroth autocorrelation coefficient thus generated (i.e. the energy of the inverse filtered signal) as a measure of dissimilarity between the -~spectra. The Itakura-Saito distortion measure is used in LPC to assess the match between the predictor filter and the input spectrum, and in one form is expressed as -M = RoAo + 2 ~ RiAi' where Ao etc are the autocorrelation coefficients of the LPC parameter set. It vill be seen that this is closely similar to the relationship derived above, and ~hen it is remembered that the LPC coefficients are the taps of an FIR filter having the inverse spectral response of the input -signal so that the LPC coefficient set is the impulse response of the inverse LPC- filter, it will be - ~
apparent that the Itakura-Saito Distortion Neasure is in fact merely a form of equation 1, wherein the filter response H is the inverse of the spectral shape of an all-pole model of the input signal.
In fact, it is also possible to transpose the spectra, using the LPC coefficients of the test spect~um and the autocorrelation coefficients of the reference spectrum, to obtain a different measure of spectral similarity.
The I-S Distortion ~easure is further discussed in "Speech Coding based upon Vector Qu~ntisation~ by A Buzo, A H Gray, R ~ Gray and J D ~arkel, IEEE Trans on ASSP, Vol ASSP-28, No 5, October lg~0.
Since the frames of signal have only a finite length, and a number of terms (N, where N is the filter order) are neglected, the above result is an approxi~ation only; it gives, however, a surprisingly qood indicator of the presence or absence of speech and thus ~ay be used as a measure H in speech detection. In an environment where -the noise spectrum is well known and stationary, it is - i quite possible to simply employ fixed ho~ hl etc coefficients to model the inverse noise filter. t ~ ~ 7 _ 1335003 ~ owever, apparatus which can adapt to different nolse environments is much more widely useful.
Referring to Fi~ure 1, in a first embodiment, a signal from a microphone (not shown) is received at an input 1 and converted to digital samples s at a suitable sampling rate by an analoque to digital converter 2. An LPC
analysis unit 3 ~in a ~nown type of LPC coder) then ~derives, for successive frames of n teg 160) samples, a~ -set of N (eg 8 or 12) LPC filter coefficients Li which are transmitted~ to represent the input speech. The speech signal s also enters a correlator unit 4 (normally part of the LPC coder -3~~since the autocorrelation vector Ri ~
the speech is also ~usually ptoduced as a step in the LPC
analysis although it will be appreciated that a separate correlator could be provided). The correlator 4 produces the autocorrelation Yector Ri, including the zero order correlation coefficient Ro and at least 2 further aut~correlation coefficients Rl, R2, R3. These are then supplied to a multiplier unit 5.
A second input 11 is connected to a second microphone located distant from the speaker so as to receive only background noise. The input from this ~icrophone is converted to a diqital input sample train by AD convertor 12 and LPC analysed by a second LPC analyser 13. The ~noise" LPC coefficients produced from analyser 13 are passed to correlator ~unit 14, and the autocorrelation vector thus produced is multiplied term ~y term ~ith the autocorrelation coefficients Ri f the input signal fro~
the speech microphon-e in multiplier 5 and the weighted coefficients thus produced are com~ined in adder 6 according to Bquation 1, so as to apply a filter having the inverse shape of the noise spectrum from the noise-only ~icrophone (which in practice is the same as the shape of the noise spectrum~ in the signal-plus-noise microphone) and thus filter out most of the noise. The resulting measure ~ is thresholded by thresholder 7 to produce a logic output 8 indicating the presence or absence of speech; if ~ is high, speech is deemed to be present.
This embodiment does, however, re~uire two microphones and two LPC analysers, which adds to the expense and complexity of the equipment necessary.
Alternatively, another embodiment uses a corresponding measure formed using the autocorrelations from the noise microphone 11 and ~the LPC coefficients from the maLn microphone 1, so that an extra autocorrelator rather than an LPC analyser is necessary.
These embodiments are therefore able to operate within different environments having -noise at different frequencies, or within a chan~ing noise spectrum in a given environ~ent.
Referring to Pigure 2, in the preferred embodiment of the invention, there is provided a buffer 15 which stores a set of LPC coefficients (or the autocorrelation vector of the set) derived from the microphone input 1 in a period identified as being ~a ~non speech~ (ie noise only) period. These coefficients are then used to derive a measure using e~uation 1, which also-of course corresponds to the Itakura-Saito Distortion ~easure, except that a single stored frame of LPC coefficients corresponding to an approximation of the inverse noise spectrum is used, rather than the present frame of LPC coefficients.
The LPC coefficient vector Li output by analyser ~
is also routed to a correlator 14, which produces the autocorrelation vector of the LPC coefficient vector. The buffer memory 15 is controlled by the speech/non-speech -- q 133500~
output of thresholder 7, in such a way that during ~speech~ frames the buffer retains the ~noise~
autocorrelation coefficients, but during "noise~ frames a new set of LPC coefficients may be used to update the buffer, for example by a multiple -switch 16, via which outputs of the correlator 14, carrying each autocorrelation coefficient, are connected to the buf~er 15. It ~ill be appreciated that correlator 14 could be positioned after buffer 15. Further, the speech/no-speech decision for coefficient update need not be from output 8, but couid be (and preferably is) otherwise derived.
- Since frequent periods ~ithout speech occur, the LPC
coefficients stored in the buffer are updated from time to time, so that the apparatus is thus capable of tracking chanqes in the noise spectrum. It will be appreciated that such updating of the buffer may be necessary only occasionally, or may occur -only once at the start of operation of the detector, if (as is often the case) the noise spectrum is relatively stationary over time, but in a mobile radio environment frequent updating is preferred.
In a modification of this embodiment, the system initially employs equation 1 with coefficient terms corresponding to a simple fixed high pass filter, and then subsequently starts to adapt by switching over to using ~noise period~ LPC coefficients. If, for some reason, speech detection fails, the system may return to using the simple high pass filter.
It is possible to normalise the above measure by dividing through by Ro~ so that the ~pression to be thresholded has the form N = Ao +2 ~ RiAi - R
I O
- lO - 133S0 03 This measure is independent of the total signal enerqy in a frame and is thus compensated for gross signal level changes, but gives rather less mar~ea contrast betveen "noise~ and ~speech~ levels and is hence preferably not employed in h-igh-noise environments.
Instead of employing LPC analysis to derive the inverse -filter coefficients of the noise signal (from either the noise microphone or noise only periods, as in the various embodiments described above), it is possible to model the ~inverse noise spectrum usinq an adaptive filter of known-type; as the noise spectrum changes only slowly (as discussed below) a relatively-slow coefficient adaption rate common for such filters is acceptable. In one embodiment, which corresponds to Figure 1, LPC
analysis unit 13 is simply replaced by an adaptive filter (for example a transversal FI~ or lattice filter), connected so as to whiten the noise input by modelling the inverse filter, and its coefficients ~re supplied as before to autocorrelator 14.
~ n a second embodiment, corresponding to that of Figure 2, LPC analysis means 3 is replaced by such an adapter filter, and buffer means 15 is omitted, but switch 16 operates to prevent the adaptive filter from adapting its coefficients during speech periods.
A second Voice Activity Detector in accordance with another aspect of the invention vill now be described.
From the foregoing, it will be apparent that the ~PC
coefficient vector is simply the impulse response of an FIR filter which has a response approximating the inverse spectral shape of the input signal. When the Itakura-Saito Distortion ~easure between adjacent frames is formed, this is in fact e~ual to the power of the signal, as f~ltered by the ~P~ filter of the previous frame. So if spectra of adjacent frames differ little, a correspondingly small amount of the spectral power of a frame will escape filtering and the ~easure will be low.
Correspondingly, a large interframe spectral difference produces a high Ita~ura-Saito Distortion ~easure, so that the measure reflects the spectral similarity of adjacent frames. In a speech coder, it is desirable to minimise the data~ rate, so frame length is made as long as possible; in other w rds, if the frame length is Iong enough, then a speech signal should show a significant spectral change from frame to frame (if it does not, the coding is redundant). Noise, on the other hand, has a slowly varying spectral shape from frame to frame, and so in a period ~here speech is absent from the signal then the Itakura-Saito Distortion ~easure will correspondingly be low - since applying the inYerse LPC filter from the previous frame ~filters out" most of the noise power.
Typically, the Itakura-Saito Distortion ~easure between adjacent frames of a noisy signal containing intermittent speech is higher during periods of speech than periods of noise; the degree of variation (as illustrated by the standard deviation) is higher, and less intermittently variable.
It is noted that the standard deviation of the standard deviation of N is also a reliable measure; the effect of taking each standard deviation is essentially to smooth the measure.
~ In this second form of Voice Activity ~etector, the measured parameter used to decide whether speech is present is preferably the standard deviation of the Itakura-Saito Distortion ~easure, but other measures of variance and other spectral distortion measures ~based for example-on FFT analysis~ could be employed.
- 12 - 133~003 It is ~found advantageous to employ an adaptive threshold in voice activity detection. Such thresholds must not be adjusted during speech periods or the speech signal will be thresholded out. It is accordingly necessary to control the threshold adapter using a speech/non-speech control signal, and it is preferable that this control signal should be independent of the output of the threshold adapter.
The threshold T is adaptively adjusted so as to keep the threshold level just above the level of the measure N when - noise only is present. ~Since the measure vill in general ~ ~vary randomly when noise is present, the -threshold is varied by determining an average level over a number of bloc~s, and setting the threshold at a level proportional to this average. In a noisy environment this is not usually sufficient, however, and so an assessment of the degree of variation of the parameter over several blocks - is also taken into account.
The threshold value T is therefore preferably calculated according to T = X' + K.d where ~' is the average value of the measure over a number of consecutive frames, d is the standard deviation of the measure over those frames, and K is a constant (which may - typically be 2).
~ In practice, it -is preferred not to resume adaptation immediately after speech is indicated to be absent! but to - wait to ensure the fall is stable (to avoid rapid repeated switching between the adapting and non-adapking states).
- Referring to Figure 3, in a preferred embodiment of ~ the invention incorporating the a~ove aspects, an input 1 receives a signal which is sampled and digitised by analogue to digital conYerter (ADC) 2, and supplied to the input of an inverse filter analyser 3, which in practice is part of a speech coder with which the voice activity detector is to wor~, and vhich generates coefficients Li (typically 8) of a filter corresponding to the inverse of the input signal spectrum. The digitised signal ~s also supplied to an autocorrelator 4, (~hich is part of analyser 3) which generates the autocorrelation vector Ri-of the input signal-(or at least as many low order terms as there are LPC coefficients). Operation of these parts of the apparatus is as described-in Figres 1 and 2.
Preferably, the autocorrelation coefficients Ri are then averaged over several successive speech frames (typically 5-20 ms long) to i~prove their relia~i-lity. This may be achieved ~y storing each set of - autocorrelations coefficients output by autocorrelator 4 in a buffer 4a, and employing an averager 4b to produce a weighted sum of the current autocorrelation coefficients Ri and those from previous frames stored in and supplied from buffer 4a. The averaged autocorrelation coefficients Rai thus derived are supplied to weighting and adding means 5,6 which receives also the autocorrelation vector Ai f stored noise-period inverse filter coefficients Li from an autocorrelator 14 via buffer 15, and forms from Rai and Ai a measure X preferably defined as:
= Bo ~2 ~ RaiBi' ~ , Ro - This measure is then thresholded by thresholder 7 against a threshold level, and the ~logical result provides an indication of the presence or absence of speech at output 8.
In order that the inverse filter coefficients Li correspond to a fair estimate of the inverse of the noise ~ - 14 - 1335 0 0~
spectrum, it is desirable to update these coefficients during periods of noise (and, of course, not to update during periods of speech). It is, ho~ever, preferable that the speech/non-speech decision on which the updating is based does not depend upon the result of the updatinq, or else a single wrongly identified frame of signal may result in the voice activity detector subsequently qoing ~out of lock~ and wrongly identifying following frames.
Preferably, therefore, there is provided a control signal generating circuit 20, effectively a separate voice activity detector/ which forms an independent control signal indicating the presence or absence of speech to control inverse filter analyser 3 (or buffer 8) so that~
the inverse filter autocorrelation coefficients Ai used to form the measure h are only updated during ~noise~
periods. The control signal generator circuit 20 includes LPC analyser 21 (which again may ~e part of a speech coder and, specifically, may be performed by analyser 3), which produces a set of LPC coefficients Ni corresponding to the input signal and an autocorrelator 21a (which may be performed by autocorrelator 3a) which derives the autocorrelation coefficients Bi f ~i. If analyser 3 is performed by analyser 3, then ~i=Li and 8i=Ai.
These autocorrelation coefficients are then supplied to weighting and adding me~ns 22,23 (e~uivalent to 5, 6) which receive also the autocorrelation vector Ri f the input signal from autocorrelator 4. A measure of the spectral similarity between the input speech frame and the - - -preceding speech frame is thus calculated; this may be the Itakura-Saito distortion measure bet~een Ri f the present frame and Bi f the preceding frame, as disclosed a~ove, or it may instead be derived by calculating the Itakura - Saito distortion measure ~or Ri and- Bi -f the present frame, and subtracting (in subtractor 25~ the corresponding measure for the previous frame stored in buffer 24, to generate - a spectral difference signal (in either case, the measure is prefera~ly energy-normalised by dividing by Ro). The buffer 24 is then, of course, updated. This spectral difference signal, when thresholded by a thresholder 26 is, as discussed above, an indicator of the presence or absence of speech. We have found, however, that although this measure is excellent for distinguishing noise from unvoiced speech (a tas~ which prior art systems are generally incapable of) it is in~general rather less able to distinguish noise from voiced- speech. Accordingly, there is preferably further providea within circuit 20 a voiced speech detection circuit comprising a pitch analyser 27 (which in practice ~ay operate as part of a speech coder, and in particular may measure the long term predictor lag value produced in a multipulse LPC coder).
The pitch analyser 27 produces a logic signal which is ntrue" when voiced speech is detected, and this siqnal, together ~ith the thresholded ~easure derived from thresholder 26 ~which will ~enerally be ~true~ when unvoiced speech is present) are supplied to the inputs of a NOR gate 28 to generate a signal which is ~false~ when speech is present and "true~ when noise is present. This signal is supplied to buffer 8 lor to inverse filter analyser 3) so that inverse filter coefficients Li are only updated during noise periods.
Threshold adapter 29 is also connected to receive the non-speech signal control output of control signal generator circuit 20. The output of the threshold adapter 29 is supplied to thresholder 7. The threshold adapter operates to increment or decrement the threshold in steps which are a proportion of the instant threshold Yalue, until the threshold approximates the noise pcwer level (which may conveniently be derived from, f~r example, ~ 6 - 1335003 weighting and adding circuits 22, 23). When the~ input signal is very low, it may be desirable that the threshold is automatically set to a fixed, low, level since at the low signal levels the effect of signal quantisation produced by ADC- 2 can produce unreliable results.
There may be further provided "hangover~ generating means 30, which operates to measure the duration of indications of speech after thresholder 7 and, when the presence of speech has been indicated for a period in excess of a predetermined time constant, the output is held high for a short ~hangover~ period. In this way, clipping -of the midd~e of low-level spee~h bursts is avoided, and appropriate selection of the time constant prevents triggering of the hangover generator 30 by short spikes of no;se which are falsely indicated as speech.
It will of course be appreciated that all the above functions may be egecuted by a single suita~ly programmed digital processing mPans such as a Digital Signal Processing (~SP~ chip, as part of an LP~ codec thus implemented (this is the preferred implementation), or as a suitably programmed microcomputer or microcontroller chip with an associated memory device.
Conveniently, as described above, the voice detection apparatus may be implemented as part of an LPC codec.
Alternatively, where autocorrelation coefficients of the signal or related measures (partial correlation, or "parcor", coefficients) are transmitted to a distant station the voice detection ~ay take place distantly from the ~odec.
,
A voice activity detector is a device which is supplied with a signal ~ith the object of detecting periods of speech, or periods containing only noise~.
Although the present invention is not limited thereto, one application of particular interest for such detectors is in mobile radio telephone systems where the knowledge as to the presence or otherwise of speech can be used exploited by a speech coder to improve the efficient utilisation of radio spectrum, and where also the noise level (from a vehicle-mounted unit) is likely to be high.
The essence of voice activity detection is to locate a me~sure which differs appreciably -between speech and non-speech periods. In apparatus which includes a speech coder, a number of parameters are readily available from one or other stage of the coder, and it is therefore desirable to economise on processing needed by utilising some such parameter. In many environments, the main noise sources occur in known defined areas of the frequency spectrum. For example, in a moving car much of the noise (eg, engine noise) is concentrated in the low frequency regions of the spectrum. Where such knowledge of the spectral position of noise is available, it is desirable to ~ase the decision as to whether speech is present or absent upon measurements taken fro~-that portion of the spectrum which contains relatively little noise. It would, of course, be possible in practice to pre-filter the signal before analysing to detect speech activity, but where the voice activity detector follows the output of a speech coder, prefiltering-would distort the voice signal to be coded.
~, _ - 2 - 1335 00~
According to a f irst aspect of the invention there is provided voice activity detection apparatus comprising means for receiving an input signal, means for estimating the noise signal component of the input signal, means for continually forming a measure ~ of the spectral sim~larity between a portion of the input signal and the noise signal, and means- for comparing a p~rameter derived from the measure M with a threshoId va~ue T to produce an output to indicate the presence or absence of speech in dependence upon whether or not that value is exceeded.
According to a second aspect of the invention there is provided voice activity detection apparatus co~prising:
means for continually forming a spectral distortion measure of the similarity between a portion of the input signal and earlier portions of the input signal and me~ns for comparing the degree of var~ation between successive values of the measure with a threshold v~lue to produce an output incating the presence or absence of speech in dependence upon ~hether or not that value is exceeded.
Preferably, the measure is the Itakura-Saito Distortion Xeasure.
Other aspects of the present invention ~re as defined in the claLms.
Some embodiments of the invention will now be described, by way of example, ~ith reference to the accompanying drawings, in which:
Fig~re 1 is a block diagram of a first embodiment of - the invention;
F-gure 2 shows a-second embodiment of the invention;
Figure 3 shows a third, preferred embodiment -of the invention.
i _ - 3 - 133 S 0 03 -The general principle underlying a first Yoice Activity Detector according to the a first embodiment of the invention is as follows.
A frame of n signal samples (sO~ 51~ 52' s3~ s4 -- sn_l ) will, ~hen passed through a notional fourth order finite impulse response (FIR) digital filter of impulse response (1, ho~ hl, h2, -h3), result in a filtered signal (ignoring samples from previous frames) s =
( O)' (S-l + hoSo)~
(S2 + hoSl + hlSO)' (s3 + hos2 ~ hlsl ~ h2So)~
(S4 ~ hos3 ~ hlS2 + h2Sl + hlSOl' (Ss + hos4 + hls3 + h2S2 + h3Sl)' (S6 + hoS5 + hlS4 + h2S3 + h3s2), (S7 ... ) The zero order autocorrelation coefficient is the sum of each term squared, which may be normalized i.e. divided by the total number of terms (for constant frame lengths it is easier to omit the division); that of the filtered signal is thus R~o _ r ~s'i) -.
~ - o and this is therefore a measure of the power of the notional filtered signal s' - in other words, of that part of the siqnal s which falls within the passband of the notional filter.
Expanding, neglecting the first 4 terms, .
_ - 4 - 1 335 0 03 R~o = (s4 + hos3 + hlS2 + h2Sl + h3So)2 + (s5 + hos4 + hls3 + h252 + h3Sl-) = 54 hos4s3 + hls4s2 + h2s4sl + h3s4sO
+ hoS4S3 + h25o + hOhlS3S2 - hOh2S3Sl+ hoh3S3so hls4s2 + hOhlS3S2 + h21s2 + hlh2S251 + hlh3S2SO
2s4sl + hOhlS3Sl + hlh2S2Sl + h2s21h2h3SlSO
3 4 0 hoh3s3so + hlh3S2sO + h2h3SlsO + h23so + . . .
= Ro (1 + ho2+ h21+ h2+ h32) Rl ~2ho + 2h~hl + 2hlh2 + 2h2h3) + R2 (2hl + 2hlh3 + 2hOh2) + R3 (2h2 + 2hoh3) R4 (2h3) So R~o can be obtained from a combination of the autocorrelation coefficients Ri,- weighted by the bracketed constants which determine the frequency band to which the value of R'o is responsive. In faet, the bracketed terms are the autocorrelation coefficients of the impulse response of the notional filter, so that the expression above may ~e simplified to _ _ 5 _ 1 3 35 0 03 O oHo + 2 ~ RiHi, (1) i = 1 where N is the filter order and Hi are the (un-normalised) autocorrelation coefficients of the impulse response of the filter.
In other vords, the effect on the signal autocorrelation coefficients of filtering a signal may be simulated by producing a weighted sum of the autocorrelation coef$icients of the (unfiltered) signal-, using the impulse response that the required filter wou~d have had.
Thus, a relatively simple algorithm, involving a small number of multiplication operations, may simulate the effect of a digital filter requiring typically a hundred times this nu~er of multiplication operations.
This filtering operation may alternatiYely be viewed as a form of spectrum comparison, with the signal spectrum being matched against a reference spectrum (the inverse of the response of the notional filter). Since the notional filter in this application is selected so as to approximate the inverse of the noise spectrum, this operation may be viewed as a spectral comparison between speech and noise spectra, and the zeroth autocorrelation coefficient thus generated (i.e. the energy of the inverse filtered signal) as a measure of dissimilarity between the -~spectra. The Itakura-Saito distortion measure is used in LPC to assess the match between the predictor filter and the input spectrum, and in one form is expressed as -M = RoAo + 2 ~ RiAi' where Ao etc are the autocorrelation coefficients of the LPC parameter set. It vill be seen that this is closely similar to the relationship derived above, and ~hen it is remembered that the LPC coefficients are the taps of an FIR filter having the inverse spectral response of the input -signal so that the LPC coefficient set is the impulse response of the inverse LPC- filter, it will be - ~
apparent that the Itakura-Saito Distortion Neasure is in fact merely a form of equation 1, wherein the filter response H is the inverse of the spectral shape of an all-pole model of the input signal.
In fact, it is also possible to transpose the spectra, using the LPC coefficients of the test spect~um and the autocorrelation coefficients of the reference spectrum, to obtain a different measure of spectral similarity.
The I-S Distortion ~easure is further discussed in "Speech Coding based upon Vector Qu~ntisation~ by A Buzo, A H Gray, R ~ Gray and J D ~arkel, IEEE Trans on ASSP, Vol ASSP-28, No 5, October lg~0.
Since the frames of signal have only a finite length, and a number of terms (N, where N is the filter order) are neglected, the above result is an approxi~ation only; it gives, however, a surprisingly qood indicator of the presence or absence of speech and thus ~ay be used as a measure H in speech detection. In an environment where -the noise spectrum is well known and stationary, it is - i quite possible to simply employ fixed ho~ hl etc coefficients to model the inverse noise filter. t ~ ~ 7 _ 1335003 ~ owever, apparatus which can adapt to different nolse environments is much more widely useful.
Referring to Fi~ure 1, in a first embodiment, a signal from a microphone (not shown) is received at an input 1 and converted to digital samples s at a suitable sampling rate by an analoque to digital converter 2. An LPC
analysis unit 3 ~in a ~nown type of LPC coder) then ~derives, for successive frames of n teg 160) samples, a~ -set of N (eg 8 or 12) LPC filter coefficients Li which are transmitted~ to represent the input speech. The speech signal s also enters a correlator unit 4 (normally part of the LPC coder -3~~since the autocorrelation vector Ri ~
the speech is also ~usually ptoduced as a step in the LPC
analysis although it will be appreciated that a separate correlator could be provided). The correlator 4 produces the autocorrelation Yector Ri, including the zero order correlation coefficient Ro and at least 2 further aut~correlation coefficients Rl, R2, R3. These are then supplied to a multiplier unit 5.
A second input 11 is connected to a second microphone located distant from the speaker so as to receive only background noise. The input from this ~icrophone is converted to a diqital input sample train by AD convertor 12 and LPC analysed by a second LPC analyser 13. The ~noise" LPC coefficients produced from analyser 13 are passed to correlator ~unit 14, and the autocorrelation vector thus produced is multiplied term ~y term ~ith the autocorrelation coefficients Ri f the input signal fro~
the speech microphon-e in multiplier 5 and the weighted coefficients thus produced are com~ined in adder 6 according to Bquation 1, so as to apply a filter having the inverse shape of the noise spectrum from the noise-only ~icrophone (which in practice is the same as the shape of the noise spectrum~ in the signal-plus-noise microphone) and thus filter out most of the noise. The resulting measure ~ is thresholded by thresholder 7 to produce a logic output 8 indicating the presence or absence of speech; if ~ is high, speech is deemed to be present.
This embodiment does, however, re~uire two microphones and two LPC analysers, which adds to the expense and complexity of the equipment necessary.
Alternatively, another embodiment uses a corresponding measure formed using the autocorrelations from the noise microphone 11 and ~the LPC coefficients from the maLn microphone 1, so that an extra autocorrelator rather than an LPC analyser is necessary.
These embodiments are therefore able to operate within different environments having -noise at different frequencies, or within a chan~ing noise spectrum in a given environ~ent.
Referring to Pigure 2, in the preferred embodiment of the invention, there is provided a buffer 15 which stores a set of LPC coefficients (or the autocorrelation vector of the set) derived from the microphone input 1 in a period identified as being ~a ~non speech~ (ie noise only) period. These coefficients are then used to derive a measure using e~uation 1, which also-of course corresponds to the Itakura-Saito Distortion ~easure, except that a single stored frame of LPC coefficients corresponding to an approximation of the inverse noise spectrum is used, rather than the present frame of LPC coefficients.
The LPC coefficient vector Li output by analyser ~
is also routed to a correlator 14, which produces the autocorrelation vector of the LPC coefficient vector. The buffer memory 15 is controlled by the speech/non-speech -- q 133500~
output of thresholder 7, in such a way that during ~speech~ frames the buffer retains the ~noise~
autocorrelation coefficients, but during "noise~ frames a new set of LPC coefficients may be used to update the buffer, for example by a multiple -switch 16, via which outputs of the correlator 14, carrying each autocorrelation coefficient, are connected to the buf~er 15. It ~ill be appreciated that correlator 14 could be positioned after buffer 15. Further, the speech/no-speech decision for coefficient update need not be from output 8, but couid be (and preferably is) otherwise derived.
- Since frequent periods ~ithout speech occur, the LPC
coefficients stored in the buffer are updated from time to time, so that the apparatus is thus capable of tracking chanqes in the noise spectrum. It will be appreciated that such updating of the buffer may be necessary only occasionally, or may occur -only once at the start of operation of the detector, if (as is often the case) the noise spectrum is relatively stationary over time, but in a mobile radio environment frequent updating is preferred.
In a modification of this embodiment, the system initially employs equation 1 with coefficient terms corresponding to a simple fixed high pass filter, and then subsequently starts to adapt by switching over to using ~noise period~ LPC coefficients. If, for some reason, speech detection fails, the system may return to using the simple high pass filter.
It is possible to normalise the above measure by dividing through by Ro~ so that the ~pression to be thresholded has the form N = Ao +2 ~ RiAi - R
I O
- lO - 133S0 03 This measure is independent of the total signal enerqy in a frame and is thus compensated for gross signal level changes, but gives rather less mar~ea contrast betveen "noise~ and ~speech~ levels and is hence preferably not employed in h-igh-noise environments.
Instead of employing LPC analysis to derive the inverse -filter coefficients of the noise signal (from either the noise microphone or noise only periods, as in the various embodiments described above), it is possible to model the ~inverse noise spectrum usinq an adaptive filter of known-type; as the noise spectrum changes only slowly (as discussed below) a relatively-slow coefficient adaption rate common for such filters is acceptable. In one embodiment, which corresponds to Figure 1, LPC
analysis unit 13 is simply replaced by an adaptive filter (for example a transversal FI~ or lattice filter), connected so as to whiten the noise input by modelling the inverse filter, and its coefficients ~re supplied as before to autocorrelator 14.
~ n a second embodiment, corresponding to that of Figure 2, LPC analysis means 3 is replaced by such an adapter filter, and buffer means 15 is omitted, but switch 16 operates to prevent the adaptive filter from adapting its coefficients during speech periods.
A second Voice Activity Detector in accordance with another aspect of the invention vill now be described.
From the foregoing, it will be apparent that the ~PC
coefficient vector is simply the impulse response of an FIR filter which has a response approximating the inverse spectral shape of the input signal. When the Itakura-Saito Distortion ~easure between adjacent frames is formed, this is in fact e~ual to the power of the signal, as f~ltered by the ~P~ filter of the previous frame. So if spectra of adjacent frames differ little, a correspondingly small amount of the spectral power of a frame will escape filtering and the ~easure will be low.
Correspondingly, a large interframe spectral difference produces a high Ita~ura-Saito Distortion ~easure, so that the measure reflects the spectral similarity of adjacent frames. In a speech coder, it is desirable to minimise the data~ rate, so frame length is made as long as possible; in other w rds, if the frame length is Iong enough, then a speech signal should show a significant spectral change from frame to frame (if it does not, the coding is redundant). Noise, on the other hand, has a slowly varying spectral shape from frame to frame, and so in a period ~here speech is absent from the signal then the Itakura-Saito Distortion ~easure will correspondingly be low - since applying the inYerse LPC filter from the previous frame ~filters out" most of the noise power.
Typically, the Itakura-Saito Distortion ~easure between adjacent frames of a noisy signal containing intermittent speech is higher during periods of speech than periods of noise; the degree of variation (as illustrated by the standard deviation) is higher, and less intermittently variable.
It is noted that the standard deviation of the standard deviation of N is also a reliable measure; the effect of taking each standard deviation is essentially to smooth the measure.
~ In this second form of Voice Activity ~etector, the measured parameter used to decide whether speech is present is preferably the standard deviation of the Itakura-Saito Distortion ~easure, but other measures of variance and other spectral distortion measures ~based for example-on FFT analysis~ could be employed.
- 12 - 133~003 It is ~found advantageous to employ an adaptive threshold in voice activity detection. Such thresholds must not be adjusted during speech periods or the speech signal will be thresholded out. It is accordingly necessary to control the threshold adapter using a speech/non-speech control signal, and it is preferable that this control signal should be independent of the output of the threshold adapter.
The threshold T is adaptively adjusted so as to keep the threshold level just above the level of the measure N when - noise only is present. ~Since the measure vill in general ~ ~vary randomly when noise is present, the -threshold is varied by determining an average level over a number of bloc~s, and setting the threshold at a level proportional to this average. In a noisy environment this is not usually sufficient, however, and so an assessment of the degree of variation of the parameter over several blocks - is also taken into account.
The threshold value T is therefore preferably calculated according to T = X' + K.d where ~' is the average value of the measure over a number of consecutive frames, d is the standard deviation of the measure over those frames, and K is a constant (which may - typically be 2).
~ In practice, it -is preferred not to resume adaptation immediately after speech is indicated to be absent! but to - wait to ensure the fall is stable (to avoid rapid repeated switching between the adapting and non-adapking states).
- Referring to Figure 3, in a preferred embodiment of ~ the invention incorporating the a~ove aspects, an input 1 receives a signal which is sampled and digitised by analogue to digital conYerter (ADC) 2, and supplied to the input of an inverse filter analyser 3, which in practice is part of a speech coder with which the voice activity detector is to wor~, and vhich generates coefficients Li (typically 8) of a filter corresponding to the inverse of the input signal spectrum. The digitised signal ~s also supplied to an autocorrelator 4, (~hich is part of analyser 3) which generates the autocorrelation vector Ri-of the input signal-(or at least as many low order terms as there are LPC coefficients). Operation of these parts of the apparatus is as described-in Figres 1 and 2.
Preferably, the autocorrelation coefficients Ri are then averaged over several successive speech frames (typically 5-20 ms long) to i~prove their relia~i-lity. This may be achieved ~y storing each set of - autocorrelations coefficients output by autocorrelator 4 in a buffer 4a, and employing an averager 4b to produce a weighted sum of the current autocorrelation coefficients Ri and those from previous frames stored in and supplied from buffer 4a. The averaged autocorrelation coefficients Rai thus derived are supplied to weighting and adding means 5,6 which receives also the autocorrelation vector Ai f stored noise-period inverse filter coefficients Li from an autocorrelator 14 via buffer 15, and forms from Rai and Ai a measure X preferably defined as:
= Bo ~2 ~ RaiBi' ~ , Ro - This measure is then thresholded by thresholder 7 against a threshold level, and the ~logical result provides an indication of the presence or absence of speech at output 8.
In order that the inverse filter coefficients Li correspond to a fair estimate of the inverse of the noise ~ - 14 - 1335 0 0~
spectrum, it is desirable to update these coefficients during periods of noise (and, of course, not to update during periods of speech). It is, ho~ever, preferable that the speech/non-speech decision on which the updating is based does not depend upon the result of the updatinq, or else a single wrongly identified frame of signal may result in the voice activity detector subsequently qoing ~out of lock~ and wrongly identifying following frames.
Preferably, therefore, there is provided a control signal generating circuit 20, effectively a separate voice activity detector/ which forms an independent control signal indicating the presence or absence of speech to control inverse filter analyser 3 (or buffer 8) so that~
the inverse filter autocorrelation coefficients Ai used to form the measure h are only updated during ~noise~
periods. The control signal generator circuit 20 includes LPC analyser 21 (which again may ~e part of a speech coder and, specifically, may be performed by analyser 3), which produces a set of LPC coefficients Ni corresponding to the input signal and an autocorrelator 21a (which may be performed by autocorrelator 3a) which derives the autocorrelation coefficients Bi f ~i. If analyser 3 is performed by analyser 3, then ~i=Li and 8i=Ai.
These autocorrelation coefficients are then supplied to weighting and adding me~ns 22,23 (e~uivalent to 5, 6) which receive also the autocorrelation vector Ri f the input signal from autocorrelator 4. A measure of the spectral similarity between the input speech frame and the - - -preceding speech frame is thus calculated; this may be the Itakura-Saito distortion measure bet~een Ri f the present frame and Bi f the preceding frame, as disclosed a~ove, or it may instead be derived by calculating the Itakura - Saito distortion measure ~or Ri and- Bi -f the present frame, and subtracting (in subtractor 25~ the corresponding measure for the previous frame stored in buffer 24, to generate - a spectral difference signal (in either case, the measure is prefera~ly energy-normalised by dividing by Ro). The buffer 24 is then, of course, updated. This spectral difference signal, when thresholded by a thresholder 26 is, as discussed above, an indicator of the presence or absence of speech. We have found, however, that although this measure is excellent for distinguishing noise from unvoiced speech (a tas~ which prior art systems are generally incapable of) it is in~general rather less able to distinguish noise from voiced- speech. Accordingly, there is preferably further providea within circuit 20 a voiced speech detection circuit comprising a pitch analyser 27 (which in practice ~ay operate as part of a speech coder, and in particular may measure the long term predictor lag value produced in a multipulse LPC coder).
The pitch analyser 27 produces a logic signal which is ntrue" when voiced speech is detected, and this siqnal, together ~ith the thresholded ~easure derived from thresholder 26 ~which will ~enerally be ~true~ when unvoiced speech is present) are supplied to the inputs of a NOR gate 28 to generate a signal which is ~false~ when speech is present and "true~ when noise is present. This signal is supplied to buffer 8 lor to inverse filter analyser 3) so that inverse filter coefficients Li are only updated during noise periods.
Threshold adapter 29 is also connected to receive the non-speech signal control output of control signal generator circuit 20. The output of the threshold adapter 29 is supplied to thresholder 7. The threshold adapter operates to increment or decrement the threshold in steps which are a proportion of the instant threshold Yalue, until the threshold approximates the noise pcwer level (which may conveniently be derived from, f~r example, ~ 6 - 1335003 weighting and adding circuits 22, 23). When the~ input signal is very low, it may be desirable that the threshold is automatically set to a fixed, low, level since at the low signal levels the effect of signal quantisation produced by ADC- 2 can produce unreliable results.
There may be further provided "hangover~ generating means 30, which operates to measure the duration of indications of speech after thresholder 7 and, when the presence of speech has been indicated for a period in excess of a predetermined time constant, the output is held high for a short ~hangover~ period. In this way, clipping -of the midd~e of low-level spee~h bursts is avoided, and appropriate selection of the time constant prevents triggering of the hangover generator 30 by short spikes of no;se which are falsely indicated as speech.
It will of course be appreciated that all the above functions may be egecuted by a single suita~ly programmed digital processing mPans such as a Digital Signal Processing (~SP~ chip, as part of an LP~ codec thus implemented (this is the preferred implementation), or as a suitably programmed microcomputer or microcontroller chip with an associated memory device.
Conveniently, as described above, the voice detection apparatus may be implemented as part of an LPC codec.
Alternatively, where autocorrelation coefficients of the signal or related measures (partial correlation, or "parcor", coefficients) are transmitted to a distant station the voice detection ~ay take place distantly from the ~odec.
,
Claims (24)
1. Voice activity detection apparatus comprising:
(i) means for receiving a first, input, signal;
(ii) means for periodically adaptively generating a second signal representing an estimated noise signal component of the first signal;
(iii) means for periodically forming from the first and second signals a measure M of the spectral similarity between a portion of the input signal and the said estimated noise signal component; and (iv) means for comparing the measure M with a threshold value T to produce an output indicating the presence or absence of speech; and (v) analysis means operable to produce the coefficients of a filter having a spectral response which is the inverse of the frequency spectrum of one of the said two signals;
wherein the measure M is proportional to the zero-order autocorrelation R'o of a signal obtained by filtering of the other of the said two signals by a filter having the said coefficients.
(i) means for receiving a first, input, signal;
(ii) means for periodically adaptively generating a second signal representing an estimated noise signal component of the first signal;
(iii) means for periodically forming from the first and second signals a measure M of the spectral similarity between a portion of the input signal and the said estimated noise signal component; and (iv) means for comparing the measure M with a threshold value T to produce an output indicating the presence or absence of speech; and (v) analysis means operable to produce the coefficients of a filter having a spectral response which is the inverse of the frequency spectrum of one of the said two signals;
wherein the measure M is proportional to the zero-order autocorrelation R'o of a signal obtained by filtering of the other of the said two signals by a filter having the said coefficients.
2. Apparatus according to claim 1 in which the analysis means includes an adaptive filter.
3. Apparatus according to claim 1 in which the generating means are operable to compute the autocorrelation coefficients Ai of the impulse response of the said coefficients and the measure forming means comprises means for computing the autocorrelation coefficients Ri of the said other signal, and means connected to receive Ri and Ai, and to calculate the measure M therefrom.
4. Apparatus according to claim 2 in which the means for computing the autocorrelation coefficients Ri of the said other signal are arranged to do so in dependence upon the autocorrelation coefficients of several successive portions of the signal.
5. Apparatus according to claim 3 or 4 in which M = R0A0 + 2.SIGMA.R1A1 where A1 represents the i-th autocorrelation coefficient of the impulse response of said filter.
6. Apparatus according to claim 3 or 4 in which where A1 represents the i-h autocorrelation coefficient of the impulse response of said filter.
7. Apparatus according claim 1, 2, 3 or 4 in which the said one signal is the second, noise representing, signal and the said other signal is the first, input signal.
8. Apparatus according to claim 7, further comprising an input arranged to receive a second input signal, similarly subject to noise, from which speech is absent, in which the generating means comprise LPC analysis means for deriving values of A1 from the second input signal.
9. Apparatus according to claim 1, 2, 3 or 4 further comprising a buffer connected to store data from which the autocorrelation coefficients A1 of the said filter response may be obtained, in which the said filter response is periodically calculated from the signal by LPC analysis means, the apparatus being so connected and controlled that the measure M is calculated using the said stored data, and the said stored data is updated only from periods in which speech is indicated to be absent.
10. Apparatus according to claim 9 further comprising means for indicating the absence of speech to control the updating of the stored data, the means for indicating the absence of speech being a second voice activity detection means.
11. Apparatus according to claim 1, 2, 3, 4, 8, or 10, further comprising means for adjusting the said threshold value T during periods when speech is indicated to be absent.
12. Apparatus according to claim 11 in which the threshold value T is, when adjusted, adjusted to be equal to the mean of the measure plus a term which is a fraction of the standard deviation of the measure.
13. Apparatus according to claim 12, further comprising second voice activity detection means arranged to prevent adjustment of the threshold value when speech is present.
14. Apparatus according to claim 10 or 13 in which the said second voice activity detection means controls both threshold adjusting and data updating.
15. Apparatus according to claim 10 in which said second voice activity detection means comprises means for generating a measure of the spectral similarity between a portion of the input signal and earlier portions of the input signal.
16. Apparatus according to claim 15 in which the similarity measure generating means comprises means for providing, from LPC filter data and autocorrelation data relating to a present portion of the input signal, a present distortion measure; means for providing an equivalent past frame distortion measure corresponding to a preceding portion of the input signal, and means for generating a signal indicating the degree of similarity therebetween as an indicator of speech presence or absence.
17. Apparatus according to claim 15 or 16, in which said second voice activity detection means further comprises voiced speech detection means comprising pitch analysis means, for generating a signal indicative of the presence of voiced speech, upon which the output of said second voice activity detection means also depends.
18. A method of detecting voice activity in a first, input, signal, comprising the steps of:
(a) periodically adaptively generating a second signal representing an estimated noise signal component of the first signal;
(b) periodically forming from the first and second signals a measure M of the spectral similarity between a portion of the input signal and the said estimated noise signal component;
(c) comparing the measure M with a threshold value T to produce an output indicating the presence or absence of speech;
(d) producing the coefficients of a filter having a spectral response which is the inverse of the frequency spectrum of one of the said two signals; and wherein the measure M is proportional to the zero-order autocorrelation Ro' of a signal obtained by filtering of the other of the said two signals by a filter having the said coefficients.
(a) periodically adaptively generating a second signal representing an estimated noise signal component of the first signal;
(b) periodically forming from the first and second signals a measure M of the spectral similarity between a portion of the input signal and the said estimated noise signal component;
(c) comparing the measure M with a threshold value T to produce an output indicating the presence or absence of speech;
(d) producing the coefficients of a filter having a spectral response which is the inverse of the frequency spectrum of one of the said two signals; and wherein the measure M is proportional to the zero-order autocorrelation Ro' of a signal obtained by filtering of the other of the said two signals by a filter having the said coefficients.
19. Apparatus for encoding speech signals including apparatus according to any one of claims 1, 2, 3, 4, 8, 10, 12, 13, 15 or 16.
20. Mobile telephone apparatus including apparatus according to any one of claims 1, 2, 3, 4, 8, 10, 12, 13, 15 or 16.
21. A voice activity detection apparatus comprising:
(i) a first voice activity detector which operates by forming a measure of the spectral similarity between an input signal and a stored portion of input signal deemed to be speech free to produce an output signal indicating the presence or absence of speech in the input signal;
a store for containing the stored portion of signal; and (iii) an auxiliary voice activity detector, wherein the auxiliary voice activity detector alone controls the updating of the store, the auxiliary voice activity detector operating by forming a measure of the spectral similarity between the current signal and an earlier portion of signal.
(i) a first voice activity detector which operates by forming a measure of the spectral similarity between an input signal and a stored portion of input signal deemed to be speech free to produce an output signal indicating the presence or absence of speech in the input signal;
a store for containing the stored portion of signal; and (iii) an auxiliary voice activity detector, wherein the auxiliary voice activity detector alone controls the updating of the store, the auxiliary voice activity detector operating by forming a measure of the spectral similarity between the current signal and an earlier portion of signal.
22. Voice activity detection apparatus comprising:
(i) means for receiving an input signal;
(ii) a store for storing a noise representing signal;
(iii) means for periodically forming from the input signal and the stored noise representing signal a measure of the spectral similarity between a portion of the input signal and the said estimated noise signal component;
(iv) means for comparing the measure with a threshold value to produce an output indicating the presence or absence of speech;
(v) an auxiliary voice activity detector; and (vi) store updating means for updating the store from the input signal;
wherein the auxiliary voice activity detector is operable in dependence on a measure of spectral similarity between the input signal and a preceding portion of the input signal to produce a control signal indicating the presence or absence of speech, and the store updating means is operable to update the store from the input signal only when said control signal indicates that speech is absent.
(i) means for receiving an input signal;
(ii) a store for storing a noise representing signal;
(iii) means for periodically forming from the input signal and the stored noise representing signal a measure of the spectral similarity between a portion of the input signal and the said estimated noise signal component;
(iv) means for comparing the measure with a threshold value to produce an output indicating the presence or absence of speech;
(v) an auxiliary voice activity detector; and (vi) store updating means for updating the store from the input signal;
wherein the auxiliary voice activity detector is operable in dependence on a measure of spectral similarity between the input signal and a preceding portion of the input signal to produce a control signal indicating the presence or absence of speech, and the store updating means is operable to update the store from the input signal only when said control signal indicates that speech is absent.
23. Apparatus according to claim 22, further comprising means for adjusting the said threshold value during periods when speech is indicated by said control signal to be absent.
24. Apparatus according to claim 22 or 23, in which said auxiliary voice activity detector further comprises voiced speech detection means comprising pitch analysis means for generating a signal indicative of the presence of voiced speech, upon which the control signal produced by the auxiliary voice activity detector also depends.
Applications Claiming Priority (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB888805795A GB8805795D0 (en) | 1988-03-11 | 1988-03-11 | Voice activity detector |
GB8805795 | 1988-03-11 | ||
GB8813346.7 | 1988-06-06 | ||
GB888813346A GB8813346D0 (en) | 1988-06-06 | 1988-06-06 | Voice activity detection |
GB888820105A GB8820105D0 (en) | 1988-08-24 | 1988-08-24 | Voice activity detection |
GB8820105.8 | 1988-08-24 |
Publications (1)
Publication Number | Publication Date |
---|---|
CA1335003C true CA1335003C (en) | 1995-03-28 |
Family
ID=27263821
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CA000593386A Expired - Lifetime CA1335003C (en) | 1988-03-11 | 1989-03-10 | Voice activity detection |
Country Status (16)
Country | Link |
---|---|
EP (2) | EP0548054B1 (en) |
JP (2) | JP3321156B2 (en) |
KR (1) | KR0161258B1 (en) |
AU (1) | AU608432B2 (en) |
BR (1) | BR8907308A (en) |
CA (1) | CA1335003C (en) |
DE (2) | DE68929442T2 (en) |
DK (1) | DK175478B1 (en) |
ES (2) | ES2188588T3 (en) |
FI (2) | FI110726B (en) |
HK (1) | HK135896A (en) |
IE (1) | IE61863B1 (en) |
NO (2) | NO304858B1 (en) |
NZ (1) | NZ228290A (en) |
PT (1) | PT89978B (en) |
WO (1) | WO1989008910A1 (en) |
Families Citing this family (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2643593B2 (en) * | 1989-11-28 | 1997-08-20 | 日本電気株式会社 | Voice / modem signal identification circuit |
CA2040025A1 (en) * | 1990-04-09 | 1991-10-10 | Hideki Satoh | Speech detection apparatus with influence of input level and noise reduced |
US5241692A (en) * | 1991-02-19 | 1993-08-31 | Motorola, Inc. | Interference reduction system for a speech recognition device |
FR2697101B1 (en) * | 1992-10-21 | 1994-11-25 | Sextant Avionique | Speech detection method. |
SE470577B (en) * | 1993-01-29 | 1994-09-19 | Ericsson Telefon Ab L M | Method and apparatus for encoding and / or decoding background noise |
JPH06332492A (en) * | 1993-05-19 | 1994-12-02 | Matsushita Electric Ind Co Ltd | Method and device for voice detection |
SE501305C2 (en) * | 1993-05-26 | 1995-01-09 | Ericsson Telefon Ab L M | Method and apparatus for discriminating between stationary and non-stationary signals |
EP0633658A3 (en) * | 1993-07-06 | 1996-01-17 | Hughes Aircraft Co | Voice activated transmission coupled AGC circuit. |
IN184794B (en) * | 1993-09-14 | 2000-09-30 | British Telecomm | |
SE501981C2 (en) * | 1993-11-02 | 1995-07-03 | Ericsson Telefon Ab L M | Method and apparatus for discriminating between stationary and non-stationary signals |
US5742734A (en) * | 1994-08-10 | 1998-04-21 | Qualcomm Incorporated | Encoding rate selection in a variable rate vocoder |
FR2727236B1 (en) * | 1994-11-22 | 1996-12-27 | Alcatel Mobile Comm France | DETECTION OF VOICE ACTIVITY |
GB2317084B (en) * | 1995-04-28 | 2000-01-19 | Northern Telecom Ltd | Methods and apparatus for distinguishing speech intervals from noise intervals in audio signals |
GB2306010A (en) * | 1995-10-04 | 1997-04-23 | Univ Wales Medicine | A method of classifying signals |
FR2739995B1 (en) * | 1995-10-13 | 1997-12-12 | Massaloux Dominique | METHOD AND DEVICE FOR CREATING COMFORT NOISE IN A DIGITAL SPEECH TRANSMISSION SYSTEM |
US5794199A (en) * | 1996-01-29 | 1998-08-11 | Texas Instruments Incorporated | Method and system for improved discontinuous speech transmission |
DE69716266T2 (en) | 1996-07-03 | 2003-06-12 | British Telecommunications P.L.C., London | VOICE ACTIVITY DETECTOR |
US6618701B2 (en) | 1999-04-19 | 2003-09-09 | Motorola, Inc. | Method and system for noise suppression using external voice activity detection |
DE10052626A1 (en) * | 2000-10-24 | 2002-05-02 | Alcatel Sa | Adaptive noise level estimator |
CN1617606A (en) * | 2003-11-12 | 2005-05-18 | 皇家飞利浦电子股份有限公司 | Method and device for transmitting non voice data in voice channel |
US7155388B2 (en) * | 2004-06-30 | 2006-12-26 | Motorola, Inc. | Method and apparatus for characterizing inhalation noise and calculating parameters based on the characterization |
US7139701B2 (en) * | 2004-06-30 | 2006-11-21 | Motorola, Inc. | Method for detecting and attenuating inhalation noise in a communication system |
FI20045315A (en) * | 2004-08-30 | 2006-03-01 | Nokia Corp | Detection of voice activity in an audio signal |
US8708702B2 (en) * | 2004-09-16 | 2014-04-29 | Lena Foundation | Systems and methods for learning using contextual feedback |
US8775168B2 (en) | 2006-08-10 | 2014-07-08 | Stmicroelectronics Asia Pacific Pte, Ltd. | Yule walker based low-complexity voice activity detector in noise suppression systems |
US8175871B2 (en) | 2007-09-28 | 2012-05-08 | Qualcomm Incorporated | Apparatus and method of noise and echo reduction in multiple microphone audio systems |
US8954324B2 (en) | 2007-09-28 | 2015-02-10 | Qualcomm Incorporated | Multiple microphone voice activity detector |
US8223988B2 (en) | 2008-01-29 | 2012-07-17 | Qualcomm Incorporated | Enhanced blind source separation algorithm for highly correlated mixtures |
US8275136B2 (en) | 2008-04-25 | 2012-09-25 | Nokia Corporation | Electronic device speech enhancement |
US8244528B2 (en) | 2008-04-25 | 2012-08-14 | Nokia Corporation | Method and apparatus for voice activity determination |
US8611556B2 (en) | 2008-04-25 | 2013-12-17 | Nokia Corporation | Calibrating multiple microphones |
ES2371619B1 (en) * | 2009-10-08 | 2012-08-08 | Telefónica, S.A. | VOICE SEGMENT DETECTION PROCEDURE. |
EP2491549A4 (en) | 2009-10-19 | 2013-10-30 | Ericsson Telefon Ab L M | Detector and method for voice activity detection |
CN108985277B (en) * | 2018-08-24 | 2020-11-10 | 广东石油化工学院 | Method and system for filtering background noise in power signal |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US3509281A (en) * | 1966-09-29 | 1970-04-28 | Ibm | Voicing detection system |
US4052568A (en) * | 1976-04-23 | 1977-10-04 | Communications Satellite Corporation | Digital voice switch |
US4358738A (en) * | 1976-06-07 | 1982-11-09 | Kahn Leonard R | Signal presence determination method for use in a contaminated medium |
JPS5636246A (en) * | 1979-08-31 | 1981-04-09 | Nec Corp | Stereo signal demodulating circuit |
JPS59115625A (en) * | 1982-12-22 | 1984-07-04 | Nec Corp | Voice detector |
EP0127718B1 (en) * | 1983-06-07 | 1987-03-18 | International Business Machines Corporation | Process for activity detection in a voice transmission system |
JPS6196817A (en) * | 1984-10-17 | 1986-05-15 | Sharp Corp | Filter |
-
1989
- 1989-03-10 IE IE77489A patent/IE61863B1/en not_active IP Right Cessation
- 1989-03-10 PT PT89978A patent/PT89978B/en not_active IP Right Cessation
- 1989-03-10 NZ NZ228290A patent/NZ228290A/en unknown
- 1989-03-10 ES ES93200015T patent/ES2188588T3/en not_active Expired - Lifetime
- 1989-03-10 DE DE68929442T patent/DE68929442T2/en not_active Expired - Lifetime
- 1989-03-10 AU AU33554/89A patent/AU608432B2/en not_active Expired
- 1989-03-10 KR KR1019890702099A patent/KR0161258B1/en not_active IP Right Cessation
- 1989-03-10 DE DE68910859T patent/DE68910859T2/en not_active Expired - Lifetime
- 1989-03-10 JP JP50377289A patent/JP3321156B2/en not_active Expired - Lifetime
- 1989-03-10 ES ES89302422T patent/ES2047664T3/en not_active Expired - Lifetime
- 1989-03-10 EP EP93200015A patent/EP0548054B1/en not_active Expired - Lifetime
- 1989-03-10 BR BR898907308A patent/BR8907308A/en not_active IP Right Cessation
- 1989-03-10 WO PCT/GB1989/000247 patent/WO1989008910A1/en active IP Right Grant
- 1989-03-10 EP EP89302422A patent/EP0335521B1/en not_active Expired - Lifetime
- 1989-03-10 CA CA000593386A patent/CA1335003C/en not_active Expired - Lifetime
-
1990
- 1990-09-07 DK DK199002156A patent/DK175478B1/en not_active IP Right Cessation
- 1990-09-07 FI FI904410A patent/FI110726B/en not_active IP Right Cessation
- 1990-09-10 NO NO903936A patent/NO304858B1/en not_active IP Right Cessation
-
1996
- 1996-07-25 HK HK135896A patent/HK135896A/en not_active IP Right Cessation
-
1998
- 1998-06-04 NO NO982568A patent/NO316610B1/en not_active IP Right Cessation
-
1999
- 1999-11-18 JP JP32819899A patent/JP3423906B2/en not_active Expired - Lifetime
-
2001
- 2001-05-04 FI FI20010933A patent/FI115328B/en not_active IP Right Cessation
Also Published As
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CA1335003C (en) | Voice activity detection | |
US5276765A (en) | Voice activity detection | |
US4630304A (en) | Automatic background noise estimator for a noise suppression system | |
KR100278423B1 (en) | Identification of normal and abnormal signals | |
US4074069A (en) | Method and apparatus for judging voiced and unvoiced conditions of speech signal | |
CA1123955A (en) | Speech analysis and synthesis apparatus | |
JP2995737B2 (en) | Improved noise suppression system | |
JPH09212195A (en) | Device and method for voice activity detection and mobile station | |
GB1533337A (en) | Speech analysis and synthesis system | |
KR102012325B1 (en) | Estimation of background noise in audio signals | |
US5579432A (en) | Discriminating between stationary and non-stationary signals | |
US5632004A (en) | Method and apparatus for encoding/decoding of background sounds | |
CA1336208C (en) | Adaptive threshold voiced detector | |
CA1336212C (en) | Distance measurement control of a multiple detector system | |
Itoh et al. | A new artificial speech signal for objective quality evaluation of speech coding systems | |
Hansen et al. | Use of objective speech quality measures in selecting effective spectral estimation techniques for speech enhancement | |
Openshaw et al. | Noise robust estimate of speech dynamics for speaker recognition | |
Prasad et al. | A 2.4 Kilobits Per Second Linear Prediction Vocoder | |
Openshaw et al. | Reducing the environmental sensitivity of cepstral features for speaker recognition | |
NZ286953A (en) | Speech encoder/decoder: discriminating between speech and background sound |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
MKEX | Expiry |