GB2384670A

GB2384670A - Voice activity detector and validator for noisy environments

Info

Publication number: GB2384670A
Application number: GB0201585A
Authority: GB
Inventors: Douglas Ralph Ealey; Holly Louise Kelleher; David John Benjamin Pearce
Original assignee: Motorola Inc
Current assignee: Motorola Solutions Inc
Priority date: 2002-01-24
Filing date: 2002-01-24
Publication date: 2003-07-30
Anticipated expiration: 2022-01-24
Also published as: JP2005516247A; CN1623186A; GB2384670B; FI124869B; KR100976082B1; JP2010061151A; FI20041013A; GB0201585D0; KR20090127182A; KR20040075959A; CN1307613C; WO2003063138A1

Abstract

A communication unit (100) includes an audio processing unit (109) having a voice activity detection mechanism (130, 135). The voice activity detection mechanism (130, 135) measures an energy acceleration of a signal input to the communication unit (100) and determines whether said input signal is speech or noise, based on said measurement. A method of detecting voice and a method of deciding whether an input signal is voice or noise are also described. Using an energy acceleration based voice activity detector and validator, particularly for noisy environments, provides the advantages of noise robustness, fast response and independence of the level of input speech.

Description

Voice Activity Detector And Validator For Noisy Environments Field of the Invention This invention relates to detection of speech (commonly known as voice activity detection (VAD) ) within a noisy environment. The invention is applicable to, but not limited to, energy acceleration measurement of voice signals in a speech detection system.

Background of the Invention Many voice communications systems, such as the global system for mobile communications (GSM) cellular telephony standard and the TErrestrial Trunked RAdio (TETRA) system for private mobile radio users, use speech-processing units to encode and decode speech patterns. In such voice communication systems, a speech encoder converts the analogue speech pattern into a suitable digital format for transmission. A speech decoder converts a received digital speech signal into an audible analogue speech pattern.

Methods and apparatus for detecting voice activity are known in the art. A voice activity detector (VAD) operates under the assumption that speech is present only in part of the audio signal. This assumption is usually correct, since there are many audio signal intervals that exhibit only silence or background noise.

A voice activity detector can be used for many purposes.

These include suppressing overall transmission activity in a transmission system, when there is no speech, thus potentially saving power and channel bandwidth. When the VAD detects that speech activity has resumed, it can reinitiate transmission activity.

A voice activity detector can also be used in conjunction with speech storage devices, by differentiating audio portions which include speech from those that are "speechless". The portions including speech are then stored in the storage device and the"speechless"portions are discarded.

Conventional methods for detecting voice are based, at least in part, on methods for detecting and assessing the power of a speech signal. The estimated power is compared to either a constant or an adaptive threshold, in order to make a decision on whether the signal was speech or not.

The main advantage of these methods is their low complexity, which makes them suitable for low-processing resource implementations. The main disadvantage of such methods is that background noise can inadvertently result in"speech"being detected when no"speech"is actually present. Alternatively,"speech"that is present may not be detected because it is obscured, and difficult to detect due to the background noise.

Some methods for detecting speech activity are directed at noisy mobile environments and are based on adaptive filtering of the speech signal. This reduces the noise

content from the signal, prior to the final decision. The frequency spectrum and noise level may vary because the method will be used for different speakers and in different environments. Hence, the input filter and thresholds are often adaptive so as to track these variations.

Examples of these methods are provided in GSM specification 06.42 Voice Activity Detector (VAD) for half rate, full rate, and enhanced full rate speech traffic channels respectively. Another such method is the"Multi- Boundary Voice Activity Detection Algorithm"as proposed in ITU G. 729 annex B. These methods are more accurate in a noisy environment but are significantly complex to implement.

All of these methods require the speech signal to be input. Some applications employing speech decompression schemes require the carrying out of speech detection during the speech decompression process.

European Patent application No. EP-A-0785419 by Benyassine et al. is directed to a method for voice activity detection that includes the following steps: (i) Extracting a predetermined set of parameters from the incoming speech signal for each frame, and (ii) Making a frame voicing decision of the incoming speech signal for each frame according to a set of difference measures extracted from the predetermined set of parameters.

The VAD in cellular systems is biased in order to ensure that when a party speaks, the radio, including the speech codec and RF circuitry etc. , will be active to convey that speech to the other party in the presence of background noise and other impairments. However, this leads to transmission of data when a party is not speaking. The cost of this is slightly lower battery life and slightly increased interference to co-channel users in other cells of the system. These are essentially second (or higher) order effects.

In these systems, there is no concept of a finite resource being available to the duplex call. It is entirely possible and consistent for the uplink and downlink, which are usually on different carriers, to be simultaneously utilising the full bandwidth.

In the field of this invention it is known that some voice activity or voice onset detectors (VADs/VODs) attempt to use characteristics of the speech such as harmonic structure (e. g. , via autocorrelation) to distinguish voiced speech. However, in noise these structural indicators can fail, either due to disruption of the speech structure or due to structure in the noise. This might be e. g. , engine, tyre or air-conditioning noise in a car. Finally, these methods are poor at detecting unvoiced speech.

The alternative is simply to use the frame energy level to detect speech. This is satisfactory for speech in high signal-to-noise ratio (SNR) conditions, where an arbitrary

threshold above the noise level can be set to denote speech. However, this approach fails in more realistic noise conditions.

For unnormalised databases or in real applications, it is likely that noise levels in one set of examples may be greater than speech levels in another-this makes it impossible to set a threshold value. The traditional method to overcome this is to average the first 100msec or so of an utterance on the assumption that this is representative of noise, creating an ad hoc threshold for that utterance. Again, however, this is insufficient for non-stationary noise where the noise may rapidly diverge from the initial estimate, where the noise has high variance or where the first few frames actually contain speech rather than the presumed noise.

A need therefore exists for an improved voice activity detector and validator for noisy environments wherein the abovementioned disadvantages may be alleviated.

Statement of Invention In accordance with a first aspect of the present invention, there is provided a communication unit, as claimed in claim 1.

In accordance with a second aspect of the present invention, there is provided a method of detecting a speech signal input to a communication unit, as claimed in claim 11.

In accordance with a third aspect of the present invention, there is provided a method of deciding whether a signal input to a communication unit is speech or noise, as claimed in claim 14.

Further aspects of the present invention are as claimed in the claims dependent therefrom.

In summary, the present invention aims to address the case of arbitrary amplitude, non-stationary noise, by the use of an energy acceleration measurement in preference to an energy amplitude measurement to denote the presence, or absence, of speech.

Brief Description of the Drawings Exemplary embodiments of the present invention will now be described, with reference to the accompanying drawings, in which: FIG. 1 illustrates a block diagram of a communication unit adapted to perform the voice activity detection and validation of the preferred embodiment of the present invention; FIG. 2 illustrates a flowchart of an energy acceleration based voice activity detector for noisy environments in accordance with a preferred embodiment of the present invention;

FIG. 3 illustrates a flowchart of an energy acceleration based voice activity validation for noisy environments in accordance with a preferred embodiment of the present invention; and FIG. 4 illustrates a buffer operation in accordance with a preferred embodiment of the present invention.

Description of Preferred Embodiments Voiced speech has a comparatively high-energy acceleration value, as its onset is dependent upon the activation of the vocal cords, which are either vibrating or still.

Similarly, unvoiced onsets (e. g. plosives) also have highenergy accelerations.

The inventors have recognised that, in a representational domain emphasising voicing such as a narrowband power spectrum or the Mel-spectrum, the resultant energy acceleration is significantly higher than non-stationary noise. The only significant exceptions are impulsive noises (e. g. a hand clap).

Hence, in accordance with the preferred embodiments of the present invention, the inventors have appreciated that one can additionally discriminate against these noises by concentrating on energy in the frequency region that is likely to contain a fundamental pitch of the voice signal.

In particular, the inventors of the present invention propose to use an unstructured characteristic of speech,

namely energy acceleration (or acceleration of some metric reflecting the speech energy or components thereof).

In particular, a preferred application for the inventive concepts described herein is the distributed speech recognition (DSR) standard currently being defined by the European Telecommunications Standards Institute (ETSI)- "Speech Processing, Transmission and Quality aspects (STQ); Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithm", ETSI ES 201 108 vl. 1. 2 (2000-04), April 2000.

Referring now to FIG. 1, a block diagram of an audio subscriber unit 100, adapted to support the inventive concepts of the preferred embodiments of the present invention, is shown.

The preferred embodiment of the present invention is described with respect to a wireless audio communication unit, for example one capable of operating in the 3rd generation partnership project (3GPP) standard for future cellular wireless communication systems and offering DSR capabilities. However, it is within the contemplation of the invention that the inventive concepts herein described, relating to voice activity detection and validation thereof, are equally applicable to any electronic device that responds to voice signals, and which may benefit from improved voice activity detection circuitry.

As known in the art, the audio subscriber unit 100 contains an antenna 102 preferably coupled to a duplex filter, antenna switch or circulator 104 that provides isolation between receive and transmit chains within the audio subscriber unit 100.

The receiver chain includes receiver front-end circuitry 106 (effectively providing reception, filtering and intermediate or base-band frequency conversion). The front-end circuit 106 is serially coupled to a signal processing function (generally realised by a digital signal processor (DSP) ) 108. The signal processing function 108 performs signal demodulation, error correction and formatting. Recovered data from the signal processing function 108 is serially coupled to an audio processing function 109, which. formats the received signal in a suitable manner to send to an audio enunciator/display 111.

In different embodiments of the invention, the signal processing function 108 and audio processing function 109 may be provided within the same physical device. A controller 114 is configured to control the information flow and operational state of the elements of the subscriber unit 100.

As regards the transmit chain, this essentially includes an audio input device 120 coupled in series through the audio processing function 109, signal processing function 108, transmitter/modulation circuitry 122 and a power amplifier 124. The processor 108, transmitter/modulation

circuitry 122 and the power amplifier 124 are operationally responsive to the controller. The power amplifier output is coupled to the duplex filter, antenna switch or circulator 104, and antenna 102 to radiate the final radio frequency signal.

In particular, audio processing function 109 includes a voice activity (or voice onset) detection (VAD) function 130 operably coupled to a voice activity decision function 135. In accordance with the preferred embodiments of the present invention, the VAD function 130 and voice activity decision function 135 have been adapted to provide improved voice detection and decision mechanism, the operation of which is further described with respect to FIG. 2 and FIG. 3. Notably, the voice activity detector function 130 includes a frame-by-frame detection stage consisting of three measurements. The three frequency range measurements include: (i) Whole spectrum; (ii) Spectral sub-bands; and (iii) Spectral variance.

Subsequently, the voice activity decision function 135 performs a decision based on a buffer of measurements, which are analysed for their speech likelihood. The final decision from the decision stage is applied retrospectively to the earliest frame in the buffer.

In the preferred embodiment of the present invention, a timer/counter 118 is also adapted to perform the timing

functions in the detection and decision processes of FIG.

2 and FIG. 3.

The signal processor function 108, audio processing function 109, VAD function 130 and voice activity decision function 135 may be implemented as distinct, operablycoupled, processing elements. Alternatively, one or more processors may be used to implement one or more of the corresponding processing operations. In a yet further alternative embodiment, the aforementioned functions may be implemented as a mixture of hardware, software or firmware elements, using application specific integrated circuits (ASICs) and/or processors, for example digital signal processors (DSPs).

Of course, the various components within the audio subscriber unit 100 can be realised in discrete or integrated component form, with an ultimate structure therefore being merely an arbitrary selection.

To this end, there are several methods by which to achieve an energy acceleration measure for use in the preferred embodiment of the present invention.

(i) The theoretically ideal method is to literally doubledifferentiate the energy level over successive frames of the utterance. The disadvantage with this approach is that this is likely to introduce delays, as one needs to analyse a number of frames on each side of the frame under analysis.

(ii) A zero-delay estimation of the energy acceleration can be obtained by comparing a ratio of a short-term average with an instantaneous value, for example: using a Frame average:

or using a Rolling Average:

In each case the method returns values that can be

interpreted as'deceleration' < '- !."' < 'acceleration'. One can then find empirical values for and a denominator length that best differentiated speech from noise.

The inventors of the present invention have recognised that a preferred optimal solution is to find a denominator that can track non-stationary noise quickly, but which is too long to track voice onset. A suggested value sequence for the rolling average is a=0. 2, b=0. 8*a, c=0. 8*b, etc., which can be simply expressed as the recursion: dt = 0. 2xt+0. 8dt-1. [3] Then: A = Xt/dt. [4] The preferred VAD and parameter initialisation systems within the detection stage are summarised in the flowchart

of FIG. 2. In non-stationary noise, long-term energy thresholds are not a reliable indicator of speech.

Similarly, in high noise conditions the structure of the speech (e. g. harmonics) cannot be wholly relied upon as an indicator, as they may be corrupted by noise, or structured noises may confuse the detector. Hence, the preferred voice activity detector uses a noise-robust characteristic of the speech, namely the energy acceleration associated with voice onset.

Referring now to FIG. 2, a flowchart 200 of the preferred detection process is illustrated. As indicated above, the process includes a frame-by-frame analysis. The preferred VAD mechanism relates to a'whole spectrum'measurement process. A frame counter is initially assessed to determine whether it is less than'N', which defines the number of buffered frames, as shown in step 205. As an example of a preferred embodiment,'N'is set to'15', assuming it has been established that each frame increments by say, 10msec. If the frame counter is less than'N'in step 205, then the rolling average for an initial acceleration test is updated, as in step 210. If the frame counter is not less than'N'in step 205, then step 210 is skipped.

A determination is then made to assess whether the energy acceleration measurement is within one or more specified margin (s), as shown in step 235. If the energy acceleration measurement is within one or more specified margin (s) in step 235, then the rolling average is updated with the results of a further energy acceleration test, as

in step 240. If the energy acceleration measurement is not within one or more specified margin (s) in step 235, then step 240 is skipped.

A determination is then made to assess whether the energy acceleration measurement is greater than a specified threshold, as shown in step 260. If the energy acceleration measurement is greater than a specified threshold in step 260, then the frame is assumed a speech frame, as in step 265. If the energy acceleration measurement is not greater than a specified threshold in step 260, then the frame is assumed a noise frame, as in step 270.

The frame counter is then incremented, as in step 275, and the process repeats from step 2. 05.

As an improvement to this process, instead of, or in addition to, the whole spectrum measurement process, a sub-region measurement process shown in optional steps 215 and 245 may be performed. A particular sub-region of the spectrum is selected as that sub-region most likely to contain the fundamental pitch.

In the sub-region process, once the rolling average for initial acceleration test is updated in step 210 in a whole spectrum measurement, a determination is made to check whether the energy acceleration measurement is greater than the threshold value, as shown in step 220.

If the energy acceleration measurement is greater than the threshold value in step 220, the process of initialising

other parameters is suspended, as shown in step 225. If the energy acceleration measurement is not greater than the threshold value in step 220, the initialisation of other parameters is updated, as in step 230. The process then returns to step 235 as shown.

A further preferred determination is made after the determination to assess whether the energy acceleration measurement is within one or more specified margin (s) in step 235. The deceleration value is assessed to determine if it is'high'in step 250 and, if so, the rolling average for the energy acceleration test is slowly updated, as shown in step 255. The process then returns to the whole spectrum method in step 260.

In this manner, the generally higher signal-to-noise ratios (SNRs) of the sub-region detector make it highly noise-robust. However, it is vulnerable to adverse microphone and speaker changes as well as band-limited noise. Thus, the measurements should not be relied upon in all circumstances. Consequently, the preferred embodiment of the present invention incorporates the subregion detector in order to augment the whole spectrum measurement.

A further measurement process is preferably performed using the'acceleration'of the variance of values within, for example, the lower half of the spectrum of each frame.

The variance measure detects structure within the lower half of the spectrum, making it highly sensitive to voiced speech. The variance measurement follows the approach of

the sub-region process, with the lower half of the spectrum being the particular sub-region selected. This variance measurement further complements the whole spectrum measurement approach, which is better able to detect unvoiced and plosive speech.

All three measurements take their raw input from the spectral representation of the filter gains generated by the first stage of a double Wiener filter, as described in US Patent Application no. US 09/427497, applicant Motorola Inc. , and inventor: Yan-Ming Chen. As described above, each measurement uses a different aspect of this data.

In particular, the whole-spectrum detector uses the known Mel-filtered spectral representation of the filter gains generated by the first stage of the double Weiner filter.

A single input value is obtained by squaring the sum of the Mel filter banks.

The whole-spectrum detector, in the preferred embodiment of the invention, applies the following process to all frames, as described below: Step one initialises the noise estimate Tracker in the following manner: If Frame < 15 AND Acceleration2. 5, then Tracker=MAX (Tracker, Input).

The energy acceleration measure prevents the Tracker being updated if speech occurs within the lead-in time of 15 frames.

Step two updates the Tracker value if the current input is similar to the noise estimate, in the following manner: If Input < Tracker*UpperBound and Input > Tracker*LowerBound, then Tracker=a*Tracker+ (1-a) * Input Step three provides a failsafe mechanism for those instances where there is speech or an uncharacteristically large noise content within the first few frames. This causes the resulting erroneously high noise estimate to decay. Step three preferably functions in the following manner: If Input < Tracker*Floor, then Tracker=b*Tracker+ (1-b) * Input Step four returns, as a'true'-speech determination, if the current input is more that 165% larger than the Tracker, in the following manner: If Input > Tracker*Threshold, then output TRUE else output FALSE.

The ratio of the instantaneous input to the short-term mean Tracker is a function of the energy acceleration of successive inputs.

Where, in the above: a=0. 8 and b=0. 97; UpperBound is 150% and LowerBound 75%; Floor is 50%; and Threshold is 165%.

Notably, there is no update if the value is greater than UpperBound, or between LowerBound and Floor. Furthermore, the energy acceleration input, as indicated above, can be calculated either as: double-differentiation of successive inputs, or estimated by tracking the ratio of two rolling averages of the inputs.

Notably, the ratio of fast and slow-adapting rolling averages reflects the energy acceleration of successive inputs.

By way of example, the contribution rates for the averages used above were: (i) 0*mean + l*input, and (ii) ( (Frame-l) *mean + l*input)/Frame, making the energy acceleration measure increasingly sensitive over the first fifteen frames.

The sub-band detector preferably uses the average of the second, third and fourth Mel-filter banks derived for the 'whole spectrum'measurement. The detector then applies the following process to all frames, in the manner described below: (i) Input=p*CurrentInput+ (1-p) *PreviousInput; (ii) If Frame < 15, then Tracker=MAX (Tracker, Input) ; (iii) If Input < Tracker*UpperBound and Input > Tracker*LowerBound, then Tracker=a*Tracker+ (l-a) *Input ;

(iv) If Input < Tracker*Floor, then Tracker=b*Tracker+ (l-jb) *Input (v) If Input > Tracker*Threshold, then output TRUE else output FALSE.

Where, in the sub-region measurement: p=0. 75.

All other parameters are the same as for the whole spectrum measurement, except Threshold, which equals 3.25.

For the spectral variance measurement, the variance of the values comprising the lower frequency half of the narrowband spectral representation of the gain for each frame is used as an input. The detector then applies exactly the same process as for the whole spectrum measurement.

The variance is calculated as:

where: N=FFT Length/4, and wi are the values of the narrowband spectral representation of the gain.

In accordance with the preferred embodiment of the present invention, the three measures detailed above are presented to a VAD decision algorithm, as shown in the flowchart of FIG. 3. Successive inputs are presented to a buffer, which provides contextual analysis. This introduces a frame delay equal to the length of the buffer minus one frame.

Referring now to FIG. 3, a flowchart 300 of an acceleration-based voice activity validation process for noisy environments is illustrated, in accordance with a preferred embodiment of the present invention.

For an N=7 frame buffer, the most recent true/false speech input is stored at position N in the data buffer, as shown in step 305. The decision logic applies a number, and preferably each, of the following steps: Step 1: VN = Measure 1 OR Measure 2 OR Measure 3; Input VN is defined as'true' (T) if any of the three measurements returns a true speech indication.

Step 2:

f C++, V=TRUE 1 M = MAXf, 0Ci < NS [6] L C=O, Vi=FALSE J c

The algorithm searches for the longest contiguous sequence of'true'values in the buffer, as in step 310. Hence, for example, for the sequence'T T F T T T F', M would equal'3'.

Step 3: If M > =SpAND T < Ls, T=Lg ; Where Sp equates to the first threshold in step 315. If the longest sequence of true (T) speech values is equal to or exceeds the first threshold in step 315, i. e. Sp=3 or more contiguous'true'values, the buffer is judged to contain'possible'speech. A short timer T of, say, LS=5 frames (Time~1) is activated, in step 325, if it is not already present (or exceeded) from the determination in step 320.

Step 4: If M > =SL AND F > Fs, T=LM else T=LL ; Where SL equates to the second threshold in step 330. If there are SL=4 or more contiguous'true'values, the

buffer is again judged to contain'likely'speech. A medium timer T of, say, L22 frames is activated in step 340 if the current frame F is outside an initial lead-in safety period Fs, as determined in step 335. Otherwise, a failsafe long timer T of, say, LL=40 frames is used in step 345. Such an arrangement is used as the early presence of speech in the utterance may cause the initial noise estimate of the VAD to be too high.

Step 5:

If M < Sp AND T > O, T-- ;

If the process determines that there are less than Sp=3 contiguous'true'values, in step 350, and the timer is greater than zero, in step 355, then the timer is decremented in step 360.

Step 6: If T > O output TRUE else output FALSE; If the timer is greater than zero, in step 365, the process outputs a'true'speech decision, as shown in step 370. Alternatively, if the timer is not greater than zero, in step 365, the process outputs a'noise'decision, as shown in step 375.

Step 7: Frame++, Shift buffer left and return to step 1.

In preparation for the next frame in step 380, the buffer is left-shifted to accommodate the next input, as shown with respect to FIG. 4. The output speech decision is applied to the frame being ejected from the buffer. The process then repeats again, at step 305, for the next true/false input to the data buffer.

It is within the contemplation that alternative mechanisms for making a speech or noise decision, based on the energy acceleration process described above, can be implemented.

For example, the decision mechanism may not be based on one or more timer (s), and may make a decision purely on whether one or more energy acceleration thresholds are exceeded.

Referring now to FIG. 4, an example of the buffer operation 400 according to the preferred embodiment of the

present invention is shown in greater detail. Let us assume that the first threshold is set for three contiguous'true'values. At a time't'410, let us assume that only the current input (frame #7) 425 and previous input (frame #6) 420 were'True'. Consequently, when the buffer is shifted, the first frame (frame #1) 415 will be marked as False.

At a time't+1'430, a third'true'input (frame #8) 450 has been received, to supplement the earlier two'true' inputs 440,445. Consequently, when the buffer is shifted, the next output frame (frame #2) 435 will be marked as'True'.

It should be noted that in the above decision process, the only constraints are : (i) Time~1 < Time~2 < Time~3, and (ii) Threshold~1 < Threshold~2.

Assuming that only these three inputs (frame #6, frame #7 and frame #8) are'True', the full output sequence will be: F T T T T T T T T T T T T T T T T F F F F F 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Where frames #2-#5 indicate'true'due to the buffer leadin function. Frames #6-#8 indicate'true'as the positions of the actual original'true'speech inputs.

Frames #9-#12 indicate'true'due to the buffer lead-out function. Frames #13-#18 indicate'true'in response to

the timer hangover that is used. Once all frames in the utterance have been input, the buffer shifts'false' entries (frames #19-#LM) until empty.

It is within the contemplation of the invention that the buffer length and hangover timers can be adjusted dynamically to suit the audio communication unit's needs.

As such, the preferred embodiment of using a buffer length 'N'of 8, and a hangover timer of five frames are used for explanatory purposes only. However, it should be noted that the buffer length'N'should always be decided such that N > =SL- In addition to its use as a VAD in its own right, it is within the contemplation of the invention that the energy acceleration measure performed in the method steps of FIG.

2 can be used to validate the initialisation of other parameters. For example, a spectral subtraction scheme requires an initial estimate of the noise, based on the first ten frames (typically 100mec) of speech. Even in stationary noise, several events may occur to invalidate the initial estimate. Examples of such events include: (a) Ramp-up of the signal: Due to a variety of possible causes, the very start of a recording may'ramp-up'to full volume within the period under assessment. The reasons behind such a full ramp-up includes: buffer-fills in digital systems, capacitance or tape-head engagement in analogue systems.

The effect of such events will invalidate the estimate.

Hence, an energy acceleration measure may be used to detect such a ramp-up and so prevent the error.

(b) Spikes in the initial signal : A common'spike'occurs with a full deployment of the press-to-talk (PTT) button on a subscriber radio unit, where the electrical contact fractionally precedes the button hitting the back of the switch. An energy acceleration measure, as described above, can be used to suspend the estimation process, as shown in step 225 of FIG. 2, when such events occur.

(c) Speech in the initial signal : Another common occurrence, with PTT systems in particular, is that the user starts speaking as soon as they depress the PTT button. In this manner, electrical contact is made after the speech has commenced. The energy acceleration measure can identify this and so suspend noise-based initialisations, as shown in step 225 of FIG. 2, or force the use of default estimates.

It is also within the contemplation of the invention that alternative measurements may be used to determine the energy acceleration. Furthermore, although the preferred expression used throughout the description is energy acceleration, the inventive concepts are believed to be applicable to any matter or element that corresponds to, or reflects, or is representative of, energy changes. In addition, the aforementioned measurements are provided as preferred examples only, in determining the energy change in an input signal.

In summary, a communication unit has been described that includes an audio processing unit having a voice activity detection mechanism. The voice activity detection mechanism measures energy acceleration of a signal input to the communication unit and determines whether said input signal is speech or noise based on said measurement.

In addition, a method of detecting a speech signal input to a communication unit has been described. The method includes the steps of measuring an acceleration or change in energy of an input signal to the communication unit; and determining whether said input signal is speech or noise based on said step of measuring.

Furthermore, a method of deciding whether a signal input to a communication unit is speech or noise, has been described. The method includes the step of deciding whether said input signal is speech or noise based on an energy acceleration or change in energy measurement of said input signal, for example using a frame average or a rolling average of a number of input signals.

Hence, it will be understood that the energy acceleration based voice activity detector and validator for noisy environments described above provides the advantages of noise robustness and fast response. As the preferred embodiment uses an energy acceleration measure, instead of an absolute measurement, the inventive concepts herein described can be applied to speech of any input level.

Whilst the specific and preferred implementations of the embodiments of the present invention are described above, it is clear that one skilled in the art could readily apply variations and modifications of such inventive concepts that would fall within the scope of the present invention.

Thus, an improved voice activity detector and validator for noisy environments have been described wherein the aforementioned disadvantages associated with prior art arrangements have been substantially alleviated.

Claims

Claims 1. A communication unit (100) comprising an audio processing unit (109) having a voice activity detection mechanism (130,135), the communication unit (100) characterised in that the voice activity detection mechanism (130,135) measures energy acceleration of a signal input to the communication unit (100) and determines whether said input signal is speech or noise, based on said measurement.
2. The communication unit (100) according to Claim 1, wherein the voice activity detection mechanism includes a voice activity detector function (130) that performs frameby-frame detection of voice for signals input to the voice activity detection mechanism (130,135).
3. The communication unit (100) according to Claim 2, wherein said frame-by-frame detection consists of performing an energy acceleration measurement on a signal input to the voice activity detection mechanism (130,135) for one or more of the following frequency ranges: (i) A whole spectrum; (ii) Spectral sub-bands; and (iii) Spectral variance.
4. The communication unit (100) according to Claim 3, wherein the voice activity detection mechanism includes a voice activity decision function (135) operably coupled to the voice activity detector function (130) to decide whether said input signal is speech based on a buffering operation of one or more of said measurements.

<Desc/Clms Page number 29>
5. The communication unit (100) according to Claim 4, wherein the voice activity decision function (135) decides whether an input signal is speech using a frame average or a rolling average of a number of said input signals.
6. The communication unit (100) according to any of preceding Claims 2 to 5, wherein if the energy acceleration measurement yields an energy acceleration value that is greater than an energy acceleration threshold then an input frame is assumed to be a speech frame (265).
7. The communication unit (100) according to Claim 6, wherein a decision that an input frame is a speech frame (265) is applied retrospectively to an earlier frame in a buffer of input signals.
8. The communication unit (100) according to Claim 6 or Claim 7, wherein if the energy acceleration measurement yields an energy acceleration value that is greater than an energy acceleration threshold over a number of contiguous frames, then an input signal is assumed to be a speech signal (370).
9. The communication unit (100) according to any of preceding Claims 3 to 8 when dependent upon Claim 3, wherein if a sub-region of an input signal spectrum is selected, the selection is based on that sub-region most likely to contain a fundamental pitch of a voice signal.

<Desc/Clms Page number 30>
10. The communication unit (100) according to any preceding Claim, wherein the voice activity detection mechanism (130,135) uses acceleration of voice-energy related features to validate a parameter initialisation of other voice or noise related metrics, for example, a spectral subtraction scheme.
11. A method of detecting a speech signal input to a communication unit characterised by the steps of: measuring an acceleration or change in energy of an input signal to the communication unit; and determining (315,330, 350) whether said input signal is speech (370) or noise (375) based on said step of measuring.
12. The method of detecting. a speech signal according to Claim 11, further characterised by the step of: performing frame-by-frame detection of voice for signals input to the communication unit.
13. The method of detecting a speech signal according to Claim 12, wherein said frame-by-frame detection includes the step of: performing an energy acceleration measurement on said input signal for one or more of the following frequency ranges: (i) A whole spectrum; (ii) Spectral sub-bands; and (iii) Spectral variance.
14. A method of deciding whether a signal input to a communication unit is speech or noise, preferably according

<Desc/Clms Page number 31>

to any of preceding Claims 11 to 13, the method further characterised by the step of: deciding (315,330, 350) whether said input signal is speech (370) or noise (375) based on an energy acceleration or change in energy measurement of said input signal, for example using a frame average or a rolling average of a number of input signals.
15. The method of deciding whether a signal input to a communication unit is speech or noise according to Claim 14, wherein said step of deciding includes: determining, if the energy acceleration measurement yields an energy acceleration value greater than an energy acceleration threshold, that an input frame is a speech frame (265) ; and applying said determination retrospectively to an earlier frame in a buffer of input signals.
16. A communication unit substantially as hereinbefore described with reference to, and/or as illustrated by, FIG.

1 of the accompanying drawings.
17. An energy acceleration based voice activity detection and/or validation method for noisy environments substantially as hereinbefore described with reference to, and/or as illustrated by, FIG. 2 or FIG. 3 or FIG. 4 of the accompanying drawings.