GB2384670A - Voice activity detector and validator for noisy environments - Google Patents
Voice activity detector and validator for noisy environments Download PDFInfo
- Publication number
- GB2384670A GB2384670A GB0201585A GB0201585A GB2384670A GB 2384670 A GB2384670 A GB 2384670A GB 0201585 A GB0201585 A GB 0201585A GB 0201585 A GB0201585 A GB 0201585A GB 2384670 A GB2384670 A GB 2384670A
- Authority
- GB
- United Kingdom
- Prior art keywords
- speech
- input
- frame
- communication unit
- voice activity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000000694 effects Effects 0.000 title claims abstract description 50
- 230000001133 acceleration Effects 0.000 claims abstract description 65
- 238000000034 method Methods 0.000 claims abstract description 57
- 238000005259 measurement Methods 0.000 claims abstract description 48
- 238000004891 communication Methods 0.000 claims abstract description 34
- 238000001514 detection method Methods 0.000 claims abstract description 31
- 238000012545 processing Methods 0.000 claims abstract description 18
- 230000007246 mechanism Effects 0.000 claims abstract description 16
- 238000001228 spectrum Methods 0.000 claims description 22
- 230000003595 spectral effect Effects 0.000 claims description 13
- 238000005096 rolling process Methods 0.000 claims description 11
- 230000008859 change Effects 0.000 claims description 5
- 238000010200 validation analysis Methods 0.000 claims description 5
- 230000001419 dependent effect Effects 0.000 claims description 3
- 230000003139 buffering effect Effects 0.000 claims 1
- 230000008901 benefit Effects 0.000 abstract description 4
- 230000004044 response Effects 0.000 abstract description 3
- 230000008569 process Effects 0.000 description 25
- 230000005540 biological transmission Effects 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 4
- 238000013459 approach Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 206010019133 Hangover Diseases 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 3
- 230000001413 cellular effect Effects 0.000 description 3
- 230000006837 decompression Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 238000004378 air conditioning Methods 0.000 description 1
- 239000000969 carrier Substances 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000006735 deficit Effects 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000002955 isolation Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/84—Detection of presence or absence of voice signals for discriminating voice from noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L2025/783—Detection of presence or absence of voice signals based on threshold decision
- G10L2025/786—Adaptive threshold
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Mobile Radio Communication Systems (AREA)
- Telephone Function (AREA)
Abstract
A communication unit (100) includes an audio processing unit (109) having a voice activity detection mechanism (130, 135). The voice activity detection mechanism (130, 135) measures an energy acceleration of a signal input to the communication unit (100) and determines whether said input signal is speech or noise, based on said measurement. A method of detecting voice and a method of deciding whether an input signal is voice or noise are also described. Using an energy acceleration based voice activity detector and validator, particularly for noisy environments, provides the advantages of noise robustness, fast response and independence of the level of input speech.
Description
<Desc/Clms Page number 1>
Voice Activity Detector And Validator For Noisy
Environments Field of the Invention This invention relates to detection of speech (commonly known as voice activity detection (VAD) ) within a noisy environment. The invention is applicable to, but not limited to, energy acceleration measurement of voice signals in a speech detection system.
Background of the Invention Many voice communications systems, such as the global system for mobile communications (GSM) cellular telephony standard and the TErrestrial Trunked RAdio (TETRA) system for private mobile radio users, use speech-processing units to encode and decode speech patterns. In such voice communication systems, a speech encoder converts the analogue speech pattern into a suitable digital format for transmission. A speech decoder converts a received digital speech signal into an audible analogue speech pattern.
Methods and apparatus for detecting voice activity are known in the art. A voice activity detector (VAD) operates under the assumption that speech is present only in part of the audio signal. This assumption is usually correct, since there are many audio signal intervals that exhibit only silence or background noise.
<Desc/Clms Page number 2>
A voice activity detector can be used for many purposes.
These include suppressing overall transmission activity in a transmission system, when there is no speech, thus potentially saving power and channel bandwidth. When the VAD detects that speech activity has resumed, it can reinitiate transmission activity.
A voice activity detector can also be used in conjunction with speech storage devices, by differentiating audio portions which include speech from those that are "speechless". The portions including speech are then stored in the storage device and the"speechless"portions are discarded.
Conventional methods for detecting voice are based, at least in part, on methods for detecting and assessing the power of a speech signal. The estimated power is compared to either a constant or an adaptive threshold, in order to make a decision on whether the signal was speech or not.
The main advantage of these methods is their low complexity, which makes them suitable for low-processing resource implementations. The main disadvantage of such methods is that background noise can inadvertently result in"speech"being detected when no"speech"is actually present. Alternatively,"speech"that is present may not be detected because it is obscured, and difficult to detect due to the background noise.
Some methods for detecting speech activity are directed at noisy mobile environments and are based on adaptive filtering of the speech signal. This reduces the noise
<Desc/Clms Page number 3>
content from the signal, prior to the final decision. The frequency spectrum and noise level may vary because the method will be used for different speakers and in different environments. Hence, the input filter and thresholds are often adaptive so as to track these variations.
Examples of these methods are provided in GSM specification 06.42 Voice Activity Detector (VAD) for half rate, full rate, and enhanced full rate speech traffic channels respectively. Another such method is the"Multi- Boundary Voice Activity Detection Algorithm"as proposed in ITU G. 729 annex B. These methods are more accurate in a noisy environment but are significantly complex to implement.
All of these methods require the speech signal to be input. Some applications employing speech decompression schemes require the carrying out of speech detection during the speech decompression process.
European Patent application No. EP-A-0785419 by Benyassine et al. is directed to a method for voice activity detection that includes the following steps: (i) Extracting a predetermined set of parameters from the incoming speech signal for each frame, and (ii) Making a frame voicing decision of the incoming speech signal for each frame according to a set of difference measures extracted from the predetermined set of parameters.
<Desc/Clms Page number 4>
The VAD in cellular systems is biased in order to ensure that when a party speaks, the radio, including the speech codec and RF circuitry etc. , will be active to convey that speech to the other party in the presence of background noise and other impairments. However, this leads to transmission of data when a party is not speaking. The cost of this is slightly lower battery life and slightly increased interference to co-channel users in other cells of the system. These are essentially second (or higher) order effects.
In these systems, there is no concept of a finite resource being available to the duplex call. It is entirely possible and consistent for the uplink and downlink, which are usually on different carriers, to be simultaneously utilising the full bandwidth.
In the field of this invention it is known that some voice activity or voice onset detectors (VADs/VODs) attempt to use characteristics of the speech such as harmonic structure (e. g. , via autocorrelation) to distinguish voiced speech. However, in noise these structural indicators can fail, either due to disruption of the speech structure or due to structure in the noise. This might be e. g. , engine, tyre or air-conditioning noise in a car. Finally, these methods are poor at detecting unvoiced speech.
The alternative is simply to use the frame energy level to detect speech. This is satisfactory for speech in high signal-to-noise ratio (SNR) conditions, where an arbitrary
<Desc/Clms Page number 5>
threshold above the noise level can be set to denote speech. However, this approach fails in more realistic noise conditions.
For unnormalised databases or in real applications, it is likely that noise levels in one set of examples may be greater than speech levels in another-this makes it impossible to set a threshold value. The traditional method to overcome this is to average the first 100msec or so of an utterance on the assumption that this is representative of noise, creating an ad hoc threshold for that utterance. Again, however, this is insufficient for non-stationary noise where the noise may rapidly diverge from the initial estimate, where the noise has high variance or where the first few frames actually contain speech rather than the presumed noise.
A need therefore exists for an improved voice activity detector and validator for noisy environments wherein the abovementioned disadvantages may be alleviated.
Statement of Invention In accordance with a first aspect of the present invention, there is provided a communication unit, as claimed in claim 1.
In accordance with a second aspect of the present invention, there is provided a method of detecting a speech signal input to a communication unit, as claimed in claim 11.
<Desc/Clms Page number 6>
In accordance with a third aspect of the present invention, there is provided a method of deciding whether a signal input to a communication unit is speech or noise, as claimed in claim 14.
Further aspects of the present invention are as claimed in the claims dependent therefrom.
In summary, the present invention aims to address the case of arbitrary amplitude, non-stationary noise, by the use of an energy acceleration measurement in preference to an energy amplitude measurement to denote the presence, or absence, of speech.
Brief Description of the Drawings Exemplary embodiments of the present invention will now be described, with reference to the accompanying drawings, in which: FIG. 1 illustrates a block diagram of a communication unit adapted to perform the voice activity detection and validation of the preferred embodiment of the present invention; FIG. 2 illustrates a flowchart of an energy acceleration based voice activity detector for noisy environments in accordance with a preferred embodiment of the present invention;
<Desc/Clms Page number 7>
FIG. 3 illustrates a flowchart of an energy acceleration based voice activity validation for noisy environments in accordance with a preferred embodiment of the present invention; and FIG. 4 illustrates a buffer operation in accordance with a preferred embodiment of the present invention.
Description of Preferred Embodiments Voiced speech has a comparatively high-energy acceleration value, as its onset is dependent upon the activation of the vocal cords, which are either vibrating or still.
Similarly, unvoiced onsets (e. g. plosives) also have highenergy accelerations.
The inventors have recognised that, in a representational domain emphasising voicing such as a narrowband power spectrum or the Mel-spectrum, the resultant energy acceleration is significantly higher than non-stationary noise. The only significant exceptions are impulsive noises (e. g. a hand clap).
Hence, in accordance with the preferred embodiments of the present invention, the inventors have appreciated that one can additionally discriminate against these noises by concentrating on energy in the frequency region that is likely to contain a fundamental pitch of the voice signal.
In particular, the inventors of the present invention propose to use an unstructured characteristic of speech,
<Desc/Clms Page number 8>
namely energy acceleration (or acceleration of some metric reflecting the speech energy or components thereof).
In particular, a preferred application for the inventive concepts described herein is the distributed speech recognition (DSR) standard currently being defined by the European Telecommunications Standards Institute (ETSI)- "Speech Processing, Transmission and Quality aspects (STQ); Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithm", ETSI ES 201 108 vl. 1. 2 (2000-04), April 2000.
Referring now to FIG. 1, a block diagram of an audio subscriber unit 100, adapted to support the inventive concepts of the preferred embodiments of the present invention, is shown.
The preferred embodiment of the present invention is described with respect to a wireless audio communication unit, for example one capable of operating in the 3rd generation partnership project (3GPP) standard for future cellular wireless communication systems and offering DSR capabilities. However, it is within the contemplation of the invention that the inventive concepts herein described, relating to voice activity detection and validation thereof, are equally applicable to any electronic device that responds to voice signals, and which may benefit from improved voice activity detection circuitry.
<Desc/Clms Page number 9>
As known in the art, the audio subscriber unit 100 contains an antenna 102 preferably coupled to a duplex filter, antenna switch or circulator 104 that provides isolation between receive and transmit chains within the audio subscriber unit 100.
The receiver chain includes receiver front-end circuitry 106 (effectively providing reception, filtering and intermediate or base-band frequency conversion). The front-end circuit 106 is serially coupled to a signal processing function (generally realised by a digital signal processor (DSP) ) 108. The signal processing function 108 performs signal demodulation, error correction and formatting. Recovered data from the signal processing function 108 is serially coupled to an audio processing function 109, which. formats the received signal in a suitable manner to send to an audio enunciator/display 111.
In different embodiments of the invention, the signal processing function 108 and audio processing function 109 may be provided within the same physical device. A controller 114 is configured to control the information flow and operational state of the elements of the subscriber unit 100.
As regards the transmit chain, this essentially includes an audio input device 120 coupled in series through the audio processing function 109, signal processing function 108, transmitter/modulation circuitry 122 and a power amplifier 124. The processor 108, transmitter/modulation
<Desc/Clms Page number 10>
circuitry 122 and the power amplifier 124 are operationally responsive to the controller. The power amplifier output is coupled to the duplex filter, antenna switch or circulator 104, and antenna 102 to radiate the final radio frequency signal.
In particular, audio processing function 109 includes a voice activity (or voice onset) detection (VAD) function 130 operably coupled to a voice activity decision function 135. In accordance with the preferred embodiments of the present invention, the VAD function 130 and voice activity decision function 135 have been adapted to provide improved voice detection and decision mechanism, the operation of which is further described with respect to FIG. 2 and FIG. 3. Notably, the voice activity detector function 130 includes a frame-by-frame detection stage consisting of three measurements. The three frequency range measurements include: (i) Whole spectrum; (ii) Spectral sub-bands; and (iii) Spectral variance.
Subsequently, the voice activity decision function 135 performs a decision based on a buffer of measurements, which are analysed for their speech likelihood. The final decision from the decision stage is applied retrospectively to the earliest frame in the buffer.
In the preferred embodiment of the present invention, a timer/counter 118 is also adapted to perform the timing
<Desc/Clms Page number 11>
functions in the detection and decision processes of FIG.
2 and FIG. 3.
The signal processor function 108, audio processing function 109, VAD function 130 and voice activity decision function 135 may be implemented as distinct, operablycoupled, processing elements. Alternatively, one or more processors may be used to implement one or more of the corresponding processing operations. In a yet further alternative embodiment, the aforementioned functions may be implemented as a mixture of hardware, software or firmware elements, using application specific integrated circuits (ASICs) and/or processors, for example digital signal processors (DSPs).
Of course, the various components within the audio subscriber unit 100 can be realised in discrete or integrated component form, with an ultimate structure therefore being merely an arbitrary selection.
To this end, there are several methods by which to achieve an energy acceleration measure for use in the preferred embodiment of the present invention.
(i) The theoretically ideal method is to literally doubledifferentiate the energy level over successive frames of the utterance. The disadvantage with this approach is that this is likely to introduce delays, as one needs to analyse a number of frames on each side of the frame under analysis.
<Desc/Clms Page number 12>
(ii) A zero-delay estimation of the energy acceleration can be obtained by comparing a ratio of a short-term average with an instantaneous value, for example: using a Frame average:
or using a Rolling Average:
In each case the method returns values that can be
interpreted as'deceleration' < '- !."' < 'acceleration'. One can then find empirical values for and a denominator length that best differentiated speech from noise.
The inventors of the present invention have recognised that a preferred optimal solution is to find a denominator that can track non-stationary noise quickly, but which is too long to track voice onset. A suggested value sequence for the rolling average is a=0. 2, b=0. 8*a, c=0. 8*b, etc., which can be simply expressed as the recursion: dt = 0. 2xt+0. 8dt-1. [3] Then:
A = Xt/dt. [4] The preferred VAD and parameter initialisation systems within the detection stage are summarised in the flowchart
<Desc/Clms Page number 13>
of FIG. 2. In non-stationary noise, long-term energy thresholds are not a reliable indicator of speech.
Similarly, in high noise conditions the structure of the speech (e. g. harmonics) cannot be wholly relied upon as an indicator, as they may be corrupted by noise, or structured noises may confuse the detector. Hence, the preferred voice activity detector uses a noise-robust characteristic of the speech, namely the energy acceleration associated with voice onset.
Referring now to FIG. 2, a flowchart 200 of the preferred detection process is illustrated. As indicated above, the process includes a frame-by-frame analysis. The preferred VAD mechanism relates to a'whole spectrum'measurement process. A frame counter is initially assessed to determine whether it is less than'N', which defines the number of buffered frames, as shown in step 205. As an example of a preferred embodiment,'N'is set to'15', assuming it has been established that each frame increments by say, 10msec. If the frame counter is less than'N'in step 205, then the rolling average for an initial acceleration test is updated, as in step 210. If the frame counter is not less than'N'in step 205, then step 210 is skipped.
A determination is then made to assess whether the energy acceleration measurement is within one or more specified margin (s), as shown in step 235. If the energy acceleration measurement is within one or more specified margin (s) in step 235, then the rolling average is updated with the results of a further energy acceleration test, as
<Desc/Clms Page number 14>
in step 240. If the energy acceleration measurement is not within one or more specified margin (s) in step 235, then step 240 is skipped.
A determination is then made to assess whether the energy acceleration measurement is greater than a specified threshold, as shown in step 260. If the energy acceleration measurement is greater than a specified threshold in step 260, then the frame is assumed a speech frame, as in step 265. If the energy acceleration measurement is not greater than a specified threshold in step 260, then the frame is assumed a noise frame, as in step 270.
The frame counter is then incremented, as in step 275, and the process repeats from step 2. 05.
As an improvement to this process, instead of, or in addition to, the whole spectrum measurement process, a sub-region measurement process shown in optional steps 215 and 245 may be performed. A particular sub-region of the spectrum is selected as that sub-region most likely to contain the fundamental pitch.
In the sub-region process, once the rolling average for initial acceleration test is updated in step 210 in a whole spectrum measurement, a determination is made to check whether the energy acceleration measurement is greater than the threshold value, as shown in step 220.
If the energy acceleration measurement is greater than the threshold value in step 220, the process of initialising
<Desc/Clms Page number 15>
other parameters is suspended, as shown in step 225. If the energy acceleration measurement is not greater than the threshold value in step 220, the initialisation of other parameters is updated, as in step 230. The process then returns to step 235 as shown.
A further preferred determination is made after the determination to assess whether the energy acceleration measurement is within one or more specified margin (s) in step 235. The deceleration value is assessed to determine if it is'high'in step 250 and, if so, the rolling average for the energy acceleration test is slowly updated, as shown in step 255. The process then returns to the whole spectrum method in step 260.
In this manner, the generally higher signal-to-noise ratios (SNRs) of the sub-region detector make it highly noise-robust. However, it is vulnerable to adverse microphone and speaker changes as well as band-limited noise. Thus, the measurements should not be relied upon in all circumstances. Consequently, the preferred embodiment of the present invention incorporates the subregion detector in order to augment the whole spectrum measurement.
A further measurement process is preferably performed using the'acceleration'of the variance of values within, for example, the lower half of the spectrum of each frame.
The variance measure detects structure within the lower half of the spectrum, making it highly sensitive to voiced speech. The variance measurement follows the approach of
<Desc/Clms Page number 16>
the sub-region process, with the lower half of the spectrum being the particular sub-region selected. This variance measurement further complements the whole spectrum measurement approach, which is better able to detect unvoiced and plosive speech.
All three measurements take their raw input from the spectral representation of the filter gains generated by the first stage of a double Wiener filter, as described in US Patent Application no. US 09/427497, applicant Motorola Inc. , and inventor: Yan-Ming Chen. As described above, each measurement uses a different aspect of this data.
In particular, the whole-spectrum detector uses the known Mel-filtered spectral representation of the filter gains generated by the first stage of the double Weiner filter.
A single input value is obtained by squaring the sum of the Mel filter banks.
The whole-spectrum detector, in the preferred embodiment of the invention, applies the following process to all frames, as described below: Step one initialises the noise estimate Tracker in the following manner:
If Frame < 15 AND Acceleration2. 5, then Tracker=MAX (Tracker, Input).
The energy acceleration measure prevents the Tracker being updated if speech occurs within the lead-in time of 15 frames.
<Desc/Clms Page number 17>
Step two updates the Tracker value if the current input is similar to the noise estimate, in the following manner:
If Input < Tracker*UpperBound and Input > Tracker*LowerBound, then Tracker=a*Tracker+ (1-a) * Input Step three provides a failsafe mechanism for those instances where there is speech or an uncharacteristically large noise content within the first few frames. This causes the resulting erroneously high noise estimate to decay. Step three preferably functions in the following manner:
If Input < Tracker*Floor, then Tracker=b*Tracker+ (1-b) * Input Step four returns, as a'true'-speech determination, if the current input is more that 165% larger than the Tracker, in the following manner:
If Input > Tracker*Threshold, then output TRUE else output FALSE.
The ratio of the instantaneous input to the short-term mean Tracker is a function of the energy acceleration of successive inputs.
Where, in the above: a=0. 8 and b=0. 97;
UpperBound is 150% and LowerBound 75%;
Floor is 50%; and
Threshold is 165%.
<Desc/Clms Page number 18>
Notably, there is no update if the value is greater than UpperBound, or between LowerBound and Floor. Furthermore, the energy acceleration input, as indicated above, can be calculated either as: double-differentiation of successive inputs, or estimated by tracking the ratio of two rolling averages of the inputs.
Notably, the ratio of fast and slow-adapting rolling averages reflects the energy acceleration of successive inputs.
By way of example, the contribution rates for the averages used above were: (i) 0*mean + l*input, and (ii) ( (Frame-l) *mean + l*input)/Frame, making the energy acceleration measure increasingly sensitive over the first fifteen frames.
The sub-band detector preferably uses the average of the second, third and fourth Mel-filter banks derived for the 'whole spectrum'measurement. The detector then applies the following process to all frames, in the manner described below: (i) Input=p*CurrentInput+ (1-p) *PreviousInput; (ii) If Frame < 15, then Tracker=MAX (Tracker, Input) ; (iii) If Input < Tracker*UpperBound and Input > Tracker*LowerBound, then Tracker=a*Tracker+ (l-a) *Input ;
<Desc/Clms Page number 19>
(iv) If Input < Tracker*Floor, then Tracker=b*Tracker+ (l-jb) *Input (v) If Input > Tracker*Threshold, then output TRUE else output FALSE.
Where, in the sub-region measurement: p=0. 75.
All other parameters are the same as for the whole spectrum measurement, except Threshold, which equals 3.25.
For the spectral variance measurement, the variance of the values comprising the lower frequency half of the narrowband spectral representation of the gain for each frame is used as an input. The detector then applies exactly the same process as for the whole spectrum measurement.
The variance is calculated as:
where: N=FFT Length/4, and wi are the values of the narrowband spectral representation of the gain.
<Desc/Clms Page number 20>
In accordance with the preferred embodiment of the present invention, the three measures detailed above are presented to a VAD decision algorithm, as shown in the flowchart of FIG. 3. Successive inputs are presented to a buffer, which provides contextual analysis. This introduces a frame delay equal to the length of the buffer minus one frame.
Referring now to FIG. 3, a flowchart 300 of an acceleration-based voice activity validation process for noisy environments is illustrated, in accordance with a preferred embodiment of the present invention.
For an N=7 frame buffer, the most recent true/false speech input is stored at position N in the data buffer, as shown in step 305. The decision logic applies a number, and preferably each, of the following steps: Step 1:
VN = Measure 1 OR Measure 2 OR Measure 3; Input VN is defined as'true' (T) if any of the three measurements returns a true speech indication.
Step 2:
f C++, V=TRUE 1 M = MAXf, 0Ci < NS [6] L C=O, Vi=FALSE J c
The algorithm searches for the longest contiguous sequence of'true'values in the buffer, as in step 310. Hence, for example, for the sequence'T T F T T T F', M would equal'3'.
<Desc/Clms Page number 21>
Step 3:
If M > =SpAND T < Ls, T=Lg ; Where Sp equates to the first threshold in step 315. If the longest sequence of true (T) speech values is equal to or exceeds the first threshold in step 315, i. e. Sp=3 or more contiguous'true'values, the buffer is judged to contain'possible'speech. A short timer T of, say, LS=5 frames (Time~1) is activated, in step 325, if it is not already present (or exceeded) from the determination in step 320.
Step 4:
If M > =SL AND F > Fs, T=LM else T=LL ; Where SL equates to the second threshold in step 330. If there are SL=4 or more contiguous'true'values, the
buffer is again judged to contain'likely'speech. A medium timer T of, say, L22 frames is activated in step 340 if the current frame F is outside an initial lead-in safety period Fs, as determined in step 335. Otherwise, a failsafe long timer T of, say, LL=40 frames is used in step 345. Such an arrangement is used as the early presence of speech in the utterance may cause the initial noise estimate of the VAD to be too high.
Step 5:
If M < Sp AND T > O, T-- ;
<Desc/Clms Page number 22>
If the process determines that there are less than Sp=3 contiguous'true'values, in step 350, and the timer is greater than zero, in step 355, then the timer is decremented in step 360.
Step 6:
If T > O output TRUE else output FALSE; If the timer is greater than zero, in step 365, the process outputs a'true'speech decision, as shown in step 370. Alternatively, if the timer is not greater than zero, in step 365, the process outputs a'noise'decision, as shown in step 375.
Step 7:
Frame++, Shift buffer left and return to step 1.
In preparation for the next frame in step 380, the buffer is left-shifted to accommodate the next input, as shown with respect to FIG. 4. The output speech decision is applied to the frame being ejected from the buffer. The process then repeats again, at step 305, for the next true/false input to the data buffer.
It is within the contemplation that alternative mechanisms for making a speech or noise decision, based on the energy acceleration process described above, can be implemented.
For example, the decision mechanism may not be based on one or more timer (s), and may make a decision purely on whether one or more energy acceleration thresholds are exceeded.
Referring now to FIG. 4, an example of the buffer operation 400 according to the preferred embodiment of the
<Desc/Clms Page number 23>
present invention is shown in greater detail. Let us assume that the first threshold is set for three contiguous'true'values. At a time't'410, let us assume that only the current input (frame #7) 425 and previous input (frame #6) 420 were'True'. Consequently, when the buffer is shifted, the first frame (frame #1) 415 will be marked as False.
At a time't+1'430, a third'true'input (frame #8) 450 has been received, to supplement the earlier two'true' inputs 440,445. Consequently, when the buffer is shifted, the next output frame (frame #2) 435 will be marked as'True'.
It should be noted that in the above decision process, the only constraints are : (i) Time~1 < Time~2 < Time~3, and (ii) Threshold~1 < Threshold~2.
Assuming that only these three inputs (frame #6, frame #7 and frame #8) are'True', the full output sequence will be: F T T T T T T T T T T T T T T T T F F F F F 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Where frames #2-#5 indicate'true'due to the buffer leadin function. Frames #6-#8 indicate'true'as the positions of the actual original'true'speech inputs.
Frames #9-#12 indicate'true'due to the buffer lead-out function. Frames #13-#18 indicate'true'in response to
<Desc/Clms Page number 24>
the timer hangover that is used. Once all frames in the utterance have been input, the buffer shifts'false' entries (frames #19-#LM) until empty.
It is within the contemplation of the invention that the buffer length and hangover timers can be adjusted dynamically to suit the audio communication unit's needs.
As such, the preferred embodiment of using a buffer length 'N'of 8, and a hangover timer of five frames are used for explanatory purposes only. However, it should be noted that the buffer length'N'should always be decided such that N > =SL- In addition to its use as a VAD in its own right, it is within the contemplation of the invention that the energy acceleration measure performed in the method steps of FIG.
2 can be used to validate the initialisation of other parameters. For example, a spectral subtraction scheme requires an initial estimate of the noise, based on the first ten frames (typically 100mec) of speech. Even in stationary noise, several events may occur to invalidate the initial estimate. Examples of such events include: (a) Ramp-up of the signal:
Due to a variety of possible causes, the very start of a recording may'ramp-up'to full volume within the period under assessment. The reasons behind such a full ramp-up includes: buffer-fills in digital systems, capacitance or tape-head engagement in analogue systems.
The effect of such events will invalidate the estimate.
<Desc/Clms Page number 25>
Hence, an energy acceleration measure may be used to detect such a ramp-up and so prevent the error.
(b) Spikes in the initial signal : A common'spike'occurs with a full deployment of the press-to-talk (PTT) button on a subscriber radio unit, where the electrical contact fractionally precedes the button hitting the back of the switch. An energy acceleration measure, as described above, can be used to suspend the estimation process, as shown in step 225 of FIG. 2, when such events occur.
(c) Speech in the initial signal : Another common occurrence, with PTT systems in particular, is that the user starts speaking as soon as they depress the PTT button. In this manner, electrical contact is made after the speech has commenced. The energy acceleration measure can identify this and so suspend noise-based initialisations, as shown in step 225 of FIG. 2, or force the use of default estimates.
It is also within the contemplation of the invention that alternative measurements may be used to determine the energy acceleration. Furthermore, although the preferred expression used throughout the description is energy acceleration, the inventive concepts are believed to be applicable to any matter or element that corresponds to, or reflects, or is representative of, energy changes. In addition, the aforementioned measurements are provided as preferred examples only, in determining the energy change in an input signal.
<Desc/Clms Page number 26>
In summary, a communication unit has been described that includes an audio processing unit having a voice activity detection mechanism. The voice activity detection mechanism measures energy acceleration of a signal input to the communication unit and determines whether said input signal is speech or noise based on said measurement.
In addition, a method of detecting a speech signal input to a communication unit has been described. The method includes the steps of measuring an acceleration or change in energy of an input signal to the communication unit; and determining whether said input signal is speech or noise based on said step of measuring.
Furthermore, a method of deciding whether a signal input to a communication unit is speech or noise, has been described. The method includes the step of deciding whether said input signal is speech or noise based on an energy acceleration or change in energy measurement of said input signal, for example using a frame average or a rolling average of a number of input signals.
Hence, it will be understood that the energy acceleration based voice activity detector and validator for noisy environments described above provides the advantages of noise robustness and fast response. As the preferred embodiment uses an energy acceleration measure, instead of an absolute measurement, the inventive concepts herein described can be applied to speech of any input level.
<Desc/Clms Page number 27>
Whilst the specific and preferred implementations of the embodiments of the present invention are described above, it is clear that one skilled in the art could readily apply variations and modifications of such inventive concepts that would fall within the scope of the present invention.
Thus, an improved voice activity detector and validator for noisy environments have been described wherein the aforementioned disadvantages associated with prior art arrangements have been substantially alleviated.
Claims (17)
- Claims 1. A communication unit (100) comprising an audio processing unit (109) having a voice activity detection mechanism (130,135), the communication unit (100) characterised in that the voice activity detection mechanism (130,135) measures energy acceleration of a signal input to the communication unit (100) and determines whether said input signal is speech or noise, based on said measurement.
- 2. The communication unit (100) according to Claim 1, wherein the voice activity detection mechanism includes a voice activity detector function (130) that performs frameby-frame detection of voice for signals input to the voice activity detection mechanism (130,135).
- 3. The communication unit (100) according to Claim 2, wherein said frame-by-frame detection consists of performing an energy acceleration measurement on a signal input to the voice activity detection mechanism (130,135) for one or more of the following frequency ranges: (i) A whole spectrum; (ii) Spectral sub-bands; and (iii) Spectral variance.
- 4. The communication unit (100) according to Claim 3, wherein the voice activity detection mechanism includes a voice activity decision function (135) operably coupled to the voice activity detector function (130) to decide whether said input signal is speech based on a buffering operation of one or more of said measurements.<Desc/Clms Page number 29>
- 5. The communication unit (100) according to Claim 4, wherein the voice activity decision function (135) decides whether an input signal is speech using a frame average or a rolling average of a number of said input signals.
- 6. The communication unit (100) according to any of preceding Claims 2 to 5, wherein if the energy acceleration measurement yields an energy acceleration value that is greater than an energy acceleration threshold then an input frame is assumed to be a speech frame (265).
- 7. The communication unit (100) according to Claim 6, wherein a decision that an input frame is a speech frame (265) is applied retrospectively to an earlier frame in a buffer of input signals.
- 8. The communication unit (100) according to Claim 6 or Claim 7, wherein if the energy acceleration measurement yields an energy acceleration value that is greater than an energy acceleration threshold over a number of contiguous frames, then an input signal is assumed to be a speech signal (370).
- 9. The communication unit (100) according to any of preceding Claims 3 to 8 when dependent upon Claim 3, wherein if a sub-region of an input signal spectrum is selected, the selection is based on that sub-region most likely to contain a fundamental pitch of a voice signal.<Desc/Clms Page number 30>
- 10. The communication unit (100) according to any preceding Claim, wherein the voice activity detection mechanism (130,135) uses acceleration of voice-energy related features to validate a parameter initialisation of other voice or noise related metrics, for example, a spectral subtraction scheme.
- 11. A method of detecting a speech signal input to a communication unit characterised by the steps of: measuring an acceleration or change in energy of an input signal to the communication unit; and determining (315,330, 350) whether said input signal is speech (370) or noise (375) based on said step of measuring.
- 12. The method of detecting. a speech signal according to Claim 11, further characterised by the step of: performing frame-by-frame detection of voice for signals input to the communication unit.
- 13. The method of detecting a speech signal according to Claim 12, wherein said frame-by-frame detection includes the step of: performing an energy acceleration measurement on said input signal for one or more of the following frequency ranges: (i) A whole spectrum; (ii) Spectral sub-bands; and (iii) Spectral variance.
- 14. A method of deciding whether a signal input to a communication unit is speech or noise, preferably according<Desc/Clms Page number 31>to any of preceding Claims 11 to 13, the method further characterised by the step of: deciding (315,330, 350) whether said input signal is speech (370) or noise (375) based on an energy acceleration or change in energy measurement of said input signal, for example using a frame average or a rolling average of a number of input signals.
- 15. The method of deciding whether a signal input to a communication unit is speech or noise according to Claim 14, wherein said step of deciding includes: determining, if the energy acceleration measurement yields an energy acceleration value greater than an energy acceleration threshold, that an input frame is a speech frame (265) ; and applying said determination retrospectively to an earlier frame in a buffer of input signals.
- 16. A communication unit substantially as hereinbefore described with reference to, and/or as illustrated by, FIG.1 of the accompanying drawings.
- 17. An energy acceleration based voice activity detection and/or validation method for noisy environments substantially as hereinbefore described with reference to, and/or as illustrated by, FIG. 2 or FIG. 3 or FIG. 4 of the accompanying drawings.
Priority Applications (8)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB0201585A GB2384670B (en) | 2002-01-24 | 2002-01-24 | Voice activity detector and validator for noisy environments |
CNB038026821A CN1307613C (en) | 2002-01-24 | 2003-01-10 | Voice activity detector and validator for noisy environments |
PCT/EP2003/000271 WO2003063138A1 (en) | 2002-01-24 | 2003-01-10 | Voice activity detector and validator for noisy environments |
KR1020097022615A KR100976082B1 (en) | 2002-01-24 | 2003-01-10 | Voice activity detector and validator for noisy environments |
JP2003562919A JP2005516247A (en) | 2002-01-24 | 2003-01-10 | Voice activity detector and enabler for noisy environments |
KR10-2004-7011459A KR20040075959A (en) | 2002-01-24 | 2003-01-10 | Voice activity detector and validator for noisy environments |
FI20041013A FI124869B (en) | 2002-01-24 | 2004-07-22 | Voice activity detector and approver for noisy environments |
JP2009251650A JP2010061151A (en) | 2002-01-24 | 2009-11-02 | Voice activity detector and validator for noisy environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB0201585A GB2384670B (en) | 2002-01-24 | 2002-01-24 | Voice activity detector and validator for noisy environments |
Publications (3)
Publication Number | Publication Date |
---|---|
GB0201585D0 GB0201585D0 (en) | 2002-03-13 |
GB2384670A true GB2384670A (en) | 2003-07-30 |
GB2384670B GB2384670B (en) | 2004-02-18 |
Family
ID=9929648
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB0201585A Expired - Lifetime GB2384670B (en) | 2002-01-24 | 2002-01-24 | Voice activity detector and validator for noisy environments |
Country Status (6)
Country | Link |
---|---|
JP (2) | JP2005516247A (en) |
KR (2) | KR100976082B1 (en) |
CN (1) | CN1307613C (en) |
FI (1) | FI124869B (en) |
GB (1) | GB2384670B (en) |
WO (1) | WO2003063138A1 (en) |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100657912B1 (en) * | 2004-11-18 | 2006-12-14 | 삼성전자주식회사 | Noise reduction method and apparatus |
JP4758879B2 (en) * | 2006-12-14 | 2011-08-31 | 日本電信電話株式会社 | Temporary speech segment determination device, method, program and recording medium thereof, speech segment determination device, method |
GB2450886B (en) | 2007-07-10 | 2009-12-16 | Motorola Inc | Voice activity detector and a method of operation |
EP2359361B1 (en) * | 2008-10-30 | 2018-07-04 | Telefonaktiebolaget LM Ericsson (publ) | Telephony content signal discrimination |
CN102044241B (en) | 2009-10-15 | 2012-04-04 | 华为技术有限公司 | Method and device for tracking background noise in communication system |
JP5575977B2 (en) * | 2010-04-22 | 2014-08-20 | クゥアルコム・インコーポレイテッド | Voice activity detection |
US8898058B2 (en) | 2010-10-25 | 2014-11-25 | Qualcomm Incorporated | Systems, methods, and apparatus for voice activity detection |
KR101196518B1 (en) | 2011-04-05 | 2012-11-01 | 한국과학기술연구원 | Apparatus and method for detecting voice activity in real-time |
RU2544293C1 (en) * | 2013-10-11 | 2015-03-20 | Сергей Александрович Косарев | Method of measuring physical quantity using mobile electronic device and external unit |
US9953661B2 (en) * | 2014-09-26 | 2018-04-24 | Cirrus Logic Inc. | Neural network voice activity detection employing running range normalization |
CN104575498B (en) * | 2015-01-30 | 2018-08-17 | 深圳市云之讯网络技术有限公司 | Efficient voice recognition methods and system |
JP2016167678A (en) * | 2015-03-09 | 2016-09-15 | 株式会社リコー | Communication device, communication system, log data storage method, and program |
CN109841223B (en) * | 2019-03-06 | 2020-11-24 | 深圳大学 | Audio signal processing method, intelligent terminal and storage medium |
US11217262B2 (en) * | 2019-11-18 | 2022-01-04 | Google Llc | Adaptive energy limiting for transient noise suppression |
CN112820324B (en) * | 2020-12-31 | 2024-06-25 | 平安科技(深圳)有限公司 | Multi-label voice activity detection method, device and storage medium |
KR102453919B1 (en) | 2022-05-09 | 2022-10-12 | (주)피플리 | Method, device and system for verifying of guide soundtrack related to cultural content based on artificial intelligence |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS6039938A (en) * | 1983-07-14 | 1985-03-02 | ジー・テイー・イー・ラボラトリーズ・インコーポレイテツド | Complementary speech detector |
WO1999014737A1 (en) * | 1997-09-18 | 1999-03-25 | Matra Nortel Communications | Method for detecting speech activity |
US5946649A (en) * | 1997-04-16 | 1999-08-31 | Technology Research Association Of Medical Welfare Apparatus | Esophageal speech injection noise detection and rejection |
US6009391A (en) * | 1997-06-27 | 1999-12-28 | Advanced Micro Devices, Inc. | Line spectral frequencies and energy features in a robust signal recognition system |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2559475B2 (en) * | 1988-09-22 | 1996-12-04 | 積水化学工業株式会社 | Voice detection method |
JPH03114100A (en) * | 1989-09-28 | 1991-05-15 | Matsushita Electric Ind Co Ltd | Voice section detecting device |
JP3024447B2 (en) * | 1993-07-13 | 2000-03-21 | 日本電気株式会社 | Audio compression device |
JP3109978B2 (en) * | 1995-04-28 | 2000-11-20 | 松下電器産業株式会社 | Voice section detection device |
US5774849A (en) * | 1996-01-22 | 1998-06-30 | Rockwell International Corporation | Method and apparatus for generating frame voicing decisions of an incoming speech signal |
JPH10171497A (en) * | 1996-12-12 | 1998-06-26 | Oki Electric Ind Co Ltd | Background noise removing device |
JP3297346B2 (en) * | 1997-04-30 | 2002-07-02 | 沖電気工業株式会社 | Voice detection device |
JPH10327089A (en) * | 1997-05-23 | 1998-12-08 | Matsushita Electric Ind Co Ltd | Portable telephone set |
JPH113091A (en) * | 1997-06-13 | 1999-01-06 | Matsushita Electric Ind Co Ltd | Detection device of aural signal rise |
JP4221537B2 (en) * | 2000-06-02 | 2009-02-12 | 日本電気株式会社 | Voice detection method and apparatus and recording medium therefor |
-
2002
- 2002-01-24 GB GB0201585A patent/GB2384670B/en not_active Expired - Lifetime
-
2003
- 2003-01-10 JP JP2003562919A patent/JP2005516247A/en active Pending
- 2003-01-10 KR KR1020097022615A patent/KR100976082B1/en active IP Right Grant
- 2003-01-10 KR KR10-2004-7011459A patent/KR20040075959A/en not_active Application Discontinuation
- 2003-01-10 WO PCT/EP2003/000271 patent/WO2003063138A1/en active Application Filing
- 2003-01-10 CN CNB038026821A patent/CN1307613C/en not_active Expired - Lifetime
-
2004
- 2004-07-22 FI FI20041013A patent/FI124869B/en active IP Right Grant
-
2009
- 2009-11-02 JP JP2009251650A patent/JP2010061151A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS6039938A (en) * | 1983-07-14 | 1985-03-02 | ジー・テイー・イー・ラボラトリーズ・インコーポレイテツド | Complementary speech detector |
US5946649A (en) * | 1997-04-16 | 1999-08-31 | Technology Research Association Of Medical Welfare Apparatus | Esophageal speech injection noise detection and rejection |
US6009391A (en) * | 1997-06-27 | 1999-12-28 | Advanced Micro Devices, Inc. | Line spectral frequencies and energy features in a robust signal recognition system |
WO1999014737A1 (en) * | 1997-09-18 | 1999-03-25 | Matra Nortel Communications | Method for detecting speech activity |
Also Published As
Publication number | Publication date |
---|---|
CN1307613C (en) | 2007-03-28 |
KR20040075959A (en) | 2004-08-30 |
WO2003063138A1 (en) | 2003-07-31 |
CN1623186A (en) | 2005-06-01 |
KR100976082B1 (en) | 2010-08-16 |
KR20090127182A (en) | 2009-12-09 |
FI124869B (en) | 2015-02-27 |
GB0201585D0 (en) | 2002-03-13 |
FI20041013A (en) | 2004-09-22 |
JP2005516247A (en) | 2005-06-02 |
GB2384670B (en) | 2004-02-18 |
JP2010061151A (en) | 2010-03-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP2010061151A (en) | Voice activity detector and validator for noisy environment | |
US7171357B2 (en) | Voice-activity detection using energy ratios and periodicity | |
US6810273B1 (en) | Noise suppression | |
KR100944252B1 (en) | Detection of voice activity in an audio signal | |
US9524735B2 (en) | Threshold adaptation in two-channel noise estimation and voice activity detection | |
US8977556B2 (en) | Voice detector and a method for suppressing sub-bands in a voice detector | |
JP3878482B2 (en) | Voice detection apparatus and voice detection method | |
US20080095384A1 (en) | Apparatus and method for detecting voice end point | |
KR20160079105A (en) | Voice recognition method, voice recognition device, and electronic device | |
US8924199B2 (en) | Voice correction device, voice correction method, and recording medium storing voice correction program | |
US8712768B2 (en) | System and method for enhanced artificial bandwidth expansion | |
WO1997022116A2 (en) | A noise suppressor and method for suppressing background noise in noisy speech, and a mobile station | |
EP3438979B1 (en) | Estimation of background noise in audio signals | |
KR100848798B1 (en) | Method for fast dynamic estimation of background noise | |
JPH05244105A (en) | Sound detection method/device | |
EP1751740B1 (en) | System and method for babble noise detection | |
EP2743923B1 (en) | Voice processing device, voice processing method | |
KR101336203B1 (en) | Apparatus and method for detecting voice activity in electronic device | |
WO2007040883A2 (en) | Voice activity detector |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
732E | Amendments to the register in respect of changes of name or changes affecting rights (sect. 32/1977) |
Free format text: REGISTERED BETWEEN 20110120 AND 20110126 |
|
732E | Amendments to the register in respect of changes of name or changes affecting rights (sect. 32/1977) |
Free format text: REGISTERED BETWEEN 20170831 AND 20170906 |
|
PE20 | Patent expired after termination of 20 years |
Expiry date: 20220123 |