WO2007030190A1 - Voice activity detector and method of operation therein - Google Patents

Voice activity detector and method of operation therein Download PDF

Info

Publication number
WO2007030190A1
WO2007030190A1 PCT/US2006/025672 US2006025672W WO2007030190A1 WO 2007030190 A1 WO2007030190 A1 WO 2007030190A1 US 2006025672 W US2006025672 W US 2006025672W WO 2007030190 A1 WO2007030190 A1 WO 2007030190A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
state
voice activity
speech
combiner
Prior art date
Application number
PCT/US2006/025672
Other languages
French (fr)
Inventor
Shani Stern
Izak Avayu
Uzi Cohen
Original Assignee
Motorola, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Motorola, Inc. filed Critical Motorola, Inc.
Publication of WO2007030190A1 publication Critical patent/WO2007030190A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04JMULTIPLEX COMMUNICATION
    • H04J3/00Time-division multiplex systems
    • H04J3/17Time-division multiplex systems in which the transmission channel allotted to a first user may be taken away and re-allotted to a second user if the first user becomes inactive, e.g. TASI
    • H04J3/175Speech activity or inactivity detectors

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

A voice activity detector (200) including an analyser (100) for analysing a signal (II) representing an audio signal to determine if speech is present in the audio signal, wherein the analyser includes a plurality of voice activity sensors (101-103) each operable to analyse the signal and to provide an output signal (Ol -03) indicating whether or not speech has been detected by the sensor and a state combiner (201) operable to receive input signals (O1-O3) produced by the voice activity sensors and to apply a combining function to the input signals to produce a state signal (07) indicating whether or not speech has been detected. Also described is a mthod of operation (400) of the voice activity detector.

Description

1 CM07651EI
TITLE: VOICE ACTIVITY DETECTOR AND METHOD OF OPERATION
THEREIN
FIELD OF THE INVENTION
The present invention relates to a voice activity detector and a method of operation therein.
BACKGROUND OF THE INVENTION
A VAD (voice activity detector) is a device that analyses an electronic signal representing an audio signal to determine whether or not speech is present. Usually, a VAD delivers an output signal that takes one of two possible values, respectively indicating that speech is present or not present. The value of the output signal will change with time according to whether or not speech is present in the analysed signal.
A VAD is often incorporated in a speech communication device such as a fixed or mobile telephone, a radio or a like device. Use of a VAD is an important enabling technology for a variety of speech based applications such as speech recognition, speech encoding and hands free telephony. The primary function of a VAD is to provide an ongoing indication of speech presence as well as to identify the beginning and end of a segment of speech. Devices such as automatic gain controllers employ a VAD to detect when they should operate in a speech present mode.
While VADs operate quite effectively in a relatively quiet environment, e.g. a conference room, they tend to be less accurate in noisy environments such as in a road vehicle and in consequence may generate detection errors. These errors include 'false alarms' which produce a signal indicating speech when none is present and 'mis- detects' which do not produce a signal to indicate speech when speech is present in noise.
There are many known types of algorithms employed in VADs to detect speech. Each of the known algorithms has advantages and disadvantages. For example, some VADs may tend to produce false alarms and others may tend to 2 CM07651EI produce mis-detects. Some VADs may tend to produce both false alarms and mis- detects in noisy environments.
SUMMARY OF THE INVENTION
According to the present invention in a first aspect there is provided a voice activity detector as defined in claim 1 of the accompanying claims.
According to the present invention in a second aspect there is provided a method of operation in a communication system, the method being as defined in claim 20 of the accompanying claims.
Further features of the present invention are as defined in the accompanying dependent claims and are disclosed in the embodiments of the invention to be described.
Embodiments of the present invention provide for a voice activity detector which has a plurality of voice activity sensors, preferably using different voice activity detection algorithms, each producing a separate speech detection result (speech present or not present). The results are combined using one or more logical combining functions to give an overall speech detection result.
Beneficially, by using multiple voice activity sensors and combining the results from the different sensors, an overall result can be obtained which is more accurate than by using a single VAD operating a particular speech detection algorithm.
Embodiments of the present invention will now be described by way of example with reference to the accompanying drawings, in which:
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block schematic diagram showing individual voice activity sensors employed in a VAD arrangement embodying the present invention. FIG. 2 is a block schematic diagram of a combined VAD embodying the present invention which operates using input signals from the individual voice activity sensors shown in FIG. 1. 3 CM07651EI
FIG. 3 is a graph showing four traces related to an analysed signal including speech plotted against time.
FIG. 4 is a flow diagram illustrating a procedure of operation of a particular form of state combiner which may be used in the combined VAD shown in FIG. 2.
DESCRIPTION OF EMBODIMENTS OF THE INVENTION
In the following description the expressions speech 'segments' and speech 'frames' are used. Speech segments are the parts of a signal where speech is present. In parts where no speech is present the signal may be considered to consist of noise segments. Typically, a speech segment lasts for one to two seconds. Frames are the units into which a signal is divided in order to determine whether speech is present in the part of the signal being analysed. Typically, frames are 20 msec to 30 msec long depending on the analysis procedure being used. FIG. 1 is a block schematic diagram showing individual voice activity sensors employed in a VAD analyser 100 embodying the present invention. The analyser 100 may be included in a VAD arrangement and is used for analysing a signal representing an audio signal to determine if speech is present in the audio signal. An input electrical signal II, representing an audio signal to be analysed, is applied in parallel to each of a VAS (voice activity sensor) 101, a VAS 102 and a VAS 103. Although three VASs are shown in this embodiment, any plural number n of VASs could be used in practice. The VAS 101, 102 and 103 are referred to herein as 'VASs' in order to distinguish them as individual units from an overall combined VAD (voice activity detector) incorporating them. The VAS 101 includes a VAS logic unit 111 and a delay unit 121. The VAS 102 includes a VAS logic unit 112 and a delay unit 122. The VAS 103 includes a VAS logic unit 113 and a delay unit 123. Each of the VAS logic units 111, 112 and 113 operates a different voice activity sensing algorithm to analyse frames of the input signal Il to determine if speech is present or not. Examples of particular known algorithms which may be used in practice to demonstrate the benefits of this embodiment of the present invention are given later. Each of the VAS logic units 111, 112 and 113 may produce an output signal having one of two values, for example, a value 'S' (speech detected) or 'NS' 4 CM07651EI
(speech not detected). The value of each output signal can change with time between frames. These output signals can be used as outputs from the VASs 101 to 103. These output signals optionally can also be applied to the delay units 121 to 123 respectively. Thus, an output signal Ol is produced by the VAS logic unit 111 and can itself be used as an output of the VAS 101 and can also be applied to the delay unit 121. An output signal 02 is produced by the VAS logic unit 112 and can itself be used as an output of the VAS 102 and can also be applied to the delay unit 122. An output signal 03 is produced by the VAS logic unit 113 and can itself be used as an output of the VAS 103 and can also be applied to the delay unit 123. Where the output signals Ol to 03 are respectively applied as input signals to the delay units 121 to 123, each of the delay units 121 to 123 checks each frame of the input signal applied to it to detect whether a change of value of the signal applied to it has occurred. Where such a change is detected, say in the value of the output signal Ol from VAS logic unit 111, the delay unit in question, e.g. the delay unit 121, can keep the state of the corresponding VAS, e.g. the VAS 101, in its present state for a pre-determined delay period consisting of an additional number of frames. This delay period is known in the art as a 'hangover'. Where the detected transition is from speech to noise a suitable delay is 7 or 8 frames. The purpose of applying this delay is to avoid clipping at the end of a speech frame. A shorter delay, e.g. 1 to 3 frames, is suitable where the transition detected is from noise to speech. The purpose of this delay is to avoid false alarms due to transient noise blips.
If the value of the output signal from the corresponding VAS logic unit is still determined to have changed correctly after the delay period applied by a given VAS, the change of state of the VAS is allowed by the delay unit 121. The current state of each of the VASs 101 to 103 determined in this way is indicated as an output state signal at an output of each of the delay units 121, 122 and 123. Thus, the delay unit 121 produces an output state signal 04, the delay unit 122 produces an output state signal 05 and the delay unit 123 produces an output state signal 06. Thus, the delay units 121-123 produce the same state output signals as their corresponding VAS logic units delayed by a given number of frames but with changes of state allowed to take place only if the change is determined to be correct after the given number of frames delay. The output state signals from the delay units 121, 122 and 123 which are 5 CM07651EI respectively signals 04, O5 and 06 can have one of two values, for example, a state value 'S' (speech state) or 'NS' (not speech state).
FIG. 2 is a block schematic diagram of an arrangement 200 embodying the invention to combine the individual results indicated by the output signals 01, 02 and 03 obtained from the VASs 101-103 shown in FIG. 1. The arrangement 200 of FIG. 2 includes a VAD state combiner 201 which receives as input signals the output signals 01, 02 and 03 from the individual VASs 101-103 (logic units 111 to 113) shown in FIG. 1. The VAD state combiner 201 applies a combining algorithm to the signals Ol to 03 and produces as an output signal an overall state signal 07. The VAD state combiner 201 may alternatively receive, or may additionally receive as optional inputs, the output state signals 04 to 06 produced by the delay units 121 to 123.
The VAD state combiner 201 may also optionally receive one or more other input signals indicated in FIG. 2 by dashed lines. These optional input signals include an input signal 12, an input signal 13 and an input signal 14. The input signal 12 indicates a value of a signal to noise ratio of the analysed input signal Il in a current frame as measured by a known signal to noise (S/N) estimator 202. The input signal 13 may be a signal which indicates a VAD type of the VAD state combiner 201. For example, the VAD state combiner 201 may operate an algorithm so as provide a minimum number of false alarms or alternatively to provide a minimum number of mis-detects and therefore may be classified according to the algorithm type. The signal 14 is an input signal which is the output signal O7 produced by the VAD state combiner 201 for the previous frame analysed and processed.
The VAD state combiner 201 operates an algorithm to apply a combining function to the input signals 01, 02 and 03 (and additionally or alternatively the input signals O4, 05 and 06) optionally using one or more of the inputs 12 to 14, to produce an overall output which is the state signal O7 having one of two values, for example, a value which indicates speech present ('S') and a value which indicates speech not present ('NS') for the analysed frame. The algorithm may apply one or more logical functions to combine the inputs 01, 02 and 03 (and/or O4, 05 an 06) so that the probability of correct detection of speech is improved.
In a simple form of the state combiner 201 the values ('S' or 'NS') of the input signals Ol , O2 and O3 (and/or 04, 05 and 06) are compared and the number of 6 CM07651EI these input signals having respectively a value 'S' and a value 'NS' is found. Where there is at least one of the input signals having the value 'S' the value of the output state signal 07 is set as 'S'. Where all of the input signals have the value 'NS' the value of the output state signal 07 is set at 'NS'. Such a state combiner can be implemented by applying an OR function to the input signals Ol to 03 (and/or 04 to 06) when the VAD state combiner 201 is providing (from the last frame analysed) an output state signal having a value 'S' and by applying an AND function to the input signals Ol to 03 (and/or 04 to 06) when the VAD state combiner 201 is providing (from the last frame analysed) an output state signal having a value 'NS'. The benefit of using an OR function for the state combiner 201 when the state combiner 201 is producing an output state signal value 'S' and an AND function when the state combiner 201 is producing an output state signal value 'NS' is illustrated by FIG. 3 which shows four traces (1), (2), (3) and (4) plotted against a common time scale. Trace (1) represents a sample of the input signal II. Two distinct speech segments 301 and 302 can be seen separated by a noise segment 303. Traces (2), (3) and (4) represent forms of the signals 01, 02 and 03 obtained by analysis of the signal Il using the VAS logic units 111, 112 and 113 (FIG. 1) each operating a different voice activity sensing algorithm. The traces (1), (2) and (3) have either an upper value 'S' or a lower value 'NS', e.g. as indicated at the right side of trace (2). It can be seen that in trace (4) a false alarm 303 is produced and several mis- detects such as mis-detects 301 and 302 are produced. However, if the values of the signals Ol to O3 are combined in the manner described above, namely by applying an OR function where speech is present (speech segments 301 and 302) and an AND function when speech is not present (noise segment 303), the resulting trace (not shown) has been found to give a more accurate representation of Il than the individual traces (1) to (3).
A more detailed alternative form of the state combiner 201 embodying the invention which employs the input signals 12, 13 and 14 and the signals Ol , 02, 03, 04, 05 and 06 shown in FIG. 2 will now be described with reference to FIG. 4 which is a flow diagram of a procedure 400 that may be operated by this particular form of the VAD state combiner 201. The object of the procedure 400 is again to combine the individual values of the signals Ol to 03 using an OR function when the last overall 7 CM07651EI output state value of the signal 07 (re-applied as input 14) is 'S' and using an AND function when the last overall output state value 07 (re-applied as input 14) of the signal 07 is 'NS'. However, in this case further logical conditions are applied to improve performance of the procedure 400 before the step of applying the OR function or the AND function (as appropriate) is reached.
The procedure 400 begins with a step 401 in which is found the value, 'S' or 'NS', of the input signal 14, which is the output state signal 07 produced by the state combiner 201 for the previous speech frame processed that is used to set a condition to be applied by the state combiner for a current speech frame of an analysed signal. Where the value found in step 401 is 'S' a step 403 is applied. Where the value found in step 401 is alternatively 'NS' a step 405 is applied. Step 405 and the steps which follow it are described further below.
In step 403 the signal 12, which is the S/N (signal to noise) value determined by the S/N estimator 202 (FIG. 2), is compared with a pre-determined threshold S/N value Tl. The threshold value Tl is selected so that S/N values of greater than Tl for the input signal 12 are considered to represent very good signal quality. In a decision step 407 the result of the comparison carried out in step 403 is indicated as an output value 'Y' (yes) or 'N' (no) indicating whether or not the value of 12 is or is not greater than Tl. Where the decision step 407 produces an output value 'Y' a step 409 is applied. Where the decision step 407 produces an output value 'N' a step 411 is alternatively applied.
In each of steps 409 and 411 an input signal 13 is received indicating a type of combining condition to be set in the step 409 or 411. For example, 13 may indicate that a minimum probability of a mis-detect is required. With such a requirement, a condition is set that for n VASs a selected minimum number are required to have a state 'S' for the condition to be met. For example, n may be 3 (VASs 101 to 103 in FIG. 1) and the minimum number k may be 2. Thus, in the step 409 a condition may be set that (at least) two out of three input signals 04, O5, 06 provided as output signals from the VASs 101 to 103 (delay units 121 to 123) in FIG. 1 must have a state value 'S ' . Also, the condition set by the step 411 may be that of n VASs a minimum of m are required to have a state 'S' for the condition to be met. Here m may be less than k, e.g. one. Thus, in the step 411 a condition may be set that (at least) one out of 8 CM07651EI three input signals 04, 05 and 06 from the VASs 101 to 103 in FIG. 1 must have a state value 'S' for the condition to be met. The reasons for selecting these conditions is explained further below.
In a step 413 the condition set in step 409 or alternatively in step 411 is applied to the input state signals 04 to 06 and the result is passed to a decision step 415 which produces an output 'Y' ('yes') or 'N' ('no') signal according to whether or not the condition applied in step 413 is met. If the output from decision step 415 is 'Y', a step 417 is activated. Alternatively, if the output from decision step 415 is 'N', a step 419 is activated. In step 417 an OR function is applied to the individual outputs Ol, O2 and 03 produced from the VASs 101 to 103 (VAS logic units 111 to 113) in FIG. 1 to see whether any of the outputs Ol to O3 has a value 'S'. In other words, only one of the outputs Ol to 03 is needed to have a value 'S' to provide an output state signal having a value 'S' from step 417. On the other hand, if none of the outputs Ol to O3 has a value 'S' then the output from step 417 is 'NS'. In step 419 an AND function is applied to the individual outputs 01, O2 and
O3 produced from the VASs 101 to 103 in FIG. 1 to see whether all of the outputs Ol to O3 have a value 'S'. In other words, all of the outputs Ol to 03 are needed to have a value 'S' to provide an output state signal having a value 'S' from step 419. If not all of the outputs Ol to 03 have a value 'S' then the output from step 419 is 'NS'. Finally, if step 417 or step 419 is activated, the output state signal produced is subject to a delay in a step 421 which operates in the same manner as the delay applied by the delay units 121 to 123 shown in FIG. 1. In this case however the delay period applied may suitably be shorter than the corresponding period applied in the delay unit 121 to 123. Thus, for a detected speech to noise transition the delay applied in step 421 may be 2 or 3 frames and for a detected noise to speech transition the delay applied in step 421 may be 1 or 2 frames. The final output state signal 07 is the output produced by the delay step 421 for the current frame which has been analysed and has a state value 'S' or 'NS'.
As noted earlier, if the input signal 14 is determined in step 401 to have a state value 'NS', a step 405 follows. In step 405 the signal 12, which is the S/N (signal to noise) value determined by the S/N estimator 202 (FIG. 2), is compared with a predetermined threshold S/N value T2. The threshold value T2 is selected so that values 9 CM07651EI of less than T2 for the input signal 12 are considered to be 'low' and to represent poor signal quality. Thus, the value of the threshold S/N value T2 employed in step 405 is generally smaller than the threshold S/N value Tl employed in step 403. In a decision step 427 the result of the comparison carried out in step 405 is indicated as an output value 'Y' (yes) or 'N' (no) indicating whether or not the value of 12 is less than T2. Where the decision step 427 produces an output value 'Y' a step 429 is applied. Where the decision step 427 produces an output value 'N' a step 431 is alternatively applied.
In each of steps 429 and 431 an input signal 13 is received indicating a type of combining condition to be set in the step 429 or 431. For example, 13 may indicate that a minimum probability of a mis-detect is required. With such a requirement, a condition is set that for n VASs a minimum of k or m are required to have a state 'S' for the condition to be met. For example, n may be 3 as shown in FIG. 4 and m may be 1. Thus, in the step 429 a condition is set that (at least) one out of three input signals O4 to 06 from the VASs 101 to 103 (delay units 121 to 123) in FIG. 1 must have a state value 'S'. Also, for the condition set by the step 431, n may be 3 as shown in FIG. 3 and k may be 2. Thus, in the step 431 a condition is set that (at least) two out of three input signals 04 to 06 from the VASs 101 to 103 in FIG. 1 must have a state value 'S' for the condition to be met. The reasons for selecting these conditions is explained further below.
In a step 433 the condition set in step 429 or alternatively in step 431 is applied to the input signals 04 to 06 and the result is passed to a decision step 435 which produces an output signal having a value 'Y' ('yes') or 'N' ('no') according to whether or not the condition applied in step 433 is met. If the output from decision step 435 is 'Y', a step 437 is activated. Alternatively, if the output from decision step 435 is 'N', a step 439 is activated.
In the step 437 an OR function is applied to the individual outputs 01, O2 and 03 produced from the VASs 101 to 103 (VAS logic units 111 to 113) in FIG. 1 to determine whether any of the outputs Ol to 03 has a value 'S'. In other words, only one of the outputs Ol to 03 is needed to have a value 'S' to provide an output state signal having a value 'S' from step 437. On the other hand, if none of the outputs Ol to 03 has a value 'S' then the output from step 437 is 'NS'. 10 CM07651EI
In step 439 an AND function is applied to the individual outputs 01, 02 and
03 produced from the VASs 101 to 103 in FIG. 1 to determine whether all of the outputs Ol to 03 have a value 'S'. In other words, all of the outputs Ol to 03 are needed to have a value 'S' to provide an output state signal having a value 'S' from step 439. If not all of the outputs Ol to 03 have a value 'S' then the output from step 439 is 'NS'. Finally, if step 437 or step 439 is activated, the output state signal produced is subject to the delay applied in step 421 referred to earlier. As noted earlier, the final output state signal 07 is the output produced by the change delay step 421 and has a value 'S' or 'NS'. It is to be noted that the input to step 421 in which a delay is applied is the output of one of the steps 417, 419, 437 and 439 according to which one of these steps is selected in the logical operations applied in the steps preceding these steps.
A further explanation of the selection of the conditions in steps 409 and 411, referred to as the 'left side of the procedure', and in steps 429 and 431, referred to as the 'right side of the procedure', is as follows. If the left side of the procedure is being followed, in normal circumstances there will be a fair speech quality and the normal answer to the question 'S/N high?' as determined in step 407 will be 'no' ('N'). This indicates that it is enough that one of the outputs O4 to 06 produced by the VASs 101 to 103 has a state value 'S' to indicate that the presence of speech is confirmed. If, however, the answer to the question 'S/N high?' is 'yes' ('Y') as determined in step 407, then a more stringent condition can be applied, since the output state signal O4 to 06 of each VAS 101 to 103 is known to be more accurate. Thus, in step 409 the condition set is that two out of three of the output signals 04 to 06 are required to be 'S' to indicate that the presence of speech is confirmed. On the right side of the procedure however the opposite rationale applies. In normal speech quality conditions the answer to the question 'S/N low?' as applied in step 427 would be 'no' ('N'). Thus, we would need two out of three VASs 101 to 103 to have a state value 'S', as indicated by the signals O4 to 06, to produce an output 'Y' in step 435 for the OR function to be applied in step 437. This means that if a 'NS' (no speech) segment of the input signal Il is initially detected, it would take at least two of the output signals 04 to 06 to have a value 'S' for the condition applied in step 433 to change. If there is low signal quality as indicated by 'Y' at step 427 11 CM07651EI there is a greater risk of not detecting speech present in noise, so in this case it is suitable for only one of the three input signals 04, 05, O6 to be 'S' in step 433.
In another embodiment of the invention the functions applied in steps 417 and 437 may be, instead of OR functions, functions requiring m of n signals to be 'S' where m is a positive integer greater than one and n is the number of VASs employed. Similarly, the functions applied in steps 419 and 439 may be, instead of AND functions, functions requiring k of the n signals to be 'S' where k is a positive integer less than n.
Beneficially, the combined VAD embodying the invention which has been described with reference to FIGs. 1 to 4 is less sensitive to background noise and will produce fewer false alarms. Further, since the combined VAD is less sensitive to clipping at the end of a speech segment, it will produce fewer mis-detects. Overall, the combined VAD embodying the invention will produce a better performance in terms of reducing the number of detection errors produced (false alarms and mis- detects) compared with the single VADs of the prior art.
As described above, an embodiment of the invention described by reference to FIG. 4 uses individual algorithms in the VAS logic units 113 to 115. Known algorithm types including, but not limited to, the following may be implemented as those individual algorithms in the VAS logic units 113 to 115: (i) a pitch detection algorithm; (ii) a Cepstral distance algorithm (that uses 'Cepstral analysis' which is a non-linear technique to detect perceptual harmonic 'Cepstral' coefficients for speech recognition); and (iii) an LPCs (linear prediction coding coefficients) distance algorithm. When compared to the performance of the individual algorithms in the detection of speech in a known audio signal, the performance of the combined methodology (procedure 400) described with reference to FIG. 4 generated fewer overall detection errors (false alarms and mis-detects) than any of the individual algorithms.
The combined VAD embodying the invention which has been described with reference to FIG.s 1 to 4 may be used advantageously in any of the known applications in which voice activity detectors are used. Thus, the combined VAD may for example be used in landline telephony applications and in mobile 12 CM07651EI communications, particularly in transmitters or receivers of portable and/or mobile communications terminals.

Claims

13 CM07651EICLAIMS
1. A voice activity detector including an analyser for analysing a signal representing an audio signal to determine if speech is present in the audio signal, wherein the analyser includes a plurality of voice activity sensors each operable to analyse the signal and to provide an output signal indicating whether or not speech has been detected by the sensor and a state combiner operable to receive input signals produced by the voice activity sensors and to apply a combining function to the input signals to produce a combined state signal indicating whether or not speech has been detected.
2. A voice activity detector according to claim 1 wherein the voice activity sensors are operable to apply different voice activity sensing algorithms to determine if speech is present.
3. A voice activity detector according to claim 2 wherein at least one of the algorithms is selected from (i) a pitch detection algorithm; (ii) a Cepstral distance algorithm; and (iii)an LPCs (linear prediction coding coefficients) distance algorithm.
4. A voice activity detector according to any one of the preceding claims wherein each of the voice activity sensors includes a logic unit, which operates a voice activity sensing algorithm, and a delay unit, wherein each of the delay units is operable to receive an output signal from a corresponding one of the logic units and to produce an output signal in which a current state of an output signal from the respective voice activity sensor is maintained until a pre-determined delay period has elapsed, the delay units being connected to the state combiner to provide input signals to the state combiner.
5. A voice activity detector according to any one of the preceding claims wherein the state combiner is operable to determine whether or not a predetermined number of input signals applied to the state combiner have a common value and to set a 14 CM07651EI condition and determine a combining function employed by the state combiner to produce the state signal.
6. A voice activity detector according to any one of the preceding claims wherein the state combiner is operable to apply an OR function to input signals from the voice activity sensors when speech is currently detected to be present.
7. A voice activity detector according to any one of the preceding claims wherein state combiner is operable to apply an AND function to input signals from the voice activity sensors when speech is currently detected not to be present.
8. A voice activity detector according to any one of the preceding claims which is operable to employ a signal indicating an algorithm type to be operated on by the state combiner to set a condition to be applied by the state combiner to determine a combining function to be applied by the state combiner.
9. A voice activity detector according to claim 8 which is operable to employ a signal indicating an algorithm type selected from an algorithm type giving a minimum number of false alarms and an algorithm type giving a minimum number of mis- detects.
10. A voice activity detector according to any one of claims 1 to 9 wherein the state combiner is operable to employ a value of the state signal produced by the state combiner for a previous frame of the analysed signal to set a condition to be applied by the state combiner for a current frame of the analysed signal.
11. A voice activity detector according to any one of the preceding claims and including an estimator for estimating a signal to noise ratio of the signal to be analysed and the state combiner is operable to set a condition to be applied by the state combiner according to a value of signal to noise ratio estimated by the estimator. 1S CM07651EI
12. A voice activity detector according to claim 11 wherein the state combiner is operable to employ a value of a state signal produced by the state combiner for a previous frame of the analysed signal and to determine whether a signal to noise ratio is above or below a first threshold when the state signal has a first value and to determine whether or not a signal to noise ratio is below a second threshold when the state signal has a second value and to set a condition to be applied by the state combiner according to the determination.
13. A voice activity detector according to claim 12 wherein the state combiner is operable to determine whether or not a signal to noise ratio is above a first threshold when the state signal indicates detection of speech and to determine whether or not a signal to noise ratio is below a second threshold when the state signal indicates no speech, the first threshold being at a greater signal to noise ratio than the second threshold.
14. A voice activity detector according to claim 13 wherein the state combiner is operable to set a first condition when the state signal indicates speech and the signal to noise ratio is estimated to be above the first threshold and to set a second condition when the state signal indicates speech and the signal to noise ratio is estimated to be below the first threshold, wherein the first condition requires a first predetermined number of input signals applied to the state combiner to have a common value and the second condition requires a second predetermined number of signals applied to the state combiner to have a common value, the second number being less than the first number.
15. A voice activity detector according to claim 13 or claim 14 wherein the state combiner is operable to set a first condition when the state signal indicates no speech and the signal to noise ratio is estimated to be above a second threshold smaller than the first threshold and to set a second condition when the state signal indicates no speech and the signal to noise ratio is estimated to be below the second threshold, wherein the first condition requires a first predetermined number of input signals applied to the state combiner to have a common value and the second condition 16 CM07651EI requires a second predetermined number of signals applied to the state combiner to have a common value, the first number being greater than the second number.
16. A voice activity detector according to claim 14 or claim 15 wherein the input signals required to have a common value are state signals produced by the voice activity sensors indicating whether or not a segment of speech is currently detected.
17. A voice activity detector according to any one of claims 14 to 16 which is operable to apply a condition and wherein, if the applied condition is met, an OR function is applied to input signals produced by the voice activity sensors and if the applied condition is not met an AND function is applied to input signals produced by the voice activity sensors.
18. A voice activity detector according to any one of the preceding claims wherein the state combiner includes a state change delay unit operable to receive an input signal indicating a current state determined by a combination operation of the state combiner and to produce an output signal in which a current state value of the input signal is maintained until a pre-determined delay period has elapsed.
19. A voice activity detector according to claim 1 and substantially as described herein with reference to FIG. 2 or FIG. 1 and FIG. 2 of the accompanying drawings.
20. A method of operation of a voice activity detector including analysing a signal representing an audio signal to determine if speech is present in the audio signal, wherein the analysing is carried out by each of a plurality of voice activity sensors, providing an output signal from each of the voice activity sensors indicating whether or not speech has been detected by the sensor, receiving input signals produced by the voice activity sensors in a state combiner and applying a combining function to the input signals by the state combiner to produce a state signal indicating whether or not speech has been detected. 17 CM07651EI
21. A method according to claim 20 and substantially as herein described with reference to FIG. 4 of the accompanying drawings.
PCT/US2006/025672 2005-09-08 2006-06-30 Voice activity detector and method of operation therein WO2007030190A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB0518213.4 2005-09-08
GB0518213A GB2430129B (en) 2005-09-08 2005-09-08 Voice activity detector and method of operation therein

Publications (1)

Publication Number Publication Date
WO2007030190A1 true WO2007030190A1 (en) 2007-03-15

Family

ID=35221019

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2006/025672 WO2007030190A1 (en) 2005-09-08 2006-06-30 Voice activity detector and method of operation therein

Country Status (2)

Country Link
GB (1) GB2430129B (en)
WO (1) WO2007030190A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009065223A1 (en) * 2007-11-22 2009-05-28 Tyco Safety Products Canada Ltd. Alarm system audio interface tamper and state detection
JP2013508744A (en) * 2009-10-19 2013-03-07 テレフオンアクチーボラゲット エル エム エリクソン(パブル) Voice interval detector and method

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8626498B2 (en) * 2010-02-24 2014-01-07 Qualcomm Incorporated Voice activity detection based on plural voice activity detectors

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5619566A (en) * 1993-08-27 1997-04-08 Motorola, Inc. Voice activity detector for an echo suppressor and an echo suppressor
US6061647A (en) * 1993-09-14 2000-05-09 British Telecommunications Public Limited Company Voice activity detector
US20040042626A1 (en) * 2002-08-30 2004-03-04 Balan Radu Victor Multichannel voice detection in adverse environments
US20040133421A1 (en) * 2000-07-19 2004-07-08 Burnett Gregory C. Voice activity detector (VAD) -based multiple-microphone acoustic noise suppression
US6876965B2 (en) * 2001-02-28 2005-04-05 Telefonaktiebolaget Lm Ericsson (Publ) Reduced complexity voice activity detector

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS5712999B2 (en) * 1974-04-08 1982-03-13
FI100840B (en) * 1995-12-12 1998-02-27 Nokia Mobile Phones Ltd Noise attenuator and method for attenuating background noise from noisy speech and a mobile station
US6453285B1 (en) * 1998-08-21 2002-09-17 Polycom, Inc. Speech activity detector for use in noise reduction system, and methods therefor
US6321194B1 (en) * 1999-04-27 2001-11-20 Brooktrout Technology, Inc. Voice detection in audio signals
US6754337B2 (en) * 2002-01-25 2004-06-22 Acoustic Technologies, Inc. Telephone having four VAD circuits

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5619566A (en) * 1993-08-27 1997-04-08 Motorola, Inc. Voice activity detector for an echo suppressor and an echo suppressor
US6061647A (en) * 1993-09-14 2000-05-09 British Telecommunications Public Limited Company Voice activity detector
US20040133421A1 (en) * 2000-07-19 2004-07-08 Burnett Gregory C. Voice activity detector (VAD) -based multiple-microphone acoustic noise suppression
US6876965B2 (en) * 2001-02-28 2005-04-05 Telefonaktiebolaget Lm Ericsson (Publ) Reduced complexity voice activity detector
US20040042626A1 (en) * 2002-08-30 2004-03-04 Balan Radu Victor Multichannel voice detection in adverse environments

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009065223A1 (en) * 2007-11-22 2009-05-28 Tyco Safety Products Canada Ltd. Alarm system audio interface tamper and state detection
US7859406B2 (en) 2007-11-22 2010-12-28 Tyco Safety Products Canada Ltd. Alarm system audio interface tamper and state detection
JP2013508744A (en) * 2009-10-19 2013-03-07 テレフオンアクチーボラゲット エル エム エリクソン(パブル) Voice interval detector and method
CN104485118A (en) * 2009-10-19 2015-04-01 瑞典爱立信有限公司 Detector and method for voice activity detection
US9773511B2 (en) 2009-10-19 2017-09-26 Telefonaktiebolaget Lm Ericsson (Publ) Detector and method for voice activity detection
US9990938B2 (en) 2009-10-19 2018-06-05 Telefonaktiebolaget Lm Ericsson (Publ) Detector and method for voice activity detection
US11361784B2 (en) 2009-10-19 2022-06-14 Telefonaktiebolaget Lm Ericsson (Publ) Detector and method for voice activity detection

Also Published As

Publication number Publication date
GB0518213D0 (en) 2005-10-19
GB2430129B (en) 2007-10-31
GB2430129A (en) 2007-03-14

Similar Documents

Publication Publication Date Title
FI110726B (en) Detection of voice activity
JP3878482B2 (en) Voice detection apparatus and voice detection method
US20230215447A1 (en) Estimation of background noise in audio signals
US20120185248A1 (en) Voice detector and a method for suppressing sub-bands in a voice detector
EP2351020A1 (en) Methods and apparatus for noise estimation in audio signals
WO1996034382A1 (en) Methods and apparatus for distinguishing speech intervals from noise intervals in audio signals
RU2127912C1 (en) Method for detection and encoding and/or decoding of stationary background sounds and device for detection and encoding and/or decoding of stationary background sounds
JPH02267599A (en) Voice detecting device
US10229686B2 (en) Methods and apparatus for speech segmentation using multiple metadata
WO2007030190A1 (en) Voice activity detector and method of operation therein
SE470577B (en) Method and apparatus for encoding and / or decoding background noise
JPH08221097A (en) Detection method of audio component
US20010029447A1 (en) Method of estimating the pitch of a speech signal using previous estimates, use of the method, and a device adapted therefor
JPS63281200A (en) Voice section detecting system
EP3852099B1 (en) Keyword detection apparatus, keyword detection method, and program
KR101841380B1 (en) Multi-channel audio signal classifier
US20040148168A1 (en) Method and device for automatically differentiating and/or detecting acoustic signals
US11961517B2 (en) Continuous utterance estimation apparatus, continuous utterance estimation method, and program
JPH064099A (en) Speech encoding device
JPH087596B2 (en) Noise suppression type voice detector
KR100349656B1 (en) Apparatus and method for speech detection using multiple sub-detection system
EP0310636A1 (en) Distance measurement control of a multiple detector system.
JPH05165496A (en) Voice detector
GB2430853A (en) Variable state change delay for a voice activity detector
JPS62270000A (en) Voice pitch frequency detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 06786014

Country of ref document: EP

Kind code of ref document: A1