GB2268669A

GB2268669A - Voice activity detector

Info

Publication number: GB2268669A
Application number: GB9214306A
Authority: GB
Inventors: Seishi Sasaki
Original assignee: Kokusai Electric Corp
Current assignee: Kokusai Electric Corp
Priority date: 1992-07-06
Filing date: 1992-07-06
Publication date: 1994-01-12
Anticipated expiration: 2012-07-06
Also published as: GB2268669B; GB9214306D0

Abstract

<IMAGE>

Description

VOICE ACTIVITY DETECTOR The present invention relates to a voice activity detector for use in such a voice communication system.

Portable radio terminals, such as digital cordless telephone apparatus, employ VOX (Voice Operate Switch Exchange) control which a-ctuates a transmitter only during voice activity and holds it out of operation during a silent duration so as to reduce power consumption during transmission, and this control reduces the mean power consumption for transmission by about 15%. To perform such a VOX function, a voice activity detector for detecting the presence or absence of a voice signal needs to be provided at a stage preceding a transmitter output circuit.

The following will be described on the assumption that such a voice activity detector is applied to VOX control of a digital cordless telephone apparatus. The digital cordless telephone utilizes a 32 kb/s adaptive differential pulse code modulation (ADPCM) system as the voice coding system (CODEC), and the processing delay time in this apparatus is required to be equal to or shorter than 7 msec.

Since the processing by a conventional voice activity detector described below is executed for each 20 msec frame, a delay time of at least 20 msec is induced, making it impossible to meet a requirement that the delay time be 7 msec or less. Moreover, the conventional voice activity detector is formed independently of the voice encoder, and hence is defective in that the amount of data to be processed is inevitably large.

It is therefore an object of the present invention to provide a voice activity detector which permits the detection of voice activity or non-activity in each short period while holding the delay time to be shorter than 7 msec, through effective utilization of predictive coefficients obtainable during processing by a voice encoder having an adaptive prediction function.

According to the present invention there is provided a voice activity detector with voice encoder means having input terminal means for receiving an input voice signal, output terminal means for outputting an encoded output of the input voice signal, and adaptive predictor means for generating two predictive coefficients for successive two sampled values of the input voice signal in the voice encoder; average calculator means for producing respective average values of the two predictive coefficients for each framed period of the input voice signal; and decision means for holding respective ranges of predictive coefficient threshold values precalculated from respective distributions of the two predictive coefficients and for deciding whether said each framed period is a voice active period or a voice nonactive period as a result of comparing the average values with said respective ranges of predictive coefficient threshold values to obtain voice active/non-active flags in correspondence to said voice active period and said voice non-active period.

Embodiments of the present invention will be described in detail below in comparison with prior art with reference to the accompanying drawings; in which: Fig. 1 is a block diagram of the voice activity detector according to an embodiment of the present invention; Fig. 2 shows timing charts explanatory of the operation of an embodiment of the present invention; Fig. 3 is a block diagram of an ADPCM encoder provided with the voice activity detector of an embodiment of the present invention; Fig. 4 shows the distributions of predictive coefficients a1 and a2; Fig. 5 shows the distributions of the predictive coefficients a1 and a2; Fig. 6 is a block diagram of a conventional voice activity detector; and Fig. 7 is a conventional decision logic flowchart.

To make differences between prior art and the present invention clear, an example of prior art will first be described.

Fig. 6 is a block diagram showing a conventional voice activity detector, which divides an input voice signal a, sampled at a sampling rate of 8 kHz and quantized by the use of 256 quantization levels, in units of 20 msec frames (each 160 samples), decides the voice activity or nonactivity for each frame and outputs a voice activity/nonactivity flag. The voice input signal a is applied to a direct-current suppressor 11, in which its DC component is removed by a high-pass filter and the output signal b is provided to each circuit mentioned below.

In a high level power detector 12 the 20 msec voice period is subdivided into five subframes (32 samples) of 4 msec and, for each sub-frame, a short-period power Psk is computed by the following Eq.

(1):

where Xi is the filter output and a notation k is the subframe number.

For the power Psk thus computed for each subframe, the following power detection is conducted using a power threshold value Th2 (-30 dBmO).

When Psk 1 Th2, D2k = 1 (2) When Psk < Th2, D2k = 0 (3) Further, a weighted sum total D2 of the following Eq.

(4) is obtained, which sum total is regarded as the result of detection for one frame, and a signal c is output accordingly.

In a low level power detector 13, for the short-period power calculated by Eq. (1), the following power detection is conducted using a power threshold value Thl (50 dBmO).

When Psk 2 Thl, Dik = 1 (5) When Psk < Thl, Dik = 0 (6) Similarly, the following weighted sum total D1 is obtained, which is regarded as the result of detection for one frame, and a signal d is output accordingly.

At the same time, the value of the following equation is calculated.

In a zero crossing number detector 14, Zsk is calculated by the following Eq. (9) for each subframe so as to count the zero crossing number of the signal b (the number of different sign bits of voice signals of two successive samples).

For each Zsk thus computed, the zero crossing number is detected using a zero crossing threshold value Th3 (24) as follows: When Zsk 2 Th3, DZsk = 1 (10) When Zsk < Th3, DZsk = 0 (11) Likewise, the following weighted sum total D2 is calculated and a signal e is output as indicative of the result of detection for one frame.

In an inter-frame power-increment comparator 15 the power PTn of one frame is obtained by the following Eq.

(13):

Further, the power thus obtained is compared with the inter-frame power PT(n-l) of the preceding frame to detect the next power increment D4, and its result is output as a signal f.

When PTn - 4PT(n-l)r D4 = 1 (14) When PTn < 4PT(n-l)r D4 = 0 (15) A decision circuit 16 receives the signals c, d, e and f and outputs a voice active/non-active flag indicating the result of detection of the voice activity in accordance with a decision logic flow depicted in Fig. 7. In Fig. 7, HOT means a hang-over timer (a function by which when the decision changes from the voice activity to the voice nonactivity, the subsequent several frames are set voiceactive to prevent the voice activity from ending), and SP flag means a voice active/non-active flag.

The present invention will hereinafter be described as being applied to a 32 kb/s (kilobit/sec) ADPCM voice encoder for the digital cordless telephone.

Fig. 3 is a block diagram of the ADPCM voice encoder equipped with a voice activity detecting function to which the present invention is applied, and Fig. 1 is a block diagram illustrating an embodiment of the voice activity detector according to the present invention.

A description will be given first of the ADPCM encoder depicted in Fig. 3. Reference numeral 21 indicates a uniform PCM converter whereby a 64 kb/s ij-rule PCM input signal is converted to a linear 13-bit PCM signal.

Reference numeral 22 denotes a subtractor whereby a predition signal i, which is the output from an adaptive predictor 23, is subtracted from the output of the uniform PCM converter 21 to obtain difference signal g. The difference signal q is quantized by an adaptive quantizer 24 and voice data of 32 kb/s are provided as the output of the ADPCM voice encoder on the transmission line.

An inverse adaptive quantizer - 26 performs inverse adaptive quantization of the 32 kb/s voice data to obtain a quantized difference signal m. An adder 25 adds the quantized differential signal m and the prediction signal i to obtain a reproduced signal n.

The adaptive predictor 23 produces the prediction signal i by the use of predictive coefficients ai (i=l, 2) and bi (i=l, ..6) under the principle defined by the following equations (16) and (17).

Where Se(h) : prediction signal i Sr(h-i) : reproduced signal dq : quantized difference signal m h : instant sampling point The predictive coefficients ai (i=1,2) and bi (i=l 6) are successively renewed in the adaptive predictor 23 under a simplified process of the gradient projection method.

The predictive coefficients ai (i=l,2) and bi (i=l, ....6) have spectrum-envelope information of an input signal, and their values are differently distributed with a case of a voice signal of high auto-correlation and a case of background noise of low auto-correlation. Accordingly, an instantaneous state of an input signal can be decided for each framed period as a voice signal or background noise in accordance with the values of the predictive coefficients ai and bi. In the present invention, only one kind of coefficients ai (i=1,2) except predictive coefficients bi is employed for detecting voice activity and applied to the voice detector 27.

To prove the above, examples of measured distributions of two predictive coefficients al and a2 are shown in Figs.

4(A), 4(B) and Figs, 5(A), (B). Fig. 4(A) shows voice signals (male voices), 4(B) voice signals (female voices), Fig. 5(A) white noise and 5(B) filtered noise (-6 dB/ oct).

In Figs. 4 and 5 the ranges of the two predictive coefficients al and a2 indicated by respective sample points, i.e. white, black and double circles, are each greater than -0.05 and smaller than +0.05, with respect to each sample point as the origin. The sample point of the maximum frequency of generation is indicated by the double circle, and the sample point which takes a value greater than 0.1 when it is normalized by the maximum frequency of generation is indicated by the black circle.

From Figs. 4 and 5 it is understood that the voice active period and the background noise period (i.e. the voice non-active period) can be decided using proper threshold values for the predictive coefficients al and a2.

When the predictive coefficients a1 and a2 assume values in the ranges (1) to (5) shown below, the voice activity detector 27 decides that such periods are background noise periods, on the basis of the distribution diagrams of the predictive coefficients depicted in Figs. 4 and 5, and when the coefficients assume other values, such periods are decided to be voice active periods. Thus the voice activity detector outputs a voice detection flag indicated by the L or H level accordingly.

(1) (0.70 < a1 # 1.00) and (-0.45 < a2 # -0.35) (2) (0.75 I al # 1.10) and (-0.55 < a2 < -0.45) (3) (0.85 < a1 # 1.20) and (-0.65 < a2 # -0.55) (4) (0.95 - < a1 # 1.20) and (-0.70 < a2 5 -0.65) (5) (ai # 0.75) and (a2 # 0) Fig. 1 is a block diagram illustrating an example of the construction of the voice activity detector according to an embodi- ment of the present invention.The contents of processing of each block in Fig.l will be described. The predictive coefficients al and a2 are input into framing circuits 31 and 32, respectively, wherein they are framed at 5 msec intervals, and the framed outputs are applied to average calculators 33 and 34. The average calculators 33 and 34 each calculate the average value of the predictive coefficient for one frame and apply the calculated output to a voice active/non-active detector 35. The detector 35 sets the voice detection flag u to the state of voicenon-active (L) or voice-active (H), depending on whether or not the average values of the predictive coefficients a1 and a2 fall inside the ranges of the threshold values (1) to (5) referred to above.The output of the detector 35 is provided to a hang-over processor 36, wherein it is subjected to hand-over processing of 100 msec to obtain an ultimate voice detected output v.

Fig. 2 shows timing charts illustrating the results of confirmation of the voice activity detecting operation by computer simulation. The input signal was superimposed on filtered noise (-6 dB/oct). Fig. 2(A) shows the input signal and 2(B) the results of voice active/non-active decision after the hang-over processing. From the results shown it is seen that the system of the present invention is not likely to malfunction in response to background noise and provides good results. Figs. 2(C) and (D) show temporal changes of the predictive coefficients al and a2, respectively. From Figs. 2(C) and (D) it can be confirmed that the predictive coefficients al and a2 assume different values for the voice active period and the background noise period.

As described above in detail, according to the present invention, the processing time necessary for the detection of voice activity is reduced to about 5 msec and the voice activity detector can be implemented with a small amount of hardware (the amount of data processing being 15% that in the ADPCM system) because of efficient utilization of coefficients obtainable in the ADPCM processing. Hence the present invention is of great utility in practical use.

Claims

1. A voice activity detector comprising: voice encoder means having input terminal means for receiving an input voice signal, output terminal means for outputting an encoded output of the input voice signal, and adaptive predictor means for generating two predictive coefficients for successive two sampled values of the input voice signal in the voice encoder; average calculator means for producing respective average values of the two predictive coefficients for each framed period of the input voice signal; and decision means for holding respective ranges of predictive coefficient threshold values precalculated from respective distributions of the two predictive coefficients and for deciding whether said each framed period is a voice active period or a voice non-active period as a result of comparing the average values with said respective ranges of predictive coefficient threshold values to obtain voice active/non-active flags in correspondence to said voice active period and said voice non-active period.

2. A voice activity detector according to claim 1, in which said respective ranges of predictive coefficient threshold values are precalculated to be greater than -0.05 and smaller than +0.05 with respect to each sample point as the origin.

3. A voice activity detector substantially as herein described with reference to Figures 1 or 3 with or without reference to any of Figures 2, 4 and 5 of the accompanying drawings.