WO1991006944A1

WO1991006944A1 - Speech waveform compression technique

Info

Publication number: WO1991006944A1
Application number: PCT/US1990/005884
Authority: WO
Inventors: Mark Richard Poulin; Daehyoung Hong; Anthony P. Van Den Heuvel
Original assignee: Motorola, Inc.
Priority date: 1989-10-25
Filing date: 1990-10-15
Publication date: 1991-05-16
Also published as: AU6877791A

Abstract

A time waveform compressor (200) is provided that performs one of three possible operations on a given segment of an input speech waveform (201). The decision logic (235) uses the pitch value (223), periodicity value (233) and silence value (227) to control the switch (209). Depending on the switch's (209) location (A, B or C), the stored signal (203) is applied to either a Time Domain Harmonic Scaling Compressor (212), a short circuit (213), or an open circuit (215). The ouptut of the compressor (219) is a continuous compressed speech waveform that can be sent to the speech coder.

Description

"SPEECH WAVEFORM COMPRESSION TECHNIQUE"

Technical Field This application relates generally to signal compressors and more particularly to signal compressors for use with speech waveforms.

Background of the Invention

When speech is processed to achieve a low bit rate (or a smaller bandwidth), improved spectrum efficiency can be obtained. However, this causes the quality of the processed speech signal to be degraded. As a result, maintaining speech quality while preserving spectrum efficiency is a key issue for the success of any speech coding scheme.

In the past, low bit-rate voice coders have been used to reduce the amount of information required for transmission or storage. One such voice coder is a digital sub-band coder, which operates on speech segments to partition a voice signal into multiple frequency sub-bands to determine where significant signal energy resides. Typically, a preset number of digital bits are allocated among these significant sub-bands to encode the spectral information for transmission.

In subband-type processing, the speech signal is divided into sub- band signals by a filter bank. Each subband signal is processed according to the frequency spectrum of the input speech. For example, digital subband coding (SBC) allocates available bits to subbands according to the computed energy distribution. More bits are allocated for subbands with higher energy. Fewer bits (or even zero bits) are allocated for sub-bands with lower energy. Multilevel subband coding (MSBC) transmits speech samples for only a set of subbands. The rest of the subband signals are not transmitted. This allows MSBC to achieve good audio quality at low bit rates. in order to preserve audio quality at even lower bit rates, alternate speech coding methods can be combined with sub-band type coders. One possible method is to compress the speech waveform in time (without changing it's bandwidth) before it goes to the sub-band coder. See, for example, R.V. Cox et al., "An Implementation of Time Domain Harmonic Scaling with Application to Speech Coding", Proc. ICASSP, May 1982. In this method, Time Domain Harmonic Scaling (TDHS) is used to compress the speech waveform in time by a factor of two before the compressed waveform is sent to a digital sub-band coder. This means that 32msec of speech, for example, is converted into 16msec of compressed speech before sub-band coding takes place. The bit rate of the original sub-band coder is effectively divided by two with the addition of TDHS. Unfortunately, TDHS inevitably degrades the audio quality somewhat because it is a process that works well only when the speech is voiced (or periodic). When the speech waveform is not very periodic, the audio quality is noticeably degraded after TDHS processing.

Summary of the Invention Accordingly, it is an object of the invention to provide an improved method for compressing the speech waveform in time without changing its bandwidth. Briefly, an improved speech waveform compressor is provided that performs the following operations on a given segment of the input speech waveform. If the speech waveform is periodic, it performs TDHS; if the waveform comprises silence, it removes the silence; and if the waveform is neither periodic nor silent, it transmits the signal without compression or alteration. The improved speech waveform compressor, according to the invention, provides a first signal output comprising a continuous compressed speech waveform and a second auxiliiary output comprising compression information. The compressed signal output may then be input to a speech coder to achieve a lower bit rate without reducing audio quality.

Ultimately both the output of the speech coder and the auxiliiary information may be multiplexed for transmission on a common channel.

Brief Description of the Drawings

Fig. 1 is a block diagram showing a speech coder using MSBC in combination with the preferred embodiment of an improved speech waveform compressor, according to the invention. Fig. 2 is a more detailed block diagram of the preferred embodiment.

Figs. 3A and 3B show a flow diagram for the preferred embodiment.

Fig. 4 depicts the speech coder of Fig. 1 with the speech waveform compressor implemented via a digital signal processor (DSP).

Detailed Description of the Invention

Referring now to Fig. 1, there is shown a speech coder 100 using an MSBC coder 113 in combination with the preferred embodiment 200 of an improved speech waveform compressor, according to the invention. The MSBC coder 113 is of a type known in the art. (The coder 113, however, is not essential to the understanding of the present invention.) An acoustic input signal to be transmitted is applied at microphone 101. The input signal 103 is then applied to filter 105, which is generally a bandpass-type. The filtered signal 107 is then represented by a digital code in analog-to-digital converter 109, as known in the art. The sampling rate is 8.0 KHz in the H preferred embodiment. The digital output 201 of A/D 109, which may be represented as input speech sequence s(j), is then applied to the present invention 200. This input speech sequence s(j) is repetitively obtained in separate frames, ie. blocks of time. In the preferred embodiment, input speech sequence s(j), 1<j<J. represents a 32msec frame containing J=256 samples.

The compressor 200 has one input 201 and two outputs, 219 and 237. A first signal output 219 comprises the compressed speech waveform, and is applied to MSBC coder 113 and ultimately to the multiplexer 137. As each compressed frame of speech is MSBC-encoded and multiplexed at block 137, a second compressor output 237 is also provided. This auxiliiary output 237 is the compressed speech frame's corresponding compression information. As shown, this signal 237 is also multiplexed at block 137 to create a block of information at output 139 which completely describes the original input speech waveform 103. In the preferred embodiment, for example, the block of information can represent 32msec, 48msec, or 64msec of original uncompressed speech preceded by anywhere from 0 to 7 frames (at 32msec per frame) of silence. The amount of speech that each information block 139 represents depends on the actual value of the compression information 237 that is contained in the block.

The compressor 200's input 201 and two outputs 219 and 237 will be explained further in Figures 2 and 3.

Referring now to Fig. 2, there is shown a more detailed block diagram of the preferred embodiment of the improved speech waveform compressor 200. As shown, the input speech sequence 201 is stored in dynamic storage 203 for a length of time determined by decision logic 235. As each new sequence 201 leaves storage 203, it is also applied as input signal 207 to a pitch detector 229, to a periodicity detector 231 , and to a silence detector 233. As shown, switch 209 is capable of residing in three positions designated A, B, and C, and explained as follows:

In position A, the signal stored in 203 is ultimately applied or transmitted to a first circuit path comprising lead 211 , TDHS compressor 212, lead 224, and summer or combiner 217. As will be described below, this path acts to compress the signal.

In position B, the signal is effectively directed or applied to a second circuit path comprising short circuit 213 and summer or combiner 217. As will be explained below, this path acts to pass or transmit the signal without compression or alteration.

In position C, the signal is connected to an open circuit 215. As will be disclosed below, this path acts to discard, delete, or block the signal.

It will be appreciated that decision logic 235 is arranged to selectively control the position of switch 209 via any convenient means such as, for instance, the depicted control path or channel 221.

The silence detector 233 may be any typical detector as known in the art. Such a suitable silence detector is shown, for example, by L R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Prentice-Hall 1978, (hereinafter "Rabiner and Schafer"), at page130. It will be appreciated that the silence detector 229 outputs a value or set of values that is a measure of how likely it is that the input signal is silence. The output of silence detector 233 is compared to a variable silence threshold by decision logic 235. If the output is below the threshold, the input signal is considered to be silence. The pitch detector 229 may be any pitch detector as known in the art.

Such a suitable pitch detector is shown, for example, by Rabiner and Schafer at page 150. It will be appreciated that the pitch detector 229 outputs the fundamental frequency (pitch) of the input signal and applies it to decision logic 235 The periodicity detector 231 may be any periodicity detector as known in the art. Such a suitable periodicity detector is shown, for example, by Rabiner and Schafer at page 150. It will be appreciated that the periodicity detector 231 outputs a value which is a measurement of the periodicity of the input waveform.

(In the preferred embodiment 200, a modified autocorrelation pitch detector, may be used to implement both the pitch detector 229 and periodicity detector 231. Such a modified autocorrelation pitch detector is well known in the art, Rabiner and Schafer disclosing such a device at p. 130. Along with the output pitch value, the detector outputs an autocorrelation value as a periodicity measurement which shows how likely it is that the detected pitch value is correct. The value ranges between +1 and -1. As it approaches +1 , the detected pitch becomes more accurate because the input signal is more periodic.) As shown, the periodicity measurement 225 is compared to a variable periodicity threshold by decision logic 235. If the measurement is above the threshold, the input signal 207 is assumed to be periodic (voiced) and the detected pitch value 223 is assumed to be correct. If the measurement is below the threshold, the input signal 207 is assumed to be not periodic (unvoiced) so the detected pitch value 223 is useless.

Decision logic 235 uses pitch value 223 and periodicity value 225 to control TDHS block 212. If the detected pitch value 223 is accurate (periodicity value 225 is above the threshold) then TDHS can be performed on the input signal to compress the signal by a desired ratio. In the preferred embodiment, this ratio is 2 to 1. If the detected pitch value 223 is not accurate (periodicity value 225 is below threshold) then TDHS is not performed on the input signal because it would degrade the speech quality. The input signal is therefore left unchanged in this case. The preferred embodiment 200 refers to this procedure of only using TDHS when the pitch is accurate as partial TDHS. It gives a lower overall compression ratio than using TDHS all of the time but it adds much less degradation.

The objective of compressor 200 is to compress the speech waveform at a predetermined ratio. In the preferred embodiment , for example, the compression ratio is 1.8 to 1. This means that for every 9 frames of speech (1 frame = 32msec in the preferred embodiment) that enter the compressor at the dynamic storage 203, 5 frames of compressed speech leave the compressor out of summer 217. In order to allow the compressor to track the burst nature of speech, the dynamic storage 203 is used to buffer a variable amount of input speech. In the preferred embodiment, for example, it holds a maximum of 3 frames of input speech. It can be holding anywhere from 0 (empty) to 3 (full) frames of input speech at any given time. This allows the compressor to adjust its instantaneous compression rate according to the current characteristics of the speech waveform. For example, if the dynamic storage is full (contains 3 frames) and the speech contains very much silence or is very periodic, the compressor can compress 12 frames of speech (9+3) into 5 frames of compressed speech by clearing out the 3 frames in dynamic storage. The instantaneous compression ratio in this case would be 12 to 5. If, in the other extreme, the dynamic storage was empty and the speech had no silence and was not ery periodic, the compressor could choose to compress only 6 frames of speech into 5 frames of compressed speech and store the other 3 frames of input speech (9-6) in the dynamic storage. The instantaneous compression ratio in this case would be 6 to 5. As can be seen, the instantaneous compression ratio varies to accommodate the speech waveform at any given time while the predetermined average compression ratio is maintained over the long term. A variable instantaneous compression ratio is necessary for the time compressor to track the burst nature of speech. Speech occurs in bursts as does silence. When the amount of silence present in the speech waveform is high, the decision logic can raise the compression ratio and remove as much silence as possible. Also, when the speech waveform is very periodic, TDHS works very well and the compression ratio can also be raised. When the amount of silence is low or the speech waveform consists mostly of unvoiced portions, the time compression ratio can be low because neither TDHS nor silence removal is applied.

In order to maintain a continuous compressed waveform (no gaps where silence has been removed or TDHS compression has taken place) at summer 217 a predetermined amount of initial delay must be incorporated into the compressor at dynamic storage 203. In the preferred embodiment, this delay would be equal to the sum of 2 different delays. The first is equal in length to 3 frames of input speech (which is the maximum size of the dynamic storage). The second is the difference between 9 speech frames (input) and 5 compressed speech frames (output). This adds up to 7 speech frames (224msec) of delay. The inherent delay limits the maximum instantaneous compression ratio to 12 to 5. This gives a limit to the amount of silence that can be removed from the speech waveform at any given time. If the dynamic storage is full, no more than 7 speech frames can be removed as silence. If 7 frames are removed, then the 5 compressed frames are actually generated from 5 original speech frames. This implies that no TDHS compression had to be performed while the instantaneous compression ratio was 12 to 5. In order for the time compression ratio to be above 9 to 5 at any given time, the dynamic storage must contain some speech frames that are in the process of being removed. In this situation, the decision logic tries to lower the instantaneous compression ratio down to 9 to 5. It does this in two ways. First, it raises the TDHS correlation threshold, which makes TDHS compression occur less often. Second, it changes the silence detection threshold to give less silence removal. As the instantaneous compression ratio falls below 9 to 5, the dynamic storage fills up. The decision logic tries to raise the instantaneous compression ratio back up to 9 to 5. It lowers the TDHS correlation threshold to allow for more TDHS compression and it changes the silence threshold to allow for more silence removal. In some instances, the compression ratio may drop low enough that the decision logic must force compression to avoid overflow of the dynamic storage. In this case, TDHS compression is used regardless of the periodicity value if no silence is present.

As can be seen from Fig. 2, for every frame of compressed waveform 219 that leaves summer 217, the decision logic 235 generates compression information 237. This information is necessary for future Cdownstream^M) correct expansion of the compressed waveform if it is desired. Therefore, the compressor 200 has two outputs. The first signal output 219 is a waveform that is compressed at a specified ratio. T e second ("auxiliiary") output 237 is an information stream that describes how to expand the compressed waveform to reconstruct the original input waveform

Referring now to Figs. 3A and 3B, there is shown a flow diagram for the compressor 200. This flow diagram may be viewed as determining the operation of decision logic 235 and its interaction with dynamic storage 203 and switch 209.

The process begins at step 301 with a silence counter being set equal to zero (0) at step 303 to denote that no silence has been removed yet. The dynamic storage 203 waits for a new frame (32msec in the preferred embodiment) of input speech at step 305. Once it has arrived, the process goes to step 307 and determination step 309 where the decision logic 235 counts the number of samples that are in the dynamic storage 203 and determines if there are enough to remove silence. If the determination to step 309 is that silence can be removed (the determination is affirmative), the process proceeds to step 317 where it points to the oldest frame in the lb dynamic storage and to step 319 where it receives the output 227 from the silence detector 233 as determined for this frame.

The process then proceeds to step 321 and determination step 323 where the decision logic compares the output 227 of the silence detector 233 to the current silence threshold to determine if the frame is silence. If determination step 323 determines the frame is silence (the determination is affirmative), then the process proceeds (via reference letter "A") to step 335 where the silence and periodicity thresholds are adjusted to a level that shows that the dynamic storage is not near being filled up The decision logic 235 then sets switch 209 to position C at step 337 via control 221 and instructs the dynamic storage 203 at step 339 to transmit the frame of speech through the switch to open circuit 215 where the frame is removed. After incrementing the silence counter at step 341 to denote that an additional frame of silence has been removed, the process returns (via reference letter Ε") to step 307.

At step 307 the decision logic 235 once again checks to see if there are enough frames left in the dynamic storage 203 to remove silence. The process then continues as described above.

Returning now to determination step 323, if the process determines the frame is not silence (the determination is negative), it proceeds to step 324 where decision logic 235 outputs the current value of the silence counter as compression information 237. It will be appreciated that this is necessary so that a waveform expander located downstream can subsequently insert the proper amount of silence at the correct place in the compressed waveform to reconstruct the original input waveform. The process then goes (via reference letter "B") to steps 325 and 327 where it checks to see if the dynamic storage is so full that TDHS compression is required or mandatory. If determination step 327 determines it is mandatory (determination is affirmative), the process proceeds to step 355 (which will // be described later) to perform TDHS compression. If not (determination is negative), the process goes to step 333 where it adjusts the silence and periodicity thresholds to show that the dynamic storage is not near being overflowed or emptied out. The process then proceeds (via reference letter "P) to step 329 (which will be described later) to decide if TDHS compression will be performed.

Returning now to determination step 309 and assuming that it is determined that silence can not be removed because the dynamic storage is near empty (determination is negative), the process thereupon proceeds to step 310. Here decision logic 235 outputs the current value of the silence counter as compression information 237. The process goes then to step 311 where it sets the silence and periodicity thresholds to a value which shows that the dynamic storage is near empty. The process then goes to determination step 313 where it determines if there are enough samples in the dynamic storage 203 to perform TDHS compression. If determination step 313 determines that there are not enough samples (determination is negative), then no compression can take place. As a result, TDHS and silence removal are prohibited from occuring by the process going to step 315. The process thereupon goes (via reference letter "D") to step 347. Here switch 209 is set to position B via control path or signal 221. The process then proceeds (via reference letter "G") to step 349, whereupon it instructs the dynamic storage to send one-half (1/2 or 0.5) of a frame of input speech through switch 209 and short circuit 213 to summer 217, thus without any frame compression or frame deletion taking place. Only one-half of a frame is sent so that the amount of speech at line 213 is equal to the amount at line 224. This is half of a frame because TDHS compression with a ratio of 2 to 1 converts one frame into one-half (or 50%) of a frame Before returning to step 301 , decision logic 235 sends a value for compression information 237 at step 351 which denotes that no TDHS compression has tZ been performed so no pitch is necessary. The process then returns to step 301.

Returning now to determination step 313 and assuming it is determined silence can not be removed but there are enough samples in dynamic storage to perform TDHS compression (determination is affirmative), the process proceeds to step 329. Here the decision logic 235 first points to the oldest frame in the dynamic storage and then looks at the pitch and periodicity values 223 and 225 for this frame at step 331. The process then proceeds (via reference letter "C") to determination step 345. Here (determination step 345) the decision logic compares the periodicity value to the current periodicity threshold. If determination step 345 determines the value is above the threshold (determation is affirmative), then TDHS compression takes place. The process goes to step 355 where the silence and periodicity thresholds are adjusted and then to steps 357, 359, and 361 where pitch 223 is sent to TDHS compression block 212 and a frame of speech is removed from dynamic storage 203 and sent through switch 209 (which is now in position A) and lead 211 to the. TDHS compression block 212. As a result, one-half (1/2 or 0.5) of a frame of compressed speech arrives at summer 217 via line 224. Before returning to step 301 , the decision logic 235 outputs pitch 223 as compression information 237 at step 363 to denote that TDHS compression has been performed using pitch 223. The process then returns to step 301.

The improved speech waveform compressor, according to the invention, may be implemented by means of a suitably-programmed digital signal processor (DSP). The DSP56000, available from Motorola, Inc., 1301 East Algonquin Road, Schaumburg, Illinois, 60196, is such a suitable DSP. This DSP may be programmed in accordance with user's manual #DSP56000UM/AD, also available from Motorola, Inc. This implementation A3 is depicted in Fig. 4, which is identical to Fig. 1 except for the element 200' depicting a waveform compressor, according to the invention, implemented via a DSP.

While various imbodiments of the improved speech waveform compressor, according to the invention, have been described herein, the scope of the invention is defined by the following claims.

Claims

What is claimed is:Claims:

1. A speech compressor comprising: input means for receiving samples that may include voice samples; determining means responsive to said input means for determining when to compress said received samples; compressing means for compressing said samples; output signal means responsive to said compressing means and said input means for transmitting an output signal based at least in part on said received samples; auxiliiary signal means responsive to said determining means for transmitting an auxiliiary signal indicating compression information about the output signal.

2. A speech compressor comprising: input means for receiving input samples that may contain voice information; pitch detector means for determining the pitch of said input samples; periodicity detector means for determining the periodicity of said input samples; silence detector means for determining the silence of said input samples; compression means for selectively compressing said input samples; discarding means for selectively discarding said input samples; output signal means for selectively forming a continuous output signal based at least in part on the input samples, the compressed input samples, and the discarded input samples; decision logic means for controlling said output signal means responsive to said pitch, said periodicity, and said silence of said input signals to achieve an overall fixed average compression ratio; auxiliiary signal means responsive to said decision logic means for generating an auxiliary signal based on said silence and said pitch.

3. A method for compressing an input signal that may contain voice information, comprising:

(a) storing a portion of the input signal; (b) determining the pitch of said stored signal;

(c) determining the periodicity of said stored signal;

(d) determining the silence of said stored signal;

(e) determining when to compress at least a part of said stored signal based on said pitch, periodicity, and silence; (f) determining when to transmit at least a part of said stored signal based on said pitch, periodicity, and silence;

(g) determining when to discard at least a part of said stored signal based on said pitch, periodicity, and silence;

(h) forming a continuous output signal based at least in part on said compressed signal, said transmitted signal, or said discarded signal;

(i) forming overall compression at a fixed average compression ratio;

(j) forming an auxiliiary output signal based on said pitch, periodicity, and silence.

4. A compressor having means for compressing an input signal that may contain voice information, the compressor comprising: means for storing a portion of the input signal; means for determining the pitch of said stored signal; /& means for determining the periodicity of said stored signal; means for determining the silence of said stored signal; means for determining when to compress at least a part of said stored signal based on said pitch, periodicity, and silence; means for determining when to transmit at least a part of said stored signal based on said pitch, periodicity, and silence; mear^~ for determining when to discard at least a part of said stored signal based on said pitch, periodicity, and silence; means for forming a continuous output signal based at least in part on said compressed signal, said transmitted signal, or said discarded signal; means for forming overall compression at a fixed average compression ratio; means for forming an auxiliiary output signal based on said pitch, periodicity, and silence.

5. A digital signal processor programmed for compressing an input signal that may contain voice information, the program comprising the steps of:

(c) determining the periodicity of said stored signal;

(d) determining the silence of said stored signal;

(g) determining when to discard at least a part of said stored signal based on said pitch, periodicity, and silence; l l

(i) forming overall compression at a fixed average compression ratio; (j) forming an auxiliiary output signal based on said pitch, periodicity, and silence.

6. A speech coder comprising a microphone coupled to a voice-band filter, said filter coupled to an analog to digital converter (A D), said A/D coupled to the input of a waveform compressor having a signal output and an auxiliiary output, the waveform compressor signal output coupled to the input of a multilevel subband coder (MSBC), the MSBC output coupled to a first input of a multiplexer/channel formatter, the waveform compressor auxiliiary output coupled to a second input of said multiplexer/channel formatter, said multiplexer/channel formatter output coupled to a channel.

7. The speech coder of claim 6 wherein said waveform compressor comprises: means for storing a portion of the input signal; means for determining the pitch of said stored signal; means for determining the periodicity of said stored signal; means for determining the silence of said stored signal; means for determining when to compress at least a part of said stored signal based on said pitch, periodicity, and silence; means for determining when to transmit at least a part of said stored signal based on said pitch, periodicity, and silence; means for determining when to discard at least a part of said stored signal based on said pitch, periodicity, and silence; means for forming a continuous output signal based at least in part on said compressed signal, said transmitted signal, or said discarded signal; means for forming overall compression at a fixed average compression ratio; means for forming an auxiliiary output signal based on said pitch, periodicity, and silence.

8. The speech coder of claim 6 wherein said waveform compressor comprises a digital signal processor programmed for compressing an input signal that may contain voice information, the program comprising the steps of:

(a) storing a portion of the input signal;

(b) determining the pitch of said stored signal;

(c) determining the periodicity of said stored signal; (d) determining the silence of said stored signal;

(e) determining when to compress at least a part of said stored signal based on said pitch, periodicity, and silence;

(f) determining when to transmit at least a part of said stored signal based on said pitch, periodicity, and silence; (g) determining when to discard at least a part of said stored signal based on said pitch, periodicity, and silence;

(h) forming a continuous output signal based at least in part on said compressed signal, said transmitted signal, or said discarded signal; (i) forming overall compression at a fixed average compression ratio;

(j) forming an auxiliiary output signal based on said pitch, periodicity, and silence. ή

9. An improved compressor including means for compressing a speech waveform in time without changing its bandwidth, comprising: means for providing an input speech waveform; first signal output means including: means for performing TDHS If the speech waveform is periodic; means for removing the silence if the waveform comprises silence; and, means for transmitting the signal without alteration if the waveform is neither periodic nor silent; and, means for generating a continuous output signal; and, means for performing overall compression at a fixed average compression ratio; and, second auxiliiary output means for providing compression information.

10. An improved compression method, comprising:

(a) providing an input waveform;

(b) performing TDHS if the waveform is periodic;

(c) removing the silence if the waveform includes silence; (d) transmitting the signal without alteration if the waveform is neither periodic nor silent;

(e) providing the continuous output signal based on (b), (c), and (d) at a predetermined average compression ratio; and,

(f) providing an auxiliary output comprising compression information.