CA2218605C

CA2218605C - Method and apparatus for data compression and decompression in speech recognition

Info

Publication number: CA2218605C
Application number: CA002218605A
Authority: CA
Inventors: Hung Shun Ma
Original assignee: Nortel Networks Ltd; Nortel Networks Corp
Current assignee: Nortel Networks Ltd
Priority date: 1997-10-20
Filing date: 1997-10-20
Publication date: 2005-04-26
Anticipated expiration: 2017-10-20
Also published as: CA2218605A1

Abstract

A method and apparatus for compressing and decompressing an audio signal is provided. The apparatus comprises an input for receiving an audio signal derived from a spoken utterance, the audio signal being contained into a plurality of successive data frames. A data frame holding a certain portion the audio signal is processed to generate a feature vector including a plurality of discrete elements characterizing at least in part the portion of the audio signal encompassed by the frame, the elements being organized in a certain sequence. The apparatus makes use of a compressor unit having a grouping processor for grouping elements of the feature vector into a plurality of sub-vectors on the basis of a certain grouping scheme, at least one of the sub-vectors including a plurality of elements from the feature vector, the plurality of elements being out of sequence relative to the certain sequence. The plurality of sub-vectors are then quantized by applying a vector quantization method.

Description

Title of the invention Method and Apparatus for Data Compression and Decompression in Speech Recognition.
s Field of the invention This invention relates to a method and an apparatus for automatically performing desired actions in response to spoken requests. It is applicable to speech recognition systems, specifically to speech recognition systems using feature vectors to represent speech utterance, and can be used to reduce the storage space required for the speech recognition dictionary and to speed-up the upload/download operation required in such systems.
Background of the invention In addition to providing printed telephone directories, telephone companies provide information services to their subscribers. The services may include stock quotes, directory assistance and many others. In most of these applications, when the information requested can be expressed as a number or number sequence, the user is required to enter his request via a touch tone telephone. This is often aggravating for the user since he is usually obliged to make repetitive entries in order to obtain a single answer. This situation becomes even more difficult when the input information is a word or phrase. In these situations, the involvement of a human operator may be required to complete the desired task.

Because telephone companies are likely to handle a very large number of calls per year, the associated labour costs are very significant. Consequently, telephone companies and telephone equipment manufacturers have devoted considerable efforts to the development of systems which reduce the labour costs associated with providing information services on the telephone network.
These efforts comprise the development of sophisticated speech processing and recognition systems that can be used in the context of telephone networks.
In a typical speech recognition system the user enters his request using isolated word, connected word or continuous speech via a microphone or telephone set. The request may be a name, a city or any other type of information for which either a function is to be performed or information is to be supplied. If valid speech is detected, the speech recognition layer of the system is invoked in an attempt to recognize the unknown utterance. The speech recognition process can be split into two steps namely a pre-processing step and a search step. The pre-processing step, also called the acoustic processor, performs the segmentation, the normalisation and the parameterisation of the input signal waveform.
Its purpose is traditionally to transform the incoming utterance into a form that facilitates speech recognition. Typically feature vectors are generated at this step. Feature vectors are used to identify speech characteristics such as formant frequencies, fricative, silence, voicing and so on. Therefore, these feature vectors can be used to identify a spoken utterance. The second step in the speech recognition process, the search step, includes a speech recognition dictionary that is

2 scored in order to find possible matches to the spoken utterance based on the feature vectors generated in the pre-processing step. The search may be done in several steps in order to maximise the probability of obtaining the correct result in the shortest possible time and most preferably in real-time. Typically, in a first pass search, a fast match algorithm is used to select the top N orthographies from a speech recognition dictionary. In a second pass search the individual orthographies are re-l0 scored using more precise likelihood calculations. The top two orthographies in the re-scored group are then processed by a rejection algorithm that evaluates if they are sufficiently distinctive from one another so that the top choice candidate can be considered to be a valid recognition.
Voiced activated dialling (VAD) systems are often based on speaker trained technology. This allows the user of the service to enter by voice a series of names 2o for which he wishes to use VAD. Each of the names is associated with a phone number that is dialled when the user utters the name. The names and phone number are stored in a "client dictionary" situated in the central repository of the VAD system. Each subscriber of the service has an associated client dictionary. Since the number of subscribers is substantial and the number of entries in each client dictionary can be quite large, the storage requirements for the central repository are very high. Furthermore, each user request requires his respective client dictionary to be downloaded to a temporary storage location in the speech recognition unit, which puts a further load on the system.
Compression/Decompression techniques are required to allow the system to support such a load. However prior

3 art techniques that have high compression factors are either not real-time or degrade significantly the performance of the speech recogniser in terms of recognition accuracy of the speech recognition system.
Thus, there exists a need in the industry to provide a real-time compression/decompression method such as to minimise the storage requirement of a speech recognition dictionary while maintaining a high recognition accuracy.
Objects and Statement of the Invention An object of the invention is to provide a method and apparatus for performing compression and/or decompression of an audio signal that offers real time performance, particularly well suited in the field of speech recognition.
Another object of this invention is to provide a method and apparatus for adding entries to a speech recognition client dictionary, particularly well suited for use in training a voice activated dialing system.
As embodied and broadly described herein the invention provides an apparatus for compressing an audio signal, said apparatus comprising:
- means for receiving an audio signal;
- means for processing said audio signal to generate at least one feature vector, said feature vector including a plurality of elements;
- means for grouping elements of said feature vector into a plurality of sub-vectors; and means for quantizing said plurality of sub-vectors.

4 For the purpose of this specification the expressions "feature vector" is a data element that can be used to describe the characteristics of a frame of speech. The elements of the feature vectors are parameters describing different components of the speech signal such as formants, energy, voiced speech and so on.
Examples of parameters are LPC parameters and mel-based cepstral coefficients.
For the purpose of this specification the expression "speech recognition client dictionary" designates a data structure containing orthographies that can be mapped onto a spoken utterance on the basis of acoustic characteristics and, optionally, a-priori probabilities or another rule, such as a linguistic or grammar model.
Each of these data structures is associated to a user of a speech recognition system.
For the purpose of this specification the expressions "orthography" is a data element that can be mapped onto a spoken utterance that can form a single word or a combination of words.
For the purpose of this specification the expressions "quantizing", "quantize" and "quantization"
are used to designate the process of approximating a value by another in order to reduce the memory space required in the representation of the latter. Devices designated herein as a "quantizor" perform this process.
In a most preferred embodiment of this invention the compression apparatus is integrated into a speech recognition system, such as a voice activated dialing

5 system, of the type that one could use in a telephone network, that enables users to add names to a directory in order to train the system. Preferably, each user has his directory, herein referred to as client dictionary, where he may enter a plurality of entries. In a typical training interaction, once the voice activated system receives a request from the user, it will first download from a central database, herein referred to as database repository, the client dictionary associated with the user. As a second step the system will issue a prompt over the telephone network requesting the user to specify a name he wishes add. If valid speech is detected the speech utterance is transformed into a series of feature vectors by a preprocessing module. Each feature vector describes the characteristics of one frame of speech.
The preferred embodiment of this invention uses mel-based vectors each vector being composed of twenty-three (23) coefficients. The first eleven (11) elements of the feature vector, often referred to as the static parameters, are the cepstral coefficients cl,..., cll. The remaining twelve (12) elements, often called dynamic parameters, are often designated as bco,...,~cll and are estimated from the first derivative of each cepstral coefficient. In the preferred embodiment the cepstral coefficient co, which represents the total energy in a given frame, is not used. The feature vectors are then transferred to the speech recognition search unit. The speech training unit of this module performs confusability testing between the newly entered utterance and each entry in the client dictionary. This confusability testing includes three parts. Firstly, each entry in the client dictionary is processed by the decompression unit which generates a series of feature vectors. In the preferred embodiment entries are

6 processed one at the time. Each entry in the client dictionary is then scored for similarity against the new entry. The preferred embodiment of this invention uses the fast-match algorithm. Examples of fast match algorithms can be found in Gupta V. N., Lennig M., Mermelstein P. "A fast search strategy in a large vocabulary word recogniser" INRS-Telecommunications. J.
Acoust. Soc. Am. 84 (6), December 1988, p.2007 and in U.S. patent #5,515,475 by inventors Gupta V. N. & Lennig M. Once all the entries in the client dictionary have been scored, the best score is compared to a threshold indicating the minimum level for confusability. If the score exceeds this threshold, implying that there is an entry in the client dictionary very similar to the one being added, the system returns an error indicating the problem to the user. On the other hand if the score is below the threshold and no entry in the client dictionary is similar enough, the system requests the user to enter the same name a second time. Again the confusability test is performed with each entry in the client dictionary and the same threshold test is applied. This time, if the threshold is satisfied a second test, herein referred to as the consistency test, is performed between the first input utterance and the second input utterance.
The preferred embodiment of this invention uses a Dynamic Time Warping (DTW) computation here to compute the distance between the two utterances. If the comparison is successful, a model is created which is a combination of the two utterances again using DTW techniques to compute the distance between them. Dynamic Time Warping is well known in the field of speech processing. The reader is invited to consult speech recognition books such as "Discrete Time Processing of Speech Signals" by Deller & als. MacMillan publishing New York and

7 O'Shawnessey D. (1990) "Speech Communication: Human and Machine", Addison-Wesley. Addison-Wesley series in Electrical Engineering: Digital Signal Processing. If the test is successful a model representing a Dynamic Time Warping average is created. Following this, the model is processed by the compressor module which generates a compressed version of the feature vector herein designated as a quantizor codebook entry index vector. Once the quantizor codebook entry index vectors l0 have been generated for the utterance, they are added to the client dictionary along with useful information such as a telephone number in the case of voice activated dialing. Once the user terminates his operation and hangs up, the client dictionary is uploaded to the database repository. The compression/decompression operations allow the system to reduce the storage space required for the quantizor codebook entry index vector since the quantizor codebook entry index vectors require less space than the feature vectors. Furthermore, by virtue of their reduced size the processor overhead required to download and upload the client dictionary is reduced since less data must be transferred.
In the preferred embodiment of this invention the compressor module uses a two step vector quantization method. The basic approach of the compressor is to group the 23-dimensional cepstral vector into a number of pre-defined sub-vectors of various sizes and to encode them separately using a set of variable-bit quantizors which are optimized for these sub-vectors respectively. The feature vector coefficients are first grouped into sub-vectors by a grouping processor. The sub-vectors compositions are defined by a priori assignments stored in a grouping table. These a priori assignments are

8 defined by off-line iterative computations in order to form groups that maximize the accuracy of the system.
Secondly, each sub-vector is processed by a Nk-bit quantizor, Nk being the number of bits assigned to the kth sub-vector. The quantizors generate an index corresponding to an entry in the codebook and each sub-vector is assigned a certain number of bits to index entries in quantizor codebooks. These indices form the compressed version of the feature vector. In the preferred embodiment each quantizor is associated to a quantizor codebook. The codebooks are created off-line using vector quantization methods namely iterative clustering is performed on the entries the sub-vectors assigned to a particular codebook. Essentially clustering involves selecting the best set of M codebook vectors based on L test vectors (L >_ M). The clustering procedure is well known in the field to which this invention pertains. More information about this topic can be found in "Fundamentals of speech Recognition"
Prentice Hall publishing by L. Rabiner and B.-H. Guang (1993) p.122-132.
In another preferred embodiment of this invention the speech recognition system is integrated into a voice activated dialing system, such as one that could be used in a telephone network, that enables users to dial a number simply by uttering the name of the entity he wishes to reach. Once the voice-activated system receives a request from the user, it will first issue a prompt over the telephone network requesting the user to

9 specify name of the user he wishes to call. If valid speech is detected the speech utterance is transformed into a series of feature vectors by a preprocessing module. Each feature vector describes the characteristics of one frame of speech. The preferred embodiment of this invention uses mel-based vectors each vector being composed of twenty-three (23) coefficients.
The cepstral coefficient co is not used since it represents the total energy in a given frame. The feature vectors are then transferred to the speech recognition search unit. The speech recognition unit of this module performs a best match testing operation between the newly entered utterance and each entry in the client dictionary. This best match testing includes two parts.
Firstly, each entry in the client dictionary is processed by the decompression unit which generates a series of feature vectors. In the preferred embodiment entries are processed one at the time. Each entry in the client dictionary is then scored for similarity against the new 2o entry. The preferred embodiment of this invention uses a fast-match algorithm. Once all the entries in the client dictionary have been scored, the best score is compared to a threshold indicating the minimum level for a match.
If the score exceeds this threshold, implying that there is an entry in the client dictionary very similar to the one being added, the system returns the number associated with the best match. On the other hand if the score is below the threshold and no entry in the client dictionary is similar enough, the system returns an error or requests the user to enter the name a second time.
In the preferred embodiment of this invention the decompressor module uses a two step vector quantization method. The dictionary entries are stored in the form of quantizor codebook entry index vectors. The quantizor codebook entry index vectors are used to lookup in the codebooks the associates quantized coefficients. The codebooks are the same as that of the compressor module and are created off-line using vector quantization methods. Once the sub-vector have been recovered, a feature vector is generated by the grouping processor.
Using the a priori assignments stored in a grouping table the sub-vectors are placed in the appropriate positions in the vector.
As embodied and broadly described herein the invention also provides an apparatus for compressing a speech signal, said apparatus including:
- means for receiving said speech signal;
- means for processing a frame of said signal to generate at least one feature vector, said feature vector including a plurality of elements describing a characteristic of said frame;
- means for grouping elements of said feature vector into a plurality of sub-vectors, one sub-vector of said plurality of sub-vectors including at least one of said elements; and - means for associating each sub-vector with an entry in a codebook, said entry being an approximation of the sub-vector.
As embodied and broadly described herein the 3o invention also provides a machine readable storage medium including:
- a codebook including a plurality of entries, each said entry being an approximation of a sub-vector of a feature vector, said feature vector including a plurality of elements describing a characteristic of a speech signal frame, at least one sub-vector including at least one of said elements;
- a codebook entry index vector including a plurality of indices, each index constituting a pointer to said codebook permitting retrieval of a corresponding entry that constitutes an approximation of a sub-vector of a feature vector.
As embodied and broadly described herein, the invention also provides an apparatus for decompressing an audio signal, said apparatus comprising:
- means for receiving a frame of an audio signal in a compressed format, said frame including a plurality of indices;
- a codebook including a plurality of entries, each entry being associated with a given index;
means for extracting from said codebook entries corresponding to said indices;
- means for separating at least one of said entries corresponding to said indices into individual elements of a feature vector that describes a characteristic of the audio signal.
As embodied and broadly described herein, the invention further provides a speech recognition apparatus, comprising:
- a database repository containing a plurality of client dictionaries, each dictionary including a plurality of orthographies;
- means for receiving a spoken utterance to be stored as an orthography in a selected one of said plurality of client dictionaries;

- means for assessing a degree of confusability between said spoken utterance and orthographies in said selected one of said plurality of client dictionaries;
- means for entering in said selected one of said plurality of client dictionaries an orthography representative of said spoken utterance if the degree of confusability between said spoken utterance and other orthographies in said selected one of said plurality of client dictionaries is below a certain level.
As embodied and broadly described herein, the invention also provides a method for compressing an audio signal, comprising:
- processing said audio signal to generate at least one feature vector, said feature vector including a plurality of elements;
- grouping elements of said feature vector into a plurality of sub-vectors; and - quantizing said plurality of sub-vectors.
As embodied and broadly described herein the invention also provides a method for compressing a speech signal, said method comprising the steps of:
- receiving said speech signal;
- processing a frame of said signal to generate at least one feature vector, said feature vector including a plurality of elements describing a characteristic of said frame;
- grouping elements of said feature vector into a plurality of sub-vectors, one sub-vector of said plurality of sub-vectors including at least one of said elements; and - associating each sub-vector with an entry in a codebook, said entry being an approximation of the sub-vector.
s Brief description of the drawings These and other features of the present invention will become apparent from the following detailed description considered in connection with the l0 accompanying drawings. It is to be understood, however, that the drawings are designed for purposes of illustration only and not as a definition of the limits of the invention for which reference should be made to the appending claims.
Fig. 1 shows a system in which the preferred embodiment of this invention operates;
Fig. 2a shows a detailed block diagram of a Speaker Trained Recognition Technology unit in training mode that uses a preferred embodiment of the invention;
Fig. 2b shows a detailed block diagram of a Speaker Trained Recognition Technology unit in dialling mode that uses a preferred embodiment of the invention;
Fig. 3 shows a detailed block diagram of the compressor module consistent with the invention;
Fig. 4 shows a detailed block diagram of the decompressor module consistent with the invention;
Fig. 5 the trajectory of a client dictionary being transferred to the Speaker Trained Recognition Technology unit;
Fig.6 shows a flow chart of the off-line grouping table and quantizor codebook generation;

Fig. 7 shows is a flow chart showing the sequence of operations to generate hypothesis to re-group the sub-vectors;
Fig. 8a and 8b show a the distribution and centroid location of coefficients;
Fig. 9 shows the centroid location obtained by combining two groups using a rough computation;
to Description of a preferred embodiment In the preferred embodiment of this invention the speech recognition apparatus comprising the compression/decompression module is integrated into a telephone network of the type shown in figure 1 where it can be used for various services. In its preferred embodiment the apparatus is used to allow voice-activated dialing (VAD). The user of the system enters his requests using a device such as a microphone or telephone set 100 that converts the spoken utterance into an electric signal. The signal is then transmitted through a telephone switch 102 to a speech-processing centre 104.
The speech processing centre 104 is composed of five (5) major blocks inter-connected through a bus 116 and a dedicated data channel 118. The five major blocks are the system CPU 112, the database repository 114, the real-time kernel 110, the pre-processing unit 106 and the Speaker-Trained recognition technology unit 108. The speech-processing centre 104 can function in one of two modes namely a training mode and in dialling mode. The speech-processing centre 104 functions in training mode when the user wishes to add new entries to his respective client dictionary and in dialling mode when the user wishes to dial a number.
System Overview In a typical training interaction, once the voice activated system receives a request from the user, it will first download from the database repository 114 the client dictionary associated with the user. The database repository 114 is a non-volatile memory unit where all the client dictionaries are stored. In the preferred embodiment of this invention, the telephone number of the subscriber indexes the client dictionaries in the database repository. Other indexing methods are possible such as indexing via a user code. The organisation and searching methods may vary across implementations and organisations and methods differing from those mentioned above do not deter from the spirit of the invention.
Once the client dictionary is located, it is transferred to the Speaker Trained Recognition Technology unit. The transfer process is done through the system CPU 112 and the real-time kernel 110, to the Speaker Trained Recognition technology unit 108 as shown in figure 5. The system CPU 112 is the computer that executes computations. The real-time kernel 110 can be any real time operating system kernel such as the Motorola PSOS.
As a second step the system will issue a prompt over the telephone network requesting the user to specify a name he wishes to add. When the speech-processing centre 104 first receives an utterance, the pre-processing unit 106 translates the incoming signal into a form that will facilitate speech recognition. The preferred embodiment of this invention uses a pre-processing unit that generates a sequence of feature parameters such as mel-based cepstral parameters. Feature parameters for one frame are then combined into what is commonly known as feature vectors. Feature vectors may be composed of any number of entries. The preferred embodiment of this invention uses mel-based cepstral vectors each vector being composed of twenty-three (23) coefficients. The first eleven (11) elements of the feature vector, often referred to as the static parameters, are the cepstral coef f icients cl, ..., cll . The remaining twelve ( 12 ) elements, often called dynamic parameters, are often designated as 8co,...,8c11 and are estimated from the first derivative of each cepstral coefficient. In the preferred embodiment the cepstral coefficient co, which represents the total energy in a given frame, is not used except to compute the dynamic feature 8co. The pre-processing operation and Mel-based cepstral parameters are well known in the art to which this invention pertains.
In the third processing step, the feature vectors are transferred to the Speaker Trained Recognition unit 108. The Speaker Trained Recognition technology unit 108 performs the detailed speech processing of the signal and is shown in figure 2a. In figure 2a, the blocks, denoted by dotted lines, are inactive during the training mode and those denoted by continuous lines are active. The speech-training unit 200 receives the feature vectors generated by the pre-processing unit 106 and performs a series of tests between the newly entered utterance and each entry in the client dictionary. The client dictionary is stored in its compressed form in a temporary storage unit 206 herein referred to as quantizor codebook entry index vectors. This storage unit 206 is accessible by the database repository 114 to upload and download client dictionaries. The first test, designated as confusability testing, includes three parts. Firstly, the decompression unit 208 processes each entry in the client dictionary and generates a series of feature vectors. In the preferred embodiment entries are processed one at the time. Each entry in the client dictionary is then scored in the speech-training unit 200 for similarity against the new entry. The preferred embodiment of this invention uses a fast-match algorithm. Once all the entries in the client dictionary have been scored, the best score is compared to a threshold indicating the level of confusability accepted.
If the score exceeds this threshold, implying that there is an entry in the client dictionary very similar to the one being added, the system returns an error indicating the problem to the user. On the other hand if the score is below the threshold and no entry in the client dictionary is similar enough, the system requests the user to enter the same name a second time. Again the confusability test is performed with each entry in the client dictionary and the same threshold test is applied.
This time, if the threshold is satisfied a second test, herein referred to as the consistency test, is performed between the first input utterance and the second input utterance. The preferred embodiment of this invention uses a Dynamic Time Warping (DTW) computation here to compute the distance between the two utterances. If the comparison is successful, a model is created which is a combination of the two utterances again using DTW
techniques to compute the distance between them. Dynamic Time Warping is well known in the field of speech processing. The reader is invited to consult speech recognition books such as "Discrete Time Processing of __ CA 02218605 2004-O1-15 Speech Signals" by Deller & als. MacMillan publishing New York and 0'Shawnessey D. (1990) "Speech Communication:
Human and Machine", Addison-Wesley. Addison-Wesley series in Electrical Engineering: Digital Signal Processing. If the test is successful a model representing a Dynamic Time Warping average is created and placed in a temporary storage location 202. The model is another feature vector. Following this, the model is processed by the compressor unit 204 which generates a compressed version of the feature vector herein designated as a quantizor codebook entry index vector. Once the quantizor codebook entry index vectors have been generated for the utterance, they are added to the client dictionary 206 along with useful information such as a telephone number in the case of voice activated dialing. Once the user terminates his operation and hangs up, the client dictionary is uploaded to the database repository 114.
zo Compression Unit The compression unit 204 is shown in detail in figure 3. The basic approach of the compression unit is to group the 23-dimensional cepstral vector into a number of pre-defined sub-vectors of various sizes and encode them separately using a set of variable-bit quantizors which are optimized for these sub-vectors respectively.
In the preferred embodiment of this invention the compressor module uses a two step vector quantization method. The input of this unit is the feature vector.
The feature vector coefficients 300 are first grouped into sub-vectors 306 by a grouping processor 302. The sub-vectors are designated by GX where x in the sub-vector index. The sub-vectors compositions are defined by a priori assignments stored in a grouping table 304. These a priori assignments are defined by off-line iterative computations in order to form groups that maximize the accuracy of the system and are generated at the same time as the quantizor codebooks database 308.
Secondly, each sub-vector is processed by a Nk-bit quantizor 312, Nk being the number of bits assigned to the kth sub-vector. Preferably, each quantizor is associated l0 to a quantizor codebook 310 located in the quantizor codebook database 308. The codebooks can be created off-line using vector quantization methods namely iterative clustering is performed on the entries the sub-vectors assigned to a particular codebook. Essentially clustering involves selecting the best set of M codebook vectors based on L test vectors (L >_ M). The clustering procedure is well known in the field to which this invention pertains. More information about this topic can be found in "Fundamentals of speech Recognition"
Prentice Hall publishing by L. Rabiner and B.-H. Guang (1993) p.122-132. In the preferred. embodiment, the codebooks were generated at the same time as the grouping table 304 and the method is described further on in the specification. Each codebook comprises a number of entries, herein designated as clusters, accessible via indices. A quantizors 312 generates an index corresponding to an entry in a given codebook that is the most likely to match a given sub-vector. In order to find the best match in the cookbook, the following expression is used:

Equation 1 V,~. = min Gk - h,~", for all m in the codebook where ~~ is a distance computation, k is the index of the sub-vector, m is the index of the entry in the codebook, V is the entry in the codebook, j is the codebook entry index of the nearest neighbor. This computation is performed for each sub-vector. The combination of the indices of the nearest codebook entries over the various sub-vectors forms the compressed version of the feature vector or quantizor codebook entry index vector 314.
Decompression Unit The decompression unit is shown in detail in figure 4. In the preferred embodiment of this invention the decompression module uses a two step vector quantization method and uses components similar to that of the compression unit namely a grouping processor 302 a grouping table 304 and codebook database 308 and N-bit quantizors 312. The client dictionary entries are received by a temporary storage unit 314. As mentioned above, the client dictionary entries are stored in the form of quantizor codebook entry index vectors. The quantizor codebook entry index vectors are used to lookup in the codebook database 308 the associated quantizor codebook 310 to find the corresponding quantized coefficients. This is a simple look-up/search operation and the manner in which the search is performed does not detract from the spirit of the invention. Preferably, the codebooks in the decompression module are the same as that of the compression module and are created off-line using vector quantization methods. The recovered quantized coefficients stored in 306 and then rearranged using the grouping processor 302 in order to recover a feature vector 300. The grouping processor 302 uses the a priori assignments stored in the grouping table to place the sub-vector entries the appropriate positions in the feature vector 300.
Grouping table 8~ Quantizor Codebook Generation In the preferred embodiment of this invention, the grouping table and quantizor codebooks are generated together such that their composition minimizes the distortion of the feature vector coefficients. The inputs to the generation of a priori assignments are the number of cepstral coefficients (23 in the preferred embodiment), a set of training cepstral vectors say M
vectors, the number of sub-vectors to form and the total number of bits in the compressed vector, herein referred to as quantizor codebook entry index vector. The generation of the a priori assignments comprises two major phases namely an initial estimate and an iterative refinement. Figure 6 shows a flow chart of the generation of the grouping table 304 and the codebook database 308. In the preferred embodiment the initial estimate step initializes the system to prepare for the iteration to follow. The first stage in this phase is the Initial Coefficient Formation 600. In this stage, each coefficient defines a single-element group. Three operations are performed at this stage. The first involves assigning a number of bits to each group/coefficient. This can be done using the following mathematical operation:

Equation 2 sF[v ]
n= -NT X 23 t i=1,...23 SF[vk ]
k=1 where NT is the total number of bits in the compressed vector, ni is the number of bits assigned to the ith coefficient, Vi is the variance of the ith coefficient and SF[] is a function to assign the relative importance of the ith cepstral coefficient the recognition. The function SF[] could be a simple reciprocal function. For example:
Equation 3 SF[v;]- 1 v, Therefore for coefficients with high variance values, less weight is assigned. The variance values are estimated from the M training vectors. Preferably a large number of training vectors are used in order to obtain a good representation of the spectra. Computing the variance of a variable is a well-known operation in the field to which the invention pertains. The number of quantization codes of the ith coefficient is:
Equation 4 ql =2ni i=1,...23 The second operation involves generating a training set comprising the 1-D coefficient sub-vectors. For a training set comprising M feature vectors, the training set of 1-D coefficients is generated from the projection of the original vector space onto this coefficient and is defined as:

Equation 5 X; _~xt~;l<_ j<_m;~ i=1,...23 where Xi is the training set for the ith coefficient.
The third operation in the Initial Coefficient Group Formation step 600 involves the clustering of the sub-vectors and the distortion calculation. Algorithms such as the K-means algorithm and the isoda algorithm, two clustering algorithms, can be used here to create the optimal sub-groups. Clustering algorithms are well known in the art to which this invention pertains. A brief history and a detailed K-means, also known as the LBG
algorithm, is given in Makhoul et al. "Vector quantization in speech coding," Proceedings of the IEEE, vol. 73, pp. 1551-1588, Nov 1985. The reader in also invited to consult Rabiner et al. (1979) "Applications of clustering techniques to speaker-trained isolated word recognition," Bell System Technical Journal, vol. 58, pp.2217-2231 . Many clustering algorithms can be used in the context of this invention and algorithms different from the ones mentioned above do not deter from the spirit of this invention. As was previously noted, each coefficient is assigned (based on equation 2) a certain number of bits ni permitting to define a certain number of codes qi (as defined in equation 4). Therefore it is necessary to cluster the training set entries Xi such that they comprise qi regions which are also known as clusters.
In the clustering approach, the multiple entries obtained from the training vectors, M in this case, are reduced to qi <_ M clusters each cluster being represented by one feature vector. Mathematically this can be expressed as:

.. CA 02218605 2004-O1-15 Equation 6 Zr = ~zri ~i < J < qt1 where zip is the centroid of the jth cluster of the ith coefficient. The centroid can be generated with the K
mean algorithm mentioned above. The quantization distortion can be expressed as:
Equation 7 9;
j=1 xex;~z where xi~z is the set of all samples which belong to cluster zip and D() is a distance computation for example the euclidean distance.
The second step in the generation of the grouping table is referred to as the Initial group formation 602.
In this step the coefficients the cepstral coefficients are split into p equal size groups. If the groups cannot be of equal size then either the remaining coefficients are added to one of the groups or the remaining coefficients are put in their own smaller group. The preferred embodiment of this invention leaves the remaining coefficients in their own group. For each of these groups the same sequence of three operations is performed with the modification that the operations are performed on multi-dimensional vectors instead of one-dimensional vectors. The first operation on these p groups involves assigning a number of bits to each group.
This can be done using the following mathematical operation:
Equation 8 3o Nk =~nk~ k=1,...p i where Nk is the number of bits assigned to the kth sub-vector, k is the index of the sub-vector, p is the number of sub-vectors and j is the index of the coefficient within a given sub-vector. The number of quantization codes of the kth sub-vector is:
Equation 9 Qk =2Nk k=1,...p The second operation involves generating a training set comprising the multidimentional sub-vectors. Again the training set is generated from the projection of the original vector space onto the sub-vector space.
The third operation in the Initial Group Formation step 602 involves the clustering of the sub-vectors and the distortion calculation. Here the same clustering algorithms as for the previous initial coefficient group formation step 600 can be used. Mathematically this can be expressed as:
Equation 10 zk =~x,ncJ~Qk~ k=1~...,P
where zk~ is the centroid of the jth cluster of the kth sub-vector. The centroid can be generated with the K-mean algorithm mentioned above. The quantization distortion can be expressed as:
Equation 11 j=1 xex~-Z
where xk~z is the set of all samples which belong to cluster zk~ and D ( ) is a distance computation for example the euclidean distance.

The third step in the generation of the grouping table involves selecting the next coefficient 604 for the re-grouping process. Any method may be used to choose the next coefficient such random choice, round-robin and others. The preferred embodiment of this invention uses the round-robin selection method.
The following step involves generating a series of hypothesis 606 for rearranging the groups. The idea is to choose the arrangement that minimizes the quantization distortion Dk for each group. Two possibilities may occur when you rearrange groups namely you move a coefficient to another group, herein referred to as coefficient migration, or you exchange coefficients between groups, herein referred to as coefficient exchange. The main task in this process is to compute the clustering distortions for both the source group and all destination group candidates under coefficient migration as well as under an exchange. The change that provides the best distortion reduction is then chosen. Within the process, the time used to cluster the source group and the (p-1) destination groups under coefficient migration and (p-1) source and (p-1) destination groups under coefficient exchange is large even for small values of p.
Preferably, in order to reduce the computations, only a rough computation is performed on all the groups followed by the actual clustering and distortion computations only for a small number of candidates for which the best distortion reduction was obtained by the rough calculation. In the preferred embodiment this computation is performed on the top l00 of the candidates. The computations are done using the coefficient selected in step 604.

Each computation involves the same series of operations shown in figure 7. These computations are performed once under coefficient exchange and another time under coefficient migration preferably for all the possibilities. The first step 700 is computing the number of bits remaining/added in a given group after each reordering. The second step 702 involves computing the number of clusters remaining/added in a given group.
For a migration the first two computations are:
Equation 12 migration _ _ migrah'on _ source source ni Ndestination Ndesh'nation + ni Equation 13 migration mi anon /1 Yril YCIhOn Nsource migration = ~Naesr~iraorion LSOtlrCe ~ ~desdnatio n where i is the index of the coefficient chosen at step 604 and source/destination are the indices of the source and destination groups respectively. The second re ordering possibility is to exchange two coefficients. In this case the computations are:
Equation 14 NexcHange - N - y~. +Yl N~change - N +Yl. -Yl source source : 1 destination destination : 1 Equation 15 exchange exchan a exchange = ~Nsource exchange - 2 Ndestina~on source ~ destinatio n where i is the index of the coefficient chosen at step 604, source/destination are the indices of the source and destination groups respectively and 1 is the index of the coefficient chosen from any group to which the coefficient with index i is not part of.
The third step 704 involves operation involves generating a training set comprising a mufti-dimensional sub-vectors based on the new coefficient composition.

Following this 706 a rough computation of the centroid and distortion is done for each source and destination group and the same operations apply for both migration and exchange. As previously described this rough computation is done in order to speed up the process. In the preferred embodiment this is done by taking the cross product of the set of 1-D clusters of all the cepstral coefficients in each group. This is defined as:
Equation 16 rough _ ~ l _ ~ rough , Zsource Z ' Zsourc k ~ 1 ~ k C source .~E~uourCe rotrgh _ _ rough Zdertinatio rr - Z j - ~Zttestinatio n,k ~ 1 C k ~ ~destinatin n l E GAesnnanv r and the corresponding equations for distortion are:
Equation 17 Qsource ( D rough - ~ ~ D' .x Z rough source \ ~ source,]
,I=1 xex source, i Qrtestination Drough - ~ ~D~x trough destination ~ destination, j J
=1 xExdestination,i In the above equations, source/destination are the indices of the source and destination groups respectively and zXrough are the rough estimates of the centroids of the clusters. The following example will describe this computation. Let us say we have two coefficient groups say GX and Gy defined by two clusters each. Their distribution may look like that shown in figure 8a and 8b with centroids at Z1, Z2 for GX and Z3, Z4 for Gy. When combining GX and Gy we find the intersection of the two spaces as shown in figure 9. The intersections are taken to be the most likely location of the centroid of the new grouping. The distortion is then computed by taking all the closest neighbors to the centroid and computing the distance.
Once the above computations have been performed for all possibilities, the fourth step 708 involves selecting the top Bo candidate, which have the lowest rough distortion value, and on which the clustering and distortion computations 710 are performed using more exact computations such as the K-means algorithm.
The grouping which reduces the distortion the most is then selected as a candidate 712. In the preferred embodiment, two candidates are selected, one for a migration and one for an exchange. The distortion for these grouping hypothesis for the source and destination group are designated as D~hange D~°hange Dmtgranon and Dm'g'a''~n source ~ destination i source destination ' Following this the next step, based on the hypothesis generated at step 606, a re-ordering decision is made 608. Essentially the procedure involves evaluating if performing a change will result in reducing further the distortion or if an optimum level has been reach. In the first case, step 610 is executed by choose the hypothesis that reduces the distortion best. The new clusters are then entered into the quantizor codebook associated with each group involved in the re-grouping namely the source and destination groups. The preferred embodiment of this invention computes the benefit of exchange and migration using the following equations:

Equation 18 migration _ ~ - ~ - ~ migration migration source ~ ~ destination source - ~ sourced ti ti es na on exchange - LD - D ~ - rD exchange _ exchange D
source source source L destinatiWa destination The decision is then divided into four possible cases summarised in the table below.
CASE (~m~8ration ~~) A~(~exchange MlgratlOn A ) ~) source source CASE (migration ) O) ~ (exchange Exchange B ~ 0) source source CAS E ( ~m~8ration ~ O) A~ ( ~ exchange~. a X ~ migration ~
C ~ O1 ~ exchange source source l source source Migration Else Exchange CASE Else No re-grouping D

The bias factor "a" (a > 1) in case C is used to favor the exchange regrouping over the migration in order to maintain uniformity in the number of coefficients among the groups.
The re-grouping is terminated 612 where no regrouping is encountered for each and everyone of the coefficients or when a pre-defined number of reformation iteration is reached. At the end of the iteration, optimal grouping table and the individual sub-vector quantization codebooks are output for usage in the compression and decompression units.

Claims

I Claim:

1. An apparatus for compressing an audio signal, said apparatus comprising:
- an input for receiving an audio signal derived from a spoken utterance, said audio signal being separable into a plurality of successive frames;
- a processing unit for processing a frame of said audio signal to generate a feature vector, said feature vector including a plurality of discrete elements characterizing at least in part the portion of the audio signal encompassed by the frame, said elements being organized in a certain sequence, said feature vector being suitable for processing by a speech recognition processing unit;
- a compressor unit comprising:
a) a grouping processor for grouping elements of said feature vector into a plurality of sub-vectors on the basis of a certain grouping scheme, at least one of the sub-vectors including a plurality of elements from said feature vector, said plurality of elements being out of sequence relative to said certain sequence;
b) a quantizer for the plurality of sub-vectors.

2. An apparatus as defined in claim 1, wherein said compressor includes a grouping table indicative of an order of association for elements of said feature vector from said certain sequence to form said plurality of sub-vectors, said order of association defining the certain grouping scheme.

3. An apparatus as defined in claim 2, wherein at least two of said plurality of sub-vectors comprise a different number of elements.

4. An apparatus as defined in claim 1, wherein said quantizer includes a codebook containing a set of entries, said entries being either exact matches of values that a sub-vector can acquire or approximations of the values that a sub-vector can acquire, said codebook further including a set of indices associated with respective entries.

5. An apparatus as defined in claim 4, wherein said quantizer comprises a plurality of codebooks, each codebook being associated with a given sub-vector to allow quantization of said given sub-vector.

6. An apparatus as defined in claim 5, wherein said quantizer is operative to process a sub-vector characterized by a certain value and entries in the codebook associated with the sub-vector to identify the entry that best matches the certain value, and output the index associated with the entry that best matches the certain value.

7. An apparatus as defined in claim 6, wherein said quantizer performs similarity measurements to identify the entry that best matches the certain value.

8. An apparatus for decompressing an audio signal, said apparatus comprising:
- an input for receiving a frame of an audio signal in a compressed format, said frame including a plurality of indices;

- a plurality of codebooks, each codebook being associated with a certain index of said frame, each codebook mapping index values to respective entries in said codebook, said entries being indicative of sub-vector values, said sub-vector values being indicative of feature vector elements in a certain sequence;
- a first processing unit for extracting from said plurality of codebooks a set of entries corresponding to said plurality of indices of said frame;
- a second processing unit for separating said sub-vector values into individual feature vector elements on the basis of an ungrouping scheme such that the feature vector elements of a resulting feature vector are out-of-sequence relative to said certain sequence, said feature vector being suitable for processing by a speech recognition processing unit.

9. A method for compressing an audio signal, said method comprising the steps of:
- receiving an audio signal derived from a spoken utterance, said audio signal being separable into a plurality of successive frames;
- processing a frame of said audio signal to generate a feature vector, said feature vector including a plurality of discrete elements characterizing at least in part the portion of the audio signal encompassed by the frame, said elements being organized in a certain sequence, said feature vector being suitable for processing by a speech recognition processing unit;

- grouping elements of said feature vector into a plurality of sub-vectors on the basis of a certain grouping scheme, at least one of the sub-vectors including a plurality of elements from said feature vector, said plurality of elements being out of sequence relative to said certain sequence;
- quantizing the plurality of sub-vectors.

10. A method as defined in claim 9, further comprising the step of providing a grouping table indicative of an order of association for elements of said feature vector from said certain sequence to form said plurality of sub-vectors, said order of association defining the certain grouping scheme.

11. A method as defined in claim 10, wherein at least two of said plurality of sub-vectors comprise a different number of elements.

12. A method as defined in claim 9, further comprising the step of providing a codebook containing a set of entries, the entries being either exact matches of values that a sub-vector can acquire or approximations of the values that a sub-vector can acquire, said codebook further including a set of indices associated with respective entries.

13. A method as defined in claim 12, further comprising the step of providing a plurality of codebooks, each codebook being associated with a given sub-vector to allow quantization of said given sub-vector.

14. A method as defined in claim 13, wherein said quantizing step comprises the step of:

- processing a sub-vector characterized by a certain value and entries in the codebook associated with the sub-vector to identify the entry that best matches the certain value;
- outputting the index associated with the entry that best matches the certain value.

15. A method as defined in claim 14, wherein said processing step performs similarity measurements to identify the entry that best matches the certain value.

16. A computer readable storage medium including a program element for processing by a computing apparatus including a processor and a memory for implementing an apparatus for compressing an audio signal derived from a spoken utterance, said audio signal being separable into a plurality of successive frames, said program element implementing functional blocks including:
- a processing unit for processing a frame of said audio signal to generate a feature vector, said feature vector including a plurality of discrete elements characterizing at least in part the portion of the audio signal encompassed by the frame, said elements being organized in a certain sequence, said feature vector being suitable for processing by a speech recognition processing unit;
- a compressor unit comprising:
a) a grouping unit for grouping elements of said feature vector into a plurality of sub-vectors on the basis of a certain grouping scheme, at least one of the sub-vectors including a plurality of elements from said feature vector, said plurality of elements being out of sequence relative to said certain sequence;
b) a quantizer for quantizing the plurality of sub-vectors.

17. A computer readable storage medium as defined in claim 16, wherein said compressor unit includes a first memory unit containing a grouping table indicative of an order of association for elements of said feature vector from said certain sequence to form said plurality of sub-vectors, said order of association defining the certain grouping scheme.

18. A computer readable storage medium as defined in claim 17, wherein at least two of said plurality of sub-vectors comprise a different number of elements.

19. A computer readable storage medium as defined in claim 16, wherein said quantizer includes a second memory unit containing a codebook containing a set of entries, said entries being either exact matches of values that a sub-vector can acquire or approximations of the values that a sub-vector can acquire, said codebook further including a set of indices associated with respective entries.

20. A computer readable storage medium as defined in claim 19, wherein quantizer comprises a second memory unit containing a plurality of codebooks, each codebook being associated with a given sub-vector to allow quantization of said given sub-vector.

21. A computer readable storage medium as defined in claim 20, wherein said quantizer is operative to process a sub-vector characterized by a certain value and entries in the codebook associated with the sub-vector to identify the entry that best matches the certain value, and output the index associated with the entry that best matches the certain value.

22. A computer readable storage medium as defined in claim 21, wherein said quantizer performs similarity measurements to identify the entry that best matches the certain value.

23. A machine readable storage medium including a program element for use by a computing apparatus for generating a plurality of codebooks and a grouping scheme, the computing apparatus comprising:
- first memory means comprising a plurality of data items representative of a plurality of audio frames ;
- a second memory means for storing a plurality of codebooks;
- a third memory means for storing a grouping table;
- processing means comprising means for:
a) processing said plurality of audio frames in order to derive a plurality of feature vectors, each of said plurality of feature vectors being associated to respective ones of said plurality of audio frames, each of said feature vectors including a plurality of elements, said elements describing a characteristic of a given audio frame;

b) grouping the elements of said feature vectors into initial sub-vectors, the composition of the initial sub-vectors being predetermined;

c) iteratively re-grouping elements of the sub-vectors and generating the codebooks associated to the sub-vectors until a pre-determined condition is reached, said regrouping being effected at least in part on a basis of a distortion measurement, said grouping scheme being derived at least in part on the basis of the order of grouping of said elements when the pre-determined condition is reached.

24. A machine readable storage medium as defined in claim 23, wherein said pre-determined condition arises when a predetermined number of iterations has been effected.

25. An apparatus for generating data indicative of a grouping scheme, said data indicative of a grouping scheme being suitable for use in compressing feature vectors, said apparatus comprising:

- an input for receiving a plurality of feature vectors representative of audio information, each of said feature vectors including a plurality of discrete elements;

- means for processing said plurality of feature vectors to generate corresponding sets of sub-vectors, said processing means including means for iteratively altering an order of grouping of discrete elements of at least one feature vector into the corresponding sets of sub-vectors until a pre-determined condition is reached, said order of grouping being altered at least in part on a basis of a distortion measurement on the discrete elements in the set of sub-vectors;

- an output for outputting data indicative of a grouping scheme, said grouping scheme being derived at least in part on the basis of the order of grouping of discrete elements when the pre-determined condition is reached.

26. An apparatus as defined in claim 25, wherein said pre-determined condition arises when a predetermined number of iterations have been effected.