EP1119844A1

EP1119844A1 - A method of speech processing and an apparatus for processing of speech

Info

Publication number: EP1119844A1
Application number: EP99956421A
Authority: EP
Inventors: Fisseha Mekuria
Original assignee: Telefonaktiebolaget LM Ericsson AB
Current assignee: Telefonaktiebolaget LM Ericsson AB
Priority date: 1998-10-09
Filing date: 1999-10-05
Publication date: 2001-08-01
Also published as: TR200101881T2; CN1322346A; JP2002527796A; SE9803458D0; WO2000022608A1; SE9803458L; AU1303800A

Abstract

In a method of speech processing digital signals representative of the speech are provided for speech encoding, the digital signals including a first set of coefficients. A second set of coefficients is calculated for speech recognition. The first set of coefficients is used in the calculation of the second set of coefficients. A corresponding apparatus comprises speech encoding means (2; 41) for providing digital signals representative of the speech and including a first set of coefficients, and speech recognition means (3; 42) in which a second set of coefficients is calculated. The speech recognition means is adapted to use the first set of coefficients in the calculation of the second set of coefficients. In this way the memory requirement for a device including speech encoding as well as speech recognition is reduced considerably.

Description

A METHOD OF SPEECH PROCESSING AND AN APPARATUS FOR PROCESSING OF SPEECH

The invention relates to a method of speech processing, wherein digital signals representative of said speech are provided for speech encoding, said digital signals including a first set of coefficients, and a second set of coefficients is calculated for speech recognition. The invention further relates to a corresponding apparatus.

In devices for speech processing such as modern digital portable telephones (e.g. for the GSM system or similar systems) speech encoders are used for compressing speech signal information and removing redundant information in order to increase the capacity of the digital telephone channel through which the speech information is to be transmitted. Such speech encoders use signal analysis, and the speech encoding algorithms are normally based on linear prediction analysis modelling of the speech. The use of Linear Predictive Coding involves the calculation of a number of model filter coefficients called Linear Prediction Coefficients or Reflection Coefficients.

Further, there is a wish to combine such devices with a voice actuated function for controlling the use of the device . In digital telephones this could be in the form of a so-called voice dialling function for making telephone calls and accessing information from a database. Such systems are normally based on speech recognition algorithms which are basically composed of a pre-processing signal analysis algorithm (extraction of a set of feature vectors) , a pattern matching algorithm, and reference word lists (feature vector codebook) .

In the state of the art the speech recognition algorithm (for voice dialling) is performed separately from the basic speech encoding algorithms of the device. GB 2 290 437 discloses a digital portable telephone in which a single digital processor is used to perform voice encoding processing on transmitted voice data (and decoding processing on received voice data) and voice recognition on voice commands for dialling and other telephone functions. The two functions (or algorithms) can be handled by the same processor on a time-share basis because they do not normally occur simultaneously, e.g. the computational resources of the processor can be utilized to perform voice dialling algorithms before call start-up and speech encoding algorithms when a call is established. By using the same processor to perform both algorithms the amount of hardware, and consequently the cost, size and weight of the telephone, is reduced.

Even though the two algorithms share the same processor in the device of GB 2 290 437 they are still performed as separate algorithms, each of them having a considerable memory requirement for program code storage and storage of calculation results and references . Especially the feature extraction part of the voice dialling consumes a large amount of memory and computational power for signal buffering, calculation of routines and coefficients, and storage. This memory requirement imposes a limitation in implementing the voice dialling function with a sufficient number of vocabularies.

Therefore, it is an object of the invention to provide a method of the above-mentioned type which can perform both the speech encoding algorithm and the speech recognition algorithm with a considerably reduced memory requirement.

In accordance with the invention, this object is accomplished in that said first set of coefficients is used in the calculation of said second set of coefficients .

When the coefficients of the speech recognition algorithm (i.e. the feature extraction) are calculated from the coefficients calculated in the speech encoding algorithm, then the coefficient calculation of the speech recognition uses the code already available for speech encoding, or in other words, it can be integrated in the speech encoding block with only a little extra processing. A considerable amount of memory and computational power that would otherwise be needed for e.g. the feature extraction can be saved. The utilization of the code and signal processing already available effectively reduces the power consumption and the size of future mobile terminals with various speech processing functions integrated in the product.

As stated in claim 2, said digital signals can expediently be provided by a linear prediction algorithm so that said first set of coefficients comprises linear prediction coefficients.

According to a first embodiment of the invention, which is stated in claim 3, said first set of coefficients is used as said second set of coefficients. Using the first set of coefficients directly as a substitute for the second set of coefficients provides a very simple method, which will reduce the memory requirement further. However, this embodiment has the drawback that the performance of the speech recognition algorithm is at a lower level compared to what can be achieved in prior art speech recognition. Therefore, this embodiment is preferably used in simple devices with lower quality requirements. However, in such devices reduction of the amount of needed memory is often very important. According to an alternative embodiment of the invention, which is stated in claim 4, said second set of coefficients is calculated using a recursive equation so that each coefficient in said second set of coefficients depends on a corresponding coefficient in said first set of coefficients and on previously calculated coefficients from said first set and/or said second set of coefficients. Calculating the second set of coefficients in this way provides coefficients which are suitable for robust speech recognition, and further, the coefficient parameters are compact, and utilize code that is already existing. Therefore, this embodiment results in a considerable reduction of the memory requirement while maintaining the performance level of the prior art. When the second set of coefficients comprises cepstral coefficients, as stated in claim 5, these can expediently be calculated using the recursive equation

" ι=l where C_n is the nth cepstral coefficient and a._L is the ith linear prediction coefficient.

As stated in claim 6, the cepstral coefficients can expediently be used as feature vectors to generate a reference list for speech recognition with a view to controlling a device by voice commands. In this way, for instance voice dialling in a portable telephone can be achieved.

As mentioned, the invention also relates to a corresponding apparatus for processing of speech and comprising speech encoding means for providing digital signals representative of said speech, said digital signals including a first set of coefficients, and speech recognition means in which a second set of coefficients is calculated. When said speech recognition means is adapted to use said first set of coefficients in the calculation of said second set of coefficients, the above-mentioned advantages are achieved.

As stated in claim 8 , the apparatus can expediently be a digital portable telephone, and, as stated in claim 9, the speech encoding means can expediently comprise a linear prediction algorithm so that said first set of coefficients comprises linear prediction coefficients.

As stated in claim 10, the apparatus can be a GSM telephone and said linear prediction coefficients be calculated using an Algebraic Code Excited Linear

Prediction (ACELP) algorithm in case of GSM Enhanced Full

Rate (EFR) . Other possibilities are GSM Full Rate (FR) using a Regular Pulse Excitation - Long Term Prediction

(RPE-LTP) algorithm and GSM Half Rate (HR) using a Vector Sum Excited Linear Prediction (VSELP) algorithm. According to an alternative embodiment, which is stated in claim 11, the apparatus can be a WCDMA (Wideband Code Division Multiple Access) telephone and said linear prediction coefficients be calculated using a Conjugate Structure - Code Excited Linear Prediction (CS-CELP) algorithm. Other names for WCDMA are UMTS (Universal Mobile Telephony System - used by ETSI) and IMT 2000 (used by ITU) . A further possibility is the US system IS- 95 using a Quadrature Code Excited Linear Prediction (QCELP) algorithm.

The speech recognition means can either be adapted to use said first set of coefficients as said second set of coefficients, as stated in claim 12, or, as stated in claim 13, be adapted to calculate said second set of coefficients using a recursive equation so that each coefficient in said second set of coefficients depends on a corresponding coefficient in said first set of coefficients and on previously calculated coefficients from said first set and/or said second set of coefficients. The advantages of the two embodiments are as described above for the corresponding two embodiments of the method. When, as stated in claim 14, the second set of coefficients in the last-mentioned embodiment comprises cepstral coefficients, the speech recognition means can expediently be adapted to calculate said cepstral coefficients using the recursive equation

where C_n is the nth cepstral coefficient and a_x is the ith linear prediction coefficient.

As stated in claim 15 , the speech recognition means can expediently comprise a pattern matching block adapted to use said cepstral coefficients as feature vectors to generate a reference list for speech recognition, and be adapted to control the apparatus by voice commands. This enables the apparatus to be e.g. a portable telephone provided with the feature of voice dialling.

The invention will now be described more fully below with reference to the drawing, in which

figure 1 shows a schematic diagram of a state of the art digital mobile telephone with a voice dialling function,

figure 2 shows a block diagram of an encoder for use in the telephone of figure 1,

figure 3 shows an example of the implementation of a filter for the encoder shown in figure 2, figure 4 shows an example of the implementation of a feature extraction block in the telephone of figure 1, and

figure 5 shows a schematic diagram of a digital mobile telephone with a voice dialling function according to the invention.

Figure 1 shows a schematic diagram of a part of a digital mobile telephone 1 with a voice dialling function according to the state of the art. The telephone can e.g. be a GSM telephone adapted for Full Rate traffic. The standard telephone functional parts 2 are shown at the upper part of the figure, while the lower part shows the voice dialling part 3.

Speech pronounced by a user is received by a microphone 4 and fed as an analogue electrical signal to the audio part 5 comprising a sample-hold device and an analogue- to-digital converter. The sampling rate is 8000 sample/s and the digital output signal is a 13 bit uniform PCM signal .

The speech encoder 6 takes its input as the 13 bit uniform PCM signal from the audio part 5 and the encoded speech at the output of the speech encoder is delivered to a channel encoder unit 7, and therefrom to a radio part 8 and an antenna 9. The channel encoder unit 7 , the radio part 8 and the antenna 9 are not described in further detail here, because they have no relevance for the invention. The telephone will normally have a corresponding receiving part, and also this part is irrelevant to the invention and therefore not described here. The speech encoder 6 performs signal analysis for compressing the speech signal information and removing redundant information, thereby increasing the capacity of the digital telephone channel . The speech encoding algorithm used in the speech encoder 6 is based on linear prediction analysis modelling of the speech production process. Prediction constitutes a form of estimation, i.e. a finite set of present and past samples of a stationary process (i.e. the speech signal) is used to predict a future sample of the process. A part of the algorithm depends on having information about parts of the sound that have not been examined yet . These values are then predicted according to the trends of the past values. The prediction is called linear if it is a linear combination of the given samples of the process.

In case of Full Rate GSM the speech encoding algorithm defines a mapping from input blocks of 160 speech samples in the 13 bit uniform PCM format to encoded blocks of 260 bits. The sampling rate of 8000 sample/s leads to an average bit rate for the encoded bit stream of 13 kbit/s. The coding scheme is the so-called Regular Pulse Excitation - Long Term Prediction - Linear Predictive Coder (RPE-LTP) , and the speech encoder 6 is therefore referred to as an RPE-LTP encoder.

A block diagram of an RPE-LTP encoder is shown in figure 2. The input speech frame, consisting of 160 signal samples (uniform 13 bit PCM samples) , is first pre- processed in the pre-processing section 20. The 160 samples obtained are then analyzed in the Linear Predictive Coding (LPC) analysis block 21 to determine the coefficients for the short term analysis filter 22, in which these coefficients or parameters are then used for the filtering of the same 160 samples. The result is 160 samples of a short term residual signal. The filter parameters, termed linear prediction coefficients or reflection coefficients, are transformed into log. area ratios, LARs, before they are output to the channel encoder unit 7.

The LPC analysis block 21 and the linear prediction coefficients are the most relevant part of the circuit in relation to the invention and will therefore be described in further detail below. The remaining parts of the speech encoder 6 are less relevant for the invention and will only be described briefly. The samples of the short term residual signal are fed from the filter 22 to the RPE (Regular Pulse Excitation) and LTP(Long Term Prediction) blocks 23, 24 of the encoder for generation of RPE and LTP parameters, respectively.

Fig. 3 shows how the filter 22 can be realized. The incoming samples s (n) are taken through a number of delay elements 30, 31, 32, and the outputs from the delay elements are multiplied by coefficients a., a₂, . . . , a_p in the multiplying elements 33, 34, 35 and then added to each other. The coefficients a., a₂, . . . , a_p are the above-mentioned linear prediction coefficients. The result is subtracted from the incoming signal in the summation point 36 and the resulting signal e(n) is the short term residual signal. This filter is an all -pole filter model of the vocal tract and the filter function is given by:

1

H(z) .

1 + ∑-V

where a._, are the above-mentioned coefficients, P is the prediction order or the number of poles of the filter. For the RPE-LTP algorithm described here, P=8. The time delay T in figure 3 corresponds to z^"1. As mentioned above, the linear prediction coefficients a_lf a₂, . . . , a_p are determined in the LPC analysis block 21. They are calculated using autocorrelation and a Schur recursion algorithm which is well known and described in the art. The details of the calculations are therefore not repeated here.

In case of Enhanced Full Rate GSM (EFR) , which is based on an Algebraic Code Excited Linear Prediction (ACELP) algorithm, the prediction order P is 10 and the linear prediction coefficients a._{l t} a₂, . . . , a_p are calculated using autocorrelation and a Levinson-Durbin algorithm, but the principles are exactly the same as for Full Rate GSM. Also Half Rate GSM (HR) , which is based on a Vector

Sum Excited Linear Prediction (VSELP) algorithm, uses the same principles for the calculation of the linear prediction coefficients a_lf a₂, . . . , a_p. Also the WCDMA

(Wideband Code Division Multiple Access) system, which is based on a Conjugate Structure - Code Excited Linear Prediction (CS-CELP) algorithm, uses the same principles for the calculation of the linear prediction coefficients . Other names for WCDMA are UMTS (Universal Mobile Telephony System - used by ETSI) and IMT 2000 (used by ITU) . A further possibility is the US system IS- 95 based on a Quadrature Code Excited Linear Prediction (QCELP) algorithm.

Returning now to figure 1, it was noted that the lower part of the figure shows the voice dialling part 3 of the telephone 1. The voice dialling system is based on speech recognition algorithms. As will be seen from the figure, the speech recognition system is basically composed of a signal analysis (feature extraction) block 10, a pattern matching block 11 and a reference word list 12. The voice dialling system 3 works in parallel to the standard telephone functional parts 2 in time-sharing mode decided by an MMI (Man Machine Interface) control 13. This means that the computational resources of the telephone is utilized to perform the computation of voice dialling algorithms before call start-up and to perform the speech encoding during a call .

The aim of speech recognition is to assign a label, i.e. a word, to an observed acoustic signal. This means that the algorithm searches for segments of the speech signal that represent a hypothesized word. The signal analysis block 10, which could also be called a preprocessor, transforms the raw acoustic wave form into an intermediate compressed representation that is used for the subsequent processing. Typically, the signal analysis is capable of compressing the speech data by a factor of ten by extracting a set of feature vectors from the speech signal that preserves information about the uttered word.

In speech recognition the speech signal is assumed piecewise stationary and the preprocessor typically yields a feature vector every 10-20 ms calculated from a 20-30 ms window of speech. The result of the preprocessing signal analysis is thus a sequence of feature vectors (or speech frames) at 10 ms intervals with 10-30 coefficients per frame. Cepstral coefficients obtained by fourier transforming the log-magnitude spectrum of a signal have been found to be efficient feature vector representation for generating reference lists in voice dialling applications.

Figure 4 shows how the coefficients can be calculated in the signal analysis block 10. The first FFT (Fast Fourier Transform) block 37 performs a frequency transformation of the sampled input x(n) . In the next block 38 the logarithm of the magnitude spectrum x(ω) is calculated, and finally in the second FFT block 39 the FFT transform of the log-magnitude spectrum is calculated, thus arriving at the cepstral coefficients .

The pattern matching block 11 uses the information from the reference list 12 to assign words to the sequence of feature vectors received from the feature extraction block 10, and the assigned words are in turn used to control the mobile phone by voice commands or to generate the reference list. The assignment of words to the feature vectors can be done by a so-called template-based approach in which the reference list is a collection of pre-recorded word templates. The templates typically consist of a representative sequence of feature vectors for the corresponding words. The basic idea is to compare the utterance to each of the template words and then select the word that obtains the best match.

Even though the computational resources of the telephone are time-shared between speech encoding and voice dialling as described above, the voice dialling algorithms require a considerable amount of extra memory. ROM memory for program code storage typically amounts to 4.8 kbytes and the requirement of ROM/RAM memory for storage of references (i.e. the reference list 12) will typically be 1 kbyte for each word, i.e. a vocabulary of just 10 words needs 10 kbytes of memory. This imposes a limitation in implementing the voice dialling function with a sufficient vocabulary, because the available amount of memory in a portable telephone is rather limited. Further, the feature extraction block 10 of voice dialling is typically based on frequency domain parameters and consumes a large amount of memory and computational power for signal buffering, calculation of FFT routines and coefficients, log-frequency cepstral coefficients and storage.

Figure 5 shows a schematic diagram of a part of a digital mobile telephone 40 modified according to the invention. Again, the standard telephone functional parts 41 are shown at the upper part of the figure, while the lower part shows the voice dialling part 42. As can be seen, the speech encoding block 6 and the feature extraction block 10 of the telephone of figure 1 are here combined in a common block 43. It has been found that, with little or no modification, the feature extraction algorithm in the speech recognition part can use the existing signal analysis of the GSM speech encoding algorithms. Utilizing the existing speech encoding algorithm for feature extraction reduces the requirement of memory and computational resources . This means that the total memory of the telephone can be reduced or the vocabulary of words in the reference list can be increased with the existing amount of memory. Alone the 4.8 kbytes ROM that was earlier used for program code storage allows for about 5 extra words .

The idea is that the linear prediction coefficients a., a₂, . . . , a_p, which are computed in the speech encoding block, are used to derive the cepstral coefficients that are used as feature vectors instead of obtaining them by fourier transforming the signal spectrum, as described above .

An efficient computation of the cepstral coefficients based on the ai values is performed using the following simple recursive equation:

C_n = -a_n + ~∑(n - i)a,C_n_, n ,₌₁ where C_n is the nth cepstral coefficient and a. is the ith linear prediction coefficient, given a window of speech samples {x_n, n=l,N, N=160} corresponding to the input speech frame consisting of 160 signal samples that was described for the RPE-LTP encoder above. These coefficients obtained from the above equation are taken as feature vectors and used by the pattern matching block 11 to generate the reference command and word lists and control the mobile phone by voice commands. This implies that the feature extraction block can be integrated in the speech encoding block with only the extra processing given in the above equation. This will result in reduced memory (Code ROM) and computational requirement for implementing the voice dialling function.

Alternatively to the above equation the linear prediction coefficients a_1# a_2/ . . . , a_p can also be used directly as the cepstral coefficients, i.e. C_n=a_n. This provides a very simple method, which will reduce the memory requirement further. However, this embodiment has the drawback that the performance of the speech recognition algorithm is at a lower level compared to what can be achieved in prior art speech recognition. Therefore, this embodiment is preferably used in simple devices with lower quality requirements. However, in such devices reduction of the amount of needed memory is often very important .

Although a preferred embodiment of the present invention has been described and shown, the invention is not restricted to it, but may also be embodied in other ways within the scope of the subject-matter defined in the following claims .

Claims

C L A I M S

1. A method of speech processing, wherein

• digital signals representative of said speech are provided for speech encoding, said digital signals including a first set of coefficients,

• and a second set of coefficients is calculated for speech recognition, c h a r a c t e r i z e d in that said first set of coefficients is used in the calculation of said second set of coefficients.

2. A method according to claim 1, c h a r a c t e r i z e d in that said digital signals are provided by a linear prediction algorithm, and that said first set of coefficients comprises linear prediction coefficients.

3. A method according to claim 1 or 2 , c h a r a c t e r i z e d in that said first set of coefficients is used as said second set of coefficients.

4. A method according to claim 1 or 2 , c h a r a c t e r i z e d in that said second set of coefficients is calculated using a recursive equation so that each coefficient in said second set of coefficients depends on a corresponding coefficient in said first set of coefficients and on previously calculated coefficients from said first set and/or said second set of coefficients.

5. A method according to claim 4, c h a r a c t e r i z e d in that said second set of coefficients comprises cepstral coefficients which are calculated using the recursive equation C_H=-a„+-∑(n-i)a,C_n_ n 1=1 where C_n is the nth cepstral coefficient and , is the ith linear prediction coefficient.

6. A method according to claim 5, c h a r a c t e r i z e d in that said cepstral coefficients are used as feature vectors to generate a reference list for speech recognition with a view to controlling a device by voice commands.

7. An apparatus for processing of speech, said apparatus comprising:

• speech encoding means (2; 41) for providing digital signals representative of said speech, said digital signals including a first set of coefficients,

• and speech recognition means (3; 42) in which a second set of coefficients is calculated, c h a r a c t e r i z e d in that said speech recognition means is adapted to use said first set of coefficients in the calculation of said second set of coefficients .

8. An apparatus according to claim 7 , c h a r a c t e r i z e d in that the apparatus is a digital portable telephone.

9. An apparatus according to claim 7 or 8, c h a r a c t e r i z e d in that said speech encoding means (2; 41) comprises a linear prediction algorithm and that said first set of coefficients comprises linear prediction coefficients.

10. An apparatus according to claim 9, c h a r a c t e r i z e d in that the apparatus is a GSM telephone, and that said linear prediction coefficients are calculated using an Algebraic Code Excited Linear Prediction (ACELP) algorithm.

11. An apparatus according to claim 9 , c h a r a c t e r i z e d in that the apparatus is a WCDMA telephone, and that said linear prediction coefficients are calculated using a Conjugate Structure - Code Excited Linear Prediction (CS-CELP) algorithm.

12. An apparatus according to claims 7-11, c h a r a c t e r i z e d in that said speech recognition means (3; 42) is adapted to use said first set of coefficients as said second set of coefficients.

13. An apparatus according to claims 7-11, c h a r a c t e r i z e d in that said speech recognition means (3; 42) is adapted to calculate said second set of coefficients using a recursive equation so that each coefficient in said second set of coefficients depends on a corresponding coefficient in said first set of coefficients and on previously calculated coefficients from said first set and/or said second set of coefficients .

14. An apparatus according to claim 13 , c h a r a c t e r i z e d in that said second set of coefficients comprises cepstral coefficients and that said speech recognition means (3; 42) is adapted to calculate said cepstral coefficients using the recursive equation

where C_n is the nth cepstral coefficient and a. is the ith linear prediction coefficient.

15. An apparatus according to claim 14, c h a r a c t e r i z e d in that said speech recognition means (3; 42) comprises a pattern matching block (11) adapted to use said cepstral coefficients as feature vectors to generate a reference list (12) for speech recognition, and is adapted to control the apparatus by voice commands .