EP1119844A1 - A method of speech processing and an apparatus for processing of speech - Google Patents

A method of speech processing and an apparatus for processing of speech

Info

Publication number
EP1119844A1
EP1119844A1 EP99956421A EP99956421A EP1119844A1 EP 1119844 A1 EP1119844 A1 EP 1119844A1 EP 99956421 A EP99956421 A EP 99956421A EP 99956421 A EP99956421 A EP 99956421A EP 1119844 A1 EP1119844 A1 EP 1119844A1
Authority
EP
European Patent Office
Prior art keywords
coefficients
speech
linear prediction
speech recognition
calculated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP99956421A
Other languages
German (de)
French (fr)
Inventor
Fisseha Mekuria
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Telefonaktiebolaget LM Ericsson AB
Original Assignee
Telefonaktiebolaget LM Ericsson AB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Telefonaktiebolaget LM Ericsson AB filed Critical Telefonaktiebolaget LM Ericsson AB
Publication of EP1119844A1 publication Critical patent/EP1119844A1/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • G10L19/12Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters the excitation function being a code excitation, e.g. in code excited linear prediction [CELP] vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • the invention relates to a method of speech processing, wherein digital signals representative of said speech are provided for speech encoding, said digital signals including a first set of coefficients, and a second set of coefficients is calculated for speech recognition.
  • the invention further relates to a corresponding apparatus.
  • speech encoders are used for compressing speech signal information and removing redundant information in order to increase the capacity of the digital telephone channel through which the speech information is to be transmitted.
  • speech encoders use signal analysis, and the speech encoding algorithms are normally based on linear prediction analysis modelling of the speech.
  • Linear Predictive Coding involves the calculation of a number of model filter coefficients called Linear Prediction Coefficients or Reflection Coefficients.
  • Such systems are normally based on speech recognition algorithms which are basically composed of a pre-processing signal analysis algorithm (extraction of a set of feature vectors) , a pattern matching algorithm, and reference word lists (feature vector codebook) .
  • GB 2 290 437 discloses a digital portable telephone in which a single digital processor is used to perform voice encoding processing on transmitted voice data (and decoding processing on received voice data) and voice recognition on voice commands for dialling and other telephone functions.
  • the two functions (or algorithms) can be handled by the same processor on a time-share basis because they do not normally occur simultaneously, e.g. the computational resources of the processor can be utilized to perform voice dialling algorithms before call start-up and speech encoding algorithms when a call is established.
  • this object is accomplished in that said first set of coefficients is used in the calculation of said second set of coefficients .
  • the coefficient calculation of the speech recognition uses the code already available for speech encoding, or in other words, it can be integrated in the speech encoding block with only a little extra processing. A considerable amount of memory and computational power that would otherwise be needed for e.g. the feature extraction can be saved.
  • the utilization of the code and signal processing already available effectively reduces the power consumption and the size of future mobile terminals with various speech processing functions integrated in the product.
  • said digital signals can expediently be provided by a linear prediction algorithm so that said first set of coefficients comprises linear prediction coefficients.
  • said first set of coefficients is used as said second set of coefficients.
  • Using the first set of coefficients directly as a substitute for the second set of coefficients provides a very simple method, which will reduce the memory requirement further.
  • this embodiment has the drawback that the performance of the speech recognition algorithm is at a lower level compared to what can be achieved in prior art speech recognition. Therefore, this embodiment is preferably used in simple devices with lower quality requirements. However, in such devices reduction of the amount of needed memory is often very important.
  • said second set of coefficients is calculated using a recursive equation so that each coefficient in said second set of coefficients depends on a corresponding coefficient in said first set of coefficients and on previously calculated coefficients from said first set and/or said second set of coefficients.
  • cepstral coefficients can expediently be used as feature vectors to generate a reference list for speech recognition with a view to controlling a device by voice commands. In this way, for instance voice dialling in a portable telephone can be achieved.
  • the invention also relates to a corresponding apparatus for processing of speech and comprising speech encoding means for providing digital signals representative of said speech, said digital signals including a first set of coefficients, and speech recognition means in which a second set of coefficients is calculated.
  • speech recognition means is adapted to use said first set of coefficients in the calculation of said second set of coefficients, the above-mentioned advantages are achieved.
  • the apparatus can expediently be a digital portable telephone, and, as stated in claim 9, the speech encoding means can expediently comprise a linear prediction algorithm so that said first set of coefficients comprises linear prediction coefficients.
  • the apparatus can be a GSM telephone and said linear prediction coefficients be calculated using an Algebraic Code Excited Linear
  • ACELP Prediction Prediction
  • EFR EFR
  • GSM Full Rate FR
  • FR Regular Pulse Excitation - Long Term Prediction
  • the apparatus can be a WCDMA (Wideband Code Division Multiple Access) telephone and said linear prediction coefficients be calculated using a Conjugate Structure - Code Excited Linear Prediction (CS-CELP) algorithm.
  • WCDMA Wideband Code Division Multiple Access
  • CS-CELP Conjugate Structure - Code Excited Linear Prediction
  • Other names for WCDMA are UMTS (Universal Mobile Telephony System - used by ETSI) and IMT 2000 (used by ITU) .
  • a further possibility is the US system IS- 95 using a Quadrature Code Excited Linear Prediction (QCELP) algorithm.
  • the speech recognition means can either be adapted to use said first set of coefficients as said second set of coefficients, as stated in claim 12, or, as stated in claim 13, be adapted to calculate said second set of coefficients using a recursive equation so that each coefficient in said second set of coefficients depends on a corresponding coefficient in said first set of coefficients and on previously calculated coefficients from said first set and/or said second set of coefficients.
  • the advantages of the two embodiments are as described above for the corresponding two embodiments of the method.
  • the speech recognition means can expediently be adapted to calculate said cepstral coefficients using the recursive equation
  • C n is the nth cepstral coefficient and a x is the ith linear prediction coefficient.
  • the speech recognition means can expediently comprise a pattern matching block adapted to use said cepstral coefficients as feature vectors to generate a reference list for speech recognition, and be adapted to control the apparatus by voice commands.
  • a pattern matching block adapted to use said cepstral coefficients as feature vectors to generate a reference list for speech recognition, and be adapted to control the apparatus by voice commands. This enables the apparatus to be e.g. a portable telephone provided with the feature of voice dialling.
  • figure 1 shows a schematic diagram of a state of the art digital mobile telephone with a voice dialling function
  • figure 2 shows a block diagram of an encoder for use in the telephone of figure 1
  • figure 3 shows an example of the implementation of a filter for the encoder shown in figure 2
  • figure 4 shows an example of the implementation of a feature extraction block in the telephone of figure 1
  • figure 5 shows a schematic diagram of a digital mobile telephone with a voice dialling function according to the invention.
  • Figure 1 shows a schematic diagram of a part of a digital mobile telephone 1 with a voice dialling function according to the state of the art.
  • the telephone can e.g. be a GSM telephone adapted for Full Rate traffic.
  • the standard telephone functional parts 2 are shown at the upper part of the figure, while the lower part shows the voice dialling part 3.
  • Speech pronounced by a user is received by a microphone 4 and fed as an analogue electrical signal to the audio part 5 comprising a sample-hold device and an analogue- to-digital converter.
  • the sampling rate is 8000 sample/s and the digital output signal is a 13 bit uniform PCM signal .
  • the speech encoder 6 takes its input as the 13 bit uniform PCM signal from the audio part 5 and the encoded speech at the output of the speech encoder is delivered to a channel encoder unit 7, and therefrom to a radio part 8 and an antenna 9.
  • the channel encoder unit 7 , the radio part 8 and the antenna 9 are not described in further detail here, because they have no relevance for the invention.
  • the telephone will normally have a corresponding receiving part, and also this part is irrelevant to the invention and therefore not described here.
  • the speech encoder 6 performs signal analysis for compressing the speech signal information and removing redundant information, thereby increasing the capacity of the digital telephone channel .
  • the speech encoding algorithm used in the speech encoder 6 is based on linear prediction analysis modelling of the speech production process.
  • Prediction constitutes a form of estimation, i.e. a finite set of present and past samples of a stationary process (i.e. the speech signal) is used to predict a future sample of the process.
  • a part of the algorithm depends on having information about parts of the sound that have not been examined yet . These values are then predicted according to the trends of the past values.
  • the prediction is called linear if it is a linear combination of the given samples of the process.
  • the speech encoding algorithm defines a mapping from input blocks of 160 speech samples in the 13 bit uniform PCM format to encoded blocks of 260 bits.
  • the sampling rate of 8000 sample/s leads to an average bit rate for the encoded bit stream of 13 kbit/s.
  • the coding scheme is the so-called Regular Pulse Excitation - Long Term Prediction - Linear Predictive Coder (RPE-LTP) , and the speech encoder 6 is therefore referred to as an RPE-LTP encoder.
  • FIG. 1 A block diagram of an RPE-LTP encoder is shown in figure 2.
  • the input speech frame consisting of 160 signal samples (uniform 13 bit PCM samples) , is first pre- processed in the pre-processing section 20.
  • the 160 samples obtained are then analyzed in the Linear Predictive Coding (LPC) analysis block 21 to determine the coefficients for the short term analysis filter 22, in which these coefficients or parameters are then used for the filtering of the same 160 samples.
  • LPC Linear Predictive Coding
  • the filter parameters termed linear prediction coefficients or reflection coefficients, are transformed into log. area ratios, LARs, before they are output to the channel encoder unit 7.
  • the LPC analysis block 21 and the linear prediction coefficients are the most relevant part of the circuit in relation to the invention and will therefore be described in further detail below.
  • the remaining parts of the speech encoder 6 are less relevant for the invention and will only be described briefly.
  • the samples of the short term residual signal are fed from the filter 22 to the RPE (Regular Pulse Excitation) and LTP(Long Term Prediction) blocks 23, 24 of the encoder for generation of RPE and LTP parameters, respectively.
  • Fig. 3 shows how the filter 22 can be realized.
  • the incoming samples s (n) are taken through a number of delay elements 30, 31, 32, and the outputs from the delay elements are multiplied by coefficients a., a 2 , . . . , a p in the multiplying elements 33, 34, 35 and then added to each other.
  • the coefficients a., a 2 , . . . , a p are the above-mentioned linear prediction coefficients.
  • the result is subtracted from the incoming signal in the summation point 36 and the resulting signal e(n) is the short term residual signal.
  • This filter is an all -pole filter model of the vocal tract and the filter function is given by:
  • a. are the above-mentioned coefficients
  • P is the prediction order or the number of poles of the filter.
  • P 8.
  • the time delay T in figure 3 corresponds to z "1 .
  • the linear prediction coefficients a lf a 2 , . . . , a p are determined in the LPC analysis block 21. They are calculated using autocorrelation and a Schur recursion algorithm which is well known and described in the art. The details of the calculations are therefore not repeated here.
  • EFR Enhanced Full Rate GSM
  • ACELP Algebraic Code Excited Linear Prediction
  • HR Half Rate GSM
  • VSELP Sum Excited Linear Prediction
  • WCDMA Wideband Code Division Multiple Access
  • CS-CELP Conjugate Structure - Code Excited Linear Prediction
  • UMTS Universal Mobile Telephony System - used by ETSI
  • IMT 2000 used by ITU
  • a further possibility is the US system IS- 95 based on a Quadrature Code Excited Linear Prediction (QCELP) algorithm.
  • QELP Quadrature Code Excited Linear Prediction
  • the voice dialling system is based on speech recognition algorithms.
  • the speech recognition system is basically composed of a signal analysis (feature extraction) block 10, a pattern matching block 11 and a reference word list 12.
  • the voice dialling system 3 works in parallel to the standard telephone functional parts 2 in time-sharing mode decided by an MMI (Man Machine Interface) control 13. This means that the computational resources of the telephone is utilized to perform the computation of voice dialling algorithms before call start-up and to perform the speech encoding during a call .
  • MMI Man Machine Interface
  • the aim of speech recognition is to assign a label, i.e. a word, to an observed acoustic signal.
  • a label i.e. a word
  • the signal analysis block 10 which could also be called a preprocessor, transforms the raw acoustic wave form into an intermediate compressed representation that is used for the subsequent processing.
  • the signal analysis is capable of compressing the speech data by a factor of ten by extracting a set of feature vectors from the speech signal that preserves information about the uttered word.
  • the speech signal is assumed piecewise stationary and the preprocessor typically yields a feature vector every 10-20 ms calculated from a 20-30 ms window of speech.
  • the result of the preprocessing signal analysis is thus a sequence of feature vectors (or speech frames) at 10 ms intervals with 10-30 coefficients per frame.
  • Cepstral coefficients obtained by fourier transforming the log-magnitude spectrum of a signal have been found to be efficient feature vector representation for generating reference lists in voice dialling applications.
  • Figure 4 shows how the coefficients can be calculated in the signal analysis block 10.
  • the first FFT (Fast Fourier Transform) block 37 performs a frequency transformation of the sampled input x(n) .
  • the logarithm of the magnitude spectrum x( ⁇ ) is calculated, and finally in the second FFT block 39 the FFT transform of the log-magnitude spectrum is calculated, thus arriving at the cepstral coefficients .
  • the pattern matching block 11 uses the information from the reference list 12 to assign words to the sequence of feature vectors received from the feature extraction block 10, and the assigned words are in turn used to control the mobile phone by voice commands or to generate the reference list.
  • the assignment of words to the feature vectors can be done by a so-called template-based approach in which the reference list is a collection of pre-recorded word templates.
  • the templates typically consist of a representative sequence of feature vectors for the corresponding words. The basic idea is to compare the utterance to each of the template words and then select the word that obtains the best match.
  • ROM memory for program code storage typically amounts to 4.8 kbytes and the requirement of ROM/RAM memory for storage of references (i.e. the reference list 12) will typically be 1 kbyte for each word, i.e. a vocabulary of just 10 words needs 10 kbytes of memory.
  • references i.e. the reference list 12
  • the feature extraction block 10 of voice dialling is typically based on frequency domain parameters and consumes a large amount of memory and computational power for signal buffering, calculation of FFT routines and coefficients, log-frequency cepstral coefficients and storage.
  • Figure 5 shows a schematic diagram of a part of a digital mobile telephone 40 modified according to the invention. Again, the standard telephone functional parts 41 are shown at the upper part of the figure, while the lower part shows the voice dialling part 42.
  • the speech encoding block 6 and the feature extraction block 10 of the telephone of figure 1 are here combined in a common block 43. It has been found that, with little or no modification, the feature extraction algorithm in the speech recognition part can use the existing signal analysis of the GSM speech encoding algorithms. Utilizing the existing speech encoding algorithm for feature extraction reduces the requirement of memory and computational resources . This means that the total memory of the telephone can be reduced or the vocabulary of words in the reference list can be increased with the existing amount of memory. Alone the 4.8 kbytes ROM that was earlier used for program code storage allows for about 5 extra words .
  • linear prediction coefficients a., a 2 , . . . , a p which are computed in the speech encoding block, are used to derive the cepstral coefficients that are used as feature vectors instead of obtaining them by fourier transforming the signal spectrum, as described above .
  • These coefficients obtained from the above equation are taken as feature vectors and used by the pattern matching block 11 to generate the reference command and word lists and control the mobile phone by voice commands. This implies that the feature extraction block can be integrated in the speech encoding block with only the extra processing given in the above equation. This will result in reduced memory (Code ROM) and computational requirement for implementing the voice dialling function.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

In a method of speech processing digital signals representative of the speech are provided for speech encoding, the digital signals including a first set of coefficients. A second set of coefficients is calculated for speech recognition. The first set of coefficients is used in the calculation of the second set of coefficients. A corresponding apparatus comprises speech encoding means (2; 41) for providing digital signals representative of the speech and including a first set of coefficients, and speech recognition means (3; 42) in which a second set of coefficients is calculated. The speech recognition means is adapted to use the first set of coefficients in the calculation of the second set of coefficients. In this way the memory requirement for a device including speech encoding as well as speech recognition is reduced considerably.

Description

A METHOD OF SPEECH PROCESSING AND AN APPARATUS FOR PROCESSING OF SPEECH
The invention relates to a method of speech processing, wherein digital signals representative of said speech are provided for speech encoding, said digital signals including a first set of coefficients, and a second set of coefficients is calculated for speech recognition. The invention further relates to a corresponding apparatus.
In devices for speech processing such as modern digital portable telephones (e.g. for the GSM system or similar systems) speech encoders are used for compressing speech signal information and removing redundant information in order to increase the capacity of the digital telephone channel through which the speech information is to be transmitted. Such speech encoders use signal analysis, and the speech encoding algorithms are normally based on linear prediction analysis modelling of the speech. The use of Linear Predictive Coding involves the calculation of a number of model filter coefficients called Linear Prediction Coefficients or Reflection Coefficients.
Further, there is a wish to combine such devices with a voice actuated function for controlling the use of the device . In digital telephones this could be in the form of a so-called voice dialling function for making telephone calls and accessing information from a database. Such systems are normally based on speech recognition algorithms which are basically composed of a pre-processing signal analysis algorithm (extraction of a set of feature vectors) , a pattern matching algorithm, and reference word lists (feature vector codebook) .
In the state of the art the speech recognition algorithm (for voice dialling) is performed separately from the basic speech encoding algorithms of the device. GB 2 290 437 discloses a digital portable telephone in which a single digital processor is used to perform voice encoding processing on transmitted voice data (and decoding processing on received voice data) and voice recognition on voice commands for dialling and other telephone functions. The two functions (or algorithms) can be handled by the same processor on a time-share basis because they do not normally occur simultaneously, e.g. the computational resources of the processor can be utilized to perform voice dialling algorithms before call start-up and speech encoding algorithms when a call is established. By using the same processor to perform both algorithms the amount of hardware, and consequently the cost, size and weight of the telephone, is reduced.
Even though the two algorithms share the same processor in the device of GB 2 290 437 they are still performed as separate algorithms, each of them having a considerable memory requirement for program code storage and storage of calculation results and references . Especially the feature extraction part of the voice dialling consumes a large amount of memory and computational power for signal buffering, calculation of routines and coefficients, and storage. This memory requirement imposes a limitation in implementing the voice dialling function with a sufficient number of vocabularies.
Therefore, it is an object of the invention to provide a method of the above-mentioned type which can perform both the speech encoding algorithm and the speech recognition algorithm with a considerably reduced memory requirement.
In accordance with the invention, this object is accomplished in that said first set of coefficients is used in the calculation of said second set of coefficients .
When the coefficients of the speech recognition algorithm (i.e. the feature extraction) are calculated from the coefficients calculated in the speech encoding algorithm, then the coefficient calculation of the speech recognition uses the code already available for speech encoding, or in other words, it can be integrated in the speech encoding block with only a little extra processing. A considerable amount of memory and computational power that would otherwise be needed for e.g. the feature extraction can be saved. The utilization of the code and signal processing already available effectively reduces the power consumption and the size of future mobile terminals with various speech processing functions integrated in the product.
As stated in claim 2, said digital signals can expediently be provided by a linear prediction algorithm so that said first set of coefficients comprises linear prediction coefficients.
According to a first embodiment of the invention, which is stated in claim 3, said first set of coefficients is used as said second set of coefficients. Using the first set of coefficients directly as a substitute for the second set of coefficients provides a very simple method, which will reduce the memory requirement further. However, this embodiment has the drawback that the performance of the speech recognition algorithm is at a lower level compared to what can be achieved in prior art speech recognition. Therefore, this embodiment is preferably used in simple devices with lower quality requirements. However, in such devices reduction of the amount of needed memory is often very important. According to an alternative embodiment of the invention, which is stated in claim 4, said second set of coefficients is calculated using a recursive equation so that each coefficient in said second set of coefficients depends on a corresponding coefficient in said first set of coefficients and on previously calculated coefficients from said first set and/or said second set of coefficients. Calculating the second set of coefficients in this way provides coefficients which are suitable for robust speech recognition, and further, the coefficient parameters are compact, and utilize code that is already existing. Therefore, this embodiment results in a considerable reduction of the memory requirement while maintaining the performance level of the prior art. When the second set of coefficients comprises cepstral coefficients, as stated in claim 5, these can expediently be calculated using the recursive equation
" ι=l where Cn is the nth cepstral coefficient and a.L is the ith linear prediction coefficient.
As stated in claim 6, the cepstral coefficients can expediently be used as feature vectors to generate a reference list for speech recognition with a view to controlling a device by voice commands. In this way, for instance voice dialling in a portable telephone can be achieved.
As mentioned, the invention also relates to a corresponding apparatus for processing of speech and comprising speech encoding means for providing digital signals representative of said speech, said digital signals including a first set of coefficients, and speech recognition means in which a second set of coefficients is calculated. When said speech recognition means is adapted to use said first set of coefficients in the calculation of said second set of coefficients, the above-mentioned advantages are achieved.
As stated in claim 8 , the apparatus can expediently be a digital portable telephone, and, as stated in claim 9, the speech encoding means can expediently comprise a linear prediction algorithm so that said first set of coefficients comprises linear prediction coefficients.
As stated in claim 10, the apparatus can be a GSM telephone and said linear prediction coefficients be calculated using an Algebraic Code Excited Linear
Prediction (ACELP) algorithm in case of GSM Enhanced Full
Rate (EFR) . Other possibilities are GSM Full Rate (FR) using a Regular Pulse Excitation - Long Term Prediction
(RPE-LTP) algorithm and GSM Half Rate (HR) using a Vector Sum Excited Linear Prediction (VSELP) algorithm. According to an alternative embodiment, which is stated in claim 11, the apparatus can be a WCDMA (Wideband Code Division Multiple Access) telephone and said linear prediction coefficients be calculated using a Conjugate Structure - Code Excited Linear Prediction (CS-CELP) algorithm. Other names for WCDMA are UMTS (Universal Mobile Telephony System - used by ETSI) and IMT 2000 (used by ITU) . A further possibility is the US system IS- 95 using a Quadrature Code Excited Linear Prediction (QCELP) algorithm.
The speech recognition means can either be adapted to use said first set of coefficients as said second set of coefficients, as stated in claim 12, or, as stated in claim 13, be adapted to calculate said second set of coefficients using a recursive equation so that each coefficient in said second set of coefficients depends on a corresponding coefficient in said first set of coefficients and on previously calculated coefficients from said first set and/or said second set of coefficients. The advantages of the two embodiments are as described above for the corresponding two embodiments of the method. When, as stated in claim 14, the second set of coefficients in the last-mentioned embodiment comprises cepstral coefficients, the speech recognition means can expediently be adapted to calculate said cepstral coefficients using the recursive equation
where Cn is the nth cepstral coefficient and ax is the ith linear prediction coefficient.
As stated in claim 15 , the speech recognition means can expediently comprise a pattern matching block adapted to use said cepstral coefficients as feature vectors to generate a reference list for speech recognition, and be adapted to control the apparatus by voice commands. This enables the apparatus to be e.g. a portable telephone provided with the feature of voice dialling.
The invention will now be described more fully below with reference to the drawing, in which
figure 1 shows a schematic diagram of a state of the art digital mobile telephone with a voice dialling function,
figure 2 shows a block diagram of an encoder for use in the telephone of figure 1,
figure 3 shows an example of the implementation of a filter for the encoder shown in figure 2, figure 4 shows an example of the implementation of a feature extraction block in the telephone of figure 1, and
figure 5 shows a schematic diagram of a digital mobile telephone with a voice dialling function according to the invention.
Figure 1 shows a schematic diagram of a part of a digital mobile telephone 1 with a voice dialling function according to the state of the art. The telephone can e.g. be a GSM telephone adapted for Full Rate traffic. The standard telephone functional parts 2 are shown at the upper part of the figure, while the lower part shows the voice dialling part 3.
Speech pronounced by a user is received by a microphone 4 and fed as an analogue electrical signal to the audio part 5 comprising a sample-hold device and an analogue- to-digital converter. The sampling rate is 8000 sample/s and the digital output signal is a 13 bit uniform PCM signal .
The speech encoder 6 takes its input as the 13 bit uniform PCM signal from the audio part 5 and the encoded speech at the output of the speech encoder is delivered to a channel encoder unit 7, and therefrom to a radio part 8 and an antenna 9. The channel encoder unit 7 , the radio part 8 and the antenna 9 are not described in further detail here, because they have no relevance for the invention. The telephone will normally have a corresponding receiving part, and also this part is irrelevant to the invention and therefore not described here. The speech encoder 6 performs signal analysis for compressing the speech signal information and removing redundant information, thereby increasing the capacity of the digital telephone channel . The speech encoding algorithm used in the speech encoder 6 is based on linear prediction analysis modelling of the speech production process. Prediction constitutes a form of estimation, i.e. a finite set of present and past samples of a stationary process (i.e. the speech signal) is used to predict a future sample of the process. A part of the algorithm depends on having information about parts of the sound that have not been examined yet . These values are then predicted according to the trends of the past values. The prediction is called linear if it is a linear combination of the given samples of the process.
In case of Full Rate GSM the speech encoding algorithm defines a mapping from input blocks of 160 speech samples in the 13 bit uniform PCM format to encoded blocks of 260 bits. The sampling rate of 8000 sample/s leads to an average bit rate for the encoded bit stream of 13 kbit/s. The coding scheme is the so-called Regular Pulse Excitation - Long Term Prediction - Linear Predictive Coder (RPE-LTP) , and the speech encoder 6 is therefore referred to as an RPE-LTP encoder.
A block diagram of an RPE-LTP encoder is shown in figure 2. The input speech frame, consisting of 160 signal samples (uniform 13 bit PCM samples) , is first pre- processed in the pre-processing section 20. The 160 samples obtained are then analyzed in the Linear Predictive Coding (LPC) analysis block 21 to determine the coefficients for the short term analysis filter 22, in which these coefficients or parameters are then used for the filtering of the same 160 samples. The result is 160 samples of a short term residual signal. The filter parameters, termed linear prediction coefficients or reflection coefficients, are transformed into log. area ratios, LARs, before they are output to the channel encoder unit 7.
The LPC analysis block 21 and the linear prediction coefficients are the most relevant part of the circuit in relation to the invention and will therefore be described in further detail below. The remaining parts of the speech encoder 6 are less relevant for the invention and will only be described briefly. The samples of the short term residual signal are fed from the filter 22 to the RPE (Regular Pulse Excitation) and LTP(Long Term Prediction) blocks 23, 24 of the encoder for generation of RPE and LTP parameters, respectively.
Fig. 3 shows how the filter 22 can be realized. The incoming samples s (n) are taken through a number of delay elements 30, 31, 32, and the outputs from the delay elements are multiplied by coefficients a., a2, . . . , ap in the multiplying elements 33, 34, 35 and then added to each other. The coefficients a., a2, . . . , ap are the above-mentioned linear prediction coefficients. The result is subtracted from the incoming signal in the summation point 36 and the resulting signal e(n) is the short term residual signal. This filter is an all -pole filter model of the vocal tract and the filter function is given by:
1
H(z) .
1 + ∑-V
where a., are the above-mentioned coefficients, P is the prediction order or the number of poles of the filter. For the RPE-LTP algorithm described here, P=8. The time delay T in figure 3 corresponds to z"1. As mentioned above, the linear prediction coefficients alf a2, . . . , ap are determined in the LPC analysis block 21. They are calculated using autocorrelation and a Schur recursion algorithm which is well known and described in the art. The details of the calculations are therefore not repeated here.
In case of Enhanced Full Rate GSM (EFR) , which is based on an Algebraic Code Excited Linear Prediction (ACELP) algorithm, the prediction order P is 10 and the linear prediction coefficients a.l t a2, . . . , ap are calculated using autocorrelation and a Levinson-Durbin algorithm, but the principles are exactly the same as for Full Rate GSM. Also Half Rate GSM (HR) , which is based on a Vector
Sum Excited Linear Prediction (VSELP) algorithm, uses the same principles for the calculation of the linear prediction coefficients alf a2, . . . , ap. Also the WCDMA
(Wideband Code Division Multiple Access) system, which is based on a Conjugate Structure - Code Excited Linear Prediction (CS-CELP) algorithm, uses the same principles for the calculation of the linear prediction coefficients . Other names for WCDMA are UMTS (Universal Mobile Telephony System - used by ETSI) and IMT 2000 (used by ITU) . A further possibility is the US system IS- 95 based on a Quadrature Code Excited Linear Prediction (QCELP) algorithm.
Returning now to figure 1, it was noted that the lower part of the figure shows the voice dialling part 3 of the telephone 1. The voice dialling system is based on speech recognition algorithms. As will be seen from the figure, the speech recognition system is basically composed of a signal analysis (feature extraction) block 10, a pattern matching block 11 and a reference word list 12. The voice dialling system 3 works in parallel to the standard telephone functional parts 2 in time-sharing mode decided by an MMI (Man Machine Interface) control 13. This means that the computational resources of the telephone is utilized to perform the computation of voice dialling algorithms before call start-up and to perform the speech encoding during a call .
The aim of speech recognition is to assign a label, i.e. a word, to an observed acoustic signal. This means that the algorithm searches for segments of the speech signal that represent a hypothesized word. The signal analysis block 10, which could also be called a preprocessor, transforms the raw acoustic wave form into an intermediate compressed representation that is used for the subsequent processing. Typically, the signal analysis is capable of compressing the speech data by a factor of ten by extracting a set of feature vectors from the speech signal that preserves information about the uttered word.
In speech recognition the speech signal is assumed piecewise stationary and the preprocessor typically yields a feature vector every 10-20 ms calculated from a 20-30 ms window of speech. The result of the preprocessing signal analysis is thus a sequence of feature vectors (or speech frames) at 10 ms intervals with 10-30 coefficients per frame. Cepstral coefficients obtained by fourier transforming the log-magnitude spectrum of a signal have been found to be efficient feature vector representation for generating reference lists in voice dialling applications.
Figure 4 shows how the coefficients can be calculated in the signal analysis block 10. The first FFT (Fast Fourier Transform) block 37 performs a frequency transformation of the sampled input x(n) . In the next block 38 the logarithm of the magnitude spectrum x(ω) is calculated, and finally in the second FFT block 39 the FFT transform of the log-magnitude spectrum is calculated, thus arriving at the cepstral coefficients .
The pattern matching block 11 uses the information from the reference list 12 to assign words to the sequence of feature vectors received from the feature extraction block 10, and the assigned words are in turn used to control the mobile phone by voice commands or to generate the reference list. The assignment of words to the feature vectors can be done by a so-called template-based approach in which the reference list is a collection of pre-recorded word templates. The templates typically consist of a representative sequence of feature vectors for the corresponding words. The basic idea is to compare the utterance to each of the template words and then select the word that obtains the best match.
Even though the computational resources of the telephone are time-shared between speech encoding and voice dialling as described above, the voice dialling algorithms require a considerable amount of extra memory. ROM memory for program code storage typically amounts to 4.8 kbytes and the requirement of ROM/RAM memory for storage of references (i.e. the reference list 12) will typically be 1 kbyte for each word, i.e. a vocabulary of just 10 words needs 10 kbytes of memory. This imposes a limitation in implementing the voice dialling function with a sufficient vocabulary, because the available amount of memory in a portable telephone is rather limited. Further, the feature extraction block 10 of voice dialling is typically based on frequency domain parameters and consumes a large amount of memory and computational power for signal buffering, calculation of FFT routines and coefficients, log-frequency cepstral coefficients and storage.
Figure 5 shows a schematic diagram of a part of a digital mobile telephone 40 modified according to the invention. Again, the standard telephone functional parts 41 are shown at the upper part of the figure, while the lower part shows the voice dialling part 42. As can be seen, the speech encoding block 6 and the feature extraction block 10 of the telephone of figure 1 are here combined in a common block 43. It has been found that, with little or no modification, the feature extraction algorithm in the speech recognition part can use the existing signal analysis of the GSM speech encoding algorithms. Utilizing the existing speech encoding algorithm for feature extraction reduces the requirement of memory and computational resources . This means that the total memory of the telephone can be reduced or the vocabulary of words in the reference list can be increased with the existing amount of memory. Alone the 4.8 kbytes ROM that was earlier used for program code storage allows for about 5 extra words .
The idea is that the linear prediction coefficients a., a2, . . . , ap, which are computed in the speech encoding block, are used to derive the cepstral coefficients that are used as feature vectors instead of obtaining them by fourier transforming the signal spectrum, as described above .
An efficient computation of the cepstral coefficients based on the ai values is performed using the following simple recursive equation:
Cn = -an + ~∑(n - i)a,Cn_, n ,=1 where Cn is the nth cepstral coefficient and a. is the ith linear prediction coefficient, given a window of speech samples {xn, n=l,N, N=160} corresponding to the input speech frame consisting of 160 signal samples that was described for the RPE-LTP encoder above. These coefficients obtained from the above equation are taken as feature vectors and used by the pattern matching block 11 to generate the reference command and word lists and control the mobile phone by voice commands. This implies that the feature extraction block can be integrated in the speech encoding block with only the extra processing given in the above equation. This will result in reduced memory (Code ROM) and computational requirement for implementing the voice dialling function.
Alternatively to the above equation the linear prediction coefficients a1# a2/ . . . , ap can also be used directly as the cepstral coefficients, i.e. Cn=an. This provides a very simple method, which will reduce the memory requirement further. However, this embodiment has the drawback that the performance of the speech recognition algorithm is at a lower level compared to what can be achieved in prior art speech recognition. Therefore, this embodiment is preferably used in simple devices with lower quality requirements. However, in such devices reduction of the amount of needed memory is often very important .
Although a preferred embodiment of the present invention has been described and shown, the invention is not restricted to it, but may also be embodied in other ways within the scope of the subject-matter defined in the following claims .

Claims

C L A I M S
1. A method of speech processing, wherein
• digital signals representative of said speech are provided for speech encoding, said digital signals including a first set of coefficients,
• and a second set of coefficients is calculated for speech recognition, c h a r a c t e r i z e d in that said first set of coefficients is used in the calculation of said second set of coefficients.
2. A method according to claim 1, c h a r a c t e r i z e d in that said digital signals are provided by a linear prediction algorithm, and that said first set of coefficients comprises linear prediction coefficients.
3. A method according to claim 1 or 2 , c h a r a c t e r i z e d in that said first set of coefficients is used as said second set of coefficients.
4. A method according to claim 1 or 2 , c h a r a c t e r i z e d in that said second set of coefficients is calculated using a recursive equation so that each coefficient in said second set of coefficients depends on a corresponding coefficient in said first set of coefficients and on previously calculated coefficients from said first set and/or said second set of coefficients.
5. A method according to claim 4, c h a r a c t e r i z e d in that said second set of coefficients comprises cepstral coefficients which are calculated using the recursive equation CH=-a„+-∑(n-i)a,Cn_ n 1=1 where Cn is the nth cepstral coefficient and , is the ith linear prediction coefficient.
6. A method according to claim 5, c h a r a c t e r i z e d in that said cepstral coefficients are used as feature vectors to generate a reference list for speech recognition with a view to controlling a device by voice commands.
7. An apparatus for processing of speech, said apparatus comprising:
• speech encoding means (2; 41) for providing digital signals representative of said speech, said digital signals including a first set of coefficients,
• and speech recognition means (3; 42) in which a second set of coefficients is calculated, c h a r a c t e r i z e d in that said speech recognition means is adapted to use said first set of coefficients in the calculation of said second set of coefficients .
8. An apparatus according to claim 7 , c h a r a c t e r i z e d in that the apparatus is a digital portable telephone.
9. An apparatus according to claim 7 or 8, c h a r a c t e r i z e d in that said speech encoding means (2; 41) comprises a linear prediction algorithm and that said first set of coefficients comprises linear prediction coefficients.
10. An apparatus according to claim 9, c h a r a c t e r i z e d in that the apparatus is a GSM telephone, and that said linear prediction coefficients are calculated using an Algebraic Code Excited Linear Prediction (ACELP) algorithm.
11. An apparatus according to claim 9 , c h a r a c t e r i z e d in that the apparatus is a WCDMA telephone, and that said linear prediction coefficients are calculated using a Conjugate Structure - Code Excited Linear Prediction (CS-CELP) algorithm.
12. An apparatus according to claims 7-11, c h a r a c t e r i z e d in that said speech recognition means (3; 42) is adapted to use said first set of coefficients as said second set of coefficients.
13. An apparatus according to claims 7-11, c h a r a c t e r i z e d in that said speech recognition means (3; 42) is adapted to calculate said second set of coefficients using a recursive equation so that each coefficient in said second set of coefficients depends on a corresponding coefficient in said first set of coefficients and on previously calculated coefficients from said first set and/or said second set of coefficients .
14. An apparatus according to claim 13 , c h a r a c t e r i z e d in that said second set of coefficients comprises cepstral coefficients and that said speech recognition means (3; 42) is adapted to calculate said cepstral coefficients using the recursive equation
where Cn is the nth cepstral coefficient and a. is the ith linear prediction coefficient.
15. An apparatus according to claim 14, c h a r a c t e r i z e d in that said speech recognition means (3; 42) comprises a pattern matching block (11) adapted to use said cepstral coefficients as feature vectors to generate a reference list (12) for speech recognition, and is adapted to control the apparatus by voice commands .
EP99956421A 1998-10-09 1999-10-05 A method of speech processing and an apparatus for processing of speech Withdrawn EP1119844A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
SE9803458 1998-10-09
SE9803458A SE9803458L (en) 1998-10-09 1998-10-09 Method of speech processing and speech processing apparatus
PCT/SE1999/001763 WO2000022608A1 (en) 1998-10-09 1999-10-05 A method of speech processing and an apparatus for processing of speech

Publications (1)

Publication Number Publication Date
EP1119844A1 true EP1119844A1 (en) 2001-08-01

Family

ID=20412901

Family Applications (1)

Application Number Title Priority Date Filing Date
EP99956421A Withdrawn EP1119844A1 (en) 1998-10-09 1999-10-05 A method of speech processing and an apparatus for processing of speech

Country Status (7)

Country Link
EP (1) EP1119844A1 (en)
JP (1) JP2002527796A (en)
CN (1) CN1322346A (en)
AU (1) AU1303800A (en)
SE (1) SE9803458L (en)
TR (1) TR200101881T2 (en)
WO (1) WO2000022608A1 (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5704004A (en) * 1993-12-01 1997-12-30 Industrial Technology Research Institute Apparatus and method for normalizing and categorizing linear prediction code vectors using Bayesian categorization technique
JP2606142B2 (en) * 1994-06-15 1997-04-30 日本電気株式会社 Digital mobile phone
WO1996008005A1 (en) * 1994-09-07 1996-03-14 Motorola Inc. System for recognizing spoken sounds from continuous speech and method of using same
DE4433366A1 (en) * 1994-09-20 1996-03-21 Sel Alcatel Ag Method and device for determining a measure of the correspondence between two patterns and speech recognition device with it and program module therefor

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO0022608A1 *

Also Published As

Publication number Publication date
TR200101881T2 (en) 2001-10-22
CN1322346A (en) 2001-11-14
JP2002527796A (en) 2002-08-27
SE9803458D0 (en) 1998-10-09
WO2000022608A1 (en) 2000-04-20
SE9803458L (en) 2000-04-10
AU1303800A (en) 2000-05-01

Similar Documents

Publication Publication Date Title
KR100391287B1 (en) Speech recognition method and system using compressed speech data, and digital cellular telephone using the system
US5305421A (en) Low bit rate speech coding system and compression
CN1120471C (en) Speech coding
JP4607334B2 (en) Distributed speech recognition system
RU2366007C2 (en) Method and device for speech restoration in system of distributed speech recognition
US6182036B1 (en) Method of extracting features in a voice recognition system
US5680506A (en) Apparatus and method for speech signal analysis
US5884251A (en) Voice coding and decoding method and device therefor
US20040148160A1 (en) Method and apparatus for noise suppression within a distributed speech recognition system
US6728669B1 (en) Relative pulse position in celp vocoding
JP2645465B2 (en) Low delay low bit rate speech coder
JP2006171751A (en) Speech coding apparatus and method therefor
KR100463559B1 (en) Method for searching codebook in CELP Vocoder using algebraic codebook
EP1119844A1 (en) A method of speech processing and an apparatus for processing of speech
KR100794140B1 (en) Apparatus and Method for extracting noise-robust the speech recognition vector sharing the preprocessing step used in speech coding
US6385574B1 (en) Reusing invalid pulse positions in CELP vocoding
Chazan et al. Low bit rate speech compression for playback in speech recognition systems
Gersho Concepts and paradigms in speech coding
Kaleka Effectiveness of Linear Predictive Coding in Telephony based applications of Speech Recognition
WO2001031636A2 (en) Speech recognition on gsm encoded data
CA2297191A1 (en) A vocoder-based voice recognizer

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20010405

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

RAP1 Party data changed (applicant data changed or rights of an application transferred)

Owner name: TELEFONAKTIEBOLAGET LM ERICSSON (PUBL)

17Q First examination report despatched

Effective date: 20050318

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20050729