WO2008001991A1 - Apparatus and method for extracting noise-robust speech recognition vector by sharing preprocessing step used in speech coding - Google Patents

Apparatus and method for extracting noise-robust speech recognition vector by sharing preprocessing step used in speech coding Download PDF

Info

Publication number
WO2008001991A1
WO2008001991A1 PCT/KR2006/005831 KR2006005831W WO2008001991A1 WO 2008001991 A1 WO2008001991 A1 WO 2008001991A1 KR 2006005831 W KR2006005831 W KR 2006005831W WO 2008001991 A1 WO2008001991 A1 WO 2008001991A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
signals
estimation value
channel
noise
Prior art date
Application number
PCT/KR2006/005831
Other languages
French (fr)
Inventor
Chang-Sun Ryu
Jae-In Kim
Hong Kook Kim
Jae Sam Yoon
Yoo Rhee Oh
Original Assignee
Kt Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kt Corporation filed Critical Kt Corporation
Publication of WO2008001991A1 publication Critical patent/WO2008001991A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech

Definitions

  • the present invention relates to an apparatus for extracting a speech feature vector in a distributed speech recognition terminal and a method thereof; and, more particularly, an apparatus for extracting a noise- robust speech feature vector by sharing pre-processing steps used in a speech coder that can extract the noise- robust speech feature vector by sharing a pre-processing step for the speech coding and a pre-processing step for an extracting speech feature vector in a terminal having a speech coding function, and a method thereof.
  • DSR Distributed speech recognition
  • the simple-structured terminal extracts characteristics of speech signals and a high- performance speech recognition server performs speech recognition based on the characteristics received from the simple-structured terminal, that is, the DSR is a dual processing system.
  • MFCC Mel-frequency cepstral coefficient
  • the MFCC represents frequency spectrum which is expressed in MeI- scale into sinusoidal wave components, and the MFCC is a speech feature vector or a speech recognition parameter for representing speech received from a user.
  • the terminal extracts the speech feature vector of the speech received from the user based on the MFCC and loads the speech feature vector into a bit stream so that the speech feature vector can be transmitted through a communication network, and the terminal transmits the bit stream to the speech recognition server. That is, MFCCs extracted from speech received from the user, are mapped into vectors having the nearest distance in a codebook having a predetermined number of codewords. Then, the mapped vectors are selected and transmitted as a bit stream.
  • the codebook has codewords for each group having the similar values corresponding to the speech spoken by the user. Generally, the codeword is determined by extracting a training data from a lot of speech data and selecting a representative value among the extracted training data.
  • the speech recognition server dequantizes the speech feature vector loaded in the bit stream received from the terminal, and recognizes a word corresponding to the speech based on a hidden Markov model (HMM) as a speech model.
  • HMM hidden Markov model
  • the HMM is a process for modeling a phoneme, i.e., a unit for recognizing the speech, and completes the word and a sentence by integrating phonemes inputted to a speech recognition engine with a phoneme stored in a database of the speech recognition engine.
  • a mobile phone is highlighted as a distributed speech recognition terminal according to a digital convergence trend, and a module for speech signals processing, i.e., a speech coding module, is embedded in the mobile phone.
  • pre-processing speech signals specifically noise attenuation processing
  • a pre-processing step for speech coding and a preprocessing step for speech recognition are individually performed in general mobile phones. That is, pre-processing methods of speech spoken by the user for speech coding and speech recognition are the same, but the pre-processing methods are separately performed in the general mobile phones.
  • the pre-processing methods are performed in different pre-processing apparatuses, additional memory and operations are needed in a simple-structured terminal. Therefore, it causes wasteful use of resources.
  • speech pre-processing in speech coding has internal delay in the terminal. It causes a switching delay between speech coding process and speech recognition process in the terminal. For example, when the user uses a speech recognition function of the terminal and a call is received, there is a delay in answering the receiving call.
  • pre-processing for speech coding and the pre-processing for speech recognition in a conventional terminal will be described.
  • a conventional terminal includes a speech coding module and a distributed speech recognition front-end module.
  • the speech coding module includes a pre-processing unit for speech coding, a model parameter estimation unit, a first compression unit and a first bit stream transmitting unit.
  • the distributed speech recognition front-end module includes a pre-processing unit for speech recognition, an MFCC front-end unit, a second compression unit and a second bit stream transmitting unit.
  • the terminal includes the speech coding module and the distributed speech recognition front-end module for separately attenuating noise mixed with the speech spoken by the user because pre-processed signals for the speech coding and the speech recognition are different. Therefore, since the speech coding module and the distributed speech recognition front-end module perform the same function, a method for integrating the speech coding and the speech recognition by sharing the preprocessing steps is needed.
  • An embodiment of the present invention is directed to provide an apparatus for extracting a noise-robust speech feature vector by sharing pre-processing steps used in a speech coding that can extract the noise-robust speech coding feature vector and pre-processing steps for the extraction of speech recognition feature vector in a terminal having a speech coding function, and a method thereof.
  • an apparatus for extracting a noise-robust speech feature vector by sharing a pre- process of a speech coding in a distributed speech coding/recognition terminal including: a high pass filter for eliminating low frequency signals from input speech signals; a frequency domain conversion unit for converting the signals without low frequency signals into spectral signals in a frequency domain; a channel energy estimation unit for calculating a channel energy estimation value of the spectral signals of a current frame; a channel signal-to-noise ratio (SNR) estimation unit for estimating a channel SNR of the speech signals based on the channel energy estimation value acquired in the channel energy estimation unit and a background noise estimation value acquired in a background noise energy estimation unit; the background noise energy estimation unit for updating the background noise energy estimation value of the speech signals based on a command from a noise update decision unit; a voice metric calculation unit for acquiring a sum of voice metrics in a current channel based on the channel SNR; a spectral deviation estimation
  • a distributed speech recognition terminal including: a speech coding module for transmitting coded speech signals into outside through a speech traffic channel in a speech coding mode; a speech feature vector extracting module for transmitting extracted speech feature vectors into outside in a speech recognition mode; and a speech coding/recognition pre-processing block for attenuating a noise in speech signals received from the outside, wherein speech signals inputted into the speech coding module and the speech feature vector extracting module are pre-processed in the speech coding/recognition preprocessing block.
  • a distributed speech recognition terminal including: a speech coding module for transmitting coded speech signals into outside through a speech traffic channel in a speech coding mode; a speech feature vector extracting module for transmitting extracted speech feature vectors into outside in a speech recognition mode; a frequency down- sampler for down-sampling speech signals received from the outside; and a speech coding/recognition pre- processing block for attenuating a noise in the speech signals down-sampled in the frequency down-sampler, wherein speech signals inputted into the speech coding module and the speech feature vector extracting module are pre-processed in the speech coding/recognition pre- processing block.
  • a distributed speech recognition terminal including: a speech coding module for transmitting coded speech signals into outside through a speech traffic channel in a speech coding mode; a speech feature vector extracting module for transmitting extracted speech feature vectors into outside in a speech recognition mode; a low pass quadrature mirror filter for passing low frequency signals of speech signals received from the outside; a high pass quadrature mirror filter for passing high frequency signals of the speech signals; and a speech coding/recognition pre-processing block for attenuating a noise in the speech signals down-sampled in the frequency down-sampler, wherein speech signals inputted into the speech coding module and the speech feature vector extracting module are pre-processed in the speech coding/recognition pre-processing block.
  • a method for extracting a noise-robust speech feature vector by sharing a pre- process of a speech coding in a distributed speech coding/recognition terminal including the steps of: eliminating low frequency signals of speech signals received from outside; converting the signals without low frequency signals into spectral signals in a frequency domain; obtaining a channel energy estimation value of the spectral signals of a current frame; estimating a spectral deviation of the speech signals based on the obtained channel energy estimation value; making a noise estimation value updating command based on a total channel energy estimation value and a difference value between a current power spectrum estimation value and an average long-term power spectrum estimation value; when the noise estimation value updating command is received, updating the background noise energy estimation value; estimating a channel SNR of the speech signals based on the channel energy estimation value and the background noise energy estimation value; calculating a sum of voice metrics of the speech signals based on the channel SNR; modifying the channel SNR based on the sum of voice metrics
  • the present invention requires a small amount of memory, consumes a small amount of computation and improves performance of speech recognition by sharing a pre-processing for speech coding and speech recognition.
  • the present invention can prevent delay caused by a switch between a speech coding process and a speech recognition process included in a speech coding pre-processing step and a speech feature vector extracting pre-processing step.
  • the present invention can attenuate noise mixed in speech signal of the user during the speech coding and the extraction of the speech feature vector .
  • Fig. 1 illustrates an apparatus for extracting a noise-robust speech feature vector by sharing preprocessing steps used in a speech coding in accordance with an embodiment of the present invention
  • Fig. 2 is a detailed diagram illustrating a speech coding/recognition pre-processing block in accordance with an embodiment of the present invention
  • Fig. 3 illustrates a first expanded speech coding/recognition pre-processing block for processing 11 kHz speech signal in accordance with an embodiment of the present invention
  • Fig. 4 illustrates a second expanded speech coding/recognition pre-processing block for processing 16 kHz speech signal in accordance with an embodiment of the present invention
  • Fig. 5 is a flowchart illustrating a method for speech recognition in accordance with an embodiment of the present invention
  • Fig. 6 is a flowchart illustrating a training processing for generating an acoustic model in accordance with an embodiment of the present invention
  • Fig. 7 is a graph showing speech recognition performance by using a speech feature vector in accordance with an embodiment of the present invention.
  • Fig. 1 illustrates an apparatus for extracting a noise-robust speech feature vector by sharing pre- processing steps at a speech coding in a distributed speech recognition terminal in accordance with an embodiment of the present invention.
  • the distributed speech recognition terminal e.g., mobile phone, having the apparatus for extracting a noise-robust speech feature vector includes a speech coding module 150 and a distributed speech recognition front-end module 100, but as shown in Fig. 1, a speech coding/recognition pre-processing block 11 is shared by a pre-processing step of the speech coding module 150 and a pre-processing step of the distributed speech recognition front-end module 100.
  • the distributed speech recognition front- end module 100 includes the speech coding/recognition pre-processing block 11, a speech feature vector extraction block, e.g., MFCC front-end block, 12, a first speech compression block 13 and a first bit stream transmission block 14.
  • the speech coding module 150 includes the speech coding/recognition preprocessing block 11, a speech coding block 15, a second speech compression block 16 and a second bit stream transmission block 17.
  • the terminal includes a switch 50 for shifting between a speech coding mode and a speech recognition mode.
  • coded signals of speech spoken by the user are transmitted to a mobile communication system through a voice traffic channel in the speech coding mode; and extracted speech feature vectors of speech spoken by the user are transmitted to the speech recognition server through a packet data channel in the speech recognition mode .
  • the speech coding/recognition preprocessing block 11 performs attenuating noise in 8 KHz input speech spoken by the user.
  • a separate noise attenuation block is not used in the distributed speech recognition front-end module 100 and the speech coding/recognition pre-processing block 11 is used as a noise attenuation block.
  • noise attenuation function is performed in the speech coding/recognition pre-processing block 11 for extracting a noise-robust speech feature vector (MFCCs) in the distributed speech recognition front-end module 100.
  • the speech coding/recognition preprocessing block 11 attenuates noise to extract speech feature vectors (MFCCs) which are robust to noise in the speech feature extraction block 12.
  • the speech coding/recognition pre-processing block 11 is realized in a specification capable of performing both pre-processing for speech coding and pre-processing for speech recognition.
  • the speech coding/recognition pre-processing block 11 in accordance with an embodiment of the present invention will be described in detailed referring to Fig. 2. Since constituent elements 12, 13, 14, 15, 16 and 17 of the Fig. 1 are well-known, detailed description will be omitted.
  • Fig. 2 is a detailed diagram illustrating a speech coding/recognition pre-processing block in accordance with an embodiment of the present invention.
  • the speech coding/recognition pre-processing block 11 in accordance with the present invention includes a high pass filter 21, a frequency domain conversion unit 22, a channel energy estimation unit 23, a channel SNR estimation unit 24, a voice metric calculation unit 25, a spectral deviation estimation unit 26, a noise update decision unit 27, a channel SNR modifying unit 28, a channel gain computation unit 29, a background noise estimation unit 30, a frequency domain filter 31 and a time domain conversion unit 32.
  • the speech coding/recognition pre-processing block 11 may be implemented based on IS-127 Enhanced Variable Rate Codec (EVRC) used in CDMA having a specification which is suitable for both the speech coding pre-processing for speech communication and the speech feature preprocessing for speech recognition.
  • EVRC Enhanced Variable Rate Codec
  • the input speech signal s LFB (n) spoken by the user inputted into the speech coding/recognition pre-processing block 11 is a 16-bit uniform pulse coded modulation (PCM) format data having 8 KHz sampling frequency.
  • PCM uniform pulse coded modulation
  • the speech coding/recognition pre-processing block 11 of the present invention mainly performs noise attenuation. Therefore, noise attenuated signal s 1 (n) is outputted when the input speech signal S LFB ( ⁇ ) is inputted as shown in Fig. 2.
  • noise attenuated signal s 1 (n) is outputted when the input speech signal S LFB ( ⁇ ) is inputted as shown in Fig. 2.
  • the high pass filter 21 eliminates low frequency band signals of the input speech signal s LFB (n) inputted through a microphone, and a cutoff frequency of the high pass filter 21 is 120 Hz.
  • a filtered signal in the high pass filter 21 is defined as s hp (n), and s hp (n) is a noise attenuation object signal.
  • a frame size of the noise attenuation object signal is 10ms and current frame is defined as ⁇ m' .
  • the frequency domain conversion unit 22 converts the filtered signal s hp (n) in the high pass filter 21 into a frequency domain signal based on a smoothed trapezoidal window, i.e., windowing. The frequency domain conversion steps will be described in detail.
  • m is a current frame
  • n is a sample index of an input buffer d (m)
  • L is a frame length, for example, 80
  • D is overlap rate or delay rate of samples, e.g.,
  • d(m,D + ⁇ ) s hp (n) + ⁇ p s l ⁇ (.n-iy,0 ⁇ n ⁇ L Eq. 2
  • ⁇ p is a pre-emphasis coefficient, e.g., -0.8.
  • the input buffer has L+D samples, e.g., 104; and the first D samples are pre-emphasized and overlapped part ending from the previous frame, samples after the first D samples are pre-emphasized input part beginning from the current frame.
  • windowing signals are acquired using the smoothed trapezoidal window in the input buffer as the following Eq. 3.
  • M is a length of discrete Fourier transform (DFT), e.g., 128; a spectral signal G(k) as the following Eq. 4 can be acquired based on the M-point DFT.
  • DFT discrete Fourier transform
  • the spectral signal G(k) transformed into the frequency domain signal in the frequency domain conversion unit 22 is used as an input signal of the channel energy estimation unit 23.
  • the channel energy estimation unit 23 acquires a channel energy estimation value as the following Eq. 5 corresponding to the current frame ⁇ m' of the spectral signal G(k) inputted from the frequency domain conversion unit 22.
  • E m i n is a minimum permission channel energy value, e.g., 0.0625; ⁇ ch (m) is a channel energy smoothness [flatness] factor and is expressed as the following Eq. 6; and N c is the number of integrated channels, e.g., 16.
  • f L (i) and f H (i) are low frequency DFT bin of i th channel and high frequency DFT bin of i th channel, respectively.
  • the channel energy estimation value obtained based on Eq. 5, if the channel energy smoothness factor ⁇ C h (m) of the first frame is 0, the channel energy estimation value is initialized as a un-filtered channel energy value of the first frame.
  • the channel SNR estimation unit 24 estimates signal-to—noise ratio (SNR) existing in the channel.
  • SNR signal-to—noise ratio
  • the channel SNR estimation unit 24 acquires quantized channel SNR indices as the following Eq. 7 based on the channel energy estimation value obtained in the channel energy estimation unit 23 and a background noise energy estimation value obtained in the background noise estimation unit 30.
  • E n (m,i) obtained in the background noise estimation unit 30 is a noise energy estimation value of the current channel and ⁇ q (i) obtained based on that is from 0 to 89.
  • the voice metric calculation unit 25 acquires a sum of voice metrics in the current channel as the following Eq. 8 based on the SNR, e.g., the quantized channel SNR indices, ⁇ q (i), estimated in the channel SNR estimation unit 24.
  • V(k) is a voice metric having 90 elements as follows:
  • V(k) ⁇ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 6, 6,7,7,7,8,8, 9,9,10,10,11,12,12,13,13,14,15,15,16,17,17, 18,19,20,20,21,22,23,24,24,25,26,27,28,28,29,30,31,32,33, 34,35,36,37,37,38,39,40,41,42,43,44,45,46,47,48,49,50,50, 50,50,50,50,50,50,50 ⁇ .
  • the spectral deviation estimation unit 26 estimates a spectral deviation corresponding to the current channel signal based on the channel energy estimation value E ch (m,i) obtained in the channel energy estimation unit 23. The estimation process of the spectral deviation will be described.
  • E dB (m) is the average long-term power spectrum estimation value obtained in the previous frame.
  • an initial value of the average long- term power spectrum estimation value is set up as the log power spectrum estimation value of the first frame as the following Eq. 11.
  • a total energy estimation value of the m th frame is obtained based on the channel energy estimation value E ch (m) as the following Eq. 12.
  • the total energy estimation value E to t(m) and the difference value between the current power spectrum estimation value and the average long-term power spectrum estimation value, i.e., ⁇ E (m) are inputted into the noise update decision unit 27 in order to update the background noise estimation value.
  • an exponential window function factor ⁇ (m) is a function of the total energy estimation value E tot (m) and is obtained based on the following Eq. 13.
  • the exponential window function factor ⁇ (m) obtained by Eq. 13 is limited from ⁇ L to ⁇ H as the following Eq. 14. a ⁇ m) -max ⁇ a L ,min ⁇ a H ,a(m) ⁇ Eq. 14
  • E H and E L are dB scale boundary energies corresponding to linear interpolation values of E tot (m) expressed by ⁇ (m), when ⁇ (m) is limited from ⁇ L to ⁇ H .
  • the exponential window function factor ⁇ (m) is determined as 0.745 in case that a signal having relative energy is 4OdB.
  • an average long-term power spectral estimation value of the next frame is updated based on the exponential window function factor ⁇ (m) and the initial value of E dB (m) as the following Eq. 15.
  • the noise update decision unit 27 orders a command, e.g., update_flag, which updating a predetermined estimation value in response to the noise estimation value obtained in the background noise estimation unit 30 based on the total channel energy estimation value E tot (m) and the difference value between the current power spectrum estimation value and the average long-term power spectrum estimation value ⁇ E (m) obtained in the spectral deviation unit 26 as the following logic expressed in pseudo code.
  • update_flag e.g., update_flag
  • the channel SNR modifying unit 28 modifies values of the quantized channel SNR indices ⁇ q ⁇ estimated in the channel SNR estimation unit 24 based on v(m) , the sum of voice metrics in the current channel estimated in the voice metric calculation unit 25.
  • the modified channel indexes ⁇ q " is used as an input parameter of the channel gain computation unit 29.
  • the following logic expressed in the pseudo code shows modification of the SNR estimation value.
  • the channel gain computation unit 29 calculates a linear channel gain ⁇ ch based on the modified channel SNR indexes ⁇ q " modified in the channel SNR modifying unit 28 and the background noise energy estimation value E n (m) estimated in the background noise estimation unit 30. A process of linear channel gain calculation will be described in detail.
  • ⁇ m i n is a minimum total gain, e.g., -13; E f i oor is a noise floor energy, e.g., 1; and the background noise energy estimation value E n (m) is a estimation value in the background noise estimation unit 30. Then, a channel gain in dB is acquired based on the following Eq. 17.
  • ⁇ g is a slant of the gain, e.g., 0.39. It is desirable that the channel gain should be changed into a linear channel gain as the following Eq. 18.
  • the frequency domain filter 31 applies the linear channel gain ⁇ ch calculated in the channel gain computation unit 29 to the spectral signal G(k) transformed in the frequency domain conversion unit 22 as the following Eq. 19.
  • H(k) ⁇ rch (i)G(k) ' f L (i) ⁇ k ⁇ f H (Ofi ⁇ i ⁇ N c
  • H(M-k) H * (k);0 ⁇ k ⁇ M/2 Eq. 20
  • the background noise estimation unit 30 estimates the noise energy estimation value E n (m) of noise signals existing in the current channel and updates the corresponding noise energy estimation value based on the command, i.e., update_flag, received from the noise update decision unit 27.
  • the background noise estimation unit 30 updates channel noise estimation value of the next frame as the following Eq. 21.
  • E min is minimum channel energy, e.g., 0.0625; and (X n is a channel noise smoothness factor, e.g., 0.9. Meanwhile, noise estimation values of the first 4 frames are initialized by the channel energy estimation values, respectively.
  • E n (m,i) msLx ⁇ E imt ,E ch (m,/) ⁇ ,1 ⁇ m ⁇ 4,0 ⁇ i ⁇ N c Eq. 22
  • Einit is minimum channel noise initial energy, e.g., 16.
  • the time domain conversion unit 32 converts noise attenuated speech signals, i.e., speech signals in the frequency domain, inputted through the frequency domain filter 31 into speech signals in the time domain.
  • a time domain conversion process will be described in detail.
  • filtered signals in the frequency domain filter 31 are transformed into time domain signals based on inverse DFT as the following Eq. 23.
  • ⁇ d is a de-emphasis factor, e.g., 0.8; and s' (n) is an output buffer which can accommodate 320 samples.
  • noise-attenuated speech signal S' (n) can be obtained in the speech coding/recognition pre-processing block 11.
  • the noise attenuated speech signals S' (n) are inputted into a speech feature vector extraction block 12 of the distributed speech recognition front-end module 100 or the speech coding block 15 of the speech coding module 150 based on the speech recognition mode or the speech coding mode, respectively.
  • the frame size of the noise attenuation object signal is 10ms as above description of the speech coding/recognition pre-processing block 11, the noise attenuation is performed once every 10ms. Therefore, the noise attenuated speech signal S 1 (n) , an output signal of the speech coding/recognition pre-processing block 11, is s'(n), 240 ⁇ n ⁇ 320.
  • the noise attenuated speech signal S 1 (n) according to the frame size of the noise attenuation object signal may be outputted differently.
  • a method corresponding to the speech coding/recognition preprocessing block 11 for the speech feature vector extracting module and the speech coding module includes time-series processes in response to the public speech signal processing field. Therefore, detailed description of the method will be omitted.
  • Fig. 3 is illustrates a first expanded speech coding/recognition pre-processing block for processing 11 kHz speech signal in accordance with an embodiment of the present invention
  • Fig. 4 illustrates a second expanded speech coding/recognition pre-processing block for processing 16 kHz speech signal in accordance with an embodiment of the present invention.
  • a user speech signal of 8 KHz is a noise attenuation object signal in the speech coding/recognition pre-processing block 11 referring to Fig. 2.
  • a speech coding/recognition pre-processing block for processing 11 KHz speech signal in Fig. 3 and a speech coding/recognition pre-processing block for processing 16 KHz speech signal in Fig. 4 are presented.
  • the first expanded speech coding/recognition pre-processing block for processing 11 KHz further includes a frequency down sampler 41 for converting 11-KHz speech signal into 8-KHz speech signal in front of the speech coding/recognition pre-processing block of Fig 2.
  • a speech signal down-sampled in the frequency down sampler 41 is inputted into the speech coding/recognition pre-processing block 11.
  • the second expanded speech coding/recognition pre-processing block for processing 16 KHz further includes a low pass quadrature-mirror filter (QMF LP)[DEC by 2] 46 and a high pass quadrature- mirror filter (QMF HP) [DEC by 2 and SI] 47 in front of the speech coding/recognition pre-processing block of Fig 2.
  • QMF LP low pass quadrature-mirror filter
  • QMF HP high pass quadrature- mirror filter
  • the QMF LP 46 receives inputted 16-KHz speech signals and outputs 0 to 4-KHz low frequency band signals
  • the QMF HP 47 receives inputted 16-KHz speech signals and outputs 4 to 8-KHz high frequency band signals.
  • low frequency signal outputted from the QMF LP 46 is inputted into the speech coding/recognition pre-processing block and high frequency signal outputted from the QMF HP 47 is inputted into the speech feature vector extraction block 12, i.e., the MFCC front-end, of the distributed speech recognition front-end module 100.
  • speech feature vectors e.g., MFCCs
  • 26 Mel-filter banks are used to extract speech feature vectors from the inputted high frequency signal by using 26 Mel-filter banks.
  • the low frequency signal outputted from the QMF LP 46 is inputted into the speech feature vector extraction block 12 through the speech coding/recognition preprocessing block. Then, the low frequency signal and the high frequency signal outputted from the QMF HP 47 are combined into one signal in the speech feature vector extraction block 12. That is, before log filter bank energy is converted into cepstrum coefficient, the high frequency signal and the low frequency signal are added. Moreover, log parameters (log-energy) for every frequency bands are obtained based on the high frequency signal and the low frequency signal.
  • the expanded speech coding/recognition pre-processing blocks in Figs. 3 to 4 can be implemented according to frequency expansion specification of European Telecommunications Standards Institute (ETSI) DSR standard (ETSI ES 202 050 vl.1.3) in order to use 11 KHz or 16 KHz sampling frequency signal.
  • ETSI European Telecommunications Standards Institute
  • Fig. 5 is a flowchart illustrating a method for speech recognition in accordance with an embodiment of the present invention
  • Fig. 6 is a flowchart illustrating a training processing for generating an acoustic model in accordance with an embodiment of the present invention
  • Fig. 7 is a graph showing a speech recognition performance based on a speech feature vector in accordance with an embodiment of the present invention.
  • the present invention can be applied to the distributed speech recognition terminal, e.g., mobile phone, and its affect on the speech recognition performance needs to be verified.
  • the distributed speech recognition terminal e.g., mobile phone
  • Fig. 5 is the speech recognition processes based on a Hidden Markov Model (HMM) .
  • Speech features are extracted from the speech spoken by the user 301, and then a pattern matching 302 is performed by searching the acoustic model 303, a language model 304 and a pronouncing dictionary 305 according to the extracted speech features.
  • a word or a sentence is recognized in response to the speech.
  • a method suggested in a standard of the ETSI "ETSI ES 201 108" is used for extracting of the speech features 301.
  • the speech features are extracted from the speech signal through MFCC, the speech feature vector as high-order coefficients is formed and a word stream having maximum probability is searched through pattern matching based on the acoustic model 303, the language model 304 and the pronunciation dictionary 305 in response to the speech feature vector.
  • a noise attenuated signal by pre-processing defined in ETSI DSR standard i.e., ETSI ES 202 050 vl.1.3
  • a noise attenuated signal by the pre-processing steps defined in IS-127 is used as the speech signal for extracting the speech characteristics .
  • a noise attenuated signal outputted from the speech coding/recognition pre- processing block 11 is used as the speech signal for extracting the speech characteristics.
  • 13 order MFCCs and log-energy are extracted by using The MFCC front-end module.
  • 12 order MFCCs (c 0 , ..., Ci 2 ), log- energy, delta of them, and delta-delta of them are used for parameters of the training of the acoustic model and speech recognition.
  • HMM is used for the acoustic model 303.
  • a phone model in accordance with the language is used as the acoustic model.
  • the training process for generating the context independent phone model will be described referring to Fig. 6.
  • a monophone-based model as a context independent phone model is generated based on the speech feature vector extracted from training data at step S401.
  • a triphone-based model as a context dependent phone model is generated by expanding the monophone-based model at step S403. Then, a state-tying is performed considering that the training data for the triphone-based model is small at step S404.
  • a final acoustic model is generated by increasing the number of mixture densities of a result acoustic model acquired by performing the state tying at step S405.
  • the language model 304 shown in Fig. 5 adapts a statistical estimation method.
  • the statistical estimation method estimates probability of available word sequence statistically from the speech database in predetermined environment.
  • the language model adapting the statistical estimation method is an n-gram.
  • probability of word sequence is approximated by multiplying previous n conditional probabilities.
  • a bigram language model is used.
  • the pronunciation dictionary provided by "CleanSentOl” of SiTEC is used for Korean and "CMU dictionary V.0.6” provided from Carnegie-Mellon university is used for English.
  • pronunciation of phrasal words that are not supported by “cleanSentOl” are supported by a pronunciation converter produced for the purpose based on "standard pronunciation method of standard language rule.”
  • the phrasal word is composed of a word and a auxiliary word.
  • the total number of phrasal words of the pronunciation dictionary provided by "CleanSentOl” is 36,104 and the total number of phrasal words of the pronunciation dictionary for speech recognition is 223, 857.
  • a sentence speech DB (e.g., CleanSentOl) is used in case of Korean, and an AURORA 4 DB (e.g., Wall Street Journal) is used in case of English.
  • 5000 sentences among text data used in training and 3000 sentences among ⁇ speech recognition language model usage text DB' may be used for generating the language model.
  • a hidden Markov model toolkit (HTK) v3.1 is used to generate the language model and the final language model includes 31,582 words.
  • the finally acquired model includes a network of 31,582 words.
  • a word recognition rate of using the conventional noise attenuated speech signal is 68.61% and the word recognition rate of using the noise attenuated speech signal in accordance with the present invention is 69.31% referring to Fig. 7. That is, the speech recognition performance of the present invention is improved than that of conventional method.
  • a noise-robust speech feature vector can be extracted by sharing the speech coding pre-processing and the speech feature vector extracting pre-processing in a simple-structured terminal. Therefore, the speech recognition performance is improved with the small amount of memory and operations in the simple-structured terminal .
  • the above described method according to the present invention can be embodied as a program and be stored on a computer readable recording medium.
  • the computer readable recording medium is any data storage device that can store data which can be read by the computer system.
  • the computer readable recording medium includes a read-only memory (ROM) , a random-access memory (RAM) , a CD-ROM, a floppy disk, a hard disk and an optical magnetic disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

Provided are an apparatus for extracting a speech feature vector in a distributed speech recognition terminal and a method thereof. The apparatus includes a speech coding module for transmitting coded speech signals into outside through a speech traffic channel in a speech coding mode; a speech feature vector extracting module for transmitting extracted speech feature vectors into outside in a speech recognition mode; and a speech coding/recognition pre-processing block for attenuating a noise in speech signals received from the outside, wherein speech signals inputted into the speech coding module and the speech feature vector extracting module are pre-processed in the speech coding/recognition preprocessing block.

Description

DESCRIPTION
APPARATUS AND METHOD FOR EXTRACTING NOISE-ROBUST SPEECH RECOGNITION VECTOR BY SHARING PREPROCESSING STEP
USED IN SPEECH CODING
TECHNICAL FIELD
The present invention relates to an apparatus for extracting a speech feature vector in a distributed speech recognition terminal and a method thereof; and, more particularly, an apparatus for extracting a noise- robust speech feature vector by sharing pre-processing steps used in a speech coder that can extract the noise- robust speech feature vector by sharing a pre-processing step for the speech coding and a pre-processing step for an extracting speech feature vector in a terminal having a speech coding function, and a method thereof.
BACKGROUND ART
Distributed speech recognition (DSR) implements speech recognition in a simple-structured terminal such as a mobile phone. The simple-structured terminal extracts characteristics of speech signals and a high- performance speech recognition server performs speech recognition based on the characteristics received from the simple-structured terminal, that is, the DSR is a dual processing system.
Generally, a Mel-frequency cepstral coefficient (MFCC) is used for speech recognition. The MFCC represents frequency spectrum which is expressed in MeI- scale into sinusoidal wave components, and the MFCC is a speech feature vector or a speech recognition parameter for representing speech received from a user.
The terminal extracts the speech feature vector of the speech received from the user based on the MFCC and loads the speech feature vector into a bit stream so that the speech feature vector can be transmitted through a communication network, and the terminal transmits the bit stream to the speech recognition server. That is, MFCCs extracted from speech received from the user, are mapped into vectors having the nearest distance in a codebook having a predetermined number of codewords. Then, the mapped vectors are selected and transmitted as a bit stream. The codebook has codewords for each group having the similar values corresponding to the speech spoken by the user. Generally, the codeword is determined by extracting a training data from a lot of speech data and selecting a representative value among the extracted training data.
The speech recognition server dequantizes the speech feature vector loaded in the bit stream received from the terminal, and recognizes a word corresponding to the speech based on a hidden Markov model (HMM) as a speech model. Herein, the HMM is a process for modeling a phoneme, i.e., a unit for recognizing the speech, and completes the word and a sentence by integrating phonemes inputted to a speech recognition engine with a phoneme stored in a database of the speech recognition engine.
Recently, a mobile phone is highlighted as a distributed speech recognition terminal according to a digital convergence trend, and a module for speech signals processing, i.e., a speech coding module, is embedded in the mobile phone.
As described above, when the speech feature vector corresponding to the spoken speech by the user is extracted, pre-processing speech signals, specifically noise attenuation processing, is needed. However, such a pre-processing step for speech coding and a preprocessing step for speech recognition are individually performed in general mobile phones. That is, pre-processing methods of speech spoken by the user for speech coding and speech recognition are the same, but the pre-processing methods are separately performed in the general mobile phones. Especially, since the pre-processing methods are performed in different pre-processing apparatuses, additional memory and operations are needed in a simple-structured terminal. Therefore, it causes wasteful use of resources.
In addition, speech pre-processing in speech coding has internal delay in the terminal. It causes a switching delay between speech coding process and speech recognition process in the terminal. For example, when the user uses a speech recognition function of the terminal and a call is received, there is a delay in answering the receiving call. Hereinafter, the pre-processing for speech coding and the pre-processing for speech recognition in a conventional terminal will be described.
A conventional terminal includes a speech coding module and a distributed speech recognition front-end module.
The speech coding module includes a pre-processing unit for speech coding, a model parameter estimation unit, a first compression unit and a first bit stream transmitting unit. The distributed speech recognition front-end module includes a pre-processing unit for speech recognition, an MFCC front-end unit, a second compression unit and a second bit stream transmitting unit.
According to the conventional method, the terminal includes the speech coding module and the distributed speech recognition front-end module for separately attenuating noise mixed with the speech spoken by the user because pre-processed signals for the speech coding and the speech recognition are different. Therefore, since the speech coding module and the distributed speech recognition front-end module perform the same function, a method for integrating the speech coding and the speech recognition by sharing the preprocessing steps is needed.
DISCLOSURE TECHNICAL PROBLEM
An embodiment of the present invention is directed to provide an apparatus for extracting a noise-robust speech feature vector by sharing pre-processing steps used in a speech coding that can extract the noise-robust speech coding feature vector and pre-processing steps for the extraction of speech recognition feature vector in a terminal having a speech coding function, and a method thereof.
TECHNICAL SOLUTION
In accordance with an aspect of the present invention, there is provided an apparatus for extracting a noise-robust speech feature vector by sharing a pre- process of a speech coding in a distributed speech coding/recognition terminal, including: a high pass filter for eliminating low frequency signals from input speech signals; a frequency domain conversion unit for converting the signals without low frequency signals into spectral signals in a frequency domain; a channel energy estimation unit for calculating a channel energy estimation value of the spectral signals of a current frame; a channel signal-to-noise ratio (SNR) estimation unit for estimating a channel SNR of the speech signals based on the channel energy estimation value acquired in the channel energy estimation unit and a background noise estimation value acquired in a background noise energy estimation unit; the background noise energy estimation unit for updating the background noise energy estimation value of the speech signals based on a command from a noise update decision unit; a voice metric calculation unit for acquiring a sum of voice metrics in a current channel based on the channel SNR; a spectral deviation estimation unit for estimating a spectral deviation of the speech signals based on the channel energy estimation value; the noise update decision unit for commanding to update the noise estimation value based on a total channel energy estimation value and a difference value between a current power spectrum estimation value and an average long-term power spectrum estimation value estimated in the spectral deviation estimation unit; a channel SNR modifying unit for modifying the channel SNR estimated in the channel SNR estimation unit based on the sum of voice metrics acquired in the voice metric calculation unit; a channel gain computation unit for acquiring a linear channel gain based on the modified channel SNR modified in the channel SNR modifying unit and the background energy estimation value obtained in the background noise estimation unit; a frequency domain filter for applying the linear channel gain to the spectral signals converted in the frequency domain conversion unit; and a time domain conversion unit for converting the linear channel gain applied spectral signals into speech signals in a time domain.
In accordance with another aspect of the present invention, there is provided a distributed speech recognition terminal, including: a speech coding module for transmitting coded speech signals into outside through a speech traffic channel in a speech coding mode; a speech feature vector extracting module for transmitting extracted speech feature vectors into outside in a speech recognition mode; and a speech coding/recognition pre-processing block for attenuating a noise in speech signals received from the outside, wherein speech signals inputted into the speech coding module and the speech feature vector extracting module are pre-processed in the speech coding/recognition preprocessing block. In accordance with another aspect of the present invention, there is provided a distributed speech recognition terminal, including: a speech coding module for transmitting coded speech signals into outside through a speech traffic channel in a speech coding mode; a speech feature vector extracting module for transmitting extracted speech feature vectors into outside in a speech recognition mode; a frequency down- sampler for down-sampling speech signals received from the outside; and a speech coding/recognition pre- processing block for attenuating a noise in the speech signals down-sampled in the frequency down-sampler, wherein speech signals inputted into the speech coding module and the speech feature vector extracting module are pre-processed in the speech coding/recognition pre- processing block.
In accordance with another aspect of the present invention, there is provided a distributed speech recognition terminal, including: a speech coding module for transmitting coded speech signals into outside through a speech traffic channel in a speech coding mode; a speech feature vector extracting module for transmitting extracted speech feature vectors into outside in a speech recognition mode; a low pass quadrature mirror filter for passing low frequency signals of speech signals received from the outside; a high pass quadrature mirror filter for passing high frequency signals of the speech signals; and a speech coding/recognition pre-processing block for attenuating a noise in the speech signals down-sampled in the frequency down-sampler, wherein speech signals inputted into the speech coding module and the speech feature vector extracting module are pre-processed in the speech coding/recognition pre-processing block.
In accordance with another aspect of the present invention, there is provided a method for extracting a noise-robust speech feature vector by sharing a pre- process of a speech coding in a distributed speech coding/recognition terminal, including the steps of: eliminating low frequency signals of speech signals received from outside; converting the signals without low frequency signals into spectral signals in a frequency domain; obtaining a channel energy estimation value of the spectral signals of a current frame; estimating a spectral deviation of the speech signals based on the obtained channel energy estimation value; making a noise estimation value updating command based on a total channel energy estimation value and a difference value between a current power spectrum estimation value and an average long-term power spectrum estimation value; when the noise estimation value updating command is received, updating the background noise energy estimation value; estimating a channel SNR of the speech signals based on the channel energy estimation value and the background noise energy estimation value; calculating a sum of voice metrics of the speech signals based on the channel SNR; modifying the channel SNR based on the sum of voice metrics; obtaining a linear channel gain based on the modified channel SNR and the background noise energy estimation value; applying the linear channel gain to the spectral signals; and converting the linear channel gain applied spectral signals into time domain speech signals.
ADVANTAGEOUS EFFECTS
The present invention requires a small amount of memory, consumes a small amount of computation and improves performance of speech recognition by sharing a pre-processing for speech coding and speech recognition.
Also, the present invention can prevent delay caused by a switch between a speech coding process and a speech recognition process included in a speech coding pre-processing step and a speech feature vector extracting pre-processing step.
In addition, the present invention can attenuate noise mixed in speech signal of the user during the speech coding and the extraction of the speech feature vector .
BRIEF DESCRIPTION OF THE DRAWINGS
Fig. 1 illustrates an apparatus for extracting a noise-robust speech feature vector by sharing preprocessing steps used in a speech coding in accordance with an embodiment of the present invention;
Fig. 2 is a detailed diagram illustrating a speech coding/recognition pre-processing block in accordance with an embodiment of the present invention;
Fig. 3 illustrates a first expanded speech coding/recognition pre-processing block for processing 11 kHz speech signal in accordance with an embodiment of the present invention; Fig. 4 illustrates a second expanded speech coding/recognition pre-processing block for processing 16 kHz speech signal in accordance with an embodiment of the present invention;
Fig. 5 is a flowchart illustrating a method for speech recognition in accordance with an embodiment of the present invention;
Fig. 6 is a flowchart illustrating a training processing for generating an acoustic model in accordance with an embodiment of the present invention; and Fig. 7 is a graph showing speech recognition performance by using a speech feature vector in accordance with an embodiment of the present invention.
BEST MODE FOR THE INVENTION The advantages, features and aspects of the invention will become apparent from the following description of the embodiments with reference to the accompanying drawings, which is set forth hereinafter, and thus the invention will be easily carried out by those skilled in the art to which the invention pertains. Also, when it is considered that detailed description on a related art may obscure the points of the present invention unnecessarily in describing the present invention, the description will not be provided herein. Hereinafter, specific embodiments of the present invention will be described with reference to the accompanying drawings.
Fig. 1 illustrates an apparatus for extracting a noise-robust speech feature vector by sharing pre- processing steps at a speech coding in a distributed speech recognition terminal in accordance with an embodiment of the present invention.
The distributed speech recognition terminal, e.g., mobile phone, having the apparatus for extracting a noise-robust speech feature vector includes a speech coding module 150 and a distributed speech recognition front-end module 100, but as shown in Fig. 1, a speech coding/recognition pre-processing block 11 is shared by a pre-processing step of the speech coding module 150 and a pre-processing step of the distributed speech recognition front-end module 100.
That is, the distributed speech recognition front- end module 100 includes the speech coding/recognition pre-processing block 11, a speech feature vector extraction block, e.g., MFCC front-end block, 12, a first speech compression block 13 and a first bit stream transmission block 14. In addition, the speech coding module 150 includes the speech coding/recognition preprocessing block 11, a speech coding block 15, a second speech compression block 16 and a second bit stream transmission block 17.
Of course, the terminal includes a switch 50 for shifting between a speech coding mode and a speech recognition mode. According to the action of the switch 50, coded signals of speech spoken by the user are transmitted to a mobile communication system through a voice traffic channel in the speech coding mode; and extracted speech feature vectors of speech spoken by the user are transmitted to the speech recognition server through a packet data channel in the speech recognition mode .
Especially, the speech coding/recognition preprocessing block 11 performs attenuating noise in 8 KHz input speech spoken by the user. In the present invention, a separate noise attenuation block is not used in the distributed speech recognition front-end module 100 and the speech coding/recognition pre-processing block 11 is used as a noise attenuation block.
That is, noise attenuation function is performed in the speech coding/recognition pre-processing block 11 for extracting a noise-robust speech feature vector (MFCCs) in the distributed speech recognition front-end module 100. Here, the speech coding/recognition preprocessing block 11 attenuates noise to extract speech feature vectors (MFCCs) which are robust to noise in the speech feature extraction block 12. The speech coding/recognition pre-processing block 11 is realized in a specification capable of performing both pre-processing for speech coding and pre-processing for speech recognition. The speech coding/recognition pre-processing block 11 in accordance with an embodiment of the present invention will be described in detailed referring to Fig. 2. Since constituent elements 12, 13, 14, 15, 16 and 17 of the Fig. 1 are well-known, detailed description will be omitted.
Fig. 2 is a detailed diagram illustrating a speech coding/recognition pre-processing block in accordance with an embodiment of the present invention. As shown in Fig. 2, the speech coding/recognition pre-processing block 11 in accordance with the present invention includes a high pass filter 21, a frequency domain conversion unit 22, a channel energy estimation unit 23, a channel SNR estimation unit 24, a voice metric calculation unit 25, a spectral deviation estimation unit 26, a noise update decision unit 27, a channel SNR modifying unit 28, a channel gain computation unit 29, a background noise estimation unit 30, a frequency domain filter 31 and a time domain conversion unit 32. In the present invention, the speech coding/recognition pre-processing block 11 may be implemented based on IS-127 Enhanced Variable Rate Codec (EVRC) used in CDMA having a specification which is suitable for both the speech coding pre-processing for speech communication and the speech feature preprocessing for speech recognition.
Meanwhile, the input speech signal sLFB(n) spoken by the user inputted into the speech coding/recognition pre-processing block 11 is a 16-bit uniform pulse coded modulation (PCM) format data having 8 KHz sampling frequency.
Generally, before the speech coding and the speech feature vector extraction, noise mixed in the input speech signal has to be attenuated to improve quality of the speech signal. That is, the speech coding/recognition pre-processing block 11 of the present invention mainly performs noise attenuation. Therefore, noise attenuated signal s1 (n) is outputted when the input speech signal SLFB(Π) is inputted as shown in Fig. 2. Below, each constituent element of the speech coding/recognition pre-processing block 11 will be described in detail.
The high pass filter 21 eliminates low frequency band signals of the input speech signal sLFB(n) inputted through a microphone, and a cutoff frequency of the high pass filter 21 is 120 Hz.
A filtered signal in the high pass filter 21 is defined as shp(n), and shp(n) is a noise attenuation object signal. A frame size of the noise attenuation object signal is 10ms and current frame is defined as λm' . The frequency domain conversion unit 22 converts the filtered signal shp(n) in the high pass filter 21 into a frequency domain signal based on a smoothed trapezoidal window, i.e., windowing. The frequency domain conversion steps will be described in detail.
In the smoothed trapezoidal window, the first D samples in an input frame buffer d(m,n) of the mth frame are overlapped with the last D samples in the previous frame. This overlapping is expressed as the following Eq. 1. d(m,n) = d(m-l,L + n);0<n<L Eq. I
Here, m is a current frame; n is a sample index of an input buffer d (m) ; L is a frame length, for example, 80; and D is overlap rate or delay rate of samples, e.g.,
24. The remained samples of the input buffer are pre- emphasized as the following Eq. 2.
d(m,D + ή) = shp(n) + ξps(.n-iy,0≤n<L Eq. 2 Here, ξp is a pre-emphasis coefficient, e.g., -0.8. In Eq. 1, the input buffer has L+D samples, e.g., 104; and the first D samples are pre-emphasized and overlapped part ending from the previous frame, samples after the first D samples are pre-emphasized input part beginning from the current frame. Based on the fact, windowing signals are acquired using the smoothed trapezoidal window in the input buffer as the following Eq. 3.
d(m,n)sin2(π(n + 0.5)/2D) ;0≤n,<D d(m,ή) ;D≤n<L
£(«) = Eq. 3 d(m,n)sm2(π(n-L + D + 0.5)/2D);L≤n<D + L 0 D+L≤n<M
Here, M is a length of discrete Fourier transform (DFT), e.g., 128; a spectral signal G(k) as the following Eq. 4 can be acquired based on the M-point DFT.
2 M-]
G(k) =—∑g(n)e-j2mk/M;0≤k<M Eq. 4
The spectral signal G(k) transformed into the frequency domain signal in the frequency domain conversion unit 22 is used as an input signal of the channel energy estimation unit 23.
The channel energy estimation unit 23 acquires a channel energy estimation value as the following Eq. 5 corresponding to the current frame ^m' of the spectral signal G(k) inputted from the frequency domain conversion unit 22.
Figure imgf000015_0001
0≤i <NC Eq . 5 Here, Emin is a minimum permission channel energy value, e.g., 0.0625; αch(m) is a channel energy smoothness [flatness] factor and is expressed as the following Eq. 6; and Nc is the number of integrated channels, e.g., 16. In addition, fL(i) and fH(i) are low frequency DFT bin of ith channel and high frequency DFT bin of ith channel, respectively. fL(i) and fH(i) are expressed as follows: fL= {2, 4, 6, 8, 10, 12, 14, 17, 20, 23, 27, 31, 36, 42, 49, 56}, fH={ 3, 5, 7, 9, 11, 13, 16, 19, 22, 26, 30, 35, 41, 48, 55, 63}
Figure imgf000016_0001
For the channel energy estimation value obtained based on Eq. 5, if the channel energy smoothness factor αCh (m) of the first frame is 0, the channel energy estimation value is initialized as a un-filtered channel energy value of the first frame.
The channel SNR estimation unit 24 estimates signal-to—noise ratio (SNR) existing in the channel.
That is, the channel SNR estimation unit 24 acquires quantized channel SNR indices as the following Eq. 7 based on the channel energy estimation value obtained in the channel energy estimation unit 23 and a background noise energy estimation value obtained in the background noise estimation unit 30.
Figure imgf000016_0002
Eq. 7
Meanwhile, En(m,i) obtained in the background noise estimation unit 30 is a noise energy estimation value of the current channel and σq(i) obtained based on that is from 0 to 89.
The voice metric calculation unit 25 acquires a sum of voice metrics in the current channel as the following Eq. 8 based on the SNR, e.g., the quantized channel SNR indices, σq(i), estimated in the channel SNR estimation unit 24.
Figure imgf000017_0001
Here, V(k) is a voice metric having 90 elements as follows:
V(k)={2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6,7,7,7,8,8, 9,9,10,10,11,12,12,13,13,14,15,15,16,17,17, 18,19,20,20,21,22,23,24,24,25,26,27,28,28,29,30,31,32,33, 34,35,36,37,37,38,39,40,41,42,43,44,45,46,47,48,49,50,50, 50,50,50,50,50,50,50,50}.
The spectral deviation estimation unit 26 estimates a spectral deviation corresponding to the current channel signal based on the channel energy estimation value Ech(m,i) obtained in the channel energy estimation unit 23. The estimation process of the spectral deviation will be described.
First, a log power spectrum of the current channel is acquired based on the channel energy estimation value ECh(m, i) as the following Eq. 9.
EdB(m,i) = \0\ogi0(Ech(m,i)),0≤i<Nc Eq. 9
Then, a difference value between the current power spectrum estimation value obtained by Eq. 9 and an average long-term power spectrum estimation value is acquired based on the following Eq. 10.
Figure imgf000017_0002
Here, EdB(m) is the average long-term power spectrum estimation value obtained in the previous frame. Before updating the average long-term power spectrum estimation value based on the estimation process of the spectral deviation, an initial value of the average long- term power spectrum estimation value is set up as the log power spectrum estimation value of the first frame as the following Eq. 11.
Figure imgf000018_0001
In addition, a total energy estimation value of the mth frame is obtained based on the channel energy estimation value Ech (m) as the following Eq. 12. The total energy estimation value Etot(m) and the difference value between the current power spectrum estimation value and the average long-term power spectrum estimation value, i.e., ΔE (m) , are inputted into the noise update decision unit 27 in order to update the background noise estimation value.
^(/H)
Figure imgf000018_0002
Also, an exponential window function factor α(m) is a function of the total energy estimation value Etot (m) and is obtained based on the following Eq. 13.
a(m) - E101 (mj) Eq . 13
Figure imgf000018_0003
Here, the exponential window function factor α(m) obtained by Eq. 13 is limited from αL to αH as the following Eq. 14. a{m) -max{aL,min{aH,a(m)}} Eq. 14
Also, EH and EL are dB scale boundary energies corresponding to linear interpolation values of Etot (m) expressed by α(m), when α(m) is limited from αL to αH. Here, EH =50, EL=30, αH=0.99, αL=0.50. For example, the exponential window function factor α(m) is determined as 0.745 in case that a signal having relative energy is 4OdB. Finally, an average long-term power spectral estimation value of the next frame is updated based on the exponential window function factor α(m) and the initial value of EdB(m) as the following Eq. 15.
Figure imgf000019_0001
The noise update decision unit 27 orders a command, e.g., update_flag, which updating a predetermined estimation value in response to the noise estimation value obtained in the background noise estimation unit 30 based on the total channel energy estimation value Etot (m) and the difference value between the current power spectrum estimation value and the average long-term power spectrum estimation value ΔE(m) obtained in the spectral deviation unit 26 as the following logic expressed in pseudo code.
/* Normal update logic */ updatejlag = FALSE if(v(m) < UPDATE_THLD) { updatejlag = TRUE update_cnt = 0 }
/* Forced update logic */ else if ((£ttW > NOISE_FLOOR_DB) and ( Δ£(m) < DEV_THLD)) { update_cπt = update_cnt + 1 if( update_cnt > UPDATE_CNT_THLE) updatejlag = TRUE }
/* "Hysteresis" logic to prevent long-term creeping of update_cnt */ if (update_cnt == last_update_cnt) hyster_cnt = hyster_cnt + 1 else hyster_cnt = 0 last_update_cnt = update_cnt if( hyster_cnt > HYSTER_CNT_THLD) update_cnt = 0
Here, variables [constants] of the logic expressed in the pseudo code are UPDATE_THLE=35, NOISE_FLOOR_DB=101oglo(Efl00r) , DEV_THLD=28, UPDATE_CNT_THLD=50 and HYSTER_CNT_THLD=β .
The channel SNR modifying unit 28 modifies values of the quantized channel SNR indices {σq} estimated in the channel SNR estimation unit 24 based on v(m) , the sum of voice metrics in the current channel estimated in the voice metric calculation unit 25. The modified channel indexes σq" is used as an input parameter of the channel gain computation unit 29. The following logic expressed in the pseudo code shows modification of the SNR estimation value. /* Set or reset modify flag */ index_cnt = 0
Figure imgf000021_0001
1 ) { if( σ,(O ≥ |NDEX.THLD) inde>_rnt = lnde>_cnl t- 1 } if (index_cnt < INDEX_CNT_THLD) modlfyjlag = TRUE else modlfyjlag - FALSE
/* Modify fhc SNR indices to get ty ) */ if(rnodifγ_flag == TRUE) for( l-0 to jV, -1 step 1 ) if(( v(m) < METRiC_THLD) or (σ? (') < SETBAC'<_THLD))
else
^(O = O-,, (J) else
{<> = {*,}
/* Limit {σ«i to SNR rhresholrl σth */ 1or( I=O to N«--\ step I ) if( σ;(0 = <τIft)
ΘIGΘ
Here, variables [constants] and threshold values of the logic expressed in the pseudo code are as follows:
NM=5, INDEXJTHLD = 12, INDEX_CNT_THLD = 5, METRIC_THLD = 45, SETBACK_THLD = 12, σth=6
The channel gain computation unit 29 calculates a linear channel gain γch based on the modified channel SNR indexes σq" modified in the channel SNR modifying unit 28 and the background noise energy estimation value En (m) estimated in the background noise estimation unit 30. A process of linear channel gain calculation will be described in detail.
First, a total gain of the current frame is acquired based on the following Eq. 16.
Yn KO Eq. 16
Figure imgf000022_0001
Here, γmin is a minimum total gain, e.g., -13; Efioor is a noise floor energy, e.g., 1; and the background noise energy estimation value En (m) is a estimation value in the background noise estimation unit 30. Then, a channel gain in dB is acquired based on the following Eq. 17.
Figure imgf000022_0002
Here, μg is a slant of the gain, e.g., 0.39. It is desirable that the channel gain should be changed into a linear channel gain as the following Eq. 18.
Ych (0 = min{l,10r"s(')/2°},0<i<iVc Eq. 18
The frequency domain filter 31 applies the linear channel gain γch calculated in the channel gain computation unit 29 to the spectral signal G(k) transformed in the frequency domain conversion unit 22 as the following Eq. 19.
H(k) = \rch (i)G(k)' fL (i) ≤ k ≤ fH (Ofi ≤ i ≤ Nc
{ G(k), 0≤k≤fL(0),fH(Nc-ϊ)≤k≤M/2 q"
In Eq. 19, the spectral signal G(k) is transformed into a H(k) based on the linear channel gain and non- transformed signals by the linear channel gain in frequency domain, i.e., H(k) = G(k), exist. That is, the non-transformed signals by the linear channel gain are expressed as the following Eq. 20, and a magnitude of the H(k) is even and a phase of the H(k) is odd.
H(M-k) = H*(k);0<k<M/2 Eq. 20
Here, complex conjugate is needed to obtain an inverse DFT of H(k).
As described above, the background noise estimation unit 30 estimates the noise energy estimation value En (m) of noise signals existing in the current channel and updates the corresponding noise energy estimation value based on the command, i.e., update_flag, received from the noise update decision unit 27.
That is, if the update_flag is true, the background noise estimation unit 30 updates channel noise estimation value of the next frame as the following Eq. 21.
En(m+1,0 = max{£min,anEn(m,i)+(1-an)Ech(m,/)},0 ≤i<Nc Eq. 21
Here, Emin is minimum channel energy, e.g., 0.0625; and (Xn is a channel noise smoothness factor, e.g., 0.9. Meanwhile, noise estimation values of the first 4 frames are initialized by the channel energy estimation values, respectively.
En(m,i) = msLx{Eimt,Ech(m,/)},1 ≤m≤4,0≤i<Nc Eq. 22
Here, Einit is minimum channel noise initial energy, e.g., 16.
The time domain conversion unit 32 converts noise attenuated speech signals, i.e., speech signals in the frequency domain, inputted through the frequency domain filter 31 into speech signals in the time domain. A time domain conversion process will be described in detail. First, filtered signals in the frequency domain filter 31 are transformed into time domain signals based on inverse DFT as the following Eq. 23.
Figure imgf000024_0001
Then, overlap-and-add is applied into Eq. 23, and frequency domain filtering is performed as the following Eq. 24.
[h(m,ri) + h(m -l,n + L), 0 ≤ n < M - L h'(n) = < Eq . 24
[ h(m,n), M - L ≤ n < L
Finally, de-emphasis is applied to Eq. 24, and the time domain speech signals are outputted as the following Eq. 25.
s\n + 240) = h\n) + ζds'(n + 239);0≤n< L Eq. 25
Here, ζd is a de-emphasis factor, e.g., 0.8; and s' (n) is an output buffer which can accommodate 320 samples.
As described above, noise-attenuated speech signal S' (n) can be obtained in the speech coding/recognition pre-processing block 11. The noise attenuated speech signals S' (n) are inputted into a speech feature vector extraction block 12 of the distributed speech recognition front-end module 100 or the speech coding block 15 of the speech coding module 150 based on the speech recognition mode or the speech coding mode, respectively.
Since the frame size of the noise attenuation object signal is 10ms as above description of the speech coding/recognition pre-processing block 11, the noise attenuation is performed once every 10ms. Therefore, the noise attenuated speech signal S1 (n) , an output signal of the speech coding/recognition pre-processing block 11, is s'(n), 240<n<320. Of course, it is well-known that the noise attenuated speech signal S1 (n) according to the frame size of the noise attenuation object signal may be outputted differently.
Meanwhile, referring to Fig. 2, a method corresponding to the speech coding/recognition preprocessing block 11 for the speech feature vector extracting module and the speech coding module includes time-series processes in response to the public speech signal processing field. Therefore, detailed description of the method will be omitted.
Fig. 3 is illustrates a first expanded speech coding/recognition pre-processing block for processing 11 kHz speech signal in accordance with an embodiment of the present invention; and Fig. 4 illustrates a second expanded speech coding/recognition pre-processing block for processing 16 kHz speech signal in accordance with an embodiment of the present invention.
A user speech signal of 8 KHz is a noise attenuation object signal in the speech coding/recognition pre-processing block 11 referring to Fig. 2. In the present invention, a speech coding/recognition pre-processing block for processing 11 KHz speech signal in Fig. 3 and a speech coding/recognition pre-processing block for processing 16 KHz speech signal in Fig. 4 are presented.
In Fig. 3, the first expanded speech coding/recognition pre-processing block for processing 11 KHz further includes a frequency down sampler 41 for converting 11-KHz speech signal into 8-KHz speech signal in front of the speech coding/recognition pre-processing block of Fig 2. A speech signal down-sampled in the frequency down sampler 41 is inputted into the speech coding/recognition pre-processing block 11.
In Fig. 4, the second expanded speech coding/recognition pre-processing block for processing 16 KHz further includes a low pass quadrature-mirror filter (QMF LP)[DEC by 2] 46 and a high pass quadrature- mirror filter (QMF HP) [DEC by 2 and SI] 47 in front of the speech coding/recognition pre-processing block of Fig 2.
The QMF LP 46 receives inputted 16-KHz speech signals and outputs 0 to 4-KHz low frequency band signals, and the QMF HP 47 receives inputted 16-KHz speech signals and outputs 4 to 8-KHz high frequency band signals.
Especially, low frequency signal outputted from the QMF LP 46 is inputted into the speech coding/recognition pre-processing block and high frequency signal outputted from the QMF HP 47 is inputted into the speech feature vector extraction block 12, i.e., the MFCC front-end, of the distributed speech recognition front-end module 100. In the speech feature vector extraction block 12, speech feature vectors, e.g., MFCCs, are extracted from the inputted high frequency signal by using 26 Mel-filter banks.
The low frequency signal outputted from the QMF LP 46 is inputted into the speech feature vector extraction block 12 through the speech coding/recognition preprocessing block. Then, the low frequency signal and the high frequency signal outputted from the QMF HP 47 are combined into one signal in the speech feature vector extraction block 12. That is, before log filter bank energy is converted into cepstrum coefficient, the high frequency signal and the low frequency signal are added. Moreover, log parameters (log-energy) for every frequency bands are obtained based on the high frequency signal and the low frequency signal.
In addition, the expanded speech coding/recognition pre-processing blocks in Figs. 3 to 4, can be implemented according to frequency expansion specification of European Telecommunications Standards Institute (ETSI) DSR standard (ETSI ES 202 050 vl.1.3) in order to use 11 KHz or 16 KHz sampling frequency signal.
Fig. 5 is a flowchart illustrating a method for speech recognition in accordance with an embodiment of the present invention; Fig. 6 is a flowchart illustrating a training processing for generating an acoustic model in accordance with an embodiment of the present invention; and Fig. 7 is a graph showing a speech recognition performance based on a speech feature vector in accordance with an embodiment of the present invention.
The present invention can be applied to the distributed speech recognition terminal, e.g., mobile phone, and its affect on the speech recognition performance needs to be verified.
Below, the speech recognition performance of the present invention will be examined based on the speech recognition process and the training process for generating the acoustic model. Fig. 5 is the speech recognition processes based on a Hidden Markov Model (HMM) . Speech features are extracted from the speech spoken by the user 301, and then a pattern matching 302 is performed by searching the acoustic model 303, a language model 304 and a pronouncing dictionary 305 according to the extracted speech features. A word or a sentence is recognized in response to the speech. A method suggested in a standard of the ETSI, "ETSI ES 201 108", is used for extracting of the speech features 301. That is, the speech features are extracted from the speech signal through MFCC, the speech feature vector as high-order coefficients is formed and a word stream having maximum probability is searched through pattern matching based on the acoustic model 303, the language model 304 and the pronunciation dictionary 305 in response to the speech feature vector. Here, for verifying performance of the conventional speech recognition, a noise attenuated signal by pre-processing defined in ETSI DSR standard, i.e., ETSI ES 202 050 vl.1.3, or a noise attenuated signal by the pre-processing steps defined in IS-127 is used as the speech signal for extracting the speech characteristics .
Meanwhile, for verifying the performance of the present speech recognition, a noise attenuated signal outputted from the speech coding/recognition pre- processing block 11 is used as the speech signal for extracting the speech characteristics.
In the extraction of the speech characteristics, 13 order MFCCs and log-energy are extracted by using The MFCC front-end module. 12 order MFCCs (c0, ..., Ci2), log- energy, delta of them, and delta-delta of them are used for parameters of the training of the acoustic model and speech recognition.
Moreover, HMM is used for the acoustic model 303.
In the present invention, a phone model in accordance with the language is used as the acoustic model. The training process for generating the context independent phone model will be described referring to Fig. 6. First, a monophone-based model as a context independent phone model is generated based on the speech feature vector extracted from training data at step S401.
Subsequently, forced alignment is performed based on the monophone-based model, so that phone label file is newly generated at step S402.
Meanwhile, a triphone-based model as a context dependent phone model is generated by expanding the monophone-based model at step S403. Then, a state-tying is performed considering that the training data for the triphone-based model is small at step S404.
Then, a final acoustic model is generated by increasing the number of mixture densities of a result acoustic model acquired by performing the state tying at step S405.
The language model 304 shown in Fig. 5 adapts a statistical estimation method. Here, the statistical estimation method estimates probability of available word sequence statistically from the speech database in predetermined environment. Among the language model adapting the statistical estimation method is an n-gram. In the n-gram, probability of word sequence is approximated by multiplying previous n conditional probabilities. In Fig. 5, a bigram language model is used.
With respect to the pronunciation dictionary 305, the pronunciation dictionary provided by "CleanSentOl" of SiTEC is used for Korean and "CMU dictionary V.0.6" provided from Carnegie-Mellon university is used for English. In addition, pronunciation of phrasal words that are not supported by "cleanSentOl" are supported by a pronunciation converter produced for the purpose based on "standard pronunciation method of standard language rule." Here, the phrasal word is composed of a word and a auxiliary word. The total number of phrasal words of the pronunciation dictionary provided by "CleanSentOl" is 36,104 and the total number of phrasal words of the pronunciation dictionary for speech recognition is 223, 857.
With respect to the speech DB, a sentence speech DB (e.g., CleanSentOl) is used in case of Korean, and an AURORA 4 DB (e.g., Wall Street Journal) is used in case of English.
5000 sentences among text data used in training and 3000 sentences among ^speech recognition language model usage text DB' may be used for generating the language model. A hidden Markov model toolkit (HTK) v3.1 is used to generate the language model and the final language model includes 31,582 words. The finally acquired model includes a network of 31,582 words. In the speech recognition process, a word recognition rate of using the conventional noise attenuated speech signal is 68.61% and the word recognition rate of using the noise attenuated speech signal in accordance with the present invention is 69.31% referring to Fig. 7. That is, the speech recognition performance of the present invention is improved than that of conventional method.
In short, a noise-robust speech feature vector can be extracted by sharing the speech coding pre-processing and the speech feature vector extracting pre-processing in a simple-structured terminal. Therefore, the speech recognition performance is improved with the small amount of memory and operations in the simple-structured terminal .
The above described method according to the present invention can be embodied as a program and be stored on a computer readable recording medium. The computer readable recording medium is any data storage device that can store data which can be read by the computer system. The computer readable recording medium includes a read-only memory (ROM) , a random-access memory (RAM) , a CD-ROM, a floppy disk, a hard disk and an optical magnetic disk.
While the present invention has been described with respect to certain preferred embodiments, it will be apparent to those skilled in the art that various changes and modifications may be made without departing from the scope of the invention as defined in the following claims.

Claims

WHAT IS CLAIMED IS:
1. An apparatus for extracting a noise-robust speech feature vector by sharing a pre-process step of a speech coding in a distributed speech coding/recognition terminal, comprising: a high pass filter for eliminating low frequency signals from input speech signals; a frequency domain conversion unit for converting the signals without low frequency signals into spectral signals in a frequency domain; a channel energy estimation unit for calculating a channel energy estimation value of the spectral signals of a current frame; a channel signal-to-noise ratio (SNR) estimation unit for estimating a channel SNR of the speech signals based on the channel energy estimation value acquired in the channel energy estimation unit and a background noise estimation value acquired in a background noise energy estimation unit; the background noise energy estimation unit for updating the background noise energy estimation value of the speech signals based on a command from a noise update decision unit; a voice metric calculation unit for acquiring a sum of voice metrics in a current channel based on the channel SNR; a spectral deviation estimation unit for estimating a spectral deviation of the speech signals based on the channel energy estimation value; the noise update decision unit for commanding to update the noise estimation value based on a total channel energy estimation value and a difference value between a current power spectrum estimation value and an average long-term power spectrum estimation value estimated in the spectral deviation estimation unit; a channel SNR modifying unit for modifying the channel SNR estimated in the channel SNR estimation unit based on the sum of voice metrics acquired in the voice metric calculation unit; a channel gain computation unit for acquiring a linear channel gain based on the modified channel SNR modified in the channel SNR modifying unit and the background energy estimation value obtained in the background noise estimation unit; a frequency domain filter for applying the linear channel gain to the spectral signals converted in the frequency domain conversion unit; and a time domain conversion unit for converting the linear channel gain applied spectral signals into speech signals in a time domain.
2. The apparatus of claim 1, wherein the speech signals outputted from the time domain conversion unit are noise-attenuated, and inputted into a speech feature vector extraction block of a speech feature vector extracting module or a speech coding block of a speech coding module.
3. The apparatus of claim 1, wherein a frame size of the filtered signals outputted from the high pass filter is 10 ms .
4. The apparatus of claim 1, wherein the frequency domain conversion unit converts the inputted signals into the spectral signals in the frequency domain by using a smoothed trapezoidal window.
5. The apparatus of claim 1, wherein an initialization of the channel energy estimation value with an un-filtered channel energy value of the first frame is permitted, when a channel energy smoothness factor of a first frame is 0.
6. The apparatus of claim 1, wherein the SNR of the speech signals includes quantized channel SNR indexes as the following Equation:
Figure imgf000034_0001
7. A method for extracting a noise-robust speech feature vector by sharing a pre-process of a speech coding in a distributed speech coding/recognition terminal, comprising the steps of: eliminating low frequency signals of speech signals received from outside; converting the signals without low frequency signals into spectral signals in a frequency domain; obtaining a channel energy estimation value of the spectral signals of a current frame; estimating a spectral deviation of the speech signals based on the obtained channel energy estimation value; making a noise estimation value updating command based on a total channel energy estimation value and a difference value between a current power spectrum estimation value and an average long-term power spectrum estimation value; when the noise estimation value updating command is received, updating the background noise energy estimation value; estimating a channel signal-to-noise ratio (SNR) of the speech signals based on the channel energy estimation value and the background noise energy estimation value; calculating a sum of voice metrics of the speech signals based on the channel SNR; modifying the channel SNR based on the sum of voice metrics; obtaining a linear channel gain based on the modified channel SNR and the background noise energy estimation value; applying the linear channel gain to the spectral signals; and converting the linear channel gain applied spectral signals into time domain speech signals.
8. The method of claim 7, wherein the spectral deviation estimation step includes the steps of: calculating a log power spectrum estimation value of the speech signals in a current channel based on the channel energy estimation value; calculating the difference value between the current power spectrum estimation value and the average long-term power spectrum estimation value; calculating the total channel energy estimation value of a current frame based on the channel energy estimation value; calculating an exponential window function factor based on the total channel energy estimation value; and updating the average long-term power spectrum estimation value of the next frame based on the exponential window function factor and an initial value of the power spectrum estimation value.
9. The method of claim 8, wherein the average long-term power spectrum is initialized by estimation value of the log power spectrum of a first frame.
10. The method of claim 7, wherein a background noise estimation value updating parameter of the noise estimation value update step includes the total energy estimation value and a difference value between the current power spectrum estimation value and an average long-term power spectrum estimation value calculated in step d) .
11. The method of claim I1 wherein the linear channel gain obtaining step includes the steps of: calculating a total gain factor of the current frame in the current channel of the speech signals; and calculating a channel gain of the current channel of the speech signals.
12. A distributed speech recognition terminal, comprising: a speech coding module for transmitting coded speech signals into outside through a speech traffic channel in a speech coding mode; a speech feature vector extracting module for transmitting extracted speech feature vectors into outside in a speech recognition mode; and a speech coding/recognition pre-processing block for attenuating a noise in speech signals received from the outside, wherein speech signals inputted into the speech coding module and the speech feature vector extracting module are pre-processed in the speech coding/recognition pre-processing block.
13. The apparatus of claim 12, wherein frequency of the speech signals inputted into the speech coding/recognition pre-processing block is 8 KHz.
14. The apparatus of claim 12, wherein the speech feature vector extracting module includes: a speech feature vector extraction unit for extracting speech feature vectors of the speech signal pre-processed in the speech coding/recognition preprocessing block; a first compression unit for compressing the speech feature vectors extracted in the speech feature vector extraction unit; a first bit stream transmission unit for transmitting a bit stream data loading the speech feature vectors compressed in the first compression unit to outside .
15. The apparatus of claim 12, wherein the speech coding module includes: a speech coding unit for coding the speech signals pre-processed in the speech coding/recognition preprocessing block; a second compression unit for compressing the speech signals coded in the second speech coding unit; and a second bit stream transmission unit for transmitting a bit stream data loading the speech signals compressed in the second compression unit to outside.
16. The apparatus of claim 12, further comprising: a switch for shifting a mode between a speech coding mode and a speech recognition mode.
17. A distributed speech recognition terminal, comprising: a speech coding module for transmitting coded speech signals into outside through a speech traffic channel in a speech coding mode; a speech feature vector extracting module for transmitting extracted speech feature vectors into outside in a speech recognition mode; a frequency down-sampler for down-sampling speech signals received from the outside; and a speech coding/recognition pre-processing block for attenuating a noise in the speech signals down- sampled in the frequency down-sampler, wherein speech signals inputted into the speech coding module and the speech feature vector extracting module are pre-processed in the speech coding/recognition pre-processing block.
18. The apparatus of claim 17, wherein frequency of the speech signals inputted into the frequency down- sampler is 11 KHz.
19. A distributed speech recognition terminal, comprising: a speech coding module for transmitting coded speech signals into outside through a speech traffic channel in a communication mode; a speech feature vector extracting module for transmitting extracted speech feature vectors into outside in a speech recognition mode; a low pass quadrature mirror filter for passing low frequency signals of speech signals received from the outside; a high pass quadrature mirror filter for passing high frequency signals of the speech signals; and a speech coding/recognition pre-processing block for attenuating a noise in the speech signals down- sampled in the frequency down-sampler, wherein speech signals inputted into the speech coding module and the speech feature vector extracting module are pre-processed in the speech coding/recognition pre-processing block.
20. The apparatus of claim 19, wherein frequency of the speech signals inputted into the low pass quadrature filter and the high pass quadrature filter is 16 KHz.
21. The apparatus of claim 20, wherein the low pass quadrature filter receives lβ-KHz speech signals and outputs 0 to 4-KHz signals.
22. The apparatus of claim 20, wherein the high pass quadrature filter receives lβ-KHz speech signals and outputs 5 to 8-KHz signals.
23. The apparatus of claim 19, wherein the low frequency signals outputted from the low pass quadrature mirror filter is inputted into the speech coding/recognition pre-processing block, and the high frequency signals outputted from the high pass quadrature mirror filter is inputted into the speech feature vector extracting module.
24. The apparatus of claim 23, wherein the low frequency signals and the high frequency signals are combined in the speech feature vector extracting module before a log filter bank energy is transformed into a cepstrum coefficient.
25. A computer-readable recording medium for recording a program that implements a method in a terminal having a processor, the method comprising the steps of: eliminating low frequency signals of speech signals received from outside; converting the signals without low frequency signals into spectral signals in a frequency domain; obtaining a channel energy estimation value of the spectral signals of a current frame; estimating a spectral deviation of the speech signals based on the obtained channel energy estimation value; making a noise estimation value updating command based on a total channel energy estimation value and a difference value between a current power spectrum estimation value and an average long-term power spectrum estimation value; when the noise estimation value updating command is received, updating the background noise energy estimation value; estimating a channel signal-to-noise ratio (SNR) of the speech signals based on the channel energy estimation value and the background noise energy estimation value; calculating a sum of voice metrics of the speech signals based on the channel SNR; modifying the channel SNR based on the sum of voice metrics; obtaining a linear channel gain based on the modified channel SNR and the background noise energy estimation value; applying the linear channel gain to the spectral signals; and converting the linear channel gain applied spectral signals into time domain speech signals.
PCT/KR2006/005831 2006-06-30 2006-12-28 Apparatus and method for extracting noise-robust speech recognition vector by sharing preprocessing step used in speech coding WO2008001991A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2006-0061150 2006-06-30
KR1020060061150A KR100794140B1 (en) 2006-06-30 2006-06-30 Apparatus and Method for extracting noise-robust the speech recognition vector sharing the preprocessing step used in speech coding

Publications (1)

Publication Number Publication Date
WO2008001991A1 true WO2008001991A1 (en) 2008-01-03

Family

ID=38845730

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2006/005831 WO2008001991A1 (en) 2006-06-30 2006-12-28 Apparatus and method for extracting noise-robust speech recognition vector by sharing preprocessing step used in speech coding

Country Status (2)

Country Link
KR (1) KR100794140B1 (en)
WO (1) WO2008001991A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150154964A1 (en) * 2013-12-03 2015-06-04 Google Inc. Multi-path audio processing

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101684554B1 (en) * 2015-08-20 2016-12-08 현대자동차 주식회사 Voice dialing system and method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1097296A (en) * 1996-09-20 1998-04-14 Sony Corp Method and device for voice coding, and method and device for voice decoding
US5956683A (en) * 1993-12-22 1999-09-21 Qualcomm Incorporated Distributed voice recognition system
WO2000046794A1 (en) * 1999-02-08 2000-08-10 Qualcomm Incorporated Distributed voice recognition system
WO2003094152A1 (en) * 2002-04-30 2003-11-13 Qualcomm Incorporated Distributed voice recognition system utilizing multistream feature processing

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100841096B1 (en) * 2002-10-14 2008-06-25 리얼네트웍스아시아퍼시픽 주식회사 Preprocessing of digital audio data for mobile speech codecs
KR100754439B1 (en) * 2003-01-09 2007-08-31 와이더댄 주식회사 Preprocessing of Digital Audio data for Improving Perceptual Sound Quality on a Mobile Phone
US20040260540A1 (en) 2003-06-20 2004-12-23 Tong Zhang System and method for spectrogram analysis of an audio signal
KR100636317B1 (en) * 2004-09-06 2006-10-18 삼성전자주식회사 Distributed Speech Recognition System and method
KR100592926B1 (en) * 2004-12-08 2006-06-26 주식회사 라이브젠 digital audio signal preprocessing method for mobile telecommunication terminal
JP2007097070A (en) * 2005-09-30 2007-04-12 Fujitsu Ten Ltd Structure for attaching speaker unit

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5956683A (en) * 1993-12-22 1999-09-21 Qualcomm Incorporated Distributed voice recognition system
JPH1097296A (en) * 1996-09-20 1998-04-14 Sony Corp Method and device for voice coding, and method and device for voice decoding
WO2000046794A1 (en) * 1999-02-08 2000-08-10 Qualcomm Incorporated Distributed voice recognition system
WO2003094152A1 (en) * 2002-04-30 2003-11-13 Qualcomm Incorporated Distributed voice recognition system utilizing multistream feature processing

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150154964A1 (en) * 2013-12-03 2015-06-04 Google Inc. Multi-path audio processing
US9449602B2 (en) * 2013-12-03 2016-09-20 Google Inc. Dual uplink pre-processing paths for machine and human listening

Also Published As

Publication number Publication date
KR20080002359A (en) 2008-01-04
KR100794140B1 (en) 2008-01-10

Similar Documents

Publication Publication Date Title
US6411926B1 (en) Distributed voice recognition system
KR100391287B1 (en) Speech recognition method and system using compressed speech data, and digital cellular telephone using the system
CA2179759C (en) Distributed voice recognition system
KR100923896B1 (en) Method and apparatus for transmitting speech activity in distributed voice recognition systems
KR100594670B1 (en) Automatic speech/speaker recognition over digital wireless channels
US20020091515A1 (en) System and method for voice recognition in a distributed voice recognition system
JP2004527006A (en) System and method for transmitting voice active status in a distributed voice recognition system
WO2002061727A2 (en) System and method for computing and transmitting parameters in a distributed voice recognition system
JP2002502993A (en) Noise compensated speech recognition system and method
US20040148160A1 (en) Method and apparatus for noise suppression within a distributed speech recognition system
Vlaj et al. A computationally efficient mel-filter bank VAD algorithm for distributed speech recognition systems
WO2008001991A1 (en) Apparatus and method for extracting noise-robust speech recognition vector by sharing preprocessing step used in speech coding
Kotnik et al. Efficient noise robust feature extraction algorithms for distributed speech recognition (DSR) systems
Yoon et al. A MFCC-based CELP speech coder for server-based speech recognition in network environments
Mahlanyane Using a low-bit rate speech enhancement variable post-filter as a speech recognition system pre-filter to improve robustness to GSM speech
CA2297191A1 (en) A vocoder-based voice recognizer
WO2001031636A2 (en) Speech recognition on gsm encoded data
KR20090035222A (en) System and method for recognizing speech

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 06835531

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: RU

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1), EPO FORM 1205A SENT ON 15/04/09.

122 Ep: pct application non-entry in european phase

Ref document number: 06835531

Country of ref document: EP

Kind code of ref document: A1