WO2008001991A1

WO2008001991A1 - Apparatus and method for extracting noise-robust speech recognition vector by sharing preprocessing step used in speech coding

Info

Publication number: WO2008001991A1
Application number: PCT/KR2006/005831
Authority: WO
Inventors: Chang-Sun Ryu; Jae-In Kim; Hong Kook Kim; Jae Sam Yoon; Yoo Rhee Oh
Original assignee: Kt Corporation
Priority date: 2006-06-30
Filing date: 2006-12-28
Publication date: 2008-01-03
Also published as: KR20080002359A; KR100794140B1

Abstract

Provided are an apparatus for extracting a speech feature vector in a distributed speech recognition terminal and a method thereof. The apparatus includes a speech coding module for transmitting coded speech signals into outside through a speech traffic channel in a speech coding mode; a speech feature vector extracting module for transmitting extracted speech feature vectors into outside in a speech recognition mode; and a speech coding/recognition pre-processing block for attenuating a noise in speech signals received from the outside, wherein speech signals inputted into the speech coding module and the speech feature vector extracting module are pre-processed in the speech coding/recognition preprocessing block.

Description

DESCRIPTION

APPARATUS AND METHOD FOR EXTRACTING NOISE-ROBUST SPEECH RECOGNITION VECTOR BY SHARING PREPROCESSING STEP

USED IN SPEECH CODING

TECHNICAL FIELD

The present invention relates to an apparatus for extracting a speech feature vector in a distributed speech recognition terminal and a method thereof; and, more particularly, an apparatus for extracting a noise- robust speech feature vector by sharing pre-processing steps used in a speech coder that can extract the noise- robust speech feature vector by sharing a pre-processing step for the speech coding and a pre-processing step for an extracting speech feature vector in a terminal having a speech coding function, and a method thereof.

BACKGROUND ART

Distributed speech recognition (DSR) implements speech recognition in a simple-structured terminal such as a mobile phone. The simple-structured terminal extracts characteristics of speech signals and a high- performance speech recognition server performs speech recognition based on the characteristics received from the simple-structured terminal, that is, the DSR is a dual processing system.

Generally, a Mel-frequency cepstral coefficient (MFCC) is used for speech recognition. The MFCC represents frequency spectrum which is expressed in MeI- scale into sinusoidal wave components, and the MFCC is a speech feature vector or a speech recognition parameter for representing speech received from a user.

The terminal extracts the speech feature vector of the speech received from the user based on the MFCC and loads the speech feature vector into a bit stream so that the speech feature vector can be transmitted through a communication network, and the terminal transmits the bit stream to the speech recognition server. That is, MFCCs extracted from speech received from the user, are mapped into vectors having the nearest distance in a codebook having a predetermined number of codewords. Then, the mapped vectors are selected and transmitted as a bit stream. The codebook has codewords for each group having the similar values corresponding to the speech spoken by the user. Generally, the codeword is determined by extracting a training data from a lot of speech data and selecting a representative value among the extracted training data.

The speech recognition server dequantizes the speech feature vector loaded in the bit stream received from the terminal, and recognizes a word corresponding to the speech based on a hidden Markov model (HMM) as a speech model. Herein, the HMM is a process for modeling a phoneme, i.e., a unit for recognizing the speech, and completes the word and a sentence by integrating phonemes inputted to a speech recognition engine with a phoneme stored in a database of the speech recognition engine.

Recently, a mobile phone is highlighted as a distributed speech recognition terminal according to a digital convergence trend, and a module for speech signals processing, i.e., a speech coding module, is embedded in the mobile phone.

As described above, when the speech feature vector corresponding to the spoken speech by the user is extracted, pre-processing speech signals, specifically noise attenuation processing, is needed. However, such a pre-processing step for speech coding and a preprocessing step for speech recognition are individually performed in general mobile phones. That is, pre-processing methods of speech spoken by the user for speech coding and speech recognition are the same, but the pre-processing methods are separately performed in the general mobile phones. Especially, since the pre-processing methods are performed in different pre-processing apparatuses, additional memory and operations are needed in a simple-structured terminal. Therefore, it causes wasteful use of resources.

In addition, speech pre-processing in speech coding has internal delay in the terminal. It causes a switching delay between speech coding process and speech recognition process in the terminal. For example, when the user uses a speech recognition function of the terminal and a call is received, there is a delay in answering the receiving call. Hereinafter, the pre-processing for speech coding and the pre-processing for speech recognition in a conventional terminal will be described.

A conventional terminal includes a speech coding module and a distributed speech recognition front-end module.

The speech coding module includes a pre-processing unit for speech coding, a model parameter estimation unit, a first compression unit and a first bit stream transmitting unit. The distributed speech recognition front-end module includes a pre-processing unit for speech recognition, an MFCC front-end unit, a second compression unit and a second bit stream transmitting unit.

According to the conventional method, the terminal includes the speech coding module and the distributed speech recognition front-end module for separately attenuating noise mixed with the speech spoken by the user because pre-processed signals for the speech coding and the speech recognition are different. Therefore, since the speech coding module and the distributed speech recognition front-end module perform the same function, a method for integrating the speech coding and the speech recognition by sharing the preprocessing steps is needed.

DISCLOSURE TECHNICAL PROBLEM

An embodiment of the present invention is directed to provide an apparatus for extracting a noise-robust speech feature vector by sharing pre-processing steps used in a speech coding that can extract the noise-robust speech coding feature vector and pre-processing steps for the extraction of speech recognition feature vector in a terminal having a speech coding function, and a method thereof.

TECHNICAL SOLUTION

In accordance with an aspect of the present invention, there is provided an apparatus for extracting a noise-robust speech feature vector by sharing a pre- process of a speech coding in a distributed speech coding/recognition terminal, including: a high pass filter for eliminating low frequency signals from input speech signals; a frequency domain conversion unit for converting the signals without low frequency signals into spectral signals in a frequency domain; a channel energy estimation unit for calculating a channel energy estimation value of the spectral signals of a current frame; a channel signal-to-noise ratio (SNR) estimation unit for estimating a channel SNR of the speech signals based on the channel energy estimation value acquired in the channel energy estimation unit and a background noise estimation value acquired in a background noise energy estimation unit; the background noise energy estimation unit for updating the background noise energy estimation value of the speech signals based on a command from a noise update decision unit; a voice metric calculation unit for acquiring a sum of voice metrics in a current channel based on the channel SNR; a spectral deviation estimation unit for estimating a spectral deviation of the speech signals based on the channel energy estimation value; the noise update decision unit for commanding to update the noise estimation value based on a total channel energy estimation value and a difference value between a current power spectrum estimation value and an average long-term power spectrum estimation value estimated in the spectral deviation estimation unit; a channel SNR modifying unit for modifying the channel SNR estimated in the channel SNR estimation unit based on the sum of voice metrics acquired in the voice metric calculation unit; a channel gain computation unit for acquiring a linear channel gain based on the modified channel SNR modified in the channel SNR modifying unit and the background energy estimation value obtained in the background noise estimation unit; a frequency domain filter for applying the linear channel gain to the spectral signals converted in the frequency domain conversion unit; and a time domain conversion unit for converting the linear channel gain applied spectral signals into speech signals in a time domain.

In accordance with another aspect of the present invention, there is provided a distributed speech recognition terminal, including: a speech coding module for transmitting coded speech signals into outside through a speech traffic channel in a speech coding mode; a speech feature vector extracting module for transmitting extracted speech feature vectors into outside in a speech recognition mode; and a speech coding/recognition pre-processing block for attenuating a noise in speech signals received from the outside, wherein speech signals inputted into the speech coding module and the speech feature vector extracting module are pre-processed in the speech coding/recognition preprocessing block. In accordance with another aspect of the present invention, there is provided a distributed speech recognition terminal, including: a speech coding module for transmitting coded speech signals into outside through a speech traffic channel in a speech coding mode; a speech feature vector extracting module for transmitting extracted speech feature vectors into outside in a speech recognition mode; a frequency down- sampler for down-sampling speech signals received from the outside; and a speech coding/recognition pre- processing block for attenuating a noise in the speech signals down-sampled in the frequency down-sampler, wherein speech signals inputted into the speech coding module and the speech feature vector extracting module are pre-processed in the speech coding/recognition pre- processing block.

In accordance with another aspect of the present invention, there is provided a distributed speech recognition terminal, including: a speech coding module for transmitting coded speech signals into outside through a speech traffic channel in a speech coding mode; a speech feature vector extracting module for transmitting extracted speech feature vectors into outside in a speech recognition mode; a low pass quadrature mirror filter for passing low frequency signals of speech signals received from the outside; a high pass quadrature mirror filter for passing high frequency signals of the speech signals; and a speech coding/recognition pre-processing block for attenuating a noise in the speech signals down-sampled in the frequency down-sampler, wherein speech signals inputted into the speech coding module and the speech feature vector extracting module are pre-processed in the speech coding/recognition pre-processing block.

In accordance with another aspect of the present invention, there is provided a method for extracting a noise-robust speech feature vector by sharing a pre- process of a speech coding in a distributed speech coding/recognition terminal, including the steps of: eliminating low frequency signals of speech signals received from outside; converting the signals without low frequency signals into spectral signals in a frequency domain; obtaining a channel energy estimation value of the spectral signals of a current frame; estimating a spectral deviation of the speech signals based on the obtained channel energy estimation value; making a noise estimation value updating command based on a total channel energy estimation value and a difference value between a current power spectrum estimation value and an average long-term power spectrum estimation value; when the noise estimation value updating command is received, updating the background noise energy estimation value; estimating a channel SNR of the speech signals based on the channel energy estimation value and the background noise energy estimation value; calculating a sum of voice metrics of the speech signals based on the channel SNR; modifying the channel SNR based on the sum of voice metrics; obtaining a linear channel gain based on the modified channel SNR and the background noise energy estimation value; applying the linear channel gain to the spectral signals; and converting the linear channel gain applied spectral signals into time domain speech signals.

ADVANTAGEOUS EFFECTS

The present invention requires a small amount of memory, consumes a small amount of computation and improves performance of speech recognition by sharing a pre-processing for speech coding and speech recognition.

Also, the present invention can prevent delay caused by a switch between a speech coding process and a speech recognition process included in a speech coding pre-processing step and a speech feature vector extracting pre-processing step.

In addition, the present invention can attenuate noise mixed in speech signal of the user during the speech coding and the extraction of the speech feature vector .

BRIEF DESCRIPTION OF THE DRAWINGS

Fig. 1 illustrates an apparatus for extracting a noise-robust speech feature vector by sharing preprocessing steps used in a speech coding in accordance with an embodiment of the present invention;

Fig. 2 is a detailed diagram illustrating a speech coding/recognition pre-processing block in accordance with an embodiment of the present invention;

Fig. 3 illustrates a first expanded speech coding/recognition pre-processing block for processing 11 kHz speech signal in accordance with an embodiment of the present invention; Fig. 4 illustrates a second expanded speech coding/recognition pre-processing block for processing 16 kHz speech signal in accordance with an embodiment of the present invention;

Fig. 5 is a flowchart illustrating a method for speech recognition in accordance with an embodiment of the present invention;

Fig. 6 is a flowchart illustrating a training processing for generating an acoustic model in accordance with an embodiment of the present invention; and Fig. 7 is a graph showing speech recognition performance by using a speech feature vector in accordance with an embodiment of the present invention.

BEST MODE FOR THE INVENTION The advantages, features and aspects of the invention will become apparent from the following description of the embodiments with reference to the accompanying drawings, which is set forth hereinafter, and thus the invention will be easily carried out by those skilled in the art to which the invention pertains. Also, when it is considered that detailed description on a related art may obscure the points of the present invention unnecessarily in describing the present invention, the description will not be provided herein. Hereinafter, specific embodiments of the present invention will be described with reference to the accompanying drawings.

Fig. 1 illustrates an apparatus for extracting a noise-robust speech feature vector by sharing pre- processing steps at a speech coding in a distributed speech recognition terminal in accordance with an embodiment of the present invention.

The distributed speech recognition terminal, e.g., mobile phone, having the apparatus for extracting a noise-robust speech feature vector includes a speech coding module 150 and a distributed speech recognition front-end module 100, but as shown in Fig. 1, a speech coding/recognition pre-processing block 11 is shared by a pre-processing step of the speech coding module 150 and a pre-processing step of the distributed speech recognition front-end module 100.

That is, the distributed speech recognition front- end module 100 includes the speech coding/recognition pre-processing block 11, a speech feature vector extraction block, e.g., MFCC front-end block, 12, a first speech compression block 13 and a first bit stream transmission block 14. In addition, the speech coding module 150 includes the speech coding/recognition preprocessing block 11, a speech coding block 15, a second speech compression block 16 and a second bit stream transmission block 17.

Of course, the terminal includes a switch 50 for shifting between a speech coding mode and a speech recognition mode. According to the action of the switch 50, coded signals of speech spoken by the user are transmitted to a mobile communication system through a voice traffic channel in the speech coding mode; and extracted speech feature vectors of speech spoken by the user are transmitted to the speech recognition server through a packet data channel in the speech recognition mode .

Especially, the speech coding/recognition preprocessing block 11 performs attenuating noise in 8 KHz input speech spoken by the user. In the present invention, a separate noise attenuation block is not used in the distributed speech recognition front-end module 100 and the speech coding/recognition pre-processing block 11 is used as a noise attenuation block.

That is, noise attenuation function is performed in the speech coding/recognition pre-processing block 11 for extracting a noise-robust speech feature vector (MFCCs) in the distributed speech recognition front-end module 100. Here, the speech coding/recognition preprocessing block 11 attenuates noise to extract speech feature vectors (MFCCs) which are robust to noise in the speech feature extraction block 12. The speech coding/recognition pre-processing block 11 is realized in a specification capable of performing both pre-processing for speech coding and pre-processing for speech recognition. The speech coding/recognition pre-processing block 11 in accordance with an embodiment of the present invention will be described in detailed referring to Fig. 2. Since constituent elements 12, 13, 14, 15, 16 and 17 of the Fig. 1 are well-known, detailed description will be omitted.

Fig. 2 is a detailed diagram illustrating a speech coding/recognition pre-processing block in accordance with an embodiment of the present invention. As shown in Fig. 2, the speech coding/recognition pre-processing block 11 in accordance with the present invention includes a high pass filter 21, a frequency domain conversion unit 22, a channel energy estimation unit 23, a channel SNR estimation unit 24, a voice metric calculation unit 25, a spectral deviation estimation unit 26, a noise update decision unit 27, a channel SNR modifying unit 28, a channel gain computation unit 29, a background noise estimation unit 30, a frequency domain filter 31 and a time domain conversion unit 32. In the present invention, the speech coding/recognition pre-processing block 11 may be implemented based on IS-127 Enhanced Variable Rate Codec (EVRC) used in CDMA having a specification which is suitable for both the speech coding pre-processing for speech communication and the speech feature preprocessing for speech recognition.

Meanwhile, the input speech signal s_LFB(n) spoken by the user inputted into the speech coding/recognition pre-processing block 11 is a 16-bit uniform pulse coded modulation (PCM) format data having 8 KHz sampling frequency.

Generally, before the speech coding and the speech feature vector extraction, noise mixed in the input speech signal has to be attenuated to improve quality of the speech signal. That is, the speech coding/recognition pre-processing block 11 of the present invention mainly performs noise attenuation. Therefore, noise attenuated signal s¹ (n) is outputted when the input speech signal S_LFB(Π) is inputted as shown in Fig. 2. Below, each constituent element of the speech coding/recognition pre-processing block 11 will be described in detail.

The high pass filter 21 eliminates low frequency band signals of the input speech signal s_LFB(n) inputted through a microphone, and a cutoff frequency of the high pass filter 21 is 120 Hz.

A filtered signal in the high pass filter 21 is defined as s_hp(n), and s_hp(n) is a noise attenuation object signal. A frame size of the noise attenuation object signal is 10ms and current frame is defined as ^λm' . The frequency domain conversion unit 22 converts the filtered signal s_hp(n) in the high pass filter 21 into a frequency domain signal based on a smoothed trapezoidal window, i.e., windowing. The frequency domain conversion steps will be described in detail.

In the smoothed trapezoidal window, the first D samples in an input frame buffer d(m,n) of the m^th frame are overlapped with the last D samples in the previous frame. This overlapping is expressed as the following Eq. 1. d(m,n) = d(m-l,L + n);0<n<L Eq. I

Here, m is a current frame; n is a sample index of an input buffer d (m) ; L is a frame length, for example, 80; and D is overlap rate or delay rate of samples, e.g.,

24. The remained samples of the input buffer are pre- emphasized as the following Eq. 2.

d(m,D + ή) = s_hp(n) + ξ_ps_lφ(.n-iy,0≤n<L Eq. 2 Here, ξ_p is a pre-emphasis coefficient, e.g., -0.8. In Eq. 1, the input buffer has L+D samples, e.g., 104; and the first D samples are pre-emphasized and overlapped part ending from the previous frame, samples after the first D samples are pre-emphasized input part beginning from the current frame. Based on the fact, windowing signals are acquired using the smoothed trapezoidal window in the input buffer as the following Eq. 3.

d(m,n)sin²(π(n + 0.5)/2D) ;0≤n,<D d(m,ή) ;D≤n<L

£(«) = Eq. 3 d(m,n)sm²(π(n-L + D + 0.5)/2D);L≤n<D + L 0 D+L≤n<M

Here, M is a length of discrete Fourier transform (DFT), e.g., 128; a spectral signal G(k) as the following Eq. 4 can be acquired based on the M-point DFT.

2 M-]

G(k) =—∑g(n)e-^j2mk/M;0≤k<M Eq. 4

The spectral signal G(k) transformed into the frequency domain signal in the frequency domain conversion unit 22 is used as an input signal of the channel energy estimation unit 23.

The channel energy estimation unit 23 acquires a channel energy estimation value as the following Eq. 5 corresponding to the current frame ^m' of the spectral signal G(k) inputted from the frequency domain conversion unit 22.

0≤i <N_C Eq . 5 Here, E_mi_n is a minimum permission channel energy value, e.g., 0.0625; α_ch(m) is a channel energy smoothness [flatness] factor and is expressed as the following Eq. 6; and N_c is the number of integrated channels, e.g., 16. In addition, f_L(i) and f_H(i) are low frequency DFT bin of i^th channel and high frequency DFT bin of i^th channel, respectively. f_L(i) and f_H(i) are expressed as follows: f_L= {2, 4, 6, 8, 10, 12, 14, 17, 20, 23, 27, 31, 36, 42, 49, 56}, f_H={ 3, 5, 7, 9, 11, 13, 16, 19, 22, 26, 30, 35, 41, 48, 55, 63}

For the channel energy estimation value obtained based on Eq. 5, if the channel energy smoothness factor α_Ch (m) of the first frame is 0, the channel energy estimation value is initialized as a un-filtered channel energy value of the first frame.

The channel SNR estimation unit 24 estimates signal-to—noise ratio (SNR) existing in the channel.

That is, the channel SNR estimation unit 24 acquires quantized channel SNR indices as the following Eq. 7 based on the channel energy estimation value obtained in the channel energy estimation unit 23 and a background noise energy estimation value obtained in the background noise estimation unit 30.

Eq. 7

Meanwhile, E_n(m,i) obtained in the background noise estimation unit 30 is a noise energy estimation value of the current channel and σ_q(i) obtained based on that is from 0 to 89.

The voice metric calculation unit 25 acquires a sum of voice metrics in the current channel as the following Eq. 8 based on the SNR, e.g., the quantized channel SNR indices, σ_q(i), estimated in the channel SNR estimation unit 24.

Here, V(k) is a voice metric having 90 elements as follows:

V(k)={2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6,7,7,7,8,8, 9,9,10,10,11,12,12,13,13,14,15,15,16,17,17, 18,19,20,20,21,22,23,24,24,25,26,27,28,28,29,30,31,32,33, 34,35,36,37,37,38,39,40,41,42,43,44,45,46,47,48,49,50,50, 50,50,50,50,50,50,50,50}.

The spectral deviation estimation unit 26 estimates a spectral deviation corresponding to the current channel signal based on the channel energy estimation value E_ch(m,i) obtained in the channel energy estimation unit 23. The estimation process of the spectral deviation will be described.

First, a log power spectrum of the current channel is acquired based on the channel energy estimation value E_Ch(m, i) as the following Eq. 9.

E_dB(m,i) = \0\og_i0(E_ch(m,i)),0≤i<N_c Eq. 9

Then, a difference value between the current power spectrum estimation value obtained by Eq. 9 and an average long-term power spectrum estimation value is acquired based on the following Eq. 10.

Here, E_dB(m) is the average long-term power spectrum estimation value obtained in the previous frame. Before updating the average long-term power spectrum estimation value based on the estimation process of the spectral deviation, an initial value of the average long- term power spectrum estimation value is set up as the log power spectrum estimation value of the first frame as the following Eq. 11.

In addition, a total energy estimation value of the m^th frame is obtained based on the channel energy estimation value E_ch (m) as the following Eq. 12. The total energy estimation value E_tot(m) and the difference value between the current power spectrum estimation value and the average long-term power spectrum estimation value, i.e., Δ_E (m) , are inputted into the noise update decision unit 27 in order to update the background noise estimation value.

^(/H)

Also, an exponential window function factor α(m) is a function of the total energy estimation value E_tot (m) and is obtained based on the following Eq. 13.

a(m) - E₁₀₁ (mj) Eq . 13

Here, the exponential window function factor α(m) obtained by Eq. 13 is limited from α_L to α_H as the following Eq. 14. a{m) -max{a_L,min{a_H,a(m)}} Eq. 14

Also, E_H and E_L are dB scale boundary energies corresponding to linear interpolation values of E_tot (m) expressed by α(m), when α(m) is limited from α_L to α_H. Here, E_H =50, E_L=30, α_H=0.99, α_L=0.50. For example, the exponential window function factor α(m) is determined as 0.745 in case that a signal having relative energy is 4OdB. Finally, an average long-term power spectral estimation value of the next frame is updated based on the exponential window function factor α(m) and the initial value of E_dB(m) as the following Eq. 15.

The noise update decision unit 27 orders a command, e.g., update_flag, which updating a predetermined estimation value in response to the noise estimation value obtained in the background noise estimation unit 30 based on the total channel energy estimation value E_tot (m) and the difference value between the current power spectrum estimation value and the average long-term power spectrum estimation value Δ_E(m) obtained in the spectral deviation unit 26 as the following logic expressed in pseudo code.

/* Normal update logic */ updatejlag = FALSE if(v(m) < UPDATE_THLD) { updatejlag = TRUE update_cnt = 0 }

/* Forced update logic */ else if ((^£ttW > NOISE_FLOOR_DB) and ( Δ_£(m) < DEV_THLD)) { update_cπt = update_cnt + 1 if( update_cnt > UPDATE_CNT_THLE) updatejlag = TRUE }

/* "Hysteresis" logic to prevent long-term creeping of update_cnt */ if (update_cnt == last_update_cnt) hyster_cnt = hyster_cnt + 1 else hyster_cnt = 0 last_update_cnt = update_cnt if( hyster_cnt > HYSTER_CNT_THLD) update_cnt = 0

Here, variables [constants] of the logic expressed in the pseudo code are UPDATE_THLE=35, NOISE_FLOOR_DB=101og_lo(E_fl00r) , DEV_THLD=28, UPDATE_CNT_THLD=50 and HYSTER_CNT_THLD=β .

The channel SNR modifying unit 28 modifies values of the quantized channel SNR indices {σ_q} estimated in the channel SNR estimation unit 24 based on v(m) , the sum of voice metrics in the current channel estimated in the voice metric calculation unit 25. The modified channel indexes σ_q" is used as an input parameter of the channel gain computation unit 29. The following logic expressed in the pseudo code shows modification of the SNR estimation value. /* Set or reset modify flag */ index_cnt = 0

1 ) { if( ^σ,(O ≥ |NDEX.THLD) inde>_rnt = lnde>_cnl t- 1 } if (index_cnt < INDEX_CNT_THLD) modlfyjlag = TRUE else modlfyjlag - FALSE

/* Modify fhc SNR indices to get ty ) */ if(rnodifγ_flag == TRUE) for( l-0 to ^jV, -1 step 1 ) if(( v(m) < METRiC_THLD) or (^σ? (') < SETBAC'<_THLD))

else

^(O = O-,, (J) else

{<> = {*,}

/* Limit {^σ«i to SNR rhresholrl ^σth */ 1or( I=O to ^N«--\ step I ) if( σ;(0 = <τ_Ift)

ΘIGΘ

Here, variables [constants] and threshold values of the logic expressed in the pseudo code are as follows:

N_M=5, INDEXJTHLD = 12, INDEX_CNT_THLD = 5, METRIC_THLD = 45, SETBACK_THLD = 12, σ_th=6

The channel gain computation unit 29 calculates a linear channel gain γ_ch based on the modified channel SNR indexes σ_q" modified in the channel SNR modifying unit 28 and the background noise energy estimation value E_n (m) estimated in the background noise estimation unit 30. A process of linear channel gain calculation will be described in detail.

First, a total gain of the current frame is acquired based on the following Eq. 16.

Yn KO Eq. 16

Here, γ_mi_n is a minimum total gain, e.g., -13; E_fi_oor is a noise floor energy, e.g., 1; and the background noise energy estimation value E_n (m) is a estimation value in the background noise estimation unit 30. Then, a channel gain in dB is acquired based on the following Eq. 17.

Here, μ_g is a slant of the gain, e.g., 0.39. It is desirable that the channel gain should be changed into a linear channel gain as the following Eq. 18.

Ych (0 = min{l,10^r"^s('^)/2°},0<i<iV_c Eq. 18

The frequency domain filter 31 applies the linear channel gain γ_ch calculated in the channel gain computation unit 29 to the spectral signal G(k) transformed in the frequency domain conversion unit 22 as the following Eq. 19.

H(k) = \^{rch (i)G(k)}' f_L (i) ≤ k ≤ f_H (Ofi ≤ i ≤ N_c

{ G(k), 0≤k≤f_L(0),f_H(N_c-ϊ)≤k≤M/2 ^q"

In Eq. 19, the spectral signal G(k) is transformed into a H(k) based on the linear channel gain and non- transformed signals by the linear channel gain in frequency domain, i.e., H(k) = G(k), exist. That is, the non-transformed signals by the linear channel gain are expressed as the following Eq. 20, and a magnitude of the H(k) is even and a phase of the H(k) is odd.

H(M-k) = H^*(k);0<k<M/2 Eq. 20

Here, complex conjugate is needed to obtain an inverse DFT of H(k).

As described above, the background noise estimation unit 30 estimates the noise energy estimation value E_n (m) of noise signals existing in the current channel and updates the corresponding noise energy estimation value based on the command, i.e., update_flag, received from the noise update decision unit 27.

That is, if the update_flag is true, the background noise estimation unit 30 updates channel noise estimation value of the next frame as the following Eq. 21.

E_n(m+1,0 = max{£_min,a_nE_n(m,i)+(1-a_n)E_ch(m,/)},0 ≤i<N_c Eq. 21

Here, E_min is minimum channel energy, e.g., 0.0625; and (X_n is a channel noise smoothness factor, e.g., 0.9. Meanwhile, noise estimation values of the first 4 frames are initialized by the channel energy estimation values, respectively.

E_n(m,i) = msLx{E_imt,E_ch(m,/)},1 ≤m≤4,0≤i<N_c Eq. 22

Here, Einit is minimum channel noise initial energy, e.g., 16.

The time domain conversion unit 32 converts noise attenuated speech signals, i.e., speech signals in the frequency domain, inputted through the frequency domain filter 31 into speech signals in the time domain. A time domain conversion process will be described in detail. First, filtered signals in the frequency domain filter 31 are transformed into time domain signals based on inverse DFT as the following Eq. 23.

Then, overlap-and-add is applied into Eq. 23, and frequency domain filtering is performed as the following Eq. 24.

[h(m,ri) + h(m -l,n + L), 0 ≤ n < M - L h'(n) = < Eq . 24

[ h(m,n), M - L ≤ n < L

Finally, de-emphasis is applied to Eq. 24, and the time domain speech signals are outputted as the following Eq. 25.

s\n + 240) = h\n) + ζ_ds'(n + 239);0≤n< L Eq. 25

Here, ζd is a de-emphasis factor, e.g., 0.8; and s' (n) is an output buffer which can accommodate 320 samples.

As described above, noise-attenuated speech signal S' (n) can be obtained in the speech coding/recognition pre-processing block 11. The noise attenuated speech signals S' (n) are inputted into a speech feature vector extraction block 12 of the distributed speech recognition front-end module 100 or the speech coding block 15 of the speech coding module 150 based on the speech recognition mode or the speech coding mode, respectively.

Since the frame size of the noise attenuation object signal is 10ms as above description of the speech coding/recognition pre-processing block 11, the noise attenuation is performed once every 10ms. Therefore, the noise attenuated speech signal S¹ (n) , an output signal of the speech coding/recognition pre-processing block 11, is s'(n), 240<n<320. Of course, it is well-known that the noise attenuated speech signal S¹ (n) according to the frame size of the noise attenuation object signal may be outputted differently.

Meanwhile, referring to Fig. 2, a method corresponding to the speech coding/recognition preprocessing block 11 for the speech feature vector extracting module and the speech coding module includes time-series processes in response to the public speech signal processing field. Therefore, detailed description of the method will be omitted.

Fig. 3 is illustrates a first expanded speech coding/recognition pre-processing block for processing 11 kHz speech signal in accordance with an embodiment of the present invention; and Fig. 4 illustrates a second expanded speech coding/recognition pre-processing block for processing 16 kHz speech signal in accordance with an embodiment of the present invention.

A user speech signal of 8 KHz is a noise attenuation object signal in the speech coding/recognition pre-processing block 11 referring to Fig. 2. In the present invention, a speech coding/recognition pre-processing block for processing 11 KHz speech signal in Fig. 3 and a speech coding/recognition pre-processing block for processing 16 KHz speech signal in Fig. 4 are presented.

In Fig. 3, the first expanded speech coding/recognition pre-processing block for processing 11 KHz further includes a frequency down sampler 41 for converting 11-KHz speech signal into 8-KHz speech signal in front of the speech coding/recognition pre-processing block of Fig 2. A speech signal down-sampled in the frequency down sampler 41 is inputted into the speech coding/recognition pre-processing block 11.

In Fig. 4, the second expanded speech coding/recognition pre-processing block for processing 16 KHz further includes a low pass quadrature-mirror filter (QMF LP)[DEC by 2] 46 and a high pass quadrature- mirror filter (QMF HP) [DEC by 2 and SI] 47 in front of the speech coding/recognition pre-processing block of Fig 2.

The QMF LP 46 receives inputted 16-KHz speech signals and outputs 0 to 4-KHz low frequency band signals, and the QMF HP 47 receives inputted 16-KHz speech signals and outputs 4 to 8-KHz high frequency band signals.

Especially, low frequency signal outputted from the QMF LP 46 is inputted into the speech coding/recognition pre-processing block and high frequency signal outputted from the QMF HP 47 is inputted into the speech feature vector extraction block 12, i.e., the MFCC front-end, of the distributed speech recognition front-end module 100. In the speech feature vector extraction block 12, speech feature vectors, e.g., MFCCs, are extracted from the inputted high frequency signal by using 26 Mel-filter banks.

The low frequency signal outputted from the QMF LP 46 is inputted into the speech feature vector extraction block 12 through the speech coding/recognition preprocessing block. Then, the low frequency signal and the high frequency signal outputted from the QMF HP 47 are combined into one signal in the speech feature vector extraction block 12. That is, before log filter bank energy is converted into cepstrum coefficient, the high frequency signal and the low frequency signal are added. Moreover, log parameters (log-energy) for every frequency bands are obtained based on the high frequency signal and the low frequency signal.

In addition, the expanded speech coding/recognition pre-processing blocks in Figs. 3 to 4, can be implemented according to frequency expansion specification of European Telecommunications Standards Institute (ETSI) DSR standard (ETSI ES 202 050 vl.1.3) in order to use 11 KHz or 16 KHz sampling frequency signal.

Fig. 5 is a flowchart illustrating a method for speech recognition in accordance with an embodiment of the present invention; Fig. 6 is a flowchart illustrating a training processing for generating an acoustic model in accordance with an embodiment of the present invention; and Fig. 7 is a graph showing a speech recognition performance based on a speech feature vector in accordance with an embodiment of the present invention.

The present invention can be applied to the distributed speech recognition terminal, e.g., mobile phone, and its affect on the speech recognition performance needs to be verified.

Below, the speech recognition performance of the present invention will be examined based on the speech recognition process and the training process for generating the acoustic model. Fig. 5 is the speech recognition processes based on a Hidden Markov Model (HMM) . Speech features are extracted from the speech spoken by the user 301, and then a pattern matching 302 is performed by searching the acoustic model 303, a language model 304 and a pronouncing dictionary 305 according to the extracted speech features. A word or a sentence is recognized in response to the speech. A method suggested in a standard of the ETSI, "ETSI ES 201 108", is used for extracting of the speech features 301. That is, the speech features are extracted from the speech signal through MFCC, the speech feature vector as high-order coefficients is formed and a word stream having maximum probability is searched through pattern matching based on the acoustic model 303, the language model 304 and the pronunciation dictionary 305 in response to the speech feature vector. Here, for verifying performance of the conventional speech recognition, a noise attenuated signal by pre-processing defined in ETSI DSR standard, i.e., ETSI ES 202 050 vl.1.3, or a noise attenuated signal by the pre-processing steps defined in IS-127 is used as the speech signal for extracting the speech characteristics .

Meanwhile, for verifying the performance of the present speech recognition, a noise attenuated signal outputted from the speech coding/recognition pre- processing block 11 is used as the speech signal for extracting the speech characteristics.

In the extraction of the speech characteristics, 13 order MFCCs and log-energy are extracted by using The MFCC front-end module. 12 order MFCCs (c₀, ..., Ci₂), log- energy, delta of them, and delta-delta of them are used for parameters of the training of the acoustic model and speech recognition.

Moreover, HMM is used for the acoustic model 303.

In the present invention, a phone model in accordance with the language is used as the acoustic model. The training process for generating the context independent phone model will be described referring to Fig. 6. First, a monophone-based model as a context independent phone model is generated based on the speech feature vector extracted from training data at step S401.

Subsequently, forced alignment is performed based on the monophone-based model, so that phone label file is newly generated at step S402.

Meanwhile, a triphone-based model as a context dependent phone model is generated by expanding the monophone-based model at step S403. Then, a state-tying is performed considering that the training data for the triphone-based model is small at step S404.

Then, a final acoustic model is generated by increasing the number of mixture densities of a result acoustic model acquired by performing the state tying at step S405.

The language model 304 shown in Fig. 5 adapts a statistical estimation method. Here, the statistical estimation method estimates probability of available word sequence statistically from the speech database in predetermined environment. Among the language model adapting the statistical estimation method is an n-gram. In the n-gram, probability of word sequence is approximated by multiplying previous n conditional probabilities. In Fig. 5, a bigram language model is used.

With respect to the pronunciation dictionary 305, the pronunciation dictionary provided by "CleanSentOl" of SiTEC is used for Korean and "CMU dictionary V.0.6" provided from Carnegie-Mellon university is used for English. In addition, pronunciation of phrasal words that are not supported by "cleanSentOl" are supported by a pronunciation converter produced for the purpose based on "standard pronunciation method of standard language rule." Here, the phrasal word is composed of a word and a auxiliary word. The total number of phrasal words of the pronunciation dictionary provided by "CleanSentOl" is 36,104 and the total number of phrasal words of the pronunciation dictionary for speech recognition is 223, 857.

With respect to the speech DB, a sentence speech DB (e.g., CleanSentOl) is used in case of Korean, and an AURORA 4 DB (e.g., Wall Street Journal) is used in case of English.

5000 sentences among text data used in training and 3000 sentences among ^speech recognition language model usage text DB' may be used for generating the language model. A hidden Markov model toolkit (HTK) v3.1 is used to generate the language model and the final language model includes 31,582 words. The finally acquired model includes a network of 31,582 words. In the speech recognition process, a word recognition rate of using the conventional noise attenuated speech signal is 68.61% and the word recognition rate of using the noise attenuated speech signal in accordance with the present invention is 69.31% referring to Fig. 7. That is, the speech recognition performance of the present invention is improved than that of conventional method.

In short, a noise-robust speech feature vector can be extracted by sharing the speech coding pre-processing and the speech feature vector extracting pre-processing in a simple-structured terminal. Therefore, the speech recognition performance is improved with the small amount of memory and operations in the simple-structured terminal .

The above described method according to the present invention can be embodied as a program and be stored on a computer readable recording medium. The computer readable recording medium is any data storage device that can store data which can be read by the computer system. The computer readable recording medium includes a read-only memory (ROM) , a random-access memory (RAM) , a CD-ROM, a floppy disk, a hard disk and an optical magnetic disk.

While the present invention has been described with respect to certain preferred embodiments, it will be apparent to those skilled in the art that various changes and modifications may be made without departing from the scope of the invention as defined in the following claims.

Claims

WHAT IS CLAIMED IS:

1. An apparatus for extracting a noise-robust speech feature vector by sharing a pre-process step of a speech coding in a distributed speech coding/recognition terminal, comprising: a high pass filter for eliminating low frequency signals from input speech signals; a frequency domain conversion unit for converting the signals without low frequency signals into spectral signals in a frequency domain; a channel energy estimation unit for calculating a channel energy estimation value of the spectral signals of a current frame; a channel signal-to-noise ratio (SNR) estimation unit for estimating a channel SNR of the speech signals based on the channel energy estimation value acquired in the channel energy estimation unit and a background noise estimation value acquired in a background noise energy estimation unit; the background noise energy estimation unit for updating the background noise energy estimation value of the speech signals based on a command from a noise update decision unit; a voice metric calculation unit for acquiring a sum of voice metrics in a current channel based on the channel SNR; a spectral deviation estimation unit for estimating a spectral deviation of the speech signals based on the channel energy estimation value; the noise update decision unit for commanding to update the noise estimation value based on a total channel energy estimation value and a difference value between a current power spectrum estimation value and an average long-term power spectrum estimation value estimated in the spectral deviation estimation unit; a channel SNR modifying unit for modifying the channel SNR estimated in the channel SNR estimation unit based on the sum of voice metrics acquired in the voice metric calculation unit; a channel gain computation unit for acquiring a linear channel gain based on the modified channel SNR modified in the channel SNR modifying unit and the background energy estimation value obtained in the background noise estimation unit; a frequency domain filter for applying the linear channel gain to the spectral signals converted in the frequency domain conversion unit; and a time domain conversion unit for converting the linear channel gain applied spectral signals into speech signals in a time domain.

2. The apparatus of claim 1, wherein the speech signals outputted from the time domain conversion unit are noise-attenuated, and inputted into a speech feature vector extraction block of a speech feature vector extracting module or a speech coding block of a speech coding module.

3. The apparatus of claim 1, wherein a frame size of the filtered signals outputted from the high pass filter is 10 ms .

4. The apparatus of claim 1, wherein the frequency domain conversion unit converts the inputted signals into the spectral signals in the frequency domain by using a smoothed trapezoidal window.

5. The apparatus of claim 1, wherein an initialization of the channel energy estimation value with an un-filtered channel energy value of the first frame is permitted, when a channel energy smoothness factor of a first frame is 0.

6. The apparatus of claim 1, wherein the SNR of the speech signals includes quantized channel SNR indexes as the following Equation:

7. A method for extracting a noise-robust speech feature vector by sharing a pre-process of a speech coding in a distributed speech coding/recognition terminal, comprising the steps of: eliminating low frequency signals of speech signals received from outside; converting the signals without low frequency signals into spectral signals in a frequency domain; obtaining a channel energy estimation value of the spectral signals of a current frame; estimating a spectral deviation of the speech signals based on the obtained channel energy estimation value; making a noise estimation value updating command based on a total channel energy estimation value and a difference value between a current power spectrum estimation value and an average long-term power spectrum estimation value; when the noise estimation value updating command is received, updating the background noise energy estimation value; estimating a channel signal-to-noise ratio (SNR) of the speech signals based on the channel energy estimation value and the background noise energy estimation value; calculating a sum of voice metrics of the speech signals based on the channel SNR; modifying the channel SNR based on the sum of voice metrics; obtaining a linear channel gain based on the modified channel SNR and the background noise energy estimation value; applying the linear channel gain to the spectral signals; and converting the linear channel gain applied spectral signals into time domain speech signals.

8. The method of claim 7, wherein the spectral deviation estimation step includes the steps of: calculating a log power spectrum estimation value of the speech signals in a current channel based on the channel energy estimation value; calculating the difference value between the current power spectrum estimation value and the average long-term power spectrum estimation value; calculating the total channel energy estimation value of a current frame based on the channel energy estimation value; calculating an exponential window function factor based on the total channel energy estimation value; and updating the average long-term power spectrum estimation value of the next frame based on the exponential window function factor and an initial value of the power spectrum estimation value.

9. The method of claim 8, wherein the average long-term power spectrum is initialized by estimation value of the log power spectrum of a first frame.

10. The method of claim 7, wherein a background noise estimation value updating parameter of the noise estimation value update step includes the total energy estimation value and a difference value between the current power spectrum estimation value and an average long-term power spectrum estimation value calculated in step d) .

11. The method of claim I₁ wherein the linear channel gain obtaining step includes the steps of: calculating a total gain factor of the current frame in the current channel of the speech signals; and calculating a channel gain of the current channel of the speech signals.

12. A distributed speech recognition terminal, comprising: a speech coding module for transmitting coded speech signals into outside through a speech traffic channel in a speech coding mode; a speech feature vector extracting module for transmitting extracted speech feature vectors into outside in a speech recognition mode; and a speech coding/recognition pre-processing block for attenuating a noise in speech signals received from the outside, wherein speech signals inputted into the speech coding module and the speech feature vector extracting module are pre-processed in the speech coding/recognition pre-processing block.

13. The apparatus of claim 12, wherein frequency of the speech signals inputted into the speech coding/recognition pre-processing block is 8 KHz.

14. The apparatus of claim 12, wherein the speech feature vector extracting module includes: a speech feature vector extraction unit for extracting speech feature vectors of the speech signal pre-processed in the speech coding/recognition preprocessing block; a first compression unit for compressing the speech feature vectors extracted in the speech feature vector extraction unit; a first bit stream transmission unit for transmitting a bit stream data loading the speech feature vectors compressed in the first compression unit to outside .

15. The apparatus of claim 12, wherein the speech coding module includes: a speech coding unit for coding the speech signals pre-processed in the speech coding/recognition preprocessing block; a second compression unit for compressing the speech signals coded in the second speech coding unit; and a second bit stream transmission unit for transmitting a bit stream data loading the speech signals compressed in the second compression unit to outside.

16. The apparatus of claim 12, further comprising: a switch for shifting a mode between a speech coding mode and a speech recognition mode.

17. A distributed speech recognition terminal, comprising: a speech coding module for transmitting coded speech signals into outside through a speech traffic channel in a speech coding mode; a speech feature vector extracting module for transmitting extracted speech feature vectors into outside in a speech recognition mode; a frequency down-sampler for down-sampling speech signals received from the outside; and a speech coding/recognition pre-processing block for attenuating a noise in the speech signals down- sampled in the frequency down-sampler, wherein speech signals inputted into the speech coding module and the speech feature vector extracting module are pre-processed in the speech coding/recognition pre-processing block.

18. The apparatus of claim 17, wherein frequency of the speech signals inputted into the frequency down- sampler is 11 KHz.

19. A distributed speech recognition terminal, comprising: a speech coding module for transmitting coded speech signals into outside through a speech traffic channel in a communication mode; a speech feature vector extracting module for transmitting extracted speech feature vectors into outside in a speech recognition mode; a low pass quadrature mirror filter for passing low frequency signals of speech signals received from the outside; a high pass quadrature mirror filter for passing high frequency signals of the speech signals; and a speech coding/recognition pre-processing block for attenuating a noise in the speech signals down- sampled in the frequency down-sampler, wherein speech signals inputted into the speech coding module and the speech feature vector extracting module are pre-processed in the speech coding/recognition pre-processing block.

20. The apparatus of claim 19, wherein frequency of the speech signals inputted into the low pass quadrature filter and the high pass quadrature filter is 16 KHz.

21. The apparatus of claim 20, wherein the low pass quadrature filter receives lβ-KHz speech signals and outputs 0 to 4-KHz signals.

22. The apparatus of claim 20, wherein the high pass quadrature filter receives lβ-KHz speech signals and outputs 5 to 8-KHz signals.

23. The apparatus of claim 19, wherein the low frequency signals outputted from the low pass quadrature mirror filter is inputted into the speech coding/recognition pre-processing block, and the high frequency signals outputted from the high pass quadrature mirror filter is inputted into the speech feature vector extracting module.

24. The apparatus of claim 23, wherein the low frequency signals and the high frequency signals are combined in the speech feature vector extracting module before a log filter bank energy is transformed into a cepstrum coefficient.

25. A computer-readable recording medium for recording a program that implements a method in a terminal having a processor, the method comprising the steps of: eliminating low frequency signals of speech signals received from outside; converting the signals without low frequency signals into spectral signals in a frequency domain; obtaining a channel energy estimation value of the spectral signals of a current frame; estimating a spectral deviation of the speech signals based on the obtained channel energy estimation value; making a noise estimation value updating command based on a total channel energy estimation value and a difference value between a current power spectrum estimation value and an average long-term power spectrum estimation value; when the noise estimation value updating command is received, updating the background noise energy estimation value; estimating a channel signal-to-noise ratio (SNR) of the speech signals based on the channel energy estimation value and the background noise energy estimation value; calculating a sum of voice metrics of the speech signals based on the channel SNR; modifying the channel SNR based on the sum of voice metrics; obtaining a linear channel gain based on the modified channel SNR and the background noise energy estimation value; applying the linear channel gain to the spectral signals; and converting the linear channel gain applied spectral signals into time domain speech signals.