KR100794140B1

KR100794140B1 - Apparatus and Method for extracting noise-robust the speech recognition vector sharing the preprocessing step used in speech coding

Info

Publication number: KR100794140B1
Application number: KR1020060061150A
Authority: KR
Inventors: 류창선; 김재인; 김홍국; 윤재삼; 오유리
Original assignee: 주식회사 케이티
Priority date: 2006-06-30
Filing date: 2006-06-30
Publication date: 2008-01-10
Also published as: WO2008001991A1; KR20080002359A

Abstract

1. 청구범위에 기재된 발명이 속한 기술분야1. TECHNICAL FIELD OF THE INVENTION

본 발명은 분산 음성 인식 단말기에서 음성 부호화기의 전처리를 공유해 잡음에 견고한 음성 특징 벡터를 추출하는 장치 및 그 방법에 관한 것임.The present invention relates to an apparatus and method for extracting a speech feature vector robust to noise by sharing preprocessing of a speech coder in a distributed speech recognition terminal.

2. 발명이 해결하려고 하는 기술적 과제2. The technical problem to be solved by the invention

본 발명은 사용자 음성 부호화 기능이 탑재된 휴대폰과 같은 단말기에서 음성 인식을 위한 음성 특징 벡터(음성 특징 파라미터)를 추출하는데 있어 음성 부호화 전처리 과정과 음성 특징 벡터 추출 전처리 과정을 공유해서 통신 환경 등에 의한 잡음에 견고한 음성 특징 벡터를 추출하는, 분산 음성 인식 단말기에서 음성 부호화기의 전처리를 공유해 잡음에 견고한 음성 특징 벡터를 추출하는 장치 및 그 방법을 제공하는데 그 목적이 있음.According to the present invention, a speech encoding preprocessing process and a speech feature vector extraction preprocessing process are used to extract a speech feature vector (speech feature parameter) for speech recognition in a terminal such as a mobile phone equipped with a user speech coding function. It is an object of the present invention to provide an apparatus and method for extracting a robust speech feature vector to noise by sharing preprocessing of a speech coder in a distributed speech recognition terminal.

3. 발명의 해결방법의 요지3. Summary of Solution to Invention

본 발명은, 분산 음성 인식 단말기에서 음성 부호화기의 전처리를 공유해 잡음에 견고한 음성 특징 벡터를 추출하기 위한 장치에 있어서, 외부로부터 입력받은 음성신호의 저주파 대역 신호를 제거하는 고대역 통과 필터; 상기 고대역 통과 필터에서 그 저주파 대역이 제거된 신호를 주파수 영역 상의 스펙트럼 신호로 변환시키는 주파수 영역 변환기; 상기 주파수 영역 변환기에서 변환시킨 스펙트럼 신호의 현재 프레임에 대한 채널 에너지 추정값을 계산하는 채널 에너지 추정기; 상기 채널 에너지 추정기에서 계산한 채널 에너지 추정값과 배경 잡음 추정기에서 계산한 배경 잡음 에너지 추정값을 토대로 상기 음성신호에 관한 채널 신호대잡음비를 추정하는 채널 신호대잡음비 추정기; 잡음 업데이트 결정기의 명령에 따라 상기 음성신호에 관한 배경 잡음 에너지 추정값을 계산하여 갱신하는 상기 배경 잡음추정기; 상기 채널 신호대잡음비 추정기에서 추정한 채널 신호대잡음비를 토대로 상기 음성신호에 관한 채널 상의 음성 메트릭들의 합을 계산하는 보이스 메트릭 계산기; 상기 채널 에너지 추정기에서 계산한 채널 에너지 추정값을 토대로 상기 음성신호에 관한 스펙트럼 편차를 추정하는 스펙트럼 편차 추정기; 상기 스펙트럼 편차 추정기에서 추정한 총 채널 에너지 추정값 및 현재 전력 스펙트럼에 대한 추정값과 평균 장구간 전력 스펙트럼에 대한 추정값간의 차이값을 토대로 잡음 추정값 갱신 명령을 내리는 상기 잡음 업데이트 결정기; 상기 보이스 메트릭 계산기에서 계산한 음성 메트릭들의 합을 토대로 상기 채널 신호대잡음비 추정기에서 추정한 채널 신호대잡음비를 수정하는 채널 신호대잡음비 추정값 수정기; 상기 채널 신호대잡음비 추정값 수정기에서 수정한 채널 신호대잡음비와 상기 배경 잡음 추정기에서 계산한 배경 잡음 에너지 추정값을 토대로 선형 채널 이득을 계산하는 채널 이득 산출기; 상기 주파수 영역 변환기에서 변환시킨 스펙트럼 신호에 대해 상기 채널 이득 산출기에서 계산한 선형 채널 이득을 적용시키는 주파수 영역 필터; 및 상기 주파수 영역 필터에서 선형 채널 이득을 적용시킨 스펙트럼 신호를 시간 영역 상의 음성신호로 변환시키는 시간 영역 변환기를 포함함.An apparatus for extracting a speech feature vector robust to noise by sharing preprocessing of a speech coder in a distributed speech recognition terminal, the apparatus comprising: a high pass filter for removing a low frequency band signal of a speech signal received from the outside; A frequency domain converter for converting a signal from which the low frequency band is removed in the high pass filter into a spectral signal in a frequency domain; A channel energy estimator for calculating a channel energy estimate for the current frame of the spectral signal converted by the frequency domain converter; A channel signal to noise ratio estimator for estimating a channel signal to noise ratio for the speech signal based on the channel energy estimate calculated by the channel energy estimator and the background noise energy estimated by the background noise estimator; The background noise estimator for calculating and updating a background noise energy estimate for the speech signal according to a command of a noise update determiner; A voice metric calculator for calculating a sum of voice metrics on a channel with respect to the voice signal based on the channel signal to noise ratio estimated by the channel signal to noise ratio estimator; A spectral deviation estimator for estimating a spectral deviation of the speech signal based on the channel energy estimate calculated by the channel energy estimator; The noise update determiner that issues a noise estimate update command based on a total channel energy estimate estimated by the spectrum deviation estimator and a difference value between an estimated value for the current power spectrum and an estimated long term power spectrum; A channel signal to noise ratio estimation value corrector for correcting the channel signal to noise ratio estimated by the channel signal to noise ratio estimator based on the sum of the voice metrics calculated by the voice metric calculator; A channel gain calculator for calculating a linear channel gain based on the channel signal-to-noise ratio modified by the channel signal-to-noise ratio estimation modifier and the background noise energy estimate calculated by the background noise estimator; A frequency domain filter applying the linear channel gain calculated by the channel gain calculator to the spectrum signal converted by the frequency domain converter; And a time domain converter for converting a spectral signal to which a linear channel gain is applied in the frequency domain filter into a voice signal in the time domain.

4. 발명의 중요한 용도4. Important uses of the invention

본 발명은 분산 음성 인식 분야 등에 이용됨.The present invention is used in the field of distributed speech recognition.

분산 음성 인식 시스템, 단말기, 음성 부호화 전처리, 음성 코딩, 음성 인식, 전처리 공유, 잡음 감쇄, 음성 특징 벡터 추출 Distributed speech recognition system, terminal, speech coding preprocessing, speech coding, speech recognition, preprocessing sharing, noise reduction, speech feature vector extraction

Description

Apparatus and Method for extracting noise-robust the speech recognition vector sharing the preprocessing step used in speech coding}

도 1은 본 발명에 따른 분산 음성 인식 단말기에서 음성 부호화기의 전처리를 공유해 잡음에 견고한 음성 특징 벡터를 추출하는 장치에 대한 일실시예 구성도.1 is a block diagram of an apparatus for extracting a speech feature vector robust to noise by sharing preprocessing of a speech encoder in a distributed speech recognition terminal according to the present invention.

도 2는 도 1의 음성 코딩/인식 전처리부에 대한 일실시예 상세 구성도.FIG. 2 is a detailed block diagram of an embodiment of the speech coding / recognition preprocessor of FIG. 1. FIG.

도 3a는 본 발명에 따른 11 KHz 음성신호 처리를 위해 확장된 음성 코딩/인식 전처리부에 대한 일실시예 구성도.3A is a diagram illustrating an embodiment of an extended speech coding / recognition preprocessor for 11 KHz speech signal processing according to the present invention;

도 3b는 본 발명에 따른 16 KHz 음성신호 처리를 위해 확장된 음성 코딩/인식 전처리부에 대한 일실시예 구성도.Figure 3b is an embodiment configuration for an extended speech coding / recognition preprocessor for 16 KHz speech signal processing according to the present invention.

도 4는 본 발명에서 사용하는 음성 인식 과정에 대한 일실시예 설명도.4 is a diagram illustrating an embodiment of a speech recognition process used in the present invention.

도 5는 도 4에 있어 음향 모델 생성을 위한 학습 과정에 대한 일실시예 순서도.FIG. 5 is a flow chart of an embodiment of a learning process for generating an acoustic model in FIG. 4. FIG.

도 6은 본 발명을 적용시켜 추출한 음성 특징 벡터를 사용한 음성 인식 성능 을 나타내는 그래프.6 is a graph showing speech recognition performance using speech feature vectors extracted by applying the present invention.

* 도면의 주요 부분에 대한 부호 설명* Explanation of symbols on the main parts of the drawing

100 : 음성 특징 벡터 추출 모듈100: speech feature vector extraction module

150 : 음성 코딩 모듈150: voice coding module

11 : 음성 코딩/인식 전처리부 12 : 음성 특징 벡터 추출부11: speech coding / recognition preprocessor 12: speech feature vector extractor

13 : 음성 압축부 14 : 비트 스트림 전송부13 voice compression unit 14 bit stream transmission unit

21 : 고대역 통과 필터 22 : 주파수 영역 변환기21 high pass filter 22 frequency domain converter

23 : 채널 에너지 추정기 24 : 채널 신호대잡음비 추정기23 channel energy estimator 24 channel signal to noise ratio estimator

25 : 보이스 메트릭 계산기 26 : 스펙트럼 편차 추정기25: Voice Metric Calculator 26: Spectral Deviation Estimator

27 : 잡음 업데이트 결정기 28 : 채널 신호대잡음비 추정값 수정기27: Noise Update Determinant 28: Channel Signal-to-Noise Ratio Estimator Corrector

29 : 채널 이득 산출기 30 : 배경 잡음 추정기29: channel gain calculator 30: background noise estimator

31 : 주파수 영역 필터 32 : 시간 영역 변환기31: frequency domain filter 32: time domain converter

본 발명은 분산 음성 인식 단말기에서 음성 특징 벡터를 추출하기 위한 장치 및 그 방법에 관한 것으로, 더욱 상세하게는 사용자 음성 부호화 기능이 탑재된 휴대폰과 같은 단말기에서 음성 인식을 위한 음성 특징 벡터(음성 특징 파라미터)를 추출하는데 있어 음성 부호화 전처리 과정과 음성 특징 벡터 추출 전처리 과정을 공유해서 통신 환경 등에 의한 잡음에 견고한 음성 특징 벡터를 추출하는, 분산 음성 인식 단말기에서 음성 부호화기의 전처리를 공유해 잡음에 견고한 음성 특징 벡터를 추출하는 장치 및 그 방법에 관한 것이다.The present invention relates to an apparatus and method for extracting a speech feature vector from a distributed speech recognition terminal, and more particularly, to a speech feature vector (voice feature parameter) for speech recognition in a terminal such as a mobile phone having a user speech encoding function. The speech feature pre-processing process and the speech feature vector extraction preprocessing process are used to extract the speech feature vectors robust to the noise caused by the communication environment. It relates to an apparatus and a method for extracting.

분산 음성 인식 기술(DSR; Distributed Speech Recognition)은 휴대폰 등과 같은 저사양 단말기에서도 음성 인식 기능이 구현될 수 있도록 하는 기술로서, 저사양 단말기에서는 음성 신호의 특징을 인식하고, 고사양 음성 인식서버에서는 저사양 단말기로부터 제공받은 음성 특징을 토대로 음성 인식을 수행하는 이원적 처리 시스템으로 이루어진다.Distributed Speech Recognition (DSR) is a technology that enables the speech recognition function to be implemented in low-end terminals such as mobile phones, and recognizes the characteristics of voice signals in low-end terminals and provides them from low-end terminals in high-end speech recognition servers. It consists of a dual processing system that performs speech recognition based on the received speech features.

일반적으로, 위와 같은 음성 인식 분야에 있어 MFCC(Mel-Frequency Cepstrum Coefficient)가 보편적으로 사용된다. MFCC란 멜스케일로 표현된 주파수 스펙트럼의 형태를 정현파 성분으로 나타낸 것으로서, 사용자로부터 입력받은 음성을 표현하는 음성 특징 벡터(일명 음성 특징 파라미터)를 말한다.In general, in the speech recognition field as described above, MFCC (Mel-Frequency Cepstrum Coefficient) is commonly used. The MFCC is a sine wave component representing the form of the frequency spectrum expressed in mel scale, and refers to a speech feature vector (also called speech feature parameter) representing a speech received from a user.

단말기측에서는 사용자로부터 입력받은 음성으로부터 MFCC를 통해 음성 특징 벡터를 추출하고서, 이를 통신망에 전송하기 적합한 형태로 처리하기 위해 음성 특징 벡터를 비트 스트림에 실어서 음성인식 서버측으로 전송한다. 예컨대, 사용자로부터 입력받은 음성에 대응되는 MFCC를 유한한 수로 이루어진 코드북(codebook)의 중심 벡터들 중에서 가장 가까운 거리를 갖는 벡터를 골라 내어서 이를 비트 스트림 데이터로 전송하는 것이다. 여기서 코드북은 사용자로부터 발화되는 음성에 관해서 유사한 값을 갖는 군집에 대한 중심값을 갖으며, 통상적으로 수많은 음성 데 이터로부터 학습 데이터(학습 MFCC 벡터)를 추출한 후에 이 추출한 훈련 데이터 중에서 가장 대표되는 값으로 선정된다.In the terminal side, the voice feature vector is extracted from the voice input from the user through the MFCC, and the voice feature vector is loaded in the bit stream and transmitted to the voice recognition server in order to process the voice feature vector into a form suitable for transmission to the communication network. For example, a MFCC corresponding to a voice received from a user is selected from a vector having the closest distance among center vectors of a codebook of a finite number and transmitted as bit stream data. Here, the codebook has a center value for a cluster having a similar value with respect to the voice spoken by the user, and is typically the most representative value among the extracted training data after extracting the training data (learning MFCC vector) from a large number of voice data. Is selected.

그리고, 음성인식 서버측에서는 단말기로부터 수신받은 비트 스트림에 실린 음성 특징 벡터를 역양자화시키고서, 음성 모델로서 HMM(Hidden Markov Model)을 사용해 음성에 대응되는 단어를 인식한다. 여기서, HMM은 음성 인식을 위한 기본 단위, 예컨대 음소를 모델링하는 프로세스로서, 음성 인식 엔진으로 들어오는 음소와 음성 인식 엔진 내의 DB에 등록되어 있는 음소를 결합해 단어와 문장을 만드는 것이다.The speech recognition server side dequantizes the speech feature vector carried in the bit stream received from the terminal, and recognizes a word corresponding to the speech using the HMM (Hidden Markov Model) as the speech model. Here, the HMM is a process of modeling a basic unit, for example, a phoneme, for speech recognition. The HMM combines phonemes coming into the speech recognition engine and phonemes registered in a DB in the speech recognition engine to form words and sentences.

한편, 최근에 디지털 컨버젼스 추세에 맞추어 분산 음성 인식 단말기로는 휴대폰이 각광받고 있으며, 이러한 휴대폰에는 그 고유 기능인 사용자 음성신호 처리 기능, 예컨대 음성 부호화기가 탑재되어 있다.On the other hand, in recent years, in accordance with the trend of digital convergence, a mobile phone has been in the spotlight as a distributed voice recognition terminal, and such a mobile phone is equipped with a user voice signal processing function, for example, a voice encoder.

그런데, 전술한 바와 같이 사용자 발화 음성에 대한 음성 특징 벡터를 추출하는데 있어 그 사용자 음성신호 전처리 과정, 특히 잡음 감쇄 처리가 요구되나, 일반적인 휴대폰에 있어 음성 통화를 위한 사용자 음성신호 전처리 과정과 음성 인식을 위한 사용자 음성신호 전처리 과정이 별개로 이루어지고 있다.However, as described above, in order to extract the voice feature vector for the user's spoken voice, the user's voice signal preprocessing process, in particular, the noise attenuation process is required. The user voice signal preprocessing process is performed separately.

예컨대, 음성 통화나 음성 인식에 있어 사용자 음성신호 처리를 하는 것은 동일하나, 휴대폰에서는 이를 별개의 전처리 과정, 특히 별개의 전처리 장치로서 처리함으로서, 저사양 특성을 갖는 휴대폰에 있어 추가적인 메모리와 연산량이 요구되며, 이는 휴대폰의 자원 낭비를 초래하고 있는 형편이다.For example, the processing of a user voice signal is the same for voice calls and voice recognition, but the mobile phone processes it as a separate preprocessing process, especially as a separate preprocessing device, which requires additional memory and computational amount in a mobile phone having low specification characteristics. This is causing a waste of resources in mobile phones.

또한, 음성 부호화 등과 같은 음성 전처리 과정은 단말 내부적으로 지연을 갖는데, 이는 휴대폰에 있어 음성 통화 프로세스와 음성 인식 프로세스간의 전환 지연을 가져오고 있는 형편이다. 예컨대, 사용자가 휴대폰을 통해 음성 인식 기능을 수행하는 도중에 전화가 걸려온 경우에 이 전화를 받는데 시간적 지연이 생기는 것이다.In addition, a speech preprocessing process such as speech encoding has a delay internally in the terminal, which causes a delay in switching between a voice call process and a speech recognition process in a mobile phone. For example, if a call is received while a user is performing a voice recognition function through a mobile phone, there is a time delay in receiving the call.

이하, 종래 방식에서의 휴대폰에 있어 음성 통화를 위한 전처리 과정과 음성 인식을 위한 전처리 과정에 대해 설명하면 다음과 같다.Hereinafter, a preprocessing process for a voice call and a preprocessing process for voice recognition in a conventional mobile phone will be described.

종래 방식에 따른 휴대폰에는 음성 코딩 모듈[음성 부호화기](speech coding module)과 음성 특징 벡터 추출 모듈(distributed speech recognition front-end module)이 탑재되어 있다.The conventional mobile phone is equipped with a speech coding module (speech coding module) and a speech feature vector extraction module (distributed speech recognition front-end module).

상기 음성 코딩 모듈은 음성 부호화 전처리기(pre-processing for speech coding), 모델 파라미터 추정기(model parameter estimation), 음성 압축기(compression) 및 비트 스트림 전송기(bit stream)로 이루어진다.The speech coding module consists of a pre-processing for speech coding, a model parameter estimation, a speech compressor and a bit stream.

상기 음성 특징 벡터 추출 모듈은 음성 인식 전처리기(pre-processing for speech recognition), 음성 특징 벡터 추출기(MFCC front-end), 음성 압축기(compression) 및 비트 스트림 전송기(bit stream)으로 이루어진다.The speech feature vector extraction module comprises a pre-processing for speech recognition, a speech feature vector extractor (MFCC front-end), a speech compressor and a bit stream transmitter.

위와 같이 종래 방식에서는 사용자 음성신호에 섞인 잡음 등을 감쇄시키기 위한 구성요소로서, 음성 부호화 전처리기와 음성 인식 전처리기가 휴대폰에 있어 각각 별도로 탑재되어 있는데, 이는 음성 부호화 과정 및 음성 특징 벡터 추출 과정에 있어 그 전처리 신호가 서로 상이한 것에 기인한다.As described above, in the conventional method, as a component for attenuating noise mixed in a user's voice signal, a speech encoding preprocessor and a speech recognition preprocessor are separately installed in a mobile phone, which is used in a speech encoding process and a speech feature vector extraction process. This is due to the different preprocessing signals.

따라서, 음성 부호화 전처리기와 음성 인식 전처리기와 같이 동일한 기능을 수행하는 점을 고려하여, 음성 부호화 과정 및 음성 특징 벡터 추출 과정에 있어 그 전처리 과정을 공유해서 통합적으로 처리할 수 있는 기술이 절실히 요구되고 있다.Therefore, in view of performing the same functions as the speech encoding preprocessor and the speech recognition preprocessor, there is an urgent need for a technology capable of integrating the preprocessing process in the speech encoding process and the speech feature vector extraction process. .

본 발명은 상기와 같은 문제점을 해결하고 상기와 같은 요구에 부응하기 위하여 제안된 것으로, 사용자 음성 부호화 기능이 탑재된 휴대폰과 같은 단말기에서 음성 인식을 위한 음성 특징 벡터(음성 특징 파라미터)를 추출하는데 있어 음성 부호화 전처리 과정과 음성 특징 벡터 추출 전처리 과정을 공유해서 통신 환경 등에 의한 잡음에 견고한 음성 특징 벡터를 추출하는, 분산 음성 인식 단말기에서 음성 부호화기의 전처리를 공유해 잡음에 견고한 음성 특징 벡터를 추출하는 장치 및 그 방법을 제공하는데 그 목적이 있다.SUMMARY OF THE INVENTION The present invention has been proposed to solve the above problems and meet the above demands, and to extract a speech feature vector (speech feature parameter) for speech recognition from a terminal such as a mobile phone equipped with a user speech encoding function. An apparatus for extracting a robust speech feature vector to noise by sharing the preprocessing of a speech coder in a distributed speech recognition terminal by sharing a speech coding preprocessing process and a speech feature vector extraction preprocessing process to extract a speech feature vector robust to noise caused by a communication environment, etc. The purpose is to provide a method.

상기의 목적을 달성하기 위한 본 발명의 장치는, 분산 음성 인식 단말기에서 음성 부호화기의 전처리를 공유해 잡음에 견고한 음성 특징 벡터를 추출하기 위한 장치에 있어서, 외부로부터 입력받은 음성신호의 저주파 대역 신호를 제거하는 고대역 통과 필터; 상기 고대역 통과 필터에서 그 저주파 대역이 제거된 신호를 주파수 영역 상의 스펙트럼 신호로 변환시키는 주파수 영역 변환기; 상기 주파수 영역 변환기에서 변환시킨 스펙트럼 신호의 현재 프레임에 대한 채널 에너지 추정값을 계산하는 채널 에너지 추정기; 상기 채널 에너지 추정기에서 계산한 채널 에너지 추정값과 배경 잡음 추정기에서 계산한 배경 잡음 에너지 추정값을 토대로 상기 음성신호에 관한 채널 신호대잡음비를 추정하는 채널 신호대잡음비 추정기; 잡음 업데이트 결정기의 명령에 따라 상기 음성신호에 관한 배경 잡음 에너지 추정값을 계산하여 갱신하는 상기 배경 잡음 추정기; 상기 채널 신호대잡음비 추정기에서 추정한 채널 신호대잡음비를 토대로 상기 음성신호에 관한 채널 상의 음성 메트릭들의 합을 계산하는 보이스 메트릭 계산기; 상기 채널 에너지 추정기에서 계산한 채널 에너지 추정값을 토대로 상기 음성신호에 관한 스펙트럼 편차를 추정하는 스펙트럼 편차 추정기; 상기 스펙트럼 편차 추정기에서 추정한 총 채널 에너지 추정값 및 현재 전력 스펙트럼에 대한 추정값과 평균 장구간 전력 스펙트럼에 대한 추정값간의 차이값을 토대로 잡음 추정값 갱신 명령을 내리는 상기 잡음 업데이트 결정기; 상기 보이스 메트릭 계산기에서 계산한 음성 메트릭들의 합을 토대로 상기 채널 신호대잡음비 추정기에서 추정한 채널 신호대잡음비를 수정하는 채널 신호대잡음비 추정값 수정기; 상기 채널 신호대잡음비 추정값 수정기에서 수정한 채널 신호대잡음비와 상기 배경 잡음 추정기에서 계산한 배경 잡음 에너지 추정값을 토대로 선형 채널 이득을 계산하는 채널 이득 산출기; 상기 주파수 영역 변환기에서 변환시킨 스펙트럼 신호에 대해 상기 채널 이득 산출기에서 계산한 선형 채널 이득을 적용시키는 주파수 영역 필터; 및 상기 주파수 영역 필터에서 선형 채널 이득을 적용시킨 스펙트럼 신호를 시간 영역 상의 음성신호로 변환시키는 시간 영역 변환기를 포함한다.The apparatus of the present invention for achieving the above object is a device for extracting a speech feature vector robust to noise by sharing the preprocessing of the speech encoder in a distributed speech recognition terminal, to remove the low frequency band signal of the speech signal received from the outside A high pass filter; A frequency domain converter for converting a signal from which the low frequency band is removed in the high pass filter into a spectral signal in a frequency domain; A channel energy estimator for calculating a channel energy estimate for the current frame of the spectral signal converted by the frequency domain converter; A channel signal to noise ratio estimator for estimating a channel signal to noise ratio for the speech signal based on the channel energy estimate calculated by the channel energy estimator and the background noise energy estimated by the background noise estimator; The background noise estimator for calculating and updating a background noise energy estimate for the speech signal according to a command of a noise update determiner; A voice metric calculator for calculating a sum of voice metrics on a channel with respect to the voice signal based on the channel signal to noise ratio estimated by the channel signal to noise ratio estimator; A spectral deviation estimator for estimating a spectral deviation of the speech signal based on the channel energy estimate calculated by the channel energy estimator; The noise update determiner that issues a noise estimate update command based on a total channel energy estimate estimated by the spectrum deviation estimator and a difference value between an estimated value for the current power spectrum and an estimated long term power spectrum; A channel signal to noise ratio estimation value corrector for correcting the channel signal to noise ratio estimated by the channel signal to noise ratio estimator based on the sum of the voice metrics calculated by the voice metric calculator; A channel gain calculator for calculating a linear channel gain based on the channel signal-to-noise ratio modified by the channel signal-to-noise ratio estimation modifier and the background noise energy estimate calculated by the background noise estimator; A frequency domain filter applying the linear channel gain calculated by the channel gain calculator to the spectrum signal converted by the frequency domain converter; And a time domain converter for converting the spectral signal to which the linear channel gain is applied in the frequency domain filter into a voice signal in the time domain.

한편, 본 발명의 다른 장치는, 분산 음성 인식 단말기에서 음성 부호화기의 전처리를 공유해 잡음에 견고한 음성 특징 벡터를 추출하는 장치에 있어서, 음성 통화 모드 시에 부호화된 음성신호가 음성 트래픽 채널을 통해 외부로 전송되도록 하기 위한 음성 코딩 수단; 음성 인식 모드 시에 추출된 사용자 음성 특징 벡터가 외부로 전송되도록 하기 위한 음성 특징 벡터 추출 수단; 및 외부로부터 입력받은 음성신호에 섞인 잡음을 감쇄시키기 위한 음성 코딩/인식 전처리부를 포함하되, 상기 음성 코딩 수단으로 입력되는 음성신호 및 상기 음성 특징 벡터 추출 수단으로 입력되는 음성신호는 상기 음성 코딩/인식 전처리부에서 전처리되는 것을 특징으로 한다.Meanwhile, another apparatus of the present invention is a device for extracting a speech feature vector that is robust to noise by sharing preprocessing of a speech coder in a distributed speech recognition terminal. Speech coding means for being transmitted; Speech feature vector extraction means for causing the extracted user speech feature vector to be transmitted to the outside in the speech recognition mode; And a speech coding / recognition preprocessor for attenuating noise mixed in the speech signal received from the outside, wherein the speech signal inputted to the speech coding means and the speech signal inputted to the speech feature vector extracting means are speech coding / recognition. It is characterized in that the pretreatment in the pretreatment unit.

한편, 본 발명의 또 다른 장치는, 분산 음성 인식 단말기에서 음성 부호화기의 전처리를 공유해 잡음에 견고한 음성 특징 벡터를 추출하는 장치에 있어서, 음성 통화 모드 시에 부호화된 음성신호가 음성 트래픽 채널을 통해 외부로 전송되도록 하기 위한 음성 코딩 수단; 음성 인식 모드 시에 추출된 사용자 음성 특징 벡터가 외부로 전송되도록 하기 위한 음성 특징 벡터 추출 수단; 외부로부터 입력받은 음성신호의 주파수 대역을 다운 샘플링하기 위한 주파수 다운 샘플러; 및 상기 주파수 다운 샘플러에서 다운 샘플링한 음성신호에 섞인 잡음을 감쇄시키기 위한 음성 코딩/인식 전처리부를 포함하되, 상기 음성 코딩 수단으로 입력되는 음성신호 및 상기 음성 특징 벡터 추출 수단으로 입력되는 음성신호는 상기 음성 코딩/인식 전처리부에서 전처리되는 것을 특징으로 한다.Meanwhile, another apparatus of the present invention is a device for extracting a speech feature vector that is robust to noise by sharing preprocessing of a speech encoder in a distributed speech recognition terminal, wherein the speech signal encoded in the voice call mode is externally transmitted through a speech traffic channel. Speech coding means for being transmitted to the network; Speech feature vector extraction means for causing the extracted user speech feature vector to be transmitted to the outside in the speech recognition mode; A frequency down sampler for down sampling the frequency band of the audio signal received from the outside; And a speech coding / recognition preprocessor for attenuating noise mixed in the down-sampled speech signal by the frequency down sampler, wherein the speech signal inputted to the speech coding means and the speech signal inputted to the speech feature vector extracting means include: It is characterized in that the pre-processing in the speech coding / recognition preprocessor.

한편, 본 발명의 또 다른 장치는, 분산 음성 인식 단말기에서 음성 부호화기의 전처리를 공유해 잡음에 견고한 음성 특징 벡터를 추출하는 장치에 있어서, 음성 통화 모드 시에 부호화된 음성신호가 음성 트래픽 채널을 통해 외부로 전송되도록 하기 위한 음성 코딩 수단; 음성 인식 모드 시에 추출된 사용자 음성 특징 벡터가 외부로 전송되도록 하기 위한 음성 특징 벡터 추출 수단; 외부로부터 입력받은 음성신호에 대해 저주파 대역 신호를 통과시키기 위한 저대역 통과 쿼드러쳐 미러 필터; 외부로부터 입력받은 상기 음성신호에 대해 고주파 대역 신호를 통과시키기 위한 고대역 통과 쿼드러쳐 미러 필터; 및 상기 저대역 통과 쿼드러쳐 미러 필터에서 통과시킨 음성신호에 섞인 잡음을 감쇄시키기 위한 음성 코딩/인식 전처리부를 포함하되, 상기 음성 코딩 수단으로 입력되는 음성신호 및 상기 음성 특징 벡터 추출 수단으로 입력되는 음성신호는 상기 음성 코딩/인식 전처리부에서 전처리되는 것을 특징으로 한다.Meanwhile, another apparatus of the present invention is a device for extracting a speech feature vector that is robust to noise by sharing preprocessing of a speech coder in a distributed speech recognition terminal, wherein the speech signal encoded in the voice call mode is externally transmitted through a speech traffic channel. Speech coding means for being transmitted to the network; Speech feature vector extraction means for causing the extracted user speech feature vector to be transmitted to the outside in the speech recognition mode; A low pass quadrature mirror filter for passing a low frequency band signal to a voice signal received from an outside; A high pass quadrature mirror filter for passing a high frequency band signal to the voice signal received from the outside; And a speech coding / recognition preprocessor for attenuating noise mixed in the speech signal passed by the low pass quadrature mirror filter, wherein the speech signal inputted to the speech coding means and the speech inputted to the speech feature vector extracting means. The signal is preprocessed by the speech coding / recognition preprocessor.

한편, 본 발명의 방법은, 분산 음성 인식 단말기에서 음성 부호화기의 전처리를 공유해 잡음에 견고한 음성 특징 벡터를 추출하기 위한 방법에 있어서, 외부로부터 입력받은 음성신호의 저주파 대역 신호를 제거하는 단계; 상기 저주파 대역이 제거된 신호를 주파수 영역 상의 스펙트럼 신호로 변환시키는 단계; 상기 변환시킨 스펙트럼 신호의 현재 프레임에 대한 채널 에너지 추정값을 계산하는 단계; 상기 계산한 채널 에너지 추정값을 토대로 상기 음성신호에 관한 스펙트럼 편차를 추정하는 단계; 상기 추정한 총 채널 에너지 추정값 및 현재 전력 스펙트럼에 대한 추정값과 평균 장구간 전력 스펙트럼에 대한 추정값간의 차이값을 토대로 잡음 추정값 갱신 명령을 내리는 단계; 상기 잡음 추정값 갱신 명령을 받으면 상기 음성신호에 관한 배경 잡음 에너지 추정값을 계산하여 갱신하는 단계; 상기 계산한 채널 에너지 추정값과 배경 잡음 에너지 추정값을 토대로 상기 음성신호에 관한 채널 신호대잡음비를 추정하는 단계; 상기 추정한 채널 신호대잡음비를 토대로 상기 음성신호에 관한 채널 상의 음성 메트릭들의 합을 계산하는 단계; 상기 계산한 음성 메 트릭들의 합을 토대로 상기 추정한 채널 신호대잡음비를 수정하는 단계; 상기 수정한 채널 신호대잡음비와 상기 계산한 배경 잡음 에너지 추정값을 토대로 선형 채널 이득을 계산하는 단계; 상기 변환시킨 스펙트럼 신호에 대해 상기 계산한 선형 채널 이득을 적용시키는 단계; 및 상기 선형 채널 이득을 적용시킨 스펙트럼 신호를 시간 영역 상의 음성신호로 변환시키는 단계를 포함한다.In the method of the present invention, a method for extracting a speech feature vector robust to noise by sharing preprocessing of a speech encoder in a distributed speech recognition terminal, the method comprising: removing a low frequency band signal of a speech signal received from an external device; Converting the signal from which the low frequency band has been removed into a spectral signal in a frequency domain; Calculating a channel energy estimate for the current frame of the transformed spectral signal; Estimating a spectral deviation of the speech signal based on the calculated channel energy estimate; Issuing a noise estimate update command based on the estimated total channel energy estimate and the difference between the estimated value for the current power spectrum and the estimated value for the average long-term power spectrum; Calculating and updating a background noise energy estimate for the speech signal when the noise estimate update command is received; Estimating a channel signal to noise ratio for the speech signal based on the calculated channel energy estimate and background noise energy estimate; Calculating a sum of speech metrics on a channel with respect to the speech signal based on the estimated channel signal to noise ratio; Modifying the estimated channel signal to noise ratio based on the sum of the calculated speech metrics; Calculating a linear channel gain based on the modified channel signal to noise ratio and the calculated background noise energy estimate; Applying the calculated linear channel gain to the transformed spectral signal; And converting the spectral signal to which the linear channel gain is applied to an audio signal in the time domain.

상술한 목적, 특징 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해 질 것이며, 그에 따라 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. 또한, 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에 그 상세한 설명을 생략하기로 한다. 이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일실시예를 상세히 설명하기로 한다.The above objects, features and advantages will become more apparent from the following detailed description taken in conjunction with the accompanying drawings, whereby those skilled in the art may easily implement the technical idea of the present invention. There will be. In addition, in describing the present invention, when it is determined that the detailed description of the known technology related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 따른 분산 음성 인식 단말기에서 음성 부호화기의 전처리를 공유해 잡음에 견고한 음성 특징 벡터를 추출하는 장치(이하, "음성 특징 벡터 추출 장치"라 함)에 대한 일실시예 구성도이다.1 is a configuration diagram of an apparatus for extracting a speech feature vector robust to noise by sharing preprocessing of a speech encoder in a distributed speech recognition terminal according to the present invention (hereinafter, referred to as a "voice feature vector extraction apparatus").

본 발명에 따른 음성 특징 벡터 추출 장치가 탑재된 분산 음성 인식 단말기(예; 휴대폰 등)에는 음성 코딩 모듈[음성 부호화기](speech coding module)(150)와, 음성 특징 벡터 추출 모듈(distributed speech recognition front-end module)(100)이 탑재되어 있는데, 도 1에 도시된 바와 같이 음성 코딩 모듈의 음성 부호화 전처리기와 음성 특징 벡터 추출 모듈의 음성 인식 전처리기가 음성 코딩/인식 전처리부(11)로 서로 공유됨을 알 수 있다.A distributed speech recognition terminal (eg, a mobile phone, etc.) equipped with a speech feature vector extracting apparatus according to the present invention includes a speech coding module 150 and a speech feature vector extraction front. -end module (100), as shown in FIG. 1, that the speech coding preprocessor of the speech coding module and the speech recognition preprocessor of the speech feature vector extraction module are shared with each other by the speech coding / recognition preprocessor (11). Able to know.

즉, 본 발명에 따른 휴대폰에는 음성 특징 벡터 추출 모듈(100)이 음성 코딩/인식 전처리부(11), 음성 특징 벡터 추출부(MFCC front-end)(12), 음성 압축부(compression)(13) 및 비트 스트림 전송부(bit stream)(14)로 이루어진다.That is, in the mobile phone according to the present invention, the voice feature vector extraction module 100 includes a voice coding / recognition preprocessor 11, a voice feature vector extraction unit (MFCC front-end) 12, and a voice compression unit 13. ) And a bit stream transmission unit (bit stream) 14.

그리고, 본 발명에 따른 휴대폰에는 음성 코딩 모듈(150)이 음성 코딩/인식 전처리부(11), 음성 코더(speech coder)(15), 음성 압축부(compression)(16) 및 비트 스트림 전송부(bit stream)(17)로 이루어진다.In the mobile phone according to the present invention, the voice coding module 150 includes a voice coding / recognition preprocessor 11, a speech coder 15, a voice compressor 16, and a bit stream transmitter ( bit stream).

물론, 본 발명에 따른 휴대폰에는 음성 통화 모드와 음성 인식 모드간의 천이를 위한 스위치(50)가 탑재되어 있으며, 이러한 스위치(50)의 동작에 따라, 음성 통화 모드에 있어 그 부호화된 사용자 음성신호가 음성 트래픽 채널을 통해 이동통신 시스템으로 전송되며, 음성 인식 모드에 있어 그 추출된 사용자 음성 특징 벡터가 무선 데이터 채널을 통해 음성인식 서버로 전송된다.Of course, the mobile phone according to the present invention is equipped with a switch 50 for transitioning between the voice call mode and the voice recognition mode, and according to the operation of the switch 50, the encoded user voice signal in the voice call mode is The voice traffic channel is transmitted to the mobile communication system, and in the voice recognition mode, the extracted user voice feature vector is transmitted to the voice recognition server through the wireless data channel.

특히, 음성 코딩/인식 전처리부(11)는 사용자 음성신호(input speech 8 KHz)에 섞인 잡음을 감쇄시키는 기능을 수행하는데, 본 발명에서는 음성 특징 벡터 추출 모듈에 별도의 잡음 감쇄 장치를 구비하지 않고서 음성 부호화 전처리 장치로 사용되는 잡음 감쇄 장치로 대체시킨다.In particular, the speech coding / recognition preprocessor 11 performs a function of attenuating noise mixed with a user speech signal (input speech 8 KHz). In the present invention, the speech feature vector extraction module is not provided with a separate noise reduction device. It is replaced by a noise attenuator used as a speech encoding preprocessor.

예컨대, 본 발명에서는 음성 특징 벡터 추출 모듈에서 잡음에 견고한 음성 특징 벡터(MFCCs)를 추출할 수 있도록, 음성 코딩/인식 전처리부(11)에서 잡음 감쇄 기능을 수행시키되, 이러한 음성 코딩/인식 전처리부(11)가 음성 통화를 위한 신호 전처리 및 음성 인식을 위한 신호 전처리를 모두 수행할 수 있는 스펙으로 구현되도록 한다.For example, in the present invention, the speech coding / recognition preprocessor 11 performs a noise attenuation function so that the speech feature vector extraction module can extract MFCCs robust to noise. (11) is implemented as a specification capable of performing both signal preprocessing for voice calls and signal preprocessing for voice recognition.

위에서 전술한 바와 같은 본 발명에서 제시하는 음성 코딩/인식 전처리부(11)에 대해서는 도 2를 참조하여 상세히 후술하기로 하며, 공지기술을 따르는 도 1의 도면부호 "12, 13, 14, 15, 16, 17" 각각에 대응되는 구성요소에 대해서는 그 상세한 설명을 생략하기로 한다.The speech coding / recognition preprocessor 11 proposed in the present invention as described above will be described in detail later with reference to FIG. 2, and reference numerals 12, 13, 14, 15, Detailed description of components corresponding to each of 16 and 17 "will be omitted.

도 2는 도 1의 음성 코딩/인식 전처리부에 대한 일실시예 상세 구성도이다.FIG. 2 is a detailed block diagram of an embodiment of the speech coding / recognition preprocessor of FIG. 1.

도 2에 도시된 바와 같이, 본 발명에 따른 음성 코딩/인식 전처리부(11)는 고대역 통과 필터(high pass filter)(21), 주파수 영역 변환기(frequency domain conversion)(22), 채널 에너지 추정기(channel energy estimation)(23), 채널 신호대잡음비 추정기(channel SNR estimation)(24), 보이스 메트릭 계산기(voice metric calculation)(25), 스펙트럼 편차 추정기(spectral deviation estimation)(26), 잡음 업데이트 결정기(noise update decision)(일명 배경 잡음 추정값 갱신기)(27), 채널 신호대잡음비 추정값 수정기(channel SNR modifier)(28), 채널 이득 산출기(channel gain computation)(29), 배경 잡음 추정기(background noise estimation)(30), 주파수 영역 필터(31) 및 시간 영역 변환기(time domain conversion)(32)를 포함한다.As shown in FIG. 2, the speech coding / recognition preprocessor 11 according to the present invention includes a high pass filter 21, a frequency domain conversion 22, and a channel energy estimator. channel energy estimation (23), channel SNR estimation (24), voice metric calculation (25), spectral deviation estimation (26), noise update determiner ( noise update decision (also known as background noise estimate updater) 27, channel SNR modifier 28, channel gain computation 29, background noise estimator estimation (30), frequency domain filter (31) and time domain conversion (32).

본 발명에서는 음성 통화를 위한 음성 부호화 전처리 및 음성 인식을 위한 음성 특징 벡터 추출 전처리를 모두 만족할 수 있는 스펙, 예를 들어 CDMA 방식에 서 사용되는 IS-127(EVRC) 등을 응용해 음성 코딩/인식 전처리부(11)를 구현한다.In the present invention, speech coding / recognition is applied by applying a specification that can satisfy both a speech coding preprocessing for a voice call and a speech feature vector extraction preprocessing for speech recognition, for example, IS-127 (EVRC) used in the CDMA method. The preprocessor 11 is implemented.

한편, 상기 음성 코딩/인식 전처리부(11)로 입력되는 사용자 음성신호 s_LFB(n)은 8 KHz 샘플링 주파수를 갖는 uniform PCM 포맷(16 bits) 형태의 데이터이다.Meanwhile, the user voice signal s _LFB (n) input to the voice coding / recognition preprocessor 11 is data in the form of a uniform PCM format (16 bits) having an 8 KHz sampling frequency.

일반적으로, 음성 부호화 과정 이전 및 음성 특징 벡터 추출 과정 이전에 신호 품질 개선을 위해 음성신호에 섞인 잡음을 감쇄시켜야 되는데, 본 발명에서 제시하는 음성 코딩/인식 전처리부(11)도 이러한 잡음 감쇄 기능을 그 주요 기능으로 수행하며, 도면에 도시된 바와 같이 입력되는 사용자 음성신호 s_LFB(n)이 잡음이 감쇄된 s'(n)으로 출력됨을 알 수 있다.In general, before the speech coding process and before the speech feature vector extraction process, the noise mixed in the speech signal must be attenuated to improve the signal quality. The speech coding / recognition preprocessor 11 proposed in the present invention also provides such a noise reduction function. As a main function, it can be seen that the user voice signal s _LFB (n) inputted as shown in the figure is output as s' (n) in which noise is attenuated.

이하, 음성 코딩/인식 전처리부(11)의 각 구성요소에 대해 살펴보기로 한다.Hereinafter, each component of the speech coding / recognition preprocessor 11 will be described.

상기 고대역 통과 필터(21)는 마이크 등을 통해 입력받은 사용자 음성신호 s_LFB(n)의 저주파 대역 신호를 제거하며, 이와 같은 고대역 통과 필터(21)는 120 Hz의 차단 주파수 대역을 갖도록 구현된다.The high pass filter 21 removes the low frequency band signal of the user voice signal s _LFB (n) received through a microphone and the like, and the high pass filter 21 has a cutoff frequency band of 120 Hz. do.

이하, 고대역 통과 필터(21)에서 필터링된 신호를 s_hp(n)이라 정의하며, 이러한 신호 s_hp(n)이 잡음 감쇄 대상 신호이며, 이 잡음 감쇄 대상 신호의 프레임 사이즈는 10ms이며, 여기서 현재 프레임을 m으로 정의하기로 한다.Hereinafter, a signal filtered by the high pass filter 21 is defined as s _hp (n), and this signal s _hp (n) is a noise reduction target signal, and the frame size of the noise reduction target signal is 10 ms, where Let's define the current frame as m.

상기 주파수 영역 변환기(22)는 고대역 통과 필터(21)에서 필터링된 신호 s_hp(n)을 주파수 영역 상으로 변환시키는데, 평탄화된 사다리꼴 윈도우(smoothed trapezoid windows)를 사용하여 이 s_hp(n)을 주파수 영역 상으로 변환시킨다[윈도윙(windowing)]. 이 주파수 영역 변환 과정에 대해 살펴보면 다음과 같다.The frequency domain converter 22 converts the signal s _hp (n) filtered by the high pass filter 21 onto the frequency domain, using s _hp (n) using smoothed trapezoid windows. Is transformed into the frequency domain (windowing). The frequency domain conversion process is as follows.

평탄화된 사다리꼴 윈도우에 있어 m번째 프레임의 입력 프레임 버퍼(input frame buffer) {d(m,n)} 상의 첫번째 D 샘플들은 이전 프레임의 마지막 D 샘플들과 겹쳐져 있는데, 이러한 겹침 현상은 다음의 [수학식 1]로 표현된다.In the flattened trapezoidal window, the first D samples on the input frame buffer {d (m, n)} of the mth frame overlap with the last D samples of the previous frame. Equation 1].

여기서, m은 현재 프레임이고, n은 버퍼 {d(m)}의 샘플 인덱스이며, 80의 값을 갖는 L은 프레임 길이이고, 24의 값을 갖는 D는 샘플들 내에서의 겹침(또는 지연) 정도를 나타낸다. 여기서, 입력 버퍼의 남은 샘플들은 다음의 [수학식 2]와 같이 프리-엠퍼시스(pre-emphasis)된다.Where m is the current frame, n is the sample index of the buffer {d (m)}, L with a value of 80 is the frame length, and D with a value of 24 is the overlap (or delay) within the samples. Indicates degree. Here, the remaining samples of the input buffer are pre-emphasis as shown in Equation 2 below.

여기서, "-0.8"의 값을 갖는

는 프리-엠퍼시스 계수이다. 상기 [수학식 1]에서 입력 버퍼는 104(L+D=104) 개의 샘플을 갖으며, 이 샘플들 내에서 첫번째 D 샘플들은 이전 프레임으로부터 프리-엠퍼사이즈드 오버랩(pre-emphasized overlap)된 부분이고, 다음의 샘플들은 현재의 프레임으로부터 프리-엠퍼사이즈드 입력(pre-emphasized input)이 되는 부분이다. 이를 토대로, 평탄화된 사다리꼴 윈도우를 입력 버퍼에 사용하여 다음의 [수학식 3]과 같은 윈도윙된 신호를 획득한다.Where a value of "-0.8"

Is the pre-emphasis coefficient. In Equation 1, the input buffer has 104 (L + D = 104) samples, and the first D samples in these samples have a pre-emphasized overlap from the previous frame. The following samples are the parts that become pre-emphasized input from the current frame. Based on this, a flattened trapezoidal window is used as an input buffer to obtain a windowed signal as shown in Equation 3 below.

여기서, 128의 값을 갖는 M은 이산 푸리에 변환(DFT; discrete Fourier transform) 길이이며, 이 M을 통해 다음의 [수학식 4]와 같은 스펙트럼 신호 G(k)를 구할 수 있다.Here, M having a value of 128 is a discrete Fourier transform (DFT) length, and through this M, a spectral signal G (k) as shown in Equation 4 can be obtained.

위와 같이 주파수 영역 변환기(22)에서 주파수 영역 상으로 변환된 스펙트럼 신호 G(k)는 채널 에너지 추정기(23)의 입력 신호로 사용된다.As described above, the spectral signal G (k) transformed on the frequency domain by the frequency domain converter 22 is used as an input signal of the channel energy estimator 23.

상기 채널 에너지 추정기(23)는 주파수 영역 변환기(22)로부터 입력받은 스펙트럼 신호 G(k)의 현재 프레임 m에 대한 채널 에너지 추정값을 다음의 [수학식 5]와 같이 계산한다.The channel energy estimator 23 calculates the channel energy estimate of the current frame m of the spectral signal G (k) received from the frequency domain converter 22 as shown in Equation 5 below.

여기서, 0.0625의 값을 갖는 E_min은 최소 허용 채널 에너지값을 나타내고,

은 다음의 [수학식 6]에 정의된 채널 에너지 평탄 인수이고, 16의 값을 갖는 N_c는 결합된 채널 수를 나타낸다. 그리고, f_L(i) 및 f_H(i) 각각은 i번째 채널의 저주파수의 DFT bin 및 i번째 채널의 고주파수의 DFT bin을 나타낸다. 이러한 f_L(i) 및 f_H(i) 각각은 다음과 같다. Here, E _min with a value of 0.0625 represents the minimum allowable channel energy value,

Is a channel energy flatness factor defined in Equation 6 below, and N _c having a value of 16 represents the number of combined channels. And, f _L (i) and f _H (i) each represent a low frequency DFT bin of the i-th channel and a high frequency DFT bin of the i-th channel. Each of these f _L (i) and f _H (i) is as follows.

f_L={2,4,6,8,10,12,14,17,20,23,27,31,36,42,49,56},f _L = {2,4,6,8,10,12,14,17,20,23,27,31,36,42,49,56},

f_H={3,5,7,9,11,13,16,19,22,26,30,35,41,48,55,63}.f _H = {3,5,7,9,11,13,16,19,22,26,30,35,41,48,55,63}.

위와 같이 [수학식 5]를 통해 계산한 채널 에너지 추정값에 있어, 채널 에너지 평탄 인수

가 첫번째 프레임(m=1)에 대한 값이 영(0)인 경우에는 그 채널 에너지 추정값이 첫번째 프레임의 필터링되지 않은 채널 에너지값으로 초기화되는 것을 허용하는 의미한다.In the channel energy estimate calculated by Equation 5, the channel energy flatness factor

If the value for the first frame (m = 1) is zero, it means that the channel energy estimate is allowed to be initialized to the unfiltered channel energy value of the first frame.

상기 채널 신호대잡음비 추정기(24)는 채널 상에 존재하는 신호대잡음비(SNR; Signal-Noise Rate)를 추정하는데, 채널 에너지 추정기(23)에서 추정한 채널 에너지 추정값과 배경 잡음 추정기(30)에서 추정한 배경 잡음 에너지 추정값을 토대로 그 양자화된 채널 SNR 인덱스들을 다음의 [수학식 7]과 같이 계산한다.The channel signal-to-noise ratio estimator 24 estimates the signal-to-noise ratio (SNR) present on the channel, which is estimated by the channel energy estimator 23 and the background noise estimator 30. Based on the background noise energy estimate, the quantized channel SNR indices are calculated as shown in Equation 7 below.

한편, 배경 잡음 추정기(30)에서 계산되는 E_n(m)은 현재의 채널 잡음 에너지 추정값을 나타내는데, 이를 토대로 [수학식 7]을 통해 계산되는

의 값은 0에서 89 사이의 값을 갖게 된다.Meanwhile, E _n (m) calculated by the background noise estimator 30 represents a current channel noise energy estimate, which is calculated through Equation 7 based on this.

Has a value between 0 and 89.

상기 보이스 메트릭 계산기(25)는 채널 신호대잡음비 추정기(24)에서 추정한 신호대잡음비, 예컨대 양자화된 채널 SNR 인덱스

를 토대로 현재 채널 상의 음성 메트릭들의 합을 다음의 [수학식 8]과 같이 계산한다.The voice metric calculator 25 uses the signal to noise ratio estimated by the channel signal to noise ratio estimator 24, e.g., the quantized channel SNR index.

The sum of speech metrics on the current channel is calculated as shown in Equation 8 below.

여기서, V(k)는 다음과 같이 90개의 원소로 이루어지는 음성 메트릭이다.Here, V (k) is a negative metric composed of 90 elements as follows.

V(k)={2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,4,4,4,5,5,5,6,6,7,7,7,8,8,9,9,10,10,11,12,12,13,13,14,15,15,16,17,17,18,19,20,20,21,22,23,24,24,25,26,27,28,28,29,30,31,32,33,34,35,36,37,37,38,39,40,41,42,43,44,45,46,47,48,49,50,50,50,50,50,50,50,50,50,50}.V (k) = {2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,4,4,4,5,5,5, 6,6,7,7,7,8,8,9,9,10,10,11,12,12,13,13,14,15,15,16,17,17,18,19,20, 20,21,22,23,24,24,25,26,27,28,28,29,30,31,32,33,34,35,36,37,37,38,39,40,41, 42,43,44,45,46,47,48,49,50,50,50,50,50,50,50,50,50,50}.

상기 스펙트럼 편차 추정기(26)는 채널 에너지 추정기(23)에서 추정한 채널 에너지 추정값 E_ch(m)을 토대로 현재 채널 상의 신호에 대한 그 스펙트럼 편차를 추 정한다. 이러한 스펙트럼 편차 추정 과정은 다음과 같다.The spectral deviation estimator 26 estimates the spectral deviation of the signal on the current channel based on the channel energy estimate E _ch (m) estimated by the channel energy estimator 23. This spectral deviation estimation process is as follows.

먼저, 상기 채널 에너지 추정값 E_ch(m)을 파라미터로 하여 다음의 [수학식 9]를 통해 현재 채널 상의 로그 전력 스펙트럼을 계산한다.First, the log power spectrum on the current channel is calculated by the following Equation 9 using the channel energy estimate E _ch (m) as a parameter.

그리고서, 상기 [수학식 9]에서 계산한 현재 전력 스펙트럼에 대한 추정값과 평균 장구간 전력 스펙트럼에 대한 추정값간의 차이값을 다음의 [수학식 10]을 통해 계산한다.Then, the difference value between the estimated value for the current power spectrum and the estimated value for the average long-term power spectrum calculated in Equation 9 is calculated through Equation 10 below.

여기서,

은 이전 프레임 동안에 계산된 평균 장구간 전력 스펙트럼에 대한 추정값이다. 다만, 이러한 평균 장구간 전력 스펙트럼

은 스펙트럼 편차 추정 과정을 통해 갱신되기 전, 예컨대

초기값은 다음의 [수학식 11]과 같이 첫번째 프레임의 로그 전력 스펙트럼에 대한 추정값으로 설정된다.here,

Is an estimate of the average long-term power spectrum calculated during the previous frame. However, these average long-term power spectrum

Before is updated through the spectral deviation estimation process,

The initial value is set as an estimate of the log power spectrum of the first frame as shown in Equation 11 below.

그리고, 상기 채널 에너지 추정값 E_ch(m)을 파라미터로 하여 다음의 [수학식 12]를 통해 현재 프레임 m에 대한 총 채널 에너지 추정값 E_tot(m)을 계산한다. 이와 같이 계산된 총 채널 에너지 추정값 E_tot(m)과, 현재 전력 스펙트럼에 대한 추정값과 평균 장구간 전력 스펙트럼에 대한 추정값간의 차이값인

은 하기의 잡음 업데이트 결정기(27)에서 배경 잡음 추정값을 갱신하는데 사용되는 파라미터로서 입력된다.The channel energy estimate E _ch (m) is used as a parameter to calculate the total channel energy estimate E _tot (m) for the current frame m using Equation 12 below. The total channel energy estimate E _tot (m) calculated as above and the difference between the estimate for the current power spectrum and the estimate for the average long-term power spectrum

Is input as a parameter used to update the background noise estimate in the following noise update determiner 27.

한편, 지수 창함수 인수

은 총 채널 에너지 추정값 E_tot(m)에 관한 함수로서 다음의 [수학식 13]으로 계산된다. Meanwhile, exponential window function argument

Is a function of the total channel energy estimate E _tot (m) and is calculated by Equation 13 below.

여기서, [수학식 13]을 통해 계산된 지수 창함수 인수

은 다음의 [수학식 14]에 의해

사이의 값으로 제한된다.Here, the exponential window function factor calculated through Equation 13

Is given by Equation 14 below.

Limited to values in between.

또한, E_H와 E_L은

로 제한된

으로 변환된 E_tot(m)의 선형 내삽치에 대한 dB로 표현되는 에너지 끝점들이며, 이러한 세 상수들의 값은

로 정의된다. 예를 들어, 40 dB의 상대적인 에너지를 갖는 어떤 신호는 다음에 이어지는 계산을 위해서

의 지수 창함수 인수를 사용하게 된다.In addition, E _H and E _L

Limited to

Are the energy endpoints expressed in dB for the linear interpolation of E _tot (m) converted into.

Is defined as For example, some signals with a relative energy of 40 dB may be used for subsequent calculations.

We will use the exponential window function argument of.

마지막으로, 위와 같이 계산한 지수 창함수 인수

,

초기값 등을 파라미터로 하여 다음의 [수학식 15]를 통해 다음 프레임에 대한 평균 장구간 전력 스펙트럼 추정값을 갱신한다.Finally, the exponential window function argument calculated as above

,

Using the initial value as a parameter, the average long-term power spectrum estimation value for the next frame is updated through Equation 15 below.

상기 잡음 업데이트 결정기(27)는 스펙트럼 편차 추정기(26)에서 추정한 총 채널 에너지 추정값 E_tot(m), 현재 전력 스펙트럼에 대한 추정값과 평균 장구간 전력 스펙트럼에 대한 추정값간의 차이값

을 토대로, 다음의 의사코드로 표현된 로직을 통해 배경 잡음 추정기(30)에서 추정한 잡음 추정값에 대해 어떠한 추정 값으로 갱신하라는 명령, 예컨대 update_flag를 내린다.The noise update determiner 27 determines the total channel energy estimate E _tot (m) estimated by the spectrum deviation estimator 26, the difference between the estimate for the current power spectrum and the estimate for the average long-term power spectrum.

On the basis of the following, the logic expressed by the following pseudo code gives an instruction, for example, update_flag, to update any estimated value with respect to the noise estimate estimated by the background noise estimator 30.

상기 의사코드로 표현된 로직 상에서, 그 사용 상수들의 값들은 각각, UPDATE_THLE=35, NOISE_FLOOR_DB=

, DEV_THLD=28, UPDATE_CNT_THLD=50, HYSTER_CNT_THLD=6이다.On the logic represented by the pseudo code, the values of the usage constants are UPDATE_THLE = 35 and NOISE_FLOOR_DB =, respectively.

, DEV_THLD = 28, UPDATE_CNT_THLD = 50, and HYSTER_CNT_THLD = 6.

상기 채널 신호대잡음비 추정값 수정기(28)는 보이스 메트릭 계산기(25)에서 추정한 현재 채널 상의 음성 메트릭들의 합 V(m)을 토대로 채널 신호대잡음비 추정기(24)에서 추정한 양자화된 채널 SNR 인덱스

의 값을 수정한다. 이러한 수정된 채널 SNR 인덱스

은 채널 이득 산출기(29)의 입력 파라미터로서 사용된다. 다음의 의사코드로 표현된 로직은 채널 신호대잡음비 추정값을 어떻게 수정하는지에 대해 보여주고 있다.The channel signal to noise ratio estimation modifier 28 estimates the quantized channel SNR index estimated by the channel signal to noise ratio estimator 24 based on the sum V (m) of the voice metrics on the current channel estimated by the voice metric calculator 25.

Modify the value of. These modified channel SNR indexes

Is used as an input parameter of the channel gain calculator 29. The logic represented by the following pseudo code shows how to modify the channel signal to noise ratio estimate.

상기 의사코드로 표현된 로직 상에서, 상수들과 임계치들의 값들은 다음과 같다.On the logic represented by the pseudo code, the values of the constants and thresholds are as follows.

상기 채널 이득 산출기(29)는 채널 신호대잡음비 추정값 수정기(28)에서 수정한 채널 SNR 인덱스

과 배경 잡음 추정기(30)에서 추정한 배경 잡음 에너지 추정값 E_n(m)을 토대로 선형 채널 이득

을 계산한다. 이러한 선형 채널 이득 산출 과정에 대해 설명하면 다음과 같다.The channel gain calculator 29 adjusts the channel SNR index modified by the channel signal to noise ratio estimation modifier 28.

And a linear channel gain based on the background noise energy estimate E _n (m) estimated by the background noise estimator 30

Calculate The linear channel gain calculation process will be described below.

먼저, 다음의 [수학식 16]을 통해 현재 프레임에 대한 전체 이득 요소를 계산한다.First, the overall gain factor for the current frame is calculated using Equation 16 below.

여기서, 최소 전체 이득

은 -13이고, 잡음 에너지 최저 한도(noise floor energy)

는 1이며, 배경 잡음 에너지 추정값 E_n(m)은 배경 잡음 추정기(30)에서 추정한 값을 따른다.Where minimum overall gain

Is -13, and noise floor energy

Is 1, and the background noise energy estimate E _n (m) follows the value estimated by the background noise estimator 30.

다음으로, 채널 이득(in dB)들을 다음의 [수학식 17]을 통해 계산한다.Next, the channel gains (in dB) are calculated through Equation 17 below.

여기서, 이득 기울기

는 0.39이며, 이 이득 기울기를 다음의 [수학식 18]과 같이 선형 채널 이득들로 바꾸는 것이 바람직하다.Where gain slope

Is 0.39, and it is preferable to change this gain slope to linear channel gains as shown in Equation 18 below.

상기 주파수 영역 필터(31)는 주파수 영역 변환기(22)에서 변환한 스펙트럼 신호 G(k)에 대해 채널 이득 산출기(29)에서 계산한 선형 채널 이득

을 다음의 [수학식 19]와 같이 적용한다. The frequency domain filter 31 calculates the linear channel gain calculated by the channel gain calculator 29 for the spectral signal G (k) converted by the frequency domain converter 22.

Is applied as in Equation 19 below.

상기 [수학식 19]에서, 채널 이득에 의해 스펙트럼 신호 G(k)가 신호 H(k)로 변경되는데, 그 중 채널 이득에 의해 변경되지 않는 주파수 대역의 신호, 예컨대 H(k)=G(k)도 존재한다. 이러한 채널 이득에 의해 변경되지 않는 주파수 대역의 신호는 다음의 [수학식 20]과 같으며, H(k)의 그 크기가 이븐(even)이어야 되고 그 위상이 오드(odd)이어야 됨을 알 수 있다.In Equation 19, the spectral signal G (k) is changed to the signal H (k) by the channel gain, of which signals in the frequency band not changed by the channel gain, for example, H (k) = G ( k) is also present. The signal in the frequency band that is not changed by the channel gain is expressed by Equation 20 below. It can be seen that the magnitude of H (k) should be even and the phase should be odd. .

여기서, 켤례 복소수(complex conjugate)는 H(k)에 inverse DFT를 구하기 위해 필요하다.Here, a complex conjugate is needed to find an inverse DFT at H (k).

특히, 상기 배경 잡음 추정기(30)는 전술한 바와 같이 현재 채널 상에 존재 하는 잡음신호에 대한 그 에너지 추정값 E_n(m)을 추정하는데, 잡음 업데이트 결정기(27)에서 내린 명령, 예컨대 update_flag에 따라 해당되는 잡음 추정값을 갱신한다.In particular, the background noise estimator 30 estimates its energy estimate E _n (m) for the noise signal present on the current channel, as described above, in accordance with the command issued by the noise update determiner 27, e.g. update_flag. Update the corresponding noise estimate.

즉, 잡음 추정값 갱신 update_flag가 트루(true)인 경우에 배경 잡음 추정기(30)에서는 다음의 [수학식 21]을 통해 다음 프레임에 대한 채널 상의 잡음 추정값을 갱신한다.That is, when the noise estimate update update_flag is true, the background noise estimator 30 updates the noise estimate on the channel for the next frame through Equation 21 below.

여기서, 최소 채널 에너지 E_min은 0.0625이고, 채널 잡음 평활화 인수

은 0.9이다. 한편, 잡음 추정값은 다음의 [수학식 22]와 같이 처음의 네개의 프레임들 각각에 대해 채널 에너지 추정값으로서 초기화되어 있다.Here, the minimum channel energy E _min is 0.0625 and the channel noise smoothing factor

Is 0.9. On the other hand, the noise estimate is initialized as a channel energy estimate for each of the first four frames as shown in Equation 22 below.

여기서, 최소 채널 잡음 초기 에너지 E_init는 16이다.Here, the minimum channel noise initial energy E _init is 16.

상기 시간 영역 변환기(32)는 주파수 영역 필터(31)를 통해 입력받은 잡음이 감쇄된 사용자 음성신호, 예컨대 주파수 영역 상에서 표현된 사용자 음성신호를 시간 영역 상의 사용자 음성신호로 변환한다. 이러한 시간 영역 변환 과정에 대해 살펴보면 다음과 같다.The time domain converter 32 converts a user voice signal in which the noise received through the frequency domain filter 31 is attenuated, for example, a user voice signal expressed in the frequency domain into a user voice signal in the time domain. This time domain transformation process is as follows.

먼저, 주파수 영역 필터(31)에서 필터링된 신호를 다음의 [수학식 23]을 통해 역DFT를 사용하는 시간 영역으로 변환한다.First, the signal filtered by the frequency domain filter 31 is converted into a time domain using an inverse DFT through Equation 23 below.

그리고나서, 상기 [수학식 23]에 대해 오버랩-앤드-애드(overlap-and-add)를 적용하여 다음의 [수학식 24]와 같이 주파수 영역 필터링 과정을 수행한다. Then, overlap-and-add is applied to Equation 23 to perform a frequency domain filtering process as shown in Equation 24 below.

마지막으로, 상기 [수학식 24]에 대해 디-엠퍼시스(de-emphasis)를 적용하여 다음의 [수학식 25]와 같이 시간 영역 상의 사용자 음성신호로 변환한다.Finally, de-emphasis is applied to Equation 24 to convert the user voice signal in the time domain as shown in Equation 25 below.

여기서, 디-엠퍼시스 인수

는 0.8이며,

는 320개의 샘플들을 담을 출력 버퍼를 나타낸다.Where the de-emphasis factor

Is 0.8,

Denotes an output buffer that will contain 320 samples.

전술한 바와 같이, 본 음성 코딩/인식 전처리부(11)를 통해 잡음이 감쇄된 사용자 음성신호 S'(n)을 획득할 수 있으며, 이러한 사용자 음성신호 S'(n)은 음성 통화 모드 또는 음성 인식 모드에 따라, 음성 특징 벡터 추출 모듈(100)의 음성 특징 벡터 추출부(12)로 입력되거나 또는 음성 코딩 모듈(150)의 음성 코더(15)로 입 력된다.As described above, the user voice signal S '(n) whose noise is attenuated through the voice coding / recognition preprocessor 11 may be obtained, and the user voice signal S' (n) may be a voice call mode or a voice. According to the recognition mode, it is input to the voice feature vector extraction unit 12 of the voice feature vector extraction module 100 or to the voice coder 15 of the voice coding module 150.

덧붙여, 본 음성 코딩/인식 전처리부(11)의 설명 서두에 정의한 바와 같이 잡음 감쇄 대상 신호의 프레임 사이즈를 10ms로 설정했기에, 매 10ms 프레임마다 잡음 감쇄가 수행되므로, 위에서 전술한 사용자 음성신호 S'(n)[음성 코딩/인식 전처리부의 출력 신호]는

이다. 물론, 잡음 감쇄 대상 신호의 프레임 사이즈의 그 설정 크기에 따라 사용자 음성신호 S'(n)를 다르게 출력할 수도 있음을 당업자라면 쉽게 이해할 수 있을 것이다.In addition, since the frame size of the noise reduction target signal is set to 10 ms as defined in the description of the speech coding / recognition preprocessor 11, since the noise reduction is performed every 10 ms frame, the above-described user voice signal S ' (n) [output signal of speech coding / recognition preprocessor]

to be. Of course, it will be readily understood by those skilled in the art that the user voice signal S '(n) may be output differently according to the set size of the frame size of the noise reduction target signal.

한편, 도 2를 참조하여 전술한 본 발명의 장치, 예컨대 음성 특징 벡터 추출 모듈 및 음성 코딩 모듈을 위한 전처리 장치인 음성 코딩/인식 전처리부에 대응되는 방법은, 통상의 음성신호 처리 분야에 있어 그 장치에 대응되는 시계열적인 프로세스로 이루어지며, 이에 별도로 본 발명의 방법적인 플로우에 대해서는 설명하지 않기로 한다.Meanwhile, a method corresponding to the speech coding / recognition preprocessor, which is a preprocessing device for the apparatus of the present invention, for example, the speech feature vector extraction module and the speech coding module described above with reference to FIG. It is a time series process corresponding to the apparatus, and the method flow of the present invention will not be described separately.

도 3a는 본 발명에 따른 11 KHz 음성신호 처리를 위해 확장된 음성 코딩/인식 전처리부에 대한 일실시예 구성도이며, 도 3b는 본 발명에 따른 16 KHz 음성신호 처리를 위해 확장된 음성 코딩/인식 전처리부에 대한 일실시예 구성도이다.FIG. 3A is a diagram illustrating an embodiment of an extended speech coding / recognition preprocessor for 11 KHz speech signal processing according to the present invention, and FIG. 3B illustrates an extended speech coding / 16 KHz speech signal processing according to the present invention. One embodiment configuration diagram for the recognition preprocessor.

앞서 도 2를 참조하여 설명한 본 발명의 음성 코딩/인식 전처리부(11)는 8 KHz의 사용자 음성신호가 그 잡음 감쇄 대상 신호이다. 이에, 본 발명에서는 도 3a를 참조해 11 KHz의 사용자 음성신호를 처리하는 음성 코딩/인식 전처리부와, 도 3b를 참조해 16 KHz의 사용자 음성신호를 처리하는 음성 코딩/인식 전처리부에 대해서도 제시한다.In the speech coding / recognition preprocessor 11 of the present invention described above with reference to FIG. 2, the user speech signal of 8 KHz is the noise reduction target signal. Thus, the present invention also provides a speech coding / recognition preprocessor for processing a 11 kHz user speech signal with reference to FIG. 3a, and a speech coding / recognition preprocessor for processing a 16 kHz user speech signal with reference to FIG. 3b. do.

도 3a에는 11 KHz 음성신호 처리를 위해, 도 2에 도시된 음성 코딩/인식 전처리부의 앞단에 11 KHz의 사용자 음성신호를 8 KHz의 사용자 음성신호로 변환하기 위한 주파수 다운 샘플러(41)가 구비된다. 이와 같이 주파수 다운 샘플러(41)에서 다운 샘플링된 사용자 음성신호는 본 음성 코딩/인식 전처리부로 입력된다.3A is provided with a frequency down sampler 41 for converting a user voice signal of 11 KHz into a user voice signal of 8 KHz in front of the voice coding / recognition preprocessor shown in FIG. 2 for 11 KHz voice signal processing. . The user voice signal down-sampled by the frequency down sampler 41 is input to the present voice coding / recognition preprocessor.

도 3b에는 16 KHz 음성신호 처리를 위해, 도 2에 도시된 음성 코딩/인식 전처리부의 앞단에 저대역 통과 쿼드러쳐 미러 필터(low pass quadrature-mirror filter(QMF LP))[DEC by 2](46)와 고대역 통과 쿼드러쳐 미러 필터(high pass quadrature-mirror filter(QMF HP))[DEC by 2 and SI](47)가 구비된다.FIG. 3B shows a low pass quadrature-mirror filter (QMF LP) in front of the speech coding / recognition preprocessor shown in FIG. 2 for processing 16 KHz speech signals [DEC by 2] (46). ) And a high pass quadrature-mirror filter (QMF HP) [DEC by 2 and SI] 47.

상기 저대역 통과 쿼드러쳐 미러 필터(46)는 입력받은 16 KHz 음성신호를 0 KHz ~ 4 KHz의 저주파 대역으로 나누며, 상기 고대역 통과 쿼드러쳐 미러 필터(47)는 입력받은 16 KHz 음성신호를 4 KHz ~ 8 KHz의 고주파 대역으로 나눈다.The low pass quadrature mirror filter 46 divides the received 16 KHz voice signal into a low frequency band of 0 KHz to 4 KHz, and the high pass quadrature mirror filter 47 divides the received 16 KHz voice signal into 4 Divide into high frequency band of KHz ~ 8KHz.

특히, 상기 저대역 통과 쿼드러쳐 미러 필터(46)에서 처리한 저주파 신호는 본 음성 코딩/인식 전처리부로 입력되지만, 상기 고대역 통과 쿼드러쳐 미러 필터(47)에서 처리한 고주파 신호는 음성 특징 벡터 추출 모듈(100)의 음성 특징 벡터 추출부(MFCC front-end)(12)로 입력된다. 이는 음성 특징 벡터 추출부(12)에서 음성 특징 벡터 MFCCs를 추출하는데 있어 26 개의 Mel-filter bank를 이용하도록 한다.In particular, the low frequency signal processed by the low pass quadrature mirror filter 46 is input to the speech coding / recognition preprocessor, while the high frequency signal processed by the high pass quadrature mirror filter 47 extracts the speech feature vector. The voice feature vector extracting unit (MFCC front-end) 12 of the module 100 is input. This allows the voice feature vector extractor 12 to use 26 Mel-filter banks to extract the voice feature vector MFCCs.

위와 같이, 저대역 통과 쿼드러쳐 미러 필터(46)에서 처리된 저주파 신호는 음성 코딩/인식 전처리부를 통해 음성 특징 벡터 추출부(12)로 입력되고, 고대역 통과 쿼드러쳐 미러 필터(47)에서 처리된 고주파 신호는 음성 특징 벡터 추출부(12)로 입력된 상태에서 이 음성 특징 벡터 추출부 상에서 서로 하나의 신호로서 합쳐진다. 예컨대, 이러한 고주파 신호는 로그 필터 뱅크 에너지가 켑스트럼 계수로 변환되기 이전에 저주파 신호와 합쳐진다. 덧붙여, 모든 주파수 대역의 로그 파라미터(log-energy) 역시 고주파 신호와 저주파 신호 모두가 이용되어져 계산된다.As described above, the low frequency signal processed by the low pass quadrature mirror filter 46 is input to the voice feature vector extractor 12 through the voice coding / recognition preprocessor and processed by the high pass quadrature mirror filter 47. The received high frequency signals are combined with each other as one signal on the voice feature vector extractor 12 while being input to the voice feature vector extractor 12. For example, this high frequency signal is combined with the low frequency signal before the log filter bank energy is converted to the cepstrum coefficient. In addition, log-energy of all frequency bands is also calculated using both high and low frequency signals.

덧붙여, 도 3a 및 도 3b에 도시된 확장된 음성 코딩/인식 전처리부는 11 KHz 또는 16 KHz의 샘플링 주파수 신호를 사용할 수 있도록, ETSI DSR 표준(ETSI ES 202 050 v1.1.3)의 주파수 확장 방식 스펙에 따라 구현될 수 있다.In addition, the extended speech coding / recognition preprocessor shown in FIGS. 3A and 3B is subject to the frequency extension scheme specification of the ETSI DSR standard (ETSI ES 202 050 v1.1.3) so that a sampling frequency signal of 11 KHz or 16 KHz can be used. Can be implemented accordingly.

도 4는 본 발명에서 사용하는 음성 인식 과정에 대한 일실시예 설명도이며, 도 5는 도 4에 있어 음향 모델 생성을 위한 학습 과정에 대한 일실시예 순서도이며, 도 6은 본 발명을 적용시켜 추출한 음성 특징 벡터를 사용한 음성 인식 성능을 나타내는 그래프이다.4 is a diagram illustrating an embodiment of a speech recognition process used in the present invention, FIG. 5 is a flowchart illustrating an embodiment of a learning process for generating an acoustic model in FIG. 4, and FIG. 6 is applied to the present invention. It is a graph showing speech recognition performance using extracted speech feature vectors.

도 1, 도 2, 도 3a 및 도 3b를 참조해 설명한 본 발명은 분산 음성 인식 단말기, 예컨대 휴대폰 등에 적용될 수 있으며, 이러한 본 발명이 음성 인식 성능에 어떠한 영향을 미치는지에 대해 검증할 필요가 있다.The present invention described with reference to Figs. 1, 2, 3A and 3B can be applied to a distributed voice recognition terminal such as a mobile phone, and it is necessary to verify how the present invention affects the voice recognition performance.

이에, 이하 음성 인식 과정과, 음향 모델 생성을 위한 학습 과정을 통해 본 발명이 적용된 음성 인식 성능에 관해 살펴보기로 한다.Thus, the speech recognition performance to which the present invention is applied will be described through a speech recognition process and a learning process for generating an acoustic model.

도 4는 HMM(Hidden Markov Model)을 이용한 음성 인식 과정을 나타낸 것으로 서, 사용자의 음성으로부터 그 음성 특징을 추출하고서(301), 이 추출한 음성 특징에 대해 음향 모델(303), 언어 모델(304) 및 발음 사전(305)을 탐색해 패턴 매칭(302)을 통해 그 음성에 대응되는 단어, 문장을 인식하는 과정을 거친다.FIG. 4 illustrates a speech recognition process using a Hidden Markov Model (HMM). The speech feature is extracted from the user's speech (301), the acoustic model (303), and the language model (304) for the extracted speech feature. In addition, the phonetic dictionary 305 is searched and pattern matching 302 is performed to recognize words and sentences corresponding to the voices.

음성 특징 추출(301)에 있어서는 유럽전기통신표준협회(ETSI)의 표준안 "ETSI ES 201 108"에서 제안하는 방식을 사용한다. 즉, 음성 데이터로부터 MFCC를 통해 그 음성 특징이 추출되어져 고차의 계수로서 음성 특징 벡터가 형성되며, 이 음성 특징 벡터에 대해 음향 모델(303), 언어 모델(304) 및 발음 사전(305)을 이용한 패턴 매칭을 통해 확률값을 최대로 갖는 단어열을 찾는다.In the speech feature extraction 301, the method proposed by the European Telecommunication Standards Institute (ETSI) standard proposal "ETSI ES 201 108" is used. That is, the speech feature is extracted from the speech data through the MFCC to form a speech feature vector as a higher order coefficient. The speech feature vector uses an acoustic model 303, a language model 304, and a pronunciation dictionary 305. Pattern matching is used to find a word string with the highest probability.

특히, 종래 방식의 음성 인식 성능 검증을 위해, 위와 같이 음성 특징 추출에 있어 사용되는 음성신호를 ETSI DSR 표준(ETSI ES 202 050 v1.1.3)에 정의된 전처리 과정을 거쳐 잡음이 감쇄된 신호 또는 IS-127에 정의된 전처리 과정을 거쳐 잡음이 감쇄된 신호를 사용한다.In particular, to verify the speech recognition performance of the conventional method, the signal or the IS of which the noise is attenuated through the preprocessing process defined in the ETSI DSR standard (ETSI ES 202 050 v1.1.3) is used for the speech signal used for speech feature extraction as described above. The signal is attenuated using the preprocessing process defined in -127.

한편, 본 발명의 음성 인식 성능 검증을 위해, 위와 같이 음성 특징 추출에 있어 사용되는 음성신호를 본 음성 코딩/인식 전처리부(11)를 거쳐 잡음이 감쇄된 신호를 사용한다.On the other hand, to verify the speech recognition performance of the present invention, a signal whose noise is attenuated through the speech coding / recognition preprocessor 11 that uses the speech signal used for speech feature extraction as described above is used.

이러한 음성 특징 추출 과정에 있어 그 입력신호를 앞서 언급한 바와 같은 MFCC front-end를 사용하여 13차 MFCCs와 log-energy를 추출한다. 이 중에서 12차 MFCCs(c0, …, c12)와 log-energy 및 이들의 delta, delta-delta를 음향 모델 학습 및 인식을 위한 파라미터로서 사용한다.In the speech feature extraction process, the 13th MFCCs and log-energy are extracted using the MFCC front-end as the input signal. Among them, 12th order MFCCs (c0, ..., c12), log-energy, and their delta and delta-delta are used as parameters for learning and recognizing acoustic models.

한편, 음향 모델(303)에 있어서는 HMM이 사용되며, 본 발명에서는 음향 모델 로서 언어에 따른 음소 모델을 사용한다. 이러한 음소 모델 생성을 위한 학습 과정에 대해서는 도 5를 참조하여 설명하면 다음과 같다.On the other hand, the HMM is used in the acoustic model 303, and the phoneme model according to the language is used as the acoustic model in the present invention. The learning process for generating the phoneme model is described below with reference to FIG. 5.

먼저, 학습 데이터에서 추출한 음성 특징 벡터를 이용해 음소 독립형 모델인 모노폰(monophone) 기반 모델을 생성한다(401).First, a monophone based model, which is a phoneme independent model, is generated using a speech feature vector extracted from training data (401).

그리고 나서, 상기 생성한 모노폰 기반 모델을 토대로 포스드 얼라이먼트시켜 음소 레이블 파일을 새로 생성한다(402).Then, a new phoneme label file is generated by performing forced alignment based on the generated monophone-based model (402).

한편, 상기 생성한 모노폰 기반 모델을 확장시켜 음소 종속형 모델인 트라이폰(triphone) 기반 모델을 생성한다(403).Meanwhile, the generated monophone based model is extended to generate a triphone based model, which is a phoneme dependent model (403).

그런 후, 상기 생성한 트라이폰 기반 모델에 대한 훈련 데이터량이 적은 점을 고려해 상태 공유를 수행한다(404).Then, state sharing is performed in consideration of the small amount of training data for the generated triphone-based model (404).

그리고 나서, 상기 상태 공유를 수행한 결과로서 얻은 모델을 혼합 밀도 증가시켜 최종적인 음향 모델을 생성한다(405).Then, the model obtained as a result of the state sharing is increased by mixing density to generate a final acoustic model (405).

한편, 도 4에 도시된 상기 언어 모델(304)에 있어서는 통계 기반 방식을 사용한다. 여기서, 통계 기반 방식은 주어진 상황에서 발화된 음성의 DB로부터, 가능한 단어열의 확률값을 통계적으로 추정하는 방식을 말한다. 이러한 통계 기반 방식의 언어 모델 중 대표적인 것이 엔-그램(n-gram)이다. 이 엔-그램 방식은 단어열의 확률을 이전 n개 이전의 조건부 확률들의 곱으로 근사화하여 사용하며, 도 4에 있어서는 bigram을 사용한다.Meanwhile, the language model 304 illustrated in FIG. 4 uses a statistic-based method. Here, the statistics-based method refers to a method of statistically estimating the probability value of possible word strings from a DB of speech spoken in a given situation. A representative of such statistical-based language models is n-gram. This en-gram method approximates and uses the word string probability as the product of the previous n conditional probabilities, and uses bigram in FIG.

상기 발음 사전(305)에 있어서는, 한국어의 경우에는 SiTEC의 "CleanSent01"에서 제공하는 발음 사전을 사용하며, 영어의 경우에는 Carneige Mellon 대학교에 서 제공하는 "CMU dictionary V.0.6"을 사용한다. 추가적으로, "CleanSent01"에서 제공되지 않는 어절의 발음에 대해서는 "표준어 규정의 표준발음법"을 이용하여 자체적으로 제작한 발음 변환기를 이용하였다. "CleanSent01"에서 제공하는 발음 사전의 총 어절수는 36,104 개이고, 음성 인식기에 사용된 발음 사전의 어절 수는 총 223,857 개 이다.In the pronunciation dictionary 305, in Korean, the pronunciation dictionary provided by SiTEC's "CleanSent01" is used, and in English, "CMU dictionary V.0.6" provided by Carneige Mellon University is used. In addition, for the pronunciation of a word not provided in "CleanSent01", a pronunciation converter produced in-house using the "pronounciation of standard language definition" was used. The total number of words in the pronunciation dictionary provided by "CleanSent01" is 36,104, and the total number of words in the pronunciation dictionary used in the speech recognizer is 223,857.

한편, 도 4에 사용되는 음성 DB에 있어서는, 한국어의 경우에는 음성정보산업기술센터(SiTEC)에서 제공하는 "낭독문장 음성 DB(CleanSent01)"를 사용하며, 영어의 경우에는 ETSI에서 구축한 "AURORA 4 DB(Wall Street Journal)"를 사용한다.On the other hand, in the speech DB used in FIG. 4, in the case of Korean, the "reading sentence speech DB (CleanSent01)" provided by the Voice Information Industry Technology Center (SiTEC) is used, and in the case of English, "AURORA" constructed by ETSI. 4 DB (Wall Street Journal).

한편, 언어 모델 생성을 위해, 학습에 사용된 텍스트 데이터 중 5,000 문장, ETRI에서 제공하는 "음성인식 언어모델용 텍스트DB" 중 선별된 텍스트 데이터 3,000 문장 등이 사용될 수도 있다. 이러한 언어모델 생성을 위해 HTK(Hidden Markov Model Toolkit) v3.1을 사용하였으며, 최종적으로 생성된 모델은 31,582 단어에 대한 네트워크로 구성된다.Meanwhile, in order to generate a language model, 5,000 sentences of text data used for training, and 3,000 sentences of selected text data among “text DB for speech recognition language model” provided by ETRI may be used. HTK (Hidden Markov Model Toolkit) v3.1 was used to generate these language models. Finally, the generated model consists of a network of 31,582 words.

도 6에는 종래 방식에 따라 처리한 잡음이 감쇄된 신호를 음성 인식 과정에 사용한 결과의 단어 인식률(68,51 %)과, 본 발명에 따라 처리한 잡음이 감쇄된 신호를 음성 인식 과정에 사용한 결과의 단어 인식률(69.31 %)을 나타내고 있으며, 도 6을 통해 알 수 있듯이 본 발명이 적용된 음성 인식 성능이 종래 방식보다 향상됨을 이해할 수 있다.Figure 6 shows the result of using the noise attenuated signal processed according to the conventional method in the speech recognition process (68, 51%), and the result of using the noise attenuated signal processed according to the present invention in the speech recognition process The word recognition rate (69.31%) is shown, and as can be seen from FIG. 6, it can be understood that the speech recognition performance to which the present invention is applied is improved compared to the conventional method.

예컨대, 본 발명에서 제시하는 단말기에 있어 음성 부호화 전처리 과정과 음성 특징 벡터 추출 전처리 과정을 공유해서 통신 환경 등에 의한 잡음에 견고한 음 성 특징 벡터를 추출함으로서, 저사양 단말기의 메모리, 연산량을 적게 소모하면서도 그 음성 인식 성능을 향상시킴을 당업자라면 쉽게 이해할 수 있을 것이다.For example, in the terminal of the present invention, by extracting a speech feature vector robust to noise due to a communication environment by sharing a speech encoding preprocessing process and a speech feature vector extraction preprocessing process, the memory of the low specification terminal and the amount of calculations are reduced. It will be readily apparent to those skilled in the art that improving speech recognition performance.

상술한 바와 같은 본 발명의 방법은 프로그램으로 구현되어 컴퓨터로 읽을 수 있는 형태로 기록매체(씨디롬, 램, 롬, 플로피 디스크, 하드 디스크, 광자기 디스크 등)에 저장될 수 있다. 이러한 과정은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있으므로 더 이상 상세히 설명하지 않기로 한다.As described above, the method of the present invention may be implemented as a program and stored in a recording medium (CD-ROM, RAM, ROM, floppy disk, hard disk, magneto-optical disk, etc.) in a computer-readable form. Since this process can be easily implemented by those skilled in the art will not be described in more detail.

이상에서 설명한 본 발명은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경이 가능하므로 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니다.The present invention described above is capable of various substitutions, modifications, and changes without departing from the technical spirit of the present invention for those skilled in the art to which the present invention pertains. It is not limited by the drawings.

상기와 같은 본 발명은 음성 통화 및 음성 인식을 위한 전처리 과정을 공유함으로써, 저사양 단말기의 메모리와 연산량을 적게 소모하면서도 그 음성 인식 성능을 향상시킬 수 있는 효과가 있다.The present invention as described above, by sharing the pre-processing for voice calls and speech recognition, there is an effect that can improve the speech recognition performance while consuming less memory and calculation amount of the low-end terminal.

또한, 본 발명은 단말기에 있어 음성 부호화 전처리 과정 및 음성 특징 벡터 추출 전처리 과정으로 인한 음성 통화 프로세스, 음성 인식 프로세스간의 전환 지연을 방지하는 효과가 있다.In addition, the present invention has an effect of preventing the switching delay between the voice call process and the voice recognition process due to the speech encoding preprocessing process and the speech feature vector extraction preprocessing process in the terminal.

또한, 본 발명은 음성 부호화 및 음성 특징 벡터 추출에 있어 그 사용자 음성신호에 섞인 잡음을 제대로 감쇄시킬 수 있는 효과가 있다.In addition, the present invention has the effect of properly attenuating noise mixed in the user's voice signal in speech coding and speech feature vector extraction.

Claims

An apparatus for extracting a speech feature vector robust to noise by sharing preprocessing of a speech coder in a distributed speech recognition terminal,

A high pass filter for removing the low frequency band signal of the voice signal received from the outside;

A frequency domain converter for converting a signal from which the low frequency band is removed in the high pass filter into a spectral signal in a frequency domain;

A channel energy estimator for calculating a channel energy estimate for the current frame of the spectral signal converted by the frequency domain converter;

A channel signal to noise ratio estimator for estimating a channel signal to noise ratio for the speech signal based on the channel energy estimate calculated by the channel energy estimator and the background noise energy estimated by the background noise estimator;

The background noise estimator for calculating and updating a background noise energy estimate for the speech signal according to a command of a noise update determiner;

A voice metric calculator for calculating a sum of voice metrics on a channel with respect to the voice signal based on the channel signal to noise ratio estimated by the channel signal to noise ratio estimator;

A spectral deviation estimator for estimating a spectral deviation of the speech signal based on the channel energy estimate calculated by the channel energy estimator;

The noise update determiner that issues a noise estimate update command based on a total channel energy estimate estimated by the spectrum deviation estimator and a difference value between an estimated value for the current power spectrum and an estimated long term power spectrum;

A channel signal to noise ratio estimation value corrector for correcting the channel signal to noise ratio estimated by the channel signal to noise ratio estimator based on the sum of the voice metrics calculated by the voice metric calculator;

A channel gain calculator for calculating a linear channel gain based on the channel signal-to-noise ratio modified by the channel signal-to-noise ratio estimation modifier and the background noise energy estimate calculated by the background noise estimator;

A frequency domain filter applying the linear channel gain calculated by the channel gain calculator to the spectrum signal converted by the frequency domain converter; And

A time domain converter for converting a spectral signal to which a linear channel gain is applied in the frequency domain filter into a speech signal in the time domain.

Apparatus for extracting a speech feature vector robust to noise by sharing the preprocessing of the speech coder in a distributed speech recognition terminal comprising a.

The method of claim 1,

The speech signal output from the time domain converter is noise attenuated, and is distributed to a speech feature vector extracting unit of the speech feature vector extracting module mounted in the terminal or to a speech coder of the speech coding module. A device for extracting speech feature vectors robust to noise by sharing the preprocessing of speech coders in.

The method of claim 1,

And a frame size of the signal filtered by the high pass filter is 10 ms. The apparatus for extracting a speech feature vector robust to noise by sharing the preprocessing of the speech coder in a distributed speech recognition terminal.

The method of claim 1,

The frequency domain transformer uses a smoothed trapezoid window to convert the input signal into a spectral signal in the frequency domain, thereby sharing the preprocessing of the speech coder in a distributed speech recognition terminal. Device to extract it.

The method of claim 1,

In calculating the channel energy estimate, if the channel energy flatness factor is zero for the first frame, the channel energy estimate is allowed to be initialized to the unfiltered channel energy value of the first frame. A device for extracting a speech feature vector robust to noise by sharing preprocessing of a speech encoder in a distributed speech recognition terminal.

The method of claim 1,

And a signal-to-noise ratio for the speech signal includes a quantized channel SNR index as in the following Equation.

[Equation]

A method for extracting a speech feature vector robust to noise by sharing preprocessing of a speech coder in a distributed speech recognition terminal,

Removing the low frequency band signal of the voice signal received from the outside;

Converting the signal from which the low frequency band has been removed into a spectral signal in a frequency domain;

Calculating a channel energy estimate for the current frame of the transformed spectral signal;

Estimating a spectral deviation of the speech signal based on the calculated channel energy estimate;

Issuing a noise estimate update command based on the estimated total channel energy estimate and the difference between the estimated value for the current power spectrum and the estimated value for the average long-term power spectrum;

Calculating and updating a background noise energy estimate for the speech signal when the noise estimate update command is received;

Estimating a channel signal to noise ratio for the speech signal based on the calculated channel energy estimate and background noise energy estimate;

Calculating a sum of speech metrics on a channel with respect to the speech signal based on the estimated channel signal to noise ratio;

Modifying the estimated channel signal to noise ratio based on the sum of the calculated speech metrics;

Calculating a linear channel gain based on the modified channel signal to noise ratio and the calculated background noise energy estimate;

Applying the calculated linear channel gain to the transformed spectral signal; And

Converting the spectral signal to which the linear channel gain is applied to an audio signal in a time domain

A method for extracting a speech feature vector robust to noise by sharing preprocessing of a speech encoder in a distributed speech recognition terminal including a.

The method of claim 7, wherein

The spectral deviation estimation step,

Calculating a log power spectrum on a current channel for the speech signal using the channel energy estimate as a parameter;

Calculating a difference value between the estimated value for the calculated current power spectrum and the estimated value for the average long-term power spectrum;

Calculating a total channel energy estimate for a current frame on a channel related to a speech signal using the channel energy estimate as a parameter;

Calculating an exponential window function factor using the calculated total channel energy estimate as a parameter;

Updating the average long-term power spectrum estimation value for the next frame on the channel with respect to the speech signal using the calculated exponential window function factor and the initial power spectrum value as parameters;

A method for extracting a speech feature vector robust to noise by sharing preprocessing of a speech encoder in a distributed speech recognition terminal comprising a.

The method of claim 8,

The average long-term power spectrum is initialized as an estimate of the log power spectrum of the first frame on the channel for the speech signal. The method of extracting a speech feature vector robust to noise by sharing the preprocessing of the speech coder in a distributed speech recognition terminal. .

The method of claim 7, wherein

In the noise update determining step, the background noise estimate updating parameter includes a total channel energy estimate calculated in the spectrum deviation estimating step and a difference between an estimate for the current power spectrum and an estimate for the average long-term power spectrum. A method for extracting a speech feature vector robust to noise by sharing preprocessing of a speech coder in a distributed speech recognition terminal.

The method of claim 7, wherein

The linear channel gain calculating step,

Calculating an overall gain factor for the current frame on the current channel for the speech signal; And

The process of calculating the channel gain on the current channel for the voice signal

Voice coding means for transmitting the coded voice signal to the outside through a voice traffic channel in a voice call mode;

Speech feature vector extraction means for causing the extracted user speech feature vector to be transmitted to the outside in the speech recognition mode; And

It includes a speech coding / recognition pre-processing unit for attenuating noise mixed in the speech signal received from the outside,

The speech signal inputted to the speech coding means and the speech signal inputted to the speech feature vector extracting means are preprocessed by the speech coding / recognition preprocessor. Device for extracting speech feature vectors.

The method of claim 12,

And a speech signal input to the speech coding / recognition preprocessor, wherein the speech signal has a frequency band of 8 KHz.

The method of claim 12,

The speech feature vector extracting means,

A speech feature vector extractor for extracting a speech feature vector from the speech signal preprocessed by the speech coding / recognition preprocessor;

A speech compressor for compressing the speech feature vector extracted by the speech feature vector extractor; And

Bit stream transmission unit for transmitting the speech feature vector compressed by the speech compression unit to the bit stream data to be transmitted to the outside

The method of claim 12,

The speech coding means,

A speech coder for coding a speech signal preprocessed by the speech coding / recognition preprocessor;

A voice compressor for compressing a voice signal coded by the voice coder; And

Bit stream transmission unit for transmitting the voice signal compressed by the voice compression unit to the bit stream data for transmission to the outside

The method of claim 12,

An apparatus for extracting a robust speech feature vector to noise by sharing the preprocessing of a speech encoder in a distributed speech recognition terminal further comprising a switch for transitioning between a voice call mode and a speech recognition mode.

Speech feature vector extraction means for causing the extracted user speech feature vector to be transmitted to the outside in the speech recognition mode;

A frequency down sampler for down sampling the frequency band of the audio signal received from the outside; And

A speech coding / recognition preprocessor for attenuating noise mixed in the down-sampled speech signal by the frequency down sampler;

The method of claim 17,

The speech signal input to the frequency down sampler has a frequency band of 11 KHz, the apparatus for extracting a speech feature vector robust to noise by sharing the preprocessing of the speech coder in a distributed speech recognition terminal.

A low pass quadrature mirror filter for passing a low frequency band signal to a voice signal received from an outside;

A high pass quadrature mirror filter for passing a high frequency band signal to the voice signal received from the outside; And

And a speech coding / recognition preprocessor for attenuating noise mixed in the speech signal passed by the low pass quadrature mirror filter.

The method of claim 19,

The voice signal input to each of the low pass quadrature mirror filter and the high pass quadrature mirror filter has a frequency band of 16 KHz. Device for extracting feature vectors.

The method of claim 20,

The low pass quadrature mirror filter divides a 16 KHz voice signal input from the outside into a low frequency signal of 0 KHz to 4 KHz, thereby sharing the preprocessing of the voice coder in a distributed speech recognition terminal. Device for extracting vectors.

The method of claim 20,

The high pass quadrature mirror filter divides a 16 KHz voice signal received from the outside into a low frequency signal of 4 KHz to 8 KHz, thereby sharing the preprocessing of the voice coder in a distributed speech recognition terminal. Device to extract it.

The method according to any one of claims 19 to 22,

The low frequency signal processed by the low pass quadrature mirror filter is input to the speech coding / recognition preprocessor, and the high frequency signal processed by the high pass quadrature mirror filter is input to the speech feature vector extracting means. A device for extracting a speech feature vector robust to noise by sharing preprocessing of a speech encoder in a distributed speech recognition terminal.

The method of claim 23,

The low frequency signal and the high frequency signal are combined into a single signal on a speech feature vector extracting means, and the high frequency signal is combined with the low frequency signal before the log filter bank energy is converted into a cepstrum coefficient. A device for extracting speech feature vectors robust to noise by sharing preprocessing of speech coders in a recognition terminal.

In a terminal equipped with a processor,

A function of removing a low frequency band signal of an audio signal input from the outside;

Converting the signal from which the low frequency band is removed into a spectral signal in a frequency domain;

Estimating a spectral deviation with respect to the speech signal based on the calculated channel energy estimate;

Estimating a channel signal-to-noise ratio for the speech signal based on the calculated channel energy estimate and background noise energy estimate;

A function of converting the spectral signal to which the linear channel gain is applied to an audio signal in the time domain

A computer-readable recording medium having recorded thereon a program for realizing this.