KR100791349B1

KR100791349B1 - Method and Apparatus for coding speech signal in Distributed Speech Recognition system

Info

Publication number: KR100791349B1
Application number: KR1020060060378A
Authority: KR
Inventors: 김영준
Original assignee: 한국전자통신연구원
Priority date: 2005-12-08
Filing date: 2006-06-30
Publication date: 2008-01-07
Also published as: KR20070061190A

Abstract

본 발명은 분산 음성 인식 시스템에서 음성 신호의 인코딩 방법에 있어서, 멜 필터링이 수행된 음성 신호에 대하여 로그 함수를 적용하여 로그 스펙트럼을 생성하고, 상기 생성된 로그 스펙트럼을 이용하여 장기 이득과 단기 여기 신호를 생성하고, 상기 생성된 장기 이득과 단기 여기 신호의 이중 코드북을 이용하여 양자화한 후, 전송하는 것으로서, 로그 스펙트럼이 대역별 보정이 강하다는 성질을 이용함으로써 에러 보정 코드에서 수정할 수 없는 특징벡터들도 보정함으로써 네트워크 상태가 좋지 않은 분산 음성 인식 시스템에서의 음성인식의 성능을 향상시킬 수 있다. According to the present invention, in a method of encoding a speech signal in a distributed speech recognition system, a log spectrum is generated by applying a log function to a speech signal subjected to mel filtering, and a long-term gain and a short-term excitation signal are generated using the generated log spectrum. And quantize the generated long-term gain and short-term excitation signal using a double codebook, and then transmit the feature vectors that cannot be corrected in the error correction code by using the property that the log spectrum has strong band-by-band correction. In addition, it is possible to improve the performance of speech recognition in a distributed speech recognition system having poor network conditions.

분산음성인식, 인코딩, 디코딩, 양자화, 캡스트럼, 로그스펙트럼 Distributed Speech Recognition, Encoding, Decoding, Quantization, Capstrum, Log Spectrum

Description

Coding method of speech signal in distributed speech recognition system and apparatus therefor {Method and Apparatus for coding speech signal in Distributed Speech Recognition system}

도 1은 본 발명에 따른 분산 음성 인식 시스템의 구성을 개략적으로 나타낸 블럭도.1 is a block diagram schematically showing the configuration of a distributed speech recognition system according to the present invention.

도 2는 도 1에 도시된 DSR 신호 생성부를 상세히 나타낸 블럭도.FIG. 2 is a block diagram illustrating in detail the DSR signal generator shown in FIG. 1; FIG.

도 3은 도 2에 도시된 양자화부를 상세히 나타낸 블럭도.3 is a block diagram illustrating in detail the quantization unit illustrated in FIG. 2;

도 4는 본 발명에 따른 분산 음성 인식 시스템에서 클라이언트가 음성 신호를 인코딩하는 방법을 나타낸 흐름도. 4 is a flowchart illustrating a method for encoding a speech signal by a client in a distributed speech recognition system according to the present invention.

도 5는 본 발명에 따른 분산 음성 인식 시스템에서 서버가 음성 신호를 디코딩하는 방법을 나타낸 흐름도. 5 is a flowchart illustrating a method of decoding a speech signal by a server in a distributed speech recognition system according to the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

100 : 클라이언트 102 : 음성 입력부100: client 102: voice input unit

104 : DSR 신호 생성부 106 : 음성 보코더 분석부104: DSR signal generator 106: voice vocoder analyzer

108 : 합산부 110 : 송신부108: adder 110: transmitter

150 : 서버 152 : 수신부150: server 152: receiving unit

154 : 로그 스페트럼 손실 보상부 156 : 이산코사인 변환부154: log spectrum loss compensation unit 156: discrete cosine conversion unit

158 : 음성 인식부 200 : 프레임 생성부158: speech recognition unit 200: frame generation unit

210 : 전처리부 220 : 푸리에 변환부210: preprocessor 220: Fourier transform unit

230 : 멜-필터링부 240 : 로그 스펙트럼 생성부 230: mel-filtering unit 240: log spectrum generation unit

250 : 양자화부 300 : 장기이득 예측부250: quantization unit 300: long-term gain prediction unit

310 : 단기 여기 신호 생성부 320, 340 : 결합부310: short excitation signal generator 320, 340: coupling unit

330 : 합성필터 350 : 에러 최소화부330: synthesis filter 350: error minimizing unit

본 발명은 분산 음성 인식 시스템에서 캡스트럼 이전 단계인 로그 스펙트럼에서 음성 인식의 잡음 전처리를 수행하여 잡음과 음성의 관계를 명확히 설명할 수 있는 분산 음성 인식 시스템에서 음성 신호의 코딩 방법 및 그 장치에 관한 것이다. The present invention relates to a method and apparatus for coding a speech signal in a distributed speech recognition system capable of clearly explaining the relationship between noise and speech by performing noise preprocessing of speech recognition in a log spectrum, which is a pre-capstrum stage, in a distributed speech recognition system. will be.

분산 음성 인식 시스템은 특징 추출을 특징 정합과 분리하고, 2개의 상이한 위치에 배치된 2개의 상이한 프로세싱 유닛에 의해 실행되는 2개의 방법을 가짐으로써 전송된 음성에 의해 유발되는 시스템 성능 열화를 보정하는 것을 시도하게 된다. 예컨된 DSR(Distributed Speech Recognition) 이동 또는 무선 통신 시스템 또는 제1 통신 디바이스 및 제2 이동 통신 디바이스를 포함하는 클라이언트는 입력된 음성 신호에서 특징 벡터를 추출하고, 그 특징 벡터를 인코딩한다. Distributed speech recognition systems separate feature extraction from feature matching and have two methods implemented by two different processing units disposed at two different locations to compensate for system performance degradation caused by transmitted speech. Will try. An example Distributed Speech Recognition (DSR) mobile or wireless communication system or client comprising a first communication device and a second mobile communication device extracts a feature vector from an input speech signal and encodes the feature vector.

그런 다음 상기 클라이언트는 상기 인코딩된 특징을 에러 방지 데이터 채널을 통해 서버로 전송한다.The client then sends the encoded feature to the server via an error proof data channel.

그러면, 상기 서버는 인코딩된 인식 특징을 수신하고, 단지 특징 정합만을 수행하는데, 즉, 서버는 인코딩된 특징을 공지의 데이터 세트의 특징과 정합시킨다. The server then receives the encoded recognition features and performs only feature matching, i.e., the server matches the encoded features with the features of the known data set.

상기와 같은 종래의 분산 음성 인식 프레임에서는 특징벡터를 전송함으로써 특징벡터를 추출하기까지의 모든 과정이 반드시 클라이언트부분에 결합되어 있다. In the conventional distributed speech recognition frame as described above, all processes up to extracting the feature vector by transmitting the feature vector are necessarily coupled to the client part.

일반적으로 현재 음성 인식에 사용되는 특징벡터들 중 가장 많이 사용되는 것은 캡스트럼 특징 벡터이다. 상기 캡스트럼 특징 벡터는 화자와 발음 속도의 변이에 대해 비교적 강인한 특성을 가지고 있다. 하지만, 실제 음성 인식이 사용되는 환경에서는 상기 변이들보다는 주로 주변 환경의 잡음이 음성 인식의 성능을 저하시킨다. In general, the most commonly used feature vectors used for speech recognition are capstram feature vectors. The capturm feature vector has a relatively strong characteristic against variations in speaker and pronunciation speed. However, in an environment where actual speech recognition is used, noise of the surrounding environment, rather than the above variations, degrades the performance of speech recognition.

그러나 상기와 같이 캡스트럼 특징 벡터는 환경적 잡음과 음성의 결합에 대해 수학적으로 명확한 관계를 맺어주기가 어려운 단점이 있다.However, as described above, the capturm feature vector has a disadvantage in that it is difficult to form a mathematically clear relationship between the environmental noise and the voice.

또한, 캡스트럼 이전에 이루어지는 잡음 전처리 방법들은 반드시 적은 연산과 적은 메모리를 요구하는 클라이언트 부분과 결합되어야만 하므로, 많은 계산이 요구되면서 정확한 잡음 처리를 할 수 있는 알고리즘들의 실제 사용을 어렵게 하는 문제점이 있다. In addition, since the noise preprocessing methods performed before the capstrum must be combined with the client portion requiring less computation and less memory, there is a problem that it is difficult to actually use algorithms capable of accurate noise processing while requiring a lot of computation.

또한, 독립적 성분이 강한 캡스트럼을 전송함으로서 특정 주파수 영역이 유 실되었을때 에러 보정 코드에 의한 결과값만을 사용하여 복구하므로 보정 코드에서도 복구가 안될 경우에는 인식을 수행하지 못하는 단점이 있다.In addition, since the independent component transmits a strong capstrum, when a specific frequency range is lost, only the result value of the error correction code is used to recover, and thus, even if the correction code is not restored, recognition cannot be performed.

따라서, 본 발명의 목적은 캡스트럼 이전 단계인 로그 스펙트럼에서 음성 인식의 잡음 전처리를 수행하여 잡음과 음성의 관계를 명확히 설명할 수 있는 분산 음성 인식 시스템에서 음성 신호의 코딩 방법 및 그 장치를 제공하는데 있다. Accordingly, an object of the present invention is to provide a method and apparatus for coding a speech signal in a distributed speech recognition system capable of clearly explaining the relationship between noise and speech by performing noise preprocessing of speech recognition in a log spectrum, which is a pre-capstrum step. have.

본 발명의 다른 목적은 특정 대역의 신호가 유실되었을때, 로그 스펙트럼의 성질을 이용하여 쉽게 복구할 수 있는 분산 음성 인식 시스템에서 음성 신호의 코딩 방법 및 그 장치를 제공하는데 있다. Another object of the present invention is to provide a method and apparatus for coding a speech signal in a distributed speech recognition system that can be easily recovered by using a log spectrum property when a signal of a specific band is lost.

상기 목적들을 달성하기 위하여 본 발명의 일 측면에 따르면, 분산 음성 인식 시스템에서 음성 신호의 인코딩 방법에 있어서, 멜 필터링이 수행된 음성 신호에 대하여 로그 함수를 적용하여 로그 스펙트럼을 생성하고, 상기 생성된 로그 스펙트럼을 이용하여 장기 이득과 단기 여기 신호를 생성하고, 상기 생성된 장기 이득과 단기 여기 신호의 이중 코드북을 이용하여 양자화한 후, 전송하는 것을 특징으로 하는 분산 음성 인식 시스템에서 음성 신호의 코딩 방법이 제공된다. According to an aspect of the present invention to achieve the above object, in a method of encoding a speech signal in a distributed speech recognition system, a log spectrum is generated by applying a log function to a speech signal subjected to mel filtering, and the generated A long-term gain and short-term excitation signal are generated using a log spectrum, quantized using a double codebook of the generated long-term gain and short-term excitation signal, and then transmitted. This is provided.

본 발명의 다른 측면에 따르면, 분산 음성 인식 시스템에서 음성 신호의 디코딩 방법에 있어서, 클라이언트로부터 로그 스펙트럼을 수신하고, 상기 수신된 로 그 스펙트럼에 대해 손실된 대역을 보상하고, 상기 손실된 대역이 보상된 로그 스펙트럼에 대해 이산 코사인 변환을 수행하여 캡스트럼으로 변환하고 상기 변환된 캡스트럼을 이용하여 음성 인식을 수행하는 것을 특징으로 하는 분산 음성 인식 시스템에서 음성 신호의 코딩 방법이 제공된다. According to another aspect of the present invention, in a method of decoding a speech signal in a distributed speech recognition system, a log spectrum is received from a client, a lost band is compensated for the received log spectrum, and the lost band is compensated for. The present invention provides a method for coding a speech signal in a distributed speech recognition system, characterized in that a discrete cosine transform is performed on a converted log spectrum to convert to a capstrum and speech recognition is performed using the converted capstrum.

본 발명의 또 다른 측면에 따르면, 분산 음성 인식 시스템에서 음성 신호의 인코딩 장치에 있어서, 멜 필터링이 수행된 음성 신호에 대하여 로그 함수를 적용하여 로그 스펙트럼을 생성하는 로그 스펙트럼 생성부, 상기 로그 스펙트럼 생성부에서 생성된 로그 스펙트럼을 이용하여 장기 이득과 단기 여기 신호를 생성하고, 상기 생성된 장기 이득과 단기 여기 신호의 이중 코드북을 이용하여 양자화하는 양자화부, 상기 양자화부에서 양자화된 로그 스펙트럼을 송신하는 송신부를 포함하는 것을 특징으로 하는 분산 음성 인식 시스템에서 음성 신호의 코딩 장치가 제공된다. According to another aspect of the present invention, in the apparatus for encoding a speech signal in a distributed speech recognition system, a log spectrum generator for generating a log spectrum by applying a log function to a speech signal subjected to mel filtering, the log spectrum generation A long term gain and a short term excitation signal are generated using the log spectrum generated by the unit, and a quantization unit quantizes using the double codebook of the generated long term gain and short excitation signal, and a quantized log spectrum is transmitted by the quantization unit. Provided is a coding apparatus for a speech signal in a distributed speech recognition system comprising a transmitter.

본 발명의 또 다른 측면에 따르면, 분산 음성 인식 시스템에서 음성 신호의 디코딩 장치에 있어서, 클라이언트로부터 로그 스펙트럼을 수신하는 수신부, 상기 수신부로부터 출력되는 로그 스펙트럼에 대해 손실 특성 이론을 적용하여 손실된 대역을 보상하는 로그 스펙트럼 손실 보상부, 상기 로그 스펙트럼 손실 보상부에서 보상된 로그 스펙트럼에 대해 이산 코사인 변환을 수행하여 캡스트럼 영역으로 변환하는 이산 코사인 변환부, 상기 이산 코사인 변환부에서 변환된 캡스트럼을 이용하여 음성 인식을 수행하는 음성 인식부를 포함하는 것을 특징으로 하는 분산 음성 인식 시스템에서 음성 신호의 코딩 장치가 제공된다. According to another aspect of the present invention, in the apparatus for decoding a speech signal in a distributed speech recognition system, a receiver for receiving a log spectrum from a client and applying a loss characteristic theory to a log spectrum outputted from the receiver may determine a lost band. Compensating log spectral loss compensation unit, a discrete cosine transform unit performing a discrete cosine transform on the log spectrum compensated by the log spectral loss compensation unit to convert into a capstrum region, and using the capstrum transformed by the discrete cosine transform unit. In the distributed speech recognition system, a speech signal coding apparatus is provided.

이하 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명하기로 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 따른 분산 음성 인식 시스템의 구성을 개략적으로 나타낸 블럭도, 도 2는 도 1에 도시된 DSR 신호 생성부를 상세히 나타낸 블럭도, 도 3은 도 2에 도시된 양자화부를 상세히 나타낸 블럭도이다.1 is a block diagram schematically showing the configuration of a distributed speech recognition system according to the present invention, FIG. 2 is a detailed block diagram showing the DSR signal generating unit shown in FIG. 1, and FIG. 3 is a detailed block diagram showing the quantization unit shown in FIG. It is also.

도 1을 참조하면, 분산 음성 인식 시스템은 입력되는 음성 신호의 로그 스펙트럼에 대하여 이중 코드북을 이용하여 양자화하는 클라이언트(100), 상기 클라이언트(100)로부터 전송된 로그 스펙트럼의 보정을 통한 손실 대역을 복구하여 음성 인식을 수행하는 서버(150)를 포함한다. Referring to FIG. 1, a distributed speech recognition system recovers a loss band through correction of a log spectrum transmitted from a client 100 and a client 100 that quantizes a log spectrum of an input speech signal using a double codebook. Server 150 to perform voice recognition.

상기 클라이언트(100)는 사용자로부터 음성을 입력받는 음성 입력부(102), DSR(Distributed Speech Recognition) 신호 생성부(104), 음성 보코더 분석부(106), 합산부(108), 송신부(110)를 포함한다. The client 100 may include a voice input unit 102, a DSR (Distributed Speech Recognition) signal generator 104, a voice vocoder analyzer 106, an adder 108, and a transmitter 110 that receive a voice from a user. Include.

상기 DSR 신호 생성부(104)는 상기 음성 입력부(102)를 통해 수신된 음성 입력에 관한 스펙트럼 데이터를 추출하고, 상기 추출된 스펙트럼 데이터를 나타내는 코딩된 신호를 생성한다. The DSR signal generator 104 extracts spectral data regarding the voice input received through the voice input unit 102 and generates a coded signal representing the extracted spectrum data.

상기 DSR 신호 생성부(104)에 대하여 도 2 및 도 3을 참조하여 좀더 상세히 설명하기로 한다. The DSR signal generator 104 will be described in more detail with reference to FIGS. 2 and 3.

도 2를 참조하면, 상기 DSR 신호 생성부(104)는 프레임 생성부(200), 전처리부(210), 푸리에 변환부(220), 멜-필터링부(230), 로그 스펙트럼 생성부(240), 양 자화부(250)를 포함한다. Referring to FIG. 2, the DSR signal generator 104 may include a frame generator 200, a preprocessor 210, a Fourier transform 220, a mel-filter 230, and a log spectrum generator 240. And a quantization unit 250.

상기 프레임 생성부(200)는 음성 신호의 연속적인 단시간 스펙트럼 정보를 얻기 위해서 입력되는 음성 신호를 대략 10msec마다 20-30msec 길이의 프레임으로 분리한다. The frame generator 200 separates the input voice signal into a frame having a length of about 20-30 msec every 10 msec in order to obtain continuous short time spectrum information of the voice signal.

상기 전처리부(210)는 각 프레임의 음성 신호에 대해 프리엠파시스 처리와 해밍 윈도우 처리를 수행한다. 상기 프리엠퍼시스 처리는 입력 신호의 고주파 성분을 강조하는 동작이며, 상기 해밍 윈도우 처리는 각 프레임의 경계에서의 불연속성을 최소화하기 위해 상기 프리엠퍼시스 처리된 음성 신호에 윈도우 함수를 곱하는 동작이다.The preprocessor 210 performs preemphasis processing and hamming window processing on the voice signal of each frame. The pre-emphasis processing is an operation of emphasizing the high frequency components of the input signal, and the hamming window processing is an operation of multiplying the pre-emphasis processed speech signal by a window function to minimize discontinuity at each frame boundary.

상기 푸리에 변환부(220)는 상기 전처리부(210)를 통과한 프레임별 음성 신호에 대해 푸리에 변환을 수행하여 스펙트럼 정보를 출력한다.The Fourier transform unit 220 outputs spectrum information by performing Fourier transform on the frame-specific speech signal passing through the preprocessor 210.

상기 멜 필터링부(230)는 상기 푸리에 변환을 통해 얻어진 스펙트럼에 여러 개의 구간을 설정하고, 각 구간별 에너지를 구하여 필터 뱅크 에너지를 생성한다. The mel filtering unit 230 sets a plurality of sections in the spectrum obtained through the Fourier transform, obtains energy for each section, and generates filter bank energy.

상기 로그 스펙트럼 생성부(240)는 상기 생성된 필터 뱅크 에너지에 로그 함수를 적용하여 로그 스펙트럼을 생성한다.The log spectrum generator 240 generates a log spectrum by applying a log function to the generated filter bank energy.

상기 양자화부(250)는 상기 로그스펙트럼 생성부(240)에서 생성된 로그 스펙트럼상에서의 명확한 전송과 효율적인 처리를 위하여 로그 스펙트럼에 대한 예측치를 analysis-by-synthesis의 기법에 의해 생성된 장기 이득과 단기 여기 신호의 이중 코드북을 이용하여 로그 스펙트럼의 정확한 양자화를 수행한다.The quantization unit 250 is a long-term gain and short-term gain generated by the technique of analysis-by-synthesis for the log spectrum prediction for clear transmission and efficient processing on the log spectrum generated by the log spectrum generator 240 Accurate quantization of the log spectrum is performed using the double codebook of the excitation signal.

상기와 같은 역할을 수행하는 양자화부(250)에 대하여 도 3을 참조하여 좀더 상세히 설명하기로 한다. The quantization unit 250 performing the above role will be described in more detail with reference to FIG. 3.

도 3을 참조하면, 양자화부(250)는 장기 이득 예측부(300), 단기 여기 신호 예측부(310), 제1 결합부(320), 합성 필터(330), 제2 결합부(340), 에러 최소화부(350)를 포함한다.Referring to FIG. 3, the quantizer 250 includes a long-term gain predictor 300, a short-term excitation signal predictor 310, a first combiner 320, a synthesis filter 330, and a second combiner 340. And an error minimization unit 350.

상기 장기 이득 예측부(300)는 상기 로그 스펙트럼 생성부(240)에서 출력되는 로그 스펙트럼에 대해 analysis-by-synthesis 기법을 이용하여 장기 이득을 예측한다.The long-term gain predictor 300 predicts the long-term gain by using an analysis-by-synthesis technique on the log spectrum output from the log spectrum generator 240.

상기 장기 이득 예측부(300)가 장기 이득을 예측하는 방법에 대하여 좀더 상세히 살펴보면, 일반적으로 로그 스펙트럼상에서의 음성과 부가적인 잡음의 관계는 수학식 1과 같다. Looking at how the long-term gain predictor 300 predicts the long-term gain in more detail, in general, the relationship between speech and additional noise on the log spectrum is expressed by Equation 1.

......시간 도메인

...... time domain

......주파수 도메인

Frequency domain

......로그 스펙트럼 도메인

Log Spectrum Domain

여기서, 상기 x(t), X(w), x는 음성, n(t), N(W), n은 잡음을 말한다. Here, x (t), X (w), x is the voice, n (t), N (W), n is noise.

또한, 일반적으로 잡음의 통계적 분포는 특정 구간의 윈도우 값 중에서 최소가 되는 값을 따른다고 알려져 있다. 이러한 최소값 이론에 기반을 둔 알고리즘들이 실제 이득을 따가는데 더 좋은 성능을 낸다.In addition, it is generally known that the statistical distribution of noise follows a minimum value among window values of a specific section. Algorithms based on these minimum-value theories perform better to track the real gain.

이렇게 이득의 정확한 예측을 위해 몇개의 장기적인 프레임의 값을 고려하여 예측함으로써 서로 다른 프레임에서의 시간적 연관성을 고려할 수 있게 해준다.In order to accurately estimate the gain, several long-term frame values are considered and predicted so that the temporal correlation in different frames can be considered.

시간적 연관성을 고려하는 알고리즘은 여러가지가 있다.There are several algorithms to consider temporal correlation.

그중에서 하나인 최소 통계치 방법을 통해서 장기 이득을 예측하면 다음과 같다.One of the least statistical methods is used to predict long-term gains as follows.

스펙트럼 도메인상에서 power spectral density는 다음과 같은 관계를 갖는다.In the spectral domain, the power spectral density has the following relationship.

여기서,

는 noisy power spectral density,

는 현재 입력 신호,

는 과거 입력신호,

,

는 시간, k는 주파수(frequency),

는 잡음의 power를 말한다. here,

Is the noisy power spectral density,

Is the current input signal,

Is the past input signal,

,

Is the time, k is the frequency,

Is the power of noise.

수학식 2와 같은 noisy power spectral density에 로그 함수를 취하면, 수학식 3과 같은 장기이득(g)을 구한다. By taking a logarithmic function on the noisy power spectral density, Eq. (2), we obtain the long-term gain (g) as

상기 단기 여기 신호 예측부(310)는 상기 장기 이득 예측부(300)에서 구해진 장기 이득을 이용하여 단기 여기 신호를 구할 수 있다. The short term excitation signal predictor 310 may obtain a short term excitation signal using the long term gain obtained by the long term gain predictor 300.

즉, 상기 단기 여기 신호 예측부(310)는 상기 로그 스펙트럼 도메인의 스펙트럼 도메인에서 구해진 장기 이득을 이용하여 각 대역에서의 로그 스펙트럼을 정규화시킨다. 그러면, 상기 단기 여기 신호 예측부(310)는 수학식 4와 같은 단기 여기 신호(

)를 구할 수 있다. That is, the short-term excitation signal predictor 310 normalizes the log spectrum in each band by using the long-term gain obtained in the spectral domain of the log spectrum domain. Then, the short-term excitation signal prediction unit 310 is a short-term excitation signal (Equation 4)

) Can be obtained.

=

상기 수학식 4와 같은 단기 여기 신호는 수학식 5과 같이 표현된다.The short-term excitation signal such as Equation 4 is expressed as Equation 5.

여기서, 상기

는 현재 입력되는 신호를 말한다. Where

Refers to the signal currently being input.

본 발명의 다른 실시예에 따르면, 장기 이득에 대한 코드북을 만들 때, 장기 이득과 단기 여기 신호 간의 관계를 통해 각 코드북을 수학식 6과 같은 EM 알고리즘을 이용하여 구할 수도 있다. According to another embodiment of the present invention, when making a codebook for the long-term gain, each codebook may be obtained using an EM algorithm such as Equation 6 through the relationship between the long-term gain and the short-term excitation signal.

상기 제1 결합부(320)는 상기 장기 이득 예측부(300)에서 구한 장기 이득과 상기 단기 여기 신호 예측부(310)에서 구한 단기 여기 신호를 결합하여 상기 합성 필터(330)에 전송한다. The first combiner 320 combines the long-term gain obtained by the long-term gain predictor 300 with the short-term excitation signal obtained by the short-term excitation signal predictor 310 and transmits the short-term excitation signal to the synthesis filter 330.

상기 합성 필터(330)는 상기 제1 결합부(320)로부터 전송된 장기이득 신호와 단기 여기 신호를 수신하여 양자화된 스펙트럼 계수들 및 상기 결합된 여기 신호에 기초하여 입력 신호의 추정치를 발생한다. 그런 다음 상기 합성 필터(330)는 상기 발생된 입력신호의 추정치를 상기 제2 결합부(340)에 전송한다.The synthesis filter 330 receives the long-term gain signal and the short-term excitation signal transmitted from the first combiner 320 to generate an estimate of the input signal based on the quantized spectral coefficients and the combined excitation signal. The synthesis filter 330 then transmits an estimate of the generated input signal to the second combiner 340.

상기 제2 결합부(340)는 상기 로그 스펙트럼 생성부(240)에서 출력되는 로그 스펙트럼으로부터 상기 합성 필터(330)에서 출력되는 입력신호 추정치를 감산한다. 상기 입력되는 로그 스펙트럼과 상기 입력 신호 추정치간의 차이는 상기 에러 최소 화부(350)에 인가된다.The second combiner 340 subtracts the input signal estimate output from the synthesis filter 330 from the log spectrum output from the log spectrum generator 240. The difference between the input log spectrum and the input signal estimate is applied to the error minimizer 350.

또한, 상기 제2 결합부(340)는 상기 합성 필터(330)에서 출력되는 장기 이득과 단기 여기 신호를 상기 로그 스펙트럼 생성부(240)에서 출력되는 로그 스펙트럼과 결합하여 각각에 대한 인덱스 신호를 생성한다. In addition, the second combiner 340 combines the long-term gain and the short-term excitation signal output from the synthesis filter 330 with the log spectrum output from the log spectrum generator 240 to generate an index signal for each. do.

상기 에러 최소화부(350)는 상기 제2 결합부(340)에서 출력되는 입력신호 추정치와 상기 입력 로그 스펙트럼간의 차 및 가중함수에 기초하여 인식적으로 가중된 에러 신호를 발생한다.The error minimizer 350 generates a cognitively weighted error signal based on a difference and a weighting function between an input signal estimate output from the second combiner 340 and the input log spectrum.

그 후, 인식적으로 가중된 에러 신호를 이용하여 상기 입력 신호의 최적의 추정치를 발생하는 여기 벡터-관계된 파라미터들의 최적의 세트를 결정한다. 그 후 양자화된 스펙트럼 및 파라미터들의 최적의 세트는 상기 송신부(110)를 통하여 서버(150)에 전달된다. The cognitively weighted error signal is then used to determine an optimal set of excitation vector-related parameters that yield an optimal estimate of the input signal. The optimal set of quantized spectra and parameters is then delivered to server 150 via the transmitter 110.

상기와 같이 상기 양자화부(250)는 상기 장기 이득 예측부(300)에서 구한 장기이득과 상기 단기 여기 신호 예측부(310)에서 구한 단기 여기 신호를 각각 장기적 통계치와 단기적인 통계치를 반영한 독립적인 파라미터의 결합으로 생각한다.As described above, the quantization unit 250 stores the long-term gain obtained by the long-term gain predictor 300 and the short-term excitation signal obtained by the short-term excitation signal predictor 310, respectively, by reflecting long-term and short-term statistics. Think of it as a bond.

그럼으로써, 상기 양자화부(250)는 음성 코딩에서의 기법과 같이 장/단기 파라미터간의 결합을 adaptive codebook형태의 이중 코드북을 사용하여 양자화를 수행한다. As a result, the quantization unit 250 performs quantization using a double codebook in the form of an adaptive codebook as a technique for speech coding.

상기 음성 보코더 분석부(106)는 음성 복원에 이용될 수 있는 음성 입력에 관한 부가적인 데이터를 추출한다. The voice vocoder analyzer 106 extracts additional data regarding the voice input that can be used for speech reconstruction.

상기 합산부(108)는 상기 DSR 신호 생성부(104)로부터 코딩된 신호 및 상기 음성 보코더 분석부(106)에 의해 추출된 부가적인 데이터를 통합된 신호로 결합하여 상기 송신부(110)에 전송한다. The adder 108 combines the signal coded from the DSR signal generator 104 and the additional data extracted by the voice vocoder analyzer 106 into an integrated signal and transmits the combined signal to the transmitter 110. .

상기 송신부(110)는 무선 주파수 송신기 또는 트랜시버일 수 있다. The transmitter 110 may be a radio frequency transmitter or a transceiver.

상기 송신부(110)를 통해 전송된 신호는 서버(150)에 전송되고, 상기 서버(150)는 상기 로그 스펙트럼을 이용하여 입력 음성 신호를 재구성한다. The signal transmitted through the transmitter 110 is transmitted to the server 150, the server 150 reconstructs the input voice signal using the log spectrum.

즉, 상기 서버(150)는 상기 클라이언트(100)로부터 전송된 로그 스펙트럼에 대한 손상을 보상한 후, 캡스트럼을 추출하여 음성 인식을 수행한다. That is, the server 150 compensates for the damage to the log spectrum transmitted from the client 100, and then extracts a capstrum to perform voice recognition.

상기와 같은 서버(150)는 상기 클라이언트(100)로부터 전송된 로그 스펙트럼을 수신하는 수신부(152), 로그 스펙트럼 손실 보상부(154), 이산 코사인 변환부(156), 음성 인식부(158)를 포함한다.The server 150 may include a receiver 152, a log spectrum loss compensator 154, a discrete cosine converter 156, and a voice recognizer 158 that receive the log spectrum transmitted from the client 100. Include.

상기 로그 스펙트럼 손실 보상부(154)는 상기 수신부(152)를 통해 수신된 로그 스펙트럼을 이용하여 손실된 대역(band)이 존재하는지를 판단하고, 상기 판단결과 손실된 대역이 존재하면, 상기 수신된 로그 스펙트럼을 이용하여 손실된 대역을 복구한다.The log spectrum loss compensator 154 determines whether a lost band exists using the log spectrum received through the receiver 152. If the lost band exists, the received log is determined. The spectrum is used to recover the lost bands.

상기 수신부(152)에서 수신되는 로그 스펙트럼(x)은 수학식 7과 같다.The log spectrum x received by the receiver 152 is expressed by Equation 7 below.

.

여기서, 상기

은 손실된 대역의 로그 스펙트럼,

는 손실되지 않은 대역의 로그 스펙트럼을 말한다. Where

Is the log spectrum of the lost band,

Is the log spectrum of the unlost band.

이미 두 코드워드의 로그 스펙트럼에서의 보상(correlation)을 알고 있다고 가정하면, 손실된 로그 스펙트럼은 수학식 8에 의해 보상된다.Assuming that we already know the correlation in the log spectrum of two codewords, the lost log spectrum is compensated by equation (8).

여기서, 상기

은 손실된 로그 스펙트럼이 복구된 스펙트럼,

는 평균을 말한다. Where

Is the spectrum from which the lost log spectrum has been recovered,

Says average.

상기와 같이 상기 로그 스펙트럼 손실 보상부(154)는 상기 수신부(152)를 통해 수신된 로그 스펙트럼(log spectrum)에서의 enhancement방법중에서 missing feature theory를 적용하여 손실된 로그 스펙트럼을 복구한다. As described above, the log spectrum loss compensator 154 recovers the lost log spectrum by applying a missing feature theory among the enhancement methods in the log spectrum received through the receiver 152.

상기 missing feature theory는 23차의 log spectrum 중에서 missing(손상)된 부분의 dimension을

이라고 하고 손상되지 않은 부분을

라고 가정하고 난 이후, 손상되지 않은 dimension과 correlation matrix를 이용하여 손상된 부분의 log spectrum을 복구한다. The missing feature theory measures the dimension of the missing part of the 23rd log spectrum.

And the undamaged part

After that, we restore the log spectrum of the damaged part using the undamaged dimension and correlation matrix.

상기 이산 코사인 변환부(156)는 상기 로그 스펙트럼 손실 보상부(154)에서 손실된 로그 스펙트럼이 보상된 로그 스펙트럼으로부터 특징 벡터를 추출한다. The discrete cosine transformer 156 extracts a feature vector from the log spectrum of which the log spectrum lost by the log spectrum loss compensator 154 is compensated.

즉, 상기 이산 코사인 변환부(156)는 23개의 로그 스펙트럼 값을 캡스트럼 영역으로 변환한다. 이때, 단지 제1의 13개의 값만이 계산되고, 나머지 10개의 값은 계산되지 않는다. 그런 다음 상기 이산 코사인 변환부(156)는 상기 추출된 특징 벡터를 음성 인식부(158)에 전송한다. In other words, the discrete cosine transform unit 156 converts 23 log spectral values into a capstem region. At this time, only the first 13 values are calculated, and the remaining 10 values are not calculated. The discrete cosine transformer 156 then transmits the extracted feature vector to the speech recognizer 158.

그러면, 상기 음성 인식부(158)는 상기 추출된 특징 벡터를 이용하여 음성 인식을 수행하여 출력한다. Then, the speech recognition unit 158 performs speech recognition using the extracted feature vector and outputs the speech recognition.

도 4는 본 발명에 따른 분산 음성 인식 시스템에서 클라이언트가 음성 신호를 인코딩하는 방법을 나타낸 흐름도이다. 4 is a flowchart illustrating a method of encoding a voice signal by a client in a distributed speech recognition system according to the present invention.

도 4를 참조하면, 클라이언트는 멜 필터링을 통해 생성된 필터 뱅크 에너지에 로그 함수를 적용하여 로그 스펙트럼을 생성한다(S400).Referring to FIG. 4, the client generates a log spectrum by applying a log function to the filter bank energy generated through mel filtering (S400).

단계 400의 수행 후, 상기 클라이언트는 상기 생성된 로그 스펙트럼을 이용하여 장기 이득을 예측하고(S402), 상기 예측된 장기 이득을 이용하여 단기 여기 신호를 예측한다(S404). After performing step 400, the client predicts a long term gain using the generated log spectrum (S402), and predicts a short term excitation signal using the predicted long term gain (S404).

즉, 상기 클라이언트는 상기 생성된 로그 스펙트럼에 대해 서로 다른 프레임에서의 시간적 연관성을 고려하여 장기 이득을 예측한다. 그런 다음 상기 클라이언트는 상기 예측된 장기 이득을 이용하여 각 대역에서의 로그 스펙트럼을 정규화시켜 단기 여기 신호를 예측한다. That is, the client predicts a long-term gain by considering the temporal correlation in different frames with respect to the generated log spectrum. The client then uses the predicted long term gain to normalize the log spectrum in each band to predict short term excitation signals.

단계 404의 수행 후, 상기 클라이언트는 상기 장기 이득과 상기 단기 여기 신호의 이중 코드북을 사용하여 양자화를 수행한다(S406). 즉, 상기 클라이언트는 상기 장기 이득과 상기 단기 여기 신호를 각각 장기적 통계치와 단기적인 통계치를 반영한 독립적인 파라미터의 결합으로 판단하여 상기 장/단기 파라미터간의 결합을 adaptive codebook 형태의 이중 코드북을 이용하여 양자화한다. After performing step 404, the client performs quantization using the double codebook of the long-term gain and the short-term excitation signal (S406). That is, the client quantizes the long-term gain and the short-term excitation signal as a combination of independent parameters reflecting long-term and short-term statistics, respectively, and quantizes the combination between the long-term and short-term parameters using an adaptive codebook type dual codebook.

상기 양자화된 로그 스펙트럼값은 서버에 전송된다. The quantized log spectral value is transmitted to a server.

상기 서버가 클라이언트로부터 전송된 로그 스펙트럼을 이용하여 음성 신호를 인식하는 방법에 대하여 도 5를 참조하여 설명하기로 한다. A method of recognizing a voice signal using the log spectrum transmitted from a client by the server will be described with reference to FIG. 5.

도 5는 본 발명에 따른 분산 음성 인식 시스템에서 서버가 음성 신호를 디코딩하는 방법을 나타낸 흐름도이다. 5 is a flowchart illustrating a method of decoding a speech signal by a server in a distributed speech recognition system according to the present invention.

도 5를 참조하면, 서버는 클라이언트로부터 로그 스펙트럼이 수신되면(S500), 상기 로그 스펙트럼에서 손실된 대역이 존재하는지를 판단한다(S502).Referring to FIG. 5, when a log spectrum is received from a client (S500), the server determines whether there is a band lost in the log spectrum (S502).

단계 502의 판단결과 상기 로그 스펙트럼에 손실된 대역이 존재하면, 상기 서버는 상기 수신된 로그 스펙트럼을 이용하여 손실된 대역을 복구한다(S504). If there is a lost band in the log spectrum as a result of the determination in step 502, the server recovers the lost band using the received log spectrum (S504).

즉, 상기 서버는 상기 로그 스펙트럼에서의 enhancement방법중에서 missing feature theory를 적용하여 손실된 로그 스펙트럼을 복구한다. That is, the server recovers the lost log spectrum by applying missing feature theory among the enhancement methods in the log spectrum.

단계 504의 수행 후, 상기 서버는 단계 504에서 복구된 로그 스펙트럼을 상기 수신된 로그 스펙트럼상에 적용하여 캡스트럼을 추출한다(S506).After performing step 504, the server extracts the capstrum by applying the log spectrum recovered in step 504 to the received log spectrum (S506).

단계 506의 수행 후, 상기 서버는 상기 추출된 캡스트럼을 이용하여 음성 인식을 수행한다(S508).After performing step 506, the server performs speech recognition using the extracted capstrum (S508).

만약, 단계 502의 판단결과 상기 수신된 로그 스펙트럼에 손실된 대역이 존재하지 않으면, 상기 서버는 상기 수신된 로그 스펙트럼으로부터 캡스트럼을 추출하고(S510), 단계 508을 수행한다. If a lost band does not exist in the received log spectrum as a result of the determination of step 502, the server extracts a capstrum from the received log spectrum (S510) and performs step 508.

상술한 바와 같은 본 발명의 방법은 프로그램으로 구현되어 컴퓨터로 읽을 수 있는 형태로 기록매체에 저장될 수 있다. 이러한 과정은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있으므로 더 이상 상세히 설명하지 않기로 한다.The method of the present invention as described above may be implemented in a program and stored in a recording medium in a computer-readable form. Since this process can be easily carried out by those skilled in the art will not be described in more detail.

본 발명은 상기 실시예에 한정되지 않으며, 많은 변형이 본 발명의 사상 내에서 당 분야에서 통상의 지식을 가진 자에 의하여 가능함은 물론이다.The present invention is not limited to the above embodiments, and many variations are possible by those skilled in the art within the spirit of the present invention.

상술한 바와 같이 본 발명에 따르면, 로그 스펙트럼이 대역별 보정이 강하다는 성질을 이용함으로써 에러 보정 코드에서 수정할 수 없는 특징벡터들도 보정함으로써 네트워크 상태가 좋지 않은 분산 음성 인식 시스템에서의 음성인식의 성능을 향상시킬 수 있는 분산 음성 인식 시스템에서 음성 신호의 코딩 방법 및 그 장치를 제공할 수 있다. As described above, according to the present invention, the performance of speech recognition in a distributed speech recognition system in which network conditions are poor by correcting feature vectors that cannot be corrected in an error correction code by using a property of strong band-by-band correction. The present invention provides a method and apparatus for coding a speech signal in a distributed speech recognition system.

또한, 본 발명에 따르면, 로그 스펙트럼을 장기 이득과 단기 여기 신호로 모델링함으로써 분산 음성 인식시스템 프레임에서의 캡스트럼 대신 효율적이고 정확한 로그 스펙트럼을 전송을 가능하게 하는 분산 음성 인식 시스템에서 음성 신호의 코딩 방법 및 그 장치를 제공할 수 있다. In addition, according to the present invention, a method of coding a speech signal in a distributed speech recognition system capable of transmitting an efficient and accurate log spectrum instead of a capstrum in a distributed speech recognition system frame by modeling the log spectrum as a long-term gain and a short-term excitation signal. And the apparatus.

또한, 본 발명에 따르면, 캡스트럼 이전 단계인 로그 스펙트럼에서 음성 인식의 잡음 전처리를 수행하여 잡음과 음성의 관계를 명확히 설명할 수 있는 분산 음성 인식 시스템에서 음성 신호의 코딩 방법 및 그 장치를 제공할 수 있다. In addition, according to the present invention, there is provided a method and apparatus for coding a speech signal in a distributed speech recognition system capable of clearly explaining the relationship between noise and speech by performing noise preprocessing of speech recognition in a log spectrum that is a pre-capstrum step. Can be.

또한, 본 발명에 따르면, 특정 대역의 신호가 유실되었을때, 로그 스펙트럼의 성질을 이용하여 쉽게 복구할 수 있는 분산 음성 인식 시스템에서 음성 신호의 코딩 방법 및 그 장치를 제공할 수 있다. In addition, according to the present invention, it is possible to provide a method and apparatus for coding a speech signal in a distributed speech recognition system that can be easily recovered by using a log spectrum property when a signal of a specific band is lost.

Claims

In a speech signal encoding method in a distributed speech recognition system,

Generating a log spectrum by applying a log function to the voice signal on which the mel filtering is performed;

Generating a long term gain and a short term excitation signal using the generated log spectrum; and

Quantizing using the generated long-term gain and the short-term excitation signal using a double codebook, and then transmitting

Method of coding a speech signal in a distributed speech recognition system comprising a.

The method of claim 1,

The long term gain (g) is

Saved by

here,

Is the noisy power spectral density

Saved by

Is the current input signal,

Is the past input signal,

,

Is the time, k is the frequency,

The coding method of the speech signal in a distributed speech recognition system, characterized in that the power of the noise.

The method of claim 1

The short excitation signal (

) Is formed by normalizing the log spectrum in each band using the long-term gain,

=

The method of coding a speech signal in a distributed speech recognition system, wherein g is a long term gain and z (w) is an input signal.

The method of claim 1,

The quantization of the generated long-term gain and the short-term excitation signal using a double codebook is performed by an analysis-by-synthesis technique.

A method of decoding a speech signal in a distributed speech recognition system,

Receiving a log spectrum from a client;

Compensating for the lost band for the received log spectrum;

Performing discrete cosine transform on the log spectrum compensated for the lost band to convert to a capstrum; and

Performing speech recognition using the converted capstrum

The method of claim 5,

The method of coding a speech signal in a distributed speech recognition system, characterized in that the compensation for the lost band for the received log spectrum is compensated using a missing feature theory.

The method of claim 5,

Compensation of the lost band (

)silver

Made by

Where

Is the log spectrum of the lost band,

Is the log spectrum of the unlost band,

The coding method of the speech signal in a distributed speech recognition system, characterized in that the average.

An apparatus for encoding a speech signal in a distributed speech recognition system,

A log spectrum generator for generating a log spectrum by applying a log function to the voice signal on which the mel filtering is performed;

A quantization unit generating a long term gain and a short term excitation signal using the log spectrum generated by the log spectrum generator and quantizing the generated long term gain and the short term excitation signal using a double codebook;

Transmitter for transmitting the quantized log spectrum in the quantization unit

Apparatus for coding a speech signal in a distributed speech recognition system comprising a.

The method of claim 8,

The quantization unit

A long term gain predicting unit predicting a long term gain using the log spectrum generated by the log spectrum generating unit;

A short-term excitation signal predictor for generating a short-term excitation signal by normalizing a log spectrum in each band using the long-term gain predicted by the long-term gain predictor;

A first combiner combining the long term gain generated by the long term gain predictor and the short term excitation signal generated by the short term excitation signal predictor;

A second combiner configured to combine the long-term gain and the short-term excitation signal output from the first combiner with the log spectrum output from the log spectrum generator to generate an index signal for each;

And an error minimizer for minimizing an error of a signal output from the second combiner and feeding back the long-term gain predictor and the short-term excitation signal predictor.

The method of claim 8,

And the transmitter is one of a radio frequency transmitter and a transceiver.

An apparatus for decoding a speech signal in a distributed speech recognition system,

Receiving unit for receiving the log spectrum from the client;

A log spectrum loss compensator for compensating a lost band by applying a loss characteristic theory to the log spectrum output from the receiver;

A discrete cosine transform unit performing a discrete cosine transform on the log spectrum compensated by the log spectrum loss compensator and converting the transformed signal into a capstrum region; and

A speech recognition unit for performing speech recognition using the capstrum transformed by the discrete cosine transform unit