KR100628170B1

KR100628170B1 - Apparatus and method for coding speech

Info

Publication number: KR100628170B1
Application number: KR1019990017424A
Authority: KR
Inventors: 레드코브빅터브이.; 티크호트스키아나토리아이; 마이보로다알렉산드르엘; 드조우리스키에우게네브이
Original assignee: 엘지전자 주식회사
Priority date: 1999-05-14
Filing date: 1999-05-14
Publication date: 2006-09-27
Anticipated expiration: 2019-05-14
Also published as: KR20000073865A

Abstract

본 발명은 음성 스펙트럼을 복수개의 프레임으로 나누고, 프레임 분류를 상기 복수개의 프레임에 배정하고, 그리고 배정된 프레임 분류에 기초하여 음성 모델링 파라미터를 결정하는 것을 포함하는 음성 압축 장치 및 방법에 관한 것이다. 재구성된 신호의 유성 및 무성 부분들 사이의 정확한 대응을 가능케 하는 합성에 의한 분석을 이용하여 음성 스펙트럼의 유성 부분과 음성 스펙트럼의 무성 부분이 개별적으로 합성된다. 특히, 이전 및 현재 프레임에 기초하여 특수한 모의 신호의 주파수응답(frequency response)이 어림셈 함수(approximating function)로서 이용된다. 모의 신호(simulated signal)는 디코더 측에서 생성될 수 있는 방식으로 인코더 측에서 합성된다. 또한, 스펙트럼 크기(spectral magnitude)를 인코딩하기 위한 우수한 인코딩 방법 2가지가 선택된다. 따라서, 본 발명은 합성된 디지털 음성을 효율적이면서도 정확하게 재구성할 수 있다.The present invention relates to a speech compression apparatus and method comprising dividing a speech spectrum into a plurality of frames, assigning a frame classification to the plurality of frames, and determining speech modeling parameters based on the assigned frame classification. The voiced portion of the speech spectrum and the unvoiced portion of the speech spectrum are synthesized separately using an analysis by synthesis that enables accurate correspondence between the voiced and unvoiced portions of the reconstructed signal. In particular, the frequency response of a particular simulated signal based on previous and current frames is used as an approximating function. The simulated signal is synthesized at the encoder side in a manner that can be generated at the decoder side. In addition, two good encoding methods are selected for encoding the spectral magnitude. Therefore, the present invention can efficiently and accurately reconstruct synthesized digital speech.

음성 코딩, 코딩 장치Voice coding, coding device

Description

Apparatus and method for coding speech {APPARATUS AND METHOD OF SPEECH CODING}

도 1은 본 발명의 인코터를 보여주는 블록도1 is a block diagram showing an encoder of the present invention

도 2는 본 발명에 따른 해밍윈도우 응답 크기조정(scaling)을 보여주는 도면2 is a diagram showing a Hamming window response scaling according to the present invention.

도 3은 본 발명에 따른 음성 모델 파라미터 결정의 직접적인 방법을 보여주는 도면3 shows a direct method of speech model parameter determination according to the invention.

도 4는 본 발명에 따른 음성 모델 파라미터 결정을 위한 합성에 의한 분석 방법을 보여주는 도면4 is a diagram illustrating an analysis method by synthesis for determining a speech model parameter according to the present invention.

도 5는 본 발명에 따른 스펙트럼 크기 결정을 위한 합성에 의한 분석 방법을 보여주는 블록도5 is a block diagram showing a method for analysis by synthesis for spectral size determination according to the present invention.

도 6은 본 발명에 따른 스펙트럼 크기 벡터의 혼성 인코딩을 보여주는 블록도6 is a block diagram showing hybrid encoding of spectral magnitude vectors in accordance with the present invention.

도 7은 본 발명에 따른 디코더를 보여주는 블록도7 is a block diagram showing a decoder according to the present invention.

도 8은 본 발명에 따른 유성음(voiced speech) 합성을 보여주는 블록도8 is a block diagram showing voiced speech synthesis in accordance with the present invention.

도 9는 종래 기술에 따른 유성음 합성에서 대역/주파수 대응의 일례를 보여주는 도면9 illustrates an example of band / frequency correspondence in voiced sound synthesis according to the prior art;

도 10은 종래 기술에 따른 유성음 합성에서 일부 대역의 주파수응답을 보여 주는 도면10 is a view showing frequency response of some bands in voiced sound synthesis according to the prior art.

도 11은 종래 기술에 따른 유성음 합성에서 여기 스펙트럼의 일례를 보여주는 도면11 shows an example of an excitation spectrum in voiced sound synthesis according to the prior art.

도 12는 본 발명에 따른 유성음 합성법에서의 대역/주파수 대응의 일례를 보여주는 도면12 shows an example of band / frequency correspondence in the voiced sound synthesis method according to the present invention.

도 13은 본 발명에 따른 유성음 합성방법에 대한 일부 대역의 주파수응답을 보여주는 도면13 is a view showing frequency response of some bands for the voiced sound synthesis method according to the present invention.

도 14는 본 발명에 따른 유성 합성 절차를 이용하여 얻어진 유성 여기 스펙트럼의 일례를 보여주는 도면14 shows an example of a planetary excitation spectrum obtained using the planetary synthesis procedure according to the present invention.

도 15는 무성음(unvoiced speech) 합성을 보여주는 블록도15 is a block diagram showing unvoiced speech synthesis

본 발명은 통신 시스템에 관한 것으로, 더욱 구체적으로는 통신 시스템의 음성 압축 방법(speech compression method)에 관한 것이다.TECHNICAL FIELD The present invention relates to a communication system, and more particularly to a speech compression method of a communication system.

많은 음성 압축 시스템이 공지되어 있다. 일반적으로, 이들 시스템을 세 가지 유형으로 나눌 수 있다: 시간영역, 주파수영역, 그리고 혼성코덱(hybrid codec). 그러나, 비트 전송률이 낮은 코딩의 경우, 다중-대역 여기(multi-band excitation, MBE) 압축 기술이 가장 우수한 품질의 디코딩 음성을 제공한다.Many voice compression systems are known. In general, these systems can be divided into three types: time domain, frequency domain, and hybrid codec. However, for low bit rate coding, multi-band excitation (MBE) compression techniques provide the highest quality decoded speech.

MBE 보코더(vocoder)는 입력된 음성을 우선 제한된 프레임(frame)으로 나눔 으로써, 얻어진 음성 신호를 인코딩한다. 이들 프레임은 시간영역에서 주파수영역으로 변환된다. 이후에, 프레임화(framed) 및 윈도우화(windowed)된 신호의 주파수 스펙트럼을 계산하고, 이 주파수 스펙트럼을 분석한다. 또한, 피치값(pitch value), 주파수대역(frequency band)에 대한 유성/무성 결정그룹(a set of voiced/unvoiced decisions), 스펙트럼 크기(spectral magnitude) 그룹 그리고 이에 대응하는 위상값(phase values)과 같은 음성 모델 파라미터들이 MBE 보코더에서의 음성 합성에 필요하다. 대체로, 비트 전송률이 낮은 코딩에서 위상값은 전송되지 않는다.The MBE vocoder encodes the obtained speech signal by first dividing the input speech into a limited frame. These frames are converted from the time domain to the frequency domain. The frequency spectrum of the framed and windowed signals is then calculated and analyzed. In addition, a pitch value, a set of voiced / unvoiced decisions for a frequency band, a spectral magnitude group, and corresponding phase values and The same voice model parameters are needed for speech synthesis in the MBE vocoder. In general, phase coding is not transmitted in low bit rate coding.

스펙트럼 어림셈에는 많은 방법이 있는데, 이 모든 방법은 여기함수(excitation function)에 의한 주파수대역의 어림셈에 그 기초를 둔다. 여기함수에서 가장 전통적인 종류는 해밍윈도우(Hamming window)의 주파수응답(frequency response)이다. 그러나, 해밍윈도우는 정지된 음성 신호에 대한 최초 스펙트럼을 양호하게 어림 셈할 뿐이다. 정지되지 않은 음성 신호에 있어서는, 소정 종류의 여기함수는 정확한 어림 셈을 하기에는 스펙트럼의 실제 형태와 충분히 일치되지 않는다. 예를 들어, 분석을 행하는 동안의 피치 주파수 변화는 스펙트럼 크기 엔벨롭(envelope)에서 피크(peak)가 넓어지는 것을 초래할 수 있다. 따라서, 소정 여기 함수의 피크 폭은 더 이상 실제 피크의 폭과 일치하지 않게 된다. 게다가, 분석된 음성 프레임이 두 개의 다른 프로세스(process)를 혼합한 것이라면, 스펙트럼은 매우 복잡한 형태를 취하게 됨으로써 소정의 단순한 여기 함수를 이용하여 정확하게 어림 셈을 하는 것이 어렵게 된다.There are many methods of spectral estimation, all of which are based on the estimation of frequency bands by excitation functions. The most traditional type of excitation function is the frequency response of the Hamming window. However, Hamming windows only goodly estimate the initial spectrum for a stationary speech signal. For non-stationary speech signals, some kind of excitation function does not sufficiently match the actual form of the spectrum for accurate estimation. For example, a change in pitch frequency during the analysis can lead to widening of the peak in the spectral magnitude envelope. Thus, the peak width of a given excitation function no longer matches the width of the actual peak. In addition, if the analyzed speech frame is a mixture of two different processes, the spectrum takes a very complex form, making it difficult to accurately estimate it using some simple excitation function.

MBE 파라미터를 인코딩하는 기술에도 여러 가지가 있다. 통상적으로, 피치값을 인코딩하는데는 간단한 스칼라 정량화(scalar quantization)가 이용되며 유성/무성결정을 인코딩하는데는 대역 그룹화(band grouping) 방법이 이용된다. 가장 어려운 일은 스펙트럼 크기를 인코딩하는 것인데, 이 때문에 벡터 정량화(Vector Quantization, VQ), 선형 예측(Linear Prediction)과 같은 것이 이용된다. 많은 고효율 압축 방법이 VQ에 기초해서 제안되었는데, 이 중 하나가 스펙트럼 크기를 인코딩하는데 이용되는 계층 구조 코드북(hierarchical structured codebook) 방법이다.There are several techniques for encoding MBE parameters. Typically, simple scalar quantization is used to encode pitch values and band grouping methods are used to encode voiced / unvoiced crystals. The most difficult task is to encode the spectral magnitude, which is why Vector Quantization (VQ) and Linear Prediction are used. Many high efficiency compression methods have been proposed based on VQ, one of which is a hierarchical structured codebook method used to encode spectral size.

VQ 기술이 일부 문제 부분에서는 정확한 정량화를 가능하게 하지만, 일반적으로 "학습 순서(learning sequence)"에 포함되었던 데이터와 유사한 데이터에 대해서 효과적이다. 스펙트럼 크기를 인코딩하는 다른 효과적인 방법에는 인트라-프레임(intra-frame) 및 인터-프레임(inter-frame) 선형예측이 있다. 인트라-프레임 방법은 스펙트럼 크기를 적절히 인코딩하는 것이 가능하지만, 비트전송률이 낮은 코딩에서는 그 효과가 상당히 떨어진다. 인터-프레임 예측 방법 역시 양호한 방법이기는 하지만, 정지된 음성 신호에만 그 이용이 적합하다는 제한이 있다.Although the VQ technique allows for accurate quantification in some problematic areas, it is generally effective for data similar to the data that was included in the "learning sequence". Other effective ways of encoding spectral magnitudes include intra-frame and inter-frame linear prediction. The intra-frame method is capable of properly encoding the spectral size, but the effect is significantly lower in low bit rate coding. The inter-frame prediction method is also a good method, but has a limitation that its use is only suitable for stationary speech signals.

종래 기술에 따른 음성 합성은 허용 음성 모델(accepted speech model)에 따라 수행된다. 대체로, MBE 보코더에서는 두 개의 구성요소, 즉 음성의 유성 및 무성 부분이 개별적으로 합성된 다음에 결합되어 완전한 음성 신호를 만들어낸다.Speech synthesis according to the prior art is performed according to an accepted speech model. In general, in an MBE vocoder, two components, the voiced and unvoiced portions of speech, are synthesized separately and then combined to create a complete speech signal.

음성의 무성 성분은 주파수대역에서 생성되며, 상기 주파수대역은 무성으로 결정된다. 각각의 음성 프레임에서, 임의의 잡음 블록(block)이 윈도우화되어 주파 수영역으로 변환되며, 여기에서 유성 조파(harmonics)에 해당하는 스펙트럼 부분은 제로(zero)로 설정된다. 음성에서 무성 부분에 해당하는 남은 스펙트럼 성분은 정규화되어(normalized) 무성의 조파 크기가 된다.The unvoiced component of speech is produced in a frequency band, which is determined to be unvoiced. In each voice frame, an arbitrary block of noise is windowed and transformed into a frequency domain, where the spectral portion corresponding to meteorological harmonics is set to zero. The remaining spectral components corresponding to the unvoiced part of the voice are normalized to the unvoiced harmonic magnitude.

MBE 방법에서 음성의 유성 성분을 생성하는데 다른 기술이 이용된다. 유성음이 주파수영역에서 개별 조파에 의해 모델링되기 때문에, 상기 유성음은 동조 오실레이터(tuned ocsillator)의 뱅크(bank)로서 디코더에서 구현될 수 있다(implemented). 오실레이터는 그 진폭, 주파수, 그리고 위상에 의해 정의되며, 오실레이터는 프레임의 유성 영역에 있는 각 조파에 배정된다.Other techniques are used to generate negative oily components in the MBE method. Since voiced sound is modeled by individual harmonics in the frequency domain, the voiced sound can be implemented at the decoder as a bank of tuned oscillators. An oscillator is defined by its amplitude, frequency, and phase, and the oscillator is assigned to each harmonic in the planetary region of the frame.

그러나, 인접 프레임들에서 측정된 파라미터의 변분(variations)은 프레임 모서리(edge)에서 불연속(discontinuity)을 초래하여 음성의 품질을 상당히 떨어뜨릴 수 있다. 따라서, 합성을 하는 동안, 현재 및 이전 프레임의 파라미터들을 보간(interpolated)함으로써 프레임 경계(boundary)에서의 전이(transition)를 매끈하게 하여 프레임 경계에서 연속적인 유성음을 낳게 된다.However, variations in the measured parameters in adjacent frames can lead to discontinuities at the frame edges, which can significantly degrade speech quality. Thus, during compositing, the parameters of the current and previous frames are interpolated to smooth the transition at the frame boundary resulting in continuous voiced sound at the frame boundary.

(진폭, 주파수, 그리고 위상에 대해) 보간방법을 다르게 할 수 있다. 그러나, 보간법은 대체로 일정한 피치에서만 만족스러운 것이다. 피치가 급격히 변하는 경우, 이웃하는 음성 프레임에 있는 개수가 같은 주파수대역과 관련한 전통적인 조파 레이싱(lacing)으로 인해 구현 처리 규칙은 만족할 만한 결과에 이르지 못한다. 피치 주파수가 변하는 경우, 레이싱된(laced) 조파에서 주파수 차이가 발생하고, 조파 대역에 대응하는 종래 대역에서는 대역 개수가 더 많을 때 그리고 피치의 변화 정도가 더 클 때 상기 차이는 더욱 커진다. 결과적으로, 디코딩된 음성 중에 바람직하지 않은 아티팩트(artifact)가 발생한다.You can vary the interpolation method (for amplitude, frequency, and phase). However, interpolation is generally satisfactory only at a constant pitch. If the pitch changes drastically, the implementation process rules do not yield satisfactory results due to traditional harmonic lacing associated with the same number of frequencies in neighboring speech frames. When the pitch frequency changes, a frequency difference occurs in the laced harmonics, and in the conventional band corresponding to the harmonic band, the difference becomes larger when the number of bands is larger and when the pitch change is larger. As a result, undesirable artifacts occur in the decoded speech.

따라서, 본 발명은 종래의 문제점을 해결하기 위해 안출한 것으로서, 유성 및 무성 대역 모두에 대해 음성 스펙트럼 어림셈의 품질을 향상시킬 수 있는 방법을 제공하는 것을 그 목적으로 한다.Accordingly, an object of the present invention is to provide a method capable of improving the quality of speech spectrum estimation for both voiced and unvoiced bands.

본 발명의 또다른 목적은 인코딩에서 비트전송률과는 상관없이 스펙트럼 크기 그룹의 인코딩 효율을 향상시키는 것이다.Another object of the present invention is to improve the encoding efficiency of spectral size groups regardless of the bit rate in encoding.

본 발명의 또다른 목적은 음성 합성의 품질을 향상시키는 것이다.Another object of the present invention is to improve the quality of speech synthesis.

본 발명의 다른 이점, 목적, 및 특징은 하기의 상세설명에서 기술될 것이며 부분적으로는 본 발명을 검토한 당업자에 상기 사항은 명백해질 것이고 또는 본 발명을 실시함으로써 상기의 사항을 이해할 수 있을 것이다. 본 발명의 목적 및 이점은 특히 청구항에 지적한 바와 같이 구현되고 얻어질 것이다.Other advantages, objects, and features of the invention will be set forth in the description which follows, and in part will be obvious to those skilled in the art upon reviewing the invention, or may be understood by practice of the invention. The objects and advantages of the invention will be embodied and obtained in particular as pointed out in the claims.

본 명세서에 기술된 바와 같이 본 발명의 목적을 달성하기 위해, 음성 프레임의 피치 주파수에 따라, 복수개의 대역으로 분할된 스펙트럼에 음성 스펙트럼 어림셈이 수행된다. 음성 신호의 피치 주파수가 결정되고, 주파수대역이 구성되며, 주파수대역의 유성/무성 판별이 수행된다. 그다음, 음성 스펙트럼 어림셈이라는 합성에 의한 분석 방법을 스펙트럼 크기의 계산에 이용한다.To achieve the object of the present invention as described herein, speech spectral estimation is performed on a spectrum divided into a plurality of bands, depending on the pitch frequency of the speech frame. The pitch frequency of the audio signal is determined, the frequency band is configured, and the voiced / unvoiced determination of the frequency band is performed. Next, a synthesis analysis method called speech spectral estimation is used for the calculation of the spectral magnitude.

인코더 측에서 조파 크기를 보다 정밀하게 평가(evaluation)함으로써, 디코더 측에서는 재구성 신호의 유성 부분의 품질이 향상된다. 또한, 스펙트럼의 무성 대역에서의 크기를 보다 정밀하게 계산함으로써, 재구성된 신호의 잡음 부분에 대 한 품질이 향상된다. 합성에 의한 분석 방법을 유성과 무성 대역 모두에 이용하면 재구성된 신호의 유성 부분과 무성 부분 사이에 정확한 대응이 이루어진다.By more precise evaluation of the harmonic magnitude on the encoder side, the quality of the voiced portion of the reconstruction signal on the decoder side is improved. In addition, by more precisely calculating the magnitude in the unvoiced band of the spectrum, the quality of the noise portion of the reconstructed signal is improved. Using the synthetic analysis method in both the voiced and unvoiced bands, an accurate correspondence is made between the voiced and unvoiced portions of the reconstructed signal.

또한, 본 발명은 스펙트럼 크기 그룹의 인코딩 효율을 향상시킨다. 비트전송률이 낮은 인코딩의 경우, 문제는 일정 개수의 비트로써 스펙트럼 크기 데이터를 표현하는 것이다. 스펙트럼 크기 인코딩에 관한 본 발명은 두 개의 주요 과제를 안게 되는 바, 스펙트럼 크기의 최초 양을 일정 개수로 감소시켜 그 감소된 그룹을 인코딩하는 것이다. 본 발명에 따른 방법은 웨이브렛 변환(Wavelet Transform, WT)을 이용함으로써 상기 첫 번째 과제를 효과적으로 해결한다. 그리고, 음성 신호가 만약 정지된 것이라면, 인터-프레임 예측을 이용하여 상기 두 번째 과제를 효과적으로 해결할 수 있다.The present invention also improves the encoding efficiency of spectral size groups. For low bit rate encoding, the problem is to represent spectral size data with a certain number of bits. The present invention with respect to spectral size encoding presents two main challenges: reducing the initial amount of spectral size to a certain number and encoding the reduced group. The method according to the present invention effectively solves the first problem by using a Wavelet Transform (WT). And, if the speech signal is stationary, the second problem can be effectively solved using inter-frame prediction.

그러나, 정지되지 않은 신호를 포함하는 시간 간격에서는, 예측을 이용하지 않는 것이 오히려 효과적이다. 이 경우 웨이브렛 변환 기술을 이용하면 인코딩 문제를 효과적으로 해결할 수 있다. 인코딩 효율이 증가하면, 동일한 비트전송률에서는 재구성된 음성 신호의 품질이 향상되거나 또는 동일한 품질 수준에 필요한 비트전송률이 감소된다.However, at time intervals involving signals that are not stopped, it is rather effective to not use prediction. In this case, wavelet transform technology can effectively solve the encoding problem. As the encoding efficiency increases, the quality of the reconstructed speech signal is improved at the same bit rate, or the bit rate required for the same quality level is reduced.

본 발명은 또한, 음성 합성의 품질을 향상시킨다. 음성 합성은 각 프레임마다 순차적으로 수행된다. 기본주파수(fundamental frequency)가 어림셈될 스펙트럼의 전체 대역 분할의 기준이 되므로, 피치가 변하는 경우에는 레이싱된 조파의 주파수에서 차이가 발생한다. 본 발명은 현재 및 이전 프레임의 레이싱된 대역들 사이의 주파수 대응을 이용한다. 이것은 피치 주파수가 변화 및 점프하는 상태에 서 정확하고 믿을만한 음성 합성 처리를 제공한다. 피치 결정에서 보이는 명백한 문제점(오류)은 종래 방법으로는 극적인 결과에 이르지 못한다.The present invention also improves the quality of speech synthesis. Speech synthesis is performed sequentially for each frame. Since the fundamental frequency becomes the reference for the whole band division of the spectrum to be estimated, a difference occurs in the frequency of the raced harmonics when the pitch changes. The present invention uses the frequency correspondence between the raced bands of the current and previous frame. This provides accurate and reliable speech synthesis processing with pitch frequency shifting and jumping. Obvious problems (errors) seen in pitch determination do not reach dramatic results with conventional methods.

본 발명의 바람직한 실시예가 MBE 인코딩법과 관련하여 기술되겠다. MBE 보코더가 "Multiband Excitation Vocoder", IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 36, No.8, 1988, pp. 1223-35 에서 D.W.Griffin 및 J.S.Lim에 의해 개시되었으며, 그 전체가 본 명세서에 포함되었다. 통상적으로, 보코더는 8 kHz의 음성 신호 샘플링(sampling) 비율로 작동한다. 본 발명에 따른 음성 신호 인코더 및 디코더를 이용하는 것에 대해 설명하겠다.Preferred embodiments of the present invention will be described with reference to the MBE encoding method. MBE vocoder is described in "Multiband Excitation Vocoder", IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 36, No. 8, 1988, pp. D.W.Griffin and J.S.Lim at 1223-35, incorporated herein in their entirety. Typically, the vocoder operates at a speech signal sampling rate of 8 kHz. The use of a speech signal encoder and decoder according to the present invention will be described.

도 1은 본 발명에 따른 인코더의 일 실시예를 나타내는 것으로서, 음성모델 파라미터결정장치(1)와 파라미터인코딩장치(2)를 포함한다. 음성모델파라미터결정장치(1)는 직사각형 윈도우장치(Rectangular Windowing unit)(10), 해밍윈도우장치(20), 프레임분류장치(Frame Classification unit)(30), 고속푸리에변환(Fast Fourier Transform, FFT)장치(40), 피치검출장치(3), V/UV 판별장치(V/UV Discrimination unit)(80), 그리고 스펙트럼크기결정장치(90)를 포함한다. 이들 구성요소는 피치 주파수

, 유성 결정 그룹

, 스펙트럼 크기 그룹

, 그리고 위상값(비트전송률이 매우 낮기 때문에, 본 실시예에서는 위상값을 전송하지 않는다)과 같은 MBE 모델 파라미터를 결정하는데 이용된다.1 shows an embodiment of an encoder according to the invention, which comprises a voice model parameter determination apparatus 1 and a parameter encoding apparatus 2. The speech model parameter determination apparatus 1 includes a rectangular windowing unit 10, a hamming windowing unit 20, a frame classification unit 30, and a fast Fourier transform (FFT). A device 40, a pitch detection device 3, a V / UV discrimination unit 80, and a spectrum size determination device 90. These components are the pitch frequency

Meteor determination group

Spectral size group

And MBE model parameters such as phase values (since the present embodiment does not transmit phase values because the bit rate is very low).

파라미터인코딩장치(2)는 스칼라정량화장치(100), 스펙트럼크기웨이브렛감소장치(110), 스펙트럼크기혼성인코딩장치(120), 그리고 멀티플렉싱장치(multiplexer unit)(130)를 포함한다. 이들 구성요소는 MBE 모델 파라미터들을 복수개의 비트로 인코딩하는데 이용된다. 또, 피치검출장치(3)는 피치후보그룹결정장치(50), 최적후보선택장치(60), 그리고 최적후보개선장치(Best Candidate Refinement unit)(70)를 포함한다.The parameter encoding apparatus 2 includes a scalar quantification apparatus 100, a spectral size wavelet reducing apparatus 110, a spectral size hybrid encoding apparatus 120, and a multiplexer unit 130. These components are used to encode MBE model parameters into a plurality of bits. The pitch detection apparatus 3 also includes a pitch candidate group determining apparatus 50, an optimum candidate selecting apparatus 60, and a best candidate refinement unit 70.

파라미터 측정을 위해서는 우선, 음성 신호가 나누어져 20-24ms의 선행(advance)을 갖는 30ms의 중첩 세그먼트(overlapping segment)가 된다. 직사각형 윈도우장치(10)에서는, 프레임분류장치(30)에 의해 프레임 분류가 수행되기 위해 신호가 직사각형 윈도우 함수

_R에 의해 멀티플렉싱된다. 해밍윈도우장치(20)에서는, FFT 장치(40)에 의해 스펙트럼 계산이 수행되기 위해 신호가 또한 해밍윈도우 함수

_H에 의해 멀티플렉싱된다. 주파수 해상도(frequency resolution)를 증가시키기 위해, 처리된 프레임에 일련의 제로가 가산되는데, 상기 가산은 FFT를 수행하여 FFT_LENGTH 배열(array)을 만들어 내기 전에 이루어진다. 배열 FFT_LENGTH = 2048 로써 양호한 주파수 해상도를 얻을 수 있으나, 실시간(real-time) 적용에서는 FFT_LENGTH = 512의 배열값을 이용하였다.For parameter measurement, the speech signal is first divided into a 30ms overlapping segment with an advance of 20-24ms. In the rectangular window device 10, the signal is rectangular window function so that the frame classification is performed by the frame classification device 30.

Multiplexed by _R. In Hamming window device 20, the signal is also a Hamming window function for spectral calculation to be performed by FFT device 40.

Multiplexed by _H. To increase frequency resolution, a series of zeros are added to the processed frame, which is done before performing the FFT to produce an FFT_LENGTH array. Good frequency resolution can be obtained with the array FFT_LENGTH = 2048, but an array value of FFT_LENGTH = 512 was used for real-time applications.

프레임분류장치(30)는 종래기술의 MBE 모델에 대해 보조적인 장치라고 할 수 있다. 상기 장치는 시간영역의 음성 프레임을 처리하여 피치후보그룹결정장치(50)와 V/UV 판별장치(80)에서의 보다 확실하고 믿을만한 신호 처리에 이용되는 프레임 분류 특성

_f를 생성한다. 프레임들은 두 가지 방법으로 분류된다. 첫 번째는 프레임을 따라 변화하는 신호값의 특징 및 범위에 의한 분류이고, 두 번째는 프레임 내의 신호 발진(oscillation) 특징에 의한 분류이다.The frame classification device 30 may be referred to as an auxiliary device to the MBE model of the prior art. The apparatus processes the speech frames in the time domain and uses the frame classification characteristics used for more reliable and reliable signal processing in the pitch candidate group determining apparatus 50 and the V / UV discriminating apparatus 80.

Create _f Frames are classified in two ways. The first is classification based on the characteristics and ranges of signal values varying along the frame, and the second is classification based on signal oscillation features within the frame.

첫 번째 분류에서, 프레임 내의 신호 유형을 아래의 표 1에 보인 바와 같이 정의하였다. 이러한 분류 방법은 프레임 내의 특징 변화와 신호 샘플 값을 동시에 조사한 것에 기초한다. 또한, 프레임 내 신호 발진 유형은 아래의 표 2에 보인 바와 같이 정의하였다. 상기 두 번째 분류 방법은 현재 프레임의 제1 및 제2 부분을 교차하는 제로 값에 기초한다.In the first classification, the signal types in the frame were defined as shown in Table 1 below. This classification method is based on the simultaneous examination of feature changes and signal sample values within a frame. In addition, the signal oscillation type in the frame was defined as shown in Table 2 below. The second classification method is based on the zero value crossing the first and second portions of the current frame.

표1. (음성 프레임을 분류하는 첫 번째 방법)Table 1. (First way to classify speech frames)

SILENCESILENCE 우발적으로 단일의 짧은 잡음 피크가 나타나는 매우 낮은 진폭 프레임Very low amplitude frame with accidental single short noise peak FIZZLEFIZZLE 체계적인 잡음 피크가 나타나는 다소 낮은 진폭 프레임Rather low amplitude frame with systematic noise peaks VOWELVOWEL 순수 모음 프레임Pure collection frames VOWEL_FADINGVOWEL_FADING 페이딩 모음 프레임Fading collection frame VOWEL_RISINGVOWEL_RISING 상승 모음 프레임Rising collection frames PAST_VOWELPAST_VOWEL 프레임 시작에서만 페이딩 모음이 나타나는 프레임Frames where fading bars appear only at the beginning of the frame BEFORE_VOWELBEFORE_VOWEL 프레임 마지막에서만 상승 모음이 나타나는 프레임Frame where rising vowel appears only at end of frame CHAOSCHAOS 기타 모든 유형의 프레임All other types of frames

표2. (음성 프레임을 분류하는 두 번째 방법)Table 2. (Second way to classify speech frames)

WELKWELK 발진을 포함하지 않는 프레임Frames That Don't Contain Oscillations VIBRATIONVIBRATION 시작과 마지막에 발진을 포함하는 프레임Frame containing rash at the beginning and end PAST_VIBRATIONPAST_VIBRATION 시작에서만 발진을 포함하는 프레임Frame containing oscillation only at start BEFORE_VIBRATIONBEFORE_VIBRATION 마지막에서만 발진을 포함하는 프레임Frame containing rash only at the end

결과적으로, 32개(즉, 8×4)의 조합 유형으로 프레임들이 도출되어 정의된다. 도출된 프레임은 논리연산(logical operation)을 통해 조합 유형에서부터 구체화된다. 도출된 유형의 프레임의 일례는 다음과 같다:As a result, frames are derived and defined with 32 (ie 8 × 4) combination types. The derived frame is embodied from the combination type through a logical operation. An example of a derived type of frame is as follows:

SOME_OF_VOWEL 'AND''NOT'SOME_OF_VIBRATION, 여기에서SOME_OF_VOWEL 'AND' NOT 'SOME_OF_VIBRATION, here

SOME_OF_VOWEL = VOWELSOME_OF_VOWEL = VOWEL

'OR' VOWEL_FADING'OR' VOWEL_FADING

'OR' VOWEL-RISING'OR' VOWEL-RISING

'OR' PAST_VOWEL'OR' PAST_VOWEL

'OR' BEFORE_VOWEL, 그리고'OR' BEFORE_VOWEL, and

SOME_OF_VIBRATION = VIBRATIONSOME_OF_VIBRATION = VIBRATION

'OR' PAST_VIBRATION'OR' PAST_VIBRATION

'OR' BEFORE_VIBRATION.'OR' BEFORE_VIBRATION.

상기와 같이 얻어진 프레임 분류 특성들은 피치 검출과 유성 판별에 이용된다. 프레임 분류 특성을 이용하는 피치검출장치(3)의 작용이 기술되겠다.The frame classification characteristics thus obtained are used for pitch detection and meteor determination. The operation of the pitch detection device 3 using the frame classification characteristic will be described.

특히 실시간 적용에서, 중요하면서도 가장 어려운 과제 중 하나가 신뢰할 만한 피치 주파수 검출 문제이다. 본 발명에 따른 피치 검출 방법은, 시간 및 주파수영역에서의 분석에 기초하여 효과적이고 믿을 만한 해결책을 제시한다. 피치 주파수 검출은 세 단계로 이루어진다. 첫째, 피치후보그룹결정장치(50)는 시간영역에서 자기상관함수(auto-correlation function, ACF) 분석을 이용하여 피치 후보 그룹을 결정한다. 둘째, 최적후보선택장치(60)는 주파수영역의 모든 후보들을 측정하여 최적의 후보를 선택한다. 그다음, 최적후보개선장치(70)가 주파수영역의 최적 후보 값을 개선시킨다.Especially in real-time applications, one of the most important and difficult challenges is the reliable pitch frequency detection problem. The pitch detection method according to the present invention presents an effective and reliable solution based on analysis in time and frequency domain. Pitch frequency detection consists of three steps. First, the pitch candidate group determining apparatus 50 determines a pitch candidate group by using an auto-correlation function (ACF) analysis in the time domain. Second, the optimum candidate selecting device 60 selects an optimal candidate by measuring all candidates in the frequency domain. Then, the best candidate improving apparatus 70 improves the best candidate value in the frequency domain.

ACF를 계산하기 전에 단시간 중심 클리핑(short-time center clipping)을 이용하면, 좀더 믿을 만한 피치 검출을 할 수 있다. 중심 클리핑 이후에는, 처리된 프레임의 저역필터링(low-pass filtering)이 수행된다. 프레임분류장치(30)에 의해 결정된 프레임 유형에 따라, ACF 계산에 직접 또는 역의 순서(direct or inverse order)를 이용한다. ACF 계산을 위한 직접 및 역순서에 같은 공식을 이용 하더라도, 직접순서가 프레임 시작에 위치한 샘플 쌍들에 연관되는 반면 역순서는 프레임의 마지막에 위치한 샘플 쌍들과 관련한다. 예를 들어, ACF 계산의 역순서는 VOWEL_RISING, BEFORE_VOWEL 등의 프레임 유형에 적용되며, ACF 계산의 직접순서는 VOWEL_FADING, PAST_VOWEL 등의 프레임 유형에 이용된다.By using short-time center clipping before calculating the ACF, more reliable pitch detection can be achieved. After central clipping, low-pass filtering of the processed frame is performed. Depending on the frame type determined by the frame classifier 30, a direct or inverse order is used for the ACF calculation. Even if the same formula is used for direct and reverse order for the ACF calculation, the direct order is related to the sample pairs located at the beginning of the frame, while the reverse order is related to the sample pairs located at the end of the frame. For example, the reverse order of ACF calculation is applied to frame types such as VOWEL_RISING and BEFORE_VOWEL, and the direct order of ACF calculation is used to frame types such as VOWEL_FADING and PAST_VOWEL.

피치후보그룹결정장치(50)는 피치 후보 그룹을 결정하며, 여기에서 상기 후보 그룹은 전역(global) 최대 ACF 값에 대응하는 시차(time lag)의 좌측에 위치한 ACF의 국소(local) 최대값들을 모두 포함한다. 상기 그룹은 각기 다른 프레임에 대해 다수개의 후보를 포함할 수 있다. 주파수영역에서의 정밀 탐색 범위는 전역 최대 ACF 값에 대응하는 주파수에 의해 정의된다.The pitch candidate group determining device 50 determines a pitch candidate group, where the candidate group is configured to obtain local maximums of the ACF located to the left of the time lag corresponding to the global maximum ACF value. It includes everything. The group may include a plurality of candidates for different frames. The precise search range in the frequency domain is defined by the frequency corresponding to the global maximum ACF value.

최적후보선택장치(60)에서, 상기 얻어진 그룹의 모든 후보에 대한 측정이 수행되어 최적의 후보가 선택된다. 특히, 가중값(weight)을 적용하여 어림셈의 요약 제곱 오차(summarized square error, SSE)의 최소값 판정기준(criterion)을 이용함으로써 피치 후보값 근처에서 피치 주파수 최적값이 발견된다. 어림셈은 주파수영역에서 수행되고, 품질 어림셈의 측정은 다음과 같이 수행된다. 조사된 피치 주파수 p 에 따라, 주파수 전체 범위는 p Hz의 폭을 갖는 n개의 주파kk수대역으로 나누어진다. 각 대역의 음성 스펙트럼은 크기조정된(scaled) 해밍윈도우 응답에 의해 어림셈되고 어림셈의 SSE는 아래의 방정식(1)에 의해 계산된다:In the best candidate selection device 60, measurements are performed on all candidates of the obtained group to select the best candidate. In particular, the optimum pitch frequency is found near the pitch candidate value by using a weighted value and using the minimum value criterion of the summed squared error (SSE). Estimation is performed in the frequency domain, and measurement of quality estimation is performed as follows. According to the investigated pitch frequency p , the entire frequency range is divided into n frequency kk frequency bands having a width of p Hz. The speech spectrum of each band is approximated by a scaled Hamming window response and the SSE of the approximation is calculated by the following equation (1):

여기에서From here

i-번째 대역에 대한 진폭값 A _i 가 다음과 같이 계산된다. The amplitude value A _i for the i -th band is calculated as follows.

여기에서, S는 음성 스펙트럼을, W는 크기조정된 해밍윈도우 응답을, a _i 및 b _i는 i-번째 대역의 시작 및 마지막에 대응되는 조파의 개수를 나타낸다.Where S is the speech spectrum, W is the scaled Hamming window response, and a _i and b _i are the number of harmonics corresponding to the beginning and end of the i -th band.

전통적으로, MBE 방법에서는 일정한 폭을 갖는 해밍윈도우 응답이 이용된다. 그러나, 고정된 형태의 해밍윈도우 응답을 이용하면 낮은 피치 주파수(예, 피치값이 참(true)인 저조파(sub-harmonics))에서 오즈 비정렬(unjustified odds)이 발생한다는 것이 광범위한 실험을 통해 밝혀졌다.Traditionally, Hamming window responses with constant width are used in the MBE method. However, extensive experiments have shown that using a fixed Hamming window response results in unjustified odds at low pitch frequencies (e.g., sub-harmonics with true pitch values). Turned out.

바람직한 실시예에서, 도 2에 보인 바와 같이, 조사된 피치값에 상응하는 특수한 기준화계수(scale factor)가 해밍윈도우 응답을 크기조정하는데 이용된다. 크기조정은 일정 주파수 F _scale보다 낮은 주파수에 대해 수행된다. 주파수 값 F _scale= 140 Hz은 실험적으로 결정되었다. 특히, 크기조정은 다음과 같이 수행된다. FFT 변환에 의해 얻어진 스펙트럼의 전체 조파 개수 중 소정 FFT_LENGTH 값에서, 최초의 해밍윈도우 응답은 제로와는 상당히 차이가 나는 N _orig 성분을 갖는다. FFT_LENGTH = 2048에서, N _orig 은 3.1 이었다. 로우 기본주파수 F _0exam＜F _scale에서, 여기함수로 이용되는 크기조정된 해밍윈도우 응답은 형태가 보다 선명(sharp)해야 한다. 따라서, 응답값의 배열은 제로와는 상당히 차이가 나는 N ₀＜N _orig 성분을 가져야 한다. 이들 성분의 개수 N ₀은 다음과 같이 계산된다. N ₀=int[N _orig(FFT_LENGTH/2048)·(F _0exam/ F _scale)]. 크기조정된 응답을 얻기 위해 선형보간(linear interpolation)에 기초한 비례선명화(proportioanl sharpening) 절차가 최초의 해밍윈도우 응답에 적용된다.In a preferred embodiment, as shown in Fig. 2, a special scale factor corresponding to the irradiated pitch value is used to scale the Hamming window response. _Scaling is performed for frequencies below a certain frequency F _scale . The frequency value F _scale = 140 Hz was determined experimentally. In particular, resizing is performed as follows. At a given FFT_LENGTH value of the total number of harmonics of the spectrum obtained by the FFT transform, the first Hamming window response has a N _orig component that differs significantly from zero. At FFT_LENGTH = 2048, N _orig was 3.1. At the low fundamental frequency F _0exam < F _scale , the _scaled Hamming window response used as the excitation function should be sharper in shape. Thus, the array of response values should have N ₀ < N _orig components that differ significantly from zero. The number N ₀ of these components is calculated as follows. N ₀ = int [ N _orig ( FFT_LENGTH / 2048)] ( F _0exam / F _scale )]. To obtain a scaled response, a proportional sharpening procedure based on linear interpolation is applied to the original Hamming window response.

본 발명에 의하여, 낮은 피치 주파수에 해당하는 주파수대역의 어림셈은 더욱 양호해진다. 각 대역에서의 어림셈 함수의 적정 선명도(sharpness)로 인해 참인 피치 후보를 선택할 수 있다. 실시간 응용에서, 상이한 F _0exam＜F _scale에 해당하는 크기조정된 해밍윈도우 응답은 모두 표로 작성이 가능하며 탐색표(look-up table)로서 이용될 수 있다. 한편, 높은 피치 주파수(예, 피치값이 참인 다중 조파)에서 오즈 비정렬이 발생하는 것을 피하기 위해, a ₁=1이 되도록 다른 대역들과 비교하여 SSE ₁ 계산하에서 제1대역이 확대된다.According to the present invention, the approximation of the frequency band corresponding to the low pitch frequency is further improved. The proper sharpness of the approximation function in each band allows the selection of true pitch candidates. In real-time applications, the _scaled Hamming window responses corresponding to different F _0exam < F _scale can all be tabulated and used as look-up tables. On the other hand, in order to avoid occurrence of unalignment at high pitch frequencies (e.g., multiple harmonics where the pitch value is true), the first band is enlarged under the SSE ₁ calculation in comparison with the other bands such that a ₁ = 1.

그러나, 최적의 후보를 선택하는 동안, 스펙트럼에서 서로 다른 부분들은 그 중요성이 같지 않을 수도 있다. 이 문제를 고려하여, 가중값 계수 Q가 SSE 계산에 도입된다. 조각적 선형 가중값(piecewise-linear weight) Q가 하기의 방정식 (3)에서와 같이 이용되었다.However, while selecting the best candidate, different parts of the spectrum may not be equally important. In view of this problem, the weighting factor Q is introduced into the SSE calculation. A piecewise-linear weight Q was used as in equation (3) below.

여기에서, bf는 가중값의 페이딩의 시작에 해당하는 조파 개수이고 ef는 가중값의 페이딩의 마지막에 해당하는 조파 개수이며, 여기에서, 0≤bf＜ef 이다. 얻어진 최적 피치 주파수 값은, 가중값 적용없이 어림셈의 최소값을 이용하여 피치 후보 값에 근사한 범위에 있는 피치 주파수 최적 값을 발견함으로써 최적후보개선장치(70)에서 개선된다.Here, bf is the number of harmonics corresponding to the start of fading of the weighted value and ef is the number of harmonics corresponding to the end of fading of the weighted value, where 0 ≦ bf < ef . The obtained optimum pitch frequency value is improved in the optimum candidate improvement apparatus 70 by finding the optimum pitch frequency value in the range close to the pitch candidate value using the minimum value of the approximation without applying the weight value.

MBE 방법의 중요한 특징은 전체 프레임 보다는 최초 스펙트럼의 각 주파수대역에 대한 유성/무성 결정이 생성된다는 것이다. 또한, 조파 성분은 마찰 프레임(fricative frame) 내에 존재할 수 있으며 유성 프레임은 일부 잡음 대역을 포함할 수 있다. 최초 스펙트럼의 각 주파수대역에 대한 유성/무성 결정의 생성은 V/UV 판별장치(80)에서 수행된다. MBE 보코더에서의 비트전송률이 낮은 수행에서, 결정의 생성은 서로 인접하는 주파수대역 그룹에 수행된다. 바람직한 실시예에서 적응대역분할(adaptive band division)이 (대역 개수의 값에 비례하여) 이용된다.An important feature of the MBE method is that voiced / unvoiced decisions are generated for each frequency band of the original spectrum rather than the entire frame. In addition, the harmonic component may exist within a frictional frame and the meteor frame may include some noise band. The generation of voiced / unvoiced crystals for each frequency band of the original spectrum is performed in the V / UV discriminator 80. In low bit rate performance in the MBE vocoder, the generation of the decision is performed in groups of frequency bands adjacent to each other. In a preferred embodiment adaptive band division is used (in proportion to the value of the number of bands).

유성 판별 처리는 소정의 피치값이 얻어질 때 시작한다. 판별을 위해서, 최초의 스펙트럼이 소정 피치값에 따라 주파수대역으로 나누어지고 각 주파수대역은 해밍윈도우의 크기조정된 주파수응답에 의해 어림셈된다. 또한, 주파수응답 크기조정이 최적후보선택장치(60)에서 기술된 것과 동일한 이유로 동일한 기술에 의해 수행된다. 크기조정은 주파수대역의 폭과 어림셈 윈도우 사이의 정확한 관계를 제공한다. 또한, 어림셈 윈도우의 위치 및 주파수대역 피크의 위치를 정확히 조절하 는 것이 매우 중요하다.The planetary discrimination process starts when a predetermined pitch value is obtained. For discrimination, the first spectrum is divided into frequency bands according to a predetermined pitch value and each frequency band is estimated by the scaled frequency response of the Hamming window. In addition, frequency response scaling is performed by the same technique for the same reason as described in the optimum candidate selection device 60. Scaling provides an accurate relationship between the width of the frequency band and the approximation window. In addition, it is very important to precisely adjust the position of the approximation window and the position of the frequency band peaks.

어림셈 품질의 잡음 대 신호 비(Noise to Signal, NSR) 값은 주파수대역 그룹의 유성/무성 속성을 정의한다. NSR 값의 문턱값은 현재 프레임의 분류 특성에 따른다. 예를 들어, 프레임의 진폭 특성이 두 개의 모든 VOWEL 유형에 속하지만 VIBRATION 유형에는 속하지 않는다면, 문턱값은 나눔수(divisor)

에 의해 증가되며, 명백한 유성 프레임에 대해서는 유성 결정을 하게 된다. 하지만, 프레임의 진폭 특성이 VOWEL 유형에 속하지는 않지만 VIBRATION 유형의 하나에 속할 때는 명백한 자음 프레임에 있어서, 문턱값은 나눔수

에 의해 감소되어 명백한 자음 프레임에 대해 무성 결정을 하게 된다. 분류가 명확하지 않다면, 문턱값은 변하지 않으며 미리 정의된 값을 갖는다.The Noise to Signal Ratio (NSR) value of the approximation quality defines the voiced / unvoiced properties of a group of frequency bands. The threshold value of the NSR value depends on the classification characteristics of the current frame. For example, if the amplitude characteristics of a frame belong to both VOWEL types but not the VIBRATION type, then the threshold is a divisor.

It is increased by, and the meteor determination is made for the apparent meteor frame. However, for consonant frames that are apparent when the amplitude characteristics of the frame do not belong to the VOWEL type but belong to one of the VIBRATION types, the threshold is divided by

Is reduced to make an unvoiced decision for the apparent consonant frame. If the classification is not clear, the threshold does not change and has a predefined value.

NSR을 포함하는 어림셈 품질의 측정은 아래의 방정식에 의해 계산된다.The measurement of approximation quality including NSR is calculated by the following equation.

여기에서, NSR _i는 i-번째 대역 그룹의 잡음 대 신호 비로, n _i에서 n _i+1-1 까지의 대역을 포함하며, n _i는 i-번째 그룹의 제1 주파수대역의 대역 개수이다; Err _m은 m-번째 대역에 대한 어림셈의 요약 제곱 오차이다; S(k)는 어림셈된 스펙트럼의 k-번째 조파의 크기이다; a _i과 b _i는 i-번째 대역 그룹의 시작 및 마지막에 해당하는 조파 개수이며, 여기에서, a _i는 n _i-번째 대역의 제1 조파의 조파 개수이고 b _i는 (n _i+1-1)-번째 대역의 마지막 조파의 조파 개수이다.Where NSR _i is the noise-to-signal ratio of the i -th band group, and includes bands from n _i to n _{i + 1} −1, where n _i is the number of bands in the first frequency band of the i -th group; Err _m is the summary squared error of approximation for the m -th band; S (k) is the magnitude of the k -th wave of the estimated spectrum; a _i and b _i are the number of harmonics corresponding to the beginning and end of the i -th band group, where a _i is the number of harmonics of the first harmonic of the n _i -th band and b _i is ( n _{i + 1-} It is the number of harmonics of the last harmonic of 1-th band.

Err _m의 결정은 그룹에 포함된 각 대역에 대해 개별적으로 수행된다. 상기 결정을 위해서, 주파수대역 피크에 대한 크기조정된 해밍윈도우 응답의 위치 조정(position tuning)이 유성 프레임에서 수행된다. 이렇게 하여 정확한 유성/무성 결정이 생성되며 이는 다음과 같은 방법으로 이루어진다. Err _m 값은 주파수대역의 중심에 대해 어림셈 윈도우의 여러 위치에서 계산된다. 그다음, 최소 Err _m 값에 해당하는 어림셈 윈도우의 위치가 선택되어, 그룹에 포함된 각 대역에 대한 최소의 Err _m 값들로부터 전체 대역 그룹에 대한 최적의 NSR _i 값이 얻어진다. 따라서, 유성/무성 결정이 위에서 설명한 바와 같은 NSR 판정기준을 이용하여 생성된다.Determination of Err _m is performed separately for each band included in the group. For this determination, position tuning of the scaled Hamming window response to the frequency band peak is performed in the meteor frame. This produces an accurate planetary / unvoiced crystal, which is done in the following way. The Err _m value is calculated at various locations in the approximation window with respect to the center of the frequency band. Then, the position of the approximation window corresponding to the minimum Err _m value is selected so that an optimal NSR _i value for the entire band group is obtained from the minimum Err _m values for each band included in the group. Thus, a voiced / unvoiced decision is generated using the NSR criteria as described above.

AbS 방법을 이용한 스펙트럼 크기 결정이 이제 기술되겠다. 일반적으로, 분석 및 코딩 기능의 목적은 수신측에서의 음성 생성에 필요한 데이터를 송신측에서 얻기 위한 것이다. MBE 모델에 따라, 음성 생성은 피치값을 포함하는 음성 모델 파라미터를 이용하여 수행되는데, 상기 피치값은 조파 대역 시스템; 주파수대역에 대한 유성/무성 결정 그룹; 스펙트럼 진폭 그룹; 그리고 해당 위상값을 유도하게 된다.Spectral sizing using the AbS method will now be described. In general, the purpose of the analysis and coding function is to obtain at the transmitting side the data necessary for speech generation at the receiving side. According to the MBE model, speech generation is performed using speech model parameters including pitch values, the pitch values being a harmonic band system; Voiced / unvoiced decision groups for frequency bands; Spectral amplitude groups; The phase value is then derived.

도 3에 보인 바와 같이 음성 모델 파라미터들은 직접적으로 또는 명시적으로(explicitly) 쉽게 계산되어 인코더로 출력될 수 있다. 그러나, 합성에 의한 분석은 인코더로 파라미터를 출력하기 전에 음성 모델 파라미터 전체 또는 그 일부를 암시적으로(implicitly) 정의한다. 도 4를 참조하면, 파리미터 결정을 위한 AbS 방법의 개념적 구성방법은 품질평가장치, 합성장치, 그리고 탐색장치를 포함한다. 송신측 및 수신측에서 음성을 생성하는데 동일한 합성장치가 이용된다. 모델 파라미터 그룹 P가 탐색되어 합성된 신호 s～(t)를 제공하는데, 상기 합성신호 s～(t)는 일정한 평가기준에 의하여 실제 음성 신호 s(t)에 가장 가까운 것이다. 최적의 그룹 P의 탐색이 반복처리(iterative process)로서 수행되는데, 여기에서 벡터값 P는 각 반복마다 달라지며 객체함수(object function) 값

이 측정된다. 모델 파라미터의 최적 벡터는 인코딩되어 합성장치에 전송된다.As shown in FIG. 3, the speech model parameters can be easily calculated directly or explicitly and output to the encoder. However, analysis by synthesis implicitly defines all or part of the speech model parameters before outputting the parameters to the encoder. Referring to FIG. 4, the conceptual construction method of the AbS method for parameter determination includes a quality evaluation device, a synthesis device, and a search device. The same synthesizing apparatus is used to generate speech at the transmitting side and the receiving side. To provide the model parameter group P is a synthetic signal s ~ search (t), the synthesized signal s ~ (t) is the closest to the actual speech signal s (t) by a predetermined evaluation standard. The search for the optimal group P is performed as an iterative process, where the vector value P is different for each iteration and the object function value

This is measured. The optimal vector of model parameters is encoded and sent to the synthesizer.

본 발명의 일실시예에서, 직접 계산으로 측정된 유성/무성 결정 및 피치 주파수 값에 기초하여 스펙트럼 진폭이 AbS 방법을 이용하여 측정된다. 스펙트럼 진폭을 측정하기 위한 AbS 방법은 위에서 정의된 바와 같이 해석될 것이다. 송신측에는, 수신측의 합성장치와 동일한 합성 장치가 음성 생성에 이용된다. 따라서, 송신측과 수신측에서는, 이전의 프레임에서부터 현재의 프레임까지의 진폭, 위상, 그리고 주파수를 보간하는데 같은 규칙이 이용된다.In one embodiment of the present invention, the spectral amplitude is measured using the AbS method based on the planetary / voiceless determination and the pitch frequency value measured by direct calculation. The AbS method for measuring the spectral amplitude will be interpreted as defined above. On the transmitting side, the same synthesizing apparatus as that of the receiving side is used for voice generation. Thus, on the transmitting side and the receiving side, the same rules are used to interpolate the amplitude, phase, and frequency from the previous frame to the current frame.

특정 피치값

와 유성/무성 결정 그룹

에서, 스펙트럼 진폭 그룹

이 탐색되어, 스펙트럼 을 갖는 합성 신호

가 제공되는데, 상기 스펙트럼은 실제의 음성 신호 스펙트럼 S에 가장 가까운 것이다. 평가기준은 스펙트럼 에 의한 어림셈 스펙트럼 S의 최소 SSE 이다. 최적 스펙트럼 크기를 탐색하는 것은 반복처리로서 수행될 수 있다. 벡터 값 M은 각 반복마다 달라지며, 객체함수 값

이 측 정된다. 발견된 최적 값은 인코딩되어 전송된다.Specific pitch value

And meteor / voice decision groups

In, spectral amplitude group

Is searched, the synthesized signal having the spectrum

Is provided, the spectrum closest to the actual speech signal spectrum S. The evaluation criterion is the minimum SSE of the speculation approximation spectrum S. Searching for the optimal spectral size can be performed as an iterative process. The vector value M is different for each iteration, and the object function value

This is measured. The optimal value found is encoded and sent.

본 발명의 바람직한 실시예에서, 크기 결정의 일-반복(one-iteration)이 제시되는데, 이 방법이 실시간 수행에 적합하기 때문이다. 본 발명에 따른 크기 결정은 푸리에 변환의 선형성(linearity) 및 음성 신호 처리의 선형화(linearization)에 기초한다.In a preferred embodiment of the present invention, one-iteration of sizing is presented because this method is suitable for real-time performance. The size determination according to the invention is based on the linearity of the Fourier transform and the linearization of speech signal processing.

일정한 에탈론(etalon) 값 M _m =e (특히, 에탈론 값은 단위값(unit value)과 같다: M _m =1)를 결정될 각 크기에 배정함으로써 스펙트럼 크기의 모델 그룹이 형성된다. 이전 및 현재 프레임에 대한 유성 결정 그룹 V^p, V^c과 특정 피치값

하에서, 모델 음성 신호

가 스펙트럼 진폭의 배정 에탈론(단위) 값에 합성된다. 합성 신호의 스펙트럼

가 계산되어 실제 음성 신호의 스펙트럼 S에 비교된다. 상기 비교는 각 대역에서 별개로 이루어진다.A group of spectral magnitudes is formed by assigning a constant etalon value M _m = e (in particular, the etalon value is equal to the unit value: M _m = 1) to each size to be determined. Planetary decision groups V ^p , V ^c and specific pitch values for previous and current frames

Under the model voice signal

Is synthesized to the assigned etalon (unit) value of the spectral amplitude. Spectrum of Synthetic Signal

Is calculated and compared to the spectrum S of the actual speech signal. The comparison is made separately in each band.

선형 시스템 이론에서의 단위 교란(unit disturbance)의 응답 분석과 유사하게, 단위 진폭의 m-번째 스펙트럼 성분의 활성 하에서 m-번째 대역에 관련된 스펙트럼 _m 의 일부는 선형화된 시스템의 응답으로서 해석된다. m-번째 대역에 관련된 실제 스펙트럼 S _m 의 일부가 아래의 방정식 (5)와 같이 어림셈될 수 있다.Similar to the response analysis of unit disturbance in linear system theory, part of the spectrum _m related to the m-th band under the activity of the m-th spectral component of the unit amplitude is interpreted as the response of the linearized system. A portion of the actual spectrum S _m related to the m-th band can be estimated as in equation (5) below.

여기에서, E _m 은 어림셈의 오차이다. 값 μ _m 는 S _m 이 최선의 방법으로 어림셈 될 때의 인수(factor)가 된다. 그러므로, 모든 대역에 대한 μ _m 값들은 최소제곱법(Least Square Method)을 이용하여 계산된다.Where E _m is the error of approximation. The value μ _m is the factor when S _m is estimated in the best way. Therefore, μ _m values for all the bands is calculated using the least square method (Least Square Method).

결과적으로, 스펙트럼

에 의한 스펙트럼 S 어림셈의 요약 제곱 오차를 최소화하는 어림셈 계수가 결정된다. 선형성(또는 준-선형성(quasi-linearity))으로 인해, 곱셈 계수(multiplicative coefficient)가 스펙트럼 크기(M _m = e·μ _m 또는 M _m =l·μ _m )의 값으로 간주될 수 있으며, 이 때문에 합성된 신호는 실제 음성 신호의 스펙트럼 S에 가장 가까운 스펙트럼

를 갖는다. 이들 값 μ _m 는 인코딩되어 전송된다. 수신측에서는, 출력 음성 신호의 합성에 스펙트럼 진폭값을 배정하는데 상기 값들이 이용된다.As a result, the spectrum

An approximation coefficient that minimizes the sum of squared errors of the spectral S estimates is determined. Linear (or quasi-linear (quasi-linearity)) due to, and multiplication coefficient (multiplicative coefficient) can be regarded as the value of the spectral amplitude (M _m = e · μ _m or M _m = l · μ _m), the Therefore, the synthesized signal is the spectrum closest to the spectrum S of the actual speech signal.

Has These values μ _m are encoded and transmitted. On the receiving side, these values are used to assign spectral amplitude values to the synthesis of the output speech signal.

본 발명에 따른 스펙트럼 크기 결정을 보여주는 블록도가 도 5에 도시되어 있다. 유성 및 무성 크기의 계산은 개별적으로 수행된다. 특히, 유성 스펙트럼 크기의 계산이 대역대응사상구성장치(Bands' Correspondence Map Building unit)(91), 유성합성장치(92), 해밍윈도우장치(93), FFT장치(94), 그리고 유성크기평가장치(95)에 의해 수행된다. 유성 여기 스펙트럼을 만드는데 이용되는 대역대응사상구성장치(91)와 유성합성장치(92)는 도 7에 보인 디코더 측에서의 유성음 합성에 이용되는 대역대응사상구성장치(160) 및 유성합성장치(170)와 동일한 것이다. 여기 신호는 디코더 측에서 생성되는 방식으로 인코더 측에서 합성되기 때문에, 주파수응답은 어림셈 함수로서 이용되기에 매우 적합해진다.A block diagram showing spectral sizing according to the present invention is shown in FIG. The calculation of the meteor and voiceless size is performed separately. In particular, the calculation of the meteor spectral magnitude is performed by the Bands' Correspondence Map Building unit 91, the planetary synthesis device 92, the Hamming window device 93, the FFT device 94, and the meteor size evaluation device. Performed by 95. The band compatible image forming apparatus 91 and the voice synthesizer 92 used to generate the voiced excitation spectrum are combined with the band compatible image forming apparatus 160 and the voice synthesizer 170 used for voiced sound synthesis at the decoder side shown in FIG. Same thing. Since the excitation signal is synthesized at the encoder side in a manner generated at the decoder side, the frequency response is well suited to be used as an approximation function.

위에서 설명한 바와 같이, 유성합성장치용 입력 파라미터 그룹은 현재 프레 임에 대한 피치 주파수

현재 프레임에 대한 유성 결정 벡터

현재 프레임에 대한 스펙트럼 크기 벡터

그리고 대역대응사상구성장치에 의해 작성된 대역대응사상을 포함한다. 대역대응사상구성장치(91) 및 유성합성장치(92)의 상세한 작동이 디코더 측을 참조하여 후에 기술될 것이다(대역대응사상구성장치(160) 및 유성합성장치(170)를 참조). 그러나 상기 장치들은, 현재 프레임에 대한 소정의 입력 파라미터 그룹 하에서 그리고 유성합성장치에 내장된 이전프레임파라미터누산기(Previous Frame Parameters Accumulator)에 저장된 이전 프레임에 대한 유사 파라미터 그룹

하에서 출력 음성 신호를 시간영역에서 합성한다는 사실을 주목할 필요가 있다.As described above, the input parameter group for the planetary synthesizer is the pitch frequency for the current frame.

Meteor determination vector for current frame

Spectral magnitude vector for the current frame

And the band-corresponding phase created by the band-corresponding phase forming apparatus. The detailed operation of the band-compatible phase composing device 91 and the planetary synthesis device 92 will be described later with reference to the decoder side (see the band-compatible phase configuring device 160 and the planetary synthesis device 170). However, the devices are similar parameter groups for the previous frame under the predetermined input parameter group for the current frame and stored in the previous frame parameter accumulator built in the synthesizer.

It is worth noting that the output speech signal is synthesized in the time domain under the following conditions.

유성 여기 스펙트럼을 만들기 위해, 일정한 에탈론 값을 배정함으로써 유성 대역에 대한 스펙트럼 진폭이 결정되는데, 상기 에탈론 값은 1이다. 유성 결정 벡터의 구성요소가 유성 대역에서는 1, 그리고 다른 경우에는 0이라고 가정하면, 배정은 다음과 같이 나타낼 수 있다.To make the planetary excitation spectrum, the spectral amplitude for the meteor band is determined by assigning a constant etalon value, wherein the etalon value is one. Assuming that the components of the meteor determination vector are 1 in the meteor band and 0 in other cases, the assignment can be expressed as follows.

유성합성장치(92)에 의해 출력된 신호는 해밍윈도우장치(93)에 의해 윈도우화를 거치고, FFT장치(94)에 의해 처리된다. 변환이 이루어진 후, 출력 신호는 유성 여기 스펙트럼 S ^v_e 를 나타낸다. 본 발명에 따른 유성 합성 절차에 의해 얻어진 유성 여기 스펙트럼의 일례가 도 14에 도시되어 있다. 스펙트럼의 유성 부분은 일정한 구조를 갖는 반면, 스펙트럼의 무성 부분은 거의 제로에 가깝다. 피치 주파수와 유성 결정이 변하는 조건에서도 스펙트럼은 유사한 특성을 가지게 되는데, 상기 특성은 정확한 스펙트럼 어림셈에 있어서 중요한 사항이다.The signal output by the planetary synthesis device 92 is windowed by the hamming window device 93 and processed by the FFT device 94. After the conversion is made, the output signal exhibits the planetary excitation spectrum S ^v_e . An example of the planetary excitation spectrum obtained by the planetary synthesis procedure according to the present invention is shown in FIG. 14. The voiced portion of the spectrum has a constant structure, while the unvoiced portion of the spectrum is near zero. The spectrum has similar characteristics even under varying pitch frequencies and planetary crystals, which are important for accurate spectral estimation.

얻어진 유성 여기 스펙트럼은 유성크기평가장치(95)에서 유성 크기 평가에 이용된다. 유성크기평가장치(95)는 최소제곱법을 이용하여 크기 측정을 수행함으로써 여기 스펙트럼 S ^v_e에 의해 실제 스펙트럼 S의 유성 대역을 개별적으로 어림셈하게 된다. 대역 중심에 대해 스펙트럼을 양 측으로 이동시킴으로써, 주파수대역에 대한 여기 스펙트럼의 클립(clip) 위치가 유성 프레임에서 조정이 이루어진다. 이후에, 어림셈에서 가장 우수한 NSR을 제공하는 여기 스펙트럼 클립의 위치가 크기 평가를 위해 선택되며, 이는 최소제곱법을 이용하여 수행된다.The obtained planetary excitation spectrum is used for planetary size evaluation in the planetary size evaluation device 95. The planetary size evaluation device 95 separately estimates the planetary band of the actual spectrum S by the excitation spectrum S ^v_e by performing the size measurement using the least square method. By moving the spectrum to both sides with respect to the band center, the clip position of the excitation spectrum relative to the frequency band is adjusted in the planetary frame. Then, the position of the excitation spectral clip that provides the best NSR in the approximation is selected for magnitude estimation, which is performed using the least square method.

얻어진 유성 크기 값 그룹 M ⁽ ^v) 은 M 스펙트럼 크기 벡터의 일부에 지나지 않는다. 무성 스펙트럼 크기 그룹은 M 스펙트럼 크기 벡터의 다른 일부이다. 무성 스펙트럼 크기의 계산은 도 5에 보인 동기잡음생성장치(Synchronized Noise Generation unit)(96), 해밍윈도우장치(97), FFT장치(98), 그리고 무성크기평가장치(99)에 의해 수행된다. 동기잡음생성장치(96)는 단위 진폭 범위를 갖는 백색잡음신호(white noise signal)를 만들어낸다. 유성 크기 값을 얻기 위한 처리와 마찬가지로, 잡음은 인코더 및 디코더 측에서 동일한 방식으로 처리된다. 게다가, 인코딩 측에서는, 동기화 특성이 제공되어 무성음 스펙트럼의 어림셈을 더욱 양호하게 한다.The resulting planetary magnitude group M ⁽ ^v) is only part of the M spectral magnitude vector. The unvoiced spectral magnitude group is another part of the M spectral magnitude vector. The calculation of the unvoiced spectral magnitude is performed by the synchronous noise generation unit 96, the hamming window device 97, the FFT device 98, and the unvoiced size evaluation device 99 shown in FIG. The synchronous noise generating device 96 produces a white noise signal having a unit amplitude range. As with the processing to obtain the meteor magnitude value, the noise is processed in the same way on the encoder and decoder side. In addition, on the encoding side, synchronization characteristics are provided to better estimate the unvoiced spectrum.

동기잡음생성장치(96)에서 얻어진 신호는 해밍윈도우장치(97)에서 윈도우화되어 FFT장치(98)에 의해 처리된다. 무성크기평가장치(99)에서는, 최소제곱법을 이용하여 각 무성 대역에 대한 스펙트럼 크기가 계산된다. 이렇게 해서 얻어진 무성 스펙트럼 크기 그룹 M ^(uv) 은 유성 크기 그룹 M ^(v) 에 결합되어 스펙트럼 크기 벡터 M이 얻어진다.The signal obtained by the synchronous noise generating device 96 is windowed by the hamming window device 97 and processed by the FFT device 98. In the unvoiced size evaluation device 99, the spectral magnitude for each unvoiced band is calculated using the least square method. The unvoiced spectral magnitude group M ^(uv) thus obtained is combined with the planetary magnitude group M ^(v) to obtain a spectral magnitude vector M.

도 1을 참조하여 설명하면, 본 발명에 따른 음성 모델 파라미터의 인코딩은 3가지를 포함한다. 피치 주파수의 인코딩은 스칼라정량화장치(100)에 의해 수행된다. 피치 주파수 값은 주파수 범위, 예를 들어

[50,400]으로 제한되어 256 레벨(8비트)로 정량화된다. 이 경우의 피치 주파수 표현(representation)의 최대 오차는 0.684 Hz이다. 결정된 정량화 값은 멀티플렉싱장치(130)로 전달된다. 또, 유성/무성 결정 그룹의 벡터 V는 멀티플렉싱장치(130)로 전달만 된다.Referring to FIG. 1, encoding of a speech model parameter according to the present invention includes three types. The encoding of the pitch frequency is performed by the scalar quantifier 100. The pitch frequency value is in the frequency range, for example

Limited to [50,400] and quantified to 256 levels (8 bits). In this case, the maximum error of the pitch frequency representation is 0.684 Hz. The determined quantification value is transferred to the multiplexing device 130. In addition, the vector V of the voiced / unvoiced decision group is only transmitted to the multiplexing device 130.

스펙트럼 크기의 벡터 M은 2단계로 인코딩된다. 우선, 스펙트럼크기웨이브렛감소장치(110)에 의해 스펙트럼 크기 벡터가 감소된다. 다음, 감소된 스펙트럼 크기 벡터에 스펙트럼크기혼성인코딩장치(120)로 혼성 인코딩을 수행한다. 본 발명에 의한 스펙트럼 크기 벡터의 감소가 상세히 설명되겠다.The spectral magnitude vector M is encoded in two steps. First, the spectral magnitude vector is reduced by the spectral size wavelet reducing apparatus 110. Next, hybrid encoding is performed by the spectral size hybrid encoding apparatus 120 on the reduced spectral magnitude vector. The reduction of the spectral magnitude vector according to the present invention will be described in detail.

우선, 벡터 M 요소의 로가리즘이 다음의 공식으로 표현된다:First, the logic of the vector M element is represented by the following formula:

여기에서, m은 벡터 M의 차원(dimension)을 정의한다. m의 값은 피치 주파수에 달려있으며 시간에 따라 변한다. 다음, 차원 m의 벡터

가 일정한 차원 r의 벡터

로 변환된다. 벡터

를 인코딩하는데 WT를 더 이용하는 것을 고려하면, r=1·2 ⁿ 이 되도록 r을 선택할 수 있으며, 여기에서 l은 양의 정수이고 n은 상기 WT의 예상되는 단계 횟수이다. 바람직한 실시예에서, 벡터

의 차원수(dimensionality)는 n=3, l=2을 갖는 r=16 의 값으로 감소된다.Where m defines the dimension of the vector M. The value of m depends on the pitch frequency and changes over time. Next, dimension m

Vector of constant dimension r

Is converted to. vector

Considering further using WT to encode, r = 1 · 2 ⁿ R can be selected, where l is a positive integer and n is the expected number of steps in the WT. In a preferred embodiment, the vector

The dimensionality of is reduced to the value of r = 16 with n = 3 and l = 2 .

감소 연산은 비가역적이다(irreversible). 하지만, 설명한 수행방법을 따르면 높은 정밀도로써 벡터

이 재구성될 수 있다. 벡터

의 차원이 r 이라면, 감소 연산을 할 필요가 없다. 다른 경우에는, 다음의 단계를 포함하는 절차가 수행된다:Decrease operations are irreversible. However, if you follow the described method, you can

This can be reconstructed. vector

If the dimension of is r, then there is no need to perform a decrement operation. In other cases, a procedure is performed that includes the following steps:

- 벡터

요소에 기초한 삼차스플라인(cubic spline)이 구성된다;-Vector

A cubic spline based on urea is constructed;

- s=r·2k≥m, k=0,1,2,..., 되도록 최소의 수 S가 계산된다;the smallest number S is calculated such that s = r.2k ≧ m, k = 0,1,2 , ...;

- S 노드(node)를 갖는 새로운 균등 격자(uniform grid)가 구성되며 삼차스플라인의 값이 이들 노드에서 계산된다;A new uniform grid with S nodes is constructed and the value of the cubic spline is calculated at these nodes;

- 얻어진 s 값 그룹에 k 횟수의 웨이브렛 변환 단계가 적용된다; K number of wavelet transform steps are applied to the obtained s value group;

- 결과의 r 로우-패스(low-pass) 웨이브렛 계수는 벡터

요소이고, 반면 하이-패스(high-pass) 계수는 버려진다.The resulting r low-pass wavelet coefficient is a vector

Element, while the high-pass coefficient is discarded.

상기 과정에서 WT 단계의 횟수 k는 정해진 것이 아니며 서로 다른 신호 프레임에서 달라질 수 있다.The number k of the WT steps in the above process is not fixed and may be different in different signal frames.

도 6을 참조하여, 본 발명에 따른 스펙트럼 크기 벡터의 혼성 인코딩이 자세 히 설명되겠다. 벡터

는 두 번 인코딩된다. 즉, 웨이브렛인코딩장치(121)에서 웨이브렛 구성방법을 따라, 그리고 인터-프레임예측인코딩장치 (Inter-Frame Prediction Encoder unit)(122)에서 인터-프레임 예측 구성방법을 따라 인코딩된다. 두 번의 인코딩이 이루어지는 동안, NSR 판정기준을 이용하여 각 구성법의 효율이 측정되어 더 우수한 구성법이 콤퍼레이터(comparator unit)(123)에 의한 벡터

에 대한 기준 인코딩 구성법으로 선택된다.With reference to FIG. 6, hybrid encoding of spectral magnitude vectors according to the present invention will be described in detail. vector

Is encoded twice. That is, the wavelet encoding apparatus 121 is encoded according to the wavelet construction method and the inter-frame prediction encoding unit 122 according to the inter-frame prediction construction method. During two encodings, the efficiency of each construct is measured using the NSR criterion so that a better construct is obtained by the vector by the comparator unit 123.

Is selected as the reference encoding scheme for.

웨이브렛인코딩장치(121)에서, WT의 n 개 횟수의 단계가 벡터

에 적용된다. l 로우-패스와 r-l 하이-패스 웨이브렛 계수 모두 정량화된다. 격자정량화기술(lattice quantization technique)이 로우-패스 웨이브렛 계수를 인코딩하는데 이용되고, 적응스칼라정량화(adaptive scalar quantization)가 하이-패스 웨이브렛 계수에 적용된다. 하이-패스 웨이브렛 계수의 속성 때문에, 제로를 중심으로 대칭을 이룬 스칼라정량기(scalar quantizer)가 구성된다. 바람직한 실시예에서, WT 단계의 횟수 n=3이고 로우-패스 웨이브렛 계수의 개수 l=2이다. 배직교필터(biorthogonal filter)(5,3)가 감소 단계와 인코딩 단계에서 WT 필터로 이용된다.In the wavelet encoding apparatus 121, n number of steps of WT are vectors

Applies to Both low-pass and rl high-pass wavelet coefficients are quantified. Lattice quantization techniques are used to encode low-pass wavelet coefficients, and adaptive scalar quantization is applied to high-pass wavelet coefficients. Because of the nature of the high-pass wavelet coefficients, a scalar quantizer is constructed that is symmetric about zero. In a preferred embodiment, the number of WT steps n = 3 and the number of low-pass wavelet coefficients l = 2. A biorthogonal filter (5,3) is used as the WT filter in the reduction and encoding steps.

인터-프레임예측인코딩장치(122)에서, 스펙트럼 크기를 인코딩하기 위한 인터-프레임 예측은 경쟁 인코딩 구성법으로서 이용된다. 인터-프레임 예측은 이웃 프레임들의 스펙트럼 크기의 유사성을 이용하며, 고정 신호의 경우 그 효율이 높다. 예측 오차는 적응스칼라정량화를 이용하여 인코딩된다.In the inter-frame prediction encoding apparatus 122, inter-frame prediction for encoding the spectral size is used as a competitive encoding scheme. Inter-frame prediction takes advantage of the similarity of the spectral magnitudes of neighboring frames and, for fixed signals, is highly efficient. The prediction error is encoded using adaptive scalar quantification.

인코딩과 동시에, 디코딩 처리가 수행된다. 이는 인터-프레임 예측 구성 작동 및 테스트 인코딩 장치의 품질 측정 모두에 필요한 것이다. 웨이브렛 그리고 인터-프레임 예측과 같은 경쟁 인코딩 구성법을 동시에 이용하면 본 발명에 의한 방법의 효율을 더 높일 수 있다. 따라서, 콤퍼레이터(123)는 상기 두 구성방법의 효율을 비교하여 최적에 해당하는 데이터와 결정비트(decision bit)를 멀티플렉싱장치(130)에 보낸다. 멀티플렉싱장치(130)는 코딩된 모든 파라미터 값들을 복수개의 출력 비트로 결합함으로써 비트스트림(bitstream)을 형성한다.Simultaneously with encoding, a decoding process is performed. This is necessary for both inter-frame prediction configuration operation and quality measurement of the test encoding apparatus. The simultaneous use of competing encoding schemes such as wavelets and inter-frame prediction can further increase the efficiency of the method of the present invention. Accordingly, the comparator 123 compares the efficiency of the two configuration methods and sends the data and the decision bit corresponding to the optimum to the multiplexing device 130. The multiplexing device 130 forms a bitstream by combining all coded parameter values into a plurality of output bits.

도 7은 디코더를 보여주는 블록도로서, 상기 디코더는 입력 비트를 디코딩하여 합성 디지털 음성을 합성한다. 디멀티플렉싱장치(demultiplexing unit)(140)는 허용데이터구조(accepted data structure)에 따라 복수개의 입력 비트를 분리한다. 모델파라미터디코딩장치(150)는 파라미터를 예비 디코딩하며, 이로써 출력 음성이 결정된다. 모델파라미터디코딩장치(150)는 모델파라미터인코딩장치(스칼라정량화장치(100), 스펙럼크기웨이브렛감소장치(110), 그리고 스펙트럼크기혼성인코딩장치(120) 참조)와는 반대의 방식으로 작동한다.7 is a block diagram illustrating a decoder, which decodes input bits to synthesize synthesized digital speech. Demultiplexing unit 140 separates a plurality of input bits according to an accepted data structure. The model parameter decoding apparatus 150 preliminarily decodes the parameter, thereby determining the output voice. The model parameter decoding apparatus 150 operates in a manner opposite to the model parameter encoding apparatus (see the scalar quantification apparatus 100, the spectral size wavelet reducing apparatus 110, and the spectral size hybrid encoding apparatus 120).

대역대응사상구성장치(160)는 사상을 구성하며, 상기 사상은 현재 및 이전 프레임에 대한 피치 주파수 값들을 이용함으로써 레이싱된 주파수대역 쌍들을 형성한다. 유성음 부분이 유성합성장치(170)에 의해 생성되며 무성음 부분은 무성합성장치(180)에 의해 생성된다. 합산장치(summing unit)(190)는 유성 및 무성합성장치(170)(180)의 출력을 합함으로써 합성 디지털 음성을 만든다.Band-correspondence configuration device 160 constructs a map, which maps the race pairs of frequency bands by using the pitch frequency values for the current and previous frames. The voiced sound portion is generated by the voice synthesizer 170 and the unvoiced sound portion is generated by the voice synthesizer 180. A summing unit 190 combines the outputs of the voiced and unsynthesized devices 170 and 180 to produce a synthesized digital voice.

합성된 신호 S ^v(n)의 유성 부분은 적절한 조파 성분들의 합으로서 만들어지며, 그 식은 다음과 같이 표현된다.The planetary part of the synthesized signal S ^v (n) is made as the sum of the appropriate harmonic components, which is expressed as

여기에서,

,n=0,..., L-1은 m-번째 주파수대역에 해당하는 조파 성분 신호이고, L은 음성 프레임에서 겹치지 않는 부분의 길이이고, 그리고 I ^v는 유성 대역으로 결정된 주파수대역 그룹이다.From here,

, n = 0, ..., L-1 is the harmonic component signal corresponding to the m-th frequency band, L is the length of the non-overlapping portion of the speech frame, and I ^v is a frequency band group determined as a meteor band .

또, 프레임 내의 시간 지수 (샘플 개수) n을 이용하여 조파 성분 신호

는 다음과 같이 표현될 수 있다.In addition, a harmonic component signal using a time index (number of samples) n in a frame.

Can be expressed as follows.

여기에서 A _m(n)은 프레임의 시작과 마지막 사이에 보간된 m-번째 조파의 진폭을 나타내며, θ _m(n)은 조파 신호의 위상을 나타낸다.Where A _m (n) represents the amplitude of the m-th harmonic interpolated between the beginning and the end of the frame, and θ _m (n) represents the phase of the harmonic signal.

조파 진폭의 보간, 조파각주파수(harmonic angular frequency)의 보간, 그리고 조파 위상의 연속성을 제공하는 것과 같은 음성 합성에서의 중요한 문제점이 있음에도 불구하고, 가장 중요한 문제의 하나는 인터-프레임 주파수대역들의 상호작용(interaction)에 의해 발생할 수 있다. MBE 보코더와 유사한 보코더에서, 합성 수행을 위해 현재 프레임의 조파 성분은 이전 프레임의 조파 성분과 함께 레이싱된다. 종래기술에서는, 서로 이웃하는 음성 프레임에서 동일 개수의 주파수대역에 관련한 조파들이 레이싱되었다.Although there are important problems in speech synthesis, such as interpolation of harmonic amplitudes, interpolation of harmonic angular frequencies, and continuity of harmonic phases, one of the most important problems is the interworking of inter-frame frequency bands. It can be caused by an interaction. In a vocoder similar to the MBE vocoder, the harmonic components of the current frame are raced together with the harmonic components of the previous frame to perform synthesis. In the prior art, harmonics relating to the same number of frequency bands in neighboring speech frames are raced.

본 발명에 따른 음성 합성의 바람직한 실시예에서, 거의 동일한 주파수에 관련한 조파는 구성된 주파수대역대응사상을 기초로 하여 레이싱된다. 본 발명에 따른 유성음 합성을 보여주는 블록도가 도 8에 도시되어 있다.In a preferred embodiment of speech synthesis according to the invention, the harmonics related to approximately the same frequency are raced on the basis of the configured frequency band correspondence phase. A block diagram showing voiced sound synthesis according to the present invention is shown in FIG.

유성음 합성을 위한 입력 파라미터 그룹은 현재 프레임에 대한 피치 주파수

유성 결정 벡터

그리고 스펙트럼 크기 벡터

를 포함하고, 또 대역대응사상구성장치(160)에 의해 구성된 대역대응사상을 포함한다. 이전프레임파라미터누산기(171)에 저장된 이전 프레임의 파라미터 그룹

이 또한 음성 합성을 위해 이용된다. 레이싱제어장치(Lacing Controller unit)(172)는 레이싱된 대역의 유성 상태에 따라 어림셈 유형을 선택함으로써 위상보간장치(Phase Interpolator unit)(173), 각주파수보간장치(Angular Frequency Interpolator unit)(174), 진폭보간장치(Amplitude Interpolator unit)(175)의 작동을 조절한다. 제어오실레이터뱅크장치(Bank of Controlled Oscillators unit)(176)는 방정식 (7)을 이용하여 유성음 합성을 제공한다.The input parameter group for voiced synthesis is the pitch frequency for the current frame.

Meteor crystal vector

And spectral magnitude vector

It includes, and also includes a band-compatible corresponding image configured by the band-compatible image forming apparatus 160. Parameter group of previous frame stored in previous frame parameter accumulator 171

This is also used for speech synthesis. The racing controller unit 172 selects an approximation type according to the meteor state of the race band, thereby allowing a phase interpolator unit 173 and an angular frequency interpolator unit 174 to be selected. Adjust the operation of the amplitude interpolator unit 175. The Bank of Controlled Oscillators unit 176 provides voiced sound synthesis using Equation (7).

본 발명의 두드러진 특징은 조파 레이싱의 방식을 결정하는 대역대응사상구성장치(160)가 존재한다는 사실이다. 종래기술에서는 서로 이웃하는 음성 프레임에서 동일 개수의 주파수대역에 관련한 조파가 레이싱된다. 종래기술의 조파 합성에서의대역/주파수 대응의 한 예가 도 9에 도시되어 있다. 이전 프레임의 피치 주파수는 100 Hz이고 현재 프레임의 피치 주파수는 83.7 Hz이며, 이전 프레임의 대역 개수는 39개이고 현재 프레임의 대역 개수는 47개이다. 도시한 바와 같이, 피치 주파수의 작은 변화로 인해 주파수에서 큰 변화가 생겼으며, 이는 특히 조파의 개수가 많을 때 더욱 그렇다.A prominent feature of the present invention is the fact that there is a band-corresponding phase configuring device 160 that determines the manner of harmonic racing. In the prior art, harmonics relating to the same number of frequency bands in neighboring speech frames are raced. An example of band / frequency correspondence in prior art harmonic synthesis is shown in FIG. The pitch frequency of the previous frame is 100 Hz, the pitch frequency of the current frame is 83.7 Hz, the number of bands of the previous frame is 39 and the number of bands of the current frame is 47. As shown, small changes in pitch frequency result in large changes in frequency, especially when the number of harmonics is large.

도 10에서, 종래기술에 따른 7번째, 18번째, 그리고 33번째 조파 대역의 주파수응답이 도시되었다. 이들 대역은 현재와 이전 프레임에서 모두 유성이다. 위에서 언급한 조파 대역들의 대응에서, 레이싱된 조파의 주파수 차이(예, 도 10의 7, 18, 33번째 대역)는 진폭 및 폭 주파수응답에서의 차이를 야기시킨다. 이것은 여러 다른 주파수대역 응답들의 상호작용을 낳게 되며 또한 도 11에 보인 바와 같이 왜곡된 형태의 여기 스펙트럼을 낳게 된다. 게다가, 피치가 점프하면, 바람직하지 않은 아티팩트가 디코딩된 음성에서 발생한다.In Fig. 10, the frequency response of the seventh, eighteenth, and thirty-third harmonic bands according to the prior art is shown. These bands are meteor in both current and previous frames. In the correspondence of the above-mentioned harmonic bands, the frequency difference (eg, the seventh, eighteenth, and thirty-third bands of FIG. 10) of the raced harmonic causes a difference in amplitude and width frequency response. This results in the interaction of several different frequency band responses and also results in a distorted form of excitation spectrum as shown in FIG. In addition, if the pitch jumps, undesirable artifacts occur in the decoded speech.

본 발명에 따른 조파 합성에서의 대역/주파수 대응의 일례가 도 12에 도시되어 있다. 조파 합성은 현재 및 이전 프레임의 주파수대역 사이의 대응을 제공하는 직접 및 역사상(direct and inverse maps)을 기초로 하여 수행된다. 도시한 바와 같이, 대응 대역의 개수는 다를 수 있으나, 대역 주파수는 주파수 범위의 시작과 마지막에서 거의 달라지지 않는다(도 12의 33 대역에서의 Δf 참조).An example of band / frequency correspondence in harmonic synthesis according to the present invention is shown in FIG. Harmonic synthesis is performed based on direct and inverse maps that provide a correspondence between the frequency bands of the current and previous frames. As shown, the number of the corresponding band, but may be different, the frequency band is hardly changed at the beginning and end of the frequency range (see the Δ f in band 33 of Figure 12).

본 발명에 따른 7번째, 18번째, 33번째 조파 대역에서의 주파수응답이 도 13에 도시되어 있으며, 보이는 바와 같이 조파 대역은 진폭 및 폭이 같다. 주요 피크에 가까이 있는 조그만 언덕(hillock)은 이전 프레임의 조파 페이딩에 해당한다. 여기 신호의 주파수응답이 도 14에 도시되어 있는데, 일정한 구조를 보여주고 있다. 여기에서 주목해야 할 것은, 여러 다른 대역들은 여기 신호가 구성된 상태에 서 상호작용하지 않으며 또 서로 겹치지 않는다는 것이다. 이것은, 피치 주파수의 변화에 기인한 극적인 결과를 일으키지 않으면서 진폭을 보다 정확하고 믿을 만하게 평가할 수 있도록 한다.Frequency response in the seventh, 18th, and 33rd harmonic bands according to the present invention is shown in FIG. 13, and as shown, the harmonic bands have the same amplitude and width. A small hill near the main peak corresponds to the harmonic fading of the previous frame. The frequency response of the excitation signal is shown in Figure 14, which shows a constant structure. It should be noted here that the different bands do not interact and do not overlap with the excitation signal configured. This makes it possible to evaluate the amplitude more accurately and reliably without causing dramatic results due to changes in pitch frequency.

따라서, 현재 및 이전 프레임에서, 가장 유사한 주파수를 갖는 조파들의 쌍이 선택되어 레이싱된다. 레이싱되지 않은 이전 프레임의 조파는 평탄하게(smoothly) 감소하여 진폭이 제로가 되고 레이싱되지 않은 현재 프레임의 조파는 평탄하게 증가하여 소정 진폭이 된다.Thus, in the current and previous frames, pairs of harmonics with the most similar frequency are selected and raced. The harmonics of the previous non-laced frame smoothly decrease to zero amplitude and the harmonics of the current non-laced frame smoothly increase to a predetermined amplitude.

다음은 본 발명의 상세기술에 이용되는 기호들이다. 현재 및 이전 프레임에서의 주파수대역 m _c 와 m _p 에서,

또는

. 만약 현재 프레임의 주파수대역 m _c (또는 이전 프레임의 m _p )가 유성으로 결정되면, m_c∈ I _P ^V 또는 m_p ∈ I _P ^V 이다. 만약 주파수대역 m _c 또는 m _p 가 무성으로 결정되면, m_c

I _c ^V 또는 m _p

I _P ^V 이다.

를 현재 프레임에 대한 피치 주파수라고 하면 (ω _o ^C =2f _o ^C ); 그리고, N _c 를 주파수대역의 개수라고 하면 (N _c =

), 여기에서,

는 샘플링 주파수의 값이다. 그러면, {M _mc }, m _c = 0,..., M _c -1 은 각 주파수대역에 대한 크기 그룹이고 I _c ^v 는 유성 대역으로 결정되는 주파수대역 그룹이다. 이전 프레임에서와 마찬가지로,

; N _p ; {M _mp }, m _p = 0,..., M _p -1 ; 그리고 I _p ^v 는 피치 주파수, 주파수대역의 개수, 크기 그룹, 그리고 유성 주파수대역 그룹이다.The following are the symbols used in the detailed description of the present invention. In the frequency bands m _c and m _p in the current and previous frames,

or

. If the frequency band m _c of the current frame (or m _p of the previous frame) is determined to be meteoric, then m _c ∈ I _P ^V or m _p ∈ I _P ^V. If the frequency band m _c or m _p is determined to be unvoiced, m _c

I _c ^V or m _p

I _P ^V.

Is the pitch frequency for the current frame (ω _o ^C = 2 f _o ^C ); And if N _c is the number of frequency bands ( N _c =

), From here,

Is the value of the sampling frequency. Then, { M _mc }, m _c = 0, ..., M _c - 1 is a size group for each frequency band and I _c ^v is a frequency band group determined as a meteor band. As in the previous frame,

; N _p ; { M _mp }, m _p = 0, ..., M _p - 1 ; And I _p ^v is the pitch frequency, the number of frequency bands, the magnitude group, and the meteor frequency band group.

유성음 합성은 제어오실레이터뱅크장치(176)에 의해 수행된다. 제어오실레이터뱅크장치(176)의 기능은 아래의 공식으로 표현될 수 있다. 피치 주파수가 증가하면,

, 즉

이고, 합성 신호

의 유성 부분은 적절한 대역 쌍 m = 0,..., N _p -1을 합함으로써 계산되며, 그 방정식은 아래에 보이는 바와 같다.Voiced sound synthesis is performed by the control oscillator bank device 176. The function of the control oscillator bank device 176 can be expressed by the following formula. If the pitch frequency increases,

, In other words

, Composite signal

The planetary part of is calculated by summing the appropriate band pair m = 0, ..., N _p -1 , the equation is shown below.

여기에서, m 은 대역 쌍의 개수이다. m-번째 대역 쌍 ＜m _p , m _c ＞ 은 m _p 대역 및 m _c 대역으로 이루어지며, 여기에서 m _p = m 이고

이다. 이때,

은 이전 및 현재 프레임의 주파수대역들 사이의 대응을 제공하는 직접사상이다.Where m is the number of band pairs. The m -th band pair < m _p , m _c > consists of the m _p band and the m _c band, where m _p = m and

to be. At this time,

Is a direct mapping that provides a correspondence between the frequency bands of the previous and current frames.

피치 주파수가 감소하면,

, 즉

이고, 합성 신호 S ^v (n)의 유성 부분은 적절한 대역 쌍 m = 0,..., N _c -1 을 합함으로써 계산될 수 있고 아래의 방정식으로 표현될 수 있다.If the pitch frequency decreases,

, In other words

And the planetary portion of the composite signal S ^v (n) can be calculated by summing the appropriate band pairs m = 0, ..., N _c -1 and can be represented by the following equation.

m-번째 대역 쌍 ＜m _p , m _c ＞ 은 m _p 대역 및 m _c 대역으로 이루어지며, 여기에서 m _c = m 이고

이다. 함수

은 현재 및 이전 프레임의 주파수대역들 사이의 대응을 제공하는 역사상이다. The m -th band pair < m _p , m _c > consists of the m _p band and the m _c band, where m _c = m and

to be. function

Is the history of providing a correspondence between the frequency bands of the current and previous frame.

피치 주파수가 일정하면,

, 즉

이고, 합성 신호 S ^v (n)의 유성 부분은 사상 없이 계산될 수 있으며, 방정식은 아래에 보이는 바와 같다.If the pitch frequency is constant,

, In other words

And the planetary part of the composite signal S ^v (n) can be calculated without mapping, and the equation is shown below.

레이싱제어장치(172)는 위상보간장치(173), 각주파수보간장치(174), 진폭보간장치(175)의 작동을 조절한다. 레이싱된 대역의 유성 상태에 따라 보간에는 3가지 방식이 가능하다. 조건

그리고

가 m-번째 대역 쌍 ＜m _p , m _c ＞에서 만족된다면, 지속적인 조파가 생성된다. 진폭 보간은 아래의 공식에 따라 수행된다:The racing controller 172 controls the operation of the phase interpolator 173, the angular frequency interpolator 174, and the amplitude interpolator 175. Depending on the meteor state of the race band, three methods are available for interpolation. Condition

And

If is satisfied in the m -th band pair < m _p , m _c >, a continuous harmonic is produced. Amplitude interpolation is performed according to the following formula:

여기에서, M _mp 와 M _mc 는 m _p 와 m _c 밴드와 관련된 이전 및 현재 프레임에 대한 크기 값이다; n = 0,..., L-1 은 샘플 개수이다; L 은 음성 프레임에서 겹치지 않는 부분의 길이이다; 그리고 R 은 난조구간(racing interval)의 길이이다(0＜R＜L).Here, M and M _mp _mc is the size value for the previous and the current frame associated with the m _p and m _c band; n = 0, ..., L-1 is the number of samples; L is the length of the nonoverlapping portion of the speech frame; And R is the length of the hunting interval (0 < R < L ).

위상 및 각주파수의 보간은 아래의 공식에 따라 수행된다:Interpolation of phase and angular frequency is performed according to the following formula:

여기에서,From here,

여기에서, ø_mc(0)는 이전 프레임에서 겹치지 않는 부분의 마지막에 있는 조파의 위상과 동일한, 현재 프레임 시작에서의 m _c-번째 조파의 대응 위상을 나타낸다. 즉, θ_mc (0) = 0 _mp (L).Here, ø _mc (0) represents the corresponding phase of the m _c -th harmonic at the start of the current frame, which is the same as the phase of the harmonics at the end of the non-overlapping portion in the previous frame. That is, θ _mc (0) = 0 _mp (L) .

만약 m-번째 대역 쌍 ＜m _p , m _c ＞ 에 대해

그리고

이면, 페이딩 조파가 생성되어 아래의 방정식 (14),(15)에 의해 보간이 수행된다.For the m -th band pair < m _p , m _c >

And

In this case, fading harmonics are generated and interpolation is performed by the following equations (14) and (15).

만약 m-번째 대역 쌍 ＜m _p , m _c ＞ 에 대해

그리고

이면, 상승 조파(rising harmonic)가 생성되어 아래의 방정식 (16),(17)에 의해 조파 진폭 보간이 수행된다.For the m -th band pair < m _p , m _c >

And

In this case, a rising harmonic is generated, and harmonic amplitude interpolation is performed by the following equations (16) and (17).

여기에서, ø_mc(0)는 초기 위상값 ø ₀ 와 동일한, 현재 프레임의 시작에 있는 m _c -번째 조파의 위상을 나타낸다.Here, ø _mc (0) is the initial phase value ø ₀ Equivalent to, denotes the phase of the m _c -th harmonic at the beginning of the current frame.

도 15는 본 발명에 따른 무성음 합성을 보여주는 블록도로서, 인코더 측의 동기잡음생성장치(96)에 동기화된, 디코더 측의 동기잡음생성장치(Synchronized Noise generator unit)(181)를 포함한다. 따라서, 디코더에 의한 합성에 이용되는 잡음은 인코더에 의한 분석에 이용되는 잡음과 동일한 것이다. 시간 축의 백색잡음 신호 파형은, 백색잡음생성기로부터 얻어진 것으로서, 해밍윈도우장치(182)에 의해 윈도우화된다. 그 결과가 FFT 장치(183)에 의해 처리된다. 잡음신호의 스펙트럼은 무성으로 결정된 대역의 크기 M _m 에 의해 멀티플렉싱되고, 유성 대역의 진폭은 제로로 설정된다.Fig. 15 is a block diagram showing unvoiced sound synthesis according to the present invention, which includes a synchronous noise generator unit 181 on the decoder side, synchronized to a synchronous noise generator 96 on the encoder side. Therefore, the noise used for synthesis by the decoder is the same as the noise used for analysis by the encoder. The white noise signal waveform on the time axis is obtained from the white noise generator and is windowed by the hamming window device 182. The result is processed by the FFT device 183. Spectrum of the noise signal is multiplexed by the size M _m of the band determined as silent, the amplitude of the planetary band is set to zero.

스펙트럼 변환이 잡음스펙트럼변환장치(Noise Spectrum Transformation unit)(184)에 의해 수행된다. 변환된 스펙트럼은 최초 잡음 신호의 위상값을 이용하여 IFFT 장치(185)에 의해 역의 고속푸리에변환을 거치게 된다. 그다음, 가산 및 중첩장치(Add and Overlap unit)(186)에서, 상기 얻어진 잡음 신호는 버퍼(Buffer)(187)에 의해 저장된 이전 프레임의 잡음 신호와 중첩되어 무성음 부분을 만들어 낸다. 합산장치(190)에서, 유성음 및 무성음 부분을 합함으로써 합성 디지털 음성이 만들어진다.Spectral transformation is performed by a Noise Spectrum Transformation unit 184. The transformed spectrum is subjected to an inverse fast Fourier transform by the IFFT device 185 using the phase value of the original noise signal. Then, in the Add and Overlap unit 186, the obtained noise signal is superimposed with the noise signal of the previous frame stored by the buffer 187 to produce an unvoiced portion. In the adder 190, a synthesized digital voice is made by adding voiced and unvoiced portions.

위에서 기술한 실시예들은 단지 예시적인 것으로 본 발명을 제한하는 것이 아니다. 본 발명에 기술된 내용은 다른 유형의 장치에 쉽게 적용될 수 있다. 본 발명의 상세설명은 예시적인 것이며 청구의 범위를 제한하는 것이 아니다. 당업자에게는 많은 변형, 수정, 및 개량이 가능하다.The above described embodiments are merely illustrative and are not intended to limit the present invention. The teachings described herein can be readily applied to other types of devices. The details of the invention are exemplary and are not intended to limit the scope of the claims. Many variations, modifications, and improvements are possible to those skilled in the art.

상기 내용에 포함되어 있음.Included in the above.

Claims

Divide an audio signal into a plurality of frames, determine a pitch frequency, determine voice information, indicate whether each of the plurality of frequency bands in each frame should be synthesized as a voiced or unvoiced band, and process the voice frame to frequency band A synthesis analysis method for determining spectral envelope information in a speech coding system based on synthesizing a synthesized digital speech signal from a data structure created by determining spectral envelope information indicative of a spectral magnitude of

a) forming a group of spectral magnitude models by assigning fixed etalon values;

b) synthesizing a model speech signal for the spectral magnitude model group under the voice frequency determination group and for the pitch frequency determined for the previous and current frames in the same manner as the received speech signal is synthesized at the decoder;

c) calculating a spectrum of the model speech signal;

d) estimating the spectrum of the actual speech signal by the spectrum of the model speech signal; And

e) encoding a coefficient obtained from said estimated spectrum.

The method of claim 1,

In said step a), said spectral magnitude model group is formed separately for the voiced and unvoiced portions of a model speech signal spectrum.

The method of claim 2,

In step a), the spectral magnitude model group for the meteor portion of the spectrum is formed by assigning a fixed etalon value of 1 in the meteor band and otherwise assigning 0.

The method of claim 2,

And in step d), the planetary portion of the spectrum is estimated by the position of tuning the planetary excitation spectrum clip to the frequency band position using the Least Square Method.

The method of claim 2,

In step b), the unvoiced portion of the model speech spectrum is synthesized by producing a white noise signal in a unit amplitude range to provide synchronization characteristics of the synthesis method.

The method of claim 2,

In said step d), the unvoiced portion of the model speech signal spectrum is estimated by unvoiced excitation spectrum for each frequency band using least squares method.

a) reducing the number of spectral sizes;

b) evaluating a plurality of spectral size encoding methods;

c) performing a spectral size encoding by selecting from among a plurality of spectral size encoding methods according to the evaluation.

The method of claim 7, wherein

In said step a), reducing said number of spectral magnitudes is based on a wavelet transform technique.

The method of claim 8,

In step b), the plurality of spectral magnitude encoding methods are wavelet transform technology and inter-frame prediction.

Divide the speech signal into a plurality of frames, determine the pitch frequency, determine the meteor information, indicate whether each of the plurality of frequency bands in each frame should be synthesized as a voiced or unvoiced band, and process the speech frame to process the frequency band A method of synthesizing a synthesized digital speech signal from a data structure created by determining spectral envelope information indicating a spectral magnitude of

a) configuring a frequency correspondence between the bands of the current and previous frame;

b) synthesizing voiced sound components and racing the pairs of harmonics using pairs of harmonics having frequencies most similar to each other in the current and previous frames using the configured band frequency response, wherein all unpaired in the previous frame The harmonics smoothly decrease to zero amplitude and all unpaired harmonics in the current frame increase smoothly to a predetermined amplitude;

c) synthesizing a speech component for the unvoiced frequency band; And

d) synthesizing a speech signal by combining the synthesized speech components for the voiced and unvoiced frequency bands.

The method of claim 10,

In said step a), the band frequency correspondence is configured by forming direct and inverse maps of the frequency band system derived by the pitch frequencies of the previous and current frames.

In the coding and decoding system of a speech signal,

The voice signal coder

A processor for dividing the input digital voice signal into a plurality of frames and analyzing the input digital voice signal in a time and frequency domain;

An orthogonal transforming unit for transforming each frame to provide spectral data on a frequency axis;

A pitch determination device for determining a pitch frequency for each frame;

A voiced / voiceless determination device for generating a voiced / voiceless decision group using the determined pitch frequency;

A spectral size determination device for measuring spectral magnitudes using an analysis method by synthesis; And

And a parameter encoding device for encoding the determined pitch frequency, the measured spectral magnitude, and the voiced / unvoiced decision for each of the plurality of frames to combine the encoded data into a plurality of bits,

The voice signal decoder

A parameter decoding device for decoding the plurality of bits and providing pitch frequency, spectral magnitude, and voice / voice determination for each of the plurality of frames;

A band frequency corresponding image forming apparatus for forming a band frequency corresponding image between bands of the current and previous frames;

And a signal synthesizing apparatus for synthesizing a speech signal from the decoded pitch frequency, spectral magnitude, and voiced / unvoiced determination using the band frequency corresponding pseudo image.

The method of claim 12,

The coder further includes a frame classifier that classifies and assigns the frame classification to each frame in the time domain by the characteristics and ranges of signal values that change along the frame and by the signal oscillation characteristics of the first and second portions of the frame. And the voiced / voiceless determination device generates the voiced / voiceless decision group based on the assigned frame classification.

The method of claim 13,

The voiced / unvoiced determination device is characterized in that to use an adaptive threshold (adaptive threshold) according to the assigned frame classification.

The method of claim 13, wherein the pitch determination device

A pitch candidate group determining apparatus for determining a pitch candidate group based on analysis of a normalized autocorrelation function using direct or inverse order according to the assigned frame classification;

An optimum candidate selection device for selecting an optimal candidate by measuring a pitch candidate group in a frequency domain; And

And an optimum candidate improvement device for improving an optimal candidate value in a frequency domain.

The method of claim 15,

The optimum candidate selection device measures the pitch candidate group by a window function response scaled to obtain an appropriate sharpness of the approximation function in each band and to provide true pitch candidate selection. System.

The method of claim 16,

And the window function response is _scaled for a pitch frequency lower than a predetermined frequency F _scale .

The method of claim 17,

The window function response is scaled by a proportional sharpening procedure.

The method of claim 18,

The proportional sharpening procedure is performed by linear interpolation.

The method of claim 19,

The window function response scaled for different pitch frequencies is used as a look-up table.

The method of claim 12,

The parameter encoding device

A scalar quantification device for quantifying a pitch frequency value;

A spectral size wavelet reduction device for reducing the dimension of the spectral magnitude vector;

A spectral size hybrid encoding device for encoding the reduced spectral magnitude vector by wavelet technology; And

And a multiplexing device for combining the encoded data into a plurality of bits.

The method of claim 21,

The spectral size hybrid encoding device

A wavelet encoding device for encoding the reduced spectral magnitude vector;

An inter-frame prediction encoding apparatus for encoding the reduced spectral magnitude vector; And

And a comparator for comparing the efficiency of the wavelet encoding apparatus with the efficiency of the inter-frame prediction encoding apparatus, and outputting data and decision bits corresponding to the superior apparatus to the multiplexing apparatus.

The method of claim 12,

The signal sum growth value is

A voice synthesizer for synthesizing voiced sound components and racing the pairs of harmonics using pairs of frequency bands having the most similar frequencies in the current and previous frames, wherein all unpaired pairs in the previous frame are used. The harmonics decrease flat to zero amplitude and all unpaired harmonics in the current frame increase flat to become a predetermined amplitude;

An unsynthesizing device for synthesizing a speech component for an unvoiced frequency band; And

And an adder for synthesizing a model speech signal by summing synthesized speech components for the voiced and unvoiced frequency bands.

The method of claim 12,

The spectral size determination device

A band frequency corresponding image forming apparatus for forming a frequency correspondence between bands of the current and previous frame;

A voice synthesizer for synthesizing a model voice signal for a spectral magnitude model group based on the configured band frequency correspondence, pitch frequency, and voice decision group for previous and current frames;

A first window device for processing a model meteor signal;

An orthogonal conversion device for converting the model meteor signal windowed by the first window device into a frequency domain;

A planetary size evaluation device for evaluating the planetary magnitude of the transformed model signal by the least square method;

A synchronous noise generator for generating a model white noise signal having a unit amplitude range;

A second window device for processing a model white noise signal;

An orthogonal converter converting the model white noise signal windowed by the second window device into a frequency domain; And

And an unvoiced magnitude estimating device for estimating unvoiced magnitude of the transformed model white noise signal by a least square method.

The method of claim 24,

Wherein the planetary growth value is assigned a fixed etalon value of 1 in the meteor band and 0 in other cases to form a model signal for a group of spectral size models.

The method of claim 12,

The voiced / unvoiced discriminator generates a voiced / unvoiced decision group using a window function response scaled to obtain an appropriate sharpness of each band approximation function and to provide true voiced / unvoiced crystal generation. system.

The method of claim 26,

The method of claim 27,

And the window function response is scaled by a proportional sharpening procedure.

The method of claim 28,

And said proportional sharpening procedure is performed by linear interpolation.

The method of claim 29,

The window function response scaled for different pitch frequencies is used as a lookup table.

The method of claim 30,

And the voiced / unvoiced device tunes the position of the scaled response to the position of a frequency band peak.

A method of coding and decoding a speech signal, the method comprising:

The voice signal coding method is

(a) dividing an input digital voice signal into a plurality of frames to be analyzed in a time and frequency domain;

(b) orthogonally transforming each frame to provide spectral data on the frequency axis;

(c) determining a pitch frequency for each frame;

(d) generating a voiced / unvoiced crystal group using the determined pitch frequency;

(e) measuring the spectral magnitude using an analytical method by synthesis;

(f) encoding the determined pitch frequency, measured spectral magnitude, and the generated voiced / unvoiced decision for each of the plurality of frames;

(g) combining the encoded pitch frequency, the measured spectral magnitude, and the voiced / unvoiced decision into a plurality of bits,

The voice signal decoding method

(aa) decoding the plurality of bits to provide pitch frequency, spectral magnitude, and voiced / unvoiced determination for each of the plurality of frames;

(bb) configuring a band frequency correspondence between the bands of the current and previous frame;

(cc) synthesizing a speech signal from the decoded pitch frequency, spectral magnitude, and voiced / unvoiced determination using the band frequency response.

The method of claim 32,

The coding method

Classifying and assigning the frame classification to each frame of the time domain by characteristics and ranges of signal values varying along the frame and by signal oscillation characteristics of the first and second portions of the frame; And

And in step (d), using the adaptive threshold value based on the assigned frame classification.

The method of claim 33,

Step (c) is

(i) determining a pitch candidate group based on an analysis of normalized autocorrelation functions using direct or reverse order according to the assigned frame classification;

(ii) measuring a pitch candidate group in the frequency domain to select an optimal candidate; And

(iii) improving the optimal candidate value in the frequency domain.

The method of claim 34,

In step (ii),

The pitch candidate group is measured by a window function response scaled to obtain an appropriate sharpness of each band approximation function and to provide true pitch candidate selection.

The method of claim 35, wherein

The method of claim 36,

The window function response is scaled by a proportional sharpening procedure.

The method of claim 37, wherein

The proportional sharpening procedure is performed by linear interpolation.

The method of claim 38,

And said window function response scaled for different pitch frequencies is used as a lookup table.

The method of claim 32,

The step (cc) is

Synthesizing voiced sound components and racing the pairs of harmonics using pairs of frequency pairs having the most similar frequencies to each other in the current and previous frames, wherein all non-paired harmonics in the previous frame The flattening decreases to zero amplitude and all unpaired harmonics in the current frame increase flattening to a predetermined amplitude;

Synthesizing a speech component for the unvoiced frequency band; And

Synthesizing a model speech signal by combining the synthesized speech components in the voiced and unvoiced frequency bands.

The method of claim 32,

In step (d), a voiced / unvoiced decision group is generated using a window function response scaled to obtain an appropriate sharpness of each band approximation function and to provide true voiced / unvoiced decision generation. .

42. The method of claim 41 wherein

The method of claim 42, wherein

The window function response is scaled by a proportional sharpening procedure.

The method of claim 43,

The proportional sharpening procedure is performed by linear interpolation.

The method of claim 44,

42. The method of claim 41 wherein

And said step (d) further comprises tuning the position of said scaled response to the position of a frequency band peak.