KR20020081374A

KR20020081374A - Closed-loop multimode mixed-domain linear prediction speech coder

Info

Publication number: KR20020081374A
Application number: KR1020027011306A
Authority: KR
Inventors: 다스아미타바
Original assignee: 퀄컴 인코포레이티드
Priority date: 2000-02-29
Filing date: 2000-02-29
Publication date: 2002-10-26
Also published as: ES2269112T3; EP1259957A1; CN1266674C; EP1259957B1; CN1437747A; DE60031002T2; JP4907826B2; HK1055833A1; AU2000233851A1; DE60031002D1; ATE341074T1; JP2003525473A; WO2001065544A1; KR100711047B1

Abstract

A closed-loop, multimode, mixed-domain linear prediction (MDLP) speech coder includes a high-rate, time-domain coding mode, a low-rate, frequency-domain coding mode, and a closed-loop mode-selection mechanism for selecting a coding mode for the coder based upon the speech content of frames input to the coder. Transition speech (i.e., from unvoiced speech to voiced speech, or vice versa) frames are encoded with the high-rate, time-domain coding mode, which may be a CELP coding mode. Voiced speech frames are encoded with the low-rate, frequency-domain coding mode, which may be a harmonic coding mode. Phase parameters are not encoded by the frequency-domain coding mode, and are instead modeled in accordance with, e.g., a quadratic phase model. For each speech frame encoded with the frequency-domain coding mode, the initial phase value is taken to be the initial phase value of the immediately preceding speech frame encoded with the frequency-domain coding mode. If the immediately preceding speech frame was encoded with the time-domain coding mode, the initial phase value of the current speech frame is computed from the decoded speech frame information of the immediately preceding, time-domain-encoded speech frame. Each speech frame encoded with the frequency-domain coding mode may be compared with the corresponding input speech frame to obtain a performance measure. If the performance measure falls below a predefined threshold value, the input speech frame is encoded with the time-domain coding mode.

Description

CLOSED-LOOP MULTIMODE MIXED-DOMAIN LINEAR PREDICTION SPEECH CODER}

디지털 기술에 의한 음성 전송은, 특히 장거리 및 디지털 무선 전화기 응용에 널리 보급되었다. 차례로, 이는 재생된 음성의 인식 품질을 유지하면서 채널을 통해 송신할 수 있는 최소 정보량을 결정하는데 대한 관심을 불러일으켰다. 단순히 샘플링과 디지털화에 의해 음성을 전송하는 경우에는, 종래 아날로그 전화기의 음성품질을 달성하기 위해서는, 초당 64 킬로바이트 (kbps) 급의 데이터 레이트가 요구된다. 그러나, 음성 분석과, 그에 후속하는 적절한 코딩, 송신, 수신기에서의 재합성을 이용하여 데이터 레이트를 상당히 줄일 수 있다.Voice transmission by digital technology has become widespread, particularly in long distance and digital cordless telephone applications. In turn, this has generated interest in determining the minimum amount of information that can be transmitted over a channel while maintaining the perceived quality of the reproduced speech. When voice is transmitted by simply sampling and digitizing, a data rate of 64 kilobits per second (kbps) is required to achieve the voice quality of a conventional analog telephone. However, the data rate can be significantly reduced using voice analysis, followed by appropriate coding, transmission, and re-synthesis at the receiver.

인간의 음성생성모델과 관련한 파라미터들을 추출하여 언어를 압축하는 기술을 사용하는 장치를 음성 코더라고 부른다. 음성 코더는 인입 음성 신호를 시간 블록이나 분석 프레임들로 분할한다. 일반적으로, 음성 코더는 인코더와 디코더를 구비한다. 인코더는 인입 음성 프레임을 분석하여 일정한 관련 파라미터들을 추출한 후, 파라미터들을 2진 표현, 즉 비트들의 세트나 이진 데이터 패킷으로 양자화한다. 데이터 패킷은 통신 채널을 통하여 수신기나 디코더로 송신된다. 디코더는 데이터 패킷을 처리하고, 그들을 양자화하여 파라미터를 생성하고, 역양자화된 파라미터를 사용하여 음성 프레임을 재합성한다.An apparatus using a technique of extracting parameters related to a human speech generation model and compressing a language is called a speech coder. The speech coder divides the incoming speech signal into time blocks or analysis frames. Generally, a speech coder has an encoder and a decoder. The encoder analyzes the incoming speech frame to extract certain relevant parameters and then quantizes the parameters into a binary representation, i. E., A set of bits or a binary data packet. The data packet is transmitted to the receiver or decoder via the communication channel. The decoder processes the data packets, quantizes them to generate parameters, and uses the dequantized parameters to reconstruct the speech frames.

음성 코더의 기능은, 음성에 고유한 모든 자연적인 중복성 (redundancy) 을 제거함으로써 디지털화된 음성 신호를 저비트 레이트 신호로 압축하는 것이다. 디지털 압축은 입력 음성 프레임을 파라미터 세트로 표현하고 양자화를 사용하여, 파라미터들을 비트세트로 표현함으로써 달성할 수 있다. 입력 음성 프레임이 비트수 N_i를 갖고 음성 코더에 의해 생성된 데이터 패킷이 비트수 N_o를 갖는 경우에는, 음성 코더에 의해 얻어지는 압축률은이다. 시도해야할 점은 목표 압축률을 유지하면서 디코딩된 음성의 음질을 유지하는 것이다. 음성 코더의 성능은 (1) 음성 모델 또는 위에서 설명한 분석 및 합성 처리의 조합이 얼마나 잘 수행되느냐, (2) 프레임당 N₀비트의 목표 비트레이트에서 파라미터 양자화 처리가 얼마나 잘 수행되느냐에 의존한다. 따라서, 음성 모델의 목적은, 각각의 프레임에 대하여 적은 파라미터 세트로 음성 신호의 특징이나 목표 음성 품질을 획득하는 것이다.The function of the speech coder is to compress the digitized speech signal into a low bit rate signal by eliminating all natural redundancies inherent in the speech. Digital compression can be achieved by expressing an input speech frame as a parameter set and using quantization, expressing the parameters in a bit set. When the input voice frame has the number of bits N _i and the data packet generated by the voice coder has the number of bits N _o , the compression rate obtained by the voice coder is to be. What should be tried is to maintain the sound quality of the decoded speech while maintaining the target compression rate. The performance of the speech coder depends on (1) how well the speech model or a combination of the analysis and synthesis processing described above is performed, and (2) how well the parameter quantization process is performed at the target bit rate of N ₀ bits per frame. Therefore, the purpose of the speech model is to obtain the characteristics of the speech signal and the target speech quality with a small set of parameters for each frame.

음성 코더는 시간영역 코더로서 구현할 수 있으며, 이는 음성의 작은 세그먼트 (일반적으로 5 밀리세컨드 (ms) 서브프레임) 를 한번에 인코딩하는 높은 분해도 처리를 사용함으로써, 시간영역 음성 파형을 획득한다. 각각의 서브프레임에 대하여, 코드북 (codebook) 영역으로부터의 고정밀 대표값은 당해기술분야에 공지된 다양한 서치 알고리즘을 통하여 알 수 있다. 다른 방법으로는, 음성 코더는 주파수영역 코더로서 구현할 수 있으며, 이는 파라미터 세트로 입력 음성 프레임의 단기 음성 스펙트럼을 포착하며, 스펙트럼 파라미터로부터 음성 파형을 재생하기 위하여 해당하는 합성 처리를 사용한다. 파라미터 양자화기는 A. Gersho & R.M.Gray, Vector Quantization and Signal Compression (1992) 에 설명된 공지 양자화 기술에 따라서 저장된 코드벡터 표현으로 파라미터들을 표현함으로써 파라미터들을 보존한다.The speech coder may be implemented as a time domain coder, which obtains a time domain speech waveform by using a high resolution processing that encodes a small segment of speech (typically 5 milliseconds (ms) subframes) at one time. For each subframe, high-precision representative values from the codebook region can be known through various search algorithms known in the art. Alternatively, the speech coder may be implemented as a frequency domain coder, which captures the short-term speech spectrum of the input speech frame with a set of parameters and uses the corresponding synthesis processing to reproduce the speech waveform from the spectral parameters. The parameter quantizer preserves the parameters by expressing the parameters in a stored codevector representation according to the known quantization technique described in A. Gersho & R. M. Gray, Vector Quantization and Signal Compression (1992).

공지된 시간영역 음성 코더는 L.B. Rabiner & R.W. Schafer 의 Digital Processing of Speech Signals 396-453 (1978) 에 설명된 코드여기 선형예측 (Code Excited Linear Predictive; CELP) 코더가 있다. 이 CELP 코더에서는, 음성 신호의 단기 상관(성) 이나 중복(성)을, 단기 포르만트 필터 (short-term formant filter) 의 계수를 구하는 선형 예측 (LP) 분석에 의해 제거된다. 인입 음성 프레임에 단기 예측 필터를 사용하여 LP 잔여 신호를 발생시키고, 장기 예측 필터 파라미터와 후속하는 확률적 코드북 (code book) 으로 더 모델링하고 양자화한다. 따라서, CELP 코딩은 시간영역 음성 파형을 인코딩하는 작업을 LP 단기 필터 계수들을 인코딩하는 작업과 LP 잔여성분을 인코딩하는 개별 작업으로 분할한다. 시간영역 코딩은 고정 레이트 (즉, 각각의 프레임에 대하여 동일한 비트수 N₀를 사용함) 나 가변 레이트 (다른 유형의 프레임 콘텐츠에 대하여 다른 비트 레이트를 사용함) 로 수행할 수 있다. 가변-레이트 코더는 목표 품질을 얻기에 적당한레벨로 코덱 (codec) 파라미터들을 인코딩하기 위해 필요한 양만큼의 비트를 사용한다. 대표적인 가변 레이트 CELP 코더로, 본 발명의 양수인에게 양도되었으며 여기서 참조한 미국특허 제5,414,796호에 설명되어 있다.A known time domain speech coder is the Code Excited Linear Predictive (CELP) coders described in LB Rabiner & RW Schafer, Digital Processing of Speech Signals, 396-453 (1978). In this CELP coder, short-term correlation and redundancy of speech signals are removed by linear prediction (LP) analysis which obtains coefficients of a short-term formant filter. The LP residual signal is generated using a short-term prediction filter on the incoming speech frame, and further modeled and quantized with the long-term prediction filter parameter and a subsequent probabilistic codebook. Thus, CELP coding divides the task of encoding the time domain speech waveform into separate operations that encode LP short term filter coefficients and encode the LP residual components. Time-domain coding can be performed at a fixed rate (i.e., using the same number of bits N _0, for each frame) or a variable rate (using a different bit rate relative to the frame can be of different types). The variable-rate coder uses the amount of bits needed to encode the codec parameters at a level suitable for obtaining the target quality. A representative variable rate CELP coder, which is assigned to the assignee of the present invention and is described in U.S. Patent No. 5,414,796, which is incorporated herein by reference.

일반적으로, CELP 코더 등의 시간영역 코더들은, 시간영역 음성 파형의 정확성을 유지하기 위하여, 높은 프레임당 비트수 N₀를 사용한다. 일반적으로, 이들 코더는 상대적으로 많은 프레임당 비트수 N₀(예를들어, 8kbps 또는 그 이상) 를 제공하므로 우수한 음성 품질을 갖는다. 그러나, 저비트 레이트 (4kbps 또는 그 이하) 에서는, 시간영역 코더들은 제한된 이용 비트수로 인하여 고품질과 우수한 성능을 유지할 수 없다. 저비트 레이트에서는, 이러한 제한된 코드북 공간은 고레이트 상업적 응용에 성공적으로 사용되는 종래 시간영역 코더들의 파형 매칭 능력을 저하시킨다.In general, time-domain coders such as the CELP coders, in order to maintain the accuracy of the time-domain speech waveform, uses a number of bits N ₀ per high frame. In general, these coders provide a relatively large number of bits per frame N ₀ (e.g., 8 kbps or higher), thus having excellent voice quality. However, at low bit rates (4 kbps or less), time domain coders can not maintain high quality and good performance due to a limited number of available bits. At low bitrates, this limited codebook space degrades the waveform matching capabilities of conventional time-domain coders that are successfully used in high-rate commercial applications.

현재, 중간 내지 저비트 레이트 (즉, 2.4 내지 4kbps 의 범위) 로 동작하는 고품질 음성 코더를 개발하기 위한 연구 관심과 강한 상업적 요구가 있다. 응용 분야에는 무선 텔레포니, 위성 통신, 인터넷 텔레포니, 다양한 멀티미디어, 및 음성 스트리밍 응용, 음성 메일, 및 다른 음성 저장 시스템을 포함한다. 이의 원동력은 고용량에 대한 요구와 패킷 손실 상황하에서의 우수한 성능에 대한 요구이다. 최근의 다양한 음성 코딩 표준화 노력들은 저레이트 음성 코딩 알고리즘의 연구 및 개발을 추진하는 또 다른 직접적인 원동력이다. 저레이트 음성 코더는 허용가능한 응용 대역폭당 더 많은 채널이나 이용자들을 창출하며, 적절한 채널 코딩의 추가 계층과 결합시킨 저레이트 음성 코더는 전체적인 코더 사양의 비트 수지를 충족시킬 수 있고, 채널 에러 상황하에서 우수한 성능을 갖는다.Currently, there is a research interest and strong commercial need to develop high quality voice coders that operate at medium to low bit rates (i.e., in the 2.4 to 4 kbps range). Applications include wireless telephony, satellite communications, Internet telephony, various multimedia, and voice streaming applications, voice mail, and other voice storage systems. Its driving force is the demand for high capacity and good performance under packet loss conditions. A variety of recent voice coding standardization efforts are another direct driving force for research and development of low rate speech coding algorithms. A low rate voice coder creates more channels or users per allowable application bandwidth and a low rate voice coder in combination with an additional layer of appropriate channel coding can satisfy the bit budget of the overall coder specification and is superior Performance.

저비트 레이트에서의 코딩에 대하여, 음성 신호를 스펙트럼의 시변 에볼루션 (time-varing evolution) 으로 분석하는 다양한 방법의 스펙트럼 또는 주파수영역 음성 코딩이 개발되어 있다 (예를들어, R.J. McAulay & T.F. Quatieri,Sinusodial Coding,in Speech Coding and Synthesis ch. 4 (W.B. Kleijin & K.K. Paliwal eds., 1995 참조). 스펙트럼 코더에서, 그 목적은 시변 음성 파형을 정확하게 모방하는 대신에, 스펙트럼 파라미터 세트로 각각의 입력 음성 프레임의 단기 음성 스펙트럼을 모델링하거나 예측하는 것이다. 그후, 스펙트럼 파라미터들을 인코딩하며, 디코딩된 파라미터들로 출력 음성 프레임을 생성한다. 생성된 합성 음성은 최초 입력 음성 파형과 매치하지는 않지만, 유사한 인식 품질을 제공한다. 당해기술분야에 공지된 주파수영역 코더의 예는 멀티밴드 여기 코더 (MBE), 정현파 변환 코더 (STC), 및 고조파 코더 (harmonic coder; HC) 를 포함한다. 이런 주파수영역 코더들은 저비트 레이트에서 이용가능한 적은 비트수로 정확하게 양자화할 수 있는 컴팩트한 파라미터 세트를 갖는 고품질 파라미터 모델을 제공한다.For coding at low bitrates, spectrum or frequency domain speech coding has been developed in various ways to analyze speech signals into time-varing evolution of the spectrum (see, e.g., RJ McAulay & TF Quatieri, Sinusodial In Spectrum Coders, the purpose is not to accurately mimic a time-varying speech waveform, but rather to use a set of spectral parameters (e.g., And then generates the output speech frame with the decoded parameters. The resulting synthesized speech does not match the original input speech waveform, but provides similar recognition quality Examples of frequency domain coders known in the art include a multiband excitation coder (MBE), a sine wave transformer Coder (STC), and harmonic coder (HC). These frequency domain coders provide a high quality parameter model with a compact set of parameters that can be accurately quantized with a small number of bits available at low bit rates .

그러나, 저비트 레이트 코딩은, 제한된 코딩 분해능 또는 제한된 코드북 공간의 임계적 제한을 부과하여 단일의 코딩 메카니즘의 효율성을 제한하므로, 코더들은 다양한 유형의 음성 세그먼트를 다양한 배경 조건하에서 동일한 정확성으로 표현할 수 없다. 예를들어, 종래 저비트 레이트, 주파수영역 코더들은 음성 프레임에 대한 위상 정보를 전송하지 않는다. 그대신에, 위상 정보를 무작위, 인위생성 초기 위상값 및 선형삽입 기술 (linear interpolation technique) 을 이용하여 재생한다. 예를 들어, H.YANG et al.,Quadratic Phase Interpolation for Voice Speech Synthesis in the MBE Modelin 29 Electronic Letters 856-57 (May 1993) 를 참조한다. 위상 정보를 인위적으로 생성하기 때문에, 양자화-역양자화 처리에 의해 정현파의 진폭을 완전히 보존하더라도, 주파수 영역 코더에 의해 생성된 출력 음성이 원래의 입력 음성과 일치하지 않는다 (즉, 주 펄스들은 동기하지 않는다). 따라서, 예를들어, 주파수영역 코더에서는 신호대잡음비 (SNR) 나 지각 SNR (Perceptual SNR) 등의 어떠한 폐루프 성능 측정치도 사용하기가 어렵다는 것이 입증되었다.However, low bit rate coding limits the efficiency of a single coding mechanism by imposing critical constraints on limited coding resolution or limited codebook space, so that coders can not represent different types of speech segments with the same accuracy under various background conditions . For example, conventional low bit rate, frequency domain coders do not transmit phase information for voice frames. Instead, the phase information is reproduced using a random, initial phase value of artificial creation and a linear interpolation technique. See, for example, H. YANG et al., Quadratic Phase Interpolation for Speech Synthesis in the MBE Model in 29 Electronic Letters 856-57 (May 1993). Since the phase information is artificially generated, even if the amplitude of the sine wave is completely preserved by the quantization-dequantization processing, the output speech generated by the frequency domain coder does not coincide with the original input speech (i.e., ). Thus, for example, it has been demonstrated that in a frequency domain coder, it is difficult to use any closed-loop performance measures such as signal-to-noise ratio (SNR) or perceptual SNR (SNR).

개루프 모드 판정처리에 의해 저레이트 음성 코딩을 수행하는데는, 멀티모드 코딩 기술이 이용되고 있다. 이런 멀티모드 기술의 일예가 Amitava Das et al., Speech Coding and Synthesis ch. inMultimode and Variable-Rate Coding of Speech(W.B. Kleijin & K.K. Paliwal eds. 1995) 에 설명되어 있다. 종래 멀티모드 코더들은 서로다른 입력 음성 프레임들에 대해 서로 다른 모드나 인코딩-디코딩 알고리즘을 적용한다. 각각의 모드나 인코딩-디코딩 처리는, 일정한 유형의 음성 세그먼트, 예를들어 보이스 음성 (voiced speech), 언보이스 음성 (unvoiced speech), 또는 배경 잡음 (비음성) 을 가장 효율적인 방식으로 표현하도록, 맞춰진다. 일반적으로, 외부, 개루프 모드 판정 메카니즘은 입력 음성 프레임을 조사하여, 프레임에 어떤 모드를 적용할 것인지에 관하여 판정한다. 일반적으로, 개루프 모드 판정은 입력 프레임으로부터 다수의 파라미터들을 추출하고, 파라미터들을 일정한 시간 및 스펙트럼 특성에 관하여 평가하며, 그 평가에 기초하여 모드 판정을 행함으로써 수행한다. 따라서, 모드 판정이 출력 음성의 실제 조건, 즉 음성 품질이나 다른 성능 측정치에 관하여 출력 음성이 입력 음성에 얼마나 근접하는 지를 미리 알지 못한 채 이루어지게 된다.Multimodal coding techniques are used to perform low rate speech coding by open loop mode determination processing. An example of such a multimode technique is Amitava Das et al., Speech Coding and Synthesis ch. in Multimode and Variable-Rate Coding of Speech (WB Kleijin & KK Paliwal eds. 1995). Conventional multimode coders apply different modes or encoding-decoding algorithms for different input speech frames. Each mode or encoding-decoding process may be tailored to suit the most efficient manner of representing certain types of speech segments, e.g., voiced speech, unvoiced speech, or background noise (non-speech) Loses. Typically, the external, open-loop mode determination mechanism examines the input speech frame to determine what mode to apply to the frame. In general, the open loop mode determination is performed by extracting a plurality of parameters from an input frame, evaluating parameters with respect to a certain time and spectral characteristics, and making a mode determination based on the evaluation. Thus, the mode determination is made without knowing in advance the actual conditions of the output speech, i.e. how close the output speech is to the input speech with respect to the speech quality or other performance measures.

상술한 내용에 기초하여, 위상 정보를 보다 정확하게 평가하는 저비트 레이트, 주파수영역 코더들을 제공하는 것이 바람직하다. 또한, 프레임의 음성 콘텐츠에 기초하여 일부 음성 프레임은 시간영역 인코딩하고 다른 음성 프레임은 주파수영역 인코딩하는 다중모드, 혼합영역 코더를 제공하는 것이 바람직하다. 또한, 폐루프 코딩 모드 판정 메카니즘에 따라서 일부 음성 프레임들은 시간영역 인코딩하고 다른 언어 프레임들은 주파수영역 인코딩할 수 있는 혼합영역 코더를 제공하는 것이 바람직하다. 따라서, 코더에 의해 생성된 출력 음성과 코더로 입력된 최초 음성 사이의 시간동기를 보장하는 폐루프, 다중모드, 혼합영역 음성 코더가 요청되고 있다.Based on the above, it is desirable to provide low bit rate, frequency domain coders that more accurately evaluate the phase information. It is also desirable to provide a multi-mode, mixed-region coder that time-domain encodes some voice frames and frequency-domain encodes other voice frames based on the voice content of the frame. It is also desirable to provide a mixed area coder in which some speech frames may be time domain encoded and other language frames may be frequency domain encoded in accordance with the closed loop coding mode determination mechanism. Therefore, a closed loop, multi-mode, mixed region voice coder is required that ensures time synchronization between the output speech generated by the coder and the original speech input by the coder.

일반적으로, 본 발명은 음성 처리에 관한 것으로서, 특히 폐루프, 멀티모드, 혼합영역 음성 코딩 방법 및 장치에 관한 것이다.In general, the present invention relates to speech processing, and more particularly, to a closed loop, multimode, mixed region speech coding method and apparatus.

도 1 은 각 단에서 음성 코더에 의해 종결하는 통신 채널의 블록도이다.1 is a block diagram of a communication channel terminated by a voice coder at each stage.

도 2 는 멀티모드, 혼합영역 선형예측 (MDLP) 음성 코더에 사용할 수 있는인코더의 블록도이다.2 is a block diagram of an encoder that may be used for a multimode, mixed region linear prediction (MDLP) speech coder.

도 3 은 멀티모드, MDLP 음성 코더에 사용할 수 있는 디코더의 블록도이다.3 is a block diagram of a decoder that can be used in a multimode, MDLP speech coder.

도 4 는 도 2 의 인코더에 사용할 수 있는 MDLP 인코더에 의해 수행되는 MDLP 인코딩 단계들을 나타내는 플로우챠트이다.4 is a flow chart illustrating MDLP encoding steps performed by an MDLP encoder usable in the encoder of FIG.

도 5 는 음성 코딩 판정처리를 나타내는 플로우챠트이다.5 is a flow chart showing a speech coding determination process.

도 6 은 폐루프, 멀티모드, MDLP 음성 코더의 블록도이다.6 is a block diagram of a closed-loop, multi-mode, MDLP speech coder.

도 7 은 도 6 의 코더 또는 도 2 의 인코더에 사용할 수 있는 스펙트럼 코더의 블록도이다.FIG. 7 is a block diagram of a spectrum coder that may be used with the coder of FIG. 6 or the encoder of FIG. 2;

도 8 은 고조파 코더에서 정현파의 진폭을 나타내는 진폭대 주파수 그래프이다.8 is an amplitude vs. frequency graph showing the amplitude of a sine wave in a harmonic coder.

도 9 는 멀티모드, MDLP 음성 코더의 모드 판정처리를 나타내는 플로우챠트이다.9 is a flow chart showing a mode determination process of the multi-mode, MDLP voice coder.

도 10A 는 음성 신호 진폭대시간 그래프이고, 도 10B 는 선형 예측 (LP) 잔여성분 진폭대 시간 그래프이다.FIG. 10A is a speech signal amplitude versus time graph, and FIG. 10B is a linear prediction (LP) residue component amplitude versus time graph.

도 11A 는 폐루프 인코딩 판정시 레이트/모드대 프레임 인덱스 그래프이고, 도 11B 는 폐루프 판정시 지각 신호대잡음비 (PSNR) 대 프레임 인덱스 그래프이며, 도 11C 는 폐루프 인코딩 판정이 없을 때, 레이트/모드와 PSNR대 프레임 인덱스 그래프이다.11B is a perceptual signal-to-noise ratio (PSNR) versus frame index graph for a closed-loop determination, and FIG. 11C is a plot of the rate / mode versus frame index for closed- And a PSNR-to-frame index graph.

본 발명은 코더에 의해 생성된 출력 음성과 코더에 입력되는 최초 음성 사이의 시간 동기를 보장하는 폐루프, 멀티모드, 혼합영역 음성 코더에 관한 것이다. 따라서, 본 발명의 일양태에서, 바람직하게는, 멀티모드, 혼합영역, 음성 프로세서는 하나 이상의 시간영역 코딩 모드와 하나 이상의 주파수영역 코딩 모드를 갖는 코더, 및 코더에 결합되며 음성 프로세서에 의해 처리된 프레임의 콘텐츠에 기초하여 코더에 대한 코딩 모드를 선택하도록 구성된 폐루프 모드 선택 장치를 구비한다.The present invention relates to a closed loop, multimode, mixed region voice coder that ensures time synchronization between the output speech produced by the coder and the original speech input to the coder. Thus, in one aspect of the invention, preferably, the multimode, mixed region, speech processor comprises a coder having one or more time domain coding modes and one or more frequency domain coding modes, and a coder coupled to the coder, And a closed loop mode selection device configured to select a coding mode for the coder based on the content of the frame.

본 발명의 다른 양태에서, 바람직하게는, 프레임 처리 방법이, 개루프 코딩 모드 선택 처리를 각각의 연속하는 입력 프레임에 적용하여 입력 프레임의 음성 콘텐츠에 기초하여 시간영역 코딩 모드나 주파수영역 코딩 모드 중의 하나를 선택하는 단계; 입력 프레임의 음성 콘텐츠가 안정상태의 보이스 음성을 나타내는 경우 입력 프레임을 주파수영역 코딩하는 단계; 주파수영역 코딩된 프레임을 입력 프레임과 비교하여 성능 측정치를 구하는 단계; 및 성능 측정치가 소정 임계치보다 낮게되는 경우 입력 프레임을 시간 영역 코딩하는 단계를 포함한다.In another aspect of the present invention, preferably, the frame processing method includes applying an open-loop coding mode selection process to each successive input frame to generate a time-domain coding mode or a frequency-domain coding mode Selecting one; Frequency-domain coding an input frame when the audio content of the input frame indicates a stable voice voice; Comparing the frequency domain coded frame with an input frame to obtain a performance measure; And time-domain coding the input frame if the performance measure falls below a predetermined threshold.

본 발명의 또다른 양태에서, 바람직하게는, 멀티모드, 혼합영역, 음성 프로세서는, 개루프 코딩 모드 선택 처리를 입력 프레임에 적용하여 입력 프레임의 음성 콘텐츠에 기초하여 시간영역 코딩모드나 주파수영역 코딩 모드중의 하나를 선택하는 수단; 입력 프레임의 음성 콘텐츠가 안정상태 보이스 음성을 나타내는 경우 입력 프레임을 주파수영역 코딩하는 수단; 입력 프레임의 음성 콘텐츠가 안정상태 보이스 음성이 아닌 것을 나타내는 경우에는 입력 프레임을 시간영역 코딩하는 수단; 주파수영역 코딩된 프레임과 입력 프레임을 비교하여 성능 측정치를 구하는 수단; 및 성능 측정치가 소정 임계치보다 낮게되는 경우에는 입력 프레임을 시간영역 코딩하는 수단을 구비한다.In another aspect of the present invention, preferably, the multimode, mixed region, speech processor applies an open loop coding mode selection process to an input frame to generate a time domain coding mode or a frequency domain coding Means for selecting one of the modes; Means for frequency-domain coding an input frame if the audio content of the input frame represents a steady state voice; Means for time-domain coding the input frame if the audio content of the input frame is not a steady state voice; Means for comparing a frequency domain coded frame with an input frame to obtain a performance measure; And means for time-domain coding the input frame if the performance measure falls below a predetermined threshold.

도 1 에서, 제 1 인코더 (10) 는 디지털화된 음성 샘플 s(n) 을 수신하며,전송 매체 (12) 또는 통신 채널 (12) 을 통한 제 1 디코더 (14) 로의 송신을 위해 샘플 s(n) 을 인코딩한다. 디코더 (14) 는 인코딩된 음성 샘플들을 디코딩하며, 출력 음성 신호 s_SYNTH(n) 를 합성한다. 역방향으로의 송신을 위해, 제 2 인코더 (16) 는 통신 채널 (18) 을 통해 송신되는 디지털화된 음성 샘플 s(n) 을 인코딩한다. 제 2 디코더 (20) 는 인코딩된 음성 샘플을 수신하고 디코딩하여 합성된 출력 음성 신호 s_SYNTH(n) 를 생성한다.1, a first encoder 10 receives a digitized speech sample s (n) and generates a sample s (n) for transmission to the first decoder 14 over a transmission medium 12 or communication channel 12, ). Decoder 14 decodes the encoded speech samples and synthesizes the output speech signal s _SYNTH (n). For transmission in the reverse direction, the second encoder 16 encodes the digitized speech samples s (n) transmitted over the communication channel 18. The second decoder 20 receives and decodes the encoded speech samples to produce a synthesized output speech signal s _SYNTH (n).

예를들어, 음성 샘플 s(n) 은, 펄스 코드 변조 (PCM), 압신 μ-법칙 또는 A-법칙 (companded μ-law or A-law) 등을 포함하는 당해기술분야에 공지된 다양한 방법에 따라서 디지털화 및 양자화되는 음성 신호들을 나타낸다. 당해기술분야에 공지된 바와 같이, 음성 샘플 s(n) 은 입력 데이터 프레임으로 구성되며, 여기서 각각의 프레임은 소정 개수의 디지털 음성 샘플 s(n) 을 포함한다. 바람직한 실시형태에서는, 각각의 20ms 프레임이 160 개의 샘플들을 포함하는 8kHz 의 샘플링 레이트를 사용한다. 이하 설명하는 바람직한 실시형태에서는, 데이터 전송 레이트는 8kbps (풀레이트) 로부터 4kbps (하프레이트) 내지 2kbps (1/4 레이트) 내지 1kbps (1/8 레이트) 까지 프레임대 프레임 기반으로 변경하는 것이 바람직하다. 또한, 다른 데이터 레이트를 사용할 수도 있다. 여기서 사용한, 용어 "풀레이트" 또는 "고레이트" 는 일반적으로 8kbps 또는 그 이상인 데이터 레이트를 가리키며, "하프레이트" 또는 "저레이트" 는 일반적으로 4kbps 또는 그 이하인 데이터 레이트를 가리킨다. 저비트 레이트는 상대적으로 더 적은 음성 정보를 포함하는 프레임들에 선택적으로 사용할 수 있기 때문에, 데이터 전송률을 변경하는 것이 바람직하다. 당업자들이 알수 있는 바와 같이, 다른 샘플링 레이트, 프레임 사이즈, 및 데이터 전송 레이트를 사용할 수도 있다.For example, the speech samples s (n) may be encoded by various methods known in the art, including pulse code modulation (PCM), compassed μ-law or A-law, Thus representing speech signals that are digitized and quantized. As is known in the art, a speech sample s (n) consists of an input data frame, where each frame contains a predetermined number of digital speech samples s (n). In a preferred embodiment, each 20 ms frame uses a sampling rate of 8 kHz including 160 samples. In the preferred embodiment described below, it is preferable to change the data transfer rate from 8 kbps (full rate) to 4 kbps (half rate) to 2 kbps (1/4 rate) to 1 kbps (1/8 rate) . Other data rates may also be used. The term " full rate " or " high rate ", as used herein, refers to a data rate that is typically 8 kbps or higher and a " half rate " It is desirable to change the data rate because the low bit rate can selectively use frames that contain relatively less audio information. Other sampling rates, frame sizes, and data transfer rates may be used, as will be appreciated by those skilled in the art.

제 1 인코더 (10) 와 제 2 디코더 (20) 는 공동으로 제 1 음성 코더 또는 언어 코덱을 구비한다. 유사하게, 제 2 인코더 (16) 와 제 1 디코더 (14) 는 공동으로 제 2 음성 코더를 구비한다. 음성 코더들을 디지털 신호 프로세서 (DSP), 주문형 집적회로 (ASIC), 이산 게이트 로직 (discrete gate logic), 펌웨어, 또는 임의의 종래 프로그램가능 소프트웨어 모듈 및 마이크로프로세서로 구현할 수 있음을 당업자들은 알 수 있다. 소프트웨어 모듈은 RAM 메모리, 플래쉬 메모리, 레지스터, 또는 당해기술분야에 공지된 임의의 다른 유형의 기록가능 저장매체일 수 있다. 다른 방법으로는, 마이크로 프로세서를 임의의 종래 프로세서, 컨트롤러, 또는 스테이트 머신 (state machine) 으로 대체할 수 있다. 음성 코딩용으로 특별히 디자인된 대표적인 ASIC 들은 발명의 명칭이 "VOCODER ASIC" 이며, 1994년 2월 16일에 출원되었으며, 본 발명의 양수인에게 양도되고 여기에 참조한 미국특허출원번호 제08/197,417호에 설명되어 있다.The first encoder 10 and the second decoder 20 jointly comprise a first speech coder or a language codec. Similarly, the second encoder 16 and the first decoder 14 have a second voice coder in common. Those skilled in the art will recognize that voice coders can be implemented in a digital signal processor (DSP), an application specific integrated circuit (ASIC), discrete gate logic, firmware, or any conventional programmable software module and microprocessor. The software module may be a RAM memory, a flash memory, a register, or any other type of recordable storage medium known in the art. Alternatively, the microprocessor can be replaced with any conventional processor, controller, or state machine. Representative ASICs specifically designed for speech coding are disclosed in U. S. Patent Application Serial No. 08 / 197,417, entitled " VOCODER ASIC ", filed February 16, 1994, assigned to the assignee of the present invention and incorporated herein by reference Lt; / RTI >

일 실시형태에 따르면, 도 2 에 나타낸 바와 같이, 음성 코더에 사용할 수 있는 멀티모드, 혼합영역 선형예측 (MDLP) 인코더 (100) 는 모드 판정 모듈 (102), 피치 추정 모듈 (104), 선형예측 (LP) 분석 모듈 (106), LP 분석 필터 (108), LP 양자화 모듈 (110), 및 MDLP 잔여 인코더 (112) 를 구비한다. 입력 음성 프레임 s(n) 은 모드 판정 모듈 (102), 피치 추정 모듈 (104), 선형 예측 (LP) 분석 모듈 (106), 및 LP 분석 필터 (108) 에 제공된다. 모드 판정 모듈 (102) 은 주기성에 기초하여 모드 인덱스 I_M과 모드 M 과, 그리고 각각의 입력 언어 프레임 s(n) 의 에너지, 스펙트럼 경사 (spectrum tilt), 부호변환점 레이트 (zero crossing rate) 등의 다른 추출 파라미터를 생성한다. 음성 프레임을 주기성에 따라서 분류하는 다양한 방법들은, 발명의 명칭이 "METHOD AND APPARATUS FOR PERFORMING REDUCED RATE VARIABLE RATE VOCODING" 이며, 1997년 3월 11일에 출원되었으며, 본 발명의 양수인에게 양도되고, 여기서 참조하는 미국특허출원번호 제08/815,354호에 개시되어 있다. 또한, 이들 방법은 통신산업협회 잠정 표준안 TIA/EIA IS-127 와 TIA/EIA IS-733 에 포함된다.2, a multimode, mixed region linear prediction (MDLP) encoder 100 that may be used for a voice coder includes a mode determination module 102, a pitch estimation module 104, a linear prediction (LP) analysis module 106, an LP analysis filter 108, an LP quantization module 110, and an MDLP residual encoder 112. The input speech frame s (n) is provided to the mode determination module 102, the pitch estimation module 104, the linear prediction (LP) analysis module 106, and the LP analysis filter 108. The mode determination module 102 determines the mode index I _M , the mode M, and the energy, spectral tilt, zero crossing rate, etc. of each input language frame s (n) based on the periodicity Other extraction parameters are generated. Various methods of classifying speech frames according to periodicity are described in U. S. Patent Application Serial No. 10 / 030,131, entitled METHOD AND APPARATUS FOR PERFORMING REDUCED RATE VARIABLE RATE VOCODING, filed on March 11, 1997, assigned to the assignee of the present invention, Lt; RTI ID = 0.0 > 08 / 815,354. &Lt; / RTI > These methods are also included in TIA / EIA IS-127 and TIA / EIA IS-733 provisional standards of the Telecommunications Industry Association.

피치 추정 모듈 (104) 은 각각의 입력음성 프레임 s(n) 에 기초하여 피치 인덱스 I_p와 래그값 P_o를 기초하여 생성한다. LP 분석 모듈 (106) 은 각각의 입력 음성 프레임 s(n) 에 선형 예측 분석을 수행하여 LP 파라미터 a 를 생성한다. LP 파라미터 a 는 LP 양자화모듈 (110) 에 제공된다. 또한, LP 양자화모듈 (110) 은 모드 M 을 수신하여 그 모드에 따라서 양자화 처리를 수행한다. LP 양자화모듈 (110) 은 LP 인덱스 I_LP와 양자화된 LP 파라미터를 생성한다. LP 분석 필터 (108) 은 입력 음성 프레임 s(n) 이외에 양자화된 LP 파라미터를 수신한다. LP 분석 필터 (108) 는 LP 잔여 신호 R[n] 을 생성하며, 이는 양자화된 선형예측 파라미터에 기초하여 입력 음성 프레임 s(n) 과 재구성된 음성사이의 에러를 나타낸다. LP 잔여성분 R[n], 모드 M, 및 양자화된 LP 파라미터는 MDLP 잔여 인코더 (112) 에 제공된다. 이들 값에 기초하여, MDLP 잔여 인코더 (112) 는 잔여 인덱스 I_R과 양자화된 잔여 신호을 도 4 의 플로우차트를 참조하여 이하 설명하는 방법에 따라서 생성한다.The pitch estimation module 104 generates based on the pitch index I _p and the lag value P _o based on each input speech frame s (n). The LP analysis module 106 performs a linear prediction analysis on each input speech frame s (n) to generate an LP parameter a. The LP parameter a is provided to the LP quantization module 110. In addition, the LP quantization module 110 receives the mode M and performs quantization processing according to the mode. The LP quantization module 110 receives the LP index I _LP and the quantized LP parameters . The LP analysis filter 108 includes a quantized LP parameter < RTI ID = 0.0 > . The LP analysis filter 108 generates an LP residual signal R [n], which is a quantized linear prediction parameter And an error between the input speech frame s (n) and the reconstructed speech. The LP residual component R [n], the mode M, and the quantized LP parameter Is provided to the MDLP residual encoder (112). Based on these values, the MDLP residual encoder 112 calculates the residual index I _R and the quantized residual signal &_lt; _{RTI ID} = 0.0 _> In accordance with the method described below with reference to the flowchart of Fig.

도 3 에서는, 음성 코더에서 사용할 수 있는 디코더 (200) 가 LP 파라미터 디코딩 모듈 (202), 잔여 디코딩 모듈 (204), 모드 디코딩 모듈 (206), 및 LP 합성 필터 (208) 를 구비한다. 모드 디코딩 모듈 (206) 은 모드 인덱스 I_M을 수신하여 그로부터 모드 M 을 생성한다. LP 파라미터 디코딩 모듈 (202) 는 모드 M 과 LP 인덱스 I_LP를 수신한다. LP 파라미터 디코딩 모듈 (202) 은 수신된 값을 디코딩하여 양자화된 LP 파라미터를 생성한다. 잔여 디코딩 모듈 (204) 는 잔여 인덱스 I_R, 피치 인덱스 I_P, 및 모드 인덱스 I_M을 수신한다. 잔여 디코딩 모듈 (204) 는 수신한 값을 디코딩하여 양자화된 잔여 신호을 생성한다. 양자화된 잔여 신호및 양자화된 LP 파라미터는 LP 합성 필터 (208) 에 제공되며, LP 합성필터 (208) 은 그로부터 디코딩된 출력 음성 신호을 합성한다.3, a decoder 200 that can be used in a speech coder includes an LP parameter decoding module 202, a residual decoding module 204, a mode decoding module 206, and an LP synthesis filter 208. Mode decoding module 206 produces a mode M from it receives the mode index I _M. The LP parameter decoding module 202 receives the mode M and the LP index I _LP . The LP parameter decoding module 202 decodes the received value to obtain a quantized LP parameter . The residual decoding module 204 receives the residual index I _R , the pitch index I _P , and the mode index I _M. The residual decoding module 204 decodes the received value and outputs the quantized residual signal < RTI ID = 0.0 > . Quantized residual signal And a quantized LP parameter Is provided to an LP synthesis filter 208 and an LP synthesis filter 208 filters the decoded output speech signal < RTI ID = 0.0 > .

MDLP 잔여 인코더 (112) 를 제외하고는, 도 2 의 인코더 (100) 와 도 3 의 인코더 (200) 의 다양한 모듈의 동작과 구현은 상술한 미국특허 제5,414,796호와 L.B. Rabiner & R.W. Schafer, Digital Processing of Speech Signals 396-453(1978) 에 개시되어 있다.Except for the MDLP residual encoder 112, the operation and implementation of the various modules of the encoder 100 of FIG. 2 and the encoder 200 of FIG. 3 are described in the aforementioned US Pat. Nos. 5,414,796 and L.B. Rabiner & R.W. Schafer, Digital Processing of Speech Signals 396-453 (1978).

일실시형태에 따르면, MDLP 인코더 (미도시) 는 도 4 의 플로우챠트에 나타낸 단계들을 수행한다. MDLP 인코더는 도 2 의 MDLP 잔여 인코더 (112) 가 될 수 있다. 단계 300 에서, MDLP 인코더는 모드 M 이 풀레이트 (FR), 1/4 레이트 (QR), 또는 1/8 레이트 (ER) 인지를 체크한다. 모드 M 이 FR, QR, 또는 ER 이면, MDLP 인코더는 단계 302 로 진행한다. 단계 302 에서, MDLP 인코더는 대응하는 레이트 M 의 값에 의존하는 FR, QR, 또는 ER 를 잔여 인덱스 I_R에 적용한다. FR 모드에 하여, 고정확성, 고레이트 코딩이며, 바람직하게는 CELP 코딩일 수 있는 시간영역 코딩을 LP 잔여 프레임 또는 음성 프레임에 적용한다. 그후, (디지털-아날로그 변환 및 변조를 포함한 추가 신호처리 후에) 프레임을 송신한다. 일실시형태에서, 프레임은 예측 에러를 나타내는 LP 잔여 프레임이다. 또다른 실시형태에서, 프레임은 음성 샘플을 나타내는 음성 프레임이다.According to one embodiment, an MDLP encoder (not shown) performs the steps shown in the flow chart of FIG. The MDLP encoder may be the MDLP residual encoder 112 of FIG. In step 300, the MDLP encoder checks whether the mode M is full rate (FR), quarter rate (QR), or 1/8 rate (ER). If mode M is FR, QR, or ER, then the MDLP encoder proceeds to step 302. In step 302, the MDLP encoder applies FR, QR, or ER depending on the value of the corresponding rate M to the residual index I _R. In FR mode, time-domain coding, which is high accuracy, high rate coding, and preferably CELP coding, is applied to the LP residual frame or voice frame. And then transmits the frame (after further signal processing, including digital-to-analog conversion and modulation). In one embodiment, the frame is an LP residual frame indicating a prediction error. In another embodiment, the frame is a speech frame representing a speech sample.

반면에, 단계 300 에서, 모드 M 이 FR, QR, 또는 ER 이 아닌경우(즉, 모드 M 이 하프 레이트 (HR) 인 경우), MDLP 인코더는 단계 304 로 진행한다. 단계 304 에서, 바람직하게는 고조파 코딩인 스펙트럼 코딩을 LP 잔여 또는 음성 신호에 하프레이트로 이용한다. 그후, MDLP 인코더는 단계 306 으로 진행한다. 단계 306 에서는 인코딩된 음성을 디코딩하고 그것을 최초 입력 프레임과 비교함으로써 왜곡 측정치 D 를 획득한다. 그후, MDLP 인코더는 단계 308 로 진행한다. 단계 308 에서, 왜곡 측정치 D 를 소정의 임계치 T 와 비교한다. 단계 308 에서는, 왜곡 측정치 D 를 소정 임계치 T 와 비교한다. 왜곡 측정치 D 가 임계치 T 보다 큰 경우, 하프 레이트, 스펙트럼 인코딩된 프레임에 대하여 대응하는 양자화된 파라미터들을 변조 및 전송한다. 한편, 왜곡 측정치 D 가 문턱값 T 보다 크지 않은 경우, MDLP 인코더는 단계 310 으로 진행한다. 단계 310 에서는, 디코딩된 프레임을 시간 영역에서 풀레이트로 재인코딩한다. 바람직하게는, CELP 코딩 등의 임의의 종래 고레이트, 고정확성 코딩 알고리즘을 사용할 수 있다. 그후, 프레임과 관련된 FR 모드 양자화된 파라미터들을 변조 및 송신한다.On the other hand, in step 300, if the mode M is not FR, QR, or ER (i.e., if the mode M is a half rate (HR)), the MDLP encoder proceeds to step 304. In step 304, spectral coding, preferably harmonic coding, is used as a half rate for the LP residual or speech signal. The MDLP encoder then proceeds to step 306. In step 306, a distortion measure D is obtained by decoding the encoded speech and comparing it to the original input frame. The MDLP encoder then proceeds to step 308. In step 308, the distortion measure D is compared to a predetermined threshold T. In step 308, the distortion measure D is compared with a predetermined threshold T. If the distortion measure D is greater than the threshold T, then the corresponding quantized parameters for the half-rate, spectrally encoded frame are modulated and transmitted. On the other hand, if the distortion measure D is not greater than the threshold T, the MDLP encoder proceeds to step 310. In step 310, the decoded frame is re-encoded in the time domain at full rate. Preferably, any conventional high rate, high accuracy coding algorithm such as CELP coding can be used. And then modulates and transmits FR mode quantized parameters associated with the frame.

도 5 의 플로우차트에 나타낸 바와 같이, 본 발명의 일 실시형태에 따른, 폐루프, 다중모드, MDLP 음성 코더는 전송을 위한 샘플을 처리하는 일련의 단계들에 후속한다. 단계 400 에서, 음성 코더는 연속적인 프레임인 음성 신호의 디지털 샘플을 수신한다. 주어진 프레임을 수신시, 음성 코더는 단계 402 로 진행한다. 단계 402 에서, 코더는 프레임의 에너지를 검출한다. 에너지는 프레임의 음성 활동의 측정치이다. 음성 검출은 디지털화된 음성 샘플들의 크기의 제곱을 합하고, 합해진 에너지를 임계치와 비교함으로써 수행한다. 일 실시형태에서, 임계치는 배경 잡음의 변화 레벨에 기초하여 채택한다. 대표적인 가변 임계치 음성 활동 검출기가 상술한 미국특허 제5,414,796호에 개시되어 있다. 일부 언보이스 음성의 음향은 극도로 낮은 에너지 샘플들일 수 있어서, 잘못하여 배경 잡음으로서 인코딩될 수도 있다. 이를 방지하기 위해, 상술한 미국특허 제5,414,796호에 개시된 바와 같이, 배경잡음으로부터 언보이스 언어를 구별하기 위해 저에너지 샘플들의 스펙트럼 경사 (spectrum tilt) 를 사용할 수 있다.As shown in the flow chart of FIG. 5, a closed-loop, multi-mode, MDLP voice coder, in accordance with an embodiment of the present invention, follows a series of steps for processing samples for transmission. In step 400, the speech coder receives a digital sample of a speech signal that is a continuous frame. Upon receiving the given frame, the voice coder proceeds to step 402. In step 402, the coder detects the energy of the frame. Energy is a measure of the voice activity of the frame. Voice detection is performed by summing the magnitudes of the magnitudes of digitized voice samples and comparing the summed energy to a threshold. In one embodiment, the threshold is adopted based on the level of change in background noise. A representative variable threshold audio activity detector is disclosed in the aforementioned U.S. Patent No. 5,414,796. The sound of some unvoiced speech may be extremely low energy samples and may be incorrectly encoded as background noise. To prevent this, the spectrum tilt of the low energy samples can be used to distinguish the unvoiced language from the background noise, as disclosed in the above-mentioned U.S. Patent No. 5,414,796.

프레임의 에너지를 검출한 후, 음성 코더는 단계 404 로 진행한다. 단계 404 에서, 음성 코더는 프레임을 음성 정보를 포함하는 것으로 분류할 만큼 검출된 프레임 에너지가 충분한지를 결정한다. 검출된 프레임 에너지가 소정의 임계 레벨보다 낮은 경우, 음성 코더는 단계 406 으로 진행한다. 단계 406 에서, 음성 코더는 프레임을 배경잡음 (즉, 비음성 또는 침묵) 으로서 인코딩한다. 일 실시형태에서는, 배경 잡음 프레임을 1/8 레이트 또는 1kbps 로 시간영역 인코딩한다. 단계 404 에서, 검출된 프레임 에너지가 소정의 임계 레벨과 일치하거나 초과하는 경우, 프레임은 음성으로 분류되며, 음성 코더는 단계 408 로 진행한다.After detecting the energy of the frame, the speech coder proceeds to step 404. In step 404, the voice coder determines if the detected frame energy is sufficient to classify the frame as containing voice information. If the detected frame energy is lower than a predetermined threshold level, the speech coder proceeds to step 406. In step 406, the speech coder encodes the frame as background noise (i.e., non-speech or silence). In one embodiment, the background noise frame is time domain encoded at 1/8 rate or 1 kbps. In step 404, if the detected frame energy coincides with or exceeds a predetermined threshold level, the frame is classified as speech and the speech coder proceeds to step 408.

단계 408 에서, 음성 코더는 프레임이 주기적인지를 결정한다. 주기성 결정에 관한 다양한 공지 방법들은 예를 들면, 부호 변환점 (zero crossing) 의 이용 및 정규화된 자동 상관함수 (NACF) 의 이용을 포함한다. 특히, 주기성을 검출하는데 부호 변환점과 NACF 를 이용하는 것은 본 발명의 양수인에게 양도되고, 여기서 참조하였으며, 1997년 3월 11일에 출원되고, 발명의 명칭이 "METHOD AND APPARATUS FOR PERFORMING REDUCED RATE VARIABLE RATE VOCODING" 인 미국특허출원번호 제08/815,354호에 개시되어 있다. 또한, 보이스 음성을 언보이스 음성과 구별하는데 사용하는 상기 방법들은 통신산업협회 잠정 표준안 TIA/EIA IS-127 과 TIA/EIA IS-733 에 포함되어 있다. 단계 410 에서, 프레임이 주기적이라고 결정되지 않는 경우, 음성 코더는 단계 410 으로 진행한다. 단계 410 에서, 음성 코더는 프레임을 언보이스 음성으로 인코딩한다. 일실시형태에서는, 언보이스 음성 프레임을 1/4 레이트나 2kbps 로 시간영역 인코딩된다. 단계 408 에서 프레임이 주기적이라고 결정되는 경우, 음성 코더는 단계 412 로 진행한다.In step 408, the voice coder determines if the frame is periodic. Various known methods for periodicity determination include, for example, the use of zero crossing and the use of a normalized auto correlation function (NACF). In particular, the use of code transition points and NACFs to detect periodicity is assigned to the assignee of the present invention and is incorporated herein by reference in its entirety, filed March 11, 1997, entitled METHOD AND APPARATUS FOR PERFORMING REDUCED RATE VARIABLE RATE VOCODING &Quot; in U. S. Patent Application Serial No. 08 / 815,354. Also, the above methods for distinguishing voice speech from unvoiced speech are included in the Telecommunication Industry Association provisional standards TIA / EIA IS-127 and TIA / EIA IS-733. In step 410, if the frame is not determined to be periodic, the speech coder proceeds to step 410. In step 410, the speech coder encodes the frame into an unvoiced speech. In one embodiment, the unvoiced speech frame is time domain encoded at either quarter rate or 2 kbps. If it is determined in step 408 that the frame is periodic, the speech coder proceeds to step 412.

단계 412 에서, 음성 코더는, 예를들어, 상술한 미국특허출원번호 제08/815,354호에 설명된 바와 같이, 당해기술분야에 공지된 주기 검출방법을 이용하여 프레임이 충분히 주기적인지를 결정한다. 프레임이 충분히 주기적이지 않다고 결정되는 경우, 음성 코더는 단계 414 로 진행한다. 단계 414 에서, 프레임은 전이 음성 (transition speech) (즉, 언보이스 음성으로부터 보이스 음성으로의 천이) 시간영역 인코딩된다. 일실시형태에서는, 전이 음성 프레임을 풀레이트 또는 8kbps 로 시간영역 인코딩한다.In step 412, the voice coder determines whether the frame is sufficiently periodic, for example, using a period detection method known in the art, as described in U.S. Patent Application Serial No. 08 / 815,354, supra. If it is determined that the frame is not sufficiently periodic, the speech coder proceeds to step 414. In step 414, the frame is temporally encoded with transition speech (i.e., transition from unvoiced speech to voice speech). In one embodiment, the transition voice frame is time domain encoded at full rate or 8 kbps.

단계 412 에서, 음성 코더가 프레임이 충분히 주기적이라고 결정하는 경우, 음성 코더는 단계 416 으로 진행한다. 단계 416 에서, 음성 코더는 프레임을 보이스 음성으로서 인코딩한다. 일실시형태에서는, 보이스 음성 프레임을 하프 레이트 또는 4kbps 로 스펙트럼 인코딩한다. 바람직하게는, 도 7 을 참조하여 이하 설명하는 바와 같이, 보이스 음성 프레임을 고조파 코더 (harmonic coder) 로 스펙트럼형 인코딩한다. 다른 방법으로는, 예를들어, 당해기술분야에 공지된 정현파 변환 코더, 멀티밴드 여기 코더 등의 다른 스펙트럼 코더들을 사용할 수 있다. 그후, 음성 코더는 단계 418 로 진행한다. 단계 418 에서, 음성 코더는 인코딩된 보이스 음성 프레임을 디코딩한다. 그후, 음성 코더는, 단계 420 으로 진행한다. 단계 420 에서, 디코딩된 보이스 음성 프레임을 그 프레임에 대한 대응 입력 음성 샘플과 비교하여, 합성 음성 왜곡을 측정하며 하프레이트, 보이스 음성, 스펙트럼 코딩 모델이 허용한계내에서 동작하는 지를 결정한다. 그후, 음성 코더는 단계 422 로 진행한다.In step 412, if the voice coder determines that the frame is sufficiently periodic, the voice coder proceeds to step 416. [ In step 416, the voice coder encodes the frame as a voice voice. In one embodiment, the voice speech frame is spectrally encoded at half rate or at 4 kbps. Preferably, the voice speech frame is spectrally encoded with a harmonic coder as described below with reference to Fig. Alternatively, other spectral coders can be used, such as, for example, a sinusoidal transcoder, a multi-band excursion coder, as known in the art. The voice coder then proceeds to step 418. In step 418, the speech coder decodes the encoded voice speech frame. Thereafter, the speech coder proceeds to step 420. At step 420, the decoded voice speech frame is compared to a corresponding input speech sample for that frame to measure the composite speech distortion and determine if the half rate, voice speech, or spectral coding model is operating within tolerance. The voice coder then proceeds to step 422.

단계 422 에서, 음성 코더는 디코딩된 보이스 음성 프레임과 그 프레임에 대응하는 입력 음성 샘플들 사이의 에러가 소정 임계치보다 낮게 되는지를 결정한다. 일실시형태에 따르면, 이 결정은 도 6 을 참조하여 이하 설명하는 방법으로 수행한다. 인코딩 왜곡이 소정 임계치보다 낮게 되는 경우, 음성 코더는 단계 424 로 진행한다. 단계 424 에서, 음성 코더는 단계 416 의 파라미터를 사용하여 프레임을 보이스 음성으로서 송신한다. 단계 422 에서 인코딩 왜곡이 소정 임계치와 일치하거나 초과하는 경우에는, 음성 코더는 단계 414 로 진행하여, 단계 400 에서 수신한 디지털화된 음성 샘플의 프레임을 전이 음성으로서 풀레이트로 시간영역 인코딩한다.In step 422, the voice coder determines whether the error between the decoded voice speech frame and the input speech samples corresponding to that frame is below a predetermined threshold. According to one embodiment, this determination is performed in the manner described below with reference to FIG. If the encoding distortion is below a certain threshold, the speech coder proceeds to step 424. In step 424, the voice coder transmits the frame as a voice voice using the parameters of step 416. If the encoding distortion in step 422 coincides with or exceeds a predetermined threshold, the speech coder proceeds to step 414 and time-domain encodes the frame of the digitized voice sample received in step 400 as a transition speech at full rate.

단계 400 내지 410 은 개루프 인코딩 판정 모드를 포함한다. 반면에, 단계 412 내지 426 은 폐루프, 인코딩 판정 모드를 포함한다.Steps 400 to 410 include an open loop encoding determination mode. On the other hand, steps 412 to 426 include a closed loop, encoding determination mode.

일실시형태에서는, 도 6 에 나타낸 바와 같이, 폐루프, 멀티모드, MDLP 음성 코더가, 제어 프로세서 (504) 에 결합되는 프레임 버퍼 (502) 에 결합된 아날로그-디지털 컨버터 (A/D; 500) 를 구비한다. 에너지 계산기 (506), 보이스 음성 검출기 (508), 배경잡음 인코더 (510), 고레이트, 시간영역 인코더 (512), 저레이트 스펙트럼 인코더 (514) 가 제어 프로세서 (504) 에 결합된다. 스펙트럼 디코더 (516) 는 스펙트럼 인코더 (514) 와 결합되며, 에러 계산기 (518) 는 스펙트럼 디코더 (516) 과 제어 프로세서 (504) 에 결합된다. 임계치 비교기 (520) 는 에러 계산기 (518) 과 제어 프로세서 (504) 에 결합된다. 버퍼 (522) 는 스펙트럼 인코더 (514), 스펙트럼 디코더 (516), 및 임계치 비교기 (520) 에 결합된다.In one embodiment, a closed loop, multimode, MDLP voice coder is shown coupled to an analog-to-digital converter (A / D) 500 coupled to a frame buffer 502 coupled to a control processor 504, Respectively. An energy calculator 506, a voice speech detector 508, a background noise encoder 510, a high rate, a time domain encoder 512 and a low rate spectrum encoder 514 are coupled to the control processor 504. The spectrum decoder 516 is coupled to a spectrum encoder 514 and the error calculator 518 is coupled to a spectrum decoder 516 and a control processor 504. [ The threshold comparator 520 is coupled to the error calculator 518 and the control processor 504. The buffer 522 is coupled to a spectral encoder 514, a spectrum decoder 516, and a threshold comparator 520.

도 6 의 실시형태에서, 음성 코더 부품들은, 바람직하게는, 그 자체로 DSP 나 ASIC 에 설치되는 음성 코더 내의 펌웨어나 다른 소프트웨어 구동 모듈로서 구현된다. 음성 코더 부품들은 다수의 다른 공지된 방법들로도 마찬가지로 잘 구현할 수 있다. 바람직하게는, 제어 프로세서 (504) 는 마이크로 프로세서가 될 수 있지만, 다른 방법으로는, 제어기, 스테이트 머신, 또는 이산 로직으로 구현할 수도 있다.In the embodiment of FIG. 6, the voice coder components are preferably implemented as firmware or other software-driven modules within the voice coder that are themselves installed in the DSP or ASIC. Voice coder components can be implemented as well in a number of other known ways. Preferably, control processor 504 may be a microprocessor, but may alternatively be implemented as a controller, state machine, or discrete logic.

도 6 의 멀티모드 코더에서는, 음성 신호가 A/D (500) 으로 제공된다. A/D (500) 은 아날로그 신호를 디지털화된 음성 샘플 S(n) 의 프레임으로 변환한다. 디지털화된 음성 샘플은 프레임 버퍼 (502) 에 제공된다. 제어 프로세서 (504) 는 프레임 버퍼 (502) 로부터 디지털화된 음성 샘플을 취하여 그들을 에너지 계산기 (506) 에 제공한다. 에너지 계산기 (506) 는,In the multimode coder of Fig. 6, the audio signal is provided to the A / D 500. The A / D 500 converts the analog signal into a frame of digitized speech samples S (n). The digitized speech samples are provided to a frame buffer 502. The control processor 504 takes the digitized speech samples from the frame buffer 502 and provides them to the energy calculator 506. [ The energy calculator 506,

에 따라서 음성 샘플들의 에너지 E 를 계산하며, 여기서 프레임은 20ms 길이이고 샘플링 레이트는 8 kHz 이다., Where the frame is 20 ms long and the sampling rate is 8 kHz.

제어 프로세서 (504) 는 계산한 음성 에너지를 음성 활성 임계치와 비교한다. 계산한 에너지가 음성 활성 임계치보다 낮은 경우, 제어 프로세서 (504) 는 디지털화된 음성 샘플을 프레임 버퍼 (502) 로부터 배경잡음 인코더 (510) 으로 보낸다. 배경 잡음 인코더 (510) 는 배경 잡음의 추정치를 유지하기 위해 필요한 최소 비트수를 사용하여 프레임을 인코딩한다.The control processor 504 compares the calculated speech energy with the speech activity threshold. If the calculated energy is below the voice activity threshold, the control processor 504 sends the digitized voice samples from the frame buffer 502 to the background noise encoder 510. The background noise encoder 510 encodes the frame using the minimum number of bits needed to maintain an estimate of the background noise.

계산된 에너지가 음성 활성 임계치보다 크거나 동일한 경우, 제어 프로세서 (504) 는 디지털화된 음성 샘플을 프레임 버퍼 (502) 로부터 보이스 음성 검출기 (508) 로 보낸다. 보이스 음성 검출기 (508) 는 음성 프레임의 주기성 레벨이 저레이트 스펙트럼 인코딩을 이용하는 효과적인 코딩을 허용하는 지를 결정한다. 음성 프레임의 주기성 레벨을 결정하는 방법은 당해기술분야에 공지되어 있으며, 예를들어 정규 자기상관 함수 (NACF; normalized autocorrelation function) 와 부호변환점의 이용을 포함한다. 이들 및 다른 방법들은 상술한 미국출원번호 제08/815,354호에 개시되어 있다.If the calculated energy is greater than or equal to the voice activity threshold, the control processor 504 sends the digitized voice samples from the frame buffer 502 to the voice voice detector 508. The voice speech detector 508 determines if the periodicity level of the voice frame allows effective coding using low rate spectrum encoding. Methods for determining the periodicity level of a voice frame are known in the art and include, for example, the use of a normalized autocorrelation function (NACF) and a code transition point. These and other methods are disclosed in the above-mentioned U.S. Serial No. 08 / 815,354.

보이스 음성 검출기 (508) 는, 음성 프레임이 스펙트럼 인코더 (514) 에 의해 효과적으로 인코딩되는 충분한 주기성의 음성을 포함하는 지를 나타내는 신호를 제어 프로세서 (504) 에 제공한다. 보이스 음성 검출기 (508) 가 음성 프레임이 충분한 주기성을 갖지 못한다고 결정하는 경우, 제어 프로세서 (504) 는 디지털화된 음성 샘플을 고레이트 인코더 (512) 로 보내며, 고레이트 인코더 (512) 는 음성을 소정의 최대 데이터 레이트로 시간영역 인코딩한다. 일실시형태에서, 소정의 최대 데이터 레이트는 8 kbps 이며, 고레이트 인코더 (512) 는 CELP 코더이다.The voice speech detector 508 provides a signal to the control processor 504 indicating whether the speech frame contains sufficient periodic speech that is effectively encoded by the spectrum encoder 514. [ When the voice voice detector 508 determines that the voice frame does not have sufficient periodicity, the control processor 504 sends the digitized voice samples to the high rate encoder 512, Time domain encoding at the maximum data rate. In one embodiment, the predetermined maximum data rate is 8 kbps and the high rate encoder 512 is a CELP coder.

최초에 보이스 음성 검출기 (508) 가 음성 신호가 스펙트럼 인코더 (514) 에 의해 효과적으로 인코딩되는 충분한 주기성을 갖는다고 결정하는 경우, 제어 프로세서 (504) 는 디지털화된 음성 샘플들을 프레임 버퍼 (502) 로부터 스펙트럼 인코더 (514) 로 보낸다. 대표적인 스펙트럼 인코더를 도 7 을 참조하여 이하 설명한다.If the voice voice detector 508 initially determines that the voice signal has sufficient periodicity to be effectively encoded by the spectrum encoder 514, then the control processor 504 may send the digitized voice samples from the frame buffer 502 to the spectral encoder < RTI ID = 0.0 & (514). A typical spectral encoder will be described below with reference to Fig.

스펙트럼 인코더 (514) 는 추정한 피치 주파수 F₀, 피치주파수의 고조파의 진폭 A_I, 및 음성 정보 V_c를 추출한다. 스펙트럼 인코더 (514) 는 버퍼 (522) 와 스펙트럼 디코더 (516) 에 이들 파라미터를 제공한다. 바람직하게는, 스펙트럼 디코더 (516) 는 종래의 CELP 인코더의 인코더내 디코더와 유사하다. 스펙트럼 디코더 (516) 는 스펙트럼 디코딩 포맷 (도 7 을 참조하여 이하 설명함) 에 따라서, 동기 음성 샘플The spectrum encoder 514 extracts the estimated pitch frequency F ₀ , the amplitude A _I of the harmonic of the pitch frequency, and the voice information V _c . The spectral encoder 514 provides these parameters to the buffer 522 and the spectral decoder 516. Preferably, the spectrum decoder 516 is similar to a decoder in an encoder of a conventional CELP encoder. The spectral decoder 516, in accordance with the spectral decoding format (described below with reference to Figure 7)

을 발생시키며, 합성된 음성샘플을 에러 계산기 (518) 에 제공한다. 제어 프로세서 (504) 는 음성 샘플 S(n) 을 에러 계산기 (518) 에 송신한다.And provides the synthesized speech samples to error calculator 518. [ The control processor 504 sends the speech samples S (n) to the error calculator 518.

에러 계산기 (518) 는Error calculator 518

에 따라서, 각각의 음성 샘플, S(n) 과 각각의 대응하는 동기 음성 샘플 사이의 평균제곱 에러 (MSE) 를 계산한다. 계산한 MSE 를 임계치 비교기 (520) 에 제공하고, 임계치 비교기 (520) 는 왜곡 레벨이 허용가능한 경계내에 있는지, 즉, 왜곡의 레벨이 소정의 임계치보다 낮게 되는 지를 결정한다.Calculates the mean square error (MSE) between each speech sample, S (n), and each corresponding sync speech sample. The calculated MSE is provided to a threshold comparator 520, which determines whether the distortion level is within an acceptable boundary, i. E., The level of distortion is below a predetermined threshold.

계산한 MSE 가 허용가능 한계 이내인 경우, 임계치 비교기 (520) 는 신호를 버퍼 (502) 에 제공하며, 스펙트럼 인코딩된 데이터는 음성코더로부터 출력된다.반면에, MSE 가 허용가능 한계내에 있지 않는 경우, 임계치 비교기 (520) 는 신호를 제어 프로세서 (504) 에 제공하며, 차례로 제어 프로세서 (504) 는 디지털화된 샘플을 프레임 버퍼 (502) 로부터 고레이트, 시간영역 인코더 (512) 로 보낸다. 시간영역 인코더 (512) 는 소정의 최대 레이트로 프레임을 인코딩하며, 버퍼 (522) 의 콘텐츠는 폐기한다.If the calculated MSE is within an acceptable limit, then the threshold comparator 520 provides a signal to the buffer 502 and the spectrally encoded data is output from the speech coder. On the other hand, if the MSE is not within an acceptable limit The threshold comparator 520 provides a signal to the control processor 504 which in turn sends the digitized samples from the frame buffer 502 to the high rate, time domain encoder 512. The time domain encoder 512 encodes the frame at a predetermined maximum rate and discards the contents of the buffer 522.

도 6 의 실시형태에서, 사용한 스펙트럼 코딩의 유형은 도 7 을 참조하여 이하 설명하는 고조파 코딩 (harmonic coding) 이지만, 다른 방법으로, 예를들어, 정현파 변환 코딩이나 멀티밴드 여기 코딩 등의 임의의 유형의 스펙트럼 코딩이 될 수도 있다. 멀티밴드 여기 코딩의 사용은, 예를들어, 미국특허 제5,195,166호에 개시되어 있으며, 정현파 변환 코딩의 사용은 예를들어 미국특허 제4,865,068호에 개시되어 있다.In the embodiment of Fig. 6, the type of spectral coding used is harmonic coding as described below with reference to Fig. 7, but in a different way, for example, any type of spectral coding such as sinusoidal transform coding or multi- Lt; / RTI > The use of multi-band excitation coding is disclosed, for example, in U.S. Patent No. 5,195,166, and the use of sinusoidal transform coding is disclosed, for example, in U.S. Patent No. 4,865,068.

전이 프레임에 대하여 그리고 위상 왜곡 임계치가 주기성 파라미터와 동일하거나 그보다 작은 보이스 프레임에 대하여, 도 6 의 멀티모드 코더는, 바람직하게는, 고레이트, 시간영역 인코더 (512) 에 의해 풀 레이트나 8 kbps 의 CELP 코딩을 사용한다. 다른 방법으로는, 다른 공지된 임의 형태의 고레이트, 시간 영역 코딩를 이들 프레임에 사용할 수도 있다. 따라서, 전이 프레임 (그리고, 충분하게 주기적이지 않는 보이스 프레임) 을 고정밀도로 코딩하여, 위상 정보를 잘 유지하면서 입력과 출력에서의 파형을 잘 매칭시킬 수 있다. 일실시형태에서는, 임계치가 주기성을 초과하는 소정 개수의 연속하는 보이스 프레임을 처리한 후에, 임계치 비교기 (520) 의 결정에 상관없이, 멀티코더는 하나의 프레임에 대하여 하프레이트 스펙트럼 코딩에서 풀레이트 CELP 코딩으로 스위칭한다.For a transition frame and for a voice frame with a phase distortion threshold equal to or less than the periodicity parameter, the multimode coder of FIG. 6 is preferably a full-rate, 8 kbps CELP coding is used. Alternatively, any other known high-rate, time-domain coding may be used for these frames. Thus, a transition frame (and a sufficiently non-periodic voice frame) can be precisely coded to well match the waveforms at the input and output while maintaining good phase information. In one embodiment, after processing a predetermined number of consecutive voice frames whose thresholds exceed the periodicity, regardless of the decision of the threshold comparator 520, the multi-coder may perform a full rate CELP Switching to coding.

제어 프로세서 (504) 와 관련하여 에너지 계산기 (506) 와 보이스 음성 검출기 (508) 는 개루프 인코딩 판정을 포함한다. 이와달리, 제어 프로세서 (504) 와 관련하여, 스펙트럼 인코더 (514), 스펙트럼 디코더 (516), 에러 계산기 (518), 임계치 비교기 (520), 및 버퍼 (522) 는 폐루프 인코딩 판정을 포함한다.Energy calculator 506 and voice voice detector 508 in conjunction with control processor 504 include an open loop encoding determination. Alternatively, in connection with the control processor 504, the spectral encoder 514, the spectrum decoder 516, the error calculator 518, the threshold comparator 520, and the buffer 522 comprise closed loop encoding decisions.

도 7 과 관련하여 설명하는 일실시형태에서, 스펙트럼 코딩과, 바람직하게는 고조파 코딩은 저비트 레이트로 충분히 주기성을 갖는 보이스 프레임을 인코딩하는 데 사용한다. 일반적으로, 스펙트럼 코더는, 주파수 영역내의 각각의 음성 프레임을 모델링하고 인코딩함으로써 지각적으로 의미있는 방법으로 음성 스펙트럼 특성의 시간 에볼루션 (time evolution) 을 유지하는 알고리즘으로서 정의된다. 이들 알고리즘의 필수부분은, (1) 스펙트럼 분석이나 파라미터 추정; (2) 파라미터 양자화, 및 (3) 출력 음성 파형과 디코딩된 파라미터와의 합성이다. 따라서, 스펙트럼 파라미터 세트를 갖는 단기 음성 스펙트럼의 중요 특성을 유지하고, 파라미터를 인코딩하고, 디코딩된 스펙트럼 파라미터를 사용하여 출력 음성을 합성하는 것이 그 목적이다. 일반적으로, 출력 음성은 정현파의 가중합으로서 합성한다. 정현파의 진폭, 주파수, 및 위상은 분석중에 추정한 스펙트럼 파라미터이다.In one embodiment described in connection with FIG. 7, spectral coding, and preferably harmonic coding, is used to encode a voice frame with sufficient periodicity at a low bit rate. In general, a spectral coder is defined as an algorithm that maintains temporal evolution of speech spectral characteristics in a perceptually meaningful way by modeling and encoding each speech frame in the frequency domain. The essential parts of these algorithms are: (1) spectrum analysis or parameter estimation; (2) parameter quantization, and (3) synthesis of the output speech waveform and the decoded parameter. It is therefore an object to maintain the important characteristics of the short-term speech spectrum with a set of spectral parameters, to encode the parameters, and to synthesize the output speech using the decoded spectral parameters. Generally, the output speech is synthesized as a weighted sum of sinusoids. The amplitude, frequency, and phase of the sine wave are spectral parameters estimated during analysis.

"합성에 의한 분석" 은 CELP 코딩에서 공지된 기술이지만, 이 기술은 스펙트럼 코딩에서 사용하고 있지 않다. 합성에 의한 분석을 스펙트럼 코더에 적용하지 않는 주된 이유는, 음성 모델이 지각할 수 있는 관점에서 적절하게 기능함에도불구하고, 초기 위상 정보의 손실로 인하여 합성 음성의 평균제곱 에너지 (MSE; Mean Square Energy) 가 `을 수 있다는 것이다. 따라서, 정확하게 초기 위상을 발생시키는 것의 다른 이점은, 음성 샘플을 재구성된 음성과 직접 비교하여 음성 모델이 음성 프레임을 정확하게 인코딩하는 지에 대한 결정이 가능함으로써 생기는 능력이다.&Quot; Analysis by synthesis " is a known technique in CELP coding, but this technique is not used in spectral coding. The main reason for not applying the synthesis analysis to the spectrum coder is that although the speech model functions properly from a perceptible point of view, the mean square energy of the synthesized speech (MSE: Mean Square Energy ) Will be able to `. Thus, another benefit of accurately generating the initial phase is the ability to make a determination as to whether the speech model accurately encodes the speech frame by directly comparing the speech sample with the reconstructed speech.

스펙트럼 코딩에서, 출력 음성 프레임은,In spectral coding,

S[n] = S_v[n] + S_uv[n], n = 1, 2, ..., NS [n] = S _v [n] + S _uv [n], n = 1, 2, ..., N

으로서 합성되며, 여기서 N 은 프레임당 샘플수이고, S_v와 S_uv는 각각 보이스 및 언보이스 성분이다. 정현파합 합성 처리 (sum-of-sinusoid synthesis) 는, 보이스 성분,Synthesis and, where N is the number of samples per frame, S _v and S _uv are each a voice and a voice component as a frozen. In sum-of-sinusoid synthesis,

을 생성하며, 여기서 L 은 정현파 함수의 총개수이고, f_k는 단기 스펙트럼에서의 관심 주파수이고, A(k, n) 은 정현파의 진폭이고, θ(k, n) 은 정현파의 위상이다. 진폭, 주파수, 및 위상 파라미터는 스펙트럼 분석 처리에 의해서 입력 프레임의 단기 스펙트럼으로부터 추정된다. 언보이스 성분은 단일의 정현파합 합성에서 보이스 성분과 함께 생성하거나 전용 언보이스 처리에 의해 개별적으로 계산하여 S_v에 다시 더할 수 있다.And generating a, where L is the total number of sinusoidal functions, f _k is the frequency of interest in the short-term spectrum, A is the amplitude of the (k, n) is a sine wave, θ (k, n) is a phase of a sine wave. The amplitude, frequency, and phase parameters are estimated from the short-term spectrum of the input frame by spectral analysis processing. The unvoiced components may be added together with the voice component in a single sinusoidal synthesis or separately calculated by a dedicated unvoiced process and added back to S _v .

도 7 의 실시형태에서, 고조파 코더라고 하는 특정 유형의 스펙트럼 코더는저비트 레이트에서 충분히 주기적인 보이스 프레임을 스펙트럼 인코딩하기 위하여 사용된다. 고조파 코더는 프레임을 정현파합으로서 특징지움으로써 프레임의 작은 세그먼트들을 분석한다. 정현파합에서의 각각의 정현파는 프레임의 피치 F₀의 정수배인 주파수를 갖는다. 사용하는 특정 유형의 스펙트럼 코더가 고조파 코더가 아닌 다른 실시형태에서는, 각각의 프레임에 대한 정현파 주파수를 0 과 2π 사이의 실수 세트로부터 취한다. 도 7 의 실시형태에서는, 합상태의 각각의 정현파의 진폭과 위상을 선택하여, 도 8 의 그래프로 나타낸 바와 같이 하나의 주기에 대하여 신호에 최선으로 매칭시킨다. 일반적으로, 고조파 코더는, 각각의 입력 음성 프레임을 "보이스 (voiced) " 또는 "언보이스 (unvoiced) " 으로 표지하는 외적 분류를 이용한다. 보이스 프레임에 대하여, 정현파들의 주파수는 추정된 피치 (F₀), 즉 f_k= kF₀의 고조파로 제한된다. 언보이스 음성에 대하여, 단기 스펙트럼의 피크는 정현파를 결정하기 위하여 사용한다. 진폭과 위상은 프레임에 대한 그들의 에볼루션과 유사하게In the embodiment of FIG. 7, a specific type of spectral coder, referred to as a harmonic coder, is used to spectrally encode a sufficiently periodic voice frame at a low bit rate. The harmonic coder analyzes small segments of the frame by characterizing the frame as a sinusoidal sum. Each sine wave in the sinusoidal sum has a frequency that is an integral multiple of the pitch F ₀ of the frame. In embodiments where the particular type of spectral coder used is not a harmonic coder, the sinusoidal frequency for each frame is taken from a real set of numbers between 0 and 2 [pi]. In the embodiment of FIG. 7, the amplitude and phase of each sine wave in the summed state are selected and matched to the signal for one period as best shown in the graph of FIG. In general, the harmonic coder uses an external classification that marks each input speech frame as "voiced" or "unvoiced". For a voice frame, the frequency of the sinusoids is limited to the harmonic of the estimated pitch (F ₀ ), i.e., f _k = kF ₀ . For unvoiced speech, the short-term spectral peak is used to determine the sine wave. The amplitude and phase are similar to their evolution for the frame

A(k, n) = C₁(k)*n +C₂(k)A (k, n) = C ₁ (k) * n + C ₂ (k)

θ(k, n) = B₁(k)*n²+ B₂(k)*n + B₃(k)θ (k, n) = B 1 (k) * n 2 + B 2 (k) * n + B 3 (k)

로서 삽입되며, 여기서 계수 [Ci(k), Bi(k)] 는 윈도우 처리된 입력 음성 프레임의 단기 푸리에 변환 (STFT) 으로부터, 특정 주파수 위치 f_k(=kf₀) 에서의 진폭, 주파수, 및 위상의 순간값으로부터 추정된다. 정현파마다 송신할 파라미터는 진폭과 주파수이다. 이 위상은 송신하지 않고, 그 대신에 예를들어 이차 위상 모델 (quadratic phase model) 을 포함하는 여러 공지 기술들중의 하나에 따라서 모델링한다.It is inserted as, where coefficients [Ci (k), Bi ( k)] is from the short-term Fourier transform (STFT) of a windowed processing the input speech frame, the amplitude and frequency of the specific frequency position f _k (= kf _0), and It is estimated from the instantaneous value of phase. The parameters to be transmitted for each sinusoidal wave are amplitude and frequency. This phase is not transmitted, but instead is modeled according to one of several known techniques, including for example a quadratic phase model.

도 7 에 나타낸 바와 같이, 고조파 코더는 윈도우 처리 로직 (602) 과 이산 푸리에 변환 (DFT) 에 결합되는 피치 추출기 (600) 및 고조파 분석 로직 (604) 을 구비한다. 또한, 입력으로서 음성 샘플 S(n) 을 수신하는 피치 추출기 (600) 는 DFT 및 고조파 분석 로직 (604) 에 결합된다. DFT 및 고조파 분석 로직 (604) 은 잔여 인코더 (606) 에 결합된다. 피치 추출기 (600), DFT 및 고조파 분석 로직 (604), 및 잔여 인코더 (606) 는 각각 파라미터 양자화기 (608) 에 결합된다. 파라미터 양자화기 (608) 는 채널 인코더 (610) 에 결합되고, 이 채널 인코더 (610) 는 송신기 (612) 에 결합된다. 송신기 (612) 는 예를들어 CDMA무선 인터페이스 (CDMA over-the-air interface) 등의 표준 무선 주파수 (RF) 인터페이스에 의해 수신기 (614) 에 결합된다. 수신기 (614) 는 채널 디코더 (616) 에 결합되며, 채널 디코더 (616) 는 역양자화기 (618) 에 결합된다. 역양자화기 (618) 는 정현파합 음성 합성기 (620) 에 결합된다. 또한, 입력으로서 이전 프레임 정보를 수신하는 위상 추정기 (622) 는 정현파합 음성 합성기 (620) 에 결합된다. 정현파합 음성 합성기 (620) 는 합성한 음성 출력, s_SYNTH(n) 을 발생시키도록 구성된다.7, the harmonic coder has a pitch extractor 600 and harmonic analysis logic 604 coupled to window processing logic 602 and a discrete Fourier transform (DFT). Pitch extractor 600, which also receives speech samples S (n) as input, is coupled to DFT and harmonic analysis logic 604. The DFT and harmonic analysis logic 604 is coupled to the residual encoder 606. Pitch extractor 600, DFT and harmonic analysis logic 604, and residual encoder 606 are coupled to parameter quantizer 608, respectively. A parameter quantizer 608 is coupled to a channel encoder 610, which is coupled to a transmitter 612. Transmitter 612 is coupled to receiver 614 by a standard radio frequency (RF) interface, such as, for example, a CDMA over-the-air interface. The receiver 614 is coupled to a channel decoder 616 and the channel decoder 616 is coupled to a dequantizer 618. The inverse quantizer 618 is coupled to the sinusoidal sum speech synthesizer 620. In addition, phase estimator 622, which receives previous frame information as input, is coupled to sinusoidal sum speech synthesizer 620. The sinusoidal sum speech synthesizer 620 is configured to generate the synthesized speech output, s _SYNTH (n).

피치 추출기 (600), 윈도우처리 로직 (602), DFT 및 고조파 분석 로직(604), 잔여 인코더 (606), 파라미터 양자화기 (608), 채널 인코더 (606), 정현파합 음성 합성기 (620), 및 위상 추정기 (622) 는 예를들어 펌웨어 또는 소프트에어 모듈을 포함하는 당해기술분야에 공지된 다양한 방법으로 구현할 수 있다. 송신기 (612) 와 수신기 (614) 는 당해기술분야에서 공지된 임의의 동등한 표준 RF 부품으로 구현할 수 있다.The pitch extractor 600, the window processing logic 602, the DFT and harmonic analysis logic 604, the residual encoder 606, the parameter quantizer 608, the channel encoder 606, the sinusoidal sum speech synthesizer 620, The phase estimator 622 may be implemented in various ways known in the art, including, for example, firmware or software modules. Transmitter 612 and receiver 614 may be implemented with any equivalent standard RF component known in the art.

도 7 의 고조파 코더에서, 입력 샘플 S(n) 은 피치 주파수 정보 F₀를 추출하는 피치 추출기 (600) 에 의해 수신된다. 그후, 윈도우처리 로직 (602) 에 의해 샘플을 적절한 윈도우 함수로 곱함으로써 음성 프레임은 작은 세그먼트들의 분석을 가능하게 한다. 피치 추출기 (608) 에 의해 공급된 정보의 피치를 사용하여, DFT 및 고조파 분석 로직 (604) 은 샘플들의 DFT 를 계산하여, 도 8 의 그래프로 나타낸 바와 같이 고조파 진폭 A₁이 추출되는 복소 스펙트럼 포인트 (complex spectrum point) 를 생성하며, 도 8 에서 L 은 고조파의 총개수를 나타낸다. DFT 는 음성 정보 V_c를 추출하는 잔여 인코더 (606) 에 제공된다.In the harmonic coder of FIG. 7, the input sample S (n) is received by the pitch extractor 600, which extracts pitch frequency information F _0. The voice processing frame 602 then samples the voice frame by multiplying it with the appropriate window function to enable analysis of small segments. Using the pitch of the information supplied by the pitch extractor 608, the DFT and harmonic analysis logic 604 computes the DFT of the samples to determine the complex spectral point at which the harmonic amplitude A < ₁ > (complex spectrum point). In FIG. 8, L represents the total number of harmonics. The DFT is provided to the residual encoder 606 which extracts the speech information V _c .

도 8 에 나타낸 바와 같이, V_c파라미터는 주파수축상의 포인트를 나타내고, 이 포인트 이상에서는 스펙트럼은 언보이스 음성 신호의 특성이며 고조파수가 아니다. 이와달리, 포인트 V_c아래에서, 스펙트럼은 고조파이며 보이스 음성의 특성을 나타낸다.As shown in Fig. 8, the V _c parameter indicates a point on the frequency axis, and above this point, the spectrum is characteristic of the unvoiced speech signal and is not a harmonic number. Alternatively, under point V _c , the spectrum is a harmonic and characterizes the voice.

A₁, F₀, 및 V_c성분은 정보를 양자화하는 파라미터 양자화기 (608) 에 제공된다. 양자화된 정보는 채널 인코더 (610) 에 패킷 형태로 제공되며, 이 채널 인코더 (610) 는 패킷을 하프레이트, 4kbps 등의 저비트 레이트로 양자화한다. 패킷은 송신기 (612) 에 제공되며, 이 송신기 (612) 는 패킷을 변조하고 변조된 신호를 수신기 (614) 에 무선송신한다. 수신기 (614) 는 신호를 수신하고 복조하여, 인코딩된 패킷을 채널 디코더 (616) 로 전달한다. 채널 디코더 (616) 는 패킷을 디코딩하고, 디코딩된 패킷으로 역양자화기 (618) 에 제공한다. 역양자화기 (618) 는 정보를 역양자화한다. 정보는 정현파합 음성 합성기 (620) 에 제공된다.The A ₁ , F ₀ , and V _c components are provided to a parameter quantizer 608 that quantizes information. The quantized information is provided to the channel encoder 610 in a packet form, and the channel encoder 610 quantizes the packet at a low bit rate, such as a half rate, 4 kbps, or the like. The packet is provided to a transmitter 612 which modulates the packet and wirelessly transmits the modulated signal to the receiver 614. [ Receiver 614 receives and demodulates the signal and forwards the encoded packet to a channel decoder 616. The channel decoder 616 decodes the packet and provides it to the dequantizer 618 as a decoded packet. The dequantizer 618 dequantizes the information. Information is provided to the sinusoidal sum speech synthesizer 620. [

정현파합 음성 합성기 (620) 는 S[n] 에 대한 위의 식에 따라서 단기 음성 스펙트럼을 모델링하는 복수의 정현파를 합성하도록 구성한다. 정현파의 주파수, f_k는 기본 주파수 F₀의 배수 또는 고조파이며, 이는 의사주기 (즉, 전이) 보이스 음성 세그먼트에 대한 피치 주기성을 갖는 주파수이다.The sinusoidal sum speech synthesizer 620 is configured to synthesize a plurality of sine waves for modeling the short-term speech spectrum according to the above equation for S [n]. The frequency of the sine wave, f _k, is a multiple or harmonic of the fundamental frequency F ₀ , which is the frequency with pitch periodicity for the pseudo periodic (i.e., transition) voice speech segment.

또한, 정현파합 음성 합성기 (620) 는 위상 추정기 (622) 로부터 위상 정보를 수신한다. 또한, 위상 추정기 (622) 는 이전 프레임 정보, 즉 직전의 프레임에 대한 A₁, F₀, 및 V_c파라미터를 수신한다. 또한, 위상 추정기 (622) 는 이전 프레임의 재생된 N 개의 샘플들을 수신하며, 여기서 N 은 프레임 길이이다 (즉, N 은 프레임당 샘플수이다). 위상 추정기 (622) 는 이전 프레임에 대한 정보에 기초하여 프레임에 대한 초기 위상을 결정한다. 초기 위상 결정은 정현파합 음성 합성기 (620) 에 제공된다. 현재 프레임에 대한 정보 그리고 과거 프레임정보에 기초하여 위상 추정기 (622) 에 의해 수행한 초기 위상 계산에 기초하여, 정현파합 음성 합성기 (620) 는 이하 설명하는 바와 같이 합성 음성 프레임을 생성한다.In addition, the sinusoidal sum speech synthesizer 620 receives the phase information from the phase estimator 622. In addition, phase estimator 622 receives previous frame information, i.e., A ₁ , F ₀ , and V _c parameters for the immediately preceding frame. In addition, phase estimator 622 receives the reproduced N samples of the previous frame, where N is the frame length (i.e., N is the number of samples per frame). The phase estimator 622 determines the initial phase for the frame based on information about the previous frame. The initial phase determination is provided to sinusoidal sum speech synthesizer 620. On the basis of the initial phase calculation performed by the phase estimator 622 based on the information about the current frame and the past frame information, the sinusoidal sum speech synthesizer 620 generates a synthesized speech frame as described below.

위에서 설명한 바와 같이, 고조파 코더는 이전 프레임 정보를 사용하고 위상이 프레임에서 프레임으로 선형으로 변하는 것을 예측하여 음성 프레임을 합성 또는 재생한다. 위에서 설명한, 일반적으로 이차 위상 모델이라고 부르는 합성 모델에서는, 계수 B₃(k) 는 합성되는 현재 보이스 프레임에 대한 초기 위상을 나타낸다. 위상을 결정할 때, 종래의 고조파 코더는 초기 위상을 0 으로 설정하거나 초기 위상 값을 랜덤하게 또는 일부 의사랜덤 발생 방법으로 발생시킨다. 보다 정확하게 위상을 예측하기 위하여, 위상 추정기 (622) 는 직전의 프레임이 보이스 음성 프레임 (즉, 충분히 주기성을 갖는 프레임) 인지 전이 음성 프레임인지를 결정 여부에 따라서, 초기 위상을 결정하는 2 가지 가능한 방법중의 하나를 사용한다. 이전 프레임이 보이스 음성 프레임인 경우, 그 프레임의 최종 추정 위상값은 현재 프레임의 초기 위상값으로서 사용된다. 반면에 이전 프레임이 전이 프레임으로 분류되는 경우, 현재 프레임에 대한 초기 위상값은 이전 프레임에 대한 디코더 출력의 DFT 를 수행하여 얻은 이전 프레임의 스펙트럼으로부터 획득된다. 따라서, 위상 추정기 (622) 는 이미 이용가능한 정확한 위상 정보 (전이 프레임인 이전 프레임이 풀레이트로 처리되었기 때문에) 를 사용한다.As described above, the harmonic coder uses previous frame information and synthesizes or reproduces the speech frame by predicting that the phase will change linearly from frame to frame. In the synthetic model described above, generally referred to as a second phase model, the coefficient B ₃ (k) represents the initial phase for the current voice frame to be synthesized. When determining the phase, the conventional harmonic coder sets the initial phase to zero or generates the initial phase value randomly or in some pseudorandom generation manner. In order to more accurately predict the phase, the phase estimator 622 determines whether the previous frame is a voice voice frame (i.e., a frame with sufficient periodicity) or a transition voice frame, . If the previous frame is a voice speech frame, the last estimated phase value of the frame is used as the initial phase value of the current frame. On the other hand, if the previous frame is classified as a transition frame, the initial phase value for the current frame is obtained from the spectrum of the previous frame obtained by performing the DFT of the decoder output for the previous frame. Thus, the phase estimator 622 uses the precise phase information already available (since the previous frame, which is a transition frame, was processed at full rate).

일실시형태에서, 폐루프, 멀티모드, MDLP 음성 코더는 도 9 의 흐름도에 나타낸 음성 처리 단계를 따른다. 음성 코더는 최적의 인코딩 모드를 선택함으로써 각각의 입력 음성 프레임의 LP 잔여성분을 인코딩한다. 일부모드는 LP 잔여성분 또는 음성 잔여성분을 시간 영역에서 인코딩하지만, 다른 모드들은 LP 잔여성분이나 음성 잔여성분을 주파수 영역에서 나타낸다. 모드들의 세트는 풀레이트, 전이 프레임에 대한 시간 영역 (T 모드); 하프레이트, 보이스 프레임에 대한 주파수 영역 (V 모드); 쿼터 레이트, 언보이스 프레임에 대한 시간 영역 (U 모드); 및 1/8 레이트, 잡음 프레임에 대한 시간 영역 (N 모드) 이다.In one embodiment, the closed-loop, multimode, MDLP speech coder follows the speech processing steps shown in the flow chart of Fig. The voice coder encodes the LP residual components of each input voice frame by selecting the optimal encoding mode. Some modes encode the LP residual or negative residual in the time domain while other modes show the LP residual or negative residual in the frequency domain. The set of modes is a full rate, time domain (T mode) for transition frames; Half rate, frequency domain (V mode) for voice frames; Quarter rate, time domain (U mode) for unvoiced frames; And a time domain (N mode) for a 1/8 rate, noise frame.

도 9 에 나타낸 단계들을 따라서 음성 신호 또는 대응 LP 잔여성분을 인코딩할 수 있다. 잡음, 언보이스, 전이 및 보이스 음성의 파형 특성은 도 10A 의 그래프에서 시간함수로 나타낸 바와 같다. 잡음, 언보이스, 전이, 및 보이스 LP 잔여성분은 도 10B 의 그래프에서 시간의 함수로 나타낸 바와 같다.It is possible to encode the speech signal or the corresponding LP residual component according to the steps shown in FIG. The waveform characteristics of noise, unvoiced, transition and voice speech are as shown by the time function in the graph of FIG. 10A. The noise, unvoiced, transition, and voice LP residual components are as a function of time in the graph of FIG. 10B.

단계 700 에서, 개루프 모드 판정은 입력 음성 잔여성분 S(n) 에 적용하는 4 개의 모드 (T, V, U 또는 N) 중의 하나에 대하여 행해진다. T 모드를 적용하는 경우, 음성 잔여성분은 단계 702 에서 시간영역에서 T 모드, 즉 풀레이트로 처리한다. U 모드가 적용하는 경우, 음성 잔여성분은 단계 704 에서 시간영역에서 U 모드, 즉 1/4 레이트로 처리한다. N 모드를 적용하는 경우, 음성 잔여성분은 단계 706 에서 시간 영역에서 N 모드, 즉 1/8 레이트로 처리한다. V 모드를 적용하는 경우, 음성 잔여성분은 단계 708에서 주파수 영역에서 V 모드, 즉 하프레이트로 처리한다.In step 700, the open loop mode determination is made for one of the four modes (T, V, U, or N) that apply to the input speech residual component S (n). When applying the T mode, the audio residual component is processed in the T-mode, i.e., the full rate, in the time domain at step 702. [ When the U mode is applied, the audio residual component is processed in the time domain in the U-mode, that is, the 1/4 rate in step 704. In case of applying the N mode, the audio residual component is processed in N mode in the time domain at step 706, that is, 1/8 rate. When applying the V mode, the audio residual component is processed in step 708 in the frequency domain in V mode, i.e., a half rate.

단계 710 에서는, 단계 708 에서 인코딩된 음성을 디코딩하고 입력 음성 잔여성분 S(n) 과 비교하여, 성능 측정치 D 를 계산한다. 단계 712 에서는, 성능 측정치 D 를 소정 임계치 T 와 비교한다. 성능 측정치 D 가 임계치 T 보다 크거나 같은 경우, 단계 708 의 스펙트럼 인코딩한 음성 잔여성분은 단계 714 에서 송신이 승인된다. 이와달리, 성능 측정치 D 가 임계치 T 보다 작은 경우, 입력 음성 잔여성분 S(n) 은 단계 716 에서 T 모드에서 처리한다. 다른 실시형태에서는, 성능 측정치를 계산하지 않으며 임계치도 정의하지 않는다. 대신에, 소정 개수의 음성 잔여성분 프레임을 V 모드에서 처리한 후에, 다음 프레임을 T 모드에서 처리한다.In step 710, the speech encoded in step 708 is decoded and compared with the input speech residual component S (n) to calculate the performance measure D. In step 712, the performance measure D is compared to a predetermined threshold T. If the performance measure D is greater than or equal to the threshold T, the spectrally encoded voice residual component of step 708 is acknowledged in step 714. Alternatively, if the performance measure D is less than the threshold T, the input speech residual component S (n) is processed in the T mode in step 716. [ In another embodiment, no performance measure is computed and no threshold is defined. Instead, after processing a predetermined number of remaining audio component frames in the V mode, the next frame is processed in the T mode.

바람직하게는, 도 9 에 나타낸 판정 단계들은 단지 필요한 경우에만 고비트 레이트 T 모드를 사용할 수 있게 하여, V 모드를 적절하게 수행하지 않을 때 풀레이트로 스위칭함으로써, 품질의 저하를 방지하면서 더 낮은 비트 레이트의 V 모드로 보이스 음성 세그먼트의 주기성을 이용할 수 있게 한다. 따라서, 풀레이트의 음질에 도달하는 아주 우수한 고음질이 풀레이트보다 상당히 낮은 평균레이트에서 생성할 수 있다. 또한, 목표 음성 품질은 선택한 성능 측정치와 선택한 임계치에 의해 제어할 수 있다.Preferably, the decision steps shown in Fig. 9 allow high bit rate T mode to be used only when necessary, so that by switching at full rate when not properly performing the V mode, Rate V mode to utilize the periodicity of the voice voice segment. Thus, a very high quality sound that reaches full-rate sound quality can be generated at a significantly lower average rate than the full rate. In addition, the target speech quality can be controlled by the selected performance measure and the selected threshold.

또한, T 모드로의 "갱신 (update) " 은 모델 위상 트랙을 입력 음성의 위상 트랙에 근접하도록 유지함으로써, V 모드의 후속 적용의 성능을 향상시킨다. V 모드에서의 성능이 부적절할 때, 단계 710 과 712 에서의 폐루프 성능 체크는 T 모드로 스위칭하여, 초기 위상 값을 "리프레싱" 함으로써 후속하는 V 모드 처리의 성능을 향상시키며, 이는 모델 위상 트랙이 초기 입력 음성 위상 트랙에 다시 근접할 수 있게 한다. 예를들어, 도 11A 내지 C 의 그래프에 나타낸 바와 같이, 시작으로부터 5 번째 프레임은 사용한 PSNR 왜곡 측정치에 의해 입증되는 바와 같이, V 모드에서 적절하게 동작하지 않는다. 그결과, 폐루프 판정 및 갱신이 없어서, 도 11C 에 나타낸 바와 같이 모델링된 위상 트랙은 최초 입력 음성 위상 트랙으로부터 상당히 벗어나서, PSNR 에서의 심각하게 저하된다. 또한, V 모드에서 처리한 후속 프레임의 성능도 저하된다. 그러나, 폐루프 판정하에서는 제 5 프레임이 도 11A 에 나타낸 바와 같이 T 모드 처리로 스위칭된다. 도 11B 에 나타낸 바와 같이, PSNR 의 향상에 의해 입증되듯이, V 모드하에서 처리한 후속 프레임의 성능도 향상된다. 또한, V 모드에서 처리한 후속 프레임의 성능도 향상된다.In addition, " update " to the T mode keeps the model phase track close to the phase track of the input voice, thereby improving the performance of the subsequent application of the V mode. When performance in the V mode is inadequate, the closed loop performance check at steps 710 and 712 switches to T mode to " refresh " the initial phase value to improve the performance of subsequent V mode processing, Lt; / RTI > to approach this initial input speech phase track again. For example, as shown in the graphs of Figures 11A-C, the fifth frame from the start does not work properly in the V-mode, as evidenced by the PSNR distortion measurements used. As a result, there is no closed loop determination and update, so that the phase track modeled as shown in Fig. 11C deviates significantly from the original input voice phase track and is severely degraded at PSNR. In addition, the performance of the subsequent frame processed in the V mode also deteriorates. However, under the closed loop determination, the fifth frame is switched to T mode processing as shown in FIG. 11A. As shown in Figure 11B, the performance of subsequent frames processed under the V mode is also improved, as evidenced by the improvement in PSNR. Also, the performance of the subsequent frame processed in the V mode is improved.

도 9 에 나타낸 판정 단계는 아주 정확한 초기 위상 추정값을 제공하여 V 모드 표현의 품질을 향상시킴으로써, 향상된 V 모드 합성 음성 잔여 신호가 최초 입력 음성 잔여성분, S(n) 과 정확하게 시간 정렬되도록 한다. 다음 방법으로 제 1 V 모드 처리된 음성 잔여 세그먼트에 대한 초기 위상을 직전의 디코딩된 프레임으로부터 유도한다. 각각의 고조파에 대하여, 이전 프레임을 V 모드하에서 처리하는 경우, 초기 위상은 이전 프레임의 최종 추정된 위상과 동일하게 설정된다. 각각의 고조파에 대하여, 이전 프레임을 T 모드에서 처리하는 경우, 초기 위상은 이전 프레임의 실제 고조파 위상과 동일하게 설정된다. 이전의 전체 프레임을 사용하여 과거 디코딩한 잔여성분의 DFT 를 취함으로써 이전 프레임의 실제 고조파 위상은 유도될 수 있다. 다른 방법으로는, 다양한 피치 주기의 이전 프레임을처리함으로써 피치 동기 방법으로 과거 디코딩된 프레임들의 DFT 를 취함으로써 이전 프레임의 실제 고조파 위상을 유도할 수 있다.The decision step shown in FIG. 9 provides a very accurate initial phase estimate to improve the quality of the V-mode representation so that the enhanced V-mode synthesized speech residual signal is precisely time aligned with the original input speech residual, S (n). The initial phase for the first V-processed voice residual segment is derived from the immediately preceding decoded frame in the following manner. For each harmonic, if the previous frame is processed under V mode, the initial phase is set equal to the last estimated phase of the previous frame. For each harmonic, if the previous frame is processed in T mode, the initial phase is set equal to the actual harmonic phase of the previous frame. The actual harmonic phase of the previous frame can be derived by taking the DFT of the residual component that was previously decoded using the previous full frame. Alternatively, the actual harmonic phase of the previous frame can be derived by taking the DFT of the past decoded frames in a pitch synchronization manner by processing the previous frame of various pitch periods.

이상 신규한 폐루프, 멀티모드, 혼합영역 선형예측 (MDLP) 음성 코더를 설명하였다. 본 실시형태와 관련하여 설명한 다양한 예시적인 논리 블록과 알고리즘 단계들은 디지털 신호 프로세서 (DSP), 주문형 집적회로 (ASIC), 이산 게이트 또는 트랜지스터 로직, 예를들어 레지스터와 FIFO 등의 이산 하드웨어 부품, 일련의 펌웨어 명령을 실행하는 프로세서, 또는 임의의 종래 프로그램가능 소프트웨어 모듈과 프로세서로 구현하거나 수행할 수 있다. 바람직하게는, 프로세서는 마이크로프로세서일 수 있지만, 다른 방법으로는, 프로세서는 임의의 종래 프로세서, 컨트롤러, 마이크로 컨트롤러, 또는 스테이트 머신일 수 있다. 소프트웨어 모듈은 RAM 메모리, 플래쉬 메모리, 레지스터, 또는 당해기술분야에서 공지된 임의유형의 기록가능 저장매체일 수 있다. 바람직하게는, 데이터, 지시, 명령, 정보, 신호, 비트, 심볼, 및 칩은 명세서 전반에 걸쳐서, 전압, 전류, 전자기파, 자기장이나 입자, 옵티컬 필드나 입자, 또는 그들의 임의의 조합으로 나타낼 수 있다.A new closed loop, multimode, mixed region linear prediction (MDLP) speech coder has been described. The various illustrative logical blocks and algorithm steps described in connection with the present embodiment may be implemented as a digital signal processor (DSP), an application specific integrated circuit (ASIC), a discrete gate or transistor logic, discrete hardware components such as registers and FIFOs, A processor executing firmware instructions, or any conventional programmable software module and processor. Preferably, the processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. The software module may be a RAM memory, a flash memory, a register, or any other type of recordable storage medium known in the art. Preferably, data, instructions, commands, information, signals, bits, symbols, and chips may be represented throughout the description by voltage, current, electromagnetic waves, magnetic fields or particles, optical fields or particles, .

이상, 본 발명의 바람직한 실시형태를 도시하고 설명하였다. 그러나, 본 발명의 정신과 범위를 벗어나지 않고 여기서 개시한 실시형태에 다양한 변경을 할 수 있다. 따라서, 본 발명은 다음 청구항에 따른 경우 이외의 것에 대해서 제한되지 않는다.The preferred embodiments of the present invention have been shown and described. However, various modifications may be made to the embodiments disclosed herein without departing from the spirit and scope of the invention. Accordingly, the invention is not limited except as by the following claims.

Claims

In a multimode, mixed region, speech processor,

A coder having at least one time domain coding mode and at least one frequency domain coding mode;

And a closed loop mode selection device coupled to the coder and configured to select a coding mode for the coder based on the content of the frame processed by the speech processor.

The method according to claim 1,

And wherein the coder encodes the speech frames.

The method according to claim 1,

And wherein the coder encodes a linear prediction residual component of the speech frame.

The method according to claim 1,

Wherein the one or more time domain coding modes include a coding mode for coding frames at a first coding rate,

Wherein the at least one frequency-domain coding mode includes a coding mode for coding frames at a second coding rate,

And the second coding mode is smaller than the first coding mode.

The method according to claim 1,

Wherein the at least one frequency-domain coding mode comprises a harmonic coding mode.

The method according to claim 1,

A comparison circuit coupled to the coder and comparing the uncoded frame with the coded frame in one or more frequency domain coding modes and generating a performance measure based on the comparison,

The coder applies one or more time-domain coding modes only if the performance measure falls below a predetermined threshold, and otherwise applies one or more frequency-domain coding modes.

The method according to claim 1,

Wherein the coder applies one or more time-domain coding modes to each of the frames immediately following a predetermined number of consecutive processed frames coded in one or more frequency-domain coding modes.

The method according to claim 1,

The at least one frequency-domain coding mode expresses the short-term spectrum of each frame with a plurality of sinusoids having a parameter set including frequency, phase, and amplitude,

The phase is modeled as a polynomial representation and an initial phase value,

The initial phase values are (1) the last estimated phase value of the previous frame if the previous frame is coded in one or more frequency-domain coding modes, or (2) the estimated phase value of the previous frame if the previous frame is coded in more than one time- Is one of the phase values derived from the short-term spectrum.

9. The method of claim 8,

Wherein the sinusoidal frequency for each frame is an integral multiple of the pitch frequency of the frame.

9. The method of claim 8,

Wherein the sinusoidal frequency is taken from a set of real numbers between 0 and < RTI ID = 0.0 > 2. &Lt; / RTI >

In the frame processing method,

Applying an open loop coding mode selection process to each successive input frame to select one of a time domain coding mode or a frequency domain coding mode based on the audio content of the input frame;

Frequency-domain coding an input frame if the audio content of the input frame represents a steady state voice;

Time-domain coding an input frame if the audio content of the input frame indicates something other than the steady-state voice voice;

Comparing the frequency domain coded frame with an input frame to obtain a performance measurement; And

And when the performance measure falls below a predetermined threshold, time-domain coding the input frame.

12. The method of claim 11,

Wherein the frame is a linear prediction residual frame.

12. The method of claim 11,

Wherein the frame is a voice frame.

12. The method of claim 11,

The time domain coding step comprises coding the frame at a first coding rate,

The frequency domain coding step includes coding the frame at a second coding rate,

And the second coding rate is less than the first coding rate.

12. The method of claim 11,

Wherein the frequency domain coding step comprises harmonic coding.

12. The method of claim 11,

The frequency domain coding step comprises representing the short-term spectrum of each frame with a plurality of sinusoids having a set of parameters including frequency, phase and amplitude,

The phase is modeled as a polynomial representation and an initial phase value,

The initial phase values are (1) the last estimated phase value of the previous frame if the previous frame is more than one frequency domain coded, or (2) the phase derived from the short-term spectrum of the previous frame if the previous frame is more than one time- &Lt; / RTI >

17. The method of claim 16,

Wherein the sinusoidal frequency for each frame is taken from a set of real numbers between 0 and 2 [pi].

In a multimode, mixed region, speech processor,

Means for applying open-loop coding mode selection processing to an input frame to select one of a time-domain coding mode or a frequency-domain coding mode based on the audio content of the input frame;

Means for frequency-domain coding an input frame if the audio content of the input frame represents a steady state voice;

Means for time-domain coding the input frame if the audio content of the input frame indicates something other than a steady state voice;

Means for comparing a frequency domain coded frame with an input frame to obtain a performance measure; And

And means for time-domain coding the input frame if the performance measure falls below a predetermined threshold.

20. The method of claim 19,

Wherein the input frame is a linear prediction residual frame.

20. The method of claim 19,

Wherein the input frame is a voice frame.

20. The method of claim 19,

The time domain coding means comprises means for coding the frame at a first coding rate,

The frequency domain coding means comprises means for coding the frame at a second coding rate,

And the second coding rate is less than the first coding rate.

20. The method of claim 19,

Wherein the frequency domain coding means comprises a harmonic coder.

20. The method of claim 19,

The frequency domain coding means comprises means for representing the short term spectrum of each frame with a plurality of sinusoids having a parameter set comprising frequency, phase and amplitude,

The phase is modeled as a polynomial representation and an initial phase value,

The initial phase value is (1) the final estimated phase value of the immediately preceding frame when the previous frame is frequency-domain coded, or (2) the phase value derived from the short-term spectrum of the immediately preceding frame when the previous frame is time- &Lt; / RTI >

25. The method of claim 24,