KR100711047B1

KR100711047B1 - Closed-loop multimode mixed-domain linear prediction speech coder

Info

Publication number: KR100711047B1
Application number: KR1020027011306A
Authority: KR
Inventors: 다스아미타바
Original assignee: 퀄컴 인코포레이티드
Priority date: 2000-02-29
Filing date: 2000-02-29
Publication date: 2007-04-24
Also published as: KR20020081374A; JP2003525473A; CN1266674C; CN1437747A; DE60031002D1; JP4907826B2; AU2000233851A1; EP1259957A1; ES2269112T3; WO2001065544A1; ATE341074T1; DE60031002T2; EP1259957B1; HK1055833A1

Abstract

A closed-loop, multimode, mixed-domain linear prediction (MDLP) speech coder includes a high-rate, time-domain coding mode, a low-rate, frequency-domain coding mode, and a closed-loop mode-selection mechanism for selecting a coding mode for the coder based upon the speech content of frames input to the coder. Transition speech (i.e., from unvoiced speech to voiced speech, or vice versa) frames are encoded with the high-rate, time-domain coding mode, which may be a CELP coding mode. Voiced speech frames are encoded with the low-rate, frequency-domain coding mode, which may be a harmonic coding mode. Phase parameters are not encoded by the frequency-domain coding mode, and are instead modeled in accordance with, e.g., a quadratic phase model. For each speech frame encoded with the frequency-domain coding mode, the initial phase value is taken to be the initial phase value of the immediately preceding speech frame encoded with the frequency-domain coding mode. If the immediately preceding speech frame was encoded with the time-domain coding mode, the initial phase value of the current speech frame is computed from the decoded speech frame information of the immediately preceding, time-domain-encoded speech frame. Each speech frame encoded with the frequency-domain coding mode may be compared with the corresponding input speech frame to obtain a performance measure. If the performance measure falls below a predefined threshold value, the input speech frame is encoded with the time-domain coding mode.

Description

Closed-loop multimode mixed region linear prediction (MDD) voice coder {CLOSED-LOOP MULTIMODE MIXED-DOMAIN LINEAR PREDICTION SPEECH CODER}

일반적으로, 본 발명은 음성 처리에 관한 것으로서, 특히 폐루프, 멀티모드, 혼합영역 음성 코딩 방법 및 장치에 관한 것이다.In general, the present invention relates to speech processing, and more particularly, to closed loop, multimode, mixed region speech coding methods and apparatus.

디지털 기술에 의한 음성 전송은, 특히 장거리 및 디지털 무선 전화기 응용에 널리 보급되었다. 차례로, 이는 재생된 음성의 인식 품질을 유지하면서 채널을 통해 송신할 수 있는 최소 정보량을 결정하는데 대한 관심을 불러일으켰다. 단순히 샘플링과 디지털화에 의해 음성을 전송하는 경우, 종래 아날로그 전화기의 음성품질을 달성하기 위해서는, 초당 64 킬로바이트 (kbps) 급의 데이터 레이트가 요구된다. 그러나, 음성 분석과, 그에 후속하는 적절한 코딩, 송신, 수신기에서의 재합성을 이용하여 데이터 레이트를 상당히 줄일 수 있다. Voice transmission by digital technology is particularly prevalent in long distance and digital cordless telephone applications. In turn, this has generated interest in determining the minimum amount of information that can be transmitted over a channel while maintaining the recognition quality of the reproduced speech. When voice is simply transmitted by sampling and digitization, in order to achieve voice quality of a conventional analog telephone, a data rate of 64 kilobytes per second (kbps) is required. However, speech analysis and subsequent appropriate coding, transmission, and resynthesis at the receiver can be used to significantly reduce the data rate.

인간의 음성생성모델과 관련한 파라미터들을 추출하여 언어를 압축하는 기술을 사용하는 장치를 음성 코더라고 부른다. 음성 코더는 입력 음성 신호를 시간 블록이나 분석 프레임들로 분할한다. 일반적으로, 음성 코더는 인코더와 디코더를 구비한다. 인코더는 입력 음성 프레임을 분석하여 일정한 관련 파라미터들을 추출한 후, 파라미터들을 2진 표현, 즉 비트들의 세트나 이진 데이터 패킷으로 양자화한다. 데이터 패킷은 통신 채널을 통하여 수신기 및 디코더로 송신된다. 디코더는 데이터 패킷을 처리하고, 그들을 양자화하여 파라미터를 생성하고, 역양자화된 파라미터를 사용하여 음성 프레임을 재합성한다. A device that uses the technology of language compression by extracting parameters related to the human speech generation model is called a speech coder. The speech coder splits the input speech signal into time blocks or analysis frames. Generally, a voice coder has an encoder and a decoder. The encoder analyzes the input speech frame to extract certain relevant parameters and then quantizes the parameters into a binary representation, ie a set of bits or a binary data packet. Data packets are transmitted to receivers and decoders over communication channels. The decoder processes the data packets, quantizes them to generate parameters, and resynthesizes the speech frames using the dequantized parameters.

음성 코더의 기능은, 음성에 고유한 모든 고유 리던던시 (redundancy) 을 제거함으로써 디지털화된 음성 신호를 저비트 레이트 신호로 압축하는 것이다. 디지털 압축은 입력 음성 프레임을 파라미터 세트로 표현하고 양자화를 사용하여, 파라미터들을 비트세트로 표현함으로써 달성할 수 있다. 입력 음성 프레임이 비트수 N_i 를 갖고 음성 코더에 의해 생성된 데이터 패킷이 비트수 N_o 를 갖는 경우에는, 음성 코더에 의해 얻어지는 압축률은

이다. 문제는 목표 압축률을 유지하면서 디코딩된 음성의 음질을 유지하는 것이다. 음성 코더의 성능은 (1) 음성 모델 또는 위에서 설명한 분석 및 합성 프로세스의 조합이 얼마나 잘 수행되느냐, (2) 프레임당 N₀ 비트의 목표 비트레이트에서 파라미터 양자화 프로세스가 얼마나 잘 수행되느냐에 의존한다. 따라서, 음성 모델의 목적은, 각각의 프레임에 대하여 적은 파라미터 세트로 음성 신호의 특징이나 목표 음성 품질을 획득하는 것이다.The function of the speech coder is to compress the digitized speech signal into a low bit rate signal by removing all inherent redundancy inherent in speech. Digital compression can be accomplished by representing the input speech frame in a set of parameters and using quantization to represent the parameters in bitsets. When the input speech frame has the number of bits N _i and the data packet generated by the speech coder has the number of bits N _o , the compression rate obtained by the speech coder is

to be. The problem is to maintain the sound quality of the decoded speech while maintaining the target compression rate. The performance of the speech coder depends on how well (1) the speech model or the combination of the analysis and synthesis processes described above is performed, and (2) how well the parametric quantization process is performed at the target bitrate of N ₀ bits per frame. Therefore, the purpose of the speech model is to obtain the characteristic or target speech quality of the speech signal with a small set of parameters for each frame.

음성 코더는 시간영역 코더로서 구현할 수 있으며, 이는 음성의 작은 세그먼트 (일반적으로 5 밀리초 (ms) 서브프레임) 를 한번에 인코딩하는 높은 분해도 처리를 사용함으로써, 시간영역 음성 파형을 획득한다. 각각의 서브프레임에 대하여, 코드북 (codebook) 영역으로부터의 고정밀 대표값은 당해기술분야에 공지된 다양한 서치 알고리즘을 통하여 알 수 있다. 다른 방법으로는, 음성 코더는 주파수영역 코더로서 구현할 수 있으며, 이는 파라미터 세트로 입력 음성 프레임의 단기 음성 스펙트럼을 포착하며, 스펙트럼 파라미터로부터 음성 파형을 재생하기 위하여 해당하는 합성 프로세스를 사용한다. 파라미터 양자화기는 A. Gersho & R.M.Gray, Vector Quantization and Signal Compression (1992) 에 설명된 공지 양자화 기술에 따라서 저장된 코드벡터 표현으로 파라미터들을 표현함으로써 파라미터들을 보존한다. The speech coder can be implemented as a time domain coder, which uses a high resolution process to encode small segments of speech (typically 5 millisecond (ms) subframes) at one time, thereby obtaining time domain speech waveforms. For each subframe, the high precision representative value from the codebook region can be known through various search algorithms known in the art. Alternatively, the speech coder can be implemented as a frequency domain coder, which captures the short-term speech spectrum of the input speech frame with a set of parameters and uses the corresponding synthesis process to reproduce the speech waveform from the spectral parameters. The parametric quantizer preserves the parameters by representing the parameters in a stored codevector representation according to known quantization techniques described in A. Gersho & R.M. Gray, Vector Quantization and Signal Compression (1992).

공지된 시간영역 음성 코더는 L.B. Rabiner & R.W. Schafer 의 Digital Processing of Speech Signals 396-453 (1978) 에 설명된 코드여기 선형예측 (Code Excited Linear Predictive; CELP) 코더가 있다. 이 CELP 코더에서는, 음성 신호의 단기 상관(성) 이나 리던던시가, 단기 포르만트 필터 (short-term formant filter) 의 계수를 구하는 선형 예측 (LP) 분석에 의해 제거된다. 입력 음성 프레임에 단기 예측 필터를 사용하여 LP 잔여 신호를 발생시키고, 장기 예측 필터 파라미터와 후속하는 확률적 코드북 (code book) 으로 더 모델링하고 양자화한다. 따라서, CELP 코딩은 시간영역 음성 파형을 인코딩하는 작업을 LP 단기 필터 계수들을 인코딩하는 작업과 LP 잔여성분을 인코딩하는 별도의 작업으로 분할한다. 시간영역 코딩은 고정 레이트 (즉, 각각의 프레임에 대하여 동일한 비트수 N₀ 를 사용함) 나 가변 레이트 (다른 유형의 프레임 콘텐츠에 대하여 다른 비트 레이트를 사용함) 로 수행할 수 있다. 가변-레이트 코더는 목표 품질을 얻기에 적당한 레벨로 코덱 (codec) 파라미터들을 인코딩하기 위해 필요한 양만큼의 비트를 사용한다. 대표적인 가변 레이트 CELP 코더로, 본 발명의 양수인에게 양도되었으며 여기서 참조한 미국특허 제5,414,796호에 설명되어 있다. A known time domain speech coder is the Code Excited Linear Predictive (CELP) coder described in Digital Processing of Speech Signals 396-453 (1978) by LB Rabiner & RW Schafer. In this CELP coder, short-term correlation and redundancy of speech signals are removed by linear prediction (LP) analysis, which obtains the coefficients of a short-term formant filter. An LP residual signal is generated using a short-term prediction filter on the input speech frame, and further modeled and quantized with long-term prediction filter parameters and subsequent probabilistic code books. Therefore, CELP coding divides the encoding of the time-domain speech waveform into a separate operation of encoding LP short-term filter coefficients and an encoding of LP residual components. Time-domain coding can be performed at a fixed rate (ie, using the same number of bits N ₀ for each frame) or at a variable rate (using a different bit rate for different types of frame content). The variable-rate coder uses as many bits as necessary to encode the codec parameters at a level suitable to achieve the target quality. An exemplary variable rate CELP coder, assigned to the assignee of the present invention and described in US Pat. No. 5,414,796, incorporated herein by reference.

일반적으로, CELP 코더 등의 시간영역 코더들은, 시간영역 음성 파형의 정확성을 유지하기 위하여, 높은 프레임당 비트수 N₀ 를 사용한다. 일반적으로, 이들 코더는 상대적으로 많은 프레임당 비트수 N₀ (예를들어, 8kbps 또는 그 이상) 를 제공하므로 우수한 음성 품질을 갖는다. 그러나, 저비트 레이트 (4kbps 또는 그 이하) 에서는, 시간영역 코더들은 제한된 이용 비트수로 인하여 고품질과 우수한 성능을 유지할 수 없다. 저비트 레이트에서는, 이러한 제한된 코드북 공간은 고레이트 상업적 응용에 성공적으로 사용되는 종래 시간영역 코더들의 파형 매칭 능력을 저하시킨다. In general, time domain coders, such as CELP coders, use a high number of bits per frame N ₀ to maintain the accuracy of the time domain speech waveform. In general, these coders provide a relatively large number of bits N ₀ per frame (eg, 8 kbps or more) and thus have good speech quality. However, at low bit rates (4 kbps or less), time domain coders cannot maintain high quality and good performance due to the limited number of bits used. At low bit rates, this limited codebook space degrades the waveform matching capability of conventional time domain coders successfully used for high rate commercial applications.

현재, 중간 내지 저비트 레이트 (즉, 2.4 내지 4kbps 의 범위) 로 동작하는 고품질 음성 코더를 개발하기 위한 연구 관심과 강한 상업적 요구가 있다. 응용 분야에는 무선 전화, 위성 통신, 인터넷 전화, 다양한 멀티미디어, 및 음성 스트리밍 응용, 음성 메일, 및 다른 음성 저장 시스템을 포함한다. 이의 원동력은 고용량에 대한 요구와 패킷 손실 상황하에서의 우수한 성능에 대한 요구이다. 최근의 다양한 음성 코딩 표준화 노력들은 저레이트 음성 코딩 알고리즘의 연구 및 개발을 추진하는 또 다른 직접적인 원동력이다. 저레이트 음성 코더는 허용가능한 응용 대역폭당 더 많은 채널이나 이용자들을 창출하며, 적절한 채널 코딩의 추가 계층과 커플링시킨 저레이트 음성 코더는 전체적인 코더 사양의 비트 수지를 충족시킬 수 있고, 채널 에러 상황하에서 우수한 성능을 갖는다.Currently, there is a strong commercial need and research interest to develop high quality voice coders that operate at medium to low bit rates (ie, in the range of 2.4 to 4 kbps). Applications include wireless telephones, satellite communications, internet telephony, various multimedia, and voice streaming applications, voice mail, and other voice storage systems. Its driving force is the demand for high capacity and good performance under packet loss situations. Various recent speech coding standardization efforts are another direct driving force for the research and development of low rate speech coding algorithms. The low rate voice coder creates more channels or users per allowable application bandwidth, and the low rate voice coder coupled with an additional layer of appropriate channel coding can meet the bit balance of the overall coder specification and under channel error conditions. Has excellent performance.

저비트 레이트에서의 코딩에 대하여, 음성 신호를 스펙트럼의 시변 에볼루션 (time-varing evolution) 으로 분석하는 다양한 방법의 스펙트럼 또는 주파수영역 음성 코딩이 개발되어 있다 (예를들어, R.J. McAulay & T.F. Quatieri, Sinusodial Coding, in Speech Coding and Synthesis ch. 4 (W.B. Kleijin & K.K. Paliwal eds., 1995 참조). 스펙트럼 코더에서, 그 목적은 시변 음성 파형을 정확하게 모방하는 대신에, 스펙트럼 파라미터 세트로 각각의 입력 음성 프레임의 단기 음성 스펙트럼을 모델링하거나 예측하는 것이다. 그후, 스펙트럼 파라미터들을 인코딩하며, 디코딩된 파라미터들로 출력 음성 프레임을 생성한다. 생성된 합성 음성은 최초 입력 음성 파형과 매치하지는 않지만, 유사한 인식 품질을 제공한다. 당해기술분야에 공지된 주파수영역 코더의 예는 멀티밴드 여기 코더 (MBE), 정현파 변환 코더 (STC), 및 고조파 코더 (harmonic coder; HC) 를 포함한다. 이런 주파수영역 코더들은 저비트 레이트에서 이용가능한 적은 비트수로 정확하게 양자화할 수 있는 컴팩트한 파라미터 세트를 갖는 고품질 파라미터 모델을 제공한다. For coding at low bit rates, various methods of spectral or frequency domain speech coding have been developed (eg RJ McAulay & TF Quatieri, Sinusodial ) which analyze the speech signal with time-varing evolution of the spectrum. Coding, in Speech Coding and Synthesis ch. 4 (see WB Kleijin & KK Paliwal eds., 1995) .In a spectral coder, the goal is to accurately convert each input speech frame into a set of spectral parameters instead of accurately Model or predict the short-term speech spectrum, then encode the spectral parameters and generate an output speech frame with the decoded parameters The synthesized speech produced does not match the original input speech waveform, but provides similar recognition quality. Examples of frequency domain coders known in the art include multiband excitation coders (MBE), sinusoidal variations. Coder (STC), and harmonic coder (HC), these frequency domain coders provide a high quality parametric model with a compact parameter set that can be accurately quantized with the small number of bits available at low bit rates. .

그러나, 저비트 레이트 코딩은, 제한된 코딩 분해능 또는 제한된 코드북 공간의 임계적 제한을 부과하여 단일의 코딩 메카니즘의 효율성을 제한하므로, 코더들은 다양한 유형의 음성 세그먼트를 다양한 배경 조건하에서 동일한 정확성으로 표현할 수 없다. 예를들어, 종래 저비트 레이트, 주파수영역 코더들은 음성 프레임에 대한 위상 정보를 전송하지 않는다. 그대신에, 위상 정보를 무작위, 인위생성 초기 위상값 및 선형삽입 기술 (linear interpolation technique) 을 이용하여 재생한다. 예를 들어, H.YANG 등의 Quadratic Phase Interpolation for Voice Speech Synthesis in the MBE Model in 29 Electronic Letters 856-57 (May 1993) 를 참조한다. 위상 정보를 인위적으로 생성하기 때문에, 양자화-역양자화 프로세스에 의해 정현파의 진폭을 완전히 보존하더라도, 주파수 영역 코더에 의해 생성된 출력 음성이 원래의 입력 음성과 일치하지 않는다 (즉, 주 펄스들은 동기하지 않는다). 따라서, 예를들어, 주파수영역 코더에서는 신호대잡음비 (SNR) 나 지각 SNR (Perceptual SNR) 등의 어떠한 폐루프 성능 측정치도 사용하기가 어렵다는 것이 입증되었다. However, because low bit rate coding imposes limited coding resolution or critical limitations of limited codebook space, thereby limiting the efficiency of a single coding mechanism, coders cannot express different types of speech segments with the same accuracy under different background conditions. . For example, conventional low bit rate, frequency domain coders do not transmit phase information for speech frames. Instead, the phase information is reproduced using random, artificially generated initial phase values and a linear interpolation technique. See, for example, Quadratic Phase Interpolation for Voice Speech Synthesis in the MBE Model in 29 Electronic Letters 856-57 (May 1993). Because of artificially generating phase information, even if the amplitude of the sinusoid is completely preserved by the quantization-dequantization process, the output speech generated by the frequency domain coder does not match the original input speech (ie, the main pulses are not synchronized). Do). Thus, for example, it has proved difficult to use any closed loop performance measurements, such as signal-to-noise ratio (SNR) or perceptual SNR, for example in frequency-domain coders.

개루프 모드 판정 프로세스에 의해 저레이트 음성 코딩을 수행하는데는, 멀티모드 코딩 기술이 이용되고 있다. 이런 멀티모드 기술의 일예가 Amitava Das 등의 Speech Coding and Synthesis ch. in Multimode and Variable-Rate Coding of Speech (W.B. Kleijin & K.K. Paliwal eds. 1995) 에 설명되어 있다. 종래 멀티모드 코더들은 서로다른 입력 음성 프레임들에 대해 서로 다른 모드나 인코딩-디코딩 알고리즘을 적용한다. 각각의 모드나 인코딩-디코딩 프로세스는, 일정한 유형의 음성 세그먼트, 예를들어 보이스 음성 (voiced speech), 언보이스 음성 (unvoiced speech), 또는 배경 잡음 (비음성) 을 가장 효율적인 방식으로 표현하도록, 맞춰진다. 일반적으로, 외부, 개루프 모드 판정 메카니즘은 입력 음성 프레임을 조사하여, 프레임에 어떤 모드를 적용할 것인지에 관하여 판정한다. 일반적으로, 개루프 모드 판정은 입력 프레임으로부터 다수의 파라미터들을 추출하고, 파라미터들을 일정한 시간 및 스펙트럼 특성에 관하여 평가하며, 그 평가에 기초하여 모드 판정을 행함으로써 수행한다. 따라서, 모드 판정이 출력 음성의 실제 조건, 즉 음성 품질이나 다른 성능 측정치에 관하여 출력 음성이 입력 음성에 얼마나 근접하는 지를 미리 알지 못한 채 이루어지게 된다. Multimode coding techniques have been used to perform low rate speech coding by the open loop mode determination process. An example of such a multimode technique is Amitava Das et al. Speech Coding and Synthesis ch. in Multimode and Variable-Rate Coding of Speech (WB Kleijin & KK Paliwal eds. 1995). Conventional multimode coders apply different modes or encoding-decoding algorithms to different input speech frames. Each mode or encoding-decoding process is tailored to represent a certain type of speech segment, such as voiced speech, unvoiced speech, or background noise (non-voice) in the most efficient manner. Lose. In general, an external, open loop mode determination mechanism examines an input speech frame to determine which mode to apply to the frame. In general, open loop mode determination is performed by extracting a plurality of parameters from an input frame, evaluating the parameters with respect to constant time and spectral characteristics, and making a mode determination based on the evaluation. Thus, the mode decision is made without knowing in advance how close the output voice is to the input voice with respect to the actual conditions of the output voice, i.e., voice quality or other performance measures.

상술한 내용에 기초하여, 위상 정보를 보다 정확하게 평가하는 저비트 레이트, 주파수영역 코더들을 제공하는 것이 바람직하다. 또한, 프레임의 음성 콘텐츠에 기초하여 일부 음성 프레임은 시간영역 인코딩하고 다른 음성 프레임은 주파수영역 인코딩하는 다중모드, 혼합영역 코더를 제공하는 것이 바람직하다. 또한, 폐루프 코딩 모드 판정 메카니즘에 따라서 일부 음성 프레임들은 시간영역 인코딩하고 다른 언어 프레임들은 주파수영역 인코딩할 수 있는 혼합영역 코더를 제공하는 것이 바람직하다. 따라서, 코더에 의해 생성된 출력 음성과 코더로 입력된 최초 음성 사이의 시간동기를 보장하는 폐루프, 다중모드, 혼합영역 음성 코더가 요청되고 있다. Based on the above, it is desirable to provide low bit rate, frequency domain coders that more accurately evaluate phase information. It is also desirable to provide a multimode, mixed domain coder that encodes some speech frames in time-domain encoding and others in frequency-domain encoding based on the speech content of the frame. It is also desirable to provide a mixed domain coder capable of time domain encoding and other language frames frequency domain encoding in accordance with the closed loop coding mode determination mechanism. Accordingly, there is a need for a closed loop, multimode, mixed region voice coder that guarantees time synchronization between the output voice generated by the coder and the original voice input to the coder.

본 발명은 코더에 의해 생성된 출력 음성과 코더에 입력되는 최초 음성 사이의 시간 동기를 보장하는 폐루프, 멀티모드, 혼합영역 음성 코더에 관한 것이다. 따라서, 본 발명의 일양태에서, 바람직하게는, 멀티모드, 혼합영역, 음성 프로세서는 하나 이상의 시간영역 코딩 모드와 하나 이상의 주파수영역 코딩 모드를 갖는 코더, 및 코더에 커플링되며 음성 프로세서에 의해 처리된 프레임의 콘텐츠에 기초하여 코더에 대한 코딩 모드를 선택하도록 구성되는 폐루프 모드 선택 장치를 구비한다. The present invention relates to a closed loop, multimode, mixed domain voice coder that guarantees time synchronization between the output voice generated by the coder and the original voice input to the coder. Thus, in one aspect of the invention, preferably, the multimode, mixed domain, speech processor is coupled to a coder having one or more time domain coding modes and one or more frequency domain coding modes, and processed by the speech processor. And a closed loop mode selection device configured to select a coding mode for the coder based on the content of the frame.

본 발명의 다른 양태에서, 바람직하게는, 프레임 처리 방법이, 개루프 코딩 모드 선택 프로세스를 각각의 연속하는 입력 프레임에 적용하여 입력 프레임의 음성 콘텐츠에 기초하여 시간영역 코딩 모드나 주파수영역 코딩 모드 중의 하나를 선택하는 단계; 입력 프레임의 음성 콘텐츠가 안정상태의 보이스 음성을 나타내는 경우에 입력 프레임을 주파수영역 코딩하는 단계; 주파수영역 코딩된 프레임을 입력 프레임과 비교하여 성능 측정치를 구하는 단계; 및 성능 측정치가 소정 임계치보다 낮게되는 경우 입력 프레임을 시간 영역 코딩하는 단계를 포함한다. In another aspect of the present invention, preferably, the frame processing method applies an open loop coding mode selection process to each successive input frame to generate a time domain coding mode or a frequency domain coding mode based on the speech content of the input frame. Selecting one; Frequency domain coding the input frame when the voice content of the input frame represents a voice voice in a stable state; Comparing the frequency domain coded frame with the input frame to obtain a performance measure; And time-domain coding the input frame if the performance measure is lower than a predetermined threshold.

본 발명의 또다른 양태에서, 바람직하게는, 멀티모드, 혼합영역, 음성 프로세서는, 개루프 코딩 모드 선택 프로세스를 입력 프레임에 적용하여 입력 프레임의 음성 콘텐츠에 기초하여 시간영역 코딩모드나 주파수영역 코딩 모드중의 하나를 선택하는 수단; 입력 프레임의 음성 콘텐츠가 안정상태 보이스 음성을 나타내는 경우에 입력 프레임을 주파수영역 코딩하는 수단; 입력 프레임의 음성 콘텐츠가 안정상태 보이스 음성이 아닌 것을 나타내는 경우에는 입력 프레임을 시간영역 코딩하는 수단; 주파수영역 코딩된 프레임과 입력 프레임을 비교하여 성능 측정치를 구하는 수단; 및 성능 측정치가 소정 임계치보다 낮게되는 경우에는 입력 프레임을 시간영역 코딩하는 수단을 구비한다. In another aspect of the present invention, preferably, the multimode, mixed domain, speech processor applies an open loop coding mode selection process to the input frame, thereby applying a time domain coding mode or frequency domain coding based on the speech content of the input frame. Means for selecting one of the modes; Means for frequency domain coding the input frame when the speech content of the input frame represents a steady state voice speech; Means for time-domain coding the input frame if it indicates that the speech content of the input frame is not a steady state voice speech; Means for comparing a frequency domain coded frame and an input frame to obtain a performance measure; And means for time-domain coding the input frame when the performance measure is lower than a predetermined threshold.

도 1 은 각 단에서 음성 코더에 의해 종결하는 통신 채널의 블록도이다.1 is a block diagram of a communication channel terminated by a voice coder at each stage.

도 2 는 멀티모드, 혼합영역 선형예측 (MDLP) 음성 코더에 사용할 수 있는 인코더의 블록도이다.2 is a block diagram of an encoder that can be used in a multimode, mixed region linear prediction (MDLP) speech coder.

도 3 은 멀티모드, MDLP 음성 코더에 사용할 수 있는 디코더의 블록도이다.3 is a block diagram of a decoder that can be used in a multimode, MDLP voice coder.

도 4 는 도 2 의 인코더에 사용할 수 있는 MDLP 인코더에 의해 수행되는 MDLP 인코딩 단계들을 나타내는 플로우챠트이다.4 is a flowchart illustrating MDLP encoding steps performed by an MDLP encoder that can be used with the encoder of FIG. 2.

도 5 는 음성 코딩 판정 프로세스를 나타내는 플로우챠트이다.5 is a flowchart illustrating a speech coding determination process.

도 6 은 폐루프, 멀티모드, MDLP 음성 코더의 블록도이다.6 is a block diagram of a closed loop, multimode, MDLP voice coder.

도 7 은 도 6 의 코더 또는 도 2 의 인코더에 사용할 수 있는 스펙트럼 코더의 블록도이다.7 is a block diagram of a spectral coder that may be used in the coder of FIG. 6 or the encoder of FIG.

도 8 은 고조파 코더에서 정현파의 진폭을 나타내는 진폭대 주파수 그래프이다.8 is an amplitude versus frequency graph showing the amplitude of sinusoids in a harmonic coder.

도 9 는 멀티모드, MDLP 음성 코더의 모드 판정 프로세스를 나타내는 플로우챠트이다.9 is a flowchart showing a mode determination process of a multimode, MDLP voice coder.

도 10A 는 음성 신호 진폭대시간 그래프이고, 도 10B 는 선형 예측 (LP) 잔여성분 진폭대 시간 그래프이다.10A is a speech signal amplitude versus time graph, and FIG. 10B is a linear prediction (LP) residual component amplitude versus time graph.

도 11A 는 폐루프 인코딩 판정시 레이트/모드대 프레임 인덱스 그래프이고, 도 11B 는 폐루프 판정시 지각 신호대잡음비 (PSNR) 대 프레임 인덱스 그래프이며, 도 11C 는 폐루프 인코딩 판정이 없을 때, 레이트/모드와 PSNR대 프레임 인덱스 그래프이다.11A is a rate / mode versus frame index graph in closed loop encoding determination, FIG. 11B is a perceptual signal-to-noise ratio (PSNR) versus frame index graph in closed loop determination, and FIG. 11C is a rate / mode when there is no closed loop encoding determination. And PSNR vs. frame index graph.

도 1 에서, 제 1 인코더 (10) 는 디지털 음성 샘플 s(n) 을 수신하며, 전송 매체 (12) 또는 통신 채널 (12) 을 통한 제 1 디코더 (14) 로의 송신을 위해 샘플 s(n) 을 인코딩한다. 디코더 (14) 는 인코딩된 음성 샘플들을 디코딩하며, 출력 음성 신호 s_SYNTH(n) 를 합성한다. 역방향으로의 송신을 위해, 제 2 인코더 (16) 는 통신 채널 (18) 을 통해 송신되는 디지털 음성 샘플 s(n) 을 인코딩한다. 제 2 디코더 (20) 는 인코딩된 음성 샘플을 수신하고 디코딩하여 합성된 출력 음성 신호 s_SYNTH(n) 를 생성한다.In FIG. 1, the first encoder 10 receives a digital voice sample s (n) and samples s (n) for transmission to the first decoder 14 over the transmission medium 12 or the communication channel 12. Encode. Decoder 14 decodes the encoded speech samples and synthesizes the output speech signal s _SYNTH (n). For transmission in the reverse direction, the second encoder 16 encodes the digital speech sample s (n) transmitted over the communication channel 18. Second decoder 20 receives and decodes the encoded speech sample to produce a synthesized output speech signal s _SYNTH (n).

예를들어, 음성 샘플 s(n) 은, 펄스 코드 변조 (PCM), 압신 μ-법칙 또는 A-법칙 (companded μ-law or A-law) 등을 포함하는 당해기술분야에 공지된 다양한 방법에 따라서 디지털화 및 양자화되는 음성 신호들을 나타낸다. 당해기술분야에 공지된 바와 같이, 음성 샘플 s(n) 은 입력 데이터 프레임으로 구성되며, 여기서 각각의 프레임은 소정 개수의 디지털 음성 샘플 s(n) 을 포함한다. 바람직한 실시형태에서는, 각각의 20ms 프레임이 160 개의 샘플들을 포함하는 8kHz 의 샘플링 레이트를 사용한다. 이하 설명하는 바람직한 실시형태에서는, 데이터 전송 레이트는 8kbps (풀레이트) 로부터 4kbps (하프레이트) 내지 2kbps (1/4 레이트) 내지 1kbps (1/8 레이트) 까지 프레임대 프레임 기반으로 변경하는 것이 바람직하다. 또한, 다른 데이터 레이트를 사용할 수도 있다. 여기서 사용한, 용어 "풀레이트" 또는 "고레이트" 는 일반적으로 8kbps 또는 그 이상인 데이터 레이트를 가리키며, "하프레이트" 또는 "저레이트" 는 일반적으로 4kbps 또는 그 이하인 데이터 레이트를 가리킨다. 저비트 레이트는 상대적으로 더 적은 음성 정 보를 포함하는 프레임들에 선택적으로 사용할 수 있기 때문에, 데이터 전송률을 변경하는 것이 바람직하다. 당업자들이 알수 있는 바와 같이, 다른 샘플링 레이트, 프레임 사이즈, 및 데이터 전송 레이트를 사용할 수도 있다.For example, the negative sample s (n) may be used in various methods known in the art, including pulse code modulation (PCM), companded μ-law or A-law, and the like. Thus representing speech signals that are digitized and quantized. As is known in the art, the speech sample s (n) consists of an input data frame, where each frame comprises a predetermined number of digital speech samples s (n). In a preferred embodiment, each 20 ms frame uses a sampling rate of 8 kHz that includes 160 samples. In the preferred embodiment described below, the data transfer rate is preferably changed from 8kbps (full rate) to 4kbps (half rate) to 2kbps (1/4 rate) to 1kbps (1/8 rate) on a frame-by-frame basis. . Also, other data rates may be used. As used herein, the term “full rate” or “high rate” refers to a data rate that is generally 8 kbps or more, and “half rate” or “low rate” generally refers to a data rate that is 4 kbps or less. It is desirable to change the data rate since low bit rates can be selectively used for frames containing relatively less voice information. As will be appreciated by those skilled in the art, other sampling rates, frame sizes, and data transfer rates may be used.

제 1 인코더 (10) 와 제 2 디코더 (20) 는 공동으로 제 1 음성 코더 또는 언어 코덱을 구비한다. 유사하게, 제 2 인코더 (16) 와 제 1 디코더 (14) 는 공동으로 제 2 음성 코더를 구비한다. 음성 코더들을 디지털 신호 프로세서 (DSP), 주문형 집적회로 (ASIC), 별도의 게이트 로직 (discrete gate logic), 펌웨어, 또는 임의의 종래 프로그램가능 소프트웨어 모듈 및 마이크로프로세서로 구현할 수 있음을 당업자들은 알 수 있다. 소프트웨어 모듈은 RAM 메모리, 플래쉬 메모리, 레지스터, 또는 당해기술분야에 공지된 임의의 다른 유형의 기록가능 저장매체일 수 있다. 다른 방법으로는, 마이크로 프로세서를 임의의 종래 프로세서, 컨트롤러, 또는 스테이트 머신 (state machine) 으로 대체할 수 있다. 음성 코딩용으로 특별히 디자인된 대표적인 ASIC 들은 발명의 명칭이 "VOCODER ASIC" 이며, 1994년 2월 16일에 출원되었으며, 본 발명의 양수인에게 양도되고 여기에 참조한 미국특허출원번호 제08/197,417호에 설명되어 있다.The first encoder 10 and the second decoder 20 are jointly equipped with a first voice coder or language codec. Similarly, second encoder 16 and first decoder 14 jointly have a second voice coder. Those skilled in the art will appreciate that voice coders may be implemented in a digital signal processor (DSP), application specific integrated circuit (ASIC), discrete gate logic, firmware, or any conventional programmable software module and microprocessor. . The software module may be RAM memory, flash memory, registers, or any other type of recordable storage medium known in the art. Alternatively, the microprocessor can be replaced with any conventional processor, controller, or state machine. Representative ASICs specially designed for speech coding are entitled “VOCODER ASIC”, filed February 16, 1994, and are assigned to U.S. Patent Application No. 08 / 197,417, assigned to the assignee of the present invention and referenced herein. It is explained.

일 실시형태에 따르면, 도 2 에 나타낸 바와 같이, 음성 코더에 사용할 수 있는 멀티모드, 혼합영역 선형예측 (MDLP) 인코더 (100) 는 모드 판정 모듈 (102), 피치 추정 모듈 (104), 선형예측 (LP) 분석 모듈 (106), LP 분석 필터 (108), LP 양자화 모듈 (110), 및 MDLP 잔여 인코더 (112) 를 구비한다. 입력 음성 프레임 s(n) 은 모드 판정 모듈 (102), 피치 추정 모듈 (104), 선형 예측 (LP) 분석 모 듈 (106), 및 LP 분석 필터 (108) 에 제공된다. 모드 판정 모듈 (102) 은 주기성에 기초하여 모드 인덱스 I_M 과 모드 M 과, 그리고 각각의 입력 언어 프레임 s(n) 의 에너지, 스펙트럼 경사 (spectrum tilt), 부호변환점 레이트 (zero crossing rate) 등의 다른 추출 파라미터를 생성한다. 음성 프레임을 주기성에 따라서 분류하는 다양한 방법들은, 발명의 명칭이 "METHOD AND APPARATUS FOR PERFORMING REDUCED RATE VARIABLE RATE VOCODING" 이며, 1997년 3월 11일에 출원되었으며, 본 발명의 양수인에게 양도되고, 여기서 참조하는 미국특허출원번호 제08/815,354호에 개시되어 있다. 또한, 이들 방법은 통신산업협회 잠정 표준안 TIA/EIA IS-127 와 TIA/EIA IS-733 에 포함된다.According to one embodiment, as shown in FIG. 2, a multimode, mixed region linear prediction (MDLP) encoder 100 that can be used in a voice coder includes a mode determination module 102, a pitch estimation module 104, a linear prediction. (LP) analysis module 106, LP analysis filter 108, LP quantization module 110, and MDLP residual encoder 112. The input speech frame s (n) is provided to the mode determination module 102, the pitch estimation module 104, the linear prediction (LP) analysis module 106, and the LP analysis filter 108. The mode determination module 102 is based on the periodicity, the mode index I _M and mode M, and the energy, spectral tilt, zero crossing rate, etc. of each input language frame s (n) Create another extraction parameter. Various methods for classifying speech frames according to periodicity, entitled “METHOD AND APPARATUS FOR PERFORMING REDUCED RATE VARIABLE RATE VOCODING”, filed on March 11, 1997, are assigned to the assignee of the present invention, see here. US patent application Ser. No. 08 / 815,354. These methods are also included in the TIA / EIA IS-127 and TIA / EIA IS-733 Provisional Standards.

피치 추정 모듈 (104) 은 각각의 입력음성 프레임 s(n) 에 기초하여 피치 인덱스 I_p 와 래그값 P_o 를 기초하여 생성한다. LP 분석 모듈 (106) 은 각각의 입력 음성 프레임 s(n) 에 선형 예측 분석을 수행하여 LP 파라미터 a 를 생성한다. LP 파라미터 a 는 LP 양자화모듈 (110) 에 제공된다. 또한, LP 양자화모듈 (110) 은 모드 M 을 수신하여 그 모드에 따라서 양자화 프로세스를 수행한다. LP 양자화모듈 (110) 은 LP 인덱스 I_LP 와 양자화된 LP 파라미터

를 생성한다. LP 분석 필터 (108) 은 입력 음성 프레임 s(n) 이외에 양자화된 LP 파라미터

를 수신한다. LP 분석 필터 (108) 는 LP 잔여 신호 R[n] 을 생성하며, 이는 양자화된 선형예측 파라미터

에 기초하여 입력 음성 프레임 s(n) 과 재구성된 음성 사이의 에러를 나타낸다. LP 잔여성분 R[n], 모드 M, 및 양자화된 LP 파라미터

는 MDLP 잔여 인코더 (112) 에 제공된다. 이들 값에 기초하여, MDLP 잔여 인코더 (112) 는 잔여 인덱스 I_R 과 양자화된 잔여 신호

을 도 4 의 플로우차트를 참조하여 이하 설명하는 방법에 따라서 생성한다. Pitch estimation module 104 generates based on pitch index I _p and lag value P _o based on each input speech frame s (n). LP analysis module 106 performs linear predictive analysis on each input speech frame s (n) to generate LP parameter a. LP parameter a is provided to LP quantization module 110. In addition, the LP quantization module 110 receives the mode M and performs a quantization process according to the mode. The LP quantization module 110 may be configured to convert LP index I _LP and quantized LP parameters.

Create LP analysis filter 108 may be configured to perform quantized LP parameters in addition to the input speech frame s (n).

Receive LP analysis filter 108 generates an LP residual signal R [n], which is a quantized linear prediction parameter.

Represents an error between the input speech frame s (n) and the reconstructed speech. LP residual R [n], mode M, and quantized LP parameters

Is provided to MDLP residual encoder 112. Based on these values, MDLP Residual Encoder 112 determines the residual index I _R and the quantized residual signal.

Is generated according to the method described below with reference to the flowchart of FIG.

도 3 에서는, 음성 코더에서 사용할 수 있는 디코더 (200) 가 LP 파라미터 디코딩 모듈 (202), 잔여 디코딩 모듈 (204), 모드 디코딩 모듈 (206), 및 LP 합성 필터 (208) 를 구비한다. 모드 디코딩 모듈 (206) 은 모드 인덱스 I_M을 수신하여 그로부터 모드 M 을 생성한다. LP 파라미터 디코딩 모듈 (202) 는 모드 M 과 LP 인덱스 I_LP 를 수신한다. LP 파라미터 디코딩 모듈 (202) 은 수신된 값을 디코딩하여 양자화된 LP 파라미터

를 생성한다. 잔여 디코딩 모듈 (204) 는 잔여 인덱스 I_R, 피치 인덱스 I_P, 및 모드 인덱스 I_M을 수신한다. 잔여 디코딩 모듈 (204) 는 수신한 값을 디코딩하여 양자화된 잔여 신호

을 생성한다. 양자화된 잔여 신호

및 양자화된 LP 파라미터

는 LP 합성 필터 (208) 에 제공되며, LP 합성필터 (208) 은 그로부터 디코딩된 출력 음성 신호

을 합성한다. In FIG. 3, a decoder 200 that can be used in a speech coder includes an LP parameter decoding module 202, a residual decoding module 204, a mode decoding module 206, and an LP synthesis filter 208. Mode decoding module 206 produces a mode M from it receives the mode index I _M. LP parameter decoding module 202 receives mode M and LP index I _LP . LP parameter decoding module 202 decodes the received value to quantized LP parameters.

Create Residual decoding module 204 receives the residual index I _R , the pitch index I _P , and the mode index I _M. The residual decoding module 204 decodes the received value to quantize the residual signal

Create Quantized Residual Signal

And quantized LP parameters

Is provided to the LP synthesis filter 208, where the LP synthesis filter 208 decodes the output speech signal decoded therefrom.

Synthesize

MDLP 잔여 인코더 (112) 를 제외하고는, 도 2 의 인코더 (100) 와 도 3 의 인코더 (200) 의 다양한 모듈의 동작과 구현은 상술한 미국특허 제5,414,796호와 L.B. Rabiner & R.W. Schafer, Digital Processing of Speech Signals 396-453 (1978) 에 개시되어 있다.Except for the MDLP residual encoder 112, the operation and implementation of the various modules of the encoder 100 of FIG. 2 and the encoder 200 of FIG. 3 are described in U.S. Patent Nos. 5,414,796 and L.B. Rabiner & R.W. Schafer, Digital Processing of Speech Signals 396-453 (1978).

일실시형태에 따르면, MDLP 인코더 (미도시) 는 도 4 의 플로우챠트에 나타낸 단계들을 수행한다. MDLP 인코더는 도 2 의 MDLP 잔여 인코더 (112) 가 될 수 있다. 단계 300 에서, MDLP 인코더는 모드 M 이 풀레이트 (FR), 1/4 레이트 (QR), 또는 1/8 레이트 (ER) 인지를 체크한다. 모드 M 이 FR, QR, 또는 ER 이면, MDLP 인코더는 단계 302 로 진행한다. 단계 302 에서, MDLP 인코더는 대응하는 레이트 M 의 값에 의존하는 FR, QR, 또는 ER 를 잔여 인덱스 I_R 에 적용한다. FR 모드에 하여, 고정확성, 고레이트 코딩이며, 바람직하게는 CELP 코딩일 수 있는 시간영역 코딩을 LP 잔여 프레임 또는 음성 프레임에 적용한다. 그후, (디지털-아날로그 변환 및 변조를 포함한 추가 신호처리 후에) 프레임을 송신한다. 일실시형태에서, 프레임은 예측 에러를 나타내는 LP 잔여 프레임이다. 또다른 실시형태에서, 프레임은 음성 샘플을 나타내는 음성 프레임이다.According to one embodiment, an MDLP encoder (not shown) performs the steps shown in the flowchart of FIG. 4. The MDLP encoder can be the MDLP residual encoder 112 of FIG. 2. In step 300, the MDLP encoder checks whether mode M is full rate (FR), quarter rate (QR), or eighth rate (ER). If mode M is FR, QR, or ER, the MDLP encoder proceeds to step 302. In step 302, the MDLP encoder applies FR, QR, or ER to the residual index I _R depending on the value of the corresponding rate M. In the FR mode, time-domain coding, which is high accuracy, high-rate coding, and preferably CELP coding, is applied to the LP residual frame or the speech frame. The frame is then transmitted (after further signal processing including digital-to-analog conversion and modulation). In one embodiment, the frame is an LP residual frame indicating a prediction error. In another embodiment, the frame is a speech frame representing speech samples.

한편, 단계 300 에서, 모드 M 이 FR, QR, 또는 ER 이 아닌경우(즉, 모드 M 이 하프 레이트 (HR) 인 경우), MDLP 인코더는 단계 304 로 진행한다. 단계 304 에서, 바람직하게는 고조파 코딩인 스펙트럼 코딩을 LP 잔여 또는 음성 신호에 하프레이트로 이용한다. 그후, MDLP 인코더는 단계 306 으로 진행한다. 단계 306 에서는 인코딩된 음성을 디코딩하고 그것을 최초 입력 프레임과 비교함으로써 왜곡 측정치 D 를 획득한다. 그후, MDLP 인코더는 단계 308 로 진행한다. 단계 308 에서, 왜곡 측정치 D 를 소정의 임계치 T 와 비교한다. 단계 308 에서는, 왜곡 측정치 D 를 소정 임계치 T 와 비교한다. 왜곡 측정치 D 가 임계치 T 보다 큰 경우, 하프 레이트, 스펙트럼 인코딩된 프레임에 대하여 대응하는 양자화된 파라미터들을 변조 및 전송한다. 한편, 왜곡 측정치 D 가 임계치 T 보다 크지 않은 경우, MDLP 인코더는 단계 310 으로 진행한다. 단계 310 에서는, 디코딩된 프레임을 시간 영역에서 풀레이트로 재인코딩한다. 바람직하게는, CELP 코딩 등의 임의의 종래 고레이트, 고정확성 코딩 알고리즘을 사용할 수 있다. 그후, 프레임과 관련된 FR 모드 양자화된 파라미터들을 변조 및 송신한다.On the other hand, in step 300, if mode M is not FR, QR, or ER (ie, mode M is half rate (HR)), the MDLP encoder proceeds to step 304. In step 304, spectral coding, preferably harmonic coding, is used as the half rate for the LP residual or speech signal. The MDLP encoder then proceeds to step 306. Step 306 obtains a distortion measure D by decoding the encoded speech and comparing it with the original input frame. The MDLP encoder then proceeds to step 308. In step 308, the distortion measure D is compared with a predetermined threshold T. In step 308, the distortion measure D is compared with a predetermined threshold T. If the distortion measure D is greater than the threshold T, modulate and transmit the corresponding quantized parameters for the half rate, spectral encoded frame. On the other hand, if the distortion measure D is not greater than the threshold T, the MDLP encoder proceeds to step 310. In step 310, the decoded frame is re-encoded at full rate in the time domain. Preferably, any conventional high rate, high accuracy coding algorithm, such as CELP coding, can be used. Then modulate and transmit the FR mode quantized parameters associated with the frame.

도 5 의 플로우차트에 나타낸 바와 같이, 본 발명의 일 실시형태에 따른, 폐루프, 다중모드, MDLP 음성 코더는 전송을 위한 샘플을 처리하는 일련의 단계들에 후속한다. 단계 400 에서, 음성 코더는 연속적인 프레임인 음성 신호의 디지털 샘플을 수신한다. 주어진 프레임을 수신시, 음성 코더는 단계 402 로 진행한다. 단계 402 에서, 코더는 프레임의 에너지를 검출한다. 에너지는 프레임의 음성 활동의 측정치이다. 음성 검출은 디지털 음성 샘플들의 크기의 제곱을 합하고, 합해진 에너지를 임계치와 비교함으로써 수행한다. 일 실시형태에서, 임계치는 배경 잡음의 변화 레벨에 기초하여 채택한다. 대표적인 가변 임계치 음성 활동 검출기가 상술한 미국특허 제5,414,796호에 개시되어 있다. 일부 언보이스 음성의 음향은 극도로 낮은 에너지 샘플들일 수 있어서, 잘못하여 배경 잡음으로서 인코딩될 수도 있다. 이를 방지하기 위해, 상술한 미국특허 제5,414,796호에 개시된 바와 같이, 배경잡음으로부터 언보이스 언어를 구별하기 위해 저에너지 샘플들의 스펙트럼 경사 (spectrum tilt) 를 사용할 수 있다. As shown in the flowchart of FIG. 5, a closed loop, multimode, MDLP voice coder, in accordance with an embodiment of the present invention, follows a series of steps to process a sample for transmission. In step 400, the voice coder receives digital samples of the voice signal that are consecutive frames. Upon receiving a given frame, the voice coder proceeds to step 402. In step 402, the coder detects the energy of the frame. Energy is a measure of the voice activity of a frame. Speech detection is performed by summing the square of the magnitude of the digital speech samples and comparing the combined energy with a threshold. In one embodiment, the threshold is adopted based on the level of change of background noise. Representative variable threshold voice activity detectors are disclosed in US Pat. No. 5,414,796, described above. The sound of some unvoiced speech may be extremely low energy samples, and may be mistakenly encoded as background noise. To prevent this, as disclosed in US Pat. No. 5,414,796 described above, spectral tilt of low energy samples can be used to distinguish the unvoiced language from background noise.

프레임의 에너지를 검출한 후, 음성 코더는 단계 404 로 진행한다. 단계 404 에서, 음성 코더는 프레임을 음성 정보를 포함하는 것으로 분류할 만큼 검출된 프레임 에너지가 충분한지를 결정한다. 검출된 프레임 에너지가 소정의 임계 레벨보다 낮은 경우, 음성 코더는 단계 406 으로 진행한다. 단계 406 에서, 음성 코더는 프레임을 배경잡음 (즉, 비음성 또는 침묵) 으로서 인코딩한다. 일 실시형태에서는, 배경 잡음 프레임을 1/8 레이트 또는 1kbps 로 시간영역 인코딩한다. 단계 404 에서, 검출된 프레임 에너지가 소정의 임계 레벨과 일치하거나 초과하는 경우, 프레임은 음성으로 분류되며, 음성 코더는 단계 408 로 진행한다.After detecting the energy of the frame, the voice coder proceeds to step 404. In step 404, the speech coder determines whether the detected frame energy is sufficient to classify the frame as containing speech information. If the detected frame energy is lower than the predetermined threshold level, the voice coder proceeds to step 406. In step 406, the voice coder encodes the frame as background noise (ie, non-voice or silence). In one embodiment, the background noise frame is time-domain encoded at 1/8 rate or 1 kbps. In step 404, if the detected frame energy matches or exceeds a predetermined threshold level, the frame is classified as voice and the voice coder proceeds to step 408.

단계 408 에서, 음성 코더는 프레임이 주기적인지를 결정한다. 주기성 결정에 관한 다양한 공지 방법들은 예를 들면, 부호 변환점 (zero crossing) 의 이용 및 정규화된 자동 상관함수 (NACF) 의 이용을 포함한다. 특히, 주기성을 검출하는데 부호 변환점과 NACF 를 이용하는 것은 본 발명의 양수인에게 양도되고, 여기서 참조하였으며, 1997년 3월 11일에 출원되고, 발명의 명칭이 "METHOD AND APPARATUS FOR PERFORMING REDUCED RATE VARIABLE RATE VOCODING" 인 미국특허출원번호 제08/815,354호에 개시되어 있다. 또한, 보이스 음성을 언보이스 음성과 구별하는데 사용하는 상기 방법들은 통신산업협회 잠정 표준안 TIA/EIA IS-127 과 TIA/EIA IS-733 에 포함되어 있다. 단계 410 에서, 프레임이 주기적이라고 결정되지 않는 경우, 음성 코더는 단계 410 으로 진행한다. 단계 410 에서, 음성 코더는 프레임을 언보이스 음성으로 인코딩한다. 일실시형태에서는, 언보이스 음성 프레임을 1/4 레이트나 2kbps 로 시간영역 인코딩된다. 단계 408 에서 프 레임이 주기적이라고 결정되는 경우, 음성 코더는 단계 412 로 진행한다.In step 408, the voice coder determines if the frame is periodic. Various known methods for determining periodicity include, for example, the use of zero crossings and the use of a normalized autocorrelation function (NACF). In particular, the use of sign transform points and NACF to detect periodicity is assigned to the assignee of the present invention and is referred to herein, filed March 11, 1997, and entitled "METHOD AND APPARATUS FOR PERFORMING REDUCED RATE VARIABLE RATE VOCODING". US patent application Ser. No. 08 / 815,354. In addition, the methods used to distinguish voice voices from unvoiced voices are included in the TIA / EIA IS-127 and TIA / EIA IS-733 Provisional Standards. In step 410, if the frame is not determined to be periodic, the voice coder proceeds to step 410. In step 410, the voice coder encodes the frame into an unvoiced voice. In one embodiment, unvoiced speech frames are time-domain encoded at quarter rate or 2 kbps. If it is determined in step 408 that the frame is periodic, the voice coder proceeds to step 412.

단계 412 에서, 음성 코더는, 예를들어, 상술한 미국특허출원번호 제08/815,354호에 설명된 바와 같이, 당해기술분야에 공지된 주기 검출방법을 이용하여 프레임이 충분히 주기적인지를 결정한다. 프레임이 충분히 주기적이지 않다고 결정되는 경우, 음성 코더는 단계 414 로 진행한다. 단계 414 에서, 프레임은 전이 음성 (transition speech) (즉, 언보이스 음성으로부터 보이스 음성으로의 천이) 시간영역 인코딩된다. 일실시형태에서는, 전이 음성 프레임을 풀레이트 또는 8kbps 로 시간영역 인코딩한다.In step 412, the voice coder determines whether the frame is sufficiently periodic using a period detection method known in the art, for example, as described in US Patent Application No. 08 / 815,354, supra. If it is determined that the frame is not periodic enough, the voice coder proceeds to step 414. In step 414, the frame is time domain encoded with transition speech (i.e., transition from unvoiced voice to voice voice). In one embodiment, the transition speech frame is time-domain encoded at full rate or 8 kbps.

단계 412 에서, 음성 코더가 프레임이 충분히 주기적이라고 결정하는 경우, 음성 코더는 단계 416 으로 진행한다. 단계 416 에서, 음성 코더는 프레임을 보이스 음성으로서 인코딩한다. 일실시형태에서는, 보이스 음성 프레임을 하프 레이트 또는 4kbps 로 스펙트럼 인코딩한다. 바람직하게는, 도 7 을 참조하여 이하 설명하는 바와 같이, 보이스 음성 프레임을 고조파 코더 (harmonic coder) 로 스펙트럼형 인코딩한다. 다른 방법으로는, 예를들어, 당해기술분야에 공지된 정현파 변환 코더, 멀티밴드 여기 코더 등의 다른 스펙트럼 코더들을 사용할 수 있다. 그후, 음성 코더는 단계 418 로 진행한다. 단계 418 에서, 음성 코더는 인코딩된 보이스 음성 프레임을 디코딩한다. 그후, 음성 코더는, 단계 420 으로 진행한다. 단계 420 에서, 디코딩된 보이스 음성 프레임을 그 프레임에 대한 대응 입력 음성 샘플과 비교하여, 합성 음성 왜곡을 측정하며 하프레이트, 보이스 음성, 스펙트럼 코딩 모델이 허용한계내에서 동작하는 지를 결정한다. 그 후, 음성 코더는 단계 422 로 진행한다.In step 412, if the voice coder determines that the frame is sufficiently periodic, the voice coder proceeds to step 416. In step 416, the voice coder encodes the frame as voice voice. In one embodiment, the voice speech frame is spectral encoded at half rate or 4 kbps. Preferably, as described below with reference to FIG. 7, the vocal speech frame is spectrally encoded with a harmonic coder. Alternatively, other spectral coders, such as, for example, sinusoidal transform coders, multiband excitation coders, and the like, may be used. The voice coder then proceeds to step 418. In step 418, the speech coder decodes the encoded voice speech frame. The voice coder then proceeds to step 420. In step 420, the decoded voice speech frame is compared with the corresponding input speech sample for that frame to measure synthesized speech distortion and determine whether the half rate, voice speech, spectral coding model operates within tolerances. The voice coder then proceeds to step 422.

단계 422 에서, 음성 코더는 디코딩된 보이스 음성 프레임과 그 프레임에 대응하는 입력 음성 샘플들 사이의 에러가 소정 임계치보다 낮게 되는지를 결정한다. 일실시형태에 따르면, 이 결정은 도 6 을 참조하여 이하 설명하는 방법으로 수행한다. 인코딩 왜곡이 소정 임계치보다 낮게 되는 경우, 음성 코더는 단계 424 로 진행한다. 단계 424 에서, 음성 코더는 단계 416 의 파라미터를 사용하여 프레임을 보이스 음성으로서 송신한다. 단계 422 에서 인코딩 왜곡이 소정 임계치와 일치하거나 초과하는 경우에는, 음성 코더는 단계 414 로 진행하여, 단계 400 에서 수신한 디지털 음성 샘플의 프레임을 전이 음성으로서 풀레이트로 시간영역 인코딩한다. In step 422, the speech coder determines whether the error between the decoded voice speech frame and the input speech samples corresponding to the frame is below a predetermined threshold. According to one embodiment, this determination is performed by the method described below with reference to FIG. If the encoding distortion is lower than the predetermined threshold, the voice coder proceeds to step 424. In step 424, the voice coder transmits the frame as voice voice using the parameters of step 416. If the encoding distortion in step 422 matches or exceeds the predetermined threshold, the voice coder proceeds to step 414 to time-domain encode the frame of the digital voice sample received in step 400 as a full voice as transition voice.

단계 400 내지 410 은 개루프 인코딩 판정 모드를 포함한다. 반면에, 단계 412 내지 426 은 폐루프, 인코딩 판정 모드를 포함한다. Steps 400 through 410 include an open loop encoding determination mode. On the other hand, steps 412 to 426 include a closed loop, encoding determination mode.

일실시형태에서는, 도 6 에 나타낸 바와 같이, 폐루프, 멀티모드, MDLP 음성 코더가, 제어 프로세서 (504) 에 커플링되는 프레임 버퍼 (502) 에 커플링된 아날로그-디지털 컨버터 (A/D; 500) 를 구비한다. 에너지 계산기 (506), 보이스 음성 검출기 (508), 배경잡음 인코더 (510), 고레이트, 시간영역 인코더 (512), 저레이트 스펙트럼 인코더 (514) 가 제어 프로세서 (504) 에 커플링된다. 스펙트럼 디코더 (516) 는 스펙트럼 인코더 (514) 와 커플링되며, 에러 계산기 (518) 는 스펙트럼 디코더 (516) 과 제어 프로세서 (504) 에 커플링된다. 임계치 비교기 (520) 는 에러 계산기 (518) 과 제어 프로세서 (504) 에 커플링된다. 버퍼 (522) 는 스펙트럼 인코더 (514), 스펙트럼 디코더 (516), 및 임계치 비교기 (520) 에 커플링된다. In one embodiment, as shown in FIG. 6, a closed loop, multimode, MDLP voice coder includes an analog-to-digital converter (A / D) coupled to a frame buffer 502 coupled to a control processor 504; 500). An energy calculator 506, voice speech detector 508, background noise encoder 510, high rate, time domain encoder 512, low rate spectrum encoder 514 are coupled to the control processor 504. The spectrum decoder 516 is coupled with the spectrum encoder 514, and the error calculator 518 is coupled to the spectrum decoder 516 and the control processor 504. Threshold comparator 520 is coupled to error calculator 518 and control processor 504. The buffer 522 is coupled to the spectral encoder 514, the spectral decoder 516, and the threshold comparator 520.

도 6 의 실시형태에서, 음성 코더 부품들은, 바람직하게는, 그 자체로 DSP 나 ASIC 에 설치되는 음성 코더 내의 펌웨어나 다른 소프트웨어 구동 모듈로서 구현된다. 음성 코더 부품들은 다수의 다른 공지된 방법들로도 마찬가지로 잘 구현할 수 있다. 바람직하게는, 제어 프로세서 (504) 는 마이크로 프로세서가 될 수 있지만, 다른 방법으로는, 제어기, 스테이트 머신, 또는 별도의 로직으로 구현할 수도 있다. In the embodiment of FIG. 6, the voice coder components are preferably implemented as firmware or other software driven module in the voice coder, which is itself installed in a DSP or ASIC. Voice coder parts can likewise be implemented in many other known ways. Preferably, control processor 504 may be a microprocessor, but alternatively, may be implemented in a controller, state machine, or separate logic.

도 6 의 멀티모드 코더에서는, 음성 신호가 A/D (500) 으로 제공된다. A/D (500) 은 아날로그 신호를 디지털 음성 샘플 S(n) 의 프레임으로 변환한다. 디지털 음성 샘플은 프레임 버퍼 (502) 에 제공된다. 제어 프로세서 (504) 는 프레임 버퍼 (502) 로부터 디지털 음성 샘플을 취하여 그들을 에너지 계산기 (506) 에 제공한다. 에너지 계산기 (506) 는,In the multimode coder of FIG. 6, a speech signal is provided to the A / D 500. A / D 500 converts the analog signal into a frame of digital speech samples S (n). The digital voice sample is provided to the frame buffer 502. The control processor 504 takes the digital voice samples from the frame buffer 502 and provides them to the energy calculator 506. Energy calculator 506,

에 따라서 음성 샘플들의 에너지 E 를 계산하며, 여기서 프레임은 20ms 길이이고 샘플링 레이트는 8 kHz 이다.Calculate the energy E of the speech samples, where the frame is 20 ms long and the sampling rate is 8 kHz.

제어 프로세서 (504) 는 계산한 음성 에너지를 음성 활성 임계치와 비교한다. 계산한 에너지가 음성 활성 임계치보다 낮은 경우, 제어 프로세서 (504) 는 디지털 음성 샘플을 프레임 버퍼 (502) 로부터 배경잡음 인코더 (510) 으로 보낸다. 배경 잡음 인코더 (510) 는 배경 잡음의 추정치를 유지하기 위해 필요한 최소 비트수를 사용하여 프레임을 인코딩한다. The control processor 504 compares the calculated speech energy with the speech activity threshold. If the calculated energy is lower than the voice activity threshold, the control processor 504 sends the digital voice sample from the frame buffer 502 to the background noise encoder 510. Background noise encoder 510 encodes the frame using the minimum number of bits needed to maintain an estimate of the background noise.

계산된 에너지가 음성 활성 임계치보다 크거나 동일한 경우, 제어 프로세서 (504) 는 디지털 음성 샘플을 프레임 버퍼 (502) 로부터 보이스 음성 검출기 (508) 로 보낸다. 보이스 음성 검출기 (508) 는 음성 프레임의 주기성 레벨이 저레이트 스펙트럼 인코딩을 이용하는 효과적인 코딩을 허용하는 지를 결정한다. 음성 프레임의 주기성 레벨을 결정하는 방법은 당해기술분야에 공지되어 있으며, 예를들어 정규화 자기상관 함수 (NACF; normalized autocorrelation function) 와 부호변환점의 이용을 포함한다. 이들 및 다른 방법들은 상술한 미국출원번호 제08/815,354호에 개시되어 있다.If the calculated energy is greater than or equal to the voice activity threshold, the control processor 504 sends the digital voice sample from the frame buffer 502 to the voice voice detector 508. Voice speech detector 508 determines whether the periodicity level of the speech frame allows for efficient coding using low rate spectral encoding. Methods of determining the periodicity level of speech frames are known in the art and include, for example, the use of normalized autocorrelation functions (NACFs) and code-conversion points. These and other methods are disclosed in the aforementioned US application Ser. No. 08 / 815,354.

보이스 음성 검출기 (508) 는, 음성 프레임이 스펙트럼 인코더 (514) 에 의해 효과적으로 인코딩되는 충분한 주기성의 음성을 포함하는 지를 나타내는 신호를 제어 프로세서 (504) 에 제공한다. 보이스 음성 검출기 (508) 가 음성 프레임이 충분한 주기성을 갖지 못한다고 결정하는 경우, 제어 프로세서 (504) 는 디지털 음성 샘플을 고레이트 인코더 (512) 로 보내며, 고레이트 인코더 (512) 는 음성을 소정의 최대 데이터 레이트로 시간영역 인코딩한다. 일실시형태에서, 소정의 최대 데이터 레이트는 8 kbps 이며, 고레이트 인코더 (512) 는 CELP 코더이다. The voice speech detector 508 provides a control processor 504 with a signal indicating whether the speech frame contains speech of sufficient periodicity that is effectively encoded by the spectral encoder 514. If the voice speech detector 508 determines that the speech frame does not have sufficient periodicity, the control processor 504 sends a digital speech sample to the high rate encoder 512, and the high rate encoder 512 sends the speech to a predetermined maximum. Time-domain encoding at the data rate. In one embodiment, the predetermined maximum data rate is 8 kbps and high rate encoder 512 is a CELP coder.

최초에 보이스 음성 검출기 (508) 가 음성 신호가 스펙트럼 인코더 (514) 에 의해 효과적으로 인코딩되는 충분한 주기성을 갖는다고 결정하는 경우, 제어 프로세서 (504) 는 디지털 음성 샘플들을 프레임 버퍼 (502) 로부터 스펙트럼 인코더 (514) 로 보낸다. 대표적인 스펙트럼 인코더를 도 7 을 참조하여 이하 설명한다. Initially, when the voice speech detector 508 determines that the speech signal has sufficient periodicity to be effectively encoded by the spectrum encoder 514, the control processor 504 may extract the digital speech samples from the frame buffer 502 (the spectrum encoder (502). 514). An exemplary spectral encoder is described below with reference to FIG. 7.

스펙트럼 인코더 (514) 는 추정한 피치 주파수 F₀ , 피치주파수의 고조파의 진폭 A_I, 및 음성 정보 V_c를 추출한다. 스펙트럼 인코더 (514) 는 버퍼 (522) 와 스펙트럼 디코더 (516) 에 이들 파라미터를 제공한다. 바람직하게는, 스펙트럼 디코더 (516) 는 종래의 CELP 인코더의 인코더내 디코더와 유사하다. 스펙트럼 디코더 (516) 는 스펙트럼 디코딩 포맷 (도 7 을 참조하여 이하 설명함) 에 따라서, 동기 음성 샘플The spectral encoder 514 extracts the estimated pitch frequency F ₀ , the amplitude A _I of the harmonics of the pitch frequency, and the voice information V _c . Spectrum encoder 514 provides these parameters to buffer 522 and spectral decoder 516. Preferably, the spectral decoder 516 is similar to the in-encoder decoder of a conventional CELP encoder. Spectrum decoder 516 is a synchronous speech sample, in accordance with a spectral decoding format (described below with reference to FIG. 7).

을 발생시키며, 합성된 음성샘플을 에러 계산기 (518) 에 제공한다. 제어 프로세서 (504) 는 음성 샘플 S(n) 을 에러 계산기 (518) 에 송신한다. And synthesized speech samples are provided to the error calculator 518. The control processor 504 sends the voice sample S (n) to the error calculator 518.

에러 계산기 (518) 는 Error calculator 518

에 따라서, 각각의 음성 샘플, S(n) 과 각각의 대응하는 동기 음성 샘플 사이의 평균제곱 에러 (MSE) 를 계산한다. 계산한 MSE 를 임계치 비교기 (520) 에 제공하고, 임계치 비교기 (520) 는 왜곡 레벨이 허용가능한 경계내에 있는지, 즉, 왜곡의 레벨이 소정의 임계치보다 낮게 되는 지를 결정한다.In accordance with it, the mean square error (MSE) between each speech sample, S (n) and each corresponding synchronous speech sample is calculated. The calculated MSE is provided to the threshold comparator 520, and the threshold comparator 520 determines whether the distortion level is within an acceptable boundary, i.e., the level of distortion is lower than a predetermined threshold.

계산한 MSE 가 허용가능 한계 이내인 경우, 임계치 비교기 (520) 는 신호를 버퍼 (502) 에 제공하며, 스펙트럼 인코딩된 데이터는 음성코더로부터 출력된다. 반면에, MSE 가 허용가능 한계내에 있지 않는 경우, 임계치 비교기 (520) 는 신호를 제어 프로세서 (504) 에 제공하며, 차례로 제어 프로세서 (504) 는 디지털 샘플을 프레임 버퍼 (502) 로부터 고레이트, 시간영역 인코더 (512) 로 보낸다. 시간영역 인코더 (512) 는 소정의 최대 레이트로 프레임을 인코딩하며, 버퍼 (522) 의 콘텐츠는 폐기한다. If the calculated MSE is within an acceptable limit, the threshold comparator 520 provides a signal to the buffer 502, and the spectral encoded data is output from the speech coder. On the other hand, if the MSE is not within the acceptable limits, the threshold comparator 520 provides a signal to the control processor 504, which in turn controls the digital sample from the frame buffer 502, high rate, time. To region encoder 512. Time domain encoder 512 encodes the frame at a predetermined maximum rate, and discards the contents of buffer 522.

도 6 의 실시형태에서, 사용한 스펙트럼 코딩의 유형은 도 7 을 참조하여 이하 설명하는 고조파 코딩 (harmonic coding) 이지만, 다른 방법으로, 예를들어, 정현파 변환 코딩이나 멀티밴드 여기 코딩 등의 임의의 유형의 스펙트럼 코딩이 될 수도 있다. 멀티밴드 여기 코딩의 사용은, 예를들어, 미국특허 제5,195,166호에 개시되어 있으며, 정현파 변환 코딩의 사용은 예를들어 미국특허 제4,865,068호에 개시되어 있다. In the embodiment of FIG. 6, the type of spectral coding used is harmonic coding described below with reference to FIG. 7, but in another way, for example, any type such as sinusoidal transform coding or multiband excitation coding, and the like. May be spectral coding. The use of multiband excitation coding is described, for example, in US Pat. No. 5,195,166, and the use of sinusoidal transform coding is described, for example, in US Pat. No. 4,865,068.

전이 프레임에 대하여 그리고 위상 왜곡 임계치가 주기성 파라미터와 동일하거나 그보다 작은 보이스 프레임에 대하여, 도 6 의 멀티모드 코더는, 바람직하게는, 고레이트, 시간영역 인코더 (512) 에 의해 풀 레이트나 8 kbps 의 CELP 코딩을 사용한다. 다른 방법으로는, 다른 공지된 임의 형태의 고레이트, 시간 영역 코딩를 이들 프레임에 사용할 수도 있다. 따라서, 전이 프레임 (그리고, 충분하게 주기적이지 않는 보이스 프레임) 을 고정밀도로 코딩하여, 위상 정보를 잘 유지하면서 입력과 출력에서의 파형을 잘 매칭시킬 수 있다. 일실시형태에서는, 임계치가 주기성을 초과하는 소정 개수의 연속하는 보이스 프레임을 처리한 후에, 임계치 비교기 (520) 의 결정에 상관없이, 멀티코더는 하나의 프레임에 대하여 하 프레이트 스펙트럼 코딩에서 풀레이트 CELP 코딩으로 스위칭한다. For transition frames and for voice frames where the phase distortion threshold is less than or equal to the periodicity parameter, the multimode coder of FIG. 6 is preferably used by the high rate, time domain encoder 512 to achieve a full rate or 8 kbps. Use CELP coding. Alternatively, any other known high rate, time domain coding may be used for these frames. Thus, transition frames (and voice frames that are not sufficiently periodic) can be coded with high precision to match waveforms at the input and output well while maintaining phase information well. In one embodiment, after processing a predetermined number of consecutive voice frames whose threshold exceeds periodicity, regardless of the determination of threshold comparator 520, the multicoder is full-rate CELP in half spectral coding for one frame. Switch to coding.

제어 프로세서 (504) 와 관련하여 에너지 계산기 (506) 와 보이스 음성 검출기 (508) 는 개루프 인코딩 판정을 포함한다. 이와달리, 제어 프로세서 (504) 와 관련하여, 스펙트럼 인코더 (514), 스펙트럼 디코더 (516), 에러 계산기 (518), 임계치 비교기 (520), 및 버퍼 (522) 는 폐루프 인코딩 판정을 포함한다. The energy calculator 506 and voice speech detector 508 in connection with the control processor 504 include an open loop encoding decision. In contrast, with respect to control processor 504, spectral encoder 514, spectral decoder 516, error calculator 518, threshold comparator 520, and buffer 522 include closed loop encoding decisions.

도 7 과 관련하여 설명하는 일실시형태에서, 스펙트럼 코딩과, 바람직하게는 고조파 코딩은 저비트 레이트로 충분히 주기성을 갖는 보이스 프레임을 인코딩하는 데 사용한다. 일반적으로, 스펙트럼 코더는, 주파수 영역내의 각각의 음성 프레임을 모델링하고 인코딩함으로써 지각적으로 의미있는 방법으로 음성 스펙트럼 특성의 시간 에볼루션 (time evolution) 을 유지하는 알고리즘으로서 정의된다. 이들 알고리즘의 필수부분은, (1) 스펙트럼 분석이나 파라미터 추정; (2) 파라미터 양자화, 및 (3) 출력 음성 파형과 디코딩된 파라미터와의 합성이다. 따라서, 스펙트럼 파라미터 세트를 갖는 단기 음성 스펙트럼의 중요 특성을 유지하고, 파라미터를 인코딩하고, 디코딩된 스펙트럼 파라미터를 사용하여 출력 음성을 합성하는 것이 그 목적이다. 일반적으로, 출력 음성은 정현파의 가중합으로서 합성한다. 정현파의 진폭, 주파수, 및 위상은 분석중에 추정한 스펙트럼 파라미터이다. In one embodiment described in connection with FIG. 7, spectral coding, and preferably harmonic coding, is used to encode voice frames with sufficiently periodicity at low bit rates. In general, a spectral coder is defined as an algorithm that maintains time evolution of speech spectral characteristics in a perceptually meaningful way by modeling and encoding each speech frame in the frequency domain. Essential parts of these algorithms include: (1) spectral analysis or parameter estimation; (2) parameter quantization, and (3) synthesis of the output speech waveform with the decoded parameters. Therefore, the purpose is to maintain important characteristics of the short-term speech spectrum with the spectral parameter set, encode the parameters, and synthesize the output speech using the decoded spectral parameters. In general, the output speech is synthesized as a weighted sum of sinusoids. The amplitude, frequency, and phase of the sine wave are the spectral parameters estimated during analysis.

"합성에 의한 분석" 은 CELP 코딩에서 공지된 기술이지만, 이 기술은 스펙트럼 코딩에서 사용하고 있지 않다. 합성에 의한 분석을 스펙트럼 코더에 적용하지 않는 주된 이유는, 음성 모델이 지각할 수 있는 관점에서 적절하게 기능함에도 불구하고, 초기 위상 정보의 손실로 인하여 합성 음성의 평균제곱 에너지 (MSE; Mean Square Energy) 가 `을 수 있다는 것이다. 따라서, 정확하게 초기 위상을 발생시키는 것의 다른 이점은, 음성 샘플을 재구성된 음성과 직접 비교하여 음성 모델이 음성 프레임을 정확하게 인코딩하는 지에 대한 결정이 가능함으로써 생기는 능력이다. "Analysis by Synthesis" is a known technique in CELP coding, but this technique is not used in spectral coding. The main reason for not applying the synthesis analysis to the spectral coder is that the mean square energy (MSE) of the synthesized speech is due to the loss of initial phase information, even though the speech model functions properly from a perceptible perspective. ) Can be `. Thus, another advantage of accurately generating the initial phase is the ability to directly compare the speech samples with the reconstructed speech to determine whether the speech model correctly encodes the speech frame.

스펙트럼 코딩에서, 출력 음성 프레임은, In spectral coding, the output speech frame is

S[n] = S_v[n] + S_uv[n], n = 1, 2, ..., N S [n] = S _v [n] + S _uv [n], n = 1, 2, ..., N

으로서 합성되며, 여기서 N 은 프레임당 샘플수이고, S_v 와 S_uv 는 각각 보이스 및 언보이스 성분이다. 정현파합 합성 (sum-of-sinusoid synthesis) 프로세스는, 보이스 성분, N is the number of samples per frame, and S _v and S _uv are the voice and unvoiced components, respectively. The sum-of-sinusoid synthesis process consists of a voice component,

을 생성하며, 여기서 L 은 정현파 함수의 총개수이고, f_k 는 단기 스펙트럼에서의 관심 주파수이고, A(k, n) 은 정현파의 진폭이고, θ(k, n) 은 정현파의 위상이다. 진폭, 주파수, 및 위상 파라미터는 스펙트럼 분석 프로세스에 의해서 입력 프레임의 단기 스펙트럼으로부터 추정된다. 언보이스 성분은 단일의 정현파합 합성에서 보이스 성분과 함께 생성되거나 전용 언보이스 프로세스에 의해 개별적으로 계산되어 S_v 에 다시 가산될 수 있다. Where L is the total number of sinusoidal functions, f _k is the frequency of interest in the short-term spectrum, A (k, n) is the amplitude of the sinusoid, and θ (k, n) is the phase of the sinusoid. Amplitude, frequency, and phase parameters are estimated from the short term spectrum of the input frame by a spectral analysis process. The unvoiced component can be generated with the voice component in a single sinusoidal synthesis or separately calculated by a dedicated unvoiced process and added back to S _v .

도 7 의 실시형태에서, 고조파 코더라고 하는 특정 유형의 스펙트럼 코더는 저비트 레이트에서 충분히 주기적인 보이스 프레임을 스펙트럼 인코딩하기 위하여 사용된다. 고조파 코더는 프레임을 정현파합으로서 특징지움으로써 프레임의 작은 세그먼트들을 분석한다. 정현파합에서의 각각의 정현파는 프레임의 피치 F₀ 의 정수배인 주파수를 갖는다. 사용하는 특정 유형의 스펙트럼 코더가 고조파 코더가 아닌 다른 실시형태에서는, 각각의 프레임에 대한 정현파 주파수를 0 과 2π 사이의 실수 세트로부터 취한다. 도 7 의 실시형태에서는, 합상태의 각각의 정현파의 진폭과 위상을 선택하여, 도 8 의 그래프로 나타낸 바와 같이 하나의 주기에 대하여 신호에 최선으로 매칭시킨다. 일반적으로, 고조파 코더는, 각각의 입력 음성 프레임을 "보이스 (voiced) " 또는 "언보이스 (unvoiced) " 으로 표지하는 외적 분류를 이용한다. 보이스 프레임에 대하여, 정현파들의 주파수는 추정된 피치 (F₀), 즉 f_k = kF₀ 의 고조파로 제한된다. 언보이스 음성에 대하여, 단기 스펙트럼의 피크는 정현파를 결정하기 위하여 사용한다. 진폭과 위상은 프레임에 대한 그들의 에볼루션과 유사하게 In the embodiment of FIG. 7, a particular type of spectral coder called a harmonic coder is used to spectral encode voice frames that are sufficiently periodic at low bit rates. Harmonic coders analyze small segments of a frame by characterizing the frame as a sinusoidal sum. Each sinusoid in the sinusoidal wave has a frequency that is an integer multiple of the pitch F ₀ of the frame. In other embodiments where the particular type of spectral coder used is not a harmonic coder, the sinusoidal frequency for each frame is taken from a real set between 0 and 2π. In the embodiment of Fig. 7, the amplitude and phase of each sinusoidal wave in the sum state are selected, and as shown in the graph of Fig. 8, the signal is best matched for one period. Generally, harmonic coders use an external classification that labels each input voice frame as "voiced" or "unvoiced". For voice frames, the frequency of the sinusoids is limited to harmonics of estimated pitch F ₀ , ie f _k = kF ₀ . For unvoiced speech, short-term peaks are used to determine sinusoids. The amplitude and phase are similar to their evolution over the frame.

A(k, n) = C₁(k)*n +C₂(k)A (k, n) = C ₁ (k) * n + C ₂ (k)

θ(k, n) = B₁(k)*n² + B₂(k)*n + B₃(k)θ (k, n) = B ₁ (k) * n ² + B ₂ (k) * n + B ₃ (k)

로서 삽입되며, 여기서 계수 [Ci(k), Bi(k)] 는 윈도우 처리된 입력 음성 프레임의 단기 푸리에 변환 (STFT) 으로부터, 특정 주파수 위치 f_k (=kf₀) 에서의 진폭, 주파수, 및 위상의 순간값으로부터 추정된다. 정현파마다 송신할 파라미터 는 진폭과 주파수이다. 이 위상은 송신하지 않고, 그 대신에 예를들어 이차 위상 모델 (quadratic phase model) 을 포함하는 여러 공지 기술들중의 하나에 따라서 모델링한다. Where the coefficients [Ci (k), Bi (k)] are obtained from the short-term Fourier transform (STFT) of the windowed input speech frame, at amplitude, frequency, and at a particular frequency position f _k (= kf ₀ ). It is estimated from the instantaneous value of the phase. The parameters to transmit for each sine wave are amplitude and frequency. This phase does not transmit, but instead is modeled according to one of several known techniques, including, for example, a quadratic phase model.

도 7 에 나타낸 바와 같이, 고조파 코더는 윈도우 처리 로직 (602) 과 이산 푸리에 변환 (DFT) 에 커플링되는 피치 추출기 (600) 및 고조파 분석 로직 (604) 을 구비한다. 또한, 입력으로서 음성 샘플 S(n) 을 수신하는 피치 추출기 (600) 는 DFT 및 고조파 분석 로직 (604) 에 커플링된다. DFT 및 고조파 분석 로직 (604) 은 잔여 인코더 (606) 에 커플링된다. 피치 추출기 (600), DFT 및 고조파 분석 로직 (604), 및 잔여 인코더 (606) 는 각각 파라미터 양자화기 (608) 에 커플링된다. 파라미터 양자화기 (608) 는 채널 인코더 (610) 에 커플링되고, 이 채널 인코더 (610) 는 송신기 (612) 에 커플링된다. 송신기 (612) 는 예를들어 CDMA 공중 인터페이스 (CDMA over-the-air interface) 등의 표준 무선 주파수 (RF) 인터페이스에 의해 수신기 (614) 에 커플링된다. 수신기 (614) 는 채널 디코더 (616) 에 커플링되며, 채널 디코더 (616) 는 역양자화기 (618) 에 커플링된다. 역양자화기 (618) 는 정현파합 음성 합성기 (620) 에 커플링된다. 또한, 입력으로서 이전 프레임 정보를 수신하는 위상 추정기 (622) 는 정현파합 음성 합성기 (620) 에 커플링된다. 정현파합 음성 합성기 (620) 는 합성한 음성 출력, s_SYNTH(n) 을 발생시키도록 구성된다. As shown in FIG. 7, the harmonic coder includes a pitch extractor 600 and a harmonic analysis logic 604 coupled to a window processing logic 602 and a discrete Fourier transform (DFT). Also, a pitch extractor 600 that receives speech sample S (n) as input is coupled to the DFT and harmonic analysis logic 604. DFT and harmonic analysis logic 604 is coupled to residual encoder 606. Pitch extractor 600, DFT and harmonic analysis logic 604, and residual encoder 606 are each coupled to parameter quantizer 608. Parametric quantizer 608 is coupled to channel encoder 610, which is coupled to transmitter 612. Transmitter 612 is coupled to receiver 614 by a standard radio frequency (RF) interface, such as, for example, a CDMA over-the-air interface. Receiver 614 is coupled to channel decoder 616, and channel decoder 616 is coupled to inverse quantizer 618. Inverse quantizer 618 is coupled to sinusoidal speech synthesizer 620. Also, a phase estimator 622 that receives previous frame information as input is coupled to a sinusoidal speech synthesizer 620. The sinusoidal speech synthesizer 620 is configured to generate a synthesized speech output, s _SYNTH (n).

피치 추출기 (600), 윈도우처리 로직 (602), DFT 및 고조파 분석 로직 (604), 잔여 인코더 (606), 파라미터 양자화기 (608), 채널 인코더 (606), 정현파합 음성 합성기 (620), 및 위상 추정기 (622) 는 예를들어 펌웨어 또는 소프트에어 모듈을 포함하는 당해기술분야에 공지된 다양한 방법으로 구현할 수 있다. 송신기 (612) 와 수신기 (614) 는 당해기술분야에서 공지된 임의의 동등한 표준 RF 부품으로 구현할 수 있다. Pitch extractor 600, windowing logic 602, DFT and harmonic analysis logic 604, residual encoder 606, parametric quantizer 608, channel encoder 606, sinusoidal speech synthesizer 620, and Phase estimator 622 may be implemented in a variety of ways known in the art, including, for example, firmware or software air modules. Transmitter 612 and receiver 614 may be implemented with any equivalent standard RF component known in the art.

도 7 의 고조파 코더에서, 입력 샘플 S(n) 은 피치 주파수 정보 F₀ 를 추출하는 피치 추출기 (600) 에 의해 수신된다. 그후, 윈도우처리 로직 (602) 에 의해 샘플을 적절한 윈도우 함수로 곱함으로써 음성 프레임은 작은 세그먼트들의 분석을 가능하게 한다. 피치 추출기 (608) 에 의해 공급된 정보의 피치를 사용하여, DFT 및 고조파 분석 로직 (604) 은 샘플들의 DFT 를 계산하여, 도 8 의 그래프로 나타낸 바와 같이 고조파 진폭 A₁ 이 추출되는 복소 스펙트럼 포인트 (complex spectrum point) 를 생성하며, 도 8 에서 L 은 고조파의 총개수를 나타낸다. DFT 는 음성 정보 V_c를 추출하는 잔여 인코더 (606) 에 제공된다. In the harmonic coder of FIG. 7, the input sample S (n) is received by the pitch extractor 600 which extracts the pitch frequency information F ₀ . The speech frame then enables analysis of small segments by multiplying the sample by the appropriate window function by windowing logic 602. Using the pitch of the information supplied by the pitch extractor 608, the DFT and harmonic analysis logic 604 calculates the DFT of the samples so that the complex spectral point from which harmonic amplitude A ₁ is extracted as shown in the graph of FIG. 8. A complex spectrum point is generated, and L in FIG. 8 represents the total number of harmonics. The DFT is provided to a residual encoder 606 which extracts speech information V _c .

도 8 에 나타낸 바와 같이, V_c파라미터는 주파수축상의 포인트를 나타내고, 이 포인트 이상에서는 스펙트럼은 언보이스 음성 신호의 특성이며 고조파수가 아니다. 이와달리, 포인트 V_c 아래에서, 스펙트럼은 고조파이며 보이스 음성의 특성을 나타낸다. As shown in Fig. 8, the V _c parameter represents a point on the frequency axis, and above this point, the spectrum is characteristic of the unvoiced speech signal and is not a harmonic number. In contrast, below point V _c , the spectrum is harmonic and characterizes voice speech.

A₁, F₀, 및 V_c 성분은 정보를 양자화하는 파라미터 양자화기 (608) 에 제공된 다. 양자화된 정보는 채널 인코더 (610) 에 패킷 형태로 제공되며, 이 채널 인코더 (610) 는 패킷을 하프레이트, 4kbps 등의 저비트 레이트로 양자화한다. 패킷은 송신기 (612) 에 제공되며, 이 송신기 (612) 는 패킷을 변조하고 변조된 신호를 수신기 (614) 에 무선송신한다. 수신기 (614) 는 신호를 수신하고 복조하여, 인코딩된 패킷을 채널 디코더 (616) 로 전달한다. 채널 디코더 (616) 는 패킷을 디코딩하고, 디코딩된 패킷으로 역양자화기 (618) 에 제공한다. 역양자화기 (618) 는 정보를 역양자화한다. 정보는 정현파합 음성 합성기 (620) 에 제공된다.The A ₁ , F ₀ , and V _c components are provided to a parametric quantizer 608 that quantizes the information. The quantized information is provided in a packet form to the channel encoder 610, which quantizes the packet at low bit rates such as half rate, 4 kbps, and the like. The packet is provided to a transmitter 612, which modulates the packet and wirelessly transmits the modulated signal to the receiver 614. Receiver 614 receives and demodulates the signal and forwards the encoded packet to channel decoder 616. Channel decoder 616 decodes the packet and provides it to dequantizer 618 as a decoded packet. Inverse quantizer 618 dequantizes the information. The information is provided to the sinusoidal speech synthesizer 620.

정현파합 음성 합성기 (620) 는 S[n] 에 대한 위의 식에 따라서 단기 음성 스펙트럼을 모델링하는 복수의 정현파를 합성하도록 구성한다. 정현파의 주파수, f_k는 기본 주파수 F₀ 의 배수 또는 고조파이며, 이는 의사주기 (즉, 전이) 보이스 음성 세그먼트에 대한 피치 주기성을 갖는 주파수이다. The sinusoidal speech synthesizer 620 is configured to synthesize a plurality of sinusoids modeling the short term speech spectrum according to the above equation for S [n]. The frequency of the sine wave, f _k, is a multiple or harmonic of the fundamental frequency F ₀ , which is the frequency with pitch periodicity for the pseudo period (ie, transition) voice speech segment.

또한, 정현파합 음성 합성기 (620) 는 위상 추정기 (622) 로부터 위상 정보를 수신한다. 또한, 위상 추정기 (622) 는 이전 프레임 정보, 즉 직전의 프레임에 대한 A₁, F₀, 및 V_c 파라미터를 수신한다. 또한, 위상 추정기 (622) 는 이전 프레임의 재생된 N 개의 샘플들을 수신하며, 여기서 N 은 프레임 길이이다 (즉, N 은 프레임당 샘플수이다). 위상 추정기 (622) 는 이전 프레임에 대한 정보에 기초하여 프레임에 대한 초기 위상을 결정한다. 초기 위상 결정은 정현파합 음성 합성기 (620) 에 제공된다. 현재 프레임에 대한 정보 그리고 과거 프레임 정보에 기초하여 위상 추정기 (622) 에 의해 수행한 초기 위상 계산에 기초하여, 정현파합 음성 합성기 (620) 는 이하 설명하는 바와 같이 합성 음성 프레임을 생성한다.Sinusoidal speech synthesizer 620 also receives phase information from phase estimator 622. The phase estimator 622 also receives previous frame information, i.e., the A ₁ , F ₀ , and V _c parameters for the immediately preceding frame. In addition, phase estimator 622 receives the reproduced N samples of the previous frame, where N is the frame length (ie, N is the number of samples per frame). Phase estimator 622 determines the initial phase for the frame based on the information for the previous frame. Initial phase determination is provided to a sinusoidal speech synthesizer 620. Based on the information about the current frame and the initial phase calculation performed by the phase estimator 622 based on the past frame information, the sinusoidal speech synthesizer 620 generates a synthesized speech frame as described below.

위에서 설명한 바와 같이, 고조파 코더는 이전 프레임 정보를 사용하고 위상이 프레임에서 프레임으로 선형으로 변하는 것을 예측하여 음성 프레임을 합성 또는 재생한다. 위에서 설명한, 일반적으로 이차 위상 모델이라고 부르는 합성 모델에서는, 계수 B₃(k) 는 합성되는 현재 보이스 프레임에 대한 초기 위상을 나타낸다. 위상을 결정할 때, 종래의 고조파 코더는 초기 위상을 0 으로 설정하거나 초기 위상 값을 랜덤하게 또는 일부 의사랜덤 발생 방법으로 발생시킨다. 보다 정확하게 위상을 예측하기 위하여, 위상 추정기 (622) 는 직전의 프레임이 보이스 음성 프레임 (즉, 충분히 주기성을 갖는 프레임) 인지 전이 음성 프레임인지를 결정 여부에 따라서, 초기 위상을 결정하는 2 가지 가능한 방법중의 하나를 사용한다. 이전 프레임이 보이스 음성 프레임인 경우, 그 프레임의 최종 추정 위상값은 현재 프레임의 초기 위상값으로서 사용된다. 반면에 이전 프레임이 전이 프레임으로 분류되는 경우, 현재 프레임에 대한 초기 위상값은 이전 프레임에 대한 디코더 출력의 DFT 를 수행하여 얻은 이전 프레임의 스펙트럼으로부터 획득된다. 따라서, 위상 추정기 (622) 는 이미 이용가능한 정확한 위상 정보 (전이 프레임인 이전 프레임이 풀레이트로 처리되었기 때문에) 를 사용한다.As described above, the harmonic coder synthesizes or plays back a speech frame by using previous frame information and predicting that the phase changes linearly from frame to frame. In the synthesis model described above, generally called the secondary phase model, the coefficient B ₃ (k) represents the initial phase for the current voice frame being synthesized. When determining the phase, conventional harmonic coders set the initial phase to zero or generate the initial phase value randomly or in some pseudorandom generation method. In order to predict the phase more accurately, the phase estimator 622 determines the initial phase according to whether the immediately preceding frame is a voice speech frame (ie, a frame with sufficient periodicity) or a transition speech frame. Use one of If the previous frame is a voice speech frame, the final estimated phase value of that frame is used as the initial phase value of the current frame. On the other hand, when the previous frame is classified as a transition frame, the initial phase value for the current frame is obtained from the spectrum of the previous frame obtained by performing DFT of the decoder output for the previous frame. Thus, phase estimator 622 uses the exact phase information already available (since the previous frame, which is a transition frame, was processed at full rate).

일실시형태에서, 폐루프, 멀티모드, MDLP 음성 코더는 도 9 의 흐름도에 나 타낸 음성 처리 단계를 따른다. 음성 코더는 최적의 인코딩 모드를 선택함으로써 각각의 입력 음성 프레임의 LP 잔여성분을 인코딩한다. 일부모드는 LP 잔여성분 또는 음성 잔여성분을 시간 영역에서 인코딩하지만, 다른 모드들은 LP 잔여성분이나 음성 잔여성분을 주파수 영역에서 나타낸다. 모드들의 세트는 풀레이트, 전이 프레임에 대한 시간 영역 (T 모드); 하프레이트, 보이스 프레임에 대한 주파수 영역 (V 모드); 쿼터 레이트, 언보이스 프레임에 대한 시간 영역 (U 모드); 및 1/8 레이트, 잡음 프레임에 대한 시간 영역 (N 모드) 이다. In one embodiment, the closed loop, multimode, MDLP voice coder follows the voice processing steps shown in the flow chart of FIG. The speech coder encodes the LP residual of each input speech frame by selecting the optimal encoding mode. Some modes encode LP residual or speech residual in the time domain, while other modes represent LP residual or speech residual in the frequency domain. The set of modes includes the full rate, time domain for the transition frame (T mode); Half-rate, frequency domain for voice frames (V mode); Quarter rate, time domain for unvoiced frame (U mode); And 1/8 rate, time domain (N mode) for noise frames.

도 9 에 나타낸 단계들을 따라서 음성 신호 또는 대응 LP 잔여성분을 인코딩할 수 있다. 잡음, 언보이스, 전이 및 보이스 음성의 파형 특성은 도 10A 의 그래프에서 시간함수로 나타낸 바와 같다. 잡음, 언보이스, 전이, 및 보이스 LP 잔여성분은 도 10B 의 그래프에서 시간의 함수로 나타낸 바와 같다.The steps shown in FIG. 9 can be followed to encode a speech signal or a corresponding LP residual. The waveform characteristics of noise, unvoiced, transitional and voiced speech are shown as time functions in the graph of FIG. 10A. Noise, unvoice, transition, and voice LP residuals are shown as a function of time in the graph of FIG. 10B.

단계 700 에서, 개루프 모드 판정은 입력 음성 잔여성분 S(n) 에 적용하는 4 개의 모드 (T, V, U 또는 N) 중의 하나에 대하여 행해진다. T 모드를 적용하는 경우, 음성 잔여성분은 단계 702 에서 시간영역에서 T 모드, 즉 풀레이트로 처리한다. U 모드가 적용하는 경우, 음성 잔여성분은 단계 704 에서 시간영역에서 U 모드, 즉 1/4 레이트로 처리한다. N 모드를 적용하는 경우, 음성 잔여성분은 단계 706 에서 시간 영역에서 N 모드, 즉 1/8 레이트로 처리한다. V 모드를 적용하는 경우, 음성 잔여성분은 단계 708에서 주파수 영역에서 V 모드, 즉 하프레이트로 처리한다. In step 700, the open loop mode determination is made for one of four modes (T, V, U or N) that apply to the input speech residual component S (n). When the T mode is applied, the negative residual component is processed in the time domain with the T mode, that is, the full rate, in step 702. When the U mode is applied, the speech residual component is processed in the time domain at U mode, that is, at 1/4 rate, in step 704. When the N mode is applied, the speech residual component is processed in N mode, that is, 1/8 rate, in the time domain in step 706. When applying the V mode, the negative residual is processed in step 708 in the V domain, namely half rate.

단계 710 에서는, 단계 708 에서 인코딩된 음성을 디코딩하고 입력 음성 잔 여성분 S(n) 과 비교하여, 성능 측정치 D 를 계산한다. 단계 712 에서는, 성능 측정치 D 를 소정 임계치 T 와 비교한다. 성능 측정치 D 가 임계치 T 보다 크거나 같은 경우, 단계 708 의 스펙트럼 인코딩한 음성 잔여성분은 단계 714 에서 송신이 승인된다. 이와달리, 성능 측정치 D 가 임계치 T 보다 작은 경우, 입력 음성 잔여성분 S(n) 은 단계 716 에서 T 모드에서 처리한다. 다른 실시형태에서는, 성능 측정치를 계산하지 않으며 임계치도 정의하지 않는다. 대신에, 소정 개수의 음성 잔여성분 프레임을 V 모드에서 처리한 후에, 다음 프레임을 T 모드에서 처리한다. In step 710, the performance measure D is calculated by decoding the speech encoded in step 708 and comparing it to the input speech residual S (n). In step 712, the performance measure D is compared with a predetermined threshold T. If the performance measure D is greater than or equal to the threshold T, the spectral encoded speech residual component of step 708 is approved for transmission in step 714. In contrast, if the performance measure D is less than the threshold T, the input speech residual S (n) is processed in T mode in step 716. In other embodiments, no performance measure is calculated and no threshold is defined. Instead, after processing a predetermined number of speech residual component frames in the V mode, the next frame is processed in the T mode.

바람직하게는, 도 9 에 나타낸 판정 단계들은 단지 필요한 경우에만 고비트 레이트 T 모드를 사용할 수 있게 하여, V 모드를 적절하게 수행하지 않을 때 풀레이트로 스위칭함으로써, 품질의 저하를 방지하면서 더 낮은 비트 레이트의 V 모드로 보이스 음성 세그먼트의 주기성을 이용할 수 있게 한다. 따라서, 풀레이트의 음질에 도달하는 아주 우수한 고음질이 풀레이트보다 상당히 낮은 평균레이트에서 생성할 수 있다. 또한, 목표 음성 품질은 선택한 성능 측정치와 선택한 임계치에 의해 제어할 수 있다. Preferably, the decision steps shown in FIG. 9 allow the high bit rate T mode to be used only when necessary, thereby switching to full rate when not performing the V mode properly, thereby preventing lower bit quality while preventing degradation of quality. The V mode of rate makes it possible to exploit the periodicity of the voice speech segment. Thus, very good high sound quality reaching the sound quality of the full rate can be produced at a significantly lower average rate than the full rate. In addition, the target voice quality can be controlled by the selected performance measure and the selected threshold.

또한, T 모드로의 "갱신 (update) " 은 모델 위상 트랙을 입력 음성의 위상 트랙에 근접하도록 유지함으로써, V 모드의 후속 적용의 성능을 향상시킨다. V 모드에서의 성능이 부적절할 때, 단계 710 과 712 에서의 폐루프 성능 체크는 T 모드로 스위칭하여, 초기 위상 값을 "리프레싱" 함으로써 후속하는 V 모드 처리의 성능을 향상시키며, 이는 모델 위상 트랙이 초기 입력 음성 위상 트랙에 다시 근접 할 수 있게 한다. 예를들어, 도 11A 내지 C 의 그래프에 나타낸 바와 같이, 시작으로부터 5 번째 프레임은 사용한 PSNR 왜곡 측정치에 의해 입증되는 바와 같이, V 모드에서 적절하게 동작하지 않는다. 그결과, 폐루프 판정 및 갱신이 없어서, 도 11C 에 나타낸 바와 같이 모델링된 위상 트랙은 최초 입력 음성 위상 트랙으로부터 상당히 벗어나서, PSNR 에서의 심각하게 저하된다. 또한, V 모드에서 처리한 후속 프레임의 성능도 저하된다. 그러나, 폐루프 판정하에서는 제 5 프레임이 도 11A 에 나타낸 바와 같이 T 모드 처리로 스위칭된다. 도 11B 에 나타낸 바와 같이, PSNR 의 향상에 의해 입증되듯이, V 모드하에서 처리한 후속 프레임의 성능도 향상된다. 또한, V 모드에서 처리한 후속 프레임의 성능도 향상된다. In addition, “update” to the T mode improves the performance of subsequent application of the V mode by keeping the model phase track close to the phase track of the input speech. When the performance in the V mode is inadequate, the closed loop performance check in steps 710 and 712 switches to the T mode to "refresh" the initial phase value, thereby improving the performance of subsequent V mode processing, which is a model phase track. This allows close proximity to the initial input voice phase track. For example, as shown in the graphs of FIGS. 11A-C, the fifth frame from the beginning does not operate properly in V mode, as evidenced by the PSNR distortion measurement used. As a result, there is no closed loop determination and update, so that the phase track modeled as shown in FIG. 11C is significantly deviated from the original input speech phase track and severely degraded in the PSNR. In addition, the performance of subsequent frames processed in the V mode is also degraded. However, under the closed loop determination, the fifth frame is switched to T mode processing as shown in Fig. 11A. As shown in FIG. 11B, as evidenced by the improvement in PSNR, the performance of subsequent frames processed under V mode is also improved. In addition, the performance of subsequent frames processed in the V mode is also improved.

도 9 에 나타낸 판정 단계는 아주 정확한 초기 위상 추정값을 제공하여 V 모드 표현의 품질을 향상시킴으로써, 향상된 V 모드 합성 음성 잔여 신호가 최초 입력 음성 잔여성분, S(n) 과 정확하게 시간 정렬되도록 한다. 다음 방법으로 제 1 V 모드 처리된 음성 잔여 세그먼트에 대한 초기 위상을 직전의 디코딩된 프레임으로부터 유도한다. 각각의 고조파에 대하여, 이전 프레임을 V 모드하에서 처리하는 경우, 초기 위상은 이전 프레임의 최종 추정된 위상과 동일하게 설정된다. 각각의 고조파에 대하여, 이전 프레임을 T 모드에서 처리하는 경우, 초기 위상은 이전 프레임의 실제 고조파 위상과 동일하게 설정된다. 이전의 전체 프레임을 사용하여 과거 디코딩한 잔여성분의 DFT 를 취함으로써 이전 프레임의 실제 고조파 위상은 유도될 수 있다. 다른 방법으로는, 다양한 피치 주기의 이전 프레임을 처리함으로써 피치 동기 방법으로 과거 디코딩된 프레임들의 DFT 를 취함으로써 이전 프레임의 실제 고조파 위상을 유도할 수 있다. The decision step shown in FIG. 9 provides a very accurate initial phase estimate to improve the quality of the V mode representation, so that the enhanced V mode synthesized speech residual signal is precisely time aligned with the original input speech residual, S (n). In the following manner, the initial phase for the first V mode processed speech residual segment is derived from the immediately decoded frame. For each harmonic, when processing the previous frame under V mode, the initial phase is set equal to the last estimated phase of the previous frame. For each harmonic, when processing the previous frame in T mode, the initial phase is set equal to the actual harmonic phase of the previous frame. The actual harmonic phase of the previous frame can be derived by taking the DFT of the residual components previously decoded using the previous full frame. Alternatively, it is possible to derive the actual harmonic phase of the previous frame by taking the DFT of past decoded frames in a pitch sync method by processing the previous frame of various pitch periods.

이상 신규한 폐루프, 멀티모드, 혼합영역 선형예측 (MDLP) 음성 코더를 설명하였다. 본 실시형태와 관련하여 설명한 다양한 예시적인 논리 블록과 알고리즘 단계들은 디지털 신호 프로세서 (DSP), 주문형 집적회로 (ASIC), 별도의 게이트 또는 트랜지스터 로직, 예를들어 레지스터와 FIFO 등의 별도의 하드웨어 부품, 일련의 펌웨어 명령을 실행하는 프로세서, 또는 임의의 종래 프로그램가능 소프트웨어 모듈과 프로세서로 구현하거나 수행할 수 있다. 바람직하게는, 프로세서는 마이크로프로세서일 수 있지만, 다른 방법으로는, 프로세서는 임의의 종래 프로세서, 컨트롤러, 마이크로 컨트롤러, 또는 스테이트 머신일 수 있다. 소프트웨어 모듈은 RAM 메모리, 플래쉬 메모리, 레지스터, 또는 당해기술분야에서 공지된 임의유형의 기록가능 저장매체일 수 있다. 바람직하게는, 데이터, 지시, 명령, 정보, 신호, 비트, 심볼, 및 칩은 명세서 전반에 걸쳐서, 전압, 전류, 전자기파, 자기장이나 입자, 옵티컬 필드나 입자, 또는 그들의 임의의 조합으로 나타낼 수 있다.The novel closed loop, multimode, mixed region linear prediction (MDLP) speech coder has been described. The various exemplary logic blocks and algorithm steps described in connection with this embodiment may include digital signal processors (DSPs), application specific integrated circuits (ASICs), separate gate or transistor logic, such as separate hardware components such as registers and FIFOs, It may be implemented or performed by a processor that executes a series of firmware instructions, or by any conventional programmable software module and processor. Preferably, the processor may be a microprocessor, but in other ways, the processor may be any conventional processor, controller, microcontroller, or state machine. The software module may be a RAM memory, a flash memory, a register, or any type of recordable storage medium known in the art. Preferably, data, instructions, commands, information, signals, bits, symbols, and chips may be represented throughout the specification as voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof. .

이상, 본 발명의 바람직한 실시형태를 도시하고 설명하였다. 그러나, 본 발명의 정신과 범위를 벗어나지 않고 여기서 개시한 실시형태에 다양한 변경을 할 수 있다. 따라서, 본 발명은 다음 청구항에 따른 경우 이외의 것에 대해서 제한되지 않는다.
In the above, preferred embodiment of this invention was shown and described. However, various changes can be made to the embodiments disclosed herein without departing from the spirit and scope of the invention. Accordingly, the invention is not limited except as to the case according to the following claims.

Claims

As a multimode, mixed domain, voice processor,

A coder having at least one time domain coding mode and at least one frequency domain coding mode; And

A closed loop mode selection device coupled to the coder and configured to implement the one or more time domain coding modes if the output of the frequency domain coding mode is distorted outside an acceptable range

A voice processor comprising: a.

The method of claim 1,

The coder encodes speech frames.

The method of claim 1,

The coder encodes a linear prediction residual component of a speech frame.

The method of claim 1,

The one or more time domain coding modes include a coding mode that codes the frames at a first coding rate,

Wherein said at least one frequency domain coding mode comprises a coding mode for coding frames at a second coding rate,

The second coding rate is smaller than the first coding rate.

The method of claim 1,

And said at least one frequency domain coding mode comprises a harmonic coding mode.

The method of claim 1,

A comparison circuit coupled to the coder, the comparison circuit for comparing an uncoded frame with a frame coded in the one or more frequency domain coding modes and generating a performance measure based on the comparison,

The coder applies the one or more time domain coding modes only if the performance measure is lower than a predetermined threshold, and otherwise applies the one or more frequency domain coding modes.

The method of claim 1,

The coder applies the one or more time domain coding modes to respective frames immediately following a predetermined number of successive frames coded in the one or more frequency domain coding modes.

As a multimode, mixed domain, voice processor,

A coder having at least one time domain coding mode and at least one frequency domain coding mode, wherein the at least one frequency domain coding mode converts the short-term spectrum of each frame into a plurality of sinusoids having a parameter set comprising frequency, phase, and amplitude. Wherein the phase is modeled with a polynomial representation and an initial phase value, the initial phase value being (1) the final estimated phase value of the previous frame if the previous frame is coded in one or more frequency domain coding modes, or (2) The coder, if the previous frame was coded in one or more time-domain coding modes, one of the phase values derived from the short-term spectrum of the previous frame; And

A closed loop mode selection device coupled to the coder and configured to select a coding mode for the coder based on content of a frame processed by the speech processor

A voice processor comprising: a.

The method of claim 8,

Sine wave frequency for each frame is an integer multiple of the pitch frequency of the frame.

The method of claim 8,

And the sinusoidal frequency for each frame is taken from a real set between 0 and 2π.

Applying an open loop coding mode selection process to each successive input frame to select one of a time domain coding mode or a frequency domain coding mode based on speech content of the input frame;

Frequency domain coding the input frame if the voice content of the input frame represents a steady state voice voice;

Time-domain coding the input frame if the voice content of the input frame indicates something other than a steady state voice voice;

Comparing a frequency domain coded frame with the input frame to obtain a performance measure; And

Time-domain coding the input frame if the performance measure is lower than a predetermined threshold

Frame processing method comprising a.

The method of claim 11, wherein

And the frames are linear predictive residual frames.

The method of claim 11, wherein

And the frames are voice frames.

The method of claim 11, wherein

Said time domain coding step comprises coding a frame at a first coding rate,

The frequency domain coding step includes coding the frame at a second coding rate,

And wherein the second coding rate is smaller than the first coding rate.

The method of claim 11, wherein

The frequency domain coding step includes harmonic coding.

The method of claim 11, wherein

Said frequency domain coding step comprises representing the short-term spectrum of each frame as a plurality of sinusoids having a parameter set comprising frequency, phase, and amplitude,

The phase is modeled with a polynomial representation and an initial phase value,

The initial phase value may be (1) a final estimated phase value of the previous frame if the previous frame is frequency-domain coded, or (2) a phase value derived from the short-term spectrum of the previous frame if the previous frame is time-domain coded. Frame processing method, characterized in that.

The method of claim 16,

The sine wave frequency for each frame is an integer multiple of the pitch frequency of the frame.

The method of claim 16,

As a multimode, mixed domain, voice processor,

Means for applying an open loop coding mode selection process to an input frame to select one of a time domain coding mode or a frequency domain coding mode based on speech content of the input frame;

Means for frequency domain coding the input frame if the speech content of the input frame represents a steady state voice speech;

Means for time-domain coding the input frame if the voice content of the input frame indicates something other than a steady state voice voice;

Means for comparing a frequency domain coded frame with the input frame to obtain a performance measure; And

Means for time-domain coding the input frame when the performance measure is lower than a predetermined threshold

A voice processor comprising: a.

The method of claim 19,

And the input frame is a linear predictive residual frame.

The method of claim 19,

And the input frame is a voice frame.

The method of claim 19,

The time domain coding means comprises means for coding a frame at a first coding rate,

The frequency domain coding means comprises means for coding a frame at a second coding rate,

The second coding rate is smaller than the first coding rate.

The method of claim 19,

Said frequency domain coding means comprising a harmonic coder.

The method of claim 19,

Said frequency domain coding means comprising means for representing the short-term spectrum of each frame as a plurality of sinusoids having a parameter set comprising frequency, phase and amplitude,

The initial phase value may be (1) a final estimated phase value of the previous frame if the previous frame is frequency-domain coded, or (2) a phase value derived from the short-term spectrum of the immediately preceding frame if the previous frame is time-domain coded. Voice processor, characterized in that one of.

The method of claim 24,