JP2003525473A

JP2003525473A - Closed-loop multimode mixed-domain linear prediction speech coder

Info

Publication number: JP2003525473A
Application number: JP2001564148A
Authority: JP
Inventors: ダス、アミタバ
Original assignee: Qualcomm Inc
Current assignee: Qualcomm Inc
Priority date: 2000-02-29
Filing date: 2000-02-29
Publication date: 2003-08-26
Anticipated expiration: 2020-02-29
Also published as: ES2269112T3; ATE341074T1; AU2000233851A1; EP1259957B1; DE60031002D1; DE60031002T2; WO2001065544A1; CN1266674C; JP4907826B2; KR100711047B1; CN1437747A; EP1259957A1; HK1055833A1; KR20020081374A

Abstract

(57)【要約】閉ループのマルチモードの混合領域の線形予測（mixed-domain linear prediction, MDLP）の音声コーダは、高レートの時間領域のコード化モードと、低レートの周波数領域のコード化モードと、コーダへ入力されたフレームの音声内容に基づいてコーダのコード化モードを選択する閉ループのモード選択機構とを含む。遷移音声（すなわち、無声音音声から有声音音声へ、またはその逆へ）のフレームは、高レートの時間領域コード化モード、例えばＣＥＬＰコード化モードでコード化される。有声音音声のフレームは、低レートの周波数領域コード化モード、例えば高調波コード化モードでコード化される。位相パラメータは、周波数領域コード化モードによってコード化されず、その代わりに、例えば準位相モデルにしたがってモデル化される。周波数領域コード化モードでコード化される各音声フレームにおいて、初期位相値は、周波数領域コード化モードでコード化された直前の音声フレームの初期位相値になる。直前の音声フレームが時間領域コード化モードでコード化されたときは、現在の音声フレームの初期位相値は、直前の時間領域でコード化された音声フレームのデコードされた音声フレーム情報から計算される。周波数領域コード化モードでコード化される各音声フレームを、対応する入力音声フレームと比較して、性能尺度を得ることができる。性能尺度が所定の閾値よりも低いときは、入力音声フレームは時間領域コード化モードでコード化される。 (57) [Summary] A closed-loop multi-mode mixed-domain linear prediction (MDLP) speech coder has a high-rate time-domain coding mode and a low-rate frequency-domain coding mode. And a closed-loop mode selection mechanism for selecting the coding mode of the coder based on the audio content of the frame input to the coder. The frames of the transition speech (ie, from unvoiced speech to voiced speech or vice versa) are coded in a high rate time domain coding mode, eg, a CELP coding mode. Frames of voiced speech are coded in a low rate frequency domain coding mode, eg, a harmonic coding mode. The phase parameters are not coded by the frequency domain coding mode, but instead are modeled, for example, according to a quasi-phase model. In each audio frame coded in the frequency domain coding mode, the initial phase value is the initial phase value of the immediately preceding audio frame coded in the frequency domain coding mode. When the previous audio frame was encoded in the time domain coding mode, the initial phase value of the current audio frame is calculated from the decoded audio frame information of the audio frame encoded in the previous time domain. . Each speech frame coded in the frequency domain coding mode can be compared to a corresponding input speech frame to obtain a performance measure. When the performance measure is below a predetermined threshold, the input speech frame is coded in a time domain coding mode.

Description

Detailed Description of the Invention

【０００１】発明の背景Ｉ．発明の分野本発明は、概ね音声処理の分野、とくに音声を閉ループのマルチモードの混合
領域でコード化するための方法および装置に関する。BACKGROUND OF THE INVENTION I. FIELD OF THE INVENTION The present invention relates generally to the field of speech processing, and more particularly to a method and apparatus for coding speech in a closed-loop multimode mixed domain.

【０００２】ＩＩ．背景ディジタル技術による音声（voice）の伝送は、とくに長距離のディジタル無
線電話の応用において普及してきた。これにより、チャンネル上で送ることがで
きる最少情報量を判断し、一方で再構成された音声の知覚品質を維持することに
関心が生まれた。音声を単にサンプリングして、ディジタル形式にすることによ
って送るとき、従来のアナログ電話の音声品質を実現するには、毎秒６４キロビ
ット秒（kbps）のオーダのデータレートが必要である。しかしながら、音声解析
を使用し、その後で適切にコード化し、伝送し、受信機において再合成をするこ
とによって、データレートを相当に低減することができる。II. Background The transmission of voice by digital technology has become popular, especially in long distance digital wireless telephone applications. This has led to interest in determining the minimum amount of information that can be sent on the channel while maintaining the perceptual quality of the reconstructed speech. Achieving the voice quality of conventional analog telephones requires data rates on the order of 64 kilobits per second (kbps) when voice is simply sampled and sent in digital form. However, by using speech analysis followed by proper coding, transmission and recombining at the receiver, the data rate can be significantly reduced.

【０００３】人間の音声の生成モデルに関係するパラメータを抽出することによって音声を
圧縮する技術を採用したデバイスは、音声コーダと呼ばれている。音声コーダは
、入力音声信号を時間のブロック、すなわち解析フレームに分割する。一般的に
音声コーダはエンコーダとデコーダとを含む。エンコーダは、入力音声フレーム
を解析して、一定の関連するパラメータを抽出して、パラメータを二値表現、す
なわち１組のビットまたは二値データパケットに量子化する。データパケットは
通信チャンネル上で受信機およびデコーダへ送られる。デコーダはデータパケッ
トを処理し、非量子化して（unquantize）パラメータを生成し、非量子化したパ
ラメータを使用して音声フレームを再合成する。A device that employs a technique of compressing voice by extracting parameters related to a human voice generation model is called a voice coder. A speech coder divides the input speech signal into blocks of time, or analysis frames. Speech coders typically include an encoder and a decoder. The encoder parses the input speech frame and extracts certain relevant parameters and quantizes the parameters into a binary representation, i.e. a set of bits or binary data packets. The data packet is sent to the receiver and decoder on the communication channel. The decoder processes the data packet, unquantizes it to generate parameters, and re-synthesizes the speech frame using the unquantized parameters.

【０００４】音声コーダの機能は、音声が本質的にもっている固有の冗長の全てを取去るこ
とによって、ディジタル化された音声信号を低ビットレートの信号へ圧縮するこ
とである。ディジタル圧縮は、入力音声フレームを１組のパラメータで表わし、
量子化を採用して、このパラメータを１組のビットで表わすことによって実現さ
れる。入力音声フレームが多数のビットＮ_ｉをもち、音声コーダによって生成さ
れるデータパケットが多数のビットＮ_０をもつとき、音声コーダによって得られ
る圧縮係数は、Ｃ_ｒ＝Ｎ_ｉ／Ｎ_０である。デコードされた音声(speech)の高い音
声品質（voice quality）を維持し、一方で目標の圧縮係数を得ることが課題と
されている。音声コーダの性能は、（１）音声モデル、すなわち上述の解析およ
び合成プロセスの組合せがどのくらい適切に行われるか、および（２）パラメー
タ量子化プロセスが１フレーム当りＮ_０ビットの目標ビットレートでどのくらい
適切に実行されるかに依存する。したがって音声モデルは、各フレームごとの小
さい組のパラメータを使用して、音声信号の本質（essence）、すなわち目標の
音声品質を得ることを目的としている。The function of a speech coder is to compress a digitized speech signal into a low bit rate signal by removing all of the inherent redundancy inherent in speech. Digital compression describes an input speech frame with a set of parameters,
It is implemented by employing quantization and representing this parameter with a set of bits. When the input speech frame has a large number of bits N _i and the data packet produced by the speech coder has a large number of bits N ₀ , the compression factor obtained by the speech coder is C _r = N _i / N ₀ . The challenge is to maintain the high voice quality of the decoded speech while obtaining the target compression factor. The performance of a speech coder depends on (1) how well the speech model, ie the combination of the analysis and synthesis processes described above, is performed, and (2) how the parameter quantization process is at a target bit rate of N ₀ bits per frame. It depends on how well it performs. The speech model thus aims to obtain the essence of the speech signal, ie the target speech quality, using a small set of parameters for each frame.

【０００５】音声コーダは時間領域のコーダ、すなわち音声の小さいセグメント（一般的に
５ミリ秒（millisecond, ms）のサブフレーム）を一度にコード化する高度な時
間分解処理（time-resolution processing）を採用することによって時間領域の
音声波形を得ることを試みる時間領域のコーダとして構成することができる。各
サブフレームごとに、この技術において知られている種々のサーチアルゴリズム
によって、コードブック空間から高精度の見本（representative）を見付ける。
その代わりに、音声コーダは周波数領域のコーダとして構成されていてもよく、
１組のパラメータを使用して入力音声フレームの短期間の音声スペクトルを捕ら
えて（解析）、対応する合成プロセスを採用して、スペクトルパラメータから音
声波形を再現することを試みる。パラメータ量子化器は、文献（A Gersho & R.M
. Gray, Vector Quantization and Signal Compression (1992)）に記載されて
いる既知の量子化技術にしたがって、コードベクトルの記憶されている表現を使
用してパラメータを表わすことによってそれらのパラメータを保存する。A speech coder is a time-domain coder, that is, an advanced time-resolution processing that encodes a small segment of speech (typically a subframe of 5 milliseconds (millisecond, ms)) at a time. By adopting this, it can be configured as a time domain coder that attempts to obtain a time domain speech waveform. For each sub-frame, various search algorithms known in the art are used to find a highly accurate representative from the codebook space.
Alternatively, the voice coder may be configured as a frequency domain coder,
A set of parameters is used to capture (analyze) the short-term speech spectrum of the input speech frame and a corresponding synthesis process is employed to try to reproduce the speech waveform from the spectral parameters. The parameter quantizer is described in the literature (A Gersho & RM
. Gray, Vector Quantization and Signal Compression (1992)) to preserve these parameters by presenting them using a stored representation of the code vector according to known quantization techniques.

【０００６】よく知られている時間領域の音声コーダは、ＣＥＬＰ（Code Excited Linear
Predictive）コーダであり、これはL.B. Rabiner & R.W. Schaferによる文献（D
igital Processing of Speech Signals 396-453 (1978)）に記載されており、こ
こでは参考文献として全体的にこれを取り上げている。ＣＥＬＰコーダでは、線
形予測（linear prediction, LP）解析によって、短期間のフォルマントフィル
タの係数を見付け、音声信号における短期間の相関関係、すなわち冗長を取去る
。短期間の予測フィルタを入力音声フレームに適用して、ＬＰの残余信号（resi
due signal）を生成し、このＬＰの残余信号をさらに長期間の予測フィルタパラ
メータおよび次の確率コードブックでモデル化して、量子化する。したがってＣ
ＥＬＰのコード化は、時間領域の音声波形をコード化するタスクを、ＬＰの短期
間のフィルタ係数をコード化するタスクおよびＬＰの残余をコード化するタスク
の別々のタスクへ分ける。時間領域のコード化は、固定レート（すなわち、各フ
レームごとに、同数のビットＮ_０を使用するレート）で、または可変レート（す
なわち、異なるビットレートが異なるタイプのフレームの内容に対して使用され
るレート）で実行することができる。可変レートのコーダは、目標の品質を得る
のに適したレベルまでコーデックパラメータをコード化するのに必要なビット量
のみを使用することを試みる。例示的な可変レートのＣＥＬＰのコーダは米国特
許第5,414,796号に記載されており、これは本発明の譲受人に譲渡され、ここで
は参考文献として全体的に取り上げている。A well-known time domain speech coder is a CELP (Code Excited Linear).
Predictive) coder, which is a reference by LB Rabiner & RW Schafer (D
igital Processing of Speech Signals 396-453 (1978)), which is hereby incorporated by reference in its entirety. In a CELP coder, the coefficients of a short-term formant filter are found by linear prediction (LP) analysis, and the short-term correlation, that is, redundancy, in a speech signal is removed. A short-term prediction filter is applied to the input speech frame to obtain the LP residual signal (resi
due signal), and the residual signal of this LP is quantized by modeling with a long-term prediction filter parameter and the following stochastic codebook. Therefore C
ELP coding separates the task of coding the time domain speech waveform into separate tasks: the task of coding the short term filter coefficients of LP and the task of coding the remainder of LP. Time domain coding is used at a fixed rate (ie, a rate that uses the same number of bits N ₀ for each frame) or a variable rate (ie, different bit rates for different types of frame content. Rate). A variable rate coder attempts to use only the amount of bits needed to code the codec parameters to a level suitable for achieving the target quality. An exemplary variable rate CELP coder is described in US Pat. No. 5,414,796, which is assigned to the assignee of the present invention and is hereby incorporated by reference in its entirety.

【０００７】ＣＥＬＰコーダのような時間領域のコーダは、通常は、フレームごとに多数の
ビットＮ_０に依存して、時間領域の音声波形の精度を保持する。このようなコー
ダは、通常はフレーム当りのビット数Ｎ_０が比較的に多いとき（例えば、８キロ
ビット秒以上）、優れた音声品質を伝える。しかしながら低ビットレート（４キ
ロビット秒以下）では、時間領域のコーダは、使用可能なビット数が制限されて
いるために、高品質で丈夫な性能を維持しない。低ビットレートではコードブッ
ク空間が制限されているので、従来の時間領域のコーダには備えられている波形
を整合する能力を取去って、より高レートの市販のアプリケーションにおいてこ
のようなコーダを実行するのに成功した。A time domain coder, such as a CELP coder, typically relies on a large number of bits N ₀ per frame to preserve the accuracy of the time domain speech waveform. Such coders typically deliver excellent voice quality when the number of bits N ₀ per frame is relatively high (eg, 8 kilobit seconds or more). However, at low bit rates (4 kbits or less), time domain coders do not maintain high quality and robust performance due to the limited number of bits available. Due to the limited codebook space at low bit rates, it removes the ability to match the waveforms found in traditional time domain coders and implements such coders in higher rate commercial applications. I succeeded in doing it.

【０００８】現在、研究に対する関心および活発な商業上の要求が急激に高まり、中程度か
ら低いビットレート（すなわち、２．４ないし４キロビット秒の範囲およびそれ
以下）で動作する高品質の音声コーダを発展させた。応用分野には、無線電話通
信、衛星通信、インターネット電話通信、種々のマルチメディアおよび音声スト
リーミングアプリケーション、音声メール、並びに他の音声保存システムを含む
。駆動力については、大きい容量が必要とされ、かつパケットが失われた情況下
での丈夫な性能が要求されている。種々の最近の音声のコード化を標準化する努
力は、低レートの音声コード化アルゴリズムの研究および発展を推進する別の直
接的な駆動力に当てられている。低レートの音声コーダは、許容可能な適用バン
ド幅ごとに、より多くのチャンネル、すなわちユーザを生成し、低レートの音声
コーダを適切なチャンネルコーディングの追加の層と結合して、コーダの全体的
なビット予定値（bit budget）の仕様に適合させ、チャンネルを誤った情況のも
とでも丈夫な性能を発揮させることができる。[0008] Currently, research interest and active commercial demand are rapidly increasing, with high quality voice coders operating at moderate to low bit rates (ie, in the 2.4 to 4 kilobit-second range and below). Was developed. Applications include wireless telephony, satellite communications, internet telephony, various multimedia and voice streaming applications, voice mail, and other voice storage systems. Regarding the driving force, a large capacity is required, and robust performance in a situation where packets are lost is required. Efforts to standardize various recent voice codings have been devoted to another direct driving force that drives the research and development of low rate voice coding algorithms. A low-rate voice coder produces more channels, or users, for each acceptable applicable bandwidth, and combines the low-rate voice coder with an additional layer of proper channel coding to improve the overall coder It can be adapted to the bit budget specifications to ensure robust performance of the channel in the wrong circumstances.

【０００９】より低いビットレートでコード化するために、音声のスペクトル、すなわち周
波数領域でコード化する種々の方法が開発され、この方法では音声信号は、時間
にしたがって漸進的に変化するスペクトル（time-varying evolution of spectr
a）として解析される。例えば、R.J. McAulay & T.F. Quatieriによる文献（Sin
usoidal Coding, in Speech Coding and Synthesis ch. 4 (W.B. Kleijin & K.K
. Paliwal eds., 1995)参照。スペクトルコーダは、時間にしたがって変化する
音声波形を精密にまねるのではなく、１組のスペクトルパラメータを使用して、
音声の各入力フレームの短期間の音声スペクトルをモデル化、すなわち予測する
ことを目的とする。スペクトルパラメータはコード化され、音声の出力フレーム
はデコードされたパラメータを使用して生成される。生成された合成された音声
は、元の入力音声波形と整合しないが、同様の知覚品質を与える。この技術にお
いてよく知られている周波数領域コーダの例には、マルチバンド励起コーダ（mu
ltiband excitation coder, MBE）、シヌソイド変換コーダ（sinusoidal transf
orm coder, STC）、高調波コーダ（harmonic coder, HC）を含む。このような周
波数領域のコーダは、低ビットレートで使用可能な少数のビットで正確に量子化
できるコンパクトな組のパラメータをもつ高品質のパラメータモデルを与える。In order to code at lower bit rates, various methods have been developed for coding the spectrum of the speech, ie in the frequency domain, in which the speech signal has a time-varying spectrum (time). -varying evolution of spectr
parsed as a). For example, the article by RJ McAulay & TF Quatieri (Sin
usoidal Coding, in Speech Coding and Synthesis ch. 4 (WB Kleijin & KK
. Paliwal eds., 1995). Spectral coders use a set of spectral parameters to precisely mimic a time-varying speech waveform,
The purpose is to model, or predict, the short-term speech spectrum of each input frame of speech. The spectral parameters are coded and the output frame of speech is generated using the decoded parameters. The synthesized speech produced does not match the original input speech waveform, but gives similar perceptual quality. Examples of frequency domain coders well known in the art include multi-band excitation coders (mu
ltiband excitation coder (MBE), sinusoidal transf
orm coder, STC) and harmonic coder (HC). Such a frequency domain coder provides a high quality parametric model with a compact set of parameters that can be accurately quantized with a small number of bits available at low bit rates.

【００１０】それにも関わらず、低ビットレートのコード化は、制限されたコード化分解能
、すなわち制限されたコードブック空間に重大な制約を加えて、単一のコード化
機構の効果を制限し、コーダが、等しい精度の種々の背景条件のもとで、種々の
タイプの音声セグメントを表わすことができないようにしている。例えば、従来
の低ビットレートの周波数領域のコーダは、音声フレームの位相情報を送らない
。その代わりに、位相情報は、ランダムな人工的に生成された初期位相値および
線形補間技術（linear interpolation technique）を使用することによって再構
成される。例えば、H.Yang、他による文献（Quadratic Phase Interpolation fo
r Voiced Speech Synthesis in the MBE Model, in 29 Electronic Letters 856
-57 (May 1993)）参照。位相情報は人工的に生成されるので、シヌソイドの振幅
は量子化−非量子化プロセスによって完全に保持されるときでも、周波数領域の
コーダによって生成される出力音声は元の入力音声と整合しない（例えば、大半
のパルスは同期しない）。したがって、周波数領域のコーダでは、例えば信号対
雑音比（signal-to-noise ratio, SNR）または知覚のＳＮＲのような、閉ループ
の性能尺度（performance measure）を採用することが難しいことが分かった。Nevertheless, low bit rate coding puts a significant constraint on the limited coding resolution, ie the limited codebook space, limiting the effectiveness of a single coding mechanism, It prevents the coder from representing different types of speech segments under different background conditions of equal precision. For example, conventional low bit rate frequency domain coders do not send phase information for speech frames. Instead, the phase information is reconstructed by using a random, artificially generated initial phase value and a linear interpolation technique. For example, H. Yang et al. (Quadratic Phase Interpolation fo
r Voiced Speech Synthesis in the MBE Model, in 29 Electronic Letters 856
-57 (May 1993)). Since the phase information is artificially generated, the output speech produced by the frequency domain coder does not match the original input speech (even when the sinusoidal amplitude is completely preserved by the quantization-dequantization process). For example, most pulses are not synchronized). Therefore, it has been found that it is difficult for a frequency domain coder to employ a closed loop performance measure, such as signal-to-noise ratio (SNR) or perceptual SNR.

【００１１】開ループのモード決定プロセスに関連して低レートの音声のコード化を行なう
ために、マルチモードコード化技術が採用された。１つのこのようなマルチモー
ドコード化技術は、Amitava Das、他による文献（Multimode and Variable-Rate
Coding of Speech, in Speech Coding and Synthesis ch. 7 (W.B. Kleijin &
K.K. Paliwal eds., 1995)）に記載されている。従来のマルチモードコーダは異
なるモード、すなわちコード化−デコード化アルゴリズムを、異なるタイプの入
力音声フレームへ適用する。各モード、すなわちコード化−デコード化プロセス
は、最も効率的なやり方で、例えば、有声音音声、無声音音声、または背景ノイ
ズ（非音声(nonspeech)）のような一定のタイプの音声セグメントを表わすため
に特化される。外部の開ループのモード決定機構は、入力音声フレームを検査し
て、何れのモードをフレームに適用するかに関して判断する。通常は、開ループ
のモード決定は、入力フレームから多数のパラメータを抽出して、一定の時間お
よびスペクトルの特性に関するパラメータを評価して、この評価に対するモード
決定に基づくことによって行なわれる。したがってモード決定は、出力音声の抽
出状態、すなわち出力音声が音声品質または他の性能尺度に関して入力音声にど
のくらい近くなるかを前もって知ることなく行われる。Multi-mode coding techniques have been employed to provide low rate speech coding in connection with the open loop mode decision process. One such multi-mode coding technique is described in Amitava Das, et al. (Multimode and Variable-Rate).
Coding of Speech, in Speech Coding and Synthesis ch. 7 (WB Kleijin &
KK Paliwal eds., 1995)). Conventional multi-mode coders apply different modes, i.e. coding-decoding algorithms, to different types of input speech frames. Each mode, i.e. the coding-decoding process, represents in a most efficient manner a certain type of speech segment, for example voiced speech, unvoiced speech or background noise (nonspeech). Specialized in. An external open loop mode decision mechanism examines the input speech frame to determine which mode to apply to the frame. Typically, open-loop mode decisions are made by extracting a number of parameters from the input frame, evaluating parameters for certain temporal and spectral characteristics, and based on the mode decisions for this evaluation. Thus, the mode decision is made without knowing in advance the extraction state of the output voice, ie how close the output voice will be to the input voice in terms of voice quality or other performance measure.

【００１２】上述に基づいて、位相情報をより精密に推定する低ビットレートの周波数領域
のコーダを用意することが望ましい。マルチモードの混合領域のコーダを用意し
て、フレームの音声内容に基づいて、一定の音声フレームを時間領域でコード化
し、他の音声フレームを周波数領域でコード化することがさらに好都合である。
閉ループのコード化モード決定機構にしたがって、一定の音声フレームを時間領
域でコード化して、他の音声フレームを周波数領域でコード化することができる
混合領域のコーダを用意することが、なおいっそう望ましい。したがって、コー
ダによって生成される出力音声と、コーダへ入力される元の音声との時間の同期
性を保証する、閉ループのマルチモードの混合領域の音声コーダが必要とされて
いる。Based on the above, it is desirable to prepare a low bit rate frequency domain coder that estimates phase information more accurately. It is further advantageous to provide a multi-mode mixed domain coder to code certain speech frames in the time domain and other speech frames in the frequency domain based on the speech content of the frames.
It is even more desirable to provide a mixed domain coder that can code certain speech frames in the time domain and code other speech frames in the frequency domain according to a closed-loop coding mode decision mechanism. Therefore, there is a need for a closed-loop, multi-mode, mixed-domain speech coder that ensures time synchronism between the output speech produced by the coder and the original speech input to the coder.

【００１３】発明の概要本発明は、コーダによって生成される出力音声と、コーダへ入力される元の音
声との時間の同期性を保証する、閉ループのマルチモードの混合領域の音声コー
ダに関する。したがって、本発明の１つの態様では、マルチモードの混合領域の
音声プロセッサが、少なくとも１つの時間領域コード化モードおよび少なくとも
１つの周波数領域コード化モードをもつコーダと、コーダに接続され、かつ音声
プロセッサによって処理されるフレーム内容に基づいてコーダのコード化モード
を選択するように構成されている閉ループのモード選択デバイスとを含むことが
好都合である。SUMMARY OF THE INVENTION The present invention is directed to a closed-loop, multi-mode, mixed-domain speech coder that ensures time synchronism between the output speech produced by the coder and the original speech input to the coder. Accordingly, in one aspect of the invention, a multi-mode mixed domain speech processor is coupled to a coder having at least one time domain coding mode and at least one frequency domain coding mode, and a speech processor. And a closed-loop mode selection device configured to select the coding mode of the coder based on the frame content processed by.

【００１４】本発明の別の態様では、フレームを処理する方法は、各連続する入力フレーム
へ開ループのコード化モード選択プロセスを適用して、入力フレームの音声内容
に基づいて時間領域コード化モードか、または周波数領域コード化モードの何れ
かを選択するステップと、入力フレームの音声内容が定常状態の有声音の音声を
示すときは、入力フレームを周波数領域でコード化するステップと、入力フレー
ムの音声内容が定常状態の有声音の音声以外のものを示すときは、入力フレーム
を時間領域でコード化するステップと、周波数領域でコード化されたフレームと
入力フレームとを比較して、性能尺度を求めるステップと、性能尺度が所定の閾
値より低いときは入力フレームを時間領域でコード化するステップとを含むこと
が好都合である。In another aspect of the invention, a method of processing a frame includes applying an open-loop coding mode selection process to each successive input frame to create a time domain coding mode based on the audio content of the input frame. Or in the frequency domain coding mode, and if the speech content of the input frame indicates a steady-state voiced speech, coding the input frame in the frequency domain, and When the speech content indicates something other than steady-state voiced speech, the step of coding the input frame in the time domain is compared with the frame coded in the frequency domain and the input frame to determine the performance measure. Conveniently includes the steps of determining and coding the input frame in the time domain when the performance measure is below a predetermined threshold. Is.

【００１５】本発明の別の態様では、マルチモードの混合領域の音声プロセッサは、開ルー
プのコード化モード選択プロセスを入力フレームへ適用して、入力フレームの音
声内容に基づいて、時間領域コード化モードか、または周波数領域コード化モー
ドの何れかを選択する手段と、入力フレームの音声内容が定常状態の有声音の音
声を示すときは、入力フレームを周波数領域でコード化する手段と、入力フレー
ムの音声内容が定常状態の有声音の音声以外のものを示すときは、入力フレーム
を時間領域でコード化する手段と、周波数領域でコード化されたフレームと入力
フレームとを比較して、性能尺度を求める手段と、性能尺度が所定の閾値より低
いときは、入力フレームを時間領域でコード化する手段とを含むことが好都合で
ある。In another aspect of the invention, a multi-mode mixed domain audio processor applies an open loop coding mode selection process to an input frame to time-domain code based on the audio content of the input frame. Mode or frequency domain coding mode, and means for coding the input frame in the frequency domain when the speech content of the input frame indicates a steady state voiced speech, and When the speech content of the above indicates something other than the voiced sound in the steady state, the performance measure is compared with the means for coding the input frame in the time domain and the frame coded in the frequency domain and the input frame. It is expedient to include means for determining and a means for coding the input frame in the time domain when the performance measure is below a predetermined threshold.

【００１６】好ましい実施形態の詳細な記述図１では、第１のエンコーダ10は、ディジタル形式の音声サンプルｓ（ｎ）を
受信し、サンプルｓ（ｎ）をコード化して、伝送媒体12、すなわち通信チャンネ
ル12上で第１のデコーダ14へ送る。デコーダ14はコード化された音声サンプルを
デコードし、出力された音声信号Ｓ_{ＳＹＮＴＨ}（ｎ）を合成する。反対方向で伝
送するには、第２のエンコーダ16がディジタル形式の音声サンプルｓ（ｎ）をコ
ード化し、それを通信チャンネル18上で送る。第２のデコーダ20はコード化され
た音声サンプルを受信し、デコードし、合成された出力音声信号Ｓ_{ＳＹＮＴＨ}（
ｎ）を生成する。Detailed Description of the Preferred Embodiment In FIG. 1, a first encoder 10 receives audio samples s (n) in digital form and encodes the samples s (n) for transmission medium 12, ie, communication. Send to the first decoder 14 on channel 12. The decoder 14 decodes the coded audio samples and synthesizes the output audio signal S _SYNTH (n). For transmission in the opposite direction, the second encoder 16 encodes the audio sample s (n) in digital form and sends it on the communication channel 18. The second decoder 20 receives the encoded audio samples, decodes and synthesizes the output audio signal S _SYNTH (
n) is generated.

【００１７】音声サンプルｓ（ｎ）は、この技術において知られている種々の方法、例えば
パルスコード変調（pulse code modulation, PMC）、コンパンドされたμ法、す
なわちＡ法（companded μ-law, or A-law）を含む方法にしたがって、ディジタ
ル形式にされて量子化された音声信号を表わしている。この技術において知られ
ているように、音声サンプルｓ（ｎ）は、各々が所定数のディジタル形式の音声
サンプルｓ（ｎ）を含む入力データのフレームへ編成される。例示的な実施形態
では、８キロヘルツのサンプリングレートが採用され、各２０ミリ秒のフレーム
は１６０サンプルを含んでいる。別途記載する実施形態では、データ伝送レート
はフレームごとに８キロビット秒（フルレート）から４キロビット秒（２分の１
レート）、２キロビット秒（４分の１レート）、１キロビット秒（８分の１レー
ト）へ変化することが好都合である。その代わりに、他のデータレートを使用し
てもよい。ここで使用されているように、“フルレート（full rate）”または
“高レート（high rate）”という用語は、通常は、８キロビット秒以上のデー
タレートを指し、“２分の１レート”または“低レート”という用語は、通常は
、４キロビット秒以下のデータレートを指す。比較的に少ない音声情報を含むフ
レームに対して、より低いビットレートを選択的に採用できるので、データ伝送
レートを変化させることが好都合である。当業者によって理解されるように、他
のサンプリングレート、フレームサイズ、およびデータ伝送レートを使用しても
よい。The audio samples s (n) can be obtained by various methods known in the art, such as pulse code modulation (PMC), compounded μ method, ie A method (companded μ-law, or A method). A-law) is used to represent a quantized audio signal in digital form. As is known in the art, audio samples s (n) are organized into frames of input data, each containing a predetermined number of audio samples s (n) in digital form. In the exemplary embodiment, a sampling rate of 8 kilohertz is employed, and each 20 millisecond frame contains 160 samples. In an embodiment described below, the data transmission rate is from 8 kbits (full rate) to 4 kbits (one half) per frame.
Rate), 2 kbits (1/4 rate), 1 kbits (1/8 rate). Alternatively, other data rates may be used. As used herein, the term "full rate" or "high rate" usually refers to a data rate of 8 kilobits per second or more, a "half rate" or The term "low rate" usually refers to data rates of 4 kilobit seconds or less. It is advantageous to change the data transmission rate, as a lower bit rate can be selectively adopted for frames containing relatively little audio information. Other sampling rates, frame sizes, and data transmission rates may be used, as will be appreciated by those skilled in the art.

【００１８】第１のエンコーダ10および第２のデコーダ20は共に第１の音声コーダ、すなわ
ち音声コーデックを含む。同様に、第２のエンコーダ16および第１のデコーダ14
は共に第２の音声コーダを含む。音声コーダはディジタル信号プロセッサ（digi
tal signal processor, DSP）、特定用途向け集積回路（application-specific
integrated circuit, ASIC）、離散的ゲート論理（discrete gate logic）、フ
ァームウエア、または従来のプログラマブルソフトウエアモジュールおよびマイ
クロプロセッサで構成されていてもよいことが分かるであろう。ソフトウエアモ
ジュールは、ＲＡＭメモリ、フラッシュメモリ、レジスタ、またはこの技術にお
いて知られている他の形態の書き込み可能な記憶媒体内にある。その代わりに、
従来のプロセッサ、制御装置、または状態機械をマイクロプロセッサと置換して
もよい。音声のコード化のために特別に設計されたＡＳＩＣの例は、本発明の譲
受人に譲渡され、かつここでは参考文献として全面的に取り上げている米国特許
第5,727,123号、および1994年2月16日に出願され、本発明の譲受人に譲渡され、
かつここでは参考文献として全面的に取り上げている米国特許出願第08/197,417
号（発明の名称：VOCODER ASIC）に記載されている。Both the first encoder 10 and the second decoder 20 include a first speech coder or speech codec. Similarly, the second encoder 16 and the first decoder 14
Both include a second voice coder. A voice coder is a digital signal processor (digi
tal signal processor, DSP), application-specific integrated circuits
It will be appreciated that it may consist of integrated circuits (ASICs), discrete gate logic, firmware, or conventional programmable software modules and microprocessors. The software modules reside in RAM memory, flash memory, registers, or other form of writable storage medium known in the art. Instead,
A conventional processor, controller, or state machine may replace the microprocessor. An example of an ASIC specially designed for audio coding is US Pat. No. 5,727,123, which is assigned to the assignee of the present invention and is hereby fully incorporated by reference, and February 16, 1994. Filed and assigned to the assignee of the present invention,
And US patent application Ser. No. 08 / 197,417, which is hereby fully incorporated by reference
No. (Title of invention: VOCODER ASIC).

【００１９】１つの実施形態にしたがって、図２に示されているように、音声コーダ内で使
用できるマルチモードの混合領域の線形予測（mixed-domain linear prediction
, MDLP）エンコーダ100は、モード決定モジュール102、ピッチ推定モジュール10
4、線形予測（linear prediction, LP）解析モジュール106、ＬＰ解析フィルタ1
08、ＬＰ量子化モジュール110、およびＭＤＬＰ残余エンコーダ112を含む。入力
音声フレームｓ（ｎ）は、モード決定モジュール102、ピッチ推定モジュール104
、ＬＰ解析モジュール106、およびＬＰ解析フィルタ108へ供給される。モード決
定モジュール102は、各入力音声フレームｓ（ｎ）の周期性および他の抽出パラ
メータ、例えばエネルギー、スペクトルチルト、ゼロ交差レート、などに基づい
て、モード指標Ｉ_ＭおよびモードＭを生成する。周期性にしたがって音声フレー
ムを分類する種々の方法は、米国特許出願第08/815,354号（発明の名称：METHOD
AND APPARATUS FOR PERFORMING REDUCED RATE VARIABLE RATE VOCODING）に記
載されており、これは1997年3月11日に出願され、本発明の譲受人に譲渡され、
ここでは参考文献として全面的に取り上げている。このような方法は、米国電気
通信工業会の業界暫定標準（Telecommunication Industry Association Industr
y Interim Standards）のTIA/EIA IS-127およびTIA/EIA IS-733にも採用されて
いる。According to one embodiment, as shown in FIG. 2, multi-mode mixed-domain linear prediction that can be used in a speech coder.
, MDLP) encoder 100 includes mode determination module 102 and pitch estimation module 10.
4, linear prediction (LP) analysis module 106, LP analysis filter 1
08, LP quantization module 110, and MDLP residual encoder 112. The input speech frame s (n) has a mode determination module 102 and a pitch estimation module 104.
, LP analysis module 106 and LP analysis filter 108. The mode decision module 102 generates a mode index I _M and a mode M based on the periodicity of each input speech frame s (n) and other extraction parameters such as energy, spectral tilt, zero crossing rate, and so on. Various methods of classifying speech frames according to their periodicity are described in US patent application Ser. No. 08 / 815,354 (Title of Invention: METHOD
AND APPARATUS FOR PERFORMING REDUCED RATE VARIABLE RATE VOCODING), which was filed on March 11, 1997 and assigned to the assignee of the present invention,
It is taken here in its entirety as a reference. Such a method is based on the Telecommunication Industry Association Industr
y Interim Standards) TIA / EIA IS-127 and TIA / EIA IS-733.

【００２０】[0020]

【数１】 [Equation 1]

【００２１】[0021]

【数２】 [Equation 2]

【００２２】ＭＤＬＰ残余エンコーダ112を除いて、図２のエンコーダ100および図３のデコ
ーダ200の種々のモジュールの動作および構成はこの技術において知られており
、上述の米国特許第5,414,796号およびLB. Rabiner & R.W. Schaferによる文献
（Digital Processing of Speech Signals 396-453 (1978)）に記載されている
。The operation and configuration of various modules of the encoder 100 of FIG. 2 and the decoder 200 of FIG. 3, except for the MDLP residual encoder 112, are known in the art and are described in the above-referenced US Pat. No. 5,414,796 and LB. Rabiner. & RW Schafer (Digital Processing of Speech Signals 396-453 (1978)).

【００２３】１つの実施形態にしたがって、ＭＤＬＰエンコーダ（図示されていない）は、
図４のフローチャートに示したステップを実行する。ＭＤＬＰエンコーダは、図
２のＭＤＬＰ残余エンコーダ112であってもよい。ステップ300では、ＭＤＬＰエ
ンコーダは、モードＭがフルレート（full rate, FR）であるか、４分の１レー
ト（quarter rate, QR）であるか、または８分の１レート（eighth rate, ER）
であるかを検査する。モードＭがＦＲ、ＱＲ、またはＥＲであるときは、ＭＤＬ
Ｐエンコーダはステップ302へ進む。ステップ302では、ＭＤＬＰエンコーダは対
応するレート（Ｍの値に依存して−ＦＲ，ＱＲ、またはＥＲ）を残余指標Ｉ_Ｒへ
適用する。時間領域のコード化は、ＦＲモードでは高精度で高レートのコード化
であり、かつＣＥＬＰのコード化であることが好都合であるが、この時間領域の
コード化は、ＬＰの残余フレーム、またはその代わりに音声フレームへ適用され
る。次にフレームは（ディジタル対アナログ変換および変調を含む別の信号処理
の後で）送られる。１つの実施形態では、フレームは、予測誤差を表わすＬＰ残
余フレームである。代わりの実施形態では、フレームは、音声サンプルを表わす
音声フレームである。According to one embodiment, an MDLP encoder (not shown) is
The steps shown in the flowchart of FIG. 4 are executed. The MDLP encoder may be the MDLP residual encoder 112 of FIG. In step 300, the MDLP encoder determines whether the mode M is full rate (FR), quarter rate (QR), or eighth rate (eighth rate, ER).
To check if. MDL when mode M is FR, QR, or ER
The P encoder proceeds to step 302. In step 302, MDLP encoder (depending on the value of M -FR, QR, or ER,) corresponding rate to apply to the residual index _{I R.} The time domain coding is conveniently a high precision, high rate coding in FR mode and CELP coding, but this time domain coding is the residual frame of the LP or its Instead it applies to audio frames. The frame is then sent (after further signal processing including digital-to-analog conversion and modulation). In one embodiment, the frame is an LP residual frame that represents the prediction error. In an alternative embodiment, the frame is an audio frame representing audio samples.

【００２４】他方で、ステップ300では、モードＭがＦＲ、ＱＲ、またはＥＲでなかったと
き（すなわち、モードＭが２分の１レート（half rate, HR）であるとき）、Ｍ
ＤＬＰエンコーダはステップ304へ進む。ステップ304では、スペクトルのコード
化、好ましくは高調波のコード化を２分の１のレートでＬＰ残余、またはその代
わりに音声信号へ適用する。次にＭＤＬＰエンコーダはステップ306へ進む。ス
テップ306では、コード化された音声をデコードして、それを元の入力フレーム
と比較することによって、ひずみ尺度Ｄを得る。次にＭＤＬＰエンコーダは、ス
テップ308へ進む。ステップ308では、ひずみ尺度Ｄは所定の閾値Ｔと比較される
。ひずみ尺度Ｄが閾値Ｔよりも大きいときは、２分の１レートのスペクトル的に
コード化されたフレームについて、対応する量子化されたパラメータが変調され
て、送られる。他方で、ひずみ尺度Ｄが閾値Ｔ以下であるときは、ＭＤＬＰエン
コーダはステップ310へ進む。ステップ310では、デコードされたフレームは、こ
の時間領域においてフルレートで再びコード化される。従来の高レートで高精度
のコード化アルゴリズム、例えば好ましくはＣＥＬＰのコード化を使用してもよ
い。次に、フレームと関係するＦＲモードの量子化されたパラメータが変調され
て、送られる。On the other hand, in step 300, when mode M is not FR, QR, or ER (ie, when mode M is half rate, HR), M
The DLP encoder proceeds to step 304. In step 304, spectral coding, preferably harmonic coding, is applied at a rate of one half to the LP residual, or alternatively to the speech signal. The MDLP encoder then proceeds to step 306. In step 306, the distortion measure D is obtained by decoding the coded speech and comparing it with the original input frame. The MDLP encoder then proceeds to step 308. In step 308, the strain measure D is compared with a predetermined threshold T. When the distortion measure D is greater than the threshold T, the corresponding quantized parameters are modulated and sent for a half rate spectrally coded frame. On the other hand, if the distortion measure D is less than or equal to the threshold T, then the MDLP encoder proceeds to step 310. In step 310, the decoded frame is recoded at full rate in this time domain. Conventional high rate and precision coding algorithms may be used, eg preferably CELP coding. The quantized parameters of the FR mode associated with the frame are then modulated and sent.

【００２５】図５のフローチャートに示したように、次に1つの実施形態にしたがって閉ル
ープのマルチモードのＭＤＬＰの音声コーダは、音声サンプルを処理して送る1
組のステップにしたがう。ステップ400では、音声コーダは、連続するフレーム
内の音声信号のディジタルサンプルを受信する。所与のフレームを受信すると、
音声コーダはステップ402へ進む。ステップ402では、音声コーダはフレームのエ
ネルギーを検出する。エネルギーはフレームの音声活動（speech activity）の
尺度である。音声検出は、ディジタル形式の音声サンプルの振幅の平方を加算し
て、生成されたエネルギーを閾値と比較することによって行なわれる。１つの実
施形態では、背景ノイズの変化レベルに基づいて閾値を採用する。例示的な可変
閾値の音声活動検出器は、上述の米国特許第5,414,796号に記載されている。若
干の無声音の音声は非常に低いエネルギーのサンプルであり、誤って背景ノイズ
としてコード化されてしまうことがある。このようなことが発生するのを防ぐた
めに、上述の米国特許第5,414,796号に記載されているように、低エネルギーサ
ンプルのスペクトルのチルトを使用して、無声音の音声を背景ノイズと区別する
。As shown in the flow chart of FIG. 5, a closed-loop multimode MDLP speech coder then processes and sends speech samples according to one embodiment.
Follow the set of steps. In step 400, the speech coder receives digital samples of the speech signal in consecutive frames. When a given frame is received,
The voice coder proceeds to step 402. In step 402, the speech coder detects the energy of the frame. Energy is a measure of the frame's speech activity. Speech detection is done by adding the squares of the amplitudes of the speech samples in digital form and comparing the energy produced with a threshold. In one embodiment, the threshold is adopted based on the level of change in background noise. An exemplary variable threshold voice activity detector is described in the aforementioned US Pat. No. 5,414,796. Some unvoiced speech is a very low energy sample and can be mistakenly coded as background noise. To prevent this from happening, spectral tilt of low energy samples is used to distinguish unvoiced speech from background noise, as described in the above-referenced US Pat. No. 5,414,796.

【００２６】フレームのエネルギーを検出した後で、音声コーダはステップ404へ進む。ス
テップ404では、音声コーダは、音声情報を含んでいるかについてフレームを分
類するのに、検出されたフレームエネルギーが十分であるかどうかを判断する。
検出されたフレームエネルギーが所定の閾値レベルよりも低いときは、音声コー
ダはステップ406へ進む。ステップ406では、音声コーダは背景ノイズ（すなわち
、非音声、または黙音）としてフレームをコード化する。１つの実施形態では、
背景ノイズのフレームは、８分の１レート、すなわち１キロビット秒でコード化
される時間領域である。ステップ404では、検出されたフレームのエネルギーが
所定の閾値レベル以上であるとき、フレームは音声として分類され、音声コーダ
はステップ408へ進む。After detecting the energy of the frame, the speech coder proceeds to step 404. In step 404, the speech coder determines whether the detected frame energy is sufficient to classify the frame as to whether it contains speech information.
If the detected frame energy is below a predetermined threshold level, the speech coder proceeds to step 406. At step 406, the speech coder encodes the frame as background noise (ie, non-speech, or silence). In one embodiment,
A frame of background noise is the time domain coded at 1/8 rate, or 1 kilobit second. In step 404, if the energy of the detected frame is above a predetermined threshold level, the frame is classified as speech and the speech coder proceeds to step 408.

【００２７】ステップ408では、音声コーダは、フレームが周期的であるかどうかを判断す
る。周期性を判断する種々の既知の方法には、例えばゼロ交差の使用および正規
化された自動相関関数（normalized autocorrelation function, NACF）の使用
を含む。とくに、ゼロ交差およびＮＡＣＦを使用して、周期性を検出することは
、米国出願第08/815,354号（発明の名称：METHOD AND APPARATUS FOR PERFORMIN
G REDUCED RATE VARIABLE RATE VOCODING）に記載されており、これは1997年3月
11日に出願され、本発明の譲受人に譲渡され、ここでは参考文献として全面的に
取り上げている。さらに加えて、無声音の音声から有声音の音声を区別するのに
使用される上述の方法は、米国電気通信工業会の業界暫定標準（Telecommunicat
ion Industry Association Industry Interim Standards）のTIA/EIA IS-127お
よびTIA/EIA IS-733に採用されている。ステップ408においてフレームが周期的
でないと判断されるとき、音声コーダはステップ410へ進む。ステップ410では、
音声コーダは、フレームを無声音の音声としてコード化する。1つの実施形態で
は、無声音の音声フレームは、４分の１レート、すなわち２キロビット秒でコー
ド化される時間領域である。ステップ408では、フレームが周期的であると判断
されるとき、音声コーダはステップ412へ進む。At step 408, the speech coder determines whether the frame is periodic. Various known methods of determining periodicity include, for example, the use of zero crossings and the use of a normalized autocorrelation function (NACF). In particular, the use of zero-crossing and NACF to detect periodicity is described in US Application No. 08 / 815,354 (Title of Invention: METHOD AND APPARATUS FOR PERFORMIN
G REDUCED RATE VARIABLE RATE VOCODING), which is March 1997.
Filed on 11th, assigned to the assignee of the present invention and is hereby fully incorporated by reference. In addition, the above-described method used to distinguish voiced speech from unvoiced speech is based on the Telecommunications Industry Association's industry interim standard (Telecommunicat
ion Industry Association Industry Interim Standards) TIA / EIA IS-127 and TIA / EIA IS-733. When it is determined in step 408 that the frame is not periodic, the speech coder proceeds to step 410. In step 410,
The voice coder encodes the frame as unvoiced speech. In one embodiment, unvoiced speech frames are in the time domain encoded at quarter rate, or 2 kilobit seconds. In step 408, when the frame is determined to be periodic, the speech coder proceeds to step 412.

【００２８】ステップ412では、音声コーダは、例えば上述の米国特許出願第08/815,354号
に記載されているように、この技術において知られている周期性検出方法を使用
して、フレームが十分に周期的であるかどうかを判断する。フレームが十分に周
期性でないと判断されるときは、音声コーダはステップ414へ進む。ステップ414
では、フレームは遷移音声（transition speech）（すなわち、無声音の音声か
ら有声音の音声への遷移）として時間領域でコード化される。１つの実施形態で
は、遷移音声フレームはフルレート、すなわち８キロビット秒で時間領域でコー
ド化される。In step 412, the speech coder uses a periodicity detection method known in the art, such as described in US patent application Ser. Determine if it is periodic. If it is determined that the frame is not sufficiently periodic, the speech coder proceeds to step 414. Step 414
Then, the frame is coded in the time domain as transition speech (ie, transition from unvoiced to voiced speech). In one embodiment, transition audio frames are coded in the time domain at full rate, ie 8 kilobit seconds.

【００２９】音声コーダは、ステップ412においてフレームが十分に周期的であると判断す
ると、ステップ416へ進む。ステップ416では、音声コーダは有声音の音声として
フレームをコード化する。１つの実施形態では、有声音の音声フレームは、とく
に２分の１レート、すなわち４キロビット秒でスペクトル的にコード化される。
図７を参照して別途記載するように、有声音の音声フレームは、高調波のコーダ
でスペクトル的にコード化されることが好都合である。その代わりに、他のスペ
クトルコーダは、この技術において知られているように、例えばシヌソイド変換
コーダ（sinusoidal transmission coder）またはマルチバンド励起コーダ(mult
iband excitation coder)として使用できることが好都合である。次に音声コー
ダはステップ418へ進む。ステップ418では、音声コーダはコード化された有声音
の音声フレームをデコードする。次に音声コーダはステップ420へ進む。ステッ
プ420では、デコードされた有声音の音声フレームを、このフレームの対応する
入力音声サンプルと比較して、合成された音声のひずみ尺度を得て、２分の１レ
ートの有声音音声のスペクトルコード化モデルが許容限度内で動作しているかど
うかを判断する。次に音声コーダはステップ422へ進む。If the speech coder determines in step 412 that the frame is sufficiently periodic, it proceeds to step 416. In step 416, the speech coder encodes the frame as voiced speech. In one embodiment, voiced speech frames are spectrally coded, especially at half rate, or 4 kilobit seconds.
Advantageously, the voiced speech frame is spectrally coded with a harmonic coder, as described below with reference to FIG. Instead, other spectral coders, for example, sinusoidal transmission coders or multi-band excitation coders (mults), as are known in the art.
It can be conveniently used as an iband excitation coder). The voice coder then proceeds to step 418. In step 418, the speech coder decodes the coded voiced speech frame. The voice coder then proceeds to step 420. In step 420, the decoded voiced speech frame is compared to the corresponding input speech sample of this frame to obtain a distortion measure of the synthesized speech to obtain a half rate voiced speech spectral code. Determine if the optimization model is operating within acceptable limits. The voice coder then proceeds to step 422.

【００３０】ステップ422では、音声コーダは、デコードされた有声音の音声フレームと、
このフレームに対応する入力音声フレームとの誤差が所定の閾値より小さいかど
うかを判断する。１つの実施形態では、この判断は、図６を参照して別途記載す
るやり方で行われる。コード化のひずみが所定の閾値よりも低いときは、音声コ
ーダはステップ426へ進む。ステップ426では、音声コーダは、ステップ416のパ
ラメータを使用して、フレームを有声音の音声として送る。ステップ422では、
コード化のひずみが所定の閾値以上であるときは、音声コーダはステップ414へ
進み、ステップ400において受信したディジタル形式の音声サンプルのフレーム
を遷移音声としてフルレートで時間領域でコード化する。In step 422, the speech coder includes the decoded voiced speech frames,
It is determined whether the error from the input voice frame corresponding to this frame is smaller than a predetermined threshold value. In one embodiment, this determination is made in the manner described separately with reference to FIG. If the coding distortion is below a predetermined threshold, the speech coder proceeds to step 426. In step 426, the speech coder sends the frame as voiced speech using the parameters of step 416. In step 422,
If the coding distortion is greater than or equal to the predetermined threshold, the speech coder proceeds to step 414 and encodes the frames of the digital format speech samples received in step 400 as transition speech at full rate in the time domain.

【００３１】ステップ400ないし410は開ループのコード化決定モードを含むことに注目すべ
きである。他方で、ステップ412ないし426は閉ループのコード化決定モードを含
む。It should be noted that steps 400-410 include an open loop coding decision mode. On the other hand, steps 412-426 include a closed loop coding decision mode.

【００３２】１つの実施形態では、図６に示したように、閉ループのマルチモードのＭＤＬ
Ｐの音声コーダはアナログ対ディジタルコンバータ（analog-to-digital conver
ter, A/D）500を含み、Ａ／Ｄ500はフレームバッファ502に接続され、フレーム
バッファ502は制御プロセッサ504に接続される。エネルギー計算器506、有声音
音声の検出器508、背景ノイズエンコーダ510、高レートの時間領域エンコーダ51
2、および低レートのスペクトルエンコーダ514は制御プロセッサ504へ接続され
る。スペクトルデコーダ516はスペクトルエンコーダ514に接続され、誤差計算器
518はスペクトルデコーダ516および制御プロセッサ504へ接続される。閾値比較
器520は、誤差計算器518および制御プロセッサ504へ接続される。バッファ522は
スペクトルエンコーダ514、スペクトルデコーダ516、および閾値比較器520へ接
続される。In one embodiment, as shown in FIG. 6, closed-loop multi-mode MDL
P's voice coder is an analog-to-digital converter.
ter, A / D) 500, the A / D 500 is connected to the frame buffer 502, and the frame buffer 502 is connected to the control processor 504. Energy calculator 506, voiced speech detector 508, background noise encoder 510, high rate time domain encoder 51
2, and a low rate spectrum encoder 514 is connected to the control processor 504. The spectrum decoder 516 is connected to the spectrum encoder 514, and the error calculator
518 is connected to spectrum decoder 516 and control processor 504. The threshold comparator 520 is connected to the error calculator 518 and the control processor 504. Buffer 522 is connected to spectrum encoder 514, spectrum decoder 516, and threshold comparator 520.

【００３３】図６の実施形態では、音声コーダの構成要素は、音声コーダ内にファームウエ
アまたは他のソフトウエア駆動モジュールとして構成されていることが好都合で
あり、音声コーダ自身はＤＳＰまたはＡＳＩＣ内にあることが好都合である。当
業者には、音声コーダの構成要素は、多数の他の既知のやり方で同様に適切に構
成できることが分かるであろう。制御プロセッサ504はマイクロプロセッサであ
ることが好都合であるが、制御装置、状態機械、または離散的論理と共に構成さ
れていてもよい。In the embodiment of FIG. 6, the components of the voice coder are conveniently configured in the voice coder as firmware or other software driven modules, the voice coder itself being in the DSP or ASIC. It is convenient to have. Those skilled in the art will appreciate that the components of the voice coder can be similarly configured in a number of other known ways. Control processor 504 is conveniently a microprocessor, but may be configured with a controller, state machine, or discrete logic.

【００３４】図６のマルチモードのコーダでは、音声信号はＡ／Ｄ500へ供給される。Ａ／
Ｄ500はアナログ信号をディジタル形式の音声サンプルＳ（ｎ）へ変換する。デ
ィジタル形式の音声サンプルは、フレームバッファ502へ供給される。制御プロ
セッサ504は、フレームバッファ502からディジタル形式の音声サンプルを得て、
それらをエネルギー計算器506へ供給する。エネルギー計算器506は、次の式にし
たがって音声サンプルのエネルギーＥを計算する：In the multimode coder of FIG. 6, the audio signal is supplied to the A / D 500. A /
The D500 converts the analog signal into audio samples S (n) in digital form. The audio samples in digital form are provided to frame buffer 502. The control processor 504 obtains audio samples in digital form from the frame buffer 502,
Supply them to the energy calculator 506. Energy calculator 506 calculates the energy E of a voice sample according to the following formula:

【数３】 [Equation 3]

【００３５】なお、フレームは２０ミリ秒長であり、サンプリングレートは８キロヘルツであ
る。計算されたエネルギーＥは制御プロセッサ504へ送られる。The frame is 20 milliseconds long and the sampling rate is 8 kilohertz. The calculated energy E is sent to the control processor 504.

【００３６】制御プロセッサ504は、計算された音声エネルギーを音声活動（speech activi
ty）の閾値と比較する。計算されたエネルギーが音声活動の閾値よりも小さいと
きは、制御プロセッサ504はディジタル形式の音声サンプルをフレームバッファ5
02から背景ノイズエンコーダ510へ送る。背景ノイズエンコーダ510は、背景ノイ
ズの推定値を保持するために必要な最少数のビットを使用して、フレームをコー
ド化する。The control processor 504 sends the calculated voice energy to speech activi.
ty) threshold. When the calculated energy is less than the threshold of voice activity, the control processor 504 sends the voice samples in digital form to the frame buffer 5.
Send from 02 to background noise encoder 510. Background noise encoder 510 encodes the frame using the least number of bits needed to hold an estimate of the background noise.

【００３７】計算されたエネルギーが音声活動の閾値以上であるときは、制御プロセッサ50
4はディジタル形式の音声サンプルをフレームバッファ502から有声音音声の検出
器508へ方向付ける。有声音音声の検出器508は、音声フレームの周期性が、低ビ
ットレートのスペクトルのコード化を使用して効率的なコード化を可能にするか
どうかを判断する。音声フレーム内の周期性のレベルを判断する方法は、この技
術においてよく知られており、例えば正規化された自動相関関数（normalized a
utocorrelation function, NACF）およびゼロ交差の使用を含む。これらの方法
および他の方法は、上述の米国特許出願第08/815,354号に記載されている。When the calculated energy is above the threshold of voice activity, the control processor 50
4 directs audio samples in digital form from frame buffer 502 to voiced speech detector 508. The voiced speech detector 508 determines whether the periodicity of the speech frame allows efficient coding using low bit rate spectral coding. Methods for determining the level of periodicity within a speech frame are well known in the art and include, for example, a normalized autocorrelation function.
utocorrelation function (NACF) and the use of zero crossings. These and other methods are described in the above-referenced US patent application Ser. No. 08 / 815,354.

【００３８】有声音音声の検出器508は、スペクトルエンコーダ514が効率的にコード化する
のに十分な周期性をもつ音声を音声フレームが含んでいるかどうかを示す信号を
制御プロセッサ504へ供給する。有声音音声の検出器508が、音声フレームが十分
な周期性を欠いていると判断するとき、制御プロセッサ504はディジタル形式の
音声サンプルを高レートのエンコーダ512へ方向付け、エンコーダ512は所定の最
大データレートで音声を時間領域でコード化する。１つの実施形態では、所定の
最大データレートは８キロビット秒であり、高レートのエンコーダ512はＣＥＬ
Ｐのコーダである。The voiced speech detector 508 provides a signal to the control processor 504 that indicates whether the speech frame contains speech with sufficient periodicity for the spectral encoder 514 to efficiently code it. When the voiced speech detector 508 determines that the speech frame lacks sufficient periodicity, the control processor 504 directs the speech samples in digital form to a high rate encoder 512, which encoder 512 has a predetermined maximum. Coding audio in the time domain at the data rate. In one embodiment, the predetermined maximum data rate is 8 kilobit seconds and the high rate encoder 512 is a CEL.
It is a P coder.

【００３９】有声音音声の検出器508が最初に、音声信号が、スペクトルエンコーダ514が効
率的にコード化するのに十分な周期性をもつと判断するとき、制御プロセッサ50
4は、フレームバッファ502からスペクトルエンコーダ514へディジタル形式の音
声サンプルを方向付ける。例示的なスペクトルエンコーダは、図７を参照して別
途詳しく記載する。When the voiced speech detector 508 first determines that the speech signal is sufficiently periodic to be efficiently encoded by the spectral encoder 514, the control processor 50
4 directs audio samples in digital form from frame buffer 502 to spectrum encoder 514. An exemplary spectrum encoder is described in detail below with reference to FIG.

【００４０】[0040]

【数４】 [Equation 4]

【００４１】[0041]

【数５】 [Equation 5]

【００４２】計算されたＭＳＥが許容範囲内であるときは、閾値比較器520は信号をバッフ
ァ522へ供給し、スペクトル的にコード化されたデータは音声コーダから出力さ
れる。他方で、ＭＳＥが許容限界内でないときは、閾値の比較器520は信号を制
御プロセッサ504へ送り、制御プロセッサ504はディジタル形式のサンプルをフレ
ームバッファ502から高レートの時間領域のエンコーダ512へ方向付ける。時間領
域のエンコーダ512は、所定の最大レートでフレームをコード化し、バッファ522
の内容は捨てられる。If the calculated MSE is within the acceptable range, the threshold comparator 520 provides a signal to the buffer 522 and the spectrally coded data is output from the speech coder. On the other hand, if the MSE is not within acceptable limits, the threshold comparator 520 sends a signal to the control processor 504, which directs samples in digital form from the frame buffer 502 to the high rate time domain encoder 512. . The time domain encoder 512 encodes frames at a predetermined maximum rate and buffers 522
The contents of are discarded.

【００４３】図６の実施形態では、採用されたスペクトルのコード化のタイプは高調波のコ
ード化であり、これについては図７を参照して別途記載するが、代わりの実施形
態では、シヌソイド変換のコード化またはマルチバンド励起のコード化のような
、スペクトルのコード化のタイプであってもよい。マルチバンド励起のコード化
の使用は、米国特許第5,195,166号に記載されており、シヌソイド変換のコード
化の使用は、例えば米国特許第4,865,068号に記載されている。In the embodiment of FIG. 6, the type of spectral coding employed is harmonic coding, which is described separately with reference to FIG. 7, but in an alternative embodiment, a sinusoidal transform. It may also be a type of spectral coding, such as V. coding or multi-band excitation coding. The use of multiband excitation coding is described in US Pat. No. 5,195,166, and the use of sinusoidal transform coding is described in, for example, US Pat. No. 4,865,068.

【００４４】遷移フレーム、および位相ひずみ閾値が周期性パラメータ以下である有声音フ
レームでは、図６のマルチモードコーダはフルレート、すなわち８キロビット秒
で、高レートの時間領域のコーダ512によって、ＣＥＬＰのコード化を採用する
ことが好都合である。その代わりに、このようなフレームに対して、他の既知の
形態の高レートの時間領域のコード化を使用してもよい。したがって、遷移フレ
ーム（および十分に周期的でない有声音フレーム）は高い精度でコード化され、
入力および出力における波形は適切に整合し、位相情報は適切に保持される。１
つの実施形態では、マルチモードコーダは、閾値比較器520の判断と無関係に、
閾値が周期性の尺度を越えている所定数の連続する有声音フレームを処理した後
で、各フレームごとに２分の１レートのスペクトルのコード化からフルレートの
ＣＥＬＰのコード化へスイッチする。For transition frames and voiced frames where the phase distortion threshold is less than or equal to the periodicity parameter, the multi-mode coder of FIG. 6 is at full rate, ie 8 kbps, and is coded by CELP by a high rate time domain coder 512. It is convenient to adopt Alternatively, other known forms of high rate time domain coding may be used for such frames. Therefore, transition frames (and voiced frames that are not sufficiently periodic) are coded with high accuracy,
The waveforms at the input and output are properly matched and the phase information is properly preserved. 1
In one embodiment, the multi-mode coder, independent of the decision of the threshold comparator 520,
After processing a predetermined number of consecutive voiced frames whose threshold exceeds the measure of periodicity, each frame switches from half rate spectral coding to full rate CELP coding.

【００４５】制御プロセッサ504に関連して、エネルギー計算器506および有声音音声の検出
器508は開ループのコード化決定を含むことに注意すべきである。対照的に、制
御プロセッサ504に関連して、スペクトルエンコーダ514、スペクトルデコーダ51
6、誤差計算器518、閾値比較器520、およびバッファ522は閉ループのコード化決
定を含む。It should be noted that in connection with control processor 504, energy calculator 506 and voiced speech detector 508 include open-loop coding decisions. In contrast, in connection with control processor 504, spectral encoder 514, spectral decoder 51
6, error calculator 518, threshold comparator 520, and buffer 522 include closed-loop coding decisions.

【００４６】図７を参照して記載した１つの実施形態では、スペクトルのコード化、好まし
くは高調波のコード化を使用して、低ビットレートで十分に周期的な有声音フレ
ームをコード化する。スペクトルコーダは、一般的に、周波数領域内の各音声フ
レームをモデル化してコード化することによって知覚的に重要なやり方で音声ス
ペクトル特性の時間にしたがう漸進的変化（time-evolution）を保持することを
試みるアルゴリズムとして規定される。このようなアルゴリズムの本質的な部分
では、（１）スペクトルの解析またはパラメータの推定、（２）パラメータの量
子化、（３）出力された音声波形とデコードされたパラメータとの合成を行う。
したがって、１組のスペクトルパラメータをもつ短期間の音声スペクトルの重要
な特性を保持し、デコードされたスペクトルパラメータを使用して、出力音声を
合成することを目的とする。通常は、出力音声は、シヌソイドの重み付けされた
和として合成される。シヌソイドの振幅、周波数、および位相は、解析中に推定
されるスペクトルパラメータである。In one embodiment described with reference to FIG. 7, spectral coding, preferably harmonic coding, is used to code a sufficiently periodic voiced frame at a low bit rate. . Spectral coders generally maintain time-evolution of speech spectral characteristics in a perceptually significant way by modeling and coding each speech frame in the frequency domain. Is defined as an algorithm that tries to. In the essential part of such an algorithm, (1) spectrum analysis or parameter estimation, (2) parameter quantization, (3) output speech waveform and decoded parameter synthesis are performed.
Therefore, it is aimed to retain the important properties of the short-term speech spectrum with a set of spectral parameters and to use the decoded spectral parameters to synthesize the output speech. Normally, the output speech is synthesized as a weighted sum of sinusoids. The sinusoidal amplitude, frequency, and phase are spectral parameters estimated during the analysis.

【００４７】 “合成による解析”はＣＥＬＰのコード化においてよく知られた技術であるが
、この技術はスペクトルのコード化には利用されていない。合成による解析がス
ペクトルコーダに適用されない主な理由は、初期位相の情報の損失によって、音
声モデルが知覚の観点から適切に機能していても、合成された音声の平均二乗エ
ネルギー（mean square energy, MSE）が高いからである。したがって、初期位
相を正確に生成すると、音声サンプルと再構成された音声とを直接に比較して、
音声モデルが音声フレームを正確にコード化しているかどうかを判断できるとい
った別の長所がある。“Synthetic analysis” is a well-known technique in CELP coding, but this technique has not been used for spectral coding. The main reason synthesis analysis does not apply to spectral coders is that the loss of initial phase information causes the mean square energy of the synthesized speech, even if the speech model is working properly from a perceptual perspective. This is because the MSE) is high. Therefore, if we correctly generate the initial phase, we can directly compare the speech sample with the reconstructed speech,
Another advantage is that it can determine if the speech model correctly encodes a speech frame.

【００４８】スペクトルのコード化では、出力された音声フレームは次に示すように合成す
ることができる：Ｓ[ｎ]＝Ｓ_ｖ[ｎ]＋Ｓ_ｕｖ[ｎ]，ｎ＝１，２，．．．，Ｎ，なお、Ｎは１フレーム当りのサンプル数であり、Ｓ_ｖおよびＳ_ｕｖは、それぞれ
有声音成分および無声音成分である。シヌソイド和合成プロセス（sum-of-sinus
oid synthesis process）は次の式に示すように有声音成分を生成する：[0048] In coding of the spectrum, the audio frame output may be synthesized as _{follows: S [n] = S v} [n] + S uv [n], n = 1,2 ,. ．． , N, where N is the number of samples per frame, and S _v and S _uv are voiced sound components and unvoiced sound components, respectively. Sinusoid sum synthesis process (sum-of-sinus
The oid synthesis process) produces a voiced component as shown in the following equation:

【数６】 [Equation 6]

【００４９】振幅、周波数、および位相パラメータは、スペクトル解析プロセスによって入力
フレームの短期間のスペクトルから推定される。無声音成分は、単一のシヌソイ
ド和合成において有声音部分と一緒に生成されるか、または専用の無声音合成プ
ロセスによって別々に計算され、Ｓ_ｖへ再び加えられる。Amplitude, frequency, and phase parameters are estimated from the short-term spectrum of the input frame by the spectral analysis process. The unvoiced components are either generated together with the voiced part in a single sinusoidal sum synthesis or calculated separately by a dedicated unvoiced synthesis process and added back to S _v .

【００５０】図７の実施形態では、高調波コーダと呼ばれる特定のタイプのスペクトルコー
ダを使用して、低ビットレートで十分に周期的な有声音フレームをスペクトル的
にコード化する。高調波のコーダは、シヌソイド和としてフレームを特徴付け、
フレームの小さいセグメントを解析する。シヌソイド和の中の各シヌソイドは、
フレームのピッチＦ_０の整数倍の周波数をもつ。代わりの実施形態では、高調波
のコーダ以外の特定のタイプのスペクトルコーダを使用し、各フレームに対する
シヌソイド周波数は、０ないし２πの１組の実数から得られる。図７の実施形態
では、和の中の各シヌソイドの振幅および位相が選択されることが好都合であり
、その結果、図８のグラフによって示したように、和は１期間において信号と最
良に整合する。高調波のコーダは一般的に外部の分類を採用し、各入力音声フレ
ームは有声音または無声音として表示する。有声音フレームでは、シヌソイドの
周波数は推定されたピッチ（Ｆ_０）の高調波に制限され、すなわちｆ_ｋ＝ｋＦ_０である。無声音の音声では、短期間のスペクトルのピークを使用して、シヌソイ
ドを判断する。次の式に示すように、振幅および位相が補間されて、フレームに
おいて漸進的変化をまねる：In the embodiment of FIG. 7, a particular type of spectral coder, called a harmonic coder, is used to spectrally code voiced frames that are sufficiently periodic at low bit rates. The harmonic coder characterizes the frame as a sinusoidal sum,
Analyze small segments of the frame. Each sinusoid in the sinusoid sum is
It has a frequency that is an integral multiple of the frame pitch F ₀ . In an alternative embodiment, a specific type of spectral coder other than a harmonic coder is used and the sinusoidal frequencies for each frame are derived from a set of real numbers from 0 to 2π. In the embodiment of FIG. 7, it is convenient for the amplitude and phase of each sinusoid in the sum to be selected so that the sum best matches the signal in one period, as illustrated by the graph of FIG. To do. Harmonic coders generally employ an external classification, where each input speech frame is displayed as voiced or unvoiced. In voiced frames, the sinusoidal frequencies are limited to harmonics of the estimated pitch (F ₀ ), ie f _k = kF ₀ . For unvoiced speech, short-term spectral peaks are used to determine sinusoids. Amplitude and phase are interpolated to mimic the evolution in the frame as shown in the following equation:

【数７】 [Equation 7]

【００５１】シヌソイドごとに送られるパラメータは振幅および周波数である。位相は送られ
ないが、その代わりに、例えば準位相モデル（quadratic phase model）、また
は位相の従来の多項式表現を含むいくつかの既知の技術にしたがってモデル化さ
れる。The parameters sent for each sinusoid are amplitude and frequency. The phase is not sent, but instead is modeled according to several known techniques including, for example, a quadratic phase model, or a conventional polynomial representation of the phase.

【００５２】図７に示されているように、高調波コーダはピッチ抽出器600を含み、ピッチ
抽出器600はウインドウ処理論理602へ接続され、ウインドウ処理論理602は離散
フーリエ変換（Discrete Fourier Transform, DFT）、および高調波解析論理604
へ接続される。入力として音声サンプルＳ（ｎ）を受信するピッチ抽出器600は
は、ＤＦＴおよび高調波解析論理604へも接続される。ＤＦＴおよび高調波解析
論理604は、残余エンコーダ606へ接続される。ピッチ抽出器600、ＤＦＴおよび
高調波解析論理604、並びに残余エンコーダ606は、パラメータ量子化器608へそ
れぞれ接続される。パラメータ量子化器608はチャンネルエンコーダ610へ接続さ
れ、チャンネルエンコーダ610は送信機612へ接続される。送信機612は、例えば
、符号分割多重アクセス（code division multiple access, CDMA）のような標
準の無線周波数（radio-frequency, RF）のインターフェイスによって空中イン
ターフェイス（over-the-air interface）上で、受信機614へ接続される。受信
機614はチャンネルデコーダ616へ接続され、チャンネルデコーダ616は非量子化
器618へ接続される。非量子化器618はシヌソイド和音声合成器620へ接続される
。シヌソイド和音声合成器620へさらに接続されるのは位相推定器622であり、位
相推定器622は入力として前フレーム情報を受信する。シヌソイド和音声合成器6
20は合成された音声出力Ｓ_{ＳＹＮＴＨ}（ｎ）を生成するように構成されている。As shown in FIG. 7, the harmonic coder includes a pitch extractor 600, which is connected to windowing logic 602, which is a Discrete Fourier Transform, DFT), and harmonic analysis logic 604
Connected to. Pitch extractor 600, which receives speech samples S (n) as input, is also connected to DFT and harmonic analysis logic 604. The DFT and harmonic analysis logic 604 is connected to the residual encoder 606. Pitch extractor 600, DFT and harmonic analysis logic 604, and residual encoder 606 are each connected to a parameter quantizer 608. The parameter quantizer 608 is connected to the channel encoder 610, and the channel encoder 610 is connected to the transmitter 612. The transmitter 612 receives on a over-the-air interface by a standard radio-frequency (RF) interface, eg, code division multiple access (CDMA). Connected to machine 614. Receiver 614 is connected to channel decoder 616, which is connected to dequantizer 618. The dequantizer 618 is connected to the sinusoidal sum speech synthesizer 620. Further connected to the sinusoidal sum speech synthesizer 620 is a phase estimator 622, which receives as input the previous frame information. Sinusoid sum voice synthesizer 6
Twenty is configured to produce a synthesized speech output S _SYNTH (n).

【００５３】ピッチ抽出器600、ウインドウ処理論理602、ＤＴＦおよび高調波解析論理604
、残余エンコーダ606、パラメータ量子化器608、チャンネルエンコーダ610、チ
ャンネルデコーダ616、非量子化器618、シヌソイド和音声合成器620、並びに位
相推定器622は、例えばファームウエアまたはソフトウエアモジュールを含む、
当業者によく知られている種々の異なるやり方で構成することができる。送信機
612および受信機614は、当業者には知られている対応する標準のＲＦの構成要素
で実行されていてもよい。Pitch extractor 600, windowing logic 602, DTF and harmonic analysis logic 604
, Residual encoder 606, parameter quantizer 608, channel encoder 610, channel decoder 616, dequantizer 618, sinusoidal sum speech synthesizer 620, and phase estimator 622 include, for example, firmware or software modules,
It can be configured in a variety of different ways familiar to those skilled in the art. Transmitter
612 and receiver 614 may be implemented with corresponding standard RF components known to those skilled in the art.

【００５４】図７の高調波コーダでは、入力サンプルＳ（ｎ）はピッチ抽出器600によって
受信され、ピッチ抽出器600はピッチ周波数情報Ｆ_０を抽出する。次にサンプル
は、ウインドウ処理論理602によって適切なウインドウ処理関数によって乗算さ
れ、音声フレームの小さいセグメントの解析を可能にしている。ピッチ抽出器60
0によって供給されるピッチ情報を使用して、ＤＦＴおよび高調波解析論理604は
サンプルのＤＦＴを計算して、複合のスペクトル点を生成し、この複合のスペク
トル点から、図８のグラフによって示されているように、高調波の振幅Ａ_Ｉを抽
出し、なお図８において、Ｌは高調波の合計数を示している。ＤＦＴは残余エン
コーダ606へ供給され、残余エンコーダ606は音声情報（voicing information）
Ｖ_ｃを抽出する。In the harmonic coder of FIG. 7, the input sample S (n) is received by the pitch extractor 600, which extracts the pitch frequency information F ₀ . The sample is then multiplied by the windowing logic 602 with the appropriate windowing function, allowing analysis of a small segment of the speech frame. Pitch extractor 60
Using the pitch information provided by 0, the DFT and harmonic analysis logic 604 computes the sample DFT to produce a composite spectral point from which the graph of FIG. 8 illustrates. As shown in FIG. 8, the amplitude A _I of the harmonic is extracted, and in FIG. 8, L indicates the total number of harmonics. The DFT is supplied to the residual encoder 606, and the residual encoder 606 is voicing information.
Extract V _c .

【００５５】Ｖ_ｃパラメータは、図８に示されているように、周波数軸上の点を示し、Ｖ_ｃがより高くなると、スペクトルは無声音の音声信号の特性を示し、最早高調波で
はなくなることに注意すべきである。対照的に、点Ｖ_ｃより低くなると、スペク
トルは高調波であり、有声音の音声の特性を示す。The V _c parameter indicates a point on the frequency axis, as shown in FIG. 8, and as V _c becomes higher, the spectrum shows the characteristics of an unvoiced voice signal and is no longer a harmonic. Should be noted. In contrast, below the point V _c , the spectrum is harmonic and characteristic of voiced speech.

【００５６】Ａ_Ｉ，Ｆ_０，およびＶ_ｃの成分は、パラメータ量子化器608へ供給され、パラ
メータ量子化器608では情報を量子化する。量子化された情報はパケットの形態
でチャンネルエンコーダ610へ供給され、チャンネルエンコーダ610では、例えば
ハーフレート、すなわち４キロビット秒のような低ビットレートでパケットを量
子化する。パケットは送信機612へ供給され、送信機612はパケットを変調して、
生成された信号を受信機614へ空中で（over the air）送る。受信機614は信号を
受信して、復調して、コード化されたパケットをチャンネルデコーダ616へ送る
。チャンネルデコーダ616はパケットをデコードして、デコードされたパケット
を非量子化器618へ供給する。非量子化器618は情報を非量子化する。情報はシヌ
ソイド和音声合成器620へ供給される。The components of A _I , F ₀ , and V _c are supplied to the parameter quantizer 608, which quantizes the information. The quantized information is provided in the form of packets to the channel encoder 610, which quantizes the packets at a low bit rate, eg half rate, ie 4 kilobit seconds. The packet is provided to transmitter 612, which modulates the packet and
Send the generated signal to receiver 614 over the air. Receiver 614 receives the signal, demodulates it, and sends the encoded packets to channel decoder 616. The channel decoder 616 decodes the packet and supplies the decoded packet to the dequantizer 618. Dequantizer 618 dequantizes information. Information is provided to the sinusoidal sum speech synthesizer 620.

【００５７】シヌソイド和音声合成器620は、Ｓ[ｎ]についての上述の式にしたがって短期
間の音声スペクトルをモデル化する複数のシヌソイドのモデリングを合成するよ
うに構成されている。シヌソイドｆ_ｋの周波数は、基本周波数Ｆ_０の倍数または
高調波であり、準周期的な（すなわち、遷移の）有声音の音声セグメントに対す
るピッチの周期性をもつ周波数である。The sinusoidal sum speech synthesizer 620 is configured to synthesize a plurality of sinusoidal modeling models a short-term speech spectrum according to the above equation for S [n]. The frequency of the sinusoid f _k is a multiple or harmonic of the fundamental frequency F ₀ and is a frequency with pitch periodicity for quasi-periodic (ie, transitional) voiced speech segments.

【００５８】さらに加えて、シヌソイド和の音声合成器620は位相推定器622から位相情報を
受信する。位相推定器622は前フレームの情報、すなわち直前フレームについて
のＡ_Ｉ，Ｆ_０，およびＶ_ｃのパラメータを受信する。位相推定器622は、前フレ
ームの再構成されたＮのサンプルも受信し、なおＮはフレーム長（すなわち、Ｎ
は１フレーム当りのサンプル数）である。位相推定器622は、前フレームの情報
に基づいて、フレームの初期位相を判断する。初期位相の判断は、シヌソイド和
の音声合成器620へ供給される。現在のフレームに関する情報と、過去のフレー
ム情報に基いて位相推定器622によって行なわれた初期位相の計算とを基にして
、シヌソイド和音声合成器620は上述のように音声フレームを生成する。In addition, the sinusoidal sum speech synthesizer 620 receives phase information from the phase estimator 622. Phase estimator 622 receives the information of the previous frame, that is, the parameters of A _I , F ₀ , and V _c for the previous frame. The phase estimator 622 also receives the reconstructed N samples of the previous frame, where N is the frame length (ie, N
Is the number of samples per frame). The phase estimator 622 determines the initial phase of the frame based on the information of the previous frame. The determination of the initial phase is provided to the sinusoidal sum speech synthesizer 620. Based on the information about the current frame and the initial phase calculation performed by the phase estimator 622 based on the past frame information, the sinusoidal sum speech synthesizer 620 produces speech frames as described above.

【００５９】既に記載したように、高調波のコーダは、前フレームの情報を使用して、位相
がフレームからフレームへ線形に変化することを予測することによって、音声フ
レームを合成、すなわち再構成する。上述の合成モデルは、一般的に準位相モデ
ルと呼ばれており、このような合成モデルでは、係数Ｂ_３（ｋ）は、現在の有声
音フレームの初期位相が合成されていることを表わしている。位相を判断すると
き、従来の高調波のコーダは初期位相をゼロに設定するか、または初期位相値を
ランダムに、あるいは疑似ランダム生成方法を使用して生成する。位相をより正
確に予測するために、位相推定器622は、直前のフレームが有声音の音声フレー
ム（すなわち、十分に周期的なフレーム）であるか、または遷移音声フレームで
あるかに依存して、初期位相を判断するための２つの可能な方法の一方を使用す
る。前フレームが有声音の音声フレームであったときは、このフレームの推定さ
れた最終位相値は、現在のフレームの初期位相値として使用される。他方で、前
フレームが遷移フレームとして分類されたときは、現在のフレームの初期位相値
は、前フレームのスペクトルから得られ、これは前フレームのデコーダ出力のＤ
ＦＴを行なうことによって得られる。したがって位相推定器622は、（遷移フレ
ームである前フレームがフルレートで処理されたので）既に使用可能である正確
な位相情報を使用できる。As already mentioned, the harmonic coder uses the information of the previous frame to synthesize, ie reconstruct, the speech frame by predicting a linear change in phase from frame to frame. . The synthesis model described above is generally called a quasi-phase model, and in such a synthesis model, the coefficient B ₃ (k) indicates that the initial phase of the current voiced frame is synthesized. There is. When determining the phase, conventional harmonic coders either set the initial phase to zero or generate the initial phase value randomly or using a pseudo-random generation method. To estimate the phase more accurately, the phase estimator 622 depends on whether the immediately preceding frame is a voiced speech frame (ie, a sufficiently periodic frame) or a transition speech frame. , Use one of two possible methods for determining the initial phase. If the previous frame was a voiced speech frame, the estimated final phase value of this frame is used as the initial phase value of the current frame. On the other hand, when the previous frame is classified as a transition frame, the initial phase value of the current frame is obtained from the spectrum of the previous frame, which is D at the decoder output of the previous frame.
Obtained by performing FT. Therefore, the phase estimator 622 can use the exact phase information already available (since the previous frame, which is a transition frame, was processed at full rate).

【００６０】１つの実施形態では、閉ループのマルチモードのＭＤＬＰの音声コーダは、図
９のフローチャート内に示されている音声処理ステップにしたがう。音声コーダ
は、最も適切なコード化モードを選択することによって、各入力音声フレームの
ＬＰの残余をコード化する。一定のモードは時間領域内でＬＰの残余、すなわち
音声の残余をコード化し、一方で他のモードは周波数領域内でＬＰの残余、すな
わち音声の残余を表わす。モードの組には、遷移フレームに対するフルレートの
時間領域（Ｔモード）；有声音フレームに対する２分の１レートの周波数領域（
Ｖモード）；無声音フレームに対する４分の１レートの時間領域（Ｕモード）；
およびノイズフレームに対する８分の１レートの時間領域（Ｎモード）がある。In one embodiment, the closed-loop multimode MDLP speech coder follows the speech processing steps shown in the flowchart of FIG. The speech coder codes the LP residual of each input speech frame by selecting the most appropriate coding mode. Certain modes encode the LP residual, ie the residual speech, in the time domain, while the other modes represent the LP residual, the residual speech, in the frequency domain. The mode set includes a full rate time domain for transition frames (T mode); a half rate frequency domain for voiced frames (
V mode); quarter rate time domain for unvoiced frames (U mode);
And there is a 1/8 rate time domain (N mode) for noise frames.

【００６１】当業者には、図９に示したステップにしたがうことによって、音声信号または
対応するＬＰの残余がコード化されることが分かるであろう。ノイズ、無声音、
遷移、および有声音の音声の波形特性は、図１０ａのグラフにおいて時間関数と
して参照することができる。ノイズ、無声音、遷移、および有声音のＬＰの残余
の波形特性は、図１０ｂのグラフにおいて時間関数として参照することができる
。Those skilled in the art will appreciate that by following the steps shown in FIG. 9, the speech signal or the corresponding LP residue is coded. Noise, unvoiced sound,
The waveform characteristics of transitions and voiced speech can be referenced as a function of time in the graph of FIG. 10a. The residual waveform characteristics of LP of noise, unvoiced sound, transitions, and voiced sound can be referenced as a function of time in the graph of FIG. 10b.

【００６２】ステップ700では、４つのモード（Ｔ、Ｖ、Ｕ，またはＮ）の何れか１つに関
して、開ループのモード決定を行って、入力音声の残余Ｓ（ｎ）へ適用する。Ｔ
モードが適用されるときは、ステップ702では、時間領域においてＴモード、す
なわちフルレートで音声の残余が処理される。Ｕモードが適用されるときは、ス
テップ704で、時間領域においてＵモード、すなわち４分の１レートで音声の残
余が処理される。Ｎモードが適用されるときは、ステップ706では、時間領域に
おいてＮモード、すなわち８分の１レートで音声の残余が処理される。Ｖモード
が適用されるときは、ステップ708では、周波数領域においてＶモードで、すな
わち２分の１レートで音声の残余が処理される。In step 700, an open-loop mode decision is made for any one of the four modes (T, V, U, or N) and applied to the residual S (n) of the input speech. T
If the mode is applied, then in step 702 the residual speech is processed in T-mode, or full rate, in the time domain. If U-mode is applied, then in step 704 the U-mode, ie, the quarter-rate residual speech, is processed in the time domain. If N mode is applied, then in step 706 the residual speech is processed in the time domain at N mode, ie at a rate of 1/8. If V-mode is applied, then in step 708 the residual speech is processed in V-mode in the frequency domain, ie, at half rate.

【００６３】ステップ710では、ステップ708でコード化された音声がデコードされ、入力音
声の残余Ｓ（ｎ）と比較され、性能尺度Ｄが計算される。ステップ712では、性
能尺度Ｄが所定の閾値Ｔと比較される。性能尺度Ｄが閾値Ｔ以上であるときは、
ステップ714では、ステップ708においてスペクトル的にコード化された音声の残
余は送信を許可される。他方では、性能尺度Ｄが閾値Ｔよりも小さいときは、ス
テップ716では、入力音声の残余Ｓ（ｎ）はＴモードで処理される。別の実施形
態では、性能尺度は計算されず、閾値は規定されない。その代わりに、所定数の
音声残余フレームがＶモードで処理された後で、次のフレームはＴモードで処理
される。In step 710, the speech coded in step 708 is decoded and compared with the residual S (n) of the input speech and a performance measure D is calculated. In step 712, the performance measure D is compared with a predetermined threshold T. When the performance measure D is greater than or equal to the threshold T,
At step 714, the residual spectrally coded speech at step 708 is allowed to be transmitted. On the other hand, if the performance measure D is less than the threshold T, then in step 716 the residual S (n) of the input speech is processed in T mode. In another embodiment, no performance measure is calculated and no threshold is specified. Instead, the next frame is processed in T mode after a predetermined number of speech residual frames have been processed in V mode.

【００６４】図９に示した決定のステップでは、高ビットレートのＴモードを必要なときだ
け使用して、より低いビットレートのＶモードで有声音の音声セグメントの周期
性を活用することができ、一方でＶモードが適切に実行されないときは、フルレ
ートにスイッチすることによって品質の低下を防ぐことが好都合である。したが
って、フルレートの音声品質に近づく非常に高い音声品質を、フルレートよりも
相当に低い平均レートで生成することができる。さらに、選択された性能尺度お
よび選ばれた閾値によって、目標の音声品質を制御することができる。In the decision step shown in FIG. 9, the high bitrate T-mode can be used only when needed to take advantage of the periodicity of voiced speech segments in the lower bitrate V-mode. On the other hand, when V-mode is not performed properly, it is convenient to switch to full rate to prevent quality degradation. Therefore, a very high voice quality approaching the full rate voice quality can be produced at an average rate considerably lower than the full rate. Furthermore, the target voice quality can be controlled by the selected performance measure and the selected threshold.

【００６５】Ｔモードへの“更新”は、モデル位相追跡を入力音声の位相追跡の近くに維持
することによって、後でＶモードを適用する動作を向上することができる。Ｖモ
ードの性能が不適切であるときは、ステップ710および712の閉ループの性能検査
はＴモードへスイッチし、初期位相値を“リフレッシュ”して、モデルの位相追
跡を元の入力音声位相追跡に再び近付けることによって、次のＶモードの処理の
性能を向上することができる。例えば、図１１ａないしｃのグラフに示したよう
に、開始から５番目のフレームは、使用されているＰＳＮＲのひずみ尺度によっ
て証明されているように、Ｖモードで適切に働かない。その結果、閉ループの決
定および更新がないときは、モデル化された位相追跡は元の入力音声位相追跡か
ら相当に外れ、図１１ｃに示したように、ＰＳＮＲを相当に劣化する。さらに、
Ｖモードで処理される次のフレームの性能は劣化する。しかしながら、閉ループ
の決定のもとでは、５番目のフレームは、図１１ａに示したように、Ｔモードの
処理へスイッチされる。５番目のフレームの性能は、図１１ｂに示したように、
ＰＳＮＲにおける向上によって証明されているように、更新によって相当に向上
する。さらに加えて、Ｖモードのもとで処理される次のフレームの性能も向上す
る。“Updating” to T-mode can improve the behavior of applying V-mode later by keeping the model phase tracking close to that of the input speech. If the V-mode performance is inadequate, the closed-loop performance test of steps 710 and 712 switches to T-mode and "refreshes" the initial phase values to restore the model phase tracking to the original input voice phase tracking. By approaching again, the performance of the next V-mode process can be improved. For example, as shown in the graphs of FIGS. 11a-c, the fifth frame from the start does not work properly in V-mode, as evidenced by the PSNR distortion measure used. As a result, in the absence of closed-loop decision and update, the modeled phase tracking deviates significantly from the original input speech phase tracking, and significantly degrades the PSNR, as shown in Figure 11c. further,
The performance of the next frame processed in V mode is degraded. However, under the closed loop decision, the fifth frame is switched to T mode processing, as shown in Figure 11a. The performance of the fifth frame is, as shown in FIG. 11b,
Updates provide a significant improvement, as evidenced by the improvement in PSNR. In addition, the performance of the next frame processed under V mode is also improved.

【００６６】図９に示した決定のステップでは、非常に正確な初期位相推定値を与えること
によって、Ｖモードの表現品質を向上し、生成されたＶモードの合成された音声
の残余信号は元の入力音声の残余Ｓ（ｎ）と正確に時間的に整合することを保証
する。最初のＶモードで処理された音声の残余セグメントにおける初期位相は、
次に示すやり方で直前のデコードされたフレームから求められる。各高調波では
、前フレームがＶモードで処理されたときは、初期位相は前フレームの推定され
た最終位相に等しく設定される。各高調波では、前フレームがＴモードで処理さ
れたときは、初期位相は前フレームの実際の高調波の位相に等しく設定される。
前フレームの実際の高調波の位相は、全ての前フレームを使用して過去のデコー
ドされた残余のＤＦＴをとることによって求められる。その代わりに、前フレー
ムの実際の高調波の位相は、前フレームの種々のピッチ期間を処理することによ
って、ピッチが同期するやり方で、過去のデコードされたフレームのＤＦＴをと
ることによって求められる。The decision step shown in FIG. 9 improves the V-mode representation quality by providing a very accurate initial phase estimate, and the generated V-mode synthesized speech residual signal is It is guaranteed to be exactly temporally aligned with the residual S (n) of the input speech of. The initial phase in the residual segment of the speech processed in the first V-mode is
It is obtained from the immediately preceding decoded frame in the following manner. At each harmonic, the initial phase is set equal to the estimated final phase of the previous frame when the previous frame was processed in V mode. For each harmonic, the initial phase is set equal to the phase of the actual harmonic of the previous frame when the previous frame was processed in T mode.
The actual harmonic phase of the previous frame is determined by taking the past decoded residual DFT using all previous frames. Instead, the actual harmonic phase of the previous frame is determined by taking the DFT of past decoded frames in a pitch-synchronized manner by processing the various pitch periods of the previous frame.

【００６７】本明細書では、斬新な閉ループのマルチモードの混合領域の線形予測（mixed-
domain linear prediction, MDLP）の音声コーダを記載した。当業者には、ここ
に開示した実施形態に関係して記載した種々の例示的な論理ブロックおよびアル
ゴリズムのステップが、ディジタル信号プロセッサ（digital signal processor
, DSP）、特定用途向け集積回路（application specific integrated circuit,
ASIC）、離散的ゲートまたはトランジスタ論理、例えばレジスタおよびＦＩＦＯ
のような離散的ハードウエア構成要素、1組のファームウエア命令を実行するプ
ロセッサ、または従来のプログラマブルソフトウエアモジュールおよびプロセッ
サで構成または実行できることが分かるであろう。プロセッサは、マイクロプロ
セッサであることが好都合であるが、その代わりに従来のプロセッサ、制御装置
、マイクロプロセッサ、または状態機械であってもよい。ソフトウエアモジュー
ルは、ＲＡＭメモリ、フラッシュメモリ、レジスタ、またはこの技術において知
られている他の形態の書き込み可能な記憶媒体内にあってもよい。当業者にはさ
らに、上述の記述全体で参照したデータ、命令、コマンド、情報、信号、ビット
、符号、およびチップが、電圧、電流、電磁波、磁界または磁粒、光の範囲また
は粒子（optical field or particles）、あるいはその組み合わせによって都合
よく表わされることが分かるであろう。In this specification, a novel closed-loop multi-mode mixed domain linear prediction (mixed-
The domain coder (MDLP) voice coder is described. Those skilled in the art will appreciate that various exemplary logic blocks and algorithm steps described in connection with the embodiments disclosed herein may be implemented by digital signal processors.
, DSP), application specific integrated circuit,
ASIC), discrete gate or transistor logic, such as registers and FIFOs
It will be appreciated that it may be configured or executed with discrete hardware components such as, a processor executing a set of firmware instructions, or conventional programmable software modules and processors. The processor is conveniently a microprocessor, but may alternatively be a conventional processor, controller, microprocessor or state machine. The software modules may reside in RAM memory, flash memory, registers, or any other form of writable storage medium known in the art. Those skilled in the art will further appreciate that data, instructions, commands, information, signals, bits, signs, and chips referred to throughout the above description may include voltage, current, electromagnetic waves, magnetic fields or particles, ranges of light or optical fields. or particles), or a combination thereof, will be conveniently represented.

【００６８】本明細書では、本発明の好ましい実施形態を示し、記載した。しかしながら、
当業者の一人には、ここに記載した実施形態に対して、本発明の意図または技術
的範囲から逸脱せずに多数の変更を加えられることが分かるであろう。したがっ
て、本発明は、特許請求項にしたがうことを除いて制限されない。The preferred embodiments of the present invention have been shown and described herein. However,
One of ordinary skill in the art will appreciate that numerous modifications can be made to the embodiments described herein without departing from the spirit or scope of the invention. Accordingly, the invention is not limited except in accordance with the appended claims.

[Brief description of drawings]

【図１】音声コーダによって各端部で終端している通信チャンネルのブロック図。[Figure 1] Block diagram of communication channels terminated at each end by a voice coder.

【図２】マルチモードの混合領域の線形予測（mixed-domain linear prediction, MDLP
）の音声コーダにおいて使用できるエンコーダのブロック図。[Fig. 2] Mixed-domain linear prediction (MDLP)
) A block diagram of an encoder that can be used in the voice coder.

【図３】マルチモードのＭＤＬＰの音声コーダにおいて使用できるデコーダのブロック
図。FIG. 3 is a block diagram of a decoder that can be used in a multimode MDLP voice coder.

【図４】図２のエンコーダにおいて使用できるＭＤＬＰエンコーダによって実行される
ＭＤＬＰのコード化ステップを示すフローチャート。4 is a flow chart showing MDLP encoding steps performed by an MDLP encoder that may be used in the encoder of FIG.

【図５】音声コード化決定プロセスを示すフローチャート。[Figure 5] The flowchart which shows a voice coding decision process.

【図６】閉ループのマルチモードのＭＤＬＰの音声コーダのブロック図。[Figure 6] FIG. 3 is a block diagram of a closed-loop multimode MDLP voice coder.

【図７】図６のコーダまたは図２のエンコーダにおいて使用できるスペクトルコーダの
ブロック図。7 is a block diagram of a spectrum coder that may be used in the coder of FIG. 6 or the encoder of FIG.

【図８】高調波コーダのシヌソイドの振幅を示す振幅対周波数のグラフ。[Figure 8] Amplitude vs. frequency graph showing the sinusoidal amplitude of a harmonic coder.

【図９】マルチモードのＭＤＬＰの音声コーダにおけるモード決定プロセスを示すフロ
ーチャート。FIG. 9 is a flowchart showing a mode decision process in a multimode MDLP voice coder.

【図１０】音声信号の振幅対時間のグラフ（図１０ａ）および線形予測（linear predict
ion, LP）の残余振幅対時間のグラフ（図１０ｂ）。10 is a graph of the amplitude of a speech signal versus time (FIG. 10a) and linear prediction (linear predict).
ion, LP) residual amplitude versus time (FIG. 10b).

【図１１】閉ループのコード化決定のもとでのレート／モード対フレーム指標のグラフ（
図１１ａ）、閉ループの決定のもとでの知覚の信号対雑音比（perceptual signa
l-to-noise ratio, PSNR）対フレーム指標のグラフ（図１１ｂ）、閉ループのコ
ード化決定がないときのレート／モードおよびＰＳＮＲの両者対フレーム指標の
グラフ（図１１ｃ）。FIG. 11 is a graph of rate / mode versus frame index under closed-loop coding decision (
Fig. 11a), the perceptual signa of the perception under closed-loop decision
l-to-noise ratio (PSNR) vs. frame index graph (Fig. 11b), both rate / mode and PSNR vs. frame index graph (Fig. 11c) in the absence of closed-loop coding decisions.

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ１０Ｌ 19/12 Ｇ１０Ｌ 9/14 ＣＨ０３Ｍ 7/36 Ｄ (81)指定国ＥＰ(ＡＴ，ＢＥ，ＣＨ，ＣＹ，ＤＥ，ＤＫ，ＥＳ，ＦＩ，ＦＲ，ＧＢ，ＧＲ，ＩＥ，ＩＴ，ＬＵ，ＭＣ，ＮＬ，ＰＴ，ＳＥ)，ＯＡ(ＢＦ，ＢＪ，ＣＦ，ＣＧ，ＣＩ，ＣＭ，ＧＡ，ＧＮ，ＧＷ，ＭＬ，ＭＲ，ＮＥ，ＳＮ，ＴＤ，ＴＧ)，ＡＰ(ＧＨ，ＧＭ，ＫＥ，ＬＳ，ＭＷ，ＳＤ，ＳＬ，ＳＺ，ＴＺ，ＵＧ，ＺＷ )，ＥＡ(ＡＭ，ＡＺ，ＢＹ，ＫＧ，ＫＺ，ＭＤ，ＲＵ，ＴＪ，ＴＭ)，ＡＥ，ＡＬ，ＡＭ，ＡＴ，ＡＵ，ＡＺ，ＢＡ，ＢＢ，ＢＧ，ＢＲ，ＢＹ，ＣＡ，ＣＨ，ＣＮ，ＣＲ，ＣＵ，ＣＺ，ＤＥ，ＤＫ，ＤＭ，ＥＥ，ＥＳ，ＦＩ，ＧＢ，ＧＤ，ＧＥ，ＧＨ，ＧＭ，ＨＲ，ＨＵ，ＩＤ，ＩＬ，ＩＮ，ＩＳ，ＪＰ，ＫＥ，ＫＧ，ＫＰ，ＫＲ，ＫＺ，ＬＣ，ＬＫ，ＬＲ，ＬＳ，ＬＴ，ＬＵ，ＬＶ，ＭＡ，ＭＤ，ＭＧ，ＭＫ，ＭＮ，ＭＷ，ＭＸ，ＮＯ，ＮＺ，ＰＬ，ＰＴ，ＲＯ，ＲＵ，ＳＤ，ＳＥ，ＳＧ，ＳＩ，ＳＫ，ＳＬ，ＴＪ，ＴＭ，ＴＲ，ＴＴ，ＴＺ，ＵＡ，ＵＧ，ＵＺ，ＶＮ，ＹＵ，ＺＡ，ＺＷ【要約の続き】域でコード化された音声フレームのデコードされた音声フレーム情報から計算される。周波数領域コード化モードでコード化される各音声フレームを、対応する入力音声フレームと比較して、性能尺度を得ることができる。性能尺度が所定の閾値よりも低いときは、入力音声フレームは時間領域コード化モードでコード化される。─────────────────────────────────────────────────── ─── Continuation of front page (51) Int.Cl. ⁷ Identification code FI theme code (reference) G10L 19/12 G10L 9/14 C H03M 7/36 D (81) Designated country EP (AT, BE, CH, CY, DE, DK, ES, FI, FR, GB, GR, IE, IT, LU, MC, NL, PT, SE), OA (BF, BJ, CF, CG, CI, CM, GA, GN, GW, ML, MR, NE, SN, TD, TG), AP (GH, GM, KE, LS, MW, SD, SL, SZ, TZ, UG, ZW), EA (AM, AZ, BY, KG) , KZ, MD, RU, TJ, TM), AE, AL, AM, AT, AU, AZ, BA, BB, BG, BR, BY, CA, CH, CN, CR, CU, C , DE, DK, DM, EE, ES, FI, GB, GD, GE, GH, GM, HR, HU, ID, IL, IN, IS, JP, KE, KG, KP, KR, KZ, LC, LK, LR, LS, LT, LU, LV, MA, MD, MG, MK, MN, MW, MX, NO, NZ, PL, PT, RO, RU, SD, SE, SG, SI, SK, SL , TJ, TM, TR, TT, TZ, UA, UG, UZ, VN, YU, ZA, ZW [Continued Summary] Calculated from the decoded speech frame information of the region-coded speech frame. Each speech frame coded in the frequency domain coding mode can be compared to the corresponding input speech frame to obtain a performance measure. When the performance measure is below a predetermined threshold, the input speech frame is coded in the time domain coding mode.

Claims

[Claims]

1. A coder having at least one time domain coding mode and at least one frequency domain coding mode and a coding mode of the coder based on the frame content connected to the coder and processed by a speech processor. A multi-mode mixed domain audio processor including a closed loop mode selection device configured to select.

2. The audio processor of claim 1, wherein the coder encodes audio frames.

3. The speech processor of claim 1, wherein the coder encodes a linear prediction residual of the speech frame.

4. The at least one time domain coding mode comprises a coding mode for coding a frame at a first coding rate and the at least one frequency domain coding mode at a second coding rate. The audio processor of claim 1, including a coding mode for coding a frame, wherein the second coding rate is lower than the first coding rate.

5. The audio processor of claim 1, wherein the at least one frequency domain coding mode comprises a harmonic coding mode.

6. A comparison circuit coupled to the coder, wherein the uncoded frame is compared with the frame coded in at least one frequency domain coding mode and a performance measure is based on the comparison. Further including a comparing circuit for generating,
The speech of claim 1, wherein the coder applies at least one time domain coding mode only when the performance measure is below a predetermined threshold, otherwise the coder applies at least one frequency domain coding mode. Processor.

7. A coder applies at least one time domain coding mode to each frame immediately after a predetermined number of consecutively processed frames coded in at least one frequency domain coding mode. The audio processor of claim 1.

8. At least one frequency domain coding mode represents a short-term spectrum of each frame with a plurality of sinusoids having a set of parameters including frequency, phase, and amplitude, where the phase is a polynomial representation and an initial phase. Modeled by a value, the initial phase value is (1) the estimated final phase value of the previous frame when the previous frame was coded in at least one frequency domain coding mode, or (2) The speech processor according to claim 1, wherein when the previous frame is coded in at least one time domain coding mode, it is a phase value obtained from the short-term spectrum of the previous frame.

9. The speech processor according to claim 8, wherein the frequency of the sinusoid in each frame is an integral multiple of the pitch frequency of the frame.

10. The frequency of the sinusoid in each frame is 0 to 2π.
9. The speech processor of claim 8 derived from a set of real numbers of

11. A method of processing frames, the method comprising: applying an open loop coding mode selection process to each successive input frame, based on the audio content of the input frame, in a time domain coding mode, or If one of the frequency domain coding modes is selected, and if the speech content of the input frame indicates a steady state voiced speech, the steps of coding the input frame in the frequency domain and the speech content of the input frame are When presenting anything other than steady-state voiced speech, the steps of coding the input frame in the time domain and comparing the coded frame in the frequency domain with the input frame to obtain a performance measure , If the performance measure is lower than a predetermined threshold, then coding the input frame in the time domain. Method.

12. The method of claim 11, wherein the frame is a linear prediction residual frame.

13. The method of claim 11, wherein the frame is a voice frame.

14. The step of encoding in the time domain comprises encoding the frame at a first coding rate and the step of encoding in the frequency domain encodes the frame at a second coding rate. 13. The method of claim 11, wherein the second coding rate is lower than the first coding rate.

15. The method of claim 11, wherein the step of encoding in the frequency domain comprises encoding with harmonics.

16. The frequency domain encoding step represents a short-term spectrum of each frame with a plurality of sinusoids having a set of parameters including frequency, phase, and amplitude, where the phase is a polynomial representation and an initial phase value. And the initial phase value is (1) the estimated final phase value of the previous frame when the previous frame was coded in the frequency domain, or (2) the previous frame is in the time domain. 12. The method according to claim 11, which is a phase value obtained from the short-term spectrum of the previous frame when coded with.

17. The method of claim 16, wherein the sinusoidal frequency of each frame is an integer multiple of the pitch frequency of the frame.

18. The method of claim 16, wherein the sinusoidal frequency of each frame is obtained from a set of real numbers from 0 to 2π.

19. A multi-mode mixed domain speech processor, wherein an open loop coding mode selection process is applied to an input frame to provide a time domain coding mode based on the speech content of the input frame, or The means for selecting one of the frequency domain coding modes and the means for coding the input frame in the frequency domain when the speech content of the input frame indicates a steady state voiced speech When presenting anything other than steady-state voiced speech, there are means for coding the input frame in the time domain and means for comparing the coded frame in the frequency domain with the input frame to obtain a performance measure. , When the performance measure is below a predetermined threshold, including means for coding the input frame in the time domain Processor.

20. The speech processor of claim 19, wherein the frame is a linear prediction residual frame.

21. The audio processor of claim 19, wherein the input frame is an audio frame.

22. The means for encoding in the time domain comprises means for encoding a frame at a first coding rate and the means for encoding in the frequency domain encodes a frame at a second coding rate. 20. The audio processor of claim 19, including means for performing, the second coding rate being lower than the first coding rate.

23. The speech processor of claim 19, wherein the frequency domain encoding means comprises a harmonic coder.

24. The means for encoding in the frequency domain comprises means for representing a short-term spectrum of each frame with a plurality of sinusoids having a set of parameters including frequency, phase and amplitude, the phase being a polynomial representation and Modeled with an initial phase value, which is (1) the estimated final phase value of the immediately preceding frame when the immediately preceding frame was coded in the frequency domain, or (2) 20. A speech processor according to claim 19, which is a phase value obtained from the short-term spectrum of the immediately preceding frame when the immediately preceding frame was coded in the time domain.

25. The audio processor of claim 24, wherein the sinusoidal frequency of each frame is an integer multiple of the pitch frequency of the frame.

26. The audio processor of claim 24, wherein the sinusoidal frequency of each frame is derived from a set of real numbers from 0 to 2π.