JPH08211895A

JPH08211895A - System and method for evaluation of pitch lag as well as apparatus and method for coding of sound

Info

Publication number: JPH08211895A
Application number: JP7295266A
Authority: JP
Inventors: Huan-Yu Su; フアン−ユー・スー; Tom Hong Li; トム・ホン・リー
Original assignee: Rockwell International Corp
Current assignee: Boeing North American Inc
Priority date: 1994-11-21
Filing date: 1995-11-14
Publication date: 1996-08-20
Also published as: EP0713208A2; DE69525508T2; EP0713208B1; DE69525508D1; EP0713208A3

Abstract

PROBLEM TO BE SOLVED: To provide a compact and accurate pitch evaluating system into which a multiple resolution analysis to encode a voice is incorporated. SOLUTION: The pitch evaluating device and method evaluate pitch lag of a voice by using a multiple resolution system. This system contains a step of sampling a voice, a step of alternately applying discrete Fourier transform and a step of squaring a result. Next, DTF is performed to convert a voice sample into a separate area to squared amplitude. Next, an initial pitch lag is obtained with low resolution. After the pitch lag is evaluated with low resolution, algorithm made precise is on the basis of minimizing an anticipating error in a time area. Next, a pitch lag made preceise can be directly used in encoding a voice.

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【発明の背景】信号のモデル化およびパラメータ評価は
データ圧縮、復元、および符号化においてますます重要
な役割を果たす。基本的な話声音をモデル化するため
に、音声信号は離散波形としてサンプリングされ、デジ
タル的に処理されなければならない。線形予測符号化
（ＬＰＣ）と称されるあるタイプの信号符号化技術にお
いて、何らかの特定の時間指標での信号値は前の値の線
形関数としてモデル化される。こうして、後の信号はこ
れまでの値に従って線形的に予測される。結果として、
効果的な信号表現は信号を表わすために、ある予測パラ
メータを評価し、かつ適用することによって決定でき
る。現在、符号励起線形予測（ＣＥＬＰ）を含む音声符
号化のためにＬＰＣ技術が用いられている。BACKGROUND OF THE INVENTION Signal modeling and parameter estimation play an increasingly important role in data compression, decompression, and coding. To model the basic speech sound, the speech signal must be sampled as a discrete waveform and processed digitally. In one type of signal coding technique called linear predictive coding (LPC), the signal value at some particular time index is modeled as a linear function of the previous value. Thus, the latter signal is linearly predicted according to the previous values. as a result,
The effective signal representation can be determined by evaluating and applying certain prediction parameters to represent the signal. Currently, LPC techniques are used for speech coding including Code Excited Linear Prediction (CELP).

【０００２】ピッチ情報は符号化の目的に対して、確か
な音の指標および表示であると認められている。ピッチ
は話者の音声の基本的な特徴またはパラメータを記述す
る。人間の音声は一般に容易には数学的に定量化できな
いので、音声のピッチデータを効果的に評価できる音声
評価モデルが、より的確かつ正確に符号化され、かつ復
号された音声を提供する。しかしながら、あるＣＥＬＰ
（たとえば、ベクトル和励起線形予測（ＶＳＥＬＰ）、
マルチパルス、正規パルス、代数的ＣＥＬＰなど）およ
びＭＢＥコーダ／デコーダ（「コーデック」）のような
現在の音声符号化モデルにおいて、ピッチ評価アルゴリ
ズムは正確さが高く、かつ複雑さが低いことを必要とす
るために、ピッチ評価がしばしば困難である。[0002] Pitch information is accepted as a solid sound indicator and indication for coding purposes. Pitch describes the basic features or parameters of the speaker's voice. Since human speech is generally not easily mathematically quantifiable, speech evaluation models that can effectively evaluate speech pitch data provide more accurate and accurate encoded and decoded speech. However, some CELP
(Eg vector sum excitation linear prediction (VSELP),
In current speech coding models, such as multi-pulse, regular pulse, algebraic CELP, etc.) and MBE coder / decoders (“codecs”), pitch estimation algorithms require high accuracy and low complexity. Therefore, pitch evaluation is often difficult.

【０００３】いくつかのピッチラグ評価方式は上述され
たコーデック（時間域方式、周波数域方式、およびケプ
ストラム域方式）と関連して用いられる。ピッチラグお
よび音声再現の間に密接な関係があるために、ピッチ評
価の正確さは通話品質に直接的な影響を及ぼす。ＣＥＬ
Ｐコーダでは、音声発生は予測（長期ピッチ予測および
短期線形予測）に基づく。図１は典型的なＣＥＬＰコー
ダによる音声再生のブロック図を示す。Several pitch lag evaluation schemes are used in connection with the codecs (time domain, frequency domain, and cepstrum domain) described above. Due to the close relationship between pitch lag and voice reproduction, the accuracy of pitch estimation has a direct impact on speech quality. CEL
In P-coder, speech generation is based on predictions (long-term pitch prediction and short-term linear prediction). FIG. 1 shows a block diagram of voice reproduction by a typical CELP coder.

【０００４】音声データを圧縮するためには、不可欠な
情報のみを抽出して冗長の伝送を回避することが望まし
い。音声は短いブロックに分類でき、ここで代表的なパ
ラメータはあらゆるブロックにおいて識別できる。図１
に示されるように、ＣＥＬＰ音声コーダは良質な音声を
発生するために、ＬＰＣパラメータ１１０と、（ラグお
よびその係数を含む）ピッチラグパラメータ１１２と、
その利得パラメータ１１６を備える最適な新規コードベ
クトル１１４とを符号化されるべき入力音声から抽出し
なければならない。コーダは適当な符号化方式を実現す
ることによってＬＰＣパラメータを量子化する。各パラ
メータの量子化の指標は、音声デコーダに記憶または伝
送されるべき情報を含む。ＣＥＬＰコーデックでは、ピ
ッチ予測パラメータ（ピッチラグおよびピッチ係数）の
決定は時間域で行なわれるが、ＭＢＥコーデックでは、
ピッチパラメータは周波数域で評価される。In order to compress audio data, it is desirable to extract only essential information and avoid redundant transmission. Speech can be classified into short blocks, where typical parameters can be identified in every block. FIG.
As shown in, the CELP speech coder generates an LPC parameter 110 and a pitch lag parameter 112 (including the lag and its coefficient) in order to generate a good quality speech.
An optimal new code vector 114 with its gain parameter 116 must be extracted from the input speech to be encoded. The coder quantizes the LPC parameters by implementing the appropriate coding scheme. The quantization index for each parameter includes the information to be stored or transmitted to the speech decoder. In the CELP codec, pitch prediction parameters (pitch lag and pitch coefficient) are determined in the time domain, but in the MBE codec,
The pitch parameter is evaluated in the frequency domain.

【０００５】ＬＰＣ分析の後でＣＥＬＰエンコーダは、
（通例約１０−４０ｍｓで取られる）現在の音声符号化
フレームのために適当なＬＰＣフィルタ１１０を決定す
る。ＬＰＣフィルタは次式によって表わされる。After the LPC analysis, the CELP encoder
Determine the appropriate LPC filter 110 for the current speech coded frame (typically taken at about 10-40 ms). The LPC filter is represented by the following equation.

【０００６】[0006]

【数１】 [Equation 1]

【０００７】この式において、ｎｐはＬＰＣ予測次数
（通例、約１０）であり、ｙ（ｎ）はサンプリングされ
た音声データであり、ｎは時間指標を表わす。上記のＬ
ＰＣの式は、過去のサンプルの線形結合に従って現在の
サンプルの評価を記述する。人間の耳の感度をモデルと
するＬＰＣフィルタに基づく聴感補正フィルタはここで
次式によって規定される。In this equation, np is the LPC prediction order (usually about 10), y (n) is the sampled voice data, and n is the time index. L above
The PC equation describes the evaluation of the current sample according to a linear combination of past samples. A hearing correction filter based on the LPC filter modeled on the sensitivity of the human ear is defined here by:

【０００８】[0008]

【数２】 [Equation 2]

【０００９】所望のピッチパラメータを抽出するため
に、次の重み付き符号化誤差エネルギを最小にするピッ
チパラメータは各符号化サブフレームについて計算され
なければならず、ここで１つの符号化フレームは、分析
および符号化のためにいくつかの符号化サブフレームへ
分割できる。To extract the desired pitch parameter, the pitch parameter that minimizes the next weighted coding error energy must be calculated for each coded subframe, where one coded frame is It can be divided into several coding subframes for analysis and coding.

【００１０】[0010]

【数３】 (Equation 3)

【００１１】この式において、Ｔは知覚的にフィルタさ
れた入力信号を表わす目標信号であり、ＨはフィルタＷ
（ｚ）／Ａ（ｚ）のインパルス応答行列である。Ｐ_Lag
はピッチラグ「Ｌａｇ」と、所定のラグについて独自に
規定された予測係数βとを有するピッチ予測寄与であ
り、Ｃ_iはコードブックにおける指標ｉおよびその対応
する利得αに関連したコードブック寄与である。典型的
には、人間の音声のピッチは２ｍｓから２０ｍｓの間で
異なる。したがって、音声が８ＫＨｚのサンプリング速
度でサンプリングされると、ピッチラグは概算で２０サ
ンプルから１４７サンプルに対応する。さらに、ｉは０
およびＮｃ−１の間の値を取り、ここでＮｃは新規コー
ドブックのサイズである。In this equation, T is the target signal representing the perceptually filtered input signal and H is the filter W.
It is an impulse response matrix of (z) / A (z). P _Lag
Is the pitch prediction contribution with the pitch lag “Lag” and the prediction coefficient β uniquely defined for a given lag, and C _i is the codebook contribution related to the index i in the codebook and its corresponding gain α. . Typically, the pitch of human speech varies between 2ms and 20ms. Therefore, when speech is sampled at a sampling rate of 8 KHz, the pitch lag roughly corresponds to 20 to 147 samples. Furthermore, i is 0
And Nc−1, where Nc is the size of the new codebook.

【００１２】１タップピッチ予測子および１つの新規コ
ードブックを想定する。しかしながら、典型的にピッチ
予測子の一般的な形状は多タップ方式であり、新規コー
ドブックの一般的な形状は多レベルベクトル量子化であ
るか、または、複数の新規コードブックを用いる。より
詳細には、音声の符号化において、１タップピッチ予測
子は現在の音声サンプルが１つの過去の音声サンプルに
よって予測できることを示し、一方多タップ予測子は現
在の音声サンプルが複数の過去の音声サンプルによって
予測できることを意味する。Consider a one-tap pitch predictor and one new codebook. However, the typical shape of the pitch predictor is typically a multi-tap scheme, and the typical shape of the new codebook is multi-level vector quantization, or multiple new codebooks are used. More specifically, in speech coding, the one-tap pitch predictor indicates that the current speech sample can be predicted by one past speech sample, while the multi-tap predictor indicates that the current speech sample has more than one past speech sample. It means that it can be predicted by the sample.

【００１３】複雑さについて懸念があるために、最適な
方式に準ずる方式が音声符号化方式において用いられて
きた。たとえば、ピッチラグ評価は２．５ｍｓから１
８．５ｍｓをカバーするために、Ｌ₁およびＬ₂サンプ
ルの間の範囲で起こり得るラグ値を単に評価することに
よってなされてもよい。したがって、評価されたピッチ
ラグ値は次式を最大にすることによって決定される。Due to concerns about complexity, schemes that are suboptimal have been used in speech coding schemes. For example, pitch lag evaluation is 2.5ms to 1
It may be done by simply evaluating the possible lag values in the range between the L ₁ and L ₂ samples to cover 8.5 ms. Therefore, the estimated pitch lag value is determined by maximizing:

【００１４】[0014]

【数４】 [Equation 4]

【００１５】この時間域方式は真のピッチラグを決定で
きるが、高いピッチ周波数を有する女性の音声には、式
（１）によって求められるピッチラグは真のラグではな
く、真のラグの倍数となり得る。この評価誤差を回避す
るために、評価誤差を訂正（たとえば、ラグの平滑化）
する付加的なプロセスが必要であり、これはそれと引換
えに不所望な複雑さを引起こす。Although this time domain method can determine the true pitch lag, for female voices with high pitch frequencies, the pitch lag determined by equation (1) can be a multiple of the true lag rather than the true lag. Correct the evaluation error (for example, smooth lag) to avoid this evaluation error.
An additional process is required, which in turn causes undesired complexity.

【００１６】しかしながら、このように過度に複雑であ
ることは、時間域方式を用いる際の著しい欠点である。
たとえば、整数のラグのみを用いてラグを決定するため
に、時間域方式は１秒当り３００万回の動作（３ＭＯ
Ｐ）を少なくとも必要とする。さらに、ピッチラグの平
滑化および分数のピッチラグが用いられるならば、複雑
さはほぼ４ＭＯＰであろう。実際には、容認可能な正確
さでフルレンジのピッチラグ評価を実行するために、概
算で１秒当り６００万回のデジタル信号処理機械命令
（６ＤＳＰＭＩＰ）が必要とされる。したがって、ピ
ッチ評価は４から６のＤＳＰＭＩＰを必要とすると一
般に認められている。ピッチ評価の複雑さを減少できる
方式は他にもあるが、そのような方式はしばしば質を犠
牲にする。However, such overcomplexity is a significant drawback when using the time domain method.
For example, in order to determine the lag using only an integer number of lags, the time domain method is 3 million operations per second (3 MO
At least P) is required. Moreover, if pitch lag smoothing and fractional pitch lag are used, the complexity would be approximately 4 MOPs. In practice, approximately 6 million Digital Signal Processing Machine Instructions (6 DSP MIPs) per second are required to perform a full range pitch lag evaluation with acceptable accuracy. Therefore, it is generally accepted that pitch evaluation requires 4 to 6 DSP MIPs. There are other schemes that can reduce the complexity of pitch estimation, but such schemes often sacrifice quality.

【００１７】正弦コーダの類で重要な要素であるＭＢＥ
コーダでは、符号化パラメータは周波数域において抽出
され、かつ量子化される。ＭＢＥ音声モデルは図２から
図４に示される。図２および図３に説明されるＭＢＥ音
声エンコーダ／デコーダ（「ボゴーダ」）では、基本周
波数（またはピッチラグ）２１０、有声／無声決定２１
２、およびスペクトルエンベロープ２１４は周波数域に
おいて入力音声から抽出される。パラメータは次に、記
憶または伝送できるビットストリームへ量子化され、か
つ符号化される。MBE, an important element in the class of sine coders
In the coder, the coding parameters are extracted and quantized in the frequency domain. The MBE voice model is shown in FIGS. In the MBE speech encoder / decoder ("Bogodha") described in FIGS. 2 and 3, the fundamental frequency (or pitch lag) 210, the voiced / unvoiced decision 21
2, and the spectral envelope 214 is extracted from the input speech in the frequency domain. The parameters are then quantized and encoded into a bitstream that can be stored or transmitted.

【００１８】ＭＢＥボコーダでは、良質な音声を達成す
るために基本周波数が高い正確さで評価されなければな
らない。基本周波数の評価は２段階で行なわれる。第１
に、初期のピッチラグが２１サンプルから１１４サンプ
ルの範囲内で探索され、周波数域において入力音声２１
６および合成された音声２１８の間で重み付き平均二乗
誤差方程式３１０（図３）を最小にすることによって８
０００Ｈｚのサンプリング速度で２．６ｍｓから１４．
２５ｍｓをカバーする。元の音声および合成された音声
の間の平均二乗誤差は次式によって与えられる。In MBE vocoders, the fundamental frequency must be evaluated with high accuracy in order to achieve good quality speech. The fundamental frequency is evaluated in two stages. First
, The initial pitch lag is searched within the range of 21 to 114 samples, and the input speech 21
6 and the synthesized speech 218 by minimizing the weighted mean square error equation 310 (FIG. 3).
2.6 ms to 14.000 at a sampling rate of 000 Hz.
Covers 25 ms. The mean squared error between the original and synthesized speech is given by:

【００１９】[0019]

【数５】 (Equation 5)

【００２０】この式において、Ｓ（ω）は元の音声スペ
クトルであり、Ｓ＾（ω）（＾は大文字Ｓの上にあると
見なされる）は合成された音声スペクトルであり、Ｇ
（ω）は周波数依存重み付き関数である。図４に示され
るように、ピッチ追跡アルゴリズム４１０は、隣接する
フレームのピッチ情報を用いることによって、初期のピ
ッチラグ評価４１２を更新するのに用いられる。In this equation, S (ω) is the original speech spectrum, S ^ (ω) (^ is considered to be above the capital S) is the synthesized speech spectrum, and G
(Ω) is a frequency-dependent weighted function. As shown in FIG. 4, the pitch tracking algorithm 410 is used to update the initial pitch lag estimate 412 by using the pitch information of adjacent frames.

【００２１】この方式を用いるのは、基本周波数が隣接
するフレームの間で不意には変化するはずはないという
仮定のためである。２つの過去の隣接するフレームおよ
び２つの未来の隣接するフレームのピッチ評価はピッチ
追跡のために使用される。次に、（２つの過去のフレー
ムおよび２つの未来のフレームを含む）平均二乗誤差は
最小にされて現在のフレームの新しいピッチラグ値を求
める。初期のピッチラグを追跡した後で、ピッチラグ多
重検査方式４１４が多重ピッチラグを除去するために適
用され、ピッチラグを平滑化する。This method is used because of the assumption that the fundamental frequency should not change abruptly between adjacent frames. Pitch estimates of two past adjacent frames and two future adjacent frames are used for pitch tracking. The mean squared error (which includes two past frames and two future frames) is then minimized to find a new pitch lag value for the current frame. After tracking the initial pitch lag, pitch lag multiple inspection scheme 414 is applied to remove multiple pitch lags to smooth the pitch lags.

【００２２】図４を参照すると、基本周波数評価の第２
段階でピッチラグ精密化４１６が用いられてピッチ評価
の正確さを高める。ピッチラグ候補値は初期のピッチラ
グ評価に基づいて形成される（すなわち、新しいピッチ
ラグ候補値は、初期のピッチラグ評価からある分数を加
算し、または減算することによって形成される）。した
がって、精密化されたピッチラグ評価４１８は、平均二
乗誤差関数を最小にすることによってピッチラグ候補の
中で決定できる。Referring to FIG. 4, the second of the fundamental frequency evaluations
Pitch lag refinement 416 is used at the stage to increase the accuracy of pitch evaluation. Pitch lag candidate values are formed based on the initial pitch lag estimate (ie, new pitch lag candidate values are formed by adding or subtracting a fraction from the initial pitch lag estimate). Therefore, a refined pitch lag estimate 418 can be determined among the pitch lag candidates by minimizing the mean square error function.

【００２３】しかしながら、周波数域ピッチ評価はある
欠点を有する。第１に、複雑さが非常に高い。第２に、
ピッチラグは２．５ｍｓから１４．２５ｍｓしかカバー
しない２０および１１４サンプルの範囲内で探索され
て、２５６ポイントＦＦＴに対処するように２５６サン
プルにウインドウサイズを制限しなければならない。し
かしながら、非常に低いピッチ周波数の話者には、また
は１４．２５ｍｓを超えるピッチラグを有する音声に
は、２５６サンプルウインドウ内で十分な数のサンプル
を集めるのが不可能である。さらに、音声フレームにわ
たって評価されるのは、平均されたピッチラグだけであ
る。However, frequency domain pitch estimation has certain drawbacks. First, the complexity is very high. Second,
The pitch lag must be searched within 20 and 114 samples covering only 2.5 ms to 14.25 ms, limiting the window size to 256 samples to accommodate a 256-point FFT. However, it is not possible to collect a sufficient number of samples within the 256 sample window for very low pitch frequency speakers or for speech with pitch lags greater than 14.25 ms. Moreover, only the averaged pitch lag is evaluated over the speech frame.

【００２４】１９６７年にエイ．エム．ノル（A.M.Nol
l）によって提案されたケプストラム域ピッチラグ評価
（図５）を用いて、変形された方法が他に提案された。
ケプストラム域ピッチラグ評価では、５１０でおおよそ
３７ｍｓの音声がサンプリングされるので、可能な最大
のピッチラグ（たとえば、１８．５ｍｓ）の少なくとも
２周期がカバーされる。次に、５１２ポイントＦＦＴは
音声フレームウインドウに（ブロック５１２で）適用さ
れ、周波数スペクトルを獲得する。周波数スペクトルの
対数５１４の振幅を取って、別の５１２ポイント逆ＦＦ
Ｔ５１６がケプストラムを得るために適用される。重み
付け関数５１８はケプストラムに適用され、ケプストラ
ムのピークはピッチラグを決定するために５２０で検出
される。次に、追跡アルゴリズム５２２が実行されて、
いかなるピッチ倍数をも除去する。In 1967, A. M. Nor (AMNol
Another modified method was proposed using the cepstrum region pitch lag estimation (Fig. 5) proposed by l).
The cepstral range pitch lag evaluation samples approximately 37 ms of speech at 510, thus covering at least two periods of the maximum possible pitch lag (eg, 18.5 ms). The 512-point FFT is then applied (at block 512) to the speech frame window to obtain the frequency spectrum. Taking the logarithmic 514 amplitude of the frequency spectrum, another 512 point inverse FF
T516 is applied to obtain the cepstrum. The weighting function 518 is applied to the cepstrum and the cepstrum peaks are detected at 520 to determine the pitch lag. Next, the tracking algorithm 522 is executed,
Remove any pitch multiples.

【００２５】しかしながら、ケプストラムピッチ検出方
法にはいくつかの欠点が見受けられる。たとえば、計算
上の要求が高い。８ＫＨｚのサンプリング速度において
２０サンプルおよび１４７サンプルの間でピッチの範囲
をカバーするために、５１２ポイントＦＦＴは二度行な
われなければならない。ケプストラムピッチ評価が分析
フレームにわたる平均されたピッチラグの評価のみを提
供するので、評価の正確さが不十分である。しかしなが
ら、低ビット転送速度音声符号化については、ピッチラ
グ値が短い時間期間にわたって評価されることは重要で
ある。結果として、ケプストラムピッチ評価は今日、高
質な低ビット転送速度音声符号化についてはほとんど用
いられない。したがって、上述された方式の各々に制限
があるために、効果的なピッチラグ評価のための手段に
は、高質な低ビット転送速度音声符号化の必要を満たす
ことが所望される。However, the cepstral pitch detection method has some drawbacks. For example, there are high computational demands. To cover the range of pitches between 20 and 147 samples at a sampling rate of 8 KHz, the 512 point FFT has to be performed twice. The cepstrum pitch estimate provides only an estimate of the averaged pitch lag over the analysis frame, so the accuracy of the estimate is insufficient. However, for low bit rate speech coding, it is important that the pitch lag value be evaluated over a short period of time. As a result, cepstral pitch estimation is rarely used today for high quality, low bit rate speech coding. Therefore, due to the limitations of each of the schemes described above, it is desirable for the means for effective pitch lag estimation to meet the need for high quality, low bit rate speech coding.

【００２６】[0026]

【発明の概要】したがって、この発明の目的は、複雑さ
が最小であって正確さが高いことを必要とする、音声符
号化のための多分解能分析を組入れるピッチ評価システ
ムを提供することである。特定の実施例では、この発明
はＣＥＬＰ技術ならびにさまざまな他の音声符号化およ
び認識システムを用いる音声符号化の装置および方法を
対象とする。したがって、必要な高い正確さを維持しな
がら、よりよい結果がより少ない計算手段でもたらされ
る。SUMMARY OF THE INVENTION Accordingly, it is an object of the present invention to provide a pitch estimation system incorporating a multi-resolution analysis for speech coding which requires minimal complexity and high accuracy. . In a particular embodiment, the present invention is directed to a speech coding apparatus and method using CELP technology and various other speech coding and recognition systems. Thus, better results are provided with less computational means, while maintaining the required high accuracy.

【００２７】これらの目的および他の目的は、この発明
の実施例に従って、音声の的確な再現および再生を速く
かつ効果的に可能にするピッチラグ評価方式によって達
成される。ピッチラグは所定の音声フレームについて抽
出され、次に、各サブフレームについて精密化される。
最小の数の音声サンプルが音声を直接サンプリングする
ことによって獲得された後で、離散フーリエ変換（ＤＦ
Ｔ）が適用され、結果として生じる振幅が二乗される。
第２のＤＦＴが次に行なわれる。したがって、フレーム
内の音声サンプルに対する的確な初期のピッチラグは、
８ＫＨｚのサンプリング速度で２０サンプルの可能な最
小値と１４７サンプルの最大ラグ値との間で決定でき
る。初期のピッチラグ評価を獲得した後で、時間域精密
化がさらに評価の正確さを向上するために各サブフレー
ムについて行なわれなければならない。These and other objects are achieved, in accordance with an embodiment of the present invention, by a pitch lag evaluation scheme which enables accurate reproduction and reproduction of speech quickly and effectively. The pitch lag is extracted for a given speech frame and then refined for each subframe.
After the minimum number of speech samples has been obtained by directly sampling the speech, the discrete Fourier transform (DF
T) is applied and the resulting amplitude is squared.
The second DFT is then performed. Therefore, the exact initial pitch lag for speech samples in a frame is
It is possible to determine between a possible minimum of 20 samples and a maximum lag value of 147 samples at a sampling rate of 8 KHz. After obtaining the initial pitch lag estimate, time domain refinement must be performed for each subframe to further improve the accuracy of the estimate.

【００２８】[0028]

【好ましい実施例の詳細な説明】この発明の好ましい実
施例に従ったピッチラグ評価方式は一般に図６、７、８
および９において示される。まず、Ｎ個の音声サンプル
｛ｘ（ｎ），ｎ＝０，１，…，Ｎ−１｝が集められる。
（図６のステップ６０２）Ｎはたとえば、８０００Ｈｚ
のサンプリング速度で典型的な４０ｍｓの音声ウインド
ウに対処するために３２０個の音声サンプルに等しくて
もよい。Ｎの値はおおまかに評価された音声周期によっ
て決定され、ここで少なくとも２周期が音声スペクトル
を発生するために一般に必要とされる。このように、Ｎ
が可能な最大のピッチラグの２倍よりも大きくなくては
ならず、ここでは｛ｘ（ｎ），ｎ＝０，１，…，Ｎ−
１｝である。さらに、少なくとも２ピッチ周期をカバー
するハミングウインドウ６０４または他のウインドウが
好ましくは実現される。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT The pitch lag evaluation scheme according to the preferred embodiment of the present invention is generally shown in FIGS.
And 9 are shown. First, N speech samples {x (n), n = 0, 1, ..., N-1} are collected.
(Step 602 of FIG. 6) N is, for example, 8000 Hz
May be equal to 320 voice samples to accommodate a typical 40 ms voice window at a sampling rate of. The value of N is determined by a roughly estimated speech period, where at least two periods are generally needed to generate the speech spectrum. Thus, N
Must be greater than twice the maximum possible pitch lag, where {x (n), n = 0, 1, ..., N−
1}. Moreover, a Hamming window 604 or other window covering at least two pitch periods is preferably implemented.

【００２９】[0029]

【数６】 (Equation 6)

【００３０】この発明の実施例に従って、関数Ｇ（ｆ）
ではなくＧ（ｆ）の対数が式（４）において用いられる
従来のケプストラム変換とＣ（ｎ）とが異なることが認
識される。この違いの原因は一般的には複雑さである。
除去されなければ実質的により多くの計算資源を必要と
する対数関数を除去することによって複雑さを減少する
ことが望ましい。さらに、ケプストラムまたはＣ（ｎ）
関数を用いるピッチラグ評価方式を比較すると、音声の
無声または遷移区間のみに対して異なった結果が獲得さ
れていた。たとえば、無声または遷移音声に対してピッ
チの定義が不明確である。遷移音声にはピッチがないと
いわれてきたが、誤差を最小にするために何らかの予測
を常に示すことができるともいわれる。According to an embodiment of the invention, the function G (f)
It is recognized that the logarithm of G (f) differs from C (n) with the conventional cepstrum transform used in equation (4) instead. The reason for this difference is generally complexity.
It is desirable to reduce complexity by removing logarithmic functions that would require substantially more computational resources if not eliminated. Furthermore, cepstrum or C (n)
Comparing pitch lag evaluation methods using functions, different results were obtained only for unvoiced or transitional sections of speech. For example, the definition of pitch is unclear for unvoiced or transitional speech. It has been said that transition speech has no pitch, but it is also said that some prediction can always be shown to minimize error.

【００３１】したがって、一旦Ｃ（ｎ）が決定されると
（ステップ６１０）、所定の音声フレームに対するピッ
チラグは次式を解くことによってステップ６１４で求め
られ得る。Thus, once C (n) is determined (step 610), the pitch lag for a given speech frame can be determined at step 614 by solving the following equation:

【００３２】[0032]

【数７】 (Equation 7)

【００３３】この式において、ａｒｇ［・］は内部最適
化関数を満たす変数ｎを決定し、Ｌ₁およびＬ₂はそれ
ぞれ可能な最小のピッチラグおよび可能な最大のピッチ
ラグとして規定される。音声符号化の便宜上、Ｌ₂およ
びＬ₁の間の差は２進数表現のために２の累乗であるこ
とが望ましい。好ましい実施例では、Ｌ₁およびＬ₂は
それぞれ２０および１４７の値を取って典型的な人間の
音声のピッチラグ範囲の２．５ｍｓから１８．３７５ｍ
ｓをカバーし、ここでＬ₁およびＬ₂の間の間隔は２の
累乗である。Ｗ（ｉ）は重み付き関数であり、２Ｍ＋１
はウインドウサイズを表わす。好ましくは、｛Ｗ（ｉ）
＝１，ｉ＝０，１，…，２Ｍ｝であり、Ｍ＝１である。In this equation, arg [•] determines the variable n that satisfies the internal optimization function, and L ₁ and L ₂ are defined as the smallest possible pitch lag and the largest possible pitch lag, respectively. For speech coding convenience, the difference between L ₂ and L ₁ is preferably a power of 2 for binary representation. In the preferred embodiment, L ₁ and L ₂ take values of 20 and 147, respectively, which are in the typical human voice pitch lag range of 2.5 ms to 18.375 m.
s, where the spacing between L ₁ and L ₂ is a power of 2. W (i) is a weighted function, 2M + 1
Represents the window size. Preferably, {W (i)
= 1, i = 0, 1, ..., 2M}, and M = 1.

【００３４】結果として生じるピッチラグは平均された
値であるが、それは信頼できて的確であるということが
わかった。平均化から生じる効果は絶対的に大きな分析
ウインドウサイズによるものであり、１４７サンプルの
ラグに対して、ウインドウサイズはラグ値の少なくとも
２倍であるべきである。しかしながら、不所望なこと
に、典型的に小さいピッチラグを示す女性の話者のよう
なある話者からの信号は、このような大きなウインドウ
では４から１０ピッチ周期を含み得る。ピッチラグに変
化があれば、提案されたピッチラグ評価は平均されたピ
ッチラグしか生成しない。結果として、音声符号化にお
いてこのような平均されたピッチラグを用いることで音
声評価および再生に大きな劣化が生じ得る。Although the resulting pitch lag is an averaged value, it has been found to be reliable and accurate. The effect resulting from averaging is due to the absolutely large analysis window size, for a lag of 147 samples, the window size should be at least twice the lag value. Unfortunately, however, a signal from one speaker, such as a female speaker, which typically exhibits a small pitch lag, may include 4 to 10 pitch periods in such a large window. If there is a change in pitch lag, the proposed pitch lag estimate will only produce an average pitch lag. As a result, the use of such averaged pitch lags in speech coding can cause significant degradation in speech evaluation and playback.

【００３５】音声におけるピッチ情報の相対的に速い変
化のために、ＣＥＬＰモデルに基づくほとんどの音声符
号化システムはサブフレームごとに一度ピッチラグを評
価し、かつ伝送する。こうして、典型的には２ｍｓから
１０ｍｓの長さ（１６から８０サンプル）であるいくつ
かの音声サブフレームへ１つの音声フレームが分割され
るＣＥＬＰ型音声符号化において、ピッチ情報は各サブ
フレームで更新される。したがって、正確なピッチラグ
値はサブフレームのためにのみ必要とされる。しかしな
がら、上記の方式に従って評価されたピッチラグは、平
均化から生じる影響のために正確な音声符号化には十分
な正確さを有さない。Because of the relatively fast changes in pitch information in speech, most speech coding systems based on the CELP model evaluate and transmit pitch lag once every subframe. Thus, in CELP-type speech coding, where one speech frame is divided into several speech subframes, which are typically 2 ms to 10 ms long (16 to 80 samples), the pitch information is updated in each subframe. To be done. Therefore, accurate pitch lag values are needed only for subframes. However, the pitch lag evaluated according to the above scheme is not accurate enough for accurate speech coding due to the effects resulting from averaging.

【００３６】こうして、この発明の特定の実施例におい
て、評価の正確さを向上させるために、初期のピッチラ
グ評価に基づいた精密化された探索が時間域において行
なわれる（ステップ６１８）。簡単な自己相関方法がほ
ぼ平均されたＬａｇ値で特定の符号化周期またはサブフ
レームに対して行なわれる。Thus, in a particular embodiment of the invention, a refined search based on the initial pitch lag estimate is performed in the time domain to improve the accuracy of the estimate (step 618). A simple autocorrelation method is performed for a particular coding period or subframe with approximately averaged Lag values.

【００３７】[0037]

【数８】 (Equation 8)

【００３８】この式において、ａｒｇ［・］は内部最適
化関数を満たす変数ｎを決定し、ｋはサブフレームの第
１のサンプルを示し、ｌは精密化ウインドウサイズを表
わし、ｍは探索範囲である。的確なピッチラグ値を決定
するために、精密化ウインドウサイズは少なくとも１ピ
ッチ周期であるべきである。しかしながら、ウインドウ
は平均化の影響を避けるためにあまりに大きすぎてはな
らない。たとえば、好ましくはｌ＝Ｌａｇ＋１０、およ
びｍ＝５である。こうして、式（６）の時間域精密化に
従って、より正確なピッチラグが評価されてサブフレー
ムの符号化に適用できる。In this equation, arg [•] determines the variable n that satisfies the internal optimization function, k is the first sample of the subframe, l is the refinement window size, and m is the search range. is there. In order to determine the exact pitch lag value, the refinement window size should be at least 1 pitch period. However, the window should not be too large to avoid the effects of averaging. For example, preferably 1 = Lag + 10, and m = 5. Thus, according to the time domain refinement of equation (6), a more accurate pitch lag can be evaluated and applied to subframe coding.

【００３９】動作時において、高速フーリエ変換（ＦＦ
Ｔ）が一般的なＤＦＴよりも計算上効果的である場合も
あるが、ＦＦＴを用いる際の欠点はウインドウサイズが
２の累乗でなければならないことである。たとえば、１
４７サンプルの最大のピッチラグは２の累乗ではないこ
とが示されてきた。最大のピッチラグを含むためには、
５１２サンプルのウインドウサイズが必要である。しか
しながら、このことで、上述された平均化から生じる影
響のために女性の音声に対するピッチラグ評価の質が悪
くなり、多量の計算が必要となる。２５６サンプルのウ
インドウサイズが用いられるならば、平均化から生じる
影響は減少され、複雑さが一層少なくなる。しかしなが
ら、このようなウインドウを用いると音声中の１２８サ
ンプルよりも大きなピッチラグには対処できない。In operation, the fast Fourier transform (FF
Although T) may be more computationally efficient than a general DFT, the drawback with FFT is that the window size must be a power of two. For example, 1
It has been shown that the maximum pitch lag of 47 samples is not a power of 2. To include the maximum pitch lag,
A window size of 512 samples is required. However, this results in poor quality of the pitch lag estimate for female voices due to the effects resulting from the averaging described above and requires a large amount of computation. If a window size of 256 samples is used, the effects resulting from averaging are reduced and complexity is reduced. However, such a window cannot handle pitch lags greater than 128 samples in speech.

【００４０】これらの問題のいくつかを克服するため
に、この発明の代替の好ましい実施例は２５６ポイント
ＦＦＴを利用して複雑さを減少し、変更された信号を用
いてピッチラグを評価する。信号を変更するのはダウン
サンプリングプロセスである。図７および図８を参照す
ると、Ｎ個の音声サンプル｛ｘ（ｎ），ｎ＝０，１，
…，Ｎ−１｝が集められ（ステップ７０２）、Ｎは最大
のピッチラグの２倍よりも大きい。次に、Ｎ個の音声サ
ンプルが次式に従って、線形補間を用いて２５６個の新
しい分析サンプルへダウンサンプリングされる（ステッ
プ７０４）。To overcome some of these problems, an alternative preferred embodiment of the present invention utilizes a 256-point FFT to reduce complexity and a modified signal is used to evaluate pitch lag. It is the downsampling process that modifies the signal. Referring to FIGS. 7 and 8, N speech samples {x (n), n = 0, 1,
, N-1} are collected (step 702), where N is greater than twice the maximum pitch lag. Next, the N speech samples are downsampled into 256 new analysis samples using linear interpolation according to the following equation (step 704).

【００４１】[0041]

【数９】 [Equation 9]

【００４２】この式において、λ＝Ｎ／２５６であり、
角括弧内の値すなわち［ｉ・λ］はｉ・λ以下の最大の
整数値を示す。次に、ステップ７０５でハミングウイン
ドウまたは他のウインドウが補間されたデータに適用さ
れる。In this equation, λ = N / 256,
The value in square brackets, that is, [i.lambda.] Indicates the maximum integer value of i.lambda. Or less. Then, in step 705, a Hamming window or other window is applied to the interpolated data.

【００４３】ステップ７０６では、ピッチラグ評価は２
５６ポイントＦＦＴを用いてｙ（ｉ）にわたって行なわ
れ、振幅Ｙ（ｆ）を発生する。次に、ステップ７０８か
らステップ７１０は図６に関して説明されたのと同様に
実行される。しかしながら、Ｇ（ｆ）はさらにフィルタ
され（ステップ７０９）、ピッチ検出のためには有用で
はない、Ｇ（ｆ）の高周波数成分を減少する。一旦ｙ
（ｉ）のラグすなわちＬａｇ_yが式（５）に従って求め
られれば（ステップ７１４）、これはステップ７１６で
再スケールされてピッチラグ評価を決定する。In step 706, the pitch lag evaluation is 2
Performed over y (i) using a 56-point FFT to generate the amplitude Y (f). Next, steps 708 to 710 are performed as described with respect to FIG. However, G (f) is further filtered (step 709) to reduce high frequency components of G (f), which is not useful for pitch detection. Once y
(I) If the lag i.e. Lag _y of Rarere calculated according to equation (5) (step 714), which determines the pitch lag estimation is rescaled in step 716.

【００４４】[0044]

【数１０】 [Equation 10]

【００４５】要約すると、符号化フレームのための初期
のピッチ評価を求める上記の手順は以下のとおりであ
る。In summary, the above procedure for determining the initial pitch estimate for a coded frame is as follows.

【００４６】（１）標準４０ｍｓの符号化フレームを
ピッチサブフレーム８０２および８０４へ細分する。各
ピッチサブフレームはおおよそ２０ｍｓの長さである。(1) A standard 40 ms coded frame is subdivided into pitch subframes 802 and 804. Each pitch subframe is approximately 20 ms long.

【００４７】（２）ピッチ分析ウインドウ８０６が最
後のサブフレームの中心に位置決めされるようにＮ＝３
２０個の音声サンプルを取り、提案されたアルゴリズム
を用いてそのサブフレームに対するラグを求める。(2) N = 3 so that the pitch analysis window 806 is positioned at the center of the last subframe.
Take 20 speech samples and find the lag for that subframe using the proposed algorithm.

【００４８】（３）ピッチサブフレームに対する初期
のピッチラグ値を決定する。次に、時間域精密化が元の音声サンプルｘ（ｎ）にわた
ってステップ７１８で行なわれる。こうして、この発明
の実施例において、複雑さを減少してなお、高い正確さ
を維持しながらピッチラグ値が的確に評価できる。この
発明のＦＦＴ実施例を用いると、１２０よりも大きいピ
ッチラグ値に対処するのは困難ではない。(3) Determine the initial pitch lag value for the pitch subframe. Next, time domain refinement is performed at step 718 over the original speech sample x (n). Thus, in the embodiment of the present invention, the pitch lag value can be accurately evaluated while reducing the complexity and yet maintaining high accuracy. With the FFT embodiment of the present invention, it is not difficult to handle pitch lag values greater than 120.

【００４９】より詳細には、時間域精密化は元の音声サ
ンプルにわたって行なわれる。たとえば、４０ｍｓの符
号化フレームは図９に示されるようにまず、８個の５ｍ
ｓのサブフレーム８０８へ分割される。初期のピッチラ
グ評価ｌａｇ₁およびｌａｇ ₂は、現在の符号化フレー
ムにおける各ピッチサブフレームの最後の符号化サブフ
レームに対するラグ評価である。ｌａｇ₀は先行の符号
化フレームにおける第２のピッチサブフレームの精密化
されたラグ評価である。ｌａｇ₁、ｌａｇ₂、およびｌ
ａｇ₀の間の関係は図９に示される。More specifically, the time domain refinement is based on the original speech support.
It is done over the sample. For example, the 40ms mark
As shown in Fig. 9, the coding frame is composed of eight 5m frames.
It is divided into s subframes 808. Early pitcher
Evaluation lag₁And lag ₂Is the current encoding
The last encoded subframe of each pitch subframe in the
It is a lag evaluation for a rame. lag₀Is the preceding sign
Refinement of the second pitch sub-frame in an optimized frame
It is a lag evaluation. lag₁, Lag₂, And l
ag₀The relationship between is shown in FIG.

【００５０】初期のピッチラグｌａｇ₁およびｌａｇ₂
は次式に従って最初に精密化されて、その正確さを向上
させる（図８のステップ７１８）。Initial pitch lags lag ₁ and lag ₂
Is first refined according to the following equation to improve its accuracy (step 718 of FIG. 8).

【００５１】[0051]

【数１１】 [Equation 11]

【００５２】ここでＮ_iは、ピッチｌａｇ_iに対するピ
ッチサブフレームにおける開始サンプルの指標である。
好ましくは、Ｍは１０と選択され、Ｌはｌａｇ_i＋１０
であり、ｉはピッチサブフレームの指標を示す。Here, N _i is an index of the start sample in the pitch subframe with respect to the pitch lag _i .
Preferably M is selected to be 10 and L is lag _i +10
And i indicates the pitch subframe index.

【００５３】一旦初期のピッチラグの精密化が完了する
と、符号化サブフレームのピッチラグが決定できる。符
号化サブフレームのピッチラグはｌａｇ₁、ｌａｇ₂、
およびｌａｇ₀を線形的に補間することによって評価さ
れる。符号化サブフレームのピッチラグ評価の正確さ
は、次の手順に従って各符号化サブフレームの補間され
たピッチラグを精密化することによって向上する。精密
化された初期のピッチ評価ｌａｇ₁、ｌａｇ₂、および
ｌａｇ₀に基づく符号化サブフレームの補間されたピッ
チラグを｛ｌａｇ_I（ｉ），ｉ＝０，１，…，７｝が表
わす場合、ｌａｇ _I（ｉ）は次式によって決定される。Once the initial pitch lag refinement is complete
And the pitch lag of the encoded subframe can be determined. Mark
Pitch lag of encoded subframe is lag₁, Lag₂,
And lag₀Evaluated by linearly interpolating
Be done. Accuracy of pitch lag estimation for coded subframes
Is interpolated for each encoded subframe according to the following steps
It is improved by refining the pitch lag. precision
Initialized pitch evaluation lag₁, Lag₂,and
lag₀Interpolated pits for coding subframes based on
Chirag is {lag_I(I), i = 0, 1, ..., 7} is a table
If you do, lag _I(I) is determined by the following equation.

【００５４】[0054]

【数１２】 (Equation 12)

【００５５】線形補間によって与えられるピッチラグ評
価の正確さが十分ではないので、さらなる改良が必要と
なるであろう。与えられたピッチラグ評価｛ｌａｇ
_I（ｉ），ｉ＝０，１，…，７｝に対して、各ｌａｇ_I
（ｉ）は次式によってさらに精密化される（ステップ７
２２）。Further refinement will be needed as the accuracy of the pitch lag estimation provided by linear interpolation is not sufficient. Given pitch lag rating {lag
_{For I} (i), i = 0, 1, ..., 7}, each lag _I
(I) is further refined by the following equation (step 7)
22).

【００５６】[0056]

【数１３】 (Equation 13)

【００５７】ここでＮｉはピッチｌａｇ（ｉ）に対する
符号化サブフレームにおける開始サンプルの指標であ
る。例では、Ｍは３と選択され、Ｌは４０に等しい。Here, Ni is an index of the starting sample in the coded subframe for the pitch lag (i). In the example, M is chosen to be 3 and L equals 40.

【００５８】さらに、ピッチラグの線形補間は音声の無
声区間において重要である。何らかの分析方法によって
求められたピッチラグは無声音声に任意に配分される傾
向を有する。しかしながら、相対的に大きいピッチサブ
フレームサイズのために、各サブフレームに対するラグ
が（上の手順（２）で求められる）始めに決定されたサ
ブフレームラグにあまりにも近い場合、元々は音声には
なかった不所望な人工の周期性が加えられる。さらに線
形補間は、質の悪い無声音声に関連した問題を簡単に解
決する。さらに、サブフレームのラグは任意である傾向
を有するので、各サブフレームに対するラグは一旦補間
されると、これも非常に任意に配分され、このことが音
声の質を保証する。Furthermore, linear interpolation of pitch lag is important in the unvoiced section of speech. The pitch lag obtained by some analysis method tends to be arbitrarily distributed to unvoiced speech. However, due to the relatively large pitch sub-frame size, if the lag for each sub-frame is too close to the originally determined sub-frame lag (determined in step (2) above), then the audio will originally be The unwanted artificial periodicity that was not present is added. In addition, linear interpolation easily solves the problems associated with poor quality unvoiced speech. Furthermore, since the lag of subframes tends to be arbitrary, once interpolated, the lag for each subframe is also very arbitrarily distributed, which guarantees speech quality.

[Brief description of drawings]

【図１】ＣＥＬＰ音声モデルのブロック図である。FIG. 1 is a block diagram of a CELP voice model.

【図２】ＭＢＥ音声モデルのブロック図である。FIG. 2 is a block diagram of an MBE voice model.

【図３】ＭＢＥエンコーダのブロック図である。FIG. 3 is a block diagram of an MBE encoder.

【図４】ＭＢＥボコーダにおけるピッチラグ評価のブロ
ック図である。FIG. 4 is a block diagram of pitch lag evaluation in the MBE vocoder.

【図５】ケプストラムに基づくピッチラグ検出方式のブ
ロック図である。FIG. 5 is a block diagram of a pitch lag detection method based on cepstrum.

【図６】この発明の実施例に従うピッチラグ評価の動作
上のフロー図である。FIG. 6 is an operational flow diagram of pitch lag evaluation according to an embodiment of the present invention.

【図７】この発明の別の実施例に従うピッチラグ評価の
フロー図である。FIG. 7 is a flow diagram of pitch lag evaluation according to another embodiment of the present invention.

【図８】この発明の別の実施例に従うピッチラグ評価の
フロー図である。FIG. 8 is a flow diagram of pitch lag evaluation according to another embodiment of the present invention.

【図９】図６の実施例に従う音声符号化の図である。FIG. 9 is a diagram of speech encoding according to the embodiment of FIG.

[Explanation of symbols]

８０２ピッチサブフレーム８０４ピッチサブフレーム８０６ピッチ分析ウインドウ８０８サブフレーム 802 pitch subframe 804 pitch subframe 806 pitch analysis window 808 subframe

フロントページの続き (72)発明者トム・ホン・リーアメリカ合衆国、07748 ニュージャージー州、ミドルタウン、ノウルウッド・ドライブ、501Front Page Continuation (72) Inventor Tom Hong Lee, 501, Knowwood Drive, Middletown, NJ, 07748, USA

Claims

[Claims]

1. A system for estimating pitch lag for speech quantization and compression, wherein speech is defined by a plurality of speech samples, wherein the estimation of the current speech sample follows a linear combination of past samples. And the system includes means for applying a first discrete Fourier transform (DFT) to the audio samples, the first DFT having an associated amplitude, and further comprising: Means for squaring the amplitude, means for applying a second DFT to the squared amplitude, means for determining an initial pitch lag value according to a time domain transformed speech sample, Means for encoding said speech samples according to a refined pitch lag value, and a system for evaluating pitch lag for speech quantization and compression. Temu.

2. The initial pitch lag value has an associated prediction error, and the system further includes means for refining the initial pitch lag value, the associated prediction error being minimized. The system according to Item 1.

3. Means for classifying the plurality of speech samples into a current coded frame, means for dividing the coded frame into a plurality of pitch subframes, and the pitch subframe with a plurality of codes. Means for subdividing into coded subframes, and initial pitch lag estimates lag ₁ and la representing lag estimates, respectively, for the last coded subframe of each pitch subframe in the current coded frame.
means for evaluating the g _2, and means for refining the pitch lag estimation lag ₀ of the second pitch subframe in the previous coding frame, lag _1, lag _2, and linearly interpolating lag ₀ The system of claim 1, further comprising means for evaluating a pitch lag value for the encoded subframes and means for further refining the interpolated pitch lag for each encoded subframe.

4. The system of claim 1, further comprising means for down-sampling the audio samples to a down-sampling value for a schematic representation with a small number of samples.

5. The initial pitch lag value is calculated by the equation (Lag
System according to claim 4, corrected by _scaled = number of audio samples / downsampled value).

6. The system of claim 1, wherein the means for refining the initial pitch lag value comprises autocorrelation.

7. Speech input means for receiving said speech samples, a computer for processing said refined pitch lag value and reproducing the input speech as encoded speech, said encoded speech. The system according to claim 1, further comprising: an audio output means for outputting the.

8. A speech coder for reproducing and encoding input speech, said speech coder inducing linear predictive coding (LPC) parameters and speech reproduction for generating speech. A new codebook representing a plurality of vectors referred to in the above is used, and the speech coding apparatus comprises a speech input means for receiving the input speech, and a computer for processing the input speech. Including and
The computer has means for cutting out a current coded frame in the input speech, means for dividing the coded frame into a plurality of pitch subframes, and a pitch analysis window having N speech samples. And a means for defining an initial pitch lag value for each pitch subframe, and the pitch analysis window extending for each pitch subframe. Means for dividing a frame into a plurality of coded subframes, the initial pitch lag estimate for each pitch subframe being a lag for the last coded subframe of each pitch subframe in the current coded frame. Representing an evaluation, the computer provides the evaluated pitch lag value to the Means for linearly interpolating between coded subframes and determining a pitch lag estimate for each coded subframe, and means for refining the linearly interpolated lag value of each coded subframe And a speech coding apparatus for reproducing and coding an input speech, the apparatus further comprising speech output means for outputting speech reproduced according to the refined pitch lag value.

9. The computer uses a downsampled value X to represent a small number of samples.
Further comprising means for down-sampling the N speech samples, and means for correcting the pitch lag value such that the corrected lag value Lag _scaled = N / X.
The device according to claim 8.

10. The apparatus of claim 8, further comprising sampling means for sampling the input speech at a sampling rate R, the N speech samples being determined according to the equation N = R * X.

11. The apparatus according to claim 10, wherein X = 25 ms, R = 8000 Hz, and N = 320 samples.

12. Each coded frame is approximately 40 ms
9. The device of claim 8 having a length of.

13. A method for estimating pitch lag for speech quantization and compression, wherein said speech is defined by a plurality of speech samples, wherein the estimation of the current speech sample follows a linear combination of past samples. Determined in the domain, the method comprising: applying a first Discrete Fourier Transform (DFT) to the audio samples, the first DFT having an associated amplitude, and further comprising the amplitude of the first DFT. Squared, applying a second DFT to the squared amplitude of the first DFT, and determining an initial pitch lag value according to a time domain transformed speech sample, The initial pitch lag value has an associated prediction error, and the method further comprises: refining the initial pitch lag value using autocorrelation. Look, the associated prediction error is minimized,
Further, a method for estimating pitch lag for speech quantization and compression, comprising encoding the speech samples according to the refined pitch lag value.

14. Classifying the plurality of speech samples into a current coded frame, dividing the coded frame into a plurality of pitch subframes, and dividing the pitch subframe into a plurality of coded subframes. Subdividing, and for the last coded subframe of each pitch subframe in the current coded frame, an initial pitch lag estimate lag ₁ and la representing a lag estimate, respectively.
evaluating g ₂ respectively, refining the pitch lag estimate lag ₀ of the second pitch subframe in the preceding coded frame, linearly interpolating lag ₁ , lag ₂ , and lag ₀ , said 14. The method of claim 13, further comprising evaluating pitch lag values for encoded subframes and further refining the interpolated pitch lag for each encoded subframe.

15. The method of claim 13, further comprising the step of down-sampling the speech samples to down-sampled values for a schematic representation with a small number of samples.

16. The method of claim 15, further comprising correcting the initial pitch lag value according to an equation (Lag _scaled = number of voice samples / downsampled value).

17. A step of receiving the speech samples, a step of processing the refined pitch lag value to reproduce the input speech as encoded speech, and a step of outputting the encoded speech. 14. The system of claim 13, further comprising:

18. A speech coding method for reproducing and coding an input speech, wherein the speech coding device induces a linear predictive coding (LPC) parameter and a speech reproduction to generate a speech. And a new codebook that represents a pseudo-random signal forming a plurality of vectors, the speech coding method comprising: receiving and processing the input speech; and processing the input speech. The step of processing comprises: determining a speech coded frame in the input speech; subdividing the coded frame into a plurality of pitch subframes; and N speech samples. Defining a pitch analysis window, the pitch analysis window extending over the pitch subframe, The processing step is a step of roughly evaluating an initial pitch lag value for each pitch subframe, and an initial pitch lag evaluation for each pitch subframe is a lag evaluation for the last coding subframe of each pitch subframe. To divide each pitch subframe into a plurality of coded subframes, interpolating the evaluated pitch lag value between the pitch subframes, and performing pitch lag evaluation for each coded subframe. Determining the input speech and refining the linearly interpolated lag value, the method further comprising: reproducing the input speech and outputting the reproduced speech according to the refined pitch lag value. Speech coding method for coding.

19. The step of processing comprises downsampling a value X to represent a small number of samples.
19. The apparatus of claim 18, further comprising the steps of down-sampling the N speech samples and correcting the pitch lag value such that the corrected lag value Lag _scaled = N / X.

20. The N speech samples have the formula N = R *.
Further comprising sampling the input speech at a sampling rate R, as determined according to X,
The method according to claim 18.