JP2940835B2

JP2940835B2 - Pitch frequency difference feature extraction method

Info

Publication number: JP2940835B2
Application number: JP5247991A
Authority: JP
Inventors: 敏高橋; 昭一松永
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1991-03-18
Filing date: 1991-03-18
Publication date: 1999-08-25
Anticipated expiration: 2014-08-25
Also published as: JPH04288600A

Abstract

PURPOSE:To leave a correct pitch frequency difference feature quantity even if plural peaks appears in the autocorrelation function of a predicted residue for detecting a pitch frequency. CONSTITUTION:An input voice is converted into a digital signal and a residue calculation part 3 calculates the autocorrelation function of the predicted residue, frame by frame. A residue correlation calculation part 4 finds the cross- correlation function between the autocorrelation function of the predicted residue found in each frame and the autocorrelation function of a predicted residue which is seven frames precedent and obtains the pitch frequency difference feature quantity from the peak value of the cross-correlation function.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】この発明は、音声の韻律情報の要
素の１つであるピッチ周波数の差分情報を必要とする単
語音声認識装置や連続音声認識装置、あるいは話者認識
装置などで使用され、アクセントやイントネーションに
ついての情報を獲得するピッチ周波数差分特徴量抽出法
に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention is used in a word speech recognition device, a continuous speech recognition device, a speaker recognition device, or the like that requires difference information of pitch frequency, which is one of the elements of speech prosody information. And a pitch frequency difference feature amount extraction method for acquiring information about accents and intonations.

【０００２】[0002]

【従来の技術】従来において、ピッチ周波数の差分情報
を得ようとする場合、次のような手法が行なわれた。入
力信号の短時間スペクトルf(r)^hをスペクトル包絡f(r)
で正規化することによって得られるスペクトル微細構造
は、高調波構造を有する。この正規化スペクトルf(r)^h
／f(r)をフーリエ展開することにより得られる相関関数
は、予測残差の相関関数で音源の性質を表す。（例え
ば、古井貞煕、「ディジタル音声処理」東海大学出版会
（１９８５））従って、はじめに時刻ｔのフレームにお
ける予測残差の自己相関関数を計算し、相関値のピーク
位置を検出した後、ピッチ周波数を決定する。図３Ａ
は、有声音定常部の予測残差の自己相関関数の例を示
し、横軸は時間遅れ幅で、縦軸は相関値である。図中の
鋭いピークがピッチ周波数成分を表しており、ピーク位
置の時間遅れ幅がピッチ周期となってピッチ周波数が決
定できる。同様に時刻ｓのフレームにおけるピッチ周波
数を決定する。最後に、時刻ｔのピッチ周波数と時刻ｓ
のピッチ周波数との差から差分情報を得る。2. Description of the Related Art Conventionally, in order to obtain pitch frequency difference information, the following method has been employed. The short-time spectrum f (r) ^h of the input signal is converted to the spectrum envelope f (r)
The spectral fine structure obtained by normalizing has a harmonic structure. This normalized spectrum f (r) ^h
The correlation function obtained by Fourier-expanding / f (r) represents the property of the sound source by the correlation function of the prediction residual. (For example, Sadahiro Furui, “Digital Speech Processing,” Tokai University Press (1985)) Therefore, first, the autocorrelation function of the prediction residual in the frame at time t is calculated, and after detecting the peak position of the correlation value, the pitch is calculated. Determine the frequency. FIG. 3A
Shows an example of the autocorrelation function of the prediction residual of the voiced stationary part, where the horizontal axis is the time delay width and the vertical axis is the correlation value. The sharp peak in the figure represents the pitch frequency component, and the time delay width of the peak position becomes the pitch period, and the pitch frequency can be determined. Similarly, the pitch frequency in the frame at time s is determined. Finally, the pitch frequency at time t and time s
The difference information is obtained from the difference from the pitch frequency.

【０００３】しかしながら、音声の有声、無声の過渡部
や文末では、音声のピッチ構造が乱れるために、自己相
関関数中の真のピッチ周波数成分を示すピークが顕著に
現れず、倍ピッチ周波数や半ピッチ周波数の成分を示す
ピークが真のピークを越えることがある。図３Ｂは、倍
ピッチ成分のピークが真のピッチ成分のピークを越え、
誤ったピッチ周波数が計算される例である。誤ったピー
クが選択されピッチ周波数が決定された段階で、真のピ
ッチ情報は完全に失われてしまう。従って正しいピッチ
周波数差分情報を得るためには、得られたピッチ周波数
が前後のフレームのピッチ周波数値と不連続になった場
合、予測残差の自己相関関数の第２ピーク、第３ピーク
を検出し、その中に周囲のフレームのピッチ周波数と連
続するピッチ周波数が含まれているかを検証しなければ
ならなかった。しかしこの検証のための一般的な規則を
記述するのは困難で、自動修正の他に視察による修正が
必要であるという問題があった。However, in voiced and unvoiced transitional portions and sentence endings of voices, since the pitch structure of voices is disturbed, a peak indicating a true pitch frequency component in the autocorrelation function does not appear remarkably, and the double pitch frequency and half pitch frequency do not appear. The peak indicating the pitch frequency component may exceed the true peak. FIG. 3B shows that the peak of the double pitch component exceeds the peak of the true pitch component,
This is an example in which an incorrect pitch frequency is calculated. When the wrong peak is selected and the pitch frequency is determined, the true pitch information is completely lost. Therefore, in order to obtain correct pitch frequency difference information, when the obtained pitch frequency becomes discontinuous with the pitch frequency values of the preceding and following frames, the second peak and the third peak of the autocorrelation function of the prediction residual are detected. Then, it was necessary to verify whether or not it included a pitch frequency continuous with the pitch frequency of the surrounding frame. However, it is difficult to describe general rules for this verification, and there has been a problem that in addition to automatic correction, correction by inspection is necessary.

【０００４】[0004]

【課題を解決するための手段】この発明によれば、各フ
レームでピッチ周波数を一意に決定する過程を避け、あ
る時刻ｔ（フレーム）における音声の予測残差の自己相
関関数と、その時刻ｔから離れた時刻ｓ（数フレーム離
れたフレーム）におけるその音声の予測残差の自己相関
関数との相互相関関数を計算してピッチ周波数差分特徴
ベクトルとする。この相互相関関数を計算するというこ
とは、２つの予測残差の自己相関関数のデータ系列を、
１ポイント（自己相関関数における１基準遅れ時間）ず
つずらしながら相関関数を計算し特徴ベクトルの要素と
していくので、ピーク位置がお互いに重なった時点で特
徴ベクトルの要素は大きな値をとることになり、そのピ
ーク値から差分情報を得る。このようにして、お互いの
予測残差の自己相関関数のピーク位置の情報を忠実にピ
ッチ周波数差分特徴ベクトルに反映することができる。According to the present invention, the process of uniquely determining the pitch frequency in each frame is avoided, and the autocorrelation function of the prediction residual of speech at a certain time t (frame) and the time t A cross-correlation function of the prediction residual of the voice and the auto-correlation function at a time s (a frame several frames away) from the voice is calculated as a pitch frequency difference feature vector. Calculating this cross-correlation function means that the data sequence of the auto-correlation function of the two prediction residuals is
Since the correlation function is calculated and shifted as a feature vector element by shifting by one point (one reference delay time in the autocorrelation function), the element of the feature vector takes a large value when the peak positions overlap each other, Difference information is obtained from the peak value. In this way, the information on the peak position of the autocorrelation function of the mutual prediction residuals can be faithfully reflected on the pitch frequency difference feature vector.

【０００５】[0005]

【実施例】図１にこの発明によるピッチ周波数差分特徴
量を計算する実施例を示す。入力端子１から入力された
音声は、Ａ／Ｄ変換部２においてディジタル信号に変換
される。そのディジタル信号は残差計算部３において１
フレーム（例えば１０ミリ秒）ごとに予測残差の自己相
関関数を計算する。この予測残差の自己相関関数は、短
時間スペクトルf(r)^hをスペクトル包絡f(r)で正規化し
た正規化スペクトルf(r)^h／f(r)をフーリエ展開するこ
とにより計算しても良いし、あるいは、予測残差波形を
求めた後、直接的にその相関関数を計算することにより
得てもよい。FIG. 1 shows an embodiment for calculating a pitch frequency difference feature quantity according to the present invention. The audio input from the input terminal 1 is converted into a digital signal in the A / D converter 2. The digital signal is converted to 1
The autocorrelation function of the prediction residual is calculated every frame (for example, every 10 milliseconds). The autocorrelation function of the prediction residual is calculated by Fourier expansion of a normalized spectrum f (r) ^h / f (r) obtained by normalizing the short-time spectrum f (r) ^h with the spectrum envelope f (r). Alternatively, after obtaining the prediction residual waveform, the correlation function may be directly calculated.

【０００６】次に、残差相関計算部４において、各フレ
ームで求めた予測残差の自己相関関数と数フレーム先
（例えば７フレーム先）の予測残差の自己相関関数との
相互相関関数を求める。例えば、サンプリング周波数１
２kHz でサンプリングした男声データを考える。この場
合、自己相関関数の１基準遅れ時間は１／１２０００秒
である。男声のピッチ周波数の変化範囲を５０Hzから３
００Hzとすると、予測残差の自己相関関数a(n)におい
て、５０Hzのピッチ成分を表すピークは２４０ポイント
目（１２０００／５０＝２４０）に、３００Hzのピッチ
成分を表すピークは４０ポイント目（１２０００／３０
０＝４０）にそれぞれ現れる。従って、ａ（４０）から
ａ（２４０）までの２０１個のデータを切りだし、新た
にｘ（ｎ）（ｎ＝０，１，…，２００）とする。同様に
７フレーム先でｙ（ｎ）（ｎ＝０，１，…，２００）を
求め、ｘ（ｎ）とｙ（ｎ）との相互相関関数Ｒxy
（τ）、つまり特徴ベクトルを求める。Next, the residual correlation calculator 4 calculates the cross-correlation function between the autocorrelation function of the prediction residual obtained for each frame and the autocorrelation function of the prediction residual several frames ahead (for example, 7 frames ahead). Ask. For example, sampling frequency 1
Consider male voice data sampled at 2kHz. In this case, one reference delay time of the autocorrelation function is 1/12000 second. Change range of pitch frequency of male voice from 50Hz to 3
Assuming that the frequency is 00 Hz, in the autocorrelation function a (n) of the prediction residual, the peak representing the 50 Hz pitch component is at the 240th point (12000/50 = 240), and the peak representing the 300 Hz pitch component is at the 40th point (12000). / 30
0 = 40). Accordingly, 201 data from a (40) to a (240) are cut out and newly set as x (n) (n = 0, 1,..., 200). Similarly, y (n) (n = 0, 1,..., 200) is obtained seven frames ahead, and a cross-correlation function Rxy between x (n) and y (n) is obtained.
(Τ), that is, a feature vector is obtained.

【０００７】Ｒxy（τ）＝（１／Ｔ）・Σ_tｘ（ｔ）ｙ（ｔ＋τ） Σｔはｔ＝０からｔ＝Ｔまでの加算を示す。ここでＴ＝
２００で、τは７フレーム間で動き得るピーク位置の
差、即ち、７フレームで変化し得るピッチ周波数を考慮
し−４０≦τ≦４０とする。これは、低周波数領域で５
０Hzから６０Hzまでの変化をカバーする。各フレームで
求められた８１次元の特徴ベクトルＲxy（τ）をそのフ
レームにおけるピッチ差分特徴量とし、結果出力部５か
ら出力する。もちろん、τは対象とする音声データの性
質によって変えても差し支えない。また、８１次元では
次元数が大き過ぎて扱いにくいようなシステムでは、特
徴ベクトルを、例えば２次元ずつ調べていき、数値の大
きい要素を選択しながらマージしていって４１次元の特
徴ベクトルにするような次元数の低減を行ってもよい。[0007] Rxy (τ) = (1 / T) · Σ t x (t) y (t + τ) Σt shows the addition of from t = 0 to t = T. Where T =
In 200, τ is set to −40 ≦ τ ≦ 40 in consideration of a difference between peak positions that can move between seven frames, that is, a pitch frequency that can change in seven frames. This is 5 in the low frequency range.
Covers changes from 0Hz to 60Hz. The 81-dimensional feature vector Rxy (τ) obtained in each frame is set as a pitch difference feature amount in the frame, and output from the result output unit 5. Of course, τ can be changed depending on the properties of the target audio data. In a system in which the number of dimensions is too large to handle in 81 dimensions, the feature vectors are examined, for example, two dimensions at a time, and elements with large numerical values are selected and merged into 41-dimensional feature vectors. Such a reduction in the number of dimensions may be performed.

【０００８】[0008]

【発明の効果】以上述べたように、この発明においては
予測残差の自己相関関数に現れるピッチ情報を忠実に特
徴量に取り込めるため、情報の損失の少ない特徴量を得
ることができる。また、ピッチ周波数自動修正後の視察
による検証も不要である。特に、文末や長い単語、句の
語尾では、正しいピッチ周波数を一意に決定するのが困
難であるため効果がある。As described above, according to the present invention, the pitch information appearing in the autocorrelation function of the prediction residual can be faithfully incorporated into the feature quantity, so that a feature quantity with little information loss can be obtained. In addition, verification by inspection after automatic correction of pitch frequency is unnecessary. Particularly at the end of a sentence, at the end of a long word, or at the end of a phrase, it is effective because it is difficult to uniquely determine the correct pitch frequency.

【０００９】例えば、時刻ｔで図３Ａに示されるような
予測残差の自己相関関数ｘ（ｎ）が得られ、ピーク位置
は１００ポイント目であったとする。数フレーム先の時
刻ｓで図３Ｂに示されるような予測残差の自己相関関数
ｙ（ｎ）が得られ、ピーク位置は９５ポイント目と１１
０ポイント目とに２つ現れたとする。これらを同時に図
２Ａに示す。このとき、真のピッチ成分を表すピークが
９５ポイント目のピーク（即ちピッチ周波数１２６Hz）
であるにもかかわらず、１１０ポイント目のピークの方
が大きくなってしまった場合、従来のように各時刻でピ
ッチ周波数を決定すると、時刻ｓでのピッチ周波数は１
０９Hzとなってしまう。この段階で正しいピッチ情報は
完全に失われてしまう。しかし、この発明による特徴量
では、２つの自己相関関数の相互相関関数Ｒxyを計算す
るので、上記の例の場合Ｒxyは、図２Ｂに示される時刻
ｔの１００ポイント目のピークが時刻ｓの９５ポイント
目のピークと重なるＲxy（−５）と、図２Ｃに示される
時刻ｔの１００ポイント目のピークが時刻ｓの１１０ポ
イント目のピークと重なるＲxy（１０）とでそれぞれ大
きな値を示し、何れのピーク位置の情報を失わないピッ
チ差分特徴量となる。For example, it is assumed that the autocorrelation function x (n) of the prediction residual as shown in FIG. 3A is obtained at time t, and the peak position is at the 100th point. At time s several frames ahead, the autocorrelation function y (n) of the prediction residual as shown in FIG. 3B is obtained.
Assume that two appear at the 0th point. These are simultaneously shown in FIG. 2A. At this time, the peak representing the true pitch component is the 95th peak (that is, the pitch frequency is 126 Hz).
Despite this, if the peak at the 110th point becomes larger, if the pitch frequency is determined at each time as in the past, the pitch frequency at the time s becomes 1
It becomes 09Hz. At this stage, the correct pitch information is completely lost. However, in the feature quantity according to the present invention, the cross-correlation function Rxy of the two autocorrelation functions is calculated. In the case of the above example, the peak at the 100th point at the time t shown in FIG. Rxy (-5), which overlaps with the peak at the point, and Rxy (10), where the peak at the 100th point at time t overlaps the peak at the 110th point at time s shown in FIG. Is a pitch difference feature amount that does not lose the information of the peak position.

【００１０】アクセントやイントネーションの情報を与
えるピッチ差分情報は、単語音声認識や連続音声認識に
おいて、音韻情報と組み合わせて使用することにより認
識性能が向上する。また、話者認識においても、ピッチ
差分情報は個人性を表す特徴として有効である。なおこ
のように、ｙ（ｎ）中の二つのピーク位置の情報を含む
ピッチ差分情報が得られるが、後の処理が統計的処理の
場合は正しい方のピッチ差分情報が救われる。[0010] Pitch difference information that provides information on accent and intonation improves recognition performance by being used in combination with phonemic information in word speech recognition and continuous speech recognition. Also, in speaker recognition, pitch difference information is effective as a feature representing individuality. As described above, pitch difference information including information on two peak positions in y (n) is obtained. However, when the subsequent processing is statistical processing, the correct pitch difference information is saved.

[Brief description of the drawings]

【図１】この発明を適用したピッチ周波数差分特徴量抽
出装置の例を示すブロック図。FIG. 1 is a block diagram showing an example of a pitch frequency difference feature value extraction device to which the present invention is applied.

【図２】図３Ａ、Ｂでそれぞれ二つの予測残差の自己相
関関数の相互相関関数の計算例を示す図。FIGS. 3A and 3B show examples of calculating a cross-correlation function of an auto-correlation function of two prediction residuals in FIGS. 3A and 3B, respectively.

【図３】Ａは有声音定常部における予測残差の自己相関
関数の例を示す図、Ｂは有声、無声の過渡部のフレーム
における予測残差の自己相関関数で、誤ったピッチ周波
数（半ピッチ周波数）を示すピークが、真のピッチ周波
数を示すピークを越えてしまう例を示す図である。FIG. 3A is a diagram illustrating an example of an autocorrelation function of a prediction residual in a voiced stationary part, and FIG. 3B is an autocorrelation function of a prediction residual in a frame of a voiced and unvoiced transient part. It is a figure which shows the example which the peak which shows a pitch frequency exceeds the peak which shows a true pitch frequency.

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 3/00 - 9/20 ＪＩＣＳＴファイル（ＪＯＩＳ)──────────────────────────────────────────────────続き Continued on the front page (58) Field surveyed (Int.Cl. ⁶ , DB name) G10L 3/00-9/20 JICST file (JOIS)

Claims

(57) [Claims]

1. a pitch frequency of a voice at a time t;
In a pitch frequency difference feature amount extraction method for providing difference information from the pitch frequency of the voice at time s after time t, the residual autocorrelation function of the voice at time t and the residual of the voice at time s A pitch frequency difference feature amount extraction method, wherein a difference autocorrelation function and a cross-correlation function are calculated, and the difference information is obtained from a peak value of the residual cross-correlation function.