JP2569470B2

JP2569470B2 - Formant extractor

Info

Publication number: JP2569470B2
Application number: JP60222144A
Authority: JP
Inventors: 哲田口
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1985-10-04
Filing date: 1985-10-04
Publication date: 1997-01-08
Anticipated expiration: 2012-01-08
Also published as: JPS6280700A

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明はホルマント抽出器に関する。Description: TECHNICAL FIELD The present invention relates to a formant extractor.

〔従来の技術〕音声のホルマントは、音声の分析，合成あるいは認識
等の分野で極めて有効なパラメータとして利用されてい
る。2. Description of the Related Art Speech formants are used as extremely effective parameters in the fields of speech analysis, synthesis, and recognition.

このようなホルマントは通常、あらかじめ設定するｎ
次のLPC（Linear Prediction Coding,線形予測符号化）
係数を定数とする高次方程式を解いて得られるn/2個の
共役複素解，すなわち極周波数を利用する公知の手法に
もとづいてホルマント周波数を決定している。Such a formant is usually set in advance by n
Next LPC (Linear Prediction Coding)
The formant frequency is determined based on n / 2 conjugate complex solutions obtained by solving a higher-order equation having coefficients as constants, that is, a known method using a pole frequency.

[Problems to be solved by the invention]

しかしながら上述した従来のホルマント抽出手段は、
LPC分析フレーム単位で抽出されるホルマント周波数が
フレーム相互間の連続性を失ない易いという欠点があ
る。この原因は基本的にはフレームごとの音声の変動
や、音声の量子化における窓関数による切出しの際の切
出し位相の変動等の影響がホルマント抽出処理では避け
られずこれがホルマント抽出エラーとして発生するとい
うことによる。However, the above-mentioned conventional formant extraction means,
There is a drawback that the formant frequency extracted for each LPC analysis frame does not easily lose continuity between frames. The cause of this is that basically the effects of fluctuations in speech from frame to frame and fluctuations in the cut-out phase at the time of clipping by the window function in speech quantization cannot be avoided in the formant extraction processing, and this will occur as a formant extraction error. It depends.

本発明の目的は上述した欠点を除去し、ホルマント標
準パタンを有しこれと分析データとのパタンマッチング
処理を介してホルマントを決定するという手段を備える
ことによりホルマント抽出エラーを大幅に低減しうるホ
ルマント抽出器を提供することにある。An object of the present invention is to eliminate the above-mentioned drawbacks and to significantly reduce formant extraction errors by providing a means for determining a formant through a pattern matching process between a formant standard pattern and analytical data. It is to provide an extractor.

[Means for solving the problem]

本発明のホルマント抽出器は、入力音声信号を分析し
て得られる第１から第３までのホルマント周波数を極座
標表現することによって求まる前記ホルマント周波数相
互間の２つの角度関係情報をベクトル要素とする標準パ
タンと入力音声信号の前記角度関係情報に関する分析デ
ータとの時間正規化を利用したパタンマッチングを介し
てホルマント軌跡を得るホルマント軌跡取得手段と、前
記第１から第３までのホルマントの比を前記ホルマント
周波数の絶対値に変換する変換手段とを備えて構成され
る。A formant extractor according to the present invention is a standard that uses two angular relationship information between formant frequencies obtained by analyzing first to third formant frequencies obtained by analyzing an input audio signal in polar coordinates as a vector element. A formant trajectory obtaining means for obtaining a formant trajectory through pattern matching using time normalization of a pattern and analysis data on the angle relationship information of the input audio signal, and a formant ratio of the first to third formants is determined by the formant Conversion means for converting the frequency into an absolute value.

〔Example〕

次に図面を参照して本発明を詳細に説明する。 Next, the present invention will be described in detail with reference to the drawings.

第１図は本発明の一実施例を示すブロック図であり、
A/Dコンバータ1,自己相関係数算出器2,ホルマント判定
器3,極座標変換器4,パタン照合器（１）5,標準パタンフ
ァイル6,パタン照合器（２）7,直交座標変換器８を備え
て構成される。FIG. 1 is a block diagram showing one embodiment of the present invention.
A / D converter 1, autocorrelation coefficient calculator 2, formant judgment unit 3, polar coordinate converter 4, pattern collator (1) 5, standard pattern file 6, pattern collator (2) 7, orthogonal coordinate converter 8 It is comprised including.

入力ライン101を介して供給された入力音声信号はA/D
コンバータ１でLPF（Low Pass Filter）を通して不要な
高域周波数成分を遮断したのち所定のサンプリング周波
数で標本化し、所定のビット数の量子化を行なって量子
化音声信号に変換する。この量子化処理は、本実施例の
場合サンプリング周波数は8KHz,量子化ビット数は12ビ
ットで実施している。The input audio signal supplied via input line 101 is A / D
The converter 1 cuts off unnecessary high frequency components through an LPF (Low Pass Filter), samples it at a predetermined sampling frequency, performs quantization of a predetermined number of bits, and converts it into a quantized audio signal. In this embodiment, the quantization process is performed at a sampling frequency of 8 KHz and a quantization bit number of 12 bits.

量子化音声信号はたとえば30mSECぶんずつ、すなわち
240サンプルずつを窓時間分として一旦内蔵メモリに格
納し、これにハミング関数もしくは矩形関数等の窓関数
による荷重乗算を所定の繰返し周期20mSECごとに実施し
これが分析フレームとなりこの分析フレームごとの量子
化音声信号は自己相関係数算出器２に供給される。The quantized audio signal is, for example, 30 mSEC
240 samples are temporarily stored in the internal memory as window time, and weighted multiplication by a window function such as a Hamming function or a rectangular function is performed for each predetermined repetition cycle of 20 mSEC, and this becomes an analysis frame, which is quantized for each analysis frame. The audio signal is supplied to the autocorrelation coefficient calculator 2.

自己相関係数算出器２は分析フレームごとに入力する
量子化音声信号の必要な時間遅れ範囲での自己相関係数
を所定の次数，本実施例では12次まで抽出しこれをホル
マント判定器３に供給する。The autocorrelation coefficient calculator 2 extracts an autocorrelation coefficient within a required time delay range of the quantized speech signal input for each analysis frame to a predetermined order, up to the 12th order in the present embodiment, and extracts this to a formant determiner 3. To supply.

ホルマント抽出器３は入力する分析フレームごとの12
次の自己相関係数に対し公知のホルマント抽出法，たと
えば自己相関領域におけるAbS（Analysis by Synthesi
s）等の手法を利用して第１〜第３ホルマント周波数を
判定する。この場合、判定するホルマント周波数を第３
ホルマントまでとしているのは第１〜第３ホルマント周
波数を確保すれば音声の認識，合成等の目的における有
音声の表現にはほぼ支配的な意味をもつことが知られて
おり、また出現の変動性の高い第４ホルマント以上は排
除するためである。このホルマント周波数は声道共振点
としての極周波数にほぼ一致することもよく知られてい
る。The formant extractor 3 has 12 input frames for each analysis frame.
A known formant extraction method for the following autocorrelation coefficient, for example, AbS (Analysis by Synthesi
s) to determine the first to third formant frequencies. In this case, the formant frequency to be determined is set to the third
It is known that if the first to third formant frequencies are secured, the expression of voiced speech for purposes such as speech recognition and synthesis has an almost dominant meaning, and the variation in appearance This is for excluding the fourth formant or higher, which is highly likely. It is also well known that this formant frequency substantially coincides with the pole frequency as a vocal tract resonance point.

ところで、こうしてホルマント判定器３で判定した第
１〜第３の３つのホルマント周波数_１，_２，_３は
その絶対値は個人差があるがその比_１：_２：_３の
分布は話者に対する依存性がなくほぼ一定であることが
知られている。The first to third formant frequencies ₁ , ₂ , and ₃ thus determined by the formant determiner 3 have individual differences in their absolute values, but the distribution of the ratio ₁ : ₂ : ₃ depends on the speaker. It is known that it is almost constant without any.

極座標変換器４はこのような第１〜第３ホルマント周
波数を受けてその極座標変換を実行する。The polar coordinate converter 4 receives such first to third formant frequencies and performs the polar coordinate conversion.

第２図は第１〜第３ホルマント周波数の３次元極座標
表示図である。FIG. 2 is a three-dimensional polar coordinate display diagram of the first to third formant frequencies.

原点０で直交する３つの座標軸はそれぞれホルマント
周波数_１，_２，_３を示し、これらホルマント周波
数の合成ベクトルＶ（_１，_２，_３）の空間方向を
決定する２つの角θ，がホルマント周波数相互間関係
角度である。このような２つの角度θ，は_１：
_２：_３の比が同じならば同一となり、観点を変えれ
ば分析フレームごとの音声信号の周波数スペクトルの角
度的表現でもある。このように角度的表現されたホルマ
ント周波数は話者ごとに合成ベクトル絶対値ｌが異るだ
けで相互間関係角度θ，は話者に関せずほぼ一定であ
り、従ってこれをLPC係数の如き特徴パラメータに代え
てパタンマッチングに利用する場合には話者依存性を排
除したマッチング処理が可能となる。Three coordinate axes orthogonal to each other at the origin 0 indicate formant frequencies ₁ , ₂ , and ₃ , respectively, and _two angles θ that determine the spatial direction of the composite vector V ( ₁ , ₂ , ₃ , ₃ ) of these formant frequencies are defined between the formant frequencies. The relationship angle. These two angles θ are ₁ :
If the ratio of ₂ : ₃ is the same, it is the same, and from a different viewpoint, it is an angular expression of the frequency spectrum of the audio signal for each analysis frame. The angular formant frequency thus expressed is substantially constant irrespective of the speaker, except that the synthesized vector absolute value l differs for each speaker. In the case where pattern matching is used instead of feature parameters, matching processing excluding speaker dependency can be performed.

極座標表現したホルマント周波数_１，_２，_３に
よる合成ベクトルによって示される２つの相互間関係角
度θ，は内蔵メモリに会話用語単位で格納されたうえ
読出され、パタン照合器（１）5,ならびにパタン照合器
（２）７に供給され、また合成ベクトル絶対値ｌは直交
座標変換器８に供給される。The two mutual relationship angles θ, indicated by the composite vectors of the formant frequencies ₁ , ₂ , and ₃ expressed in polar coordinates, are stored in the internal memory in units of conversational terms and read out, and the pattern matching unit (1) 5 and the pattern matching The resultant vector absolute value 1 is supplied to an orthogonal coordinate converter 8.

パタン照合器（１）５に供給されたホルマント周波数
相互間関係角度θ，は標準パタンファイル６から出力
ライン601を介して標準パタンの提供を受けつつパタン
照合を実施する。The formant frequency correlation angle θ, supplied to the pattern collator (1) 5, performs pattern collation while receiving the standard pattern from the standard pattern file 6 via the output line 601.

標準パタンファイル６は、特定話者による音声資料，
本実施例の場合は音声狭帯域通信に利用するホルマント
ボコーダで交信用に使われる限定数の会話用語につい
て、本実施例によるホルマント抽出器もしくは別に用意
するコンピュータシステムでオフライン的に分析して得
たθ，に関する分析フレーム単位の標準パタンを格納
しておき、これをパタン照合器（１）５に供給する。The standard pattern file 6 contains audio data by a specific speaker,
In the case of the present embodiment, a limited number of conversation terms used for communication in the formant vocoder used for voice narrowband communication were obtained by offline analysis with the formant extractor according to the present embodiment or a separately prepared computer system. A standard pattern of analysis frame unit regarding θ, is stored and supplied to the pattern matching unit (1) 5.

パタン照合器（１）５は標準パタンとホルマント周波
数相互間関係角度θ，を介して分析フレームごとにパ
ターン照合を行ない、入力音声信号の分析データに最も
距離の近いθ，を有する標準パタンをマッチングのと
れた標準パタンとして選択し、そのラベル番号を出力ラ
イン501を介して標準パタンファイル６に送出，標準パ
タンファイル６から当該標準パタンに関するθ，のデ
ータをパタン照合器（２）に出力せしめる。このパタン
照合における標準パタンと分析パタンのθ，を介して
の照合はこれら２つのパタンのθ，間の市街地距離も
しくはユークリッド距離の計測という形式で実施され、
本実施例の場合は次の（１）式で示される市街地距離di
jを最小とする標準パタンを選択する形式で実施され
る。The pattern matching unit (1) 5 performs pattern matching for each analysis frame via the standard pattern and the formant frequency correlation angle θ, and matches the standard pattern having the closest distance θ to the analysis data of the input voice signal. Then, the label number is selected as a standard pattern, and its label number is sent to the standard pattern file 6 via the output line 501, and the data of θ relating to the standard pattern from the standard pattern file 6 is output to the pattern collator (2). In this pattern matching, the matching between the standard pattern and the analysis pattern via θ is performed in the form of measuring the city distance or the Euclidean distance between the two patterns θ.
In the case of this embodiment, the city area distance di expressed by the following equation (1) is used.
It is implemented in the form of selecting a standard pattern that minimizes j.

（１）式においてＫは入力パタンの、またＳは標準パ
タンのθ，であることを示し、さらにｉ＝1,2,……,n
で、ｎは標準パタンの総数,j＝1,2,……,mでｍは入力パ
タンの総数である。 In the equation (1), K indicates an input pattern, S indicates a standard pattern θ, and i = 1, 2,..., N
Where n is the total number of standard patterns, j = 1, 2,..., M and m is the total number of input patterns.

ところで、パタンマッチングに利用する標準パタン作
成のために充当しうる話者の数は経済的その他多くの理
由で制限され、従ってこのような制限された数の話者に
よるトレーニング（training,登録データ）を利用して
あらゆる不特定話者にも適合する標準パタンを作成する
ことは基本的に不可能である。このことは音声のスペク
トル分布が話者ごとに異るためであり、つまり話者ごと
に声道特性と声帯音源特性が異ることに起因する。声道
特性の相違は話者ごとの声道長が異るためであり、この
ことは声道の共振点としてのホルマント周波数が話者ご
とに異ることを意味する。一方、声帯音源特性の相違は
スペクトル包絡の概形の傾きに影響をおよぼす。従って
不特定話者にも適合し易いパタンマッチングを行なうた
めには話者ごとに異る声道特性と声帯音源特性とをそれ
ぞれ何等かの手段で正規化するかもしくはその影響を除
去することが必要となる。By the way, the number of speakers that can be used for creating a standard pattern used for pattern matching is limited for many economic and other reasons. Therefore, training by such a limited number of speakers (training, registration data) It is basically impossible to create a standard pattern that can be adapted to any speaker by using. This is because the spectral distribution of speech differs for each speaker, that is, the vocal tract characteristics and the vocal cord sound source characteristics differ for each speaker. The difference in vocal tract characteristics is due to the difference in vocal tract length for each speaker, which means that the formant frequency as a resonance point of the vocal tract differs for each speaker. On the other hand, differences in vocal fold sound source characteristics affect the slope of the general shape of the spectral envelope. Therefore, in order to perform pattern matching that is easy to adapt to unspecified speakers, it is necessary to normalize the vocal tract characteristics and vocal cord sound source characteristics that differ for each speaker by some means, or to remove the effects. Required.

通常、２つの特性は畳み込まれているのでこれらを分
離したうえ正規化もしくは話者依存性の排除を図る必要
がある。本発明もこの点に着目し、θ，をもってホル
マント周波数を代表せしめることによって話者依存性を
排除したパタンマッチングの実施を図っているものであ
る。Usually, since the two characteristics are convoluted, it is necessary to separate them and normalize or eliminate speaker dependence. The present invention also focuses on this point, and implements pattern matching that eliminates speaker dependence by representing a formant frequency with θ.

本実施例ではこのようにθ，をもって声道長を正規
化し、さらにスペクトル包絡を第１〜第３ホルマントの
比で代表させて声帯音源特性もほぼ排除したものとして
いる。In this embodiment, the vocal tract length is normalized by θ, and the spectral envelope is represented by the ratio of the first to third formants.

さて、パタン照合器（２）７には標準パタンファイル
６から入力パタンにマッチングした標準パタンに関する
θ，が、また極座標変換器４からは入力音声信号の
θ，が供給される。Now, the pattern matching unit (2) 7 is supplied with θ relating to the standard pattern matched to the input pattern from the standard pattern file 6 and the polar coordinate converter 4 with θ of the input audio signal.

標準パタンファイル３に登録，格納されている標準パ
タンは、それ自体ホルマント抽出エラーを殆んど零とす
るように作成された高精度のものではあるが、あくまで
特定話者による特定発生速度のものであり、不特定話者
はもとより当該特定話者の入力パタンに対しても時間軸
上での圧縮もしく伸張を実施して時間正規化を図らない
と不特定話者のホルマント周波数の再生はできない。The standard pattern registered and stored in the standard pattern file 3 is a high-precision pattern created by itself so as to make the formant extraction error almost zero, but only a specific generation rate by a specific speaker. If the input pattern of the specific speaker as well as the specific speaker is compressed or expanded on the time axis and time normalization is not performed, the reproduction of the formant frequency of the specific speaker will not be possible. Can not.

パタン照合器（２）７は選択された標準パタンのθ，
を入力パタンのθ，に対して時間的に圧縮もしくは
伸張せしめる時間正規化を行なう。この正規化は、それ
ぞれ互いに直交する時間軸i,j上に表現される標準パタ
ンと入力パタンとを互いに写像関係に対応せしめる写像
関数を見出すことによって実施される。このことは
（１）式に示す市街地距離dijを評価尺度とするダイナ
ミックプログラミングを実施し、前記写像関数を市街地
距離dijの総和を最小とするような単調増加ダイナミッ
クプログラミング・パスとして把握する公知の技術，そ
の他の処理技術によって容易に実施しうる。The pattern collator (2) 7 selects the standard pattern θ,
Is subjected to time normalization for temporally compressing or expanding θ of the input pattern. This normalization is performed by finding a mapping function that makes the standard pattern and the input pattern expressed on the mutually orthogonal time axes i and j correspond to each other in a mapping relationship. This is a well-known technique that implements dynamic programming using the urban distance dij shown in equation (1) as an evaluation scale, and grasps the mapping function as a monotonically increasing dynamic programming path that minimizes the sum of the urban distances dij. , Can be easily implemented by other processing techniques.

こうして時間正規化された標準パタンのθ，は次に
直交座標変換器８に供給される。The time-normalized standard pattern θ is then supplied to the orthogonal coordinate converter 8.

直交座標変換器８は時間正規化されたθ，と極座標
変換器４から入力する合成ベクトル絶対値ｌとによる極
座標データを直交座標に変換し、時間軸対周波数軸の直
交座標系で表現した時間系列データとして連続性を保持
しつつホルマント軌跡を提供し、これを出力ホルマント
データとして送出する。このようにホルマント軌跡を時
間軸上パタンとして取扱うことによりホルマント抽出エ
ラーも大幅に削減し得ることとなる。また、本実施の如
く音声狭帯域通信に利用する場合には特に標準パタンと
して用意すべき会話用語が大幅に限定されているためシ
ステム構成も容易，かつ運用の信頼性も高くなし得る不
特定話者向けのホルマント抽出が可能となる。The orthogonal coordinate converter 8 converts polar coordinate data based on the time-normalized θ and the composite vector absolute value l input from the polar coordinate converter 4 into rectangular coordinates, and expresses the time expressed in a time axis versus frequency axis rectangular coordinate system. A formant trajectory is provided while maintaining continuity as sequence data, and is transmitted as output formant data. By treating the formant trajectory as a pattern on the time axis, the formant extraction error can be greatly reduced. In addition, in the case of using for voice narrow-band communication as in the present embodiment, since the conversational term to be prepared as a standard pattern is greatly limited, the system configuration is easy, and the unspecified speech which can make operation reliability high can be achieved. Formant extraction for the elderly.

なお、第１図の実施例ではパタン照合器（１）５とパ
タン照合器（２）７とによってパタン照合を実施してい
るが、これらはその機能を一体化したものとして構成し
ても一向に差支えない。In the embodiment shown in FIG. 1, the pattern matching is performed by the pattern matching unit (1) 5 and the pattern matching unit (2) 7. However, even if these functions are integrated, they can be easily realized. No problem.

〔The invention's effect〕

以上説明した如く本発明によれば、入力音声信号を分
析して得られる第１乃至第３ホルマントを極座標表現す
ることによって得られるホルマント周波数相互間関係角
度情報をベクトル要素とする標準パタンファイルと分析
データとの時間正規化を利用したパタンマッチングを介
してホルマント軌跡を確保したホルマントを抽出するホ
ルマント軌跡取得手段と、第１から第３までのホルマン
トの比をホルマント周波数の絶対値に変換する変換手段
とを備えることにより、ホルマント抽出エラーを大幅に
削減したうえ、不特定話者化を図ったホルマント抽出器
が実現できるという効果がある。As described above, according to the present invention, the first to third formants obtained by analyzing the input audio signal are expressed in polar coordinates, and the standard pattern file having the vector information as the formant frequency inter-relationship angle information is obtained. Formant trajectory acquisition means for extracting a formant whose formant trajectory is secured through pattern matching using time normalization with data, and conversion means for converting the ratio of the first to third formants to the absolute value of the formant frequency With this arrangement, the formant extraction error can be significantly reduced, and a formant extractor with an unspecified speaker can be realized.

[Brief description of the drawings]

第１図は本発明の一実施例を示すブロック図第２図は第
１〜第３ホルマント周波数の３次元極座標表示図であ
る。１……A/Dコンバータ、２……自己相関係数算出器、３
……ホルマント判定器、４……極座標変換器、５……パ
タン照合器（１）、６……標準パタンファイル、７……
パタン照合器（２）、８……直交座標変換器。FIG. 1 is a block diagram showing an embodiment of the present invention. FIG. 2 is a diagram showing three-dimensional polar coordinates of first to third formant frequencies. 1 ... A / D converter, 2 ... Autocorrelation coefficient calculator, 3
... Formant detector, 4 ... Polar coordinate converter, 5 ... Pattern collator (1), 6 ... Standard pattern file, 7 ...
Pattern collator (2), 8... Cartesian coordinate converter.

Claims

(57) [Claims]

1. A standard pattern in which two angular relationship information between formant frequencies obtained by expressing first to third formant frequencies obtained by analyzing an input audio signal in polar coordinates is used as a vector element. A formant trajectory obtaining means for obtaining a formant trajectory through pattern matching using time normalization with the analysis data relating to the angular relation information of the audio signal, and a ratio of the first to third formants is calculated by calculating an absolute value of the formant frequency. A formant extractor, comprising: conversion means for converting into a value.