JPH04122999A

JPH04122999A - Phoneme discriminating method

Info

Publication number: JPH04122999A
Application number: JP2242897A
Authority: JP
Inventors: Takashi Miki; 三木　敬
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1990-09-13
Filing date: 1990-09-13
Publication date: 1992-04-23
Anticipated expiration: 2015-02-28
Also published as: JP3012994B2

Abstract

PURPOSE:To obtain high efficient performance by obtaining static information and power variation of a voice as power vectors and variation in spectrum as time-series feature vectors, and further performing hierarchic recognition. CONSTITUTION:The power vectors and time-series feature vectors are calculated and stored as power vector data and time-series feature data, and a power vector code book 105 is generated with the power vector data. Then vector quantization is carried out by using the code book 105 to generate power code data. Time-series feature vector data corresponding to frames given the same power code number in the power code data are clustered to generate a time- series feature code book 108, and this operation is repeated as many times as the number of power codes. For the time-series feature vector data, the code book 108 determined by the corresponding power code is used for vector quantization, thereby finding the time-series feature code data. The code data and time-series data are used to generate a phoneme table 110 showing the corresponding relation between both the codes, thereby obtaining the high efficient performance.

Description

【発明の詳細な説明】〔従来技術〕従来この種の技術としては、′相互情報量を基準とする
フレーム単位の音間情報の評価”電子情報通信学会　音
声研究会　５Ｐ−１０３（１９８８）に開示きれたもの
がある。[Detailed Description of the Invention] [Prior Art] Conventionally, this type of technology is described in ``Evaluation of interspeech information on a frame-by-frame basis based on mutual information'', Institute of Electronics, Information and Communication Engineers, Speech Research Group, 5P-103 (1988). There is something I can disclose.

音声認識の分野で現在最も精力的に研究されている手法
の１つに音韻識別処理を用いた認識手法がある。音韻識
別処理とは入力された音声を音韻の系列（はぼ発音記号
に等しい）に変換するものである。このように音声を音
韻系列に変換した後は、例えば単語辞書や文法規則等を
用いて最も適切と思われる文字列（文章）に変換してい
く。One of the methods currently being most actively researched in the field of speech recognition is a recognition method using phoneme identification processing. The phoneme identification process converts input speech into a series of phonemes (equivalent to the pronunciation symbols). After the speech is converted into a phoneme sequence in this way, it is converted into the most appropriate character string (sentence) using, for example, a word dictionary or grammar rules.

音韻識別を行なう利点は、音響レベルの処理と文字列の
処理を切り離すことで語當の拡張、認識対象文型の拡張
が自在にできることにある。The advantage of performing phoneme identification is that by separating acoustic level processing and character string processing, it is possible to freely expand the range of words and sentence types to be recognized.

第８図は従来の音声識別法を実施する装置の構成を示す
ブロック図である。同図において、２０１は音声入力端
子、２０２は分析部、２０３はパワーベクトル算出部、
２０４はパワーベクトル７０部、２０５はパワーベクト
ル符号帳、２０６は特徴ベクトルＶＱ部、２０７は特徴
ベクトル符号帳、２０９は音韻変換部、２１０は音韻テ
ーブルである。FIG. 8 is a block diagram showing the configuration of a device implementing a conventional voice identification method. In the figure, 201 is an audio input terminal, 202 is an analysis section, 203 is a power vector calculation section,
204 is a power vector 70 part, 205 is a power vector codebook, 206 is a feature vector VQ part, 207 is a feature vector codebook, 209 is a phoneme conversion part, and 210 is a phoneme table.

音声入力端子２０１から入力きれた音声は分析部２０２
において、特徴を表わす特徴ベクトルに変換きれる。特
徴ベクトル　ｉとしては、中心周波数の異なる１個のバ
ンドパスフィルタ群によって抽出きれた帯域内周波数成
分をフレームと呼はれる小時間毎に取り出したものを使
用する。The voice input from the voice input terminal 201 is sent to the analysis unit 202.
can be converted into a feature vector representing the feature. As the feature vector i, in-band frequency components extracted by one group of band-pass filters having different center frequencies are used for each small time period called a frame.

５　ｉ　＝　（Ｓｉｌ　Ｓｉ２．・・・、５ｉｊ−１，
５ｉｊ）ここでｉはフレーム番号、ｊはバンドパスフィ
ルタの番号である。5 i = (Sil Si2..., 5ij-1,
5ij) Here, i is the frame number and j is the bandpass filter number.

更ニ、分析部２０２ではフレーム毎のパワーＰｉも同時
に算出する。ここで説明を簡単にするため、音声始端の
フレーム番号をＯ１音声の終端のフレーム番号を１とす
る（ｉ−・・−２，−１，０，１，２，・・・、Ｉ、Ｉ
＋１．Ｉ＋２．・　）。Furthermore, the analysis unit 202 simultaneously calculates the power Pi for each frame. To simplify the explanation, the frame number at the beginning of the audio is assumed to be O1, and the frame number at the end of the audio is assumed to be 1 (i-...-2, -1, 0, 1, 2,..., I, I
+1. I+2.・ ).

パワーベクトル算出部２０３では隣接フレームのパワー
Ｐｉを結合して得られるパワーベクトルＦｉを算出する
。The power vector calculation unit 203 calculates a power vector Fi obtained by combining the powers Pi of adjacent frames.

ｐ　ｉ　＝　（Ｐｉ−ｎ、Ｐｉ−ｎ＋１．　＝・、Ｐｉ
−１，Ｐｉ、　−、Ｆｉｎｎ−１、Ｆｉｎｎ　）ここで、ｎは隣接フレームの幅である。p i = (Pi-n, Pi-n+1. =・, Pi
-1, Pi, -, Finn-1, Finn) where n is the width of the adjacent frame.

パワーベクトル算出部２０４ではパワーベクトル符号帳
２０５を参照して、ベクトル量子化（ＶＱ）し、パワー
コードＣｉを求める。The power vector calculation unit 204 refers to the power vector codebook 205, performs vector quantization (VQ), and obtains a power code Ci.

Ｃｉ　＝ａｒｇｍｉｎ　ｄ　（ｐ　ｉ、　ｙ”）ここで
、ｄ（ｐｉ、ｙ”）はベクトル　ｉとベクトルｙ′″と
の距離を表わす。Ci = argmin d (pi, y'') where d (pi, y'') represents the distance between vector i and vector y'''.

パワーベクトル量子化処理は、入力音声をパワー形状で
大分類するものであり、大分類結果がパワーコードＣｉ
となる。ン”はパワーコードヘクトル、ｍはパワーフー
ド番号である。The power vector quantization process roughly classifies the input audio according to the power shape, and the result of the major classification is the power code Ci.
becomes. ” is the power code hector, and m is the power food number.

１”−（ｙ”−ｎ、ｙ”−ｎ＋１．・・・、ｙ’−１，
ｙ’Ｏ，ｙ”Ｌ・・・ｙ”’ｎ−１．ｙ”ｎ）　　　　
（ｍ＝１．２．　・・・、Ｍ）ここでＭはパワーベクト
ル符号帳サイズである。1"-(y"-n, y"-n+1..., y'-1,
y'O, y"L...y"'n-1. y”n)
(m=1.2. . . . , M) Here, M is the power vector codebook size.

特徴ベクトルＶＱ部２０６では特徴ベクトル符号帳２０
７に基づいて特徴ベクトル　ｉのベクトル量子化を行な
う、また、特徴ベクトル符号帳２０７にはパワーフード
別に符号帳が格納されている。言い換えるとパワーベク
トル符号帳サイズＭ個分の符号帳があることになる。The feature vector VQ unit 206 uses the feature vector codebook 20
In addition, the feature vector codebook 207 stores codebooks for each power food. In other words, there are codebooks corresponding to M power vector codebook sizes.

特徴ベクトル量子化処理は、パワーベクトル量子化処理
の大分類とは逆に音声の細かな特徴に基づいた詳細分類
を行なうものである。The feature vector quantization process performs detailed classification based on the detailed features of speech, contrary to the general classification of the power vector quantization process.

先ず、パワーコードＣｉに対した特徴ベクトル符号帳を
特徴ベクトル符号帳２０７から選択する。このことは言
い換えると大分類結果を考慮して、詳細識別用辞書を適
切なものに切り替えることに相当する。この選択きれた
符号帳を使って量子化（ＶＱ）を行なう。今ｃ＝ｃｉと
すると、特徴コードＺｉは次式であられきれる。First, a feature vector codebook for power code Ci is selected from feature vector codebook 207. In other words, this corresponds to switching the detailed identification dictionary to an appropriate one in consideration of the major classification results. Quantization (VQ) is performed using this selected codebook. Now, if c=ci, the feature code Zi can be expressed by the following equation.

Ｚ　ｉ　＝ａｒｍｉｎ　ｄ　（Ｓ　ｉ　、×（ｃ）’）
　　　　　（２）ろここでＸ（ｃ）’はパワーコードＣに対応す６特徴コー
ドベクトル、ｒは特徴ベクトルコードである。Z i = armin d (S i , ×(c)')
(2) Here, X(c)' is a 6-feature code vector corresponding to power code C, and r is a feature vector code.

ス（ｃ　）’＝　（Ｘ（ｃ）’＋、Ｘ（ｃ）’、、−，
Ｘ（ｃ）’、−１゜Ｘ（Ｃ）’Ｊ　）　　　　　＜ｒ＝
１．２．−、Ｒ（ｃ））Ｒ（ｃ）はパワーコードＣに対
応する特徴ベクトル符号サイズである。x(c)'=(X(c)'+,X(c)',,-,
X(c)', -1°X(C)'J) <r=
1.2. -, R(c)) R(c) is the feature vector code size corresponding to the power code C.

音韻変換部２０９では、このＣｉ、Ｚｉから音韻記号Ｌ
ｉに変換する。この処理について種々の方法が考えられ
るが、ここでは最も簡単なテーブルルックアップ方式に
ついて説明する。先ず音韻テーブル２１０の構成例を第
１０図に示す、パワーコードＣ１＝１　、特徴コードＺ
ｉ＝１の時はこの表より音韻記号Ｌｉはａとなる。また
、Ｃｆ＝２．Ｚｉ士３ではＬｉ−ｅと変換きれる。The phoneme conversion unit 209 converts the phoneme symbol L from these Ci and Zi.
Convert to i. Various methods can be considered for this processing, but the simplest table lookup method will be explained here. First, an example of the configuration of the phoneme table 210 is shown in FIG. 10, where power code C1=1, feature code Z
According to this table, when i=1, the phonetic symbol Li is a. Also, Cf=2. In Zishi 3, it can be converted to Li-e.

このようにして入力音声から音韻記号列に変換きれる。In this way, the input speech can be converted into a phoneme symbol string.

[Problem to be solved by the invention]

音韻識別処理では音声の種々のキーを効果的に捉えて識
別しなければ、高い識別能力が得られない。人間が音声
を聞き分ける場合、音声の静的情報、即ちある瞬間の音
の強さ、音色（スペクトル）の違いはもちろんのこと、
音声の動的情報、音の強きやスペクトルの時間的変化が
重要なキーとなねることは種々の実験でも実証きれてい
る。In phoneme discrimination processing, high discrimination ability cannot be obtained unless the various keys of speech are effectively captured and discriminated. When humans hear voices, they not only use the static information of the voice, that is, the intensity of the sound at a certain moment, the difference in timbre (spectrum), but also
Various experiments have proven that the dynamic information of the voice, the intensity of the sound, and temporal changes in the spectrum are important keys.

上述した従来の音韻識別方法では、パワーベクトルＦな
る特徴量を用いて音韻識別のキーの１つであるパワーの
変化を捉えている。またスペクトルの静的な情報も特徴
ベクトル符号帳ちスペクトル）を考慮している。In the conventional phoneme identification method described above, a feature amount called a power vector F is used to capture changes in power, which is one of the keys for phoneme identification. In addition, the static information of the spectrum (feature vector codebook spectrum) is also taken into consideration.

しかしながら、上記従来の音韻識別方法においては、類
似した音韻を識別する最大のキーである音声のスペクト
ルの変化を全く考慮しておらずこのため十分な音韻識別
能力が得られないという問題があった。However, the conventional phoneme identification method described above does not take into account changes in the spectrum of speech, which is the most important key to identifying similar phonemes, and as a result, there is a problem in that sufficient phoneme identification ability cannot be obtained. .

本発明は上述の点に鑑みてな芒れたもので、類似して音
韻を識別する最大のキーである音声のスペクトルの変化
を考慮して高い識別性能が得られる音韻識別方法を提供
することを目的とする。The present invention has been developed in view of the above-mentioned points, and an object of the present invention is to provide a phoneme identification method that can obtain high identification performance by taking into account changes in the spectrum of speech, which is the most important key to identifying phonemes by similarity. With the goal.

[Means to solve the problem]

上記課題を解決するため本発明の音韻識別方法は、下記
の（ａ）乃至（ｇ）の手段を採るようにした。In order to solve the above problems, the phoneme identification method of the present invention employs the following means (a) to (g).

（ａ）入力音声を周波数分析し、該入力音声の周波数成
分の・ベクトルである特徴ベクトルと該入力音声の強き
を表わすパワーを、フレームと称する一定時間間隔で算
出する分析と、（ｂ）隣接フレームの前記音声パワーを結合して得られ
るパワーベクトルを算出する処理と、（ｃ）隣接フレー
ムの前記特徴ベクトルを結合して得られた時系列特徴ベ
クトルを算出する処理と、（ｄ）識別しようとする入力音声に対して前記（ａ）か
ら（ｃ）までの処理により、入力パワー／くタン及び入
力時系列特徴パターンを作成する処理と、（ｅ）　前記
入力パワーパタンに対して、予め多数の音声データを作
成しておいたパワーベクトル符号帳を用いてベクトル量
子化し、入力パワーコードを求める処理と、（ｆ＞前記入力時系列特徴バタンに対して、対応するフ
レームの入力パワーコードに応した時系列特徴のベクト
ル符号帳を用いてベクトル量子化し、入力時系列特徴コ
ードを求める処理と、＜ｇ）前記入力パワーコード及び
入力時系列特徴コードからフレーム毎に音韻の尤度を求
める処理。(a) Frequency analysis of the input voice and calculation of the feature vector, which is a vector of frequency components of the input voice, and the power representing the strength of the input voice at fixed time intervals called frames; (b) Adjacent A process of calculating a power vector obtained by combining the audio powers of frames; (c) a process of calculating a time-series feature vector obtained by combining the feature vectors of adjacent frames; (d) Let's identify (e) A process of creating an input power/temporal characteristic pattern and an input time-series feature pattern by performing the processes (a) to (c) for the input audio, and (e) creating a large number of input power patterns in advance for the input power pattern. A process of vector quantizing the audio data using a previously created power vector codebook to obtain an input power code; a process of vector quantizing using a vector codebook of time-series features to obtain an input time-series feature code; and <g) a process of finding the likelihood of a phoneme for each frame from the input power code and the input time-series feature code.

[Effect]

本発明によれば、音声の静的情報、即ちある瞬間の音の
パワー、スペクトルの違いはもちろんのこと、音声の静
的情報、パワー変化をパワーベクトルで、スペクトルの
変化は時系列特徴ベクトルで捉える。更に、これらの両
特徴をパワーベクトルを１段目に、時系列特徴ベクトル
を２段目に用いる階層的識別を行なうことにより、各々
の特徴量を別々に用いるよりも効率的でしかも高い音韻
識別性能を得ることができる。According to the present invention, the static information of the voice, that is, the power of the sound at a certain moment, the difference in the spectrum, as well as the static information of the voice and the power change, are expressed as a power vector, and the change in the spectrum is expressed as a time-series feature vector. Capture. Furthermore, by performing hierarchical discrimination using both of these features, using the power vector in the first stage and the time series feature vector in the second stage, it is possible to achieve more efficient and higher phonological discrimination than using each feature separately. performance can be obtained.

〔Example〕

以下、本発明の実施例を図面に基ついて説明する。 Embodiments of the present invention will be described below with reference to the drawings.

第１図は本発明の音韻識別方法を実施する装置の構成を
示すブロック図である。同区において、りＱ部、１０５はパワーＡ４ル符号帳、１０６は時系列特
徴ベクトル生成部、１０７は時系列特徴ベクトル符号帳
、１０８は時系列特徴ベクトル符号帳、１０９は音韻変
換部、１１０は音韻テーブルである。FIG. 1 is a block diagram showing the configuration of an apparatus for implementing the phoneme identification method of the present invention. In the same section, 105 is a power A4 codebook, 106 is a time-series feature vector generation unit, 107 is a time-series feature vector codebook, 108 is a time-series feature vector codebook, 109 is a phoneme conversion unit, 110 is a phonological table.

第２図は入力音声信号パワー例を示す図、第３図は入力
音声信号バタン例を示す図、第４図は周波数スペクトル
の時間的変化例を示す図、第５図はスペクトルの時系列
パターン例を示す図、第６図はパワーベクトル符号帳例
を示す図、第７図は時系列特徴ベクトル符号帳例を示す
図である。Fig. 2 is a diagram showing an example of input audio signal power, Fig. 3 is a diagram showing an example of input audio signal bang, Fig. 4 is a diagram showing an example of temporal change in frequency spectrum, and Fig. 5 is a time series pattern of spectrum. FIG. 6 is a diagram showing an example of a power vector codebook, and FIG. 7 is a diagram showing an example of a time-series feature vector codebook.

第１図の音声入力端子１０１から、入力きれる音声信号
は分析部１０２において特徴を表わす特徴ベクトルの時
系列に変換きれる。特徴ベクトル導出方法には、中心周
波数が少しずつ異なる複数のバンドパス群を用いる方法
や、ＦＦＴ（高速フーリエ変換）によるスペクトル分析
を用いるもの等々が考えられるが、ここではバンドパス
フィルタ群を使用する方法を例に挙げる。特徴ベクトル
Ｓｉとしては、中心周波数の異なる１個の／くノドパス
フィルタ群によって抽出した帯域内周波数成分を対数変
換し、フレームと呼ばれる小時間毎に取り出したものを
使用する。An audio signal that can be input from the audio input terminal 101 in FIG. 1 is converted into a time series of feature vectors representing features in the analysis section 102. Possible methods for deriving feature vectors include using multiple bandpass groups with slightly different center frequencies, or using spectral analysis using FFT (fast Fourier transform), but here we will use a group of bandpass filters. Here is an example of a method. As the feature vector Si, the in-band frequency components extracted by a group of pass filters having different center frequencies are logarithmically transformed and extracted for each small time period called a frame.

Ｓｉ　＝　（Ｓｉ１．Ｓｉ２．・・、　Ｓｉ　ｊ−１、
５ｉｊ　）ここでｉはフレーム番号Ｊはバンドパスフィ
ルタ番号である。更に分析部１０２ではフレーム毎のパ
ワーＰｉも同時に算出する。パワーＰｉ次式で計算きれ
る。Si = (Si1.Si2..., Si j-1,
5ij) Here, i is the frame number and J is the bandpass filter number. Furthermore, the analysis unit 102 simultaneously calculates the power Pi for each frame. The power Pi can be calculated using the following equation.

Ｐ　ｉ　＝３−’Σ５ｉｊ　　　　　　　　　　　　　
（３）ここで簡単にするため、音声始端のフレーム番号
を０、音声の終端のフレーム番号を１とする。（パワー
Ｐｉを結合してパワーベクトルＰｉを算出する。P i =3−'Σ5ij
(3) For simplicity, let us assume that the frame number at the beginning of the audio is 0 and the frame number at the end of the audio is 1. (Combine the powers Pi to calculate the power vector Pi.

ｐｉ　＝　（Ｐｉ−ｎ、Ｐｉ−ｎ＋１．−、Ｐｉ−１，
Ｐｉ、Ｐｉ＋１．−、Ｐｉ＋ｎ−１、ｐｉ＋ｎ　）クトル量子化（ＶＱ）Ｉ、、パワーコードＣｉを求める
。pi = (Pi-n, Pi-n+1.-, Pi-1,
Pi, Pi+1. -, Pi+n-1, pi+n) quantization (VQ) I, , determine the power code Ci.

Ｃｉ　＝ａｒｇｉｍ　ｄ　（ｐ　ｉ　、　ｙ”）　　　
　　　（４）ここでｄ（［ｐｉ、ｙ“）はベクトル　ｉ
とへクトルノ°との距離を表わす。パワーベクトル量子
化処理は、入力音声を第２図に示すようなパワー形状で
大分類するものであり、大分類結果が第３図示すような
パワーコードＣｉとなる。　　ｙｌはパワーコードベク
トル、ｍはパワーコード番号である。Ci = argim d (p i , y”)
(4) Here, d([pi, y“) is the vector i
represents the distance between and Hectorno°. In the power vector quantization process, input speech is roughly classified into power shapes as shown in FIG. 2, and the results of the rough classification become power codes Ci as shown in FIG. 3. yl is a power code vector, and m is a power code number.

ｙ”＝　（ｙ−−ｎ、ｙ’−ｎ＋Ｌ　＋・・、ｙ”−１
，ｙ”０．ｙ”１．　・−、ｙ”ｎ−１，ｙｍｎ）　　
　　　　　　　（ｍ＝１．２．・・・、Ｍ）ここでＭは
パワーベクトル符号帳サイズである。y"= (y--n, y'-n+L +..., y"-1
,y"0.y"1.・-, y”n-1, ymn)
(m=1.2...,M) Here, M is the power vector codebook size.

時系列特徴ベクトル生成部１０６では隣接フレームの特
徴ベクトルＳｉを結合して得れる時系列特徴ベクトル　
ｉを生成する。The time-series feature vector generation unit 106 generates a time-series feature vector obtained by combining feature vectors Si of adjacent frames.
Generate i.

Ｔｉ　　＝　（Ｓｉ−に、Ｓｉ−に＋１．　・・・、　
　５ｉ−１，５ｉ、　　Ｓｉ十に−１、Ｓｉ＋ｋ）ここでｋは隣接フレーム幅である。Ti = (+1 to Si-, +1 to Si-...,
5i-1, 5i, Si+1, Si+k) where k is the adjacent frame width.

時系列特徴ベクトル量子化１０７では第７図に示すよう
な時系列特徴ベクトル符号帳１０８に基ついて時系列特
徴ベクトルＴｉのベクトル量子化（ＶＱ）を行なう。ま
た、時系列特徴ベクトル符号帳１０８にはパワーコード
別に符号帳が格納きれている。言い換えるとパワーベク
トル符号帳サイズＭ個分の符号帳があることになる。時
系列特徴ベクトル量子化処理は、上記パワーベクトル量
子化処理とは逆に、音声の細かな特徴、モしてその変化
に基づいた詳細分類を行なうものである。In time-series feature vector quantization 107, vector quantization (VQ) of time-series feature vectors Ti is performed based on a time-series feature vector codebook 108 as shown in FIG. Further, the time-series feature vector codebook 108 has all the codebooks stored for each power code. In other words, there are codebooks corresponding to M power vector codebook sizes. Contrary to the power vector quantization process described above, the time-series feature vector quantization process performs detailed classification based on the detailed features of the voice and, in particular, changes thereof.

先スパワーコードＣｉに対応した時系列特徴ベクトル符
号帳を時系列特徴ベクトル符号帳１０８から選択する。A time-series feature vector codebook corresponding to the previous power code Ci is selected from the time-series feature vector codebook 108.

このことを言い換えると大分類結果を考慮して、詳細識
別辞書を適切なものに切り替えることに相当する。この
選択きれた符号帳を使って量子化（ＶＱ）を行なう。In other words, this corresponds to switching the detailed identification dictionary to an appropriate one in consideration of the major classification results. Quantization (VQ) is performed using this selected codebook.

今、ｃ−Ｃｉとすると、特徴コードＺｉは次式％式％ここで　（ｃ）’はパワーコードＣに対する時系列特徴
コード番号である。Now, assuming c-Ci, the feature code Zi is the following formula % Formula % Here, (c)' is the time-series feature code number for the power code C.

ｕ（ｃ）’−（ｘ　（ｃ）’＋、　ｘ　（ｃ）’ｘ、　
＋・・、　ｘ　（ｅ）’＋ｔｈ＋＋）１−１＋　Ｘ　（
Ｃ）’（ｔ＊＋１）＋　）（ｒ＝１．２．・・、Ｒ（ｃ
）　）Ｒ（ｃ）はパワーコードＣに対応する時系列特徴ベクト
ル符号帳サイズである。u(c)'-(x(c)'+, x(c)'x,
+..., x (e)'+th++)1-1+ X (
C)'(t*+1)+ )(r=1.2..., R(c
)) R(c) is the time-series feature vector codebook size corresponding to the power code C.

音韻変換部１０９では、このパワーコードＣ１１特徴コ
ードＺｉから音韻記号Ｌｉに変換する。この処理にって
いは種々の方法が考えられるが、ここでは最も簡単なテ
ーブルルックアップ方式について説明する。第９図に示
す音韻テーブル１１０において、パワーコードＣ４−１
、特徴コードＺｉ＝１の時は、音韻記号Ｌｉはａとなる
。また、パワーコードＣ３＝２、特徴コートＺｉ＝２の
時は音韻記号Ｌｉはｅとなる。The phoneme conversion unit 109 converts this power code C11 characteristic code Zi into a phoneme symbol Li. Various methods can be considered for this process, but the simplest table lookup method will be explained here. In the phoneme table 110 shown in FIG. 9, power chord C4-1
, when the feature code Zi=1, the phonetic symbol Li is a. Further, when the power code C3=2 and the feature code Zi=2, the phonetic symbol Li becomes e.

このようにして入力音声から音韻記号列に変換される。In this way, the input speech is converted into a phoneme symbol string.

なお、音韻テーブル１１０の作成方法は種々考えられる
が、その１手法を簡単に箇条書で説明する。Note that there are various possible methods for creating the phoneme table 110, and one method will be briefly explained in bullet points.

（ａ）予め多数の音声データに対して、パワーベクトル
　と時系列特徴ベクトル　を算出し、各々パワーベクト
ルデータ、時系列特徴データとして記憶する。(a) Calculate a power vector and a time-series feature vector for a large number of audio data in advance, and store them as power vector data and time-series feature data, respectively.

（ｂ）パワーベクトルデータをクラスタリングし、パワ
ーベクトル符号帳を作成する。(b) Clustering the power vector data and creating a power vector codebook.

（Ｃ）パワーベクトルデータをパワーベクトル符号帳を
用いてベクトル量子化し、パワーコードデータを作成す
る。(C) Vector quantize the power vector data using a power vector codebook to create power code data.

（ｄ）パワーコードデータ中、同一パワーコード番号が
付けられたフレームに対応する時系列特徴ベクトルデー
タをクラスタリングし、時系列特徴符号帳を作成する処
理をパワーコードの数だけ繰り返す。(d) The process of clustering time-series feature vector data corresponding to frames assigned the same power code number in the power code data and creating a time-series feature codebook is repeated for the number of power codes.

（ｅ）時系列特徴ベクトルデータに対して、対応するパ
ワーコードから定まる時系列特徴ベクトル符号帳を用い
てベクトル量子化し、時系列特徴フードデータを求める
。(e) Vector quantize the time-series feature vector data using a time-series feature vector codebook determined from the corresponding power code to obtain time-series feature food data.

（ｆ＞音声データに予め視察等で与えておいた音韻フー
ドデータと、時系列特徴データから両フード間の対応関
係をあられす音韻テーブル１１０を作成する。(f> A phoneme table 110 is created that shows the correspondence between the two foods from the phoneme food data given to the audio data in advance through inspection, etc., and time-series feature data.

〔Effect of the invention〕

以上説明したように本発明によれば、音声の静的情報、
即ちある瞬間の音のパワー、スペクトルの違いはもちろ
んのこと、音声の静的情報、パワー変化をパワーベクト
ルで、スペクトルのｉｔは時系列特徴ベクトルで捉え、
更に、これらの両特徴をパワーベクトルを１段目に、時
系列特徴ベクトルを２段目に用いる階層的識別を行なう
ことにより、各々の特徴量を別々に用いるよりも効率的
でしかも高い音韻識別方法を得ることができるという優
れた効果が得られる。As explained above, according to the present invention, static information of audio,
In other words, not only the power of the sound at a certain moment and the difference in the spectrum, but also the static information and power changes of the voice are captured by the power vector, and the IT of the spectrum is captured by the time series feature vector.
Furthermore, by performing hierarchical discrimination using both of these features, using the power vector in the first stage and the time series feature vector in the second stage, it is possible to achieve more efficient and higher phonological discrimination than using each feature separately. An excellent effect can be obtained in that the method can be obtained.

[Brief explanation of drawings]

第１図は本発明の音韻識別方法を実施する装置の構成を
示すブロック図、第２図は入力音声信号パワー例を示す
図、第３図は入力音声信号バタン例を示す図、第４図は
周波数スペクトルの時間的変化例を示す図、第５図はス
ペクトルの時系列パターン例を示す図、第６図はパワー
ベクトル符号帳例を示す図、第７図は時系列特徴ベクト
ル符号帳例を示す図、第８図は従来の音韻識別方法を実
施する装置の構成を示すブロック図、第９図は音韻テー
ブル例を示す図である。図中、１０１・・・・音声入力端子、１０２・自・分析
部、１０３・・・・パワーベクトル算出部、１０４・・
・パワーベクトルＶ　Ｑ　部、１０５・・・・パワーベ
クトル符号帳、１０６・・・・時系列特徴ベクトル生成
部、１０７・・・時系列特徴ベクトル量子化、１０８・
・・・時系列特徴ベクトル符号帳、１０９・・・・音韻
変換部、１１０・・・・音韻テーブル。特許出願人　沖電気工業株式会社代理人　弁理士　熊　谷　隆（外１名）→ｆｉｌ　：Ｌ
ｈ用７良（ｆスペア）＋１／９’ｌ−３’ｌ　ｆＪ７Ｓイ
ｔ・％゛第４図ズヘ１７））し賭Ａおｌ・」ノ（７ンイｌり第５図パワーへ一７トルｒトち丁養ｆ万°ｊ第６図計禿列で４役へ７ト】し才子号す五伊」第７図ａｉａｉｆｌ＃ｄ’ｌＡ：１１　＊ｊヒＴ　５　Ｋ：１
　例ｎＧｒｘｖ　７’０７７ｍ第８図奮童−テ゛−７デー７　ル（１’Ｊ第９図FIG. 1 is a block diagram showing the configuration of an apparatus implementing the phoneme identification method of the present invention, FIG. 2 is a diagram showing an example of input audio signal power, FIG. 3 is a diagram showing an example of input audio signal slam, and FIG. 4 is a diagram showing an example of input audio signal power. 5 is a diagram showing an example of a temporal change in a frequency spectrum, FIG. 5 is a diagram showing an example of a time-series pattern of a spectrum, FIG. 6 is a diagram showing an example of a power vector codebook, and FIG. 7 is an example of a time-series feature vector codebook. FIG. 8 is a block diagram showing the configuration of an apparatus for carrying out a conventional phoneme identification method, and FIG. 9 is a diagram showing an example of a phoneme table. In the figure, 101... audio input terminal, 102... own analysis section, 103... power vector calculation section, 104...
- Power vector V Q unit, 105... Power vector codebook, 106... Time series feature vector generation unit, 107... Time series feature vector quantization, 108...
. . . Time series feature vector codebook, 109 . . . Phoneme conversion unit, 110 . . . Phoneme table. Patent applicant Oki Electric Industry Co., Ltd. Agent Patent attorney Takashi Kumagai (1 other person) → fil: L
h 7 good (f spare) + 1/9'l-3'l fJ7S it・%゛Fig. Toru r tochi d'lA: 11 *j Hi T 5 K: 1
Example n Grxv 7'077m Figure 8

Claims

[Scope of Claims] (a) An analysis that frequency-analyzes input audio and calculates a feature vector that is a vector of frequency components of the input audio and power representing the strength of the input audio at fixed time intervals called frames. (b) a process of calculating a power vector obtained by combining the audio powers of adjacent frames; and (c) a process of calculating a time-series feature vector obtained by combining the feature vectors of adjacent frames. , (d) creating an input power pattern and an input time-series feature pattern by performing the processes (a) and (c) on the input voice to be identified; and (e) creating an input power pattern and an input time-series feature pattern on the input voice to be identified. , vector quantizes a large amount of audio data using a power vector codebook created in advance,
(f) vector quantizing the input time series feature pattern using a vector codebook of time series features according to the input power code of the corresponding frame, and obtaining an input time series feature code; and (g) determining the likelihood of a phoneme for each frame from the input power code and the input time-series feature code.