JPH0634176B2

JPH0634176B2 - Voice segmentation device

Info

Publication number: JPH0634176B2
Application number: JP62016879A
Authority: JP
Inventors: 隆夫渡辺
Original assignee: Nippon Electric Co Ltd
Current assignee: NEC Corp
Priority date: 1987-01-27
Filing date: 1987-01-27
Publication date: 1994-05-02
Anticipated expiration: 2009-05-02
Also published as: JPS63183500A

Description

【発明の詳細な説明】（産業上の利用分野）本発明は音声認識技術における音声セグメンテーション
装置に関し、特に音素のような細かい認識単位を用いて
音声の認識を行なう装置に適用される音声セグメンテー
ション装置に関する。Description: TECHNICAL FIELD The present invention relates to a speech segmentation apparatus in speech recognition technology, and particularly to a speech segmentation apparatus applied to an apparatus for recognizing speech using a fine recognition unit such as a phoneme. Regarding

（従来の技術）音素を単位として認識を行ない方法は、単語の標準パタ
ンを用意する必要がなく、大語彙認識に適していること
から、従来から多くの認識方法が試みられている。この
ような方法のうち、最も一般的な方法は、音素の標準パ
タンを利用してべクトルの時系列とし表された入力音声
と音素記号の系列として表された単語辞書とのマッチン
グを行なう方法である。(Prior Art) Since a method of recognizing phonemes as a unit does not need to prepare a standard pattern of words and is suitable for large vocabulary recognition, many recognition methods have been tried conventionally. Of these methods, the most common method is to use a phoneme standard pattern to perform matching between an input speech represented as a vector time series and a word dictionary represented as a phoneme symbol series. Is.

（発明が解決しようとする問題点）音素の特徴はベクトルとして表された標準パタンとして
の記述の他に、各種の音響パラメータ上での局部的な特
徴によっても記述できる。このような局部的な特徴表現
は標準パタンによる記述では必ずしも表現されないもの
である。標準パタンにはエネルギーの急激な変化やエネ
ルギーの山や谷の存在、スペクトルの急激な変化など音
素特有の、あるいは音素変化に特有の現象についての情
報が必ずしも含まれていない。このため、音素標準パタ
ンによる音素認識の性能は不十分なものであった。本発
明は、音素あるいは音素変化に特有の現象を考慮するこ
とにより、音素標準パタンを用いたマッチングの高精度
化、すなわち、高精度の音素セグメンテーション手段を
実現するものである。(Problems to be Solved by the Invention) The characteristics of a phoneme can be described not only as a standard pattern expressed as a vector but also as a local characteristic on various acoustic parameters. Such a local feature expression is not always expressed by the description in the standard pattern. The standard pattern does not necessarily include information about phoneme-specific or phoneme-specific phenomena such as abrupt changes in energy, the presence of energy peaks and valleys, and abrupt changes in spectra. For this reason, the performance of phoneme recognition by the phoneme standard pattern is insufficient. The present invention realizes a highly accurate matching using a phoneme standard pattern, that is, a highly accurate phoneme segmentation means by considering a phenomenon peculiar to a phoneme or a phoneme change.

本発明はこのような高精度のセグメンテーション手段を
提供することにより音素単位の認識を可能ならしめ、単
語を単位とした認識では実施不能であった大語彙認識を
実現することを目的としている。It is an object of the present invention to provide such high-precision segmentation means to enable recognition in phoneme units, and to realize large vocabulary recognition that cannot be performed by word-based recognition.

（問題点を解決するための手段）前述の問題点を解決するために本発明が提供する手段
は、入力音声を既知の音素列に区分する音声セグメンテ
ーション装置であって、ベクトルの系列として表される
入力音声パタンＡ＝｛ａ（ｊ）｝を記憶する入力パタン
記憶部と、ベクトルとして表される音素標準パタンの組
Ｂ＝｛ｂ（ｋ，ｎ），ｎ＝１，…，Ｎ（ｋ），ｋ＝１，
…，Ｋ｝を記憶する標準パタン記憶部と、入力パタンＡ
の各時刻のベクトルａ（ｊ）の音素カテゴリｋに対する
類似性を示す尺度としてａ（ｊ）と標準パタンの組Ｂの
各ベクトルとの距離から音素カテゴリｋに対するフレー
ム距離ｄ（ｋ，ｊ）を算出する距離算出部と、音素から
音素への境界点で現われる音響的特徴を各音素対につい
て記述した音素境界特徴記述テーブルと、入力の全区間
にわたって各音素に対応する区間でのフレーム距離を累
積した量（Ｄ）に、各音素境界点において前記音素境界
特徴記述テーブルにより指定された特徴が存在する確度
をペナルティとして表した量（Ｐ）を加えた量（Ｄ＋
Ｐ）が最小となるように区間を分割する手段とを有する
ことを特徴とする。(Means for Solving Problems) Means provided by the present invention for solving the above problems is a speech segmentation apparatus for dividing an input speech into a known phoneme sequence, which is expressed as a sequence of vectors. Input speech pattern A = {a (j)}, and a set of phoneme standard patterns represented as a vector B = {b (k, n), n = 1, ..., N (k ), K = 1,
, K}, a standard pattern storage unit, and an input pattern A
The frame distance d (k, j) for the phoneme category k is calculated from the distance between a (j) and each vector of the set B of standard patterns as a measure showing the similarity of the vector a (j) at each time with respect to the phoneme category k. A distance calculator that calculates, a phoneme boundary feature description table that describes the acoustic features that appear at the boundary points from phonemes to phonemes for each phoneme pair, and the frame distances in the section corresponding to each phoneme over the entire input section are accumulated. The amount (D +) obtained by adding the amount (P) representing the probability that the feature specified by the phoneme boundary feature description table exists at each phoneme boundary point as a penalty to the amount (D).
Means for dividing the interval so that P) is minimized.

（作用）入力音声のペクトル系列をＡ＝｛ａ（ｊ），ｊ＝１，
…，ｊ｝で表すとする。また、音素標準パタンの組をＢ
＝｛ｂ（ｋ，ｎ）｝とする。ここでｋは音素のカテゴリ
番号を表す。ｋ＝１，…，Ｋである。ｎは各カテゴリ複
数個の標準パタンが存在することを意味し、ｎ＝１，
…，Ｎ（ｋ）である。(Operation) The vector sequence of the input speech is A = {a (j), j = 1,
,, j}. Also, set the phoneme standard pattern set to B.
= {B (k, n)}. Here, k represents a phoneme category number. k = 1, ..., K. n means that there are a plurality of standard patterns in each category, and n = 1,
..., N (k).

入力の各フレームｊにおける各音素カテゴリｋに対する
フレーム距離は、複数の標準パタンが存在する場合の最
近隣法を用いれば d(k,j)＝min[dist(a(j),b(k,n)]（１）より計算される。ここでｄｉｓｔ（・，・）は２つのベ
クトルの距離を表すものであり、ユークリッド距離、市
街距離等任意のものを用いることができる。The frame distance for each phoneme category k in each frame j of the input is d (k, j) = min [dist (a (j), b (k, n)] (1) where dist (·, ·) represents the distance between the two vectors, and any value such as Euclidean distance or city distance can be used.

音素標準パタンを用いて入力を既知の音素列に区分する
方法として、次に示す動的計画法による方法を利用する
ことができる。音素列の長さをＭとし、ｍ番目の音素に
対するフレーム距離をｄ（ｍ，ｊ）とすると、フレーム
距離を入力の全区間にわたって累積した量を最小にする
区分を次に漸化式より求めることができる。As a method of classifying an input into a known phoneme sequence using the phoneme standard pattern, the following method based on dynamic programming can be used. If the length of the phoneme string is M and the frame distance for the m-th phoneme is d (m, j), then the segment that minimizes the amount of accumulated frame distance over the entire input section is obtained from the recurrence formula. be able to.

初期条件 g(0,0)＝0,g(0,j)＝∞;j≠0 （２）漸化式 N(m,j)＝上の式を満足するｍ′ ｇ（ｍ，ｊ）は距離の累積量、Ｎ（ｍ，ｊ）は最適経過
を示すバックポインタのためのバッファである。終端に
おけるＮ（ｍ，ｊ）からポインタをたどることにより最
適経路すなわち最適な区分が求められる。Initial condition g (0,0) = 0, g (0, j) = ∞; j ≠ 0 (2) Recurrence formula N (m, j) = m'g (m, j) satisfying the above equation is a cumulative amount of distance, and N (m, j) is a buffer for a back pointer indicating an optimum course. By tracing the pointer from N (m, j) at the end, the optimum route, that is, the optimum section is obtained.

本発明では、音素から音素への変化、すなわち音素境界
における局部的な特徴を利用したセグメンテーションを
実行するため、音素境界特徴記述テーブルと呼ばれるデ
ータを新たに導入する。In the present invention, data called a phoneme boundary feature description table is newly introduced in order to execute a segmentation using a local feature at a phoneme boundary, that is, a change from a phoneme to a phoneme.

Ｔ＝｛ｔ（ｋ１，ｋ２）｝ここで、ｔは音素ｋ１から音素ｋ２への音素境界で観測
される特徴を表し、ｋ１，ｋ２の各組合わせについて与
えられる。T = {t (k1, k2)} Here, t represents a feature observed at the phoneme boundary from the phoneme k1 to the phoneme k2, and is given for each combination of k1 and k2.

上記のテーブルによって指定された特徴が存在しない場
合にはペナルティが与えられるよう（２）の漸化式のか
わりに次の漸化式を用いることができる。The following recurrence formula can be used instead of the recurrence formula of (2) so that a penalty is given when the feature specified by the above table does not exist.

ここでＰ［ｔ，ｊ］は特徴ｔがフレームｊに存在する場
合には値０、そうでない場合にはあらかじめ決められた
ペナルティ値となる関数である。 Here, P [t, j] is a function having a value 0 when the feature t exists in the frame j, and a predetermined penalty value otherwise.

第１図は本発明による音素セグメンテーションの原理を
示したものである。図は単語／ｓａＮ／の場合を示す
が、２つの音素境界ｓ−ａならびにａ−Ｎに対して、音
素境界特徴記述テーブルは次のように記述されていると
する。FIG. 1 shows the principle of phoneme segmentation according to the present invention. The figure shows the case of the word / saN /, but it is assumed that the phoneme boundary feature description table is described as follows for two phoneme boundaries sa and a-N.

t(/s/,/a/)＝“低域エネルギー変化の極大” t(/a/,/N/)＝“スペクトル変化の極大” 各フレームでこれらの特徴が存在するかどうかが調べら
れ、図に示すペナルティ値系列ｐ（ｍ，ｊ）が得られ
る。結局、ペナルティ値が０となるところが音素境界と
なるように前記の漸化式計算により音素セグメンテーシ
ョンが行なわれる。t (/ s /, / a /) ＝ “maximum of low-frequency energy change” t (/ a /, / N /) ＝ “maximum of spectrum change” It is examined whether these features exist in each frame. , The penalty value sequence p (m, j) shown in the figure is obtained. Eventually, phoneme segmentation is performed by the above recurrence formula calculation so that the place where the penalty value becomes 0 becomes the phoneme boundary.

（実施例）第２図は本発明を実現した装置の一実施例を示すブロッ
ク図である。入力パタン記憶部１、音素標準パタン記憶
部２にはそれぞれ入力パタンＡ、音素標準パタンの組Ｂ
が格納される。また、音素列記憶部３には入力音声の音
素列記述が、テーブル記憶部４には音素境界特徴テーブ
ルがそれぞれ格納される。５は距離計算部であり、
（１）式の計算が行なわれ、距離値｛ｄ（ｋ，ｊ）｝が
出力され、距離バッファ６に格納される。７は局部特徴
抽出部であり、全帯域エネルギー、帯域制限エネルギ
ー、スペクトル変化等の極大、極小となるフレームが抽
出され、特徴バッファ８に格納される。９は漸化式計算
部であり、（３）式の漸化式計算を、距離バッファ６お
よび特徴バッファ８から読み出された距離および特徴を
用いて実行し、得られた距離累積量ｇ、最適経路のトレ
ースバック用のポインタ値Ｎ、ペナルティ値ｐをそれぞ
れバッファ10,11,12に書込む。この動作がｊ＝１からｊ
＝Ｊまで繰り返され、最終的にｊ＝Ｊの時点でトレース
バック部13はバッファ11の内容を読み出しセグメンテー
ション結果を出力する。(Embodiment) FIG. 2 is a block diagram showing an embodiment of an apparatus realizing the present invention. An input pattern A and a phoneme standard pattern set B are stored in the input pattern storage unit 1 and the phoneme standard pattern storage unit 2, respectively.
Is stored. The phoneme string storage unit 3 stores a phoneme string description of the input voice, and the table storage unit 4 stores a phoneme boundary feature table. 5 is a distance calculation unit,
Equation (1) is calculated, the distance value {d (k, j)} is output and stored in the distance buffer 6. Reference numeral 7 denotes a local feature extraction unit, which extracts maximum and minimum frames of total band energy, band limited energy, spectrum change, etc. and stores them in the feature buffer 8. A recurrence formula calculation unit 9 executes the recurrence formula calculation of the formula (3) using the distances and features read from the distance buffer 6 and the feature buffer 8, and obtains the accumulated distance amount g, The pointer value N and the penalty value p for traceback of the optimum path are written in the buffers 10, 11 and 12, respectively. This operation is j = 1 to j
= J, and finally when j = J, the traceback unit 13 reads the contents of the buffer 11 and outputs the segmentation result.

（発明の効果）以上に述べたように、本発明によれば、音素標準パタン
を用いた動的計画法によるセグメンテーションにおい
て、標準パタンには必ずしも十分に記述されていない音
素の局部的な特徴情報を利用することが可能となるの
で、高精度の音素セグメンテーションを実現することが
できる。本発明は音素を認識単位とした大語彙認識の実
現に供することが可能である。(Effects of the Invention) As described above, according to the present invention, in the segmentation by the dynamic programming method using the phoneme standard pattern, the local feature information of the phoneme that is not always sufficiently described in the standard pattern. , It is possible to realize highly accurate phoneme segmentation. The present invention can be applied to the realization of large vocabulary recognition using a phoneme as a recognition unit.

[Brief description of drawings]

第１図は本発明の原理を説明する図、第２図は本発明に
よる一実施例を示すブロック図である。図において、１，２，３，４は記憶部、５は距離計算
部、６，８，10,11,12はバッファ、７は局部特徴抽出
部、９は漸化式計算部、13はトレースバック部である。FIG. 1 is a diagram for explaining the principle of the present invention, and FIG. 2 is a block diagram showing an embodiment according to the present invention. In the figure, 1, 2, 3, and 4 are storage units, 5 is a distance calculation unit, 6, 8, 10, 11, and 12 are buffers, 7 is a local feature extraction unit, 9 is a recurrence formula calculation unit, and 13 is a trace. It is the back part.

Claims

[Claims]

1. A speech segmentation apparatus for classifying an input speech into a known phoneme sequence, and an input pattern storage section for storing an input speech pattern A = {a (j)} represented as a series of vectors, and a representation as a vector. Set of phoneme standard patterns to be generated B = {b (k, n), n = 1, ..., N (k),
A standard pattern storage unit that stores k = 1, ..., K} and a phoneme category k of a vector a (j) at each time of the input pattern A.
A distance calculation unit that calculates a frame distance d (k, j) for the phoneme category k from the distance between a (j) and each vector of the set B of standard patterns as a measure indicating the similarity to
A phoneme boundary feature description table in which acoustic features appearing at boundary points from phonemes to phonemes are described for each phoneme pair, and an amount (D) in which the frame distance in a section corresponding to each phoneme is accumulated over the entire input section, And a means for dividing the section so that the amount (D + P) obtained by adding the amount (P) representing the probability that the feature specified by the phoneme boundary feature description table exists as a penalty at each phoneme boundary point is minimized. An audio segmentation device characterized by the above.