JPS62160499A

JPS62160499A - Voice recognition equipment

Info

Publication number: JPS62160499A
Application number: JP61001601A
Authority: JP
Inventors: 浮田　輝彦
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1986-01-08
Filing date: 1986-01-08
Publication date: 1987-07-16

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は連続発声された入力音声を高精度に認識するこ
とのできる音声認識装置に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a speech recognition device that can recognize continuously uttered input speech with high precision.

[Conventional technology]

音声の自動認識は、人間から機械への直接的な情報入力
のインターフェース技術として非常に重要である。この
音声の自動認識は、例えば音素や音節等の言語記号の系
列を連続発声してなる音声パターンを、離散的な言語的
記号に変換し、これらの各言語的記号をそれぞれ認識す
る過程として捕えることができる。Automatic speech recognition is extremely important as an interface technology for direct information input from humans to machines. This automatic recognition of speech is understood as the process of converting a speech pattern made by continuously uttering a series of linguistic symbols such as phonemes and syllables into discrete linguistic symbols, and recognizing each of these linguistic symbols individually. be able to.

さて、音韻を認識単位とする音声認識の研究が古くから
行われている。例えば入力音声波を音韻の単位に分割（
セグメンテーション）シ、各音韻の内容を判定（ラベリ
ング）してその人力音声を認識する手法が提唱されてい
る。この手法は、音声パワーやスペクトル等の時間的な
変化から、音声内容の変゛化が大きい箇所を音韻の境界
として検出する方法と云える。しかし、発声速度や発話
態度によって上記音韻の境界の検出精度が左右され易く
、認識精度の向上が望めなかった。Research on speech recognition using phonemes as the unit of recognition has been conducted for a long time. For example, the input speech wave is divided into phonological units (
Segmentation) and methods for recognizing human speech by determining (labeling) the content of each phoneme have been proposed. This method can be said to be a method for detecting locations where the speech content changes significantly as phoneme boundaries based on temporal changes in speech power, spectrum, etc. However, the detection accuracy of the phoneme boundaries tends to be affected by the speaking speed and speaking attitude, and no improvement in recognition accuracy can be expected.

一方、入力音声に対して一定時間（音声分析のフレーム
）毎に順次音韻ラベルを割当て、その音韻ラベルの繋が
り関係から上記入力音声を認識する手法が提唱されてい
る。しかし順次求められる音韻記号を如何にしてまとめ
るかが大きな問題となる。この為、従来一般的には、思
い付きの場当り的な規則を構成し対処しているのが実情
である。On the other hand, a method has been proposed in which phoneme labels are sequentially assigned to input speech at fixed time intervals (frames of speech analysis) and the input speech is recognized from the connected relationships of the phoneme labels. However, a major problem is how to organize the phonetic symbols that are sequentially obtained. For this reason, it has generally been the case that the problem has been dealt with by composing ad hoc rules.

そこで本発明者は、先に特願昭８０−５９８８０号にて
音素や音節の言語的記号を高精度に分割・認識し、その
認識精度の向上を図り得る認識法を提唱した。Therefore, the present inventor previously proposed in Japanese Patent Application No. 80-59880 a recognition method capable of dividing and recognizing linguistic symbols of phonemes and syllables with high precision and improving the recognition accuracy.

この認識法は、音素や音節の言語的記号に対してその音
韻の継続時間長や音韻間の接続可能条件等の制限を設け
、その制限下で最大の類似度和をとる音韻記号系列を求
めるものである。This recognition method sets restrictions on the linguistic symbols of phonemes and syllables, such as the duration of the phoneme and the conditions for connection between phonemes, and finds the phoneme symbol sequence that has the maximum similarity sum under the restrictions. It is something.

即ち、第５図にその概略構成を示すように、先ず入力音
声を音響分析部ｌに導いて所定の分析時間毎に音響分析
し、例えばフィルタバンクの出力として、スペクトルデ
ータからなる特徴パラメータを順次求める。That is, as shown in FIG. 5, the input voice is first guided to the acoustic analysis section l, where it is acoustically analyzed at predetermined analysis time intervals, and, for example, characteristic parameters consisting of spectral data are sequentially output as the output of a filter bank. demand.

そして所定の分析時間毎に順次求められる特徴パラメー
タ（スペクトルデータ）と、音韻辞書２に予め登録され
た音素や音節からなる音韻記号との類似度を、例えばパ
ターン変形に対する吸収能力の高い複合類似度法を用い
て類似度計算部３にてそれぞれ求める。Then, the degree of similarity between the feature parameters (spectral data) sequentially obtained at each predetermined analysis time and the phoneme symbols consisting of phonemes and syllables registered in advance in the phoneme dictionary 2 is calculated using, for example, a composite similarity that has a high ability to absorb pattern deformation. The similarity calculation unit 3 calculates each using the method.

そして各特徴パラメータに対して計算された類似度の系
列をその音韻記号の系列に対応させて類似度記憶部４に
それぞれ格納し、その中で最大の類似度和をとる音韻記
号系列を最大類似度系列計算部５にて求める。この最大
類似度系列の計算は、接続情報テーブル６に格納された
音韻記号間の接続可能情報、および継続時間長テーブル
７に格納された各音韻の継続時間長に関する情報を参照
し、これらの制限情報を満たす音韻記号列だけを抽出し
て動的計画法を適用して行われる。Then, the sequence of similarities calculated for each feature parameter is stored in the similarity storage unit 4 in correspondence with its sequence of phonetic symbols, and the sequence of phonetic symbols that has the maximum similarity among them is selected to have the maximum similarity. It is calculated by the degree series calculation section 5. This maximum similarity series is calculated by referring to the connectable information between phoneme symbols stored in the connection information table 6 and the information regarding the duration length of each phoneme stored in the duration length table 7, and taking into account these limitations. This is done by applying dynamic programming to extract only phoneme symbol strings that satisfy the information.

即ち、フレームｔで得られる音韻記号ｐ（ｌ≦ｐ≦Ｐ）
に対する類似度をＳ　（ｔ、ｐ）とし、またその音韻記
号ｐの最大継続時間長をＬｍｒ１ｘｓ最短継続時間長を
Ｌａ１ｎとしたとき、ｔ　−Ｌ　ｍａｘ≦ｔ′≦ｔ−Ｌ
ＬＩｌｉｎなる全てのフレームｔ′について、を計算する。ここでｑは、ｐに対して先行し得る音韻記
号であり、Ｔ　（ｔ’；ｑ）は累積類似度である。That is, the phonetic symbol p obtained at frame t (l≦p≦P)
Let S (t, p) be the similarity to S (t, p), and let Lmr1xs be the maximum duration of the phonetic symbol p, and La1n be the shortest duration, then t -L max≦t'≦t-L
For every frame t' LIlin, calculate . Here, q is a phonetic symbol that can precede p, and T (t'; q) is the cumulative similarity.

この処理を全ての音韻記号ｐ、および全てのフレームｔ
について行い、その類似度和Ｂ　（ｔ’；ｑ）が最大と
なる音韻記号の系列を求める。This process is applied to all phonetic symbols p and all frames t.
Then, find the sequence of phonetic symbols for which the sum of similarities B (t'; q) is maximum.

第６図はこのような計算処理による、複数の０点からＸ
点に至る類似度和を計算ルートの例を示している。Figure 6 shows the calculation of X from multiple 0 points using this calculation process.
An example of a route for calculating the sum of similarities leading to a point is shown.

このような処理を入力音声の区間に亙って繰返し、その
類似度和が最大となる音韻記号の系列を求める。この結
果、かなり高精度に入力音声を認識することが可能とな
る。This process is repeated over the interval of the input speech, and a sequence of phonetic symbols with the maximum similarity sum is determined. As a result, it becomes possible to recognize input speech with considerably high accuracy.

[Problem that the invention seeks to solve]

然し乍ら、この特願昭８０−５９８８０号で提唱した音
声認識法にあっては、音素あるいは音節を認識単位とし
ての音韻記号として考えている為、例えば会話的に高速
に発声された音声を認識するような場合、その音韻間の
過渡的な部分に誤った音韻記号が割当てられることがあ
ると云う問題が生じることが見出された。However, in the speech recognition method proposed in this patent application No. 80-59880, since phonemes or syllables are considered as phonetic symbols as recognition units, it is difficult to recognize, for example, speech uttered at high speed in conversation. It has been found that in such cases, a problem arises in that an incorrect phoneme symbol may be assigned to a transitional part between phonemes.

例えば母音間の過渡部の１フレームのスペクトルデータ
を捕えた場合、偶然的に他の音韻に似ている場合が生じ
る。具体的には、「はい」なる発声に対してｒｈａｅｉ
Ｊなる音韻記号系列が求められる場合が生じる。For example, when one frame of spectral data of a transitional part between vowels is captured, there may be cases where the spectral data coincidentally resembles another phoneme. Specifically, rhaei is used in response to the utterance of “yes”.
A case may arise in which a phoneme symbol sequence J is required.

本発明はこのような事情を考慮してなされたもので、そ
の目的とするところは、音韻間の過渡部における誤った
音韻記号の割付けを効果的に防止して、連続発声された
入力音声を高精度に認識することのできる音声認識装置
を提供することにある。The present invention has been made in consideration of the above circumstances, and its purpose is to effectively prevent incorrect allocation of phoneme symbols in transitional parts between phonemes, and to improve continuously uttered input speech. An object of the present invention is to provide a speech recognition device that can recognize speech with high precision.

[Means for solving problems]

本発明は、入力音声を分析して一定時間毎に求められる
特徴パラメータと音韻記号との類似度を計算し、各音韻
記号間の接続可能情報およびその継続時間長に関する情
報と上記音韻記号に対して求められた類似度とに従って
音韻として許される音韻記号列の候補を求め、その中で
その類似度和が最大となる音韻記号系列を求めて前記入
力音声を認識するに際し、前記音韻記号として、音素或いは音節を示す音韻記号Ｌ
ｐと、音素或いは音節間の過渡的な繋がり部分を示す音
韻記号Ｌｔとを準備して上述した処理を実行するように
したことを特徴とするものである。The present invention analyzes input speech and calculates the degree of similarity between feature parameters obtained at regular intervals and phonetic symbols, and calculates the degree of similarity between the phonetic symbols and the connectability information between each phonetic symbol and the duration length thereof. When recognizing the input speech by finding candidates for a phoneme symbol sequence that is allowed as a phoneme according to the degree of similarity determined by the phoneme, and finding the phoneme symbol sequence for which the sum of the degrees of similarity is maximum among them, as the phoneme symbol, Phonetic symbol L indicating a phoneme or syllable
This method is characterized in that the above-mentioned processing is executed by preparing a phoneme symbol Lt indicating a transitional connection between phonemes or syllables.

そしてその認識結果を、最大類似度和を与える音韻記号
系列から上記音韻記号Ｌｔを除いた音韻記号Ｌｐだけか
らなる系列として求めるようにしたものである。The recognition result is obtained as a sequence consisting only of phonetic symbols Lp, which is obtained by removing the phonetic symbol Lt from the phonetic symbol sequence that gives the maximum similarity sum.

[Action and its effects]

かくして本発明によれば、連続発声される音声中の各音
韻は、その典型的なスペクトルパターンを持つフレーム
データから順次後続音韻へと変化していくので、その音
韻に対する類似度はその中心に向けて徐々に大きくなり
、その後、中心から離れるに従って小さくなる。従って
このような類似度の動きの中から、前述したようにして
最大の類似度和を持つ音韻記号列を求めることにより、
入力音声に対する最適な音韻区分を求めることが可能と
なる。Thus, according to the present invention, each phoneme in continuously uttered speech changes sequentially from frame data having its typical spectral pattern to subsequent phonemes, so that the degree of similarity to that phoneme is directed toward the center. It gradually increases in size, and then decreases as it moves away from the center. Therefore, by finding the phonetic symbol string with the maximum similarity sum as described above from among these similarities,
It becomes possible to find the optimal phoneme segmentation for the input speech.

また音韻間の過渡部では、その典型的な音韻スペクトル
からほど遠くなるものがあるが、音韻間の過渡部につい
ても１つの音韻記号を設定し、その音韻記号系列の中で
最大の類似度和をとるものを求めるので、例えば音素間
に誤った音素ラベルが沸き出すことを未然に防ぐことが
できる。その上で、音韻間の過渡部を表現する音韻記号
を除去して入力音声に対する認識結果を求めるので、音
韻間の過渡部分の特徴パラメータに対する誤ったラベル
付けのない音韻記号の系列として、連続発声された入力
音声を高精度に認識することが可能となる等の実用上多
大なる効果が奏せられる。Furthermore, although some transitional parts between phonemes are far from the typical phonetic spectrum, one phoneme symbol is set for the transitional part between phonemes, and the sum of the maximum similarity among the phoneme symbol series is set. Since the user is asked what to take, for example, it is possible to prevent incorrect phoneme labels from appearing between phonemes. Then, the recognition result for the input speech is obtained by removing the phonetic symbols that express the transitional part between phonemes, so that continuous utterances can be recognized as a series of phonetic symbols without incorrect labeling of the characteristic parameters of the transitional part between phonemes. This provides great practical effects, such as making it possible to recognize input speech with high precision.

[Embodiments of the invention]

以下、図面を参照して本発明の一実施例につき説明する
。Hereinafter, one embodiment of the present invention will be described with reference to the drawings.

第１図は実施例装置の概略構成図である。この装置が先
に提唱した特願昭６０−５９８８０号に示した装置の構
成と異にするところは、音韻辞書として、音素または音
節に対する音韻記号Ｌｐとその標準パターンを記憶した
Ｌｐ辞書２ａｓおよび音素または音節間の過渡的な部分
に対する音韻記号Ｌｔとその標準パターンを記憶したＬ
ｔ辞書２ｂとをそれぞれ備え、またこれらのＬｐ辞書２
ａおよびＬｔ辞書２ｂの登録された情報に従って最大類
似度系列計算部５が求めた音韻記号系列を総合判定する
結果判定部８を備えて構成される点にある。FIG. 1 is a schematic configuration diagram of an embodiment device. This device differs from the structure of the device shown in Japanese Patent Application No. 60-59880, which was previously proposed, in that it includes an Lp dictionary 2as that stores phonetic symbols Lp for phonemes or syllables and their standard patterns, and Or L that memorizes the phonetic symbol Lt for the transitional part between syllables and its standard pattern.
t dictionaries 2b, and these Lp dictionaries 2b.
The present invention is comprised of a result determination unit 8 that comprehensively determines the phoneme symbol sequence determined by the maximum similarity sequence calculation unit 5 according to the information registered in the a and Lt dictionaries 2b.

即ち、音響分析部ｌは、連続発声して入力された音声を
所定の分析時間、例えば１０ｍ５ｅｃ毎に音響分析し、
その特徴パラメータをフレームとして求めている。具体
的には、スペルトル分析の手法として知られている複数
の帯域通過フィルタからなるフィルタバンクを用い、各
チャンネルの出力エネルギからなるスペクトルとして特
徴パラメータを求めている。この特徴パラメータは、音
響分析部１内の入カバターンメモリに格納されて音韻記
号Ｌｐ、Ｌｔとの類似度計算に供せられる。That is, the acoustic analysis unit 1 acoustically analyzes the continuously inputted voice at predetermined analysis time intervals, for example, every 10 m5ec,
The feature parameters are obtained as a frame. Specifically, a filter bank consisting of a plurality of band-pass filters, which is known as a method of spectroscopy, is used to obtain characteristic parameters as a spectrum consisting of the output energy of each channel. This feature parameter is stored in the input pattern memory in the acoustic analysis section 1 and used for calculating the degree of similarity with the phonetic symbols Lp and Lt.

類似度計算部３は、所定の分析時間間隔毎に出力される
スペクトルデータと、前記Ｌｐ辞書２ａ。The similarity calculation unit 3 calculates the spectrum data output at each predetermined analysis time interval and the Lp dictionary 2a.

およびＬｔ辞書２ｂにそれぞれ登録された各音韻記号Ｌ
　ｐ、　Ｌ　ｔの標準パターンとの類似度を、例えば複
合類似度法によりそれぞれ計算するものである。and each phonetic symbol L registered in the Lt dictionary 2b.
The similarity between p and Lt with the standard pattern is calculated by, for example, a composite similarity method.

尚、この複合類似度法を採用する場合には、各音韻のク
ラス毎に互いに直交した数種のベクトルを、それぞれそ
の標準パターンとしてｌ＄備する必要があることは勿論
のことである。It goes without saying that when this composite similarity method is employed, several types of mutually orthogonal vectors must be prepared as standard patterns for each phoneme class.

ここでＬｐ辞″！２ａは、例えば音韻として音素を対象
とする場合、音素中心位置における多くの１フレームの
スペクトルデータを音韻毎に収集し、その共分散行列を
ＫＬ展開して求められる固有ベクトルとその固有値とを
辞書として持つことによって構成される。また音韻を、
例えば子音・母音結合形（ＣＶ）の音節を基本とする場
合には、０７部分を数フレームに亙って取出し、これを
周波数方向の数チャンネルと合せて一次元のベクトルと
して表現すれば、上述した音素の場合と同様にＬｐ辞書
２ａを構成することができる。Here, the Lp dictionary "!2a" is the eigenvector obtained by collecting many frames of spectral data at the phoneme center position for each phoneme, and expanding the covariance matrix with KL, for example, when targeting a phoneme as a phoneme. It is constructed by having the eigenvalues as a dictionary.In addition, the phonology is
For example, when using consonant-vowel combination (CV) syllables as a basis, the 07 part can be extracted over several frames and expressed as a one-dimensional vector by combining it with several channels in the frequency direction. The Lp dictionary 2a can be constructed in the same way as in the case of phonemes.

一方、Ｌｔ辞書２ｂは音素間または音節間の過渡的な部
分を表現する、特殊な音韻記号として定義されるもので
、例えばその過渡部の特徴を表現するに最適な時間長と
なる次元数のベクトルで表わされる。具体的には音素間
の過渡部について、そのＡ板部の中心からその前後のｎ
フレームのデータを取出し、Ｌｐ辞書と同様にしてその
辞書を作成すれば良い。このＬｔ辞書は前記Ｌｐ辞書と
次元数が異なっていても良い。On the other hand, the Lt dictionary 2b is defined as a special phonetic symbol that expresses the transitional part between phonemes or syllables, and for example, the number of dimensions that is the optimal time length to express the characteristics of the transitional part. Represented by a vector. Specifically, for the transition part between phonemes, from the center of the A plate part to the n before and after it.
All you have to do is take out the frame data and create the dictionary in the same way as the Lp dictionary. This Lt dictionary may have a different number of dimensions from the Lp dictionary.

またＬｔ辞書は、音韻間の全ての組合せに対して準備す
る必要はなく、認識性能に重要な影響を及ぼす、母音間
や有声音間のものだけを準備するようにしても良い。ま
たその情報は、過渡部の中心部に集約されることから、
例えば（ｎ−１）としてｊａ　ｔｊとｒｉａｌのように
、その繋がりが逆の関係にあるものを１まとめにして準
備するようにしても良い。Further, the Lt dictionary does not need to be prepared for all combinations between phonemes, and may be prepared only for combinations between vowels and between voiced sounds, which have an important influence on recognition performance. In addition, since the information is concentrated in the center of the transition area,
For example, as (n-1), items such as ja tj and rial, which have opposite connections, may be prepared as one group.

前記類似度計算部３は、このようにしてＬｐ辞４２ａと
Ｌｔ辞書２ｂとにそれぞれ登録された各音韻記号Ｌ　ｐ
、　Ｌ　ｊの標準パターンと、前記所定の分析時間毎に
求められるスペクトルデータとの類似度をそれぞれ計算
することになる。The similarity calculation unit 3 calculates each phonetic symbol L p registered in the Lp dictionary 42a and the Lt dictionary 2b in this way.
, L j and the spectral data obtained for each predetermined analysis time are calculated.

しかしてこの類似度の情報は、各フレーム毎にその音韻
記号に対応して類似度記憶部４に格納される。そして最
大類似度系列計算部５にて、接続情報デープル６および
継続時間テーブル７を参照して、音韻として成立する音
韻記号系列中の最大の類似度和を得るものが抽出される
。However, this similarity information is stored in the similarity storage unit 4 in correspondence with the phonetic symbol for each frame. Then, the maximum similarity sequence calculation unit 5 refers to the connection information table 6 and the duration table 7, and extracts the phoneme symbol sequence that has the maximum similarity sum among the phoneme symbol sequences that are established as phonemes.

このテーブル８．７は、例えば第２図に示すように各音
韻記号Ｌｐ、ｔ、１について、その継続時間の情報と、
その音韻について先行可能な音韻記号の情報とをそれぞ
れ格納したものである。具体的には、音韻記号（ｈ）の
継続時間長が３２〜１Ｂｏｆｌｌｓｅｃであり、音韻記
号（ａｎ　　ｌ＋　　ｕｎ　　ｅ＊　　Ｏｒ　Ｎｒａｉ
、ａｕ、・・・）に接続可能なことが示される。For example, as shown in FIG. 2, this table 8.7 contains information on the duration of each phonetic symbol Lp, t, 1, and
It stores information on phoneme symbols that can precede the phoneme. Specifically, the duration length of the phonetic symbol (h) is 32 to 1 Bofllsec, and the phonetic symbol (an l+ un e* Or Nrai
, au, ...).

また音韻記号（ａ）の継続時間長は３２〜４００　ｍ５
ｅｃであり、音韻記号（１，ｕ、ｅ、Ｏ，Ｎ＋　　ｐｒ
　　”＊に、・・・、ｉａ、ｕａ、ｅａ、・・・）に接
続可能なことが示される。Also, the duration of phonetic symbol (a) is 32 to 400 m5
ec, and the phonetic symbol (1, u, e, O, N+ pr
” indicates that it can be connected to ..., ia, ua, ea, ...).

同様にして、音素間の過渡的な部分の音韻記号（ａｔ）
については、その継続時間長が３２〜１００ｆｆｌｓｅ
ｅであり、先行可能な音韻記号が（ａ）であることが示
される。Similarly, the phonological symbol (at) for the transitional part between phonemes
, the duration length is 32 to 100fflse
e, which indicates that the phonetic symbol that can precede is (a).

しかして最大類似度系列計算部５では、特願昭８０−５
９８８０号で提唱した手法と同様にして、各音韻記号の
継続時間条件を満たし、且つ音韻記号間の接続条件を満
たす音韻記号列について、その中で最大の類似度和を持
つ音韻記号列を求めている。However, the maximum similarity series calculation unit 5 calculates the
Similar to the method proposed in No. 9880, for phonetic symbol strings that satisfy the duration conditions for each phonetic symbol and also satisfy the connection conditions between phonetic symbols, find the phonetic symbol string that has the maximum similarity sum among them. ing.

ここで類似度の最大化は、加算関係で表現できるもので
あり、例えば動的計画法を採用して実施される。Here, maximizing the similarity can be expressed by an additive relationship, and is carried out by employing, for example, dynamic programming.

即ち、各フレームｔに応じて得られる音韻記号ｐ（１≦
ｐ≦Ｐ）に対する類似度をＳ　（ｔ：Ｉ））とする。そ
して計算部５に、累積類似度の記憶領域Ｔ　（ｔ）と、
そのラベル名の記憶領域Ｌ（ｔ）、および最適な結果を
求める為に用いられるポインタ記憶領域Ｆ　（ｔ）を準
備する。尚、これらの各記憶領域は、それぞれ零（０）
に初期設定される。That is, the phonetic symbol p (1≦
Let S (t:I)) be the similarity for p≦P). Then, the calculation unit 5 has a cumulative similarity storage area T (t),
A storage area L(t) for the label name and a pointer storage area F(t) used to obtain the optimal result are prepared. Note that each of these storage areas is zero (0).
is initialized to .

ここで第ｔフレームについて考え、各音韻記号ｐについ
て最大時間長ＬＩＩｌａＸｓ最小時間長Ｌ　ｍ１ｎをそ
れぞれ与え、次の処理を実行する。即ち、ｔ−Ｌｍａｘ
≦ｔ′≦ｔ−Ｌｍ１ｎ１５ｐ≦Ｐなるフレレームｔ′、および音韻記号ｐについて、を計
算する。この処理を全てのフレームｉ　／、および各音
韻記号ｐについてそれぞれ実行してＢ　（ｔ’；ｐ）を
求め、その中の最大のものを選ぶ。Now, considering the t-th frame, the maximum time length LIIlaXs and the minimum time length Lm1n are given to each phonetic symbol p, and the following processing is executed. That is, t-Lmax
≦t′≦t−Lm1n 15p≦P For the frame t′ and the phonetic symbol p, the following is calculated. This process is executed for all frames i/ and each phonetic symbol p to obtain B(t';p), and the largest one among them is selected.

そしてＢ（ｔ’：ｐ）の最大値Ｂ　（ｔ’；ｐ）ＩＩａ
ｘと、それを与えるｔ’ｍａ’ｔ　ｓ　ｐ　ａｎａｘを
前記各記憶領域Ｔ　（ｔ）。And the maximum value of B(t':p) B(t';p)IIa
x and give it t'ma't sp anax for each storage area T (t).

Ｌ　（ｔ）、Ｆ　（ｔ）にそれぞれ格納する。Stored in L(t) and F(t), respectively.

この処理を人力音声の区間に亙って繰返すことにより、
入力音声の終了時点（フレームｔｅｎｄ）で、各音韻記
号ｐに付いての最大類似度和Ｔ　（ｔ　ｅｎｄ；Ｉ））
が求められることになる。By repeating this process over the human voice section,
At the end of the input speech (frame tend), the maximum similarity sum T (t end; I) for each phonetic symbol p
will be required.

結果判定部８は、このようにして最大類似度系列計算部
５が求めた結果、つまり前記記憶領域Ｔ　（ｔ）、Ｌ　
（ｔ）、Ｆ　（ｔ）にそれぞれ得られた結果に従って、
その音韻記号の対応付けを求めている。尚、ここで求め
られる音韻記号列の中には、前述した過渡部を示す音韻
記号Ｌｔが含まれていることから、最終的な認識結果と
してはその音韻記号Ｌｔを取除くことが必要となる。The result determination unit 8 calculates the results obtained by the maximum similarity sequence calculation unit 5 in this way, that is, the storage areas T (t), L
According to the results obtained for (t) and F (t), respectively,
We are looking for a correspondence between the phonetic symbols. Furthermore, since the phonetic symbol string obtained here includes the phonetic symbol Lt that indicates the transitional part described above, it is necessary to remove that phonetic symbol Lt as the final recognition result. .

第３図はこのような音韻記号Ｌｔの除去を含む、最大類
似度和をとる音韻記号系列に対する結果判定処理の流れ
を示す図である。FIG. 3 is a diagram showing the flow of result determination processing for a phoneme symbol sequence that takes the maximum similarity sum, including the removal of such phoneme symbols Lt.

即ち、この結果判定処理では、先ず最大類似度和Ｔ　（
ｔ　ｅｎｄ；ｐ）を最大にする音韻記号ｐを音韻記号Ｌ
ｐの中から探し、これをｐ　ａ＋ａｘとする（ステップ
ａ）。That is, in this result determination process, first, the maximum similarity sum T (
The phonetic symbol p that maximizes t end; p) is the phonetic symbol L
Search p from p and set it as p a+ax (step a).

次に処理変数ＴＯをｔ　ｅｎｄ　、　Ｐをｐ　ｍａｘと
して初期設定しくステップｂ）、ＦＲＯＭ　　←　Ｆ　（Ｔｏ：Ｐ）＋　１なる処理を実
行しくステップＣ）、音韻記号とその始端位置、および
終端位置を表わすデータ（Ｐ、ＦＲＯＭ、Ｔｏ）を求め
る（ステップｄ）。Next, initialize the processing variables TO as ten end and P as p max in step b), execute the process FROM ← F (To:P)+1 in step C), and set the phonetic symbol, its starting position, and ending position. (Step d).

続いてＰ　　−Ｌ（ＴＯ；Ｐ）Ｔｏ　　４−　ＦＲＯＭ−１ＦＲＯＭ　　←　Ｆ　（Ｔｏ；ｐ）＋　１なる処理を行
い、次の記号を求める（ステップｅ）。この処理を前記
ＴＯが（０）となるまで繰返しくステップｆ）、音韻記
号とその始端位置、および終端位置を表わすデータ（Ｐ
、ＦＲＯＭ。Subsequently, the following process is performed: P - L (TO; P) To 4 - FROM - 1 FROM ← F (To; p) + 1 to obtain the next symbol (step e). This process is repeated until the TO becomes (0) in step f), and data (P
, FROM.

ＴＯ）を順次逆順に求める。TO) are determined in reverse order.

しかる後、求められた全ての音韻記号ｐについて、その
音韻記号ｐが音素間の過渡部分について定められた音韻
記号Ｌｔであるか否かを判定しくステップｈ）、音韻記
号Ｌｔである場合には、その類似度ｓ　（ｔａｐ）が最
大となるフレームを過渡部の中心位置であるとして検出
する（ステップｉ）。そしてこの過渡部中心位置でその
音韻記号Ｌｔを２つに分け、その前後の音韻記号Ｌｐに
その情報を統合して音韻記号Ｌｔを除去する（ステップ
ｊ）。この処理を全ての音韻記号ｐについて実行しくス
テップｇ）、類似度和が最大となる音韻記号Ｌｐの系列
を求める。Thereafter, for all the phonetic symbols p found, it is determined whether the phonetic symbol p is the phonetic symbol Lt defined for the transitional part between phonemes (step h), and if it is the phonetic symbol Lt, then , the frame with the maximum similarity s (tap) is detected as the center position of the transition part (step i). Then, the phoneme symbol Lt is divided into two at the center position of the transition part, the information is integrated into the phoneme symbols Lp before and after it, and the phoneme symbol Lt is removed (step j). This process is executed for all phoneme symbols p in step g), and a sequence of phoneme symbols Lp with the maximum similarity sum is determined.

尚、音韻の区切りを正確に求める必要がない場合には、
音韻記号Ｌｔを単純に削除するだけでも十分である。In addition, if it is not necessary to accurately determine the phonological boundaries,
It is sufficient to simply delete the phonetic symbol Lt.

このような処理によれば、例えば第４図（ａ）に入力音
声「ハイ」の音声パワーを示し、これに対する音韻記号
列の類似度の系列と、その最大類似度和をとる音韻記号
列を（ｂ）（ｃ）に示すように、所定の時間毎に計算さ
れた音韻記号に対する類似度から、音韻の継続時間長、
およびその接続関係の知識を導入して最適な音韻区分を
求めることが可能となる。According to such processing, for example, FIG. 4(a) shows the speech power of the input speech "high", and the series of similarities of phonetic symbol strings for this and the phonetic symbol string that takes the sum of the maximum similarities are calculated. As shown in (b) and (c), the duration of the phoneme, based on the similarity to the phoneme symbol calculated at each predetermined time,
It becomes possible to obtain the optimal phoneme segmentation by introducing knowledge of the information and its connection relationships.

但し、第４図（ｂ）は特願昭６０−５９８８０号で提唱
した音韻記号Ｌｐだけを用いた場合の例を示し、同図（
ｃ）は本発明に係る音韻記号Ｌｐ、Ｌｔを用いた場合の
音韻記号列を示している。However, Fig. 4(b) shows an example of using only the phonetic symbol Lp proposed in Japanese Patent Application No. 60-59880.
c) shows a phoneme symbol string when phoneme symbols Lp and Lt according to the present invention are used.

この第４図に対比して示されるように、本発明では、音
素や音節間の過渡的な部分に対する音韻記号Ｌｔを定義
し、この過渡的な部分を考慮して入力音声の音韻記号列
を求めているので、音素の接続部分が他の音韻記号とし
て誤って認識されることがなくなる。従って、音素また
は音節の過渡的部分で不本意な音韻記号が発生すること
がなくなり、入力音声を高精度に認識することが可能と
なる。As shown in comparison with FIG. 4, in the present invention, a phoneme symbol Lt is defined for a transitional part between phonemes and syllables, and a phoneme symbol string of input speech is created taking into consideration this transitional part. This prevents the connected parts of phonemes from being mistakenly recognized as other phonetic symbols. Therefore, unwanted phoneme symbols are not generated in transitional parts of phonemes or syllables, and input speech can be recognized with high precision.

尚、ここでは音韻記号として音素を例に説明したが、音
節であっても同様に装置を構成することができ、同様な
効果が奏せられる。また最大類似度和をとる音韻記号の
系列を求める処理についても、変形して実施することが
可能である。その他、本発明はその要旨を逸脱しない範
囲で種々変形して実施可能である。Note that although the explanation has been given using phonemes as examples of phonetic symbols, the device can be constructed in the same way even with syllables, and the same effects can be achieved. Furthermore, the process of finding a sequence of phonetic symbols that yields the maximum similarity sum can also be implemented in a modified manner. In addition, the present invention can be implemented with various modifications without departing from the gist thereof.

[Brief explanation of drawings]

第１図は本発明の一実施ＰＩ　Ｗ置の概略構成図、第２
図は音韻の継続時間長および先行可能な音韻の情報を格
納したテーブルの構成例を示す図、第３図は結果判定処
理の流れを示す図、第４図は入力音声に対する音韻記号
系列の検出結果を示す図、第５図は本発明者が先に提唱
した装置の概略構成図、第６図は最大類似度和を計算す
る場合の処理概念を示す図である。゛　　１・・・音響分析部、２・・・音韻辞書、２ａ・
・・Ｌｐ辞書、２ｂ・・・Ｌｔ辞書、３・・・類似度計
算部、４・・・類似度記憶部、５・・・最大類似度系列
計算部、ト・・接続情報テーブル、７・・・継続時間長
テーブル、訃・・結果判定部。出願人代理人　弁理士　鈴江武彦第２囚第３図？／−〆“イーｈ　　　　　ａ　　　　　ｅ　　　　　１ｈ　　　　　
ａ　　　　　　　１第４因Fig. 1 is a schematic configuration diagram of a PIW location in which the present invention is implemented;
The figure shows an example of the structure of a table that stores information on phoneme duration and possible preceding phonemes, Figure 3 shows the flow of result determination processing, and Figure 4 shows the detection of phoneme symbol sequences for input speech. FIG. 5 is a diagram showing the results, and FIG. 5 is a schematic block diagram of the apparatus proposed earlier by the present inventor. FIG. 6 is a diagram showing the processing concept when calculating the maximum similarity sum.゛ 1... Acoustic analysis department, 2... Phonological dictionary, 2a.
...Lp dictionary, 2b...Lt dictionary, 3...Similarity calculation unit, 4...Similarity storage unit, 5...Maximum similarity series calculation unit, G...Connection information table, 7. ...Duration length table, death...Result judgment section. Applicant's agent Patent attorney Takehiko Suzue 2nd prisoner Figure 3? /-〆“ee h a e 1h
a 1 Fourth cause

Claims

[Claims]

(1) Means for analyzing input speech to obtain feature parameters of the input speech at regular intervals; A means for calculating the similarity of each feature parameter, and a phoneme symbol string that is allowed as a phoneme according to the connectability information between the phoneme symbols, information regarding the duration length of each phoneme, and the similarity of each feature parameter to the phoneme symbol. A speech recognition device comprising: means for obtaining candidates; and means for obtaining a phoneme symbol sequence having the maximum sum of similarities among these phoneme symbol sequences.

(2) The recognition result for the input speech is obtained as a phonetic symbol sequence obtained by removing phonetic symbols at transitional connections between phonemes or syllables from the phonetic symbol sequence that gives the maximum similarity sum. The speech recognition device according to item 1.

(3) The speech recognition device according to claim 1, wherein the maximum similarity sum of phoneme symbol sequences is calculated using dynamic programming.