JPS60164798A

JPS60164798A - Monosyllabic voice recognition equipment

Info

Publication number: JPS60164798A
Application number: JP59017263A
Authority: JP
Inventors: 狩野　光彦; 安弘松田; 集手塚
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1984-02-03
Filing date: 1984-02-03
Publication date: 1985-08-27
Also published as: JPH0246957B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】［産業上の利用分野］この発明は単音節音声認識装置に関し、とくに簡易な構
成でありながら認識率をも向上させることができるよう
にしたものである。DETAILED DESCRIPTION OF THE INVENTION [Industrial Field of Application] The present invention relates to a monosyllabic speech recognition device that is particularly simple in structure and yet can improve recognition rate.

［従来技術］近年コンピュータや各種制御装置等における入力装置と
して音声認識装置が実用期を向えるにいたっている。人
間の話す言葉をそのまま認識できる音声認識装置では利
用のための特別な教育もいらず、視線や手足の拘束もな
い等種々の利点を有する。しかしながら現在実用されて
いる多くの音声認識装置は単語単位で認識を行う単語音
声認識装置であり、上述の利点の反面語数に限界を持つ
という欠点があった。[Prior Art] In recent years, voice recognition devices have come into practical use as input devices for computers, various control devices, and the like. Speech recognition devices that can recognize human speech as they are have various advantages, such as requiring no special training for use and no restrictions on line of sight or limbs. However, many speech recognition devices currently in use are word speech recognition devices that perform recognition on a word-by-word basis, and while they have the above-mentioned advantages, they have the disadvantage of having a limited number of words.

以上のこともあって最近では音節単位で音声を認識する
単音節音声認識システムが注目されるようになってきて
いる。周知のとおり日本語においては表音文字により言
語体系が構成されているので、すなわち各音節がカナの
各々にほぼ１対１で対応するので単音節音声認識システ
ムが各種入力装置として利用可能である。とくに日本語
ワード・プロセッサやワークステーションの普及にとも
なつて、この単音節音声認識システムをこれらの機器の
入力手段に用いる試みが多々なされるようになっている
。For these reasons, monosyllabic speech recognition systems that recognize speech in units of syllables have recently been attracting attention. As is well known, the Japanese language system is composed of phonetic characters, meaning that each syllable corresponds almost one-to-one to each kana, so monosyllabic speech recognition systems can be used as various input devices. . In particular, with the spread of Japanese word processors and workstations, many attempts have been made to use this monosyllabic speech recognition system as an input means for these devices.

ところで単音節音声認識システムでは、通常、音節の特
徴パラメータ時系列（以下単にパターンという）のうち
子音部分を全音節部分から切り出して、子音部分どうし
のマツチングを行うようにしている。音節のパターンは
単語のパターンに較ベパターン間の特徴に乏しく、さら
に一般に母音部分の時間長が子音部分に較べて極めて長
く、子音部分間の類似度の微少な差が後続母音部分のパ
ターンのゆらぎによってマスクされてしまうからである
。By the way, in a monosyllabic speech recognition system, the consonant part is usually extracted from the whole syllable part of the syllable characteristic parameter time series (hereinafter simply referred to as pattern), and the consonant parts are matched. Syllable patterns have fewer characteristics between patterns than word patterns, and the duration of vowel parts is generally much longer than that of consonant parts, so minute differences in similarity between consonant parts can cause fluctuations in the patterns of the following vowel parts. This is because it is masked by

吉田氏等の論文［日本語単音節音声認識実験」（日本音
響学会講演論文集、３−２−１６．１９７９年）はこの
ような子音部分の切り出しの一例を示している。この例
では音節の特徴ベクトルが大きく変化する点のうちの所
定の位置を子音・母音境界とするアルゴリズムを用いて
子音部分を切り出し、この子音部分を標準登録子音パタ
ーンにつき端点自由のダイナミックプログラミングマツ
チング法（以下ＤＰマツチングとする）を実行して子音
情報を得るようにしている。The paper by Mr. Yoshida et al., ``Japanese Monosyllabic Speech Recognition Experiment'' (Proceedings of the Acoustical Society of Japan, 3-2-16, 1979), provides an example of such extraction of consonant parts. In this example, a consonant part is extracted using an algorithm that uses a predetermined position among points where the syllable feature vector changes significantly as a consonant/vowel boundary, and this consonant part is matched with a standard registered consonant pattern using dynamic programming with free endpoints. (hereinafter referred to as DP matching) to obtain consonant information.

また、古井氏の論文［単音節認識とその大語い単語音声
認識への適用」　（電子通信学会論文集（Ａ）　、６５
−Ａ；２．ｐｐ１７５−１８２．１９８２年）は子音部
分の切り出しの他の手法を示している。この例では音節
全長に対する所定の比率で語頭部を切り出し、これを同
様の登録語頭部と比較して子音情報を得るようにしてい
る。たとえば線形マツチングを実行している。In addition, Mr. Furui's paper [Monosyllabic recognition and its application to large word speech recognition] (Proceedings of the Institute of Electronics and Communication Engineers (A), 65)
-A;2. pp175-182.1982) shows another method for cutting out consonant parts. In this example, the beginning of a word is cut out at a predetermined ratio to the total length of the syllable, and this is compared with similar registered beginnings of words to obtain consonant information. For example, performing linear matching.

さらに、中用氏等の論文「不特定話者の単音節単位入力
による大語業単語音声認識ｊ　（電子通信学会論文集（
Ｄ）　、６５−Ｄ、１２、ｐｐ１５５８−１５６５．１
９８２年）も他の切り出し手法を開示している。この例
では対象が不特定話者であるので登録パターンを固定と
することができ、このため予め目視により登録パターン
の各々につき子音、母音の境界点ｊを決定するのである
。この決定点はたとえば音声信号の波形やホルマント３
− 等を勘案して推定される。未知入カバターンについては
音節全域の登録パターンとこの未知入カバターンとにつ
いてＤＰマツチングを行い、その最適パスが上述境界点
ｊを通過する点ｉを未知入カバターンの子音、母音境界
点としている。子音情報を得る手法としては端点自由の
ＤＰマツチングを含゛め種々の提案がなされている。ま
た、この例では上述古井氏の論文と同様のアルゴリズム
により子音切り出しについても開示がある。In addition, the paper by Nakayo et al., “Large language word speech recognition using monosyllable unit input from unspecified speakers (Proceedings of the Institute of Electronics and Communication Engineers)
D), 65-D, 12, pp1558-1565.1
(982) also discloses other extraction methods. In this example, since the subject is an unspecified speaker, the registered patterns can be fixed, and for this reason, the boundary point j between consonants and vowels is determined in advance by visual inspection for each registered pattern. This decision point is, for example, the waveform of the audio signal or the formant 3
− Estimated taking into account factors such as For the unknown input kabataan, DP matching is performed for the registered pattern of the entire syllable and this unknown input kabatatern, and the point i where the optimal path passes through the above-mentioned boundary point j is set as the consonant/vowel boundary point of the unknown input kabataan. Various proposals have been made as methods for obtaining consonant information, including endpoint free DP matching. This example also discloses consonant extraction using the same algorithm as in Furui's paper mentioned above.

母音情報の特定については上述論文ともＤＰマツチング
より簡易な方法を採用している。そしてこの母音情報と
上述子音情報とを総合して音節についての識別を行なっ
ている。Regarding the identification of vowel information, both of the above-mentioned papers adopt a method simpler than DP matching. Then, this vowel information and the above-mentioned consonant information are combined to identify the syllable.

しかしながら、上述のような従来の構成では子音部分と
母音部分とを分離するために複雑なアルゴリズムを採用
する必要があった。またそうでない場合では予め固定の
登録パターンについて目視を行い煩雑な作業と深い経験
を要請することとなってしまっていた。また、このよう
に子音および母音の分離点の識別におけるエラーによっ
て音節、４− とくにその子音情報の認識ミスが増大するおそれもあっ
た。古井氏の例では音節全長に対する比率で定形的に子
音部分の切り出しを行うので上述のような問題はないけ
れども、各音節ごとの子音・母音境界点のバラツキを無
視しているため自ずと各音節ごとに認識率が異なると考
えられる。However, in the conventional configuration as described above, it was necessary to employ a complicated algorithm to separate the consonant part and the vowel part. In other cases, fixed registered patterns must be visually inspected in advance, which requires complicated work and deep experience. Furthermore, errors in identifying the separation points between consonants and vowels may increase the number of errors in recognizing syllables, especially their consonant information. In Mr. Furui's example, the consonant part is cut out in a fixed manner based on the ratio to the total syllable length, so there is no problem like the one mentioned above, but since the variation in the consonant/vowel boundary point for each syllable is ignored, it naturally cuts out the consonant part for each syllable. It is thought that the recognition rate is different.

なお、最終的に音節を認識する段階としては、子音情報
と母音情報とを個別にめ、その組み合わせから音節を決
定する手法がある。たとえばｒｉ」の母音情報とｒｋＪ
の子音情報から音節「キ」を特定するのである。他の手
法としては、母音情報を予めめ、それを後続母音とする
音節を認識候補とし、この候補の音節につきＤＰマツチ
ングなどを行うものが知られている。後者では、最終の
マツチングにおいて子音と母音との調音結合要素をも十
分考慮できるので良好な判別を行える。このような２段
階の評価については特開昭５４−１４５４０９号、特開
昭５８−５２６９４号および特開昭５８−５９４９８号
に記載がある。In addition, as a stage for finally recognizing a syllable, there is a method of individually looking at consonant information and vowel information and determining a syllable from a combination of the consonant information and vowel information. For example, the vowel information of "ri" and rkJ
The syllable ``ki'' is identified from the consonant information. Another known method is to use vowel information in advance, select syllables that follow the vowel as recognition candidates, and perform DP matching or the like on these candidate syllables. In the latter case, the articulatory coupling elements between consonants and vowels can be fully considered in the final matching, so that good discrimination can be achieved. Such two-stage evaluation is described in JP-A-54-145409, JP-A-58-52694, and JP-A-58-59498.

ただ、この場合、２段階の評価を行うのが煩雑である。However, in this case, it is complicated to perform two-stage evaluation.

後述のようにこの発明の一実現態様ではこのような問題
を解消できる。As described below, one embodiment of the present invention can solve this problem.

［発明が解決しようとする問題点コこの発明は以上の事情を考慮してなされたものであり、
子音を母音から区別することなく簡易かつ確実に子音情
報を得ることのできる単音節音声認識装置を提供するこ
とを目的としている。[Problems to be solved by the invention This invention has been made in consideration of the above circumstances,
It is an object of the present invention to provide a monosyllabic speech recognition device that can easily and reliably obtain consonant information without distinguishing consonants from vowels.

［問題点を解決するための手段］この発明では以上の目的を達成するために未知入カバタ
ーンと登録標準パターンとについてＤＰマツチングを距
離演算を実行していき、これら未知入カバターンまたは
登録標準パターンの語頭がら語尾の間の所定の中間点で
の最小累積距離に基づいて未知入カバターンの子音情報
を識別するようにしている。[Means for Solving the Problems] In order to achieve the above object, the present invention performs DP matching and distance calculation for unknown input cover turns and registered standard patterns. The consonant information of the unknown Kabataan is identified based on the minimum cumulative distance at a predetermined midpoint between the beginning and end of the word.

この発明の一態様では、子音情報を得る中間点より語尾
がわ、すなわち母音情報源をより多く含む第２の中間点
についても最小累積距離をめ、これに基づいて母音情報
を識別し、こののち識別母音を後続母音とする候補標準
パターンについて７− 子音情報の識別を行ってもよい。すなわち２段階のマツ
チングを行ってもよい。この場合、母音情報判別時の距
離演算の副次物として子音用中間点の最小累積距離を得
ることができ、距離演算を１回の処理で済ませることが
できる。In one aspect of the present invention, a minimum cumulative distance is also determined for a second intermediate point that is closer to the end of a word than the intermediate point from which consonant information is obtained, that is, a second intermediate point that contains more vowel information sources, and based on this, vowel information is identified. 7-Consonant information may be identified for the candidate standard pattern in which the identified vowel is the subsequent vowel. That is, two-stage matching may be performed. In this case, the minimum cumulative distance of consonant midpoints can be obtained as a by-product of the distance calculation during vowel information discrimination, and the distance calculation can be completed in one process.

またこの発明の他の態様では、母音情報を得る際に語尾
がわから所定の中間点までＤＰマツチングの演算を行う
ようにしてもよい。この場合、５つの母音の標準パター
ンで確実に未知入カバターンの母音を識別できるので、
音節すべてにつき参照を行う場合に比して計算量が極め
て減少する。In another aspect of the present invention, when obtaining vowel information, the ending of the word may be known and the DP matching calculation may be performed up to a predetermined midpoint. In this case, the standard pattern of five vowels can reliably identify the vowels of the unknown Kabataan, so
The amount of calculation is significantly reduced compared to when all syllables are referenced.

［実施例］以下この発明を特定話者用の音声認識装置に適用した一
実施例について図面を参照しながら説明しよう。[Embodiment] An embodiment in which the present invention is applied to a speech recognition device for a specific speaker will be described below with reference to the drawings.

第１図はこの実施例を全体として示すものであり、この
第１図において、マイクロホン１には話者の音声が供給
され、この音声がオーディオ信号に変換されてＡ／Ｄ変
換器２に供給される。このＡ／Ｄ変換器は例えばサンプ
ル周波数が２０ＫＨｚ、８− データのビット長が１２ビツトのものである。Ａ／Ｄ変
換器２からのデータは特徴パラメータ抽出部３に供給さ
れ、ここで上述データに基づいて特徴パラメータ時系列
（パターン）が形成される。FIG. 1 shows this embodiment as a whole. In FIG. 1, a speaker's voice is supplied to a microphone 1, and this voice is converted into an audio signal and supplied to an A/D converter 2. be done. This A/D converter has, for example, a sampling frequency of 20 KHz and an 8-data bit length of 12 bits. The data from the A/D converter 2 is supplied to the feature parameter extraction section 3, where a feature parameter time series (pattern) is formed based on the above-mentioned data.

本例ではこの特徴パラメータとして後に詳述するように
対数化スペクトルを用いている。In this example, a logarithmized spectrum is used as this characteristic parameter, as will be described in detail later.

本例は特定話者を対象とするものであるので、音声認識
に先だってトレーニングが行われる。すなわち、識別す
べき所定個数たとえば６８個の音節を話者がマイクロホ
ン１に向って発声し、これを順次Ａ／Ｄ変換器２および
特徴パラメータ抽出部３で演算し、認識部４に、各音節
の標準パターンを供給していく。この場合認識部４の切
換回路５はａがわに切り換えられており認識部４のスト
ア部６に登録されるようになっている。このような準備
段階ののち話者が音節を区切ってたとえば１００音ｊｉ
？ｉ／分の速度で音声を入力していくと、各音節は特徴
パラメータ抽出部３を介して未知入カバターンとして導
出され、認識部４のストア部６に記憶されていく。この
際は切換回路５はｂがわに切り換えられている。そして
未知入カバターンは順次６８個の標準パターンに参照さ
せられ、この参照結果のうち一番最適なものが出方回路
７を介してプリンタやモニタ等の出力装置８に出方され
ていく。もちろん、一番最適なものの他に、第２、第３
の候補等をも出力する様にしてもよい。Since this example targets a specific speaker, training is performed prior to speech recognition. That is, a speaker utters a predetermined number of syllables to be identified, for example, 68, into the microphone 1, which are sequentially calculated by the A/D converter 2 and the feature parameter extraction section 3, and then sent to the recognition section 4 as each syllable. We will supply standard patterns. In this case, the switching circuit 5 of the recognition section 4 is switched to "a" so that the information is registered in the storage section 6 of the recognition section 4. After this preparatory step, the speaker divides the syllables into 100 syllables, for example, ji.
? As speech is inputted at a speed of i/min, each syllable is derived as an unknown input cover pattern via the feature parameter extraction section 3 and stored in the storage section 6 of the recognition section 4. At this time, the switching circuit 5 is switched to side b. The unknown input cover pattern is sequentially referenced to 68 standard patterns, and the most optimal one among the reference results is output via the output circuit 7 to an output device 8 such as a printer or a monitor. Of course, in addition to the most optimal one, there are also
It may also be possible to output candidates, etc.

第１図の特徴パラメータ抽出部３は第２図〜第６図に示
すようにして対数化スペクトルの時系列を形成する。す
なわち、Ａ／Ｄ変換器２がらのデジタルデータはプリエ
ンファシスされ、このプリエンファシスされたデータに
基づいて時間フレームｉ、この例では１０ｍ５ｅｃごと
のエネルギＥｉがめられる。ただし、Ｅ　ｉ　＝　１０　ＱｏｇｚｏＣ振幅の二乗値の平均）
である。こののち最大エネルギＥ　ｍａｘおよび最小エ
ネルＩＥｍｉｎから正規化エネルギＥｊｎをめ、Ｅｊを
たとえばＯから３２までの値に正規化する。The feature parameter extraction unit 3 in FIG. 1 forms a time series of logarithmized spectra as shown in FIGS. 2 to 6. That is, the digital data from the A/D converter 2 is pre-emphasized, and based on this pre-emphasized data, the energy Ei for every time frame i, 10 m5ec in this example, is determined. However, E i = 10 (average of squared values of QogzoC amplitude)
It is. Thereafter, a normalized energy Ejn is obtained from the maximum energy E max and the minimum energy IEmin, and Ej is normalized to a value from 0 to 32, for example.

ただし、Ｅｉ。＝　３２　（Ｅ　ｉ　−Ｅ、ｍ１ｎ）Ｅ　ｍａｘ
　−Ｅ　ｍｉｎである。そしてこうして得た正規化エネルギＥｉｎの時
間分布から適切な閾値を設定して音節間の境界を判別す
る。However, Ei. = 32 (E i −E, m1n) E max
−E min . Then, an appropriate threshold value is set from the time distribution of the normalized energy Ein obtained in this way to determine the boundary between syllables.

他方、プリエンファシスされたデジタルデータには短時
間スペクトル分析も実行される。すなわちデジタルデー
タを１フレーム１０ｍ５ｅｃごとに移動させながら２０
ｍ５ｅｃ（４００点）の範囲でハミング窓の時間関数を
用いたライノブラードの高速フーリエ変換実行するので
ある。こうして得たパワースペクトルは対数化され、さ
らに１．０１ｌｚ〜７９００Ｈｚまでのスペクトルが１
９の周波数バンドに分割される。具体的には、１．０　
’ＯＨｚおよび２００Ｈｚが１つのバンドを形成し、同
様に３００１（ｚおよび４００Ｈｚ、・・・・・・、７
０００１１ｚ〜７９００Ｈｚがそれぞれバンドを形成す
る。On the other hand, short-term spectral analysis is also performed on the pre-emphasized digital data. In other words, while moving the digital data every 10m5ec per frame,
Rhinoblad's fast Fourier transform using a Hamming window time function is performed in the range of m5ec (400 points). The power spectrum obtained in this way is logarithmized, and the spectrum from 1.01lz to 7900Hz is
It is divided into 9 frequency bands. Specifically, 1.0
'OHz and 200Hz form one band, similarly 3001 (z and 400Hz,...,7
00011z to 7900Hz form bands, respectively.

こののち無声部分のパワースペクトルの平均値をバック
・グラウンド・ノイズとして有声部分　゛。After this, the average value of the power spectrum of the unvoiced part is used as background noise for the voiced part.

（各音節）のパワースペクトルから差し引く。(each syllable) subtracted from the power spectrum.

このようにしてバック・グラウンド・ノイズが差し引か
れたパワースペクトル、すなわち特徴パターンは時間方
向に正規化され、更に周波数成分の非線形変換を受ける
。すなわち、第３図に示すような時間１〜ｎｐ、周波数
バンド１〜ｍの特徴パターンを考える。ここでｎｐは各
音節ごとのフレーム数であり、ｍはバンド数、本例では
ｍ　＝　１９である。各時間および各バンドにおける対
数化パワースペクトルは簡略化して単に丸印で示しであ
る。時間方向の正規化を行うには、第４図に示すように
標準パターン（語長ＴＲ）および未知入カバターン（語
長Ｔｚ）の両者を所定のパターン長（Ｔｎ）に線形補間
により変換する。パターン長Ｔｎは別途実験により定め
る。たとえば標準パターン音節長の平均の１．２倍に選
定してよい。The power spectrum from which background noise has been subtracted, that is, the characteristic pattern, is normalized in the time direction and is further subjected to nonlinear transformation of frequency components. That is, consider a characteristic pattern of times 1 to np and frequency bands 1 to m as shown in FIG. Here np is the number of frames for each syllable, m is the number of bands, and in this example, m = 19. The logarithmized power spectrum at each time and each band is simply indicated by circles. To perform normalization in the time direction, both the standard pattern (word length TR) and the unknown input cover turn (word length Tz) are converted to a predetermined pattern length (Tn) by linear interpolation, as shown in FIG. The pattern length Tn is determined separately by experiment. For example, it may be selected to be 1.2 times the average standard pattern syllable length.

周波数成分の非線形変換は第５図および第６図に示すよ
うにして行う。すなわち第５図に示すように時間正規化
後の特徴パターンをｖｉｊで表わし、その最大エネルギ
をＶｉＨＡ）（＝　ＭＡＸ　（Ｖｉｊ）で表わす。そしてｖｉｊを第６図に示すように下の式に
したがってＯから２５５までの値ｖｊｊに変換するので
ある。Nonlinear transformation of frequency components is performed as shown in FIGS. 5 and 6. That is, as shown in Fig. 5, the characteristic pattern after time normalization is expressed as vij, and its maximum energy is expressed as ViHA) (= MAX (Vij). Then, as shown in Fig. 6, vij is expressed according to the formula below. It is converted into a value vjj from 0 to 255.

Ｖｊｊ　＜　ＶｉＭＡｘ−Ｖｂ（７）場合、Ｖｊ、ｊ　
＝　０Ｖｉｊ　＞　ＶｊＭＡｘＶｂ（７）場合、なおり
ｂは別途実験により定める。Ｖｂ＝３０．４０．５０と
変化させた場合、最適値はｖｂ＝４０であった。If Vjj < ViMAx−Vb(7), Vj,j
= 0Vij > VjMAxVb (7) In the case, the naori b is determined separately by experiment. When Vb was changed to 30.40.50, the optimum value was vb=40.

ｖｂの最適値はノイズパワーに対する音声ピー。The optimum value of vb is the voice peak relative to the noise power.

り信号の比に関係していると考えられる。この非線形変
換によりノイズパワーの悪影響を緩和することができる
。This is thought to be related to the ratio of the signals. This nonlinear conversion can alleviate the adverse effects of noise power.

つぎに第１図の認識部４について第７図および第８図を
も参照しながら説明しよう。認識部４はストア部６、累
積距離演算部９等からなっている。Next, the recognition unit 4 shown in FIG. 1 will be explained with reference to FIGS. 7 and 8. The recognition unit 4 includes a storage unit 6, a cumulative distance calculation unit 9, and the like.

ストア部６には、入カバターンＰＩおよび標準パターン
Ｐｈｉが蓄えられており、これら入力バター１３− ンＰ工と標準パターンＰＲｉとの間のＤＰマツチングの
演算が累積距離演算部９で実行される。ただし、この累
積距離演算部９ではパターン全体にわたるマツチング用
の演算は必要とされない。少なくとも入カバターンの時
間軸上の時刻ｔｌ＝ｔＶ（第７図）までの演算が行われ
ていればよい。ここで第７図の時点ｔｃは子音情報を得
るための中間点であり、時点ｔｖは母音情報を得るため
の中間点である。これについては以下で詳述される。The storage section 6 stores the input pattern PI and the standard pattern Phi, and the cumulative distance calculation section 9 executes a DP matching calculation between the input pattern 13-P and the standard pattern PRi. However, this cumulative distance calculation section 9 does not require calculation for matching over the entire pattern. It is sufficient that the calculation is performed at least up to time tl=tV (FIG. 7) on the time axis of the input pattern. Here, time tc in FIG. 7 is an intermediate point for obtaining consonant information, and time tv is an intermediate point for obtaining vowel information. This is detailed below.

まず所定の未知入カバターンＰ工について６８個の標準
パターンＰＲｉがマツチングされる。すなわち第８図に
示すように未知入カバターンＰＩと第１の標準パターン
ＰＲｉとのＤＰマツチング演算が実行されていき、その
際ｔｌ＝ｔｃにおける最小累積距離Ｄｃがめられる。始
点（第７図、語頭がわ）からｔｌ＝ｔｃ上の格子（マツ
チング窓に制限されるのでＷｃで示される範囲内の格子
）までのパスは種々あるけれども、それらのパスの各々
の累積距離のうち最小のものをめるのである。First, 68 standard patterns PRi are matched for a predetermined unknown cover pattern P. That is, as shown in FIG. 8, a DP matching calculation is performed between the unknown input cover turn PI and the first standard pattern PRi, and at this time, the minimum cumulative distance Dc at tl=tc is determined. Although there are various paths from the starting point (Figure 7, at the beginning of a word) to the grid on tl = tc (the grid within the range indicated by Wc because it is limited to the matching window), the cumulative distance of each of these paths is Take the smallest of these.

こののちさらにＤＰマツチングの演算を継続してｔ４＝
ｔｖにおいても同様の最小累積距離Ｄｖをめる。これら
最小累積距離Ｄｃ、Ｄｖはストア部６に蓄えられる。以
下同様にして未知入カバターンと残りの６７個の標準パ
ターンＰＲｉ　（ｊ、　＝　２〜６８）との間でＤＰマ
ツチング演算が実行されてそれぞれの最小累積距離ＤＣ
，Ｄｖがめられる。After this, further DP matching calculation is continued and t4=
A similar minimum cumulative distance Dv is calculated for tv as well. These minimum cumulative distances Dc and Dv are stored in the storage section 6. Thereafter, a DP matching operation is similarly performed between the unknown input cover pattern and the remaining 67 standard patterns PRi (j, = 2 to 68), and each minimum cumulative distance DC is calculated.
, Dv is recognized.

すべての標準パターンＰＲｉについて最小累積距離Ｄｃ
、Ｄｖがめられると、つぎにｔ４＝ｔｖにおける６８個
の最小累積距離Ｄｖから母音を決定する。すなわち最小
累積距離Ｄｖを最小とする音節をめ、この音節の母音を
検出母音情報とする。Minimum cumulative distance Dc for all standard patterns PRi
, Dv are determined, then the vowel is determined from the 68 minimum cumulative distances Dv at t4=tv. That is, the syllable with the minimum cumulative distance Dv is selected, and the vowel of this syllable is set as the detected vowel information.

こののち検出結果の母音を後続母音とする音節（たとえ
ば母音が「あ」であれば「あ」、「か」、「さ」・・・
・・・）についてのｔｙ”ｔｃでの最小累積距離ＤＣを
候補音節用データとして選び出し、このうち最小の最小
累積距離Ｄｃを持つものを音節検出結果として出力する
。以上の動作は第１図の母音検出部１０、選択回路１１
および音節検出部１２によって実行される。After this, a syllable in which the vowel of the detection result is the subsequent vowel (for example, if the vowel is "a", it is "a", "ka", "sa", etc.)
...) at ty"tc is selected as candidate syllable data, and among them, the one with the smallest minimum cumulative distance Dc is output as the syllable detection result. The above operation is shown in FIG. Vowel detection unit 10, selection circuit 11
and is executed by the syllable detection unit 12.

上述の中間点ｔｌ：ｔｃ、ｔｖは実験によって定める。The above intermediate points tl:tc, tv are determined by experiment.

たとえばｔｃ：ＴｎＸ＋程度、ｔ　ｖ＝＝　Ｔｎ×０．
７程度で良好な結果が得られた。For example, tc: about TnX+, tv==Tn×0.
Good results were obtained at about 7.

本例を６８音節につき適用した実検結果は表１のとおり
である。１音節あたりの標準パターンの数は１つである
（単一テンプレート方式）。この表で第２候補とは第１
図の音節検出部１２において２番目に小さな最小累積距
離を持つ音節のことである。Table 1 shows the actual results of applying this example to 68 syllables. The number of standard patterns per syllable is one (single template method). In this table, the second candidate is the first candidate.
This is the syllable with the second smallest minimum cumulative distance in the syllable detection unit 12 in the figure.

表１１本例の６８音節認識結果表２から明らかなように本例は先に述べた従前のものに
較べてすぐれた認識結果をもたらすものであることがわ
かる。Table 11 Results of Recognition of 68 Syllables in This Example As is clear from Table 2, this example provides superior recognition results compared to the previous example described above.

表２８本例と従来例との比較以上述べたようにこの実施例では子音・母音のセグメン
テーションが不要なので極めて簡易な構成で単音節の認
識を行うことができる。さらに母音情報を得る際のＤＰ
マツチング演算の副次物を利用して子音情報を得ること
ができ、演算量を少なくすることができる。またハード
ウェア実現態様に極めて適した構成となっている。Table 28: Comparison of this Example and the Conventional Example As described above, this example does not require segmentation of consonants and vowels, so single syllables can be recognized with an extremely simple configuration. DP when obtaining further vowel information
Consonant information can be obtained by using the by-products of the matching operation, and the amount of calculations can be reduced. Furthermore, the configuration is extremely suitable for hardware implementation.

この実施例では認識率を向上させることもできる。これ
はつぎのように考えられる。上述のとおり音節の特徴ベ
クトルは語頭に子音情報が含まれ、語の半ばから語尾に
かけての広い範囲にわたって１６− は母音情報が含まれている。これを模式的に示すと第９
図に示すとおりである。そして目視や経験により両者の
境界は破線で示されるように配置される。しかしながら
、子音は後続母音に影響を与え、とくに遷移領域には子
音決定上重要な要素が含まれていると考えられる。した
がって、子音および母音の境界を検出して子音の切り出
しを行うことはかえって子音情報を・不確かなものとし
てしまう。本例では子音・母音境界にこだわることなく
、子音情報を得るための中間点ｔｃを実験により定め、
たとえば第９図に示すように定めているのでより好まし
い認識を行うことができる。また子音決定上実現し得る
パスの終点は第７図にＷＣで示す格子群であるのでこの
範囲で標準パターンの対応する終点を選び得、入カバタ
ーンと標準パターンとの子音領域での微妙なマツチング
をより自由度高く実行することができ、このため認識率
が向上すると考えられる。This embodiment can also improve the recognition rate. This can be considered as follows. As described above, the syllable feature vector includes consonant information at the beginning of the word, and 16- includes vowel information over a wide range from the middle of the word to the end. To illustrate this schematically, the 9th
As shown in the figure. Then, by visual inspection or experience, the boundary between the two is arranged as indicated by a broken line. However, consonants influence the following vowels, and the transition region in particular is thought to contain important elements for determining consonants. Therefore, detecting the boundaries between consonants and vowels to extract consonants actually makes the consonant information uncertain. In this example, the intermediate point tc for obtaining consonant information is determined by experiment without worrying about the consonant/vowel boundary.
For example, since it is determined as shown in FIG. 9, more preferable recognition can be performed. Furthermore, since the end point of the path that can be realized in terms of consonant determination is the lattice group shown by WC in Figure 7, the corresponding end point of the standard pattern can be selected within this range, and delicate matching in the consonant region between the input cover turn and the standard pattern can be achieved. can be executed with a higher degree of freedom, which is thought to improve the recognition rate.

なお、この発明は上述実施例に限定されるものではなく
その趣旨を逸脱しない範囲で種々変更が可能である。た
とえば拗音を含む１０１単音節にこの発明を適用しても
よく複数テンプレートを用いることもできる。上述実施
例と同様の構成での１０１単音節認識結果は表３のとお
りであり、複数テンプレートでの従前との比較結果は表
４に示すとおりである。Note that this invention is not limited to the above-described embodiments, and various modifications can be made without departing from the spirit thereof. For example, the present invention may be applied to 101 monosyllables including syllables, and multiple templates may be used. Table 3 shows the results of 101 monosyllable recognition using the same configuration as the above embodiment, and Table 4 shows the results of comparison with the previous version using multiple templates.

表３．１．０１音節についてのこの発明の認識結果　１
０− 表４．複数テンプレートの認識結果また、上述の実施例では母音情報を得る場合にも語頭か
らＤＰマツチングを行ったけれども母音情報を得る場合
に語尾がわからＤＰマツチングを行ってもよい。この場
合、語長の２０〜３０％を中間点とすれば十分に母音を
決定できる。母音の標準パターンとしては「ア」　「イ
」　「つ」　「工」「オ」の５つのみでよい。子音の影
響により認識エラーが生じることが少なく、また上述実
施例と異って母音情報用の演算から副次的に６８個また
は１０１個の子音情報を得る必要がないからである。し
たがって計算量を極めて少なくさせることができる。も
ちろん、母音に関する認識結果から２０一対応する候補の音節が絞り込まれる。子音情報識別用に
は別途６８個または１０１個の標準パターンが用意され
ており、候補の音節についてこの発明にしたがって子音
情報に関する参照が行われて最終的に入力音節の識別が
完了する。Table 3. Recognition results of this invention for 1.01 syllables 1
0- Table 4. Recognition Results of Multiple TemplatesAlso, in the above-described embodiment, DP matching is performed from the beginning of a word when obtaining vowel information, but when obtaining vowel information, the ending of a word may be known and DP matching may be performed. In this case, vowels can be determined sufficiently by setting 20 to 30% of the word length as the midpoint. There are only five standard vowel patterns: ``a'', ``i'', ``tsu'', ``ko'', and ``o''. This is because recognition errors are less likely to occur due to the influence of consonants, and unlike the above-described embodiments, there is no need to obtain 68 or 101 consonant information as a subsidiary from the calculation for vowel information. Therefore, the amount of calculation can be extremely reduced. Of course, 201 corresponding candidate syllables are narrowed down from the recognition results regarding vowels. Separately, 68 or 101 standard patterns are prepared for identifying consonant information, and reference to consonant information is made for candidate syllables according to the present invention, and identification of input syllables is finally completed.

また、子音情報を得る際のＤＰマツチング演算において
、その始点および中間点ｔｃを可変にするように構成し
てもよく、また複数組の始点および中間点についてそれ
ぞれ音節認識を行って多数決で最終法に音節を決定する
ようにすることもできる。In addition, in the DP matching calculation when obtaining consonant information, the starting point and intermediate point tc may be configured to be variable, and syllable recognition is performed for each of multiple sets of starting points and intermediate points, and the final method is determined by majority vote. It is also possible to have the syllable determined by the syllable.

もちろん、登録標準パターンの時間軸上に中間点を選定
し、この中間点での最小累積距離に基づいて子音情報を
得るようにすることもできる。Of course, it is also possible to select an intermediate point on the time axis of the registered standard pattern and obtain consonant information based on the minimum cumulative distance at this intermediate point.

［発明の効果］以上説明したように、この発明によれば未知入カバター
ンと登録標準パターンとについてＤＰマツチングで距離
演算を実行していき、これら未知入カバターンまたは登
録標準パターンの語頭から語尾の間の所定の中間点での
最小累積距離に基づいて未知入カバターンの子音情報を
得るようにしている。したがって子音・母音境界点を判
別する必要がなく構成を簡略化でき、しかも境界点の決
定にともなうエラーをなくすことができる。しかも子音
情報を得るためのマツチングに自由度をもたせることが
できるので認識率を向上させることができる。[Effects of the Invention] As explained above, according to the present invention, distance calculation is performed by DP matching between unknown input cover patterns and registered standard patterns, and between the beginning and the end of the word of these unknown input cover turns or registered standard patterns. The consonant information of the unknown input kabataan is obtained based on the minimum cumulative distance at a predetermined intermediate point. Therefore, there is no need to discriminate consonant/vowel boundary points, and the configuration can be simplified, and errors associated with determining boundary points can be eliminated. Moreover, since the matching for obtaining consonant information can be given a degree of freedom, the recognition rate can be improved.

[Brief explanation of drawings]

第１図はこの発明の一実施例を示すブロック図、第２図
、第３図、第４図、第５図および第６図は第１図実施例
の特徴パラメータ抽出部３を説明するための図、第７図
および第８図は第１図実施例の認識部４を説明するため
の図、第９図は第１図実施例の効果を説明するための図
である。］・・・・マイクロホン、２・・・・Ａ／Ｄ変換器、３
・・・・特徴パラメータ抽出部、４・・・・認識部、８
・・・・出力装置。周波数− Ｉ１２数。 ■ｉ　ＭＡＸ入カッでターン　Ｐｌ１（時向→ 第０図手続補正書動式）％式％ ■、事件の表示昭和５９年　特許願　第１．７２６３号２、発明の名称単音節音声認識装置３、補正をする者事件との関係　特許出願人４、復代理人昭和５９年４月２４日６、補正の対象明細書全文７、補正の内容別紙のとおりFIG. 1 is a block diagram showing one embodiment of the present invention, and FIGS. 2, 3, 4, 5, and 6 are for explaining the feature parameter extraction unit 3 of the embodiment in FIG. 1. , FIG. 7, and FIG. 8 are diagrams for explaining the recognition section 4 of the embodiment in FIG. 1, and FIG. 9 is a diagram for explaining the effects of the embodiment in FIG. 1. ]...Microphone, 2...A/D converter, 3
...Feature parameter extraction unit, 4...Recognition unit, 8
...Output device. Frequency - I12 number. ■i MAX Input and turn Pl 1 (time direction → Figure 0 procedure correction writing type) % type % ■, Display of the incident 1981 Patent application No. 1.7263 2, Name of the invention Monosyllabic speech recognition device 3. Relationship with the case of the person making the amendment Patent applicant 4, sub-agent April 24, 1980 6. Full text of the specification to be amended 7. Details of the amendment as shown in the attached sheet

Claims

[Claims]

Monosyllabic speech that recognizes the unknown input monosyllable by analyzing the speech signal of the unknown input monosyllable and comparing the input feature parameter time series extracted from this speech signal with the registered feature parameter time series registered in advance. In the recognition system,
From the beginning of the word of the input feature parameter time series and the registered feature parameter time series, distance calculation is performed according to the dynamic programming matching method. A monosyllable speech recognition device characterized in that consonant information of the unknown input monosyllable is obtained based on a minimum cumulative distance at a predetermined midpoint between the unknown input monosyllables.