JPS63236095A

JPS63236095A - Voice recognition

Info

Publication number: JPS63236095A
Application number: JP62070923A
Authority: JP
Inventors: 哲夫小坂; 康弘小森
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 1987-03-25
Filing date: 1987-03-25
Publication date: 1988-09-30

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明け、音韻表記の単語辞書を利用することを特徴と
した単語音声認識装置における単語音声認識法、特に単
語を仮定した上で認識を行う、トップ・ダウン型の音声
認識方法に関するものである。[Detailed Description of the Invention] [Industrial Application Field] The present invention relates to a word speech recognition method in a word speech recognition device characterized by using a word dictionary with phonetic notation, and in particular to a word speech recognition method based on the assumption of words. This relates to a top-down speech recognition method.

[Conventional technology]

従来の音声認識方法を大きく分けると、以下の２種類に
なる。Conventional speech recognition methods can be roughly divided into the following two types.

（１）単語を単位とした認識方法。(1) Recognition method based on words.

（２）単語よりも小さい単位、例えば音韻を単位とした
認識方法。(2) A recognition method that uses units smaller than words, such as phonemes.

第１の方式は、単語全体の特徴パラメータを標準パタン
として持ち、入力音声とのパタンマツチングをＤＰを用
いて行うことで認識し、〜数百語程度の単語認識におい
て、高い認識率を得ている。しかしこの方法では、認識
する単語全てを、発声により登録しなければならないの
で手間がかかる。単語を標準パタンとしているため、単
語辞書の容量が太き（なるなどの点で、１０００語以上
の人語量の認識には適さない。The first method has the feature parameters of the entire word as a standard pattern, and performs pattern matching with the input speech using DP, achieving a high recognition rate in word recognition of ~ several hundred words. ing. However, this method is time-consuming because all the words to be recognized must be registered by vocalization. Since words are used as standard patterns, the capacity of the word dictionary is large, making it unsuitable for recognizing human language of 1,000 words or more.

一方第２の方式は、音韻ごとに標準パタンを持つため、
単語辞書は音韻表記で登録すればよいので登録の手間も
かからず容量も小さくて済むので、第１の方式に対し、
人語量に向いていると言える。On the other hand, the second method has a standard pattern for each phoneme, so
Since word dictionaries can be registered in phonological notation, there is no need for registration and the capacity is small, so compared to the first method,
It can be said that it is suitable for people who speak a lot of words.

第２の方式は、まずセグメンテーションをして各音韻の
存在区間を定めた後に、音韻の認識を行うのが一般的で
ある。これに対し、トップダウン的な認識方法が考えら
れる。この方法では、認識する単語を仮定し、この単語
に含まれる音韻の標準パタンを用いて、入力音声の特徴
時系列とマツチングを行い、得点を計算する。さらに各
音韻の得点を基に、単語全体の得点を求めることにより
認識を行う。このように、その単語に必要な音韻の標準
パタンしか用いないため、セグメンテーションを先に行
う方法に比べて、音韻の付加・脱落が起こりに＜（、認
識の精度が上がる。In the second method, it is common to first perform segmentation to determine the range in which each phoneme exists, and then perform phoneme recognition. On the other hand, a top-down recognition method can be considered. In this method, a word to be recognized is assumed, and a standard pattern of phonemes included in this word is used to perform matching with the feature time series of input speech, and a score is calculated. Furthermore, recognition is performed by determining the score for the entire word based on the score for each phoneme. In this way, since only the standard pattern of phonemes necessary for the word is used, compared to methods that perform segmentation first, additions and omissions of phonemes occur and recognition accuracy increases.

この例としてたとえば、杉山他による「音声認識システ
ムにおける１゛ｏ　ｐ　−ｄ　ｏ　ｗ　ｎ的音響処理」
（日本音響学会音声研究会資料５８２−６２）などがあ
る。As an example of this, for example, "1-op-down acoustic processing in a speech recognition system" by Sugiyama et al.
(Acoustical Society of Japan Speech Study Group Materials 582-62).

また処理量が軽減する方法としてトップ・ダウン処理に
ＤＰマツチングを応用した例として、盛合他による「音
素判別フィルタを用いた孤立発声単語の自動ラベリング
」（日本音響学会音声研究会資料５８５−５４）がある
。この盛合他による研究は、直接音声認識を扱ったもの
ではないが、この方法はこのままトップ・ダウン型の音
声認識に応用が可能である。Also, as an example of applying DP matching to top-down processing as a method to reduce the amount of processing, Moriai et al.'s "Automatic labeling of isolated uttered words using a phoneme discrimination filter" (Acoustical Society of Japan Speech Study Group Materials 585-54) There is. Although this research by Moriai et al. did not directly deal with speech recognition, this method can be applied as is to top-down speech recognition.

このＤＰマツチングを用いたトップ・ダウン音声認識に
ついて以下に説明する。この方法では類似度の計算式を
ＤＰマツチングを用いて、例えば以下の漸加式によって
求める。Top-down speech recognition using this DP matching will be explained below. In this method, the similarity calculation formula is determined using DP matching, for example, by the following incremental formula.

ｇ　（ｉ、　Ｏ）　＝ｇ　（０，Ｄ　−−ＯＯ（ｉ≠ｏ
、ｊ≠０）　・・・・・・・・・・・・・・（４）ただ
し、Ｉ：音韻記号列の長さ、Ｊ：入力音声のフレーム数Ｖ：音韻記号列（■ｌ　＋　　ｖ２　＋　”’　Ｖ　ｉ
　＋　”’　Ｖ　ｌ　）Ｘ；入力音声の特徴時系列（Ｘ
、、Ｘ２．・・・Ｘｉ、・・・ＸＪ）ｇ　（ｉ、Ｄ　：ＤＰの累積類似度、ｓ　（ｖ、ｘ）　：　ｖとＸの類似度Ａ　（１＋ｊ）　：　Ｘ’とＸｊの類似度、このｇ　（
ｉ、Ｄの計算を行う場合の窓制限としては、例えばｉ／２≦ｊ≦（ｉ−１−１）／２＋Ｊ＋１・・・・・・
・・・・・・・・（５）などが用いられる。g (i, O) = g (0, D −-OO(i≠o
, j≠0) ・・・・・・・・・・・・・・・・・・(4) However, I: Length of phoneme symbol string, J: Number of frames of input speech V: Phoneme symbol string (■l + v2 + ”' V i
+ ”' V l )X; Characteristic time series of input speech (X
,,X2. ...Xi, ...XJ) g (i, D: cumulative similarity of DP, s (v, x): similarity of v and X A (1+j): similarity of X' and Xj, this g (
For example, the window restriction when calculating i and D is i/2≦j≦(i-1-1)/2+J+1...
...(5) etc. are used.

この結果得られた類似度を基に、単語の認識を行う。Word recognition is performed based on the similarity obtained as a result.

〔発明が解決しようとしている問題点〕上記窓制限は、
単に計算領域を制限して計算量を減少させるというだけ
ではなく、極端なマツチングを防ぐという意味でも効果
がある。しかしながら、孤立発声された単語の場合、各
音韻の持続時間は単語内において一定ではな（、一般に
語頭に比べ語尾の方が長いという性質を持っている。こ
の性質については、匂坂他による「規則による音声合成
のための音韻時間長制御」（電子通信学会論文誌’８４
／７　Ｖｏｌ、　Ｊ６７−Ａ　Ｎｏ、７　）に詳しい。[Problem to be solved by the invention] The above window restriction is
This is effective not only in reducing the amount of calculation by simply limiting the calculation area, but also in preventing extreme matching. However, in the case of words that are uttered in isolation, the duration of each phoneme is not constant within the word (generally speaking, the duration of each phoneme is longer at the end than at the beginning. "Phonological duration control for speech synthesis" (Transactions of Institute of Electronics and Communication Engineers '84)
/7 Vol. J67-A No. 7) for details.

このため（５）式に示すような線型の窓制限では、単語
内の位置による各音韻の持続時間の変動に対処できない
場合があり、特に極端なマツチングを避けるために窓幅
を狭くした場合、入力音声に対し正確に各音韻に対応付
けられないという問題があった。For this reason, the linear window restriction shown in equation (5) may not be able to deal with variations in the duration of each phoneme depending on its position within a word, and especially when the window width is narrowed to avoid extreme matching, There was a problem in that the input speech could not be accurately mapped to each phoneme.

〔問題点を解決するための手段（及び作用）〕本発明は
、孤立発声単語における各音韻の時間的な存在位置を記
したテーブルを各モーラ数ごとに持ち、そのテーブルか
ら得られる値を入力単語の継続時間に合わせて伸縮する
ことにより入力単語に適応させ、さらに適応後の値と重
み関数を用いて重み付きのＤＰの窓を設定することによ
り上記目的を達成するものである。[Means (and effects) for solving the problem] The present invention has a table for each number of moras that records the temporal position of each phoneme in isolated utterance words, and inputs the values obtained from the table. The above objective is achieved by adapting to the input word by expanding or contracting it in accordance with the duration of the word, and further by setting a weighted DP window using the adapted value and the weighting function.

[Effect]

これによって、上記ＤＰ窓を利用することにより、孤立
発声単語のような、単語内で各音韻の持続時間が異なる
音声においても、その持続時間の伸縮を窓内に納めるこ
とができ、ＤＰによる類似度計算がより精密に行われる
。As a result, by using the above-mentioned DP window, even in sounds where the duration of each phoneme within a word is different, such as an isolated utterance word, the expansion and contraction of the duration can be kept within the window, and the similarity using DP Calculations are made more precisely.

〔Example〕

第１−１図は本発明の一実施例における説明図である。 FIG. 1-1 is an explanatory diagram of one embodiment of the present invention.

ｌは入力発声の周波数分析を行う音響分析部、２は音韻
の識別に必要な特徴を抽出するための特徴抽出部、３は
特徴抽出した結果をもとに単語辞書の候補単語を絞る予
備選択部、４は音韻記号列によって登録されている単語
辞書、５は選択された単語のモーラ数をカウントするモ
ーラ数カウント部、６は各モーラ数に対応するモーラの
開始・終了時刻を記憶したモーラの開始・終了時刻テー
ブル、７は選択されたモーラの開始・終了時刻と重み関
数と音声開始時刻Ｔｓ（ｍｓ）及び音声終了時刻Ｔｅ（
ｍｓ）から音韻推定存在区間を計算する音韻推定区間計
算部、８は重み関数を記憶している重み関数テーブル、
９は各音韻に対応する標準パタンを記憶している音韻標
準パタンテーブル、１０は入力音声から特徴抽出を行う
ことによって得られた特徴時系列と音韻標準パタンのマ
ツチングを７より得られた窓制限によるＤＰで行う単語
認識部である。第１−２図は本発明の認識方法を実行す
る為の具体的な構成ブロック図である。第１−２図にお
いて１１は、音声を入力する入力部、１２は各種データ
を記憶するディスク、１３は本装置（システム）を制御
する制御部で、第１−３図に示す様な制御プログラムを
格納したＲＯＭを含むものである。１４は第１−１図に
示した各部における各種データを記憶するＲＡＭ。1 is an acoustic analysis unit that performs frequency analysis of input utterances, 2 is a feature extraction unit that extracts features necessary for phoneme identification, and 3 is a preliminary selection that narrows down candidate words for the word dictionary based on the feature extraction results. 4 is a word dictionary registered by phonetic symbol strings, 5 is a mora number counting unit that counts the number of moras of the selected word, and 6 is a mora that stores the start and end times of mora corresponding to each mora number. The start/end time table 7 shows the start/end time of the selected mora, the weight function, the audio start time Ts (ms), and the audio end time Te(
8 is a weighting function table that stores weighting functions;
9 is a phoneme standard pattern table that stores standard patterns corresponding to each phoneme, and 10 is a window restriction obtained from 7 that matches the feature time series obtained by extracting features from input speech with the phoneme standard pattern. This is a word recognition unit that uses DP. FIGS. 1-2 are block diagrams showing specific configurations for carrying out the recognition method of the present invention. In Fig. 1-2, 11 is an input section for inputting audio, 12 is a disk for storing various data, and 13 is a control section for controlling this device (system), which includes a control program as shown in Fig. 1-3. It includes a ROM that stores . 14 is a RAM for storing various data in each section shown in FIG. 1-1.

１５は出力部である。なお、第１−１図に示した各部が
、それぞれＣＰＵ、ＲＡＭ、ＲＯＭを有していてもよい
。15 is an output section. In addition, each part shown in FIG. 1-1 may have a CPU, RAM, and ROM, respectively.

次に第１−３図を参照して、本発明の詳細な説明する。The present invention will now be described in detail with reference to FIGS. 1-3.

上記構成において、音声は１の音響分析部で周波数分析
される（Ｓｌ）。またこのとき音声開始時刻ｔｓと音声
終了時刻ｔｅも求められる（Ｓ２）。周波数分析により
得られた音声スペクトルから２の特徴抽出部において音
韻の認識に必要な特徴が抽出される（Ｓ３）。ここで得
られた特徴時系列の大局的な特徴を用いて、３の予備選
択部で、入力音声に特徴が近いと思われる単語を４の単
語辞書から複数個選択する（Ｓ４）。４の単語辞書は音
韻記号列によって登録されているので、母音、撥音・長
母音、促音の数を数えることによって、選択された単語
のモーラ数Ｎを求めることができる。このような方法で
５のモーラ数カウント部で選択された単語のモーラ数Ｎ
を求め（Ｓ５）、このモーラ数に対応する各モーラの開
始と終了の時間的な位置を示す値を６のモーラ開始・終
了時刻テーブルから選択する（Ｓ６）。７の音韻推定区
間計算部では、６で選択された値を基に、各音韻の推定
存在区間を計算する（Ｓ７）。In the above configuration, the sound is frequency-analyzed by one acoustic analysis unit (Sl). At this time, the audio start time ts and the audio end time te are also determined (S2). A second feature extraction unit extracts features necessary for phoneme recognition from the speech spectrum obtained by frequency analysis (S3). Using the global features of the feature time series obtained here, the preliminary selection section 3 selects a plurality of words from the word dictionary 4 whose features are considered to be similar to the input speech (S4). Since the word dictionary No. 4 is registered by phoneme symbol strings, the number of moras N of the selected word can be determined by counting the number of vowels, pellic sounds/long vowels, and consonants. In this way, the mora number N of the word selected in the mora number counting section of 5
is determined (S5), and a value indicating the temporal position of the start and end of each mora corresponding to this number of moras is selected from the six mora start/end time tables (S6). The phoneme estimation interval calculation unit 7 calculates the estimated existence interval of each phoneme based on the value selected in 6 (S7).

この６，７の動作をさらに第２図〜第５図を用いて詳し
く説明する。The operations 6 and 7 will be further explained in detail using FIGS. 2 to 5.

６のモーラ開始・終了時刻テーブルには、各モーラ数の
単語に対応するモーラの開始と終了時刻が記憶されてい
る。この時刻が単語のモーラ数の違いによって複数設定
されているのは、モーラ数の違いによって音韻の持続時
間の伸縮の状態が異なるためである。この複数からの選
択は、５のモーラ数カウント部のカウント結果を用いて
行う。The mora start and end time table No. 6 stores the mora start and end times corresponding to the words of each mora number. The reason why a plurality of times are set depending on the number of moras of a word is that the state of expansion and contraction of the duration of a phoneme differs depending on the number of moras. This selection from the plurality is performed using the count result of the mora number counting section 5.

第２図に例として、３モ一ラ単語に対応するものについ
て示す。このグラフの横軸は時間を縦軸はモーラを表わ
す。ｔ　ｓ　（ｍ　ｉ　）　、　ｔ　ｅ　（ｍ　ｉ　）
はそれぞれ第ｉモーラの開始終了時刻を示している。７
の音韻推定区間計算部で第２図横軸の始端と終端をそれ
ぞれ音声開始時刻Ｔｓ１音声終了時刻Ｔｅとなるように
線形に伸縮することにより、入力音声に適応したｔｓ　
（ｍｉ）　／ｌｅ　（ｍｉ）の位置を定める。ｔｓ　（
ｍｉ）・ｔｅ（ｍｉ）の位置を定めた後、各モーラの区
間を音韻の単位に分割する。分割の方法は第２図の（１
）〜（３）に示した通りである。子音を含まない母音だ
けのモーラ・撥音・促音は、モータ区間をそのまま音韻
の区間とした。また撥音・拗音以外の子音を含むモーラ
は１／２に分割して、それぞれ子音区間・母音区間する
。また拗音を含むモーラはｌ／３に分割してそれぞれ子
音区間・拗音区間・母音区間とした。FIG. 2 shows, as an example, words corresponding to trimolar words. The horizontal axis of this graph represents time and the vertical axis represents mora. t s (m i ), t e (m i )
respectively indicate the start and end times of the i-th mora. 7
ts adapted to the input speech by linearly expanding and contracting the start and end of the horizontal axis in FIG.
(mi) /le Determine the position of (mi). ts (
After determining the positions of mi) and te(mi), each mora section is divided into phoneme units. The division method is shown in Figure 2 (1
) to (3). For moras, plosives, and consonants that contain only vowels and no consonants, the motor interval was treated as the phonological interval. Furthermore, moras containing consonants other than plosives and obsessives are divided into 1/2 to form consonant and vowel sections, respectively. In addition, the mora containing a persistent consonant was divided into 1/3 to form a consonant section, a persistent consonant section, and a vowel section, respectively.

このようにして定まった各音韻区間に対し、８の重み関
数テーブルに記憶された重み関数を適用し、重み付きの
ＤＰの窓とした。第３図に重み関数の例を示す。図の横
軸は時間を縦軸は重みを表わす。横軸をやはり線形に伸
縮し図に示したｔｐｓ、ｔｐｅをそれぞれ各音韻の開始
終了時刻に合わせることにより、各音韻区間に適した重
み関数に変換する。この状態を第４図に示す。The weighting functions stored in the 8 weighting function tables were applied to each phoneme interval determined in this manner to form a weighted DP window. FIG. 3 shows an example of a weighting function. The horizontal axis of the figure represents time and the vertical axis represents weight. By linearly expanding and contracting the horizontal axis and aligning the tps and tpe shown in the figure with the start and end times of each phoneme, it is converted into a weighting function suitable for each phoneme interval. This state is shown in FIG.

次に第１−３図の８８では、ｌＯの単語認識部で、７の
音韻区間計算部で求められた音韻推定区間を窓制限に利
用して、２で得られた入力音声の特徴時系列と９の音韻
標準パタンテーブルに登録された音韻標準パタンのうち
、予備選択の結果選択された単語に含まれるパタンを用
いてＤＰマツチングを行う。Next, at 88 in Fig. 1-3, the word recognition unit of IO uses the phoneme estimation interval obtained by the phoneme interval calculation unit of 7 as a window restriction, and uses the characteristic time series of the input speech obtained in 2. DP matching is performed using patterns included in the word selected as a result of preliminary selection among the phoneme standard patterns registered in the phoneme standard pattern table of and 9.

ＤＰマツチングでは前述の重み関数に従い、得点に重み
を付けて計算を行う。各候補単語の得点を計算し、得点
の一番高かったものを認識単語とした（Ｓ９，５ＩＯ）
。In DP matching, calculations are performed by weighting scores according to the weighting function described above. The score of each candidate word was calculated, and the one with the highest score was selected as the recognized word (S9, 5IO)
.

本実施例では単語の予備選択部を設けているので、マツ
チングの候補単語が絞られ人語量でも高速に認識を行う
ことができる。In this embodiment, a word preliminary selection section is provided, so that candidate words for matching are narrowed down and recognition can be performed at high speed even with a human vocabulary.

更には、前述のＤＰの経路をたどり、各音韻に対する始
端と終端にその音韻の開始記号、終了記号を付けること
により音声の自動ラベリングシステムとして用いること
ができる。Furthermore, it can be used as an automatic speech labeling system by following the above-mentioned DP path and attaching the start and end symbols of each phoneme to the beginning and end of that phoneme.

〔Effect of the invention〕

以上説明したように本発明によれば、孤立発声単語の場
合、語頭に比べ語尾の方で音韻共が長くなるという性質
を考慮した非線形の窓制限を、ＤＰ計算時に用いている
ため、計算量の増大と極端なマツチングを避けるために
窓幅を狭（しても、ＤＰの最適経路が窓内に収まり、こ
の結果精度の良いＤＰ値が得られる。またＤＰに用いる
窓を時間軸の方向に関する重み関数によって表わすこと
により、推定された音素区間からはずれるに従って得点
が低下するので、より推定の信頼度の高い経路が選ばれ
る結果、さらに精度の良いＤＰ値が得られる。As explained above, according to the present invention, in the case of isolated uttered words, a non-linear window restriction is used in DP calculation, taking into consideration the property that the phoneme is longer at the end of the word than at the beginning of the word. Even if the window width is narrowed to avoid an increase in Since the score decreases as the phoneme interval deviates from the estimated phoneme interval, a route with higher estimation reliability is selected, resulting in a more accurate DP value.

この結果単語の認識率が向上するという効果が得られる
。As a result, the effect of improving the word recognition rate can be obtained.

第１図は本発明の説明図、第１−２図は本発明適用の構成ブロック図、第１−３図
は本発明の制御フローチャート、第２図は３モ一ラ単語
の場合の音韻の区分及びモーラの開始・終了時刻を示す
図、第３図は重み関数の例を示す図、第４図は（／　ｋ　ｙ　ａ　ｎ　ｏ　Ｎ　／　）という
単語に対するＤＰの重み付き窓の説明図。Fig. 1 is an explanatory diagram of the present invention, Figs. 1-2 are block diagrams of the configuration to which the present invention is applied, Figs. 1-3 are control flowcharts of the present invention, and Fig. 2 is a diagram of the phoneme in the case of a trimolar word. FIG. 3 is a diagram showing an example of a weighting function; FIG. 4 is an explanatory diagram of a weighted window of DP for the word (/ky a no N /).

１・・・音響分析部、２・・・特徴抽出部、３・・・予
備選択部、４・・・単語辞書、５・・・モーラ数カウン
ト部、６・・・モーラ開始・終了時刻テーブル、７・・・音韻
推定区間計算部、８・・・重み関数テーブル、９・・・音韻標準パタンテーブル、１０・・・単語認識部。1... Acoustic analysis section, 2... Feature extraction section, 3... Preliminary selection section, 4... Word dictionary, 5... Mora number counting section, 6... Mora start/end time table , 7... Phoneme estimation interval calculation unit, 8... Weighting function table, 9... Phoneme standard pattern table, 10... Word recognition unit.

記抱拳語音声たり馳”鰻重のブＯ・・・７図不７−Ｚ図市り竹子７０−午ヤーヒ下１−子画１１Ｌ％Ｍ（ｍ５１Ji-hui-ken language Voice recording “Unagiju’s buO…Figure 7” Fu7-Z diagram Ichiri Takeko 70 - Noon Yahi Lower 1 - child drawing 11L%M (m51

Claims

[Claims]

(1) Having a standard pattern for each phoneme expressed as a time series of feature vectors and a word dictionary expressed by phoneme symbols, and the degree of similarity between the speech input pattern expressed as a time series of feature vectors and registered words. In calculating,
Assuming a candidate word corresponding to the input speech, the existence of each phoneme is determined from the speech rate calculated from the duration of the input word and the number of moras of the candidate word in order to recognize the word using the standard pattern of phonemes included in the word. A speech recognition method that estimates the interval, determines the window-limited interval, and then calculates the degree of similarity using dynamic programming (hereinafter abbreviated as DP).

(2) The window used in DP is expressed by a weighting function in the direction of the time axis, and is designed to prevent extreme deviation from the estimated interval as it moves away from the center of the estimated existence interval in time. The speech recognition method according to claim 1, characterized in that the value of the weighting coefficient is reduced.