JPH11175087A

JPH11175087A - Character string matching method for word speech recognition

Info

Publication number: JPH11175087A
Application number: JP9339586A
Authority: JP
Inventors: Tetsutada Sakurai; 哲真桜井; Yoshio Nakadai; 芳夫中台; Yoshitake Suzuki; 義武鈴木; Shunei Kurokawa; 俊英黒川; Yamato Sato; 大和佐藤
Original assignee: NTT Advanced Technology Corp; Nippon Telegraph and Telephone Corp
Current assignee: NTT Advanced Technology Corp; Nippon Telegraph and Telephone Corp
Priority date: 1997-12-10
Filing date: 1997-12-10
Publication date: 1999-07-02

Abstract

PROBLEM TO BE SOLVED: To make the storage capacity and calculation quantity small by deciding similarity by recognizing matches between a phoneme series of respective words and phoneme symbols of the phoneme series of an input voice sequentially from the beginning of a word. SOLUTION: Word registration in a word dictionary 8 is performed with a vowel, a fricative, and a voiceless part so that vowels appearing in words are traced and a spectrum analysis of the input speech is taken; and conversion to three kinds of phoneme series is performed (12) and voiceless sounds in the beginning and ending of the word are removed. Further, discontinuous symbols are replaced with last symbols to generate a symbol sequence (13, 14), each word dictionary series, e.g. 'a*uaei' is compared with an input series 'a*u*aeie', character by character, from the beginning of the word, and when a match between, for example, 'a' and 'a' is obtained, an advance to a comparison of a next character is made, but when a mismatch between, for example, 4th 'a' and '*' is obtained, the penality is increased by one, only the input series is advanced by one character, and 4th 'a'and 5th 'a' are compared with each other.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、コンピュータあ
るいはより小さなシステム上で単語音声を認識させる際
の辞書と入力音声とのマッチング方法に関するものであ
る。[0001] 1. Field of the Invention [0002] The present invention relates to a method for matching a dictionary with input speech when recognizing word speech on a computer or a smaller system.

【０００２】[0002]

【従来の技術】音素情報を用いた音声認識、即ち音素系
列の辞書を備え、音声入力信号を一定フレーム毎に分析
し、フレーム毎の音声音素系列の各記号と辞書の単語音
素系列の記号との類似度を求める認識手法がある。ま
た、ＳＰＬＩＴ法と呼ばれる音素標準パターンの数を、
スペクトルの変動を表現するのに十分な数に増加した認
識手法がある。2. Description of the Related Art Speech recognition using phoneme information, that is, a dictionary of phoneme sequences is provided, a speech input signal is analyzed for each fixed frame, and each symbol of a phoneme sequence for each frame and a symbol of a word phoneme sequence in the dictionary are analyzed. There is a recognition method for calculating the similarity of the Also, the number of phoneme standard patterns called SPLIT method is
There are recognition techniques that have been increased to a number sufficient to represent spectral variations.

【０００３】先にあげた音素情報を用いた音声認識法で
は、スペクトルを基にした音素の標準パターンが必要に
なる。入力音声と標準パターンの類似度を求める際に、
スペクトル同士の距離計算を行うため、計算量が多くな
る。さらに、こうして得られた類似度と単語辞書とを比
較した類似度和によって単語音声の認識を行う。そのた
め、取りこぼしが少なく認識率が高いが、認識に必要な
記憶容量と計算量も著しく増加する。この方法で用いら
れる単語辞書は音素について知っていないと、辞書自体
が音素の並びとなっているので新たに作ることが難し
い。In the above-described speech recognition method using phoneme information, a standard phoneme pattern based on a spectrum is required. When calculating the similarity between the input voice and the standard pattern,
Since the distance between the spectra is calculated, the amount of calculation increases. Further, word speech is recognized based on the similarity sum obtained by comparing the obtained similarity with the word dictionary. As a result, the recognition rate is high with few misses, but the storage capacity and calculation amount required for recognition also increase significantly. If the word dictionary used in this method does not know about phonemes, it is difficult to create a new dictionary because the dictionary itself is a sequence of phonemes.

【０００４】ＳＰＬＩＴ法は、従来の音素単位の認識
系、単語単位の認識系、音声符号化におけるベクトル量
子化法の３手法の融合されたものといえる。ここで、Ｓ
ＰＬＩＴ法の標準パターンを擬音韻標準パターンとい
い、各話者毎に、多数の音声サンプルの短時間スペクト
ルパターンのクラスタ化によって得られるものであり、
スペクトルパターンの分布のみに基づいているため音素
との明確な対応付けはない。それゆえ、単語辞書の生成
に正書法を用いることはできない。また、認識に必要な
記憶容量と計算量は単語単位の標準パターンを用いる際
の１／１０程度である。[0004] The SPLIT method can be said to be a combination of three conventional methods: a phoneme-based recognition system, a word-based recognition system, and a vector quantization method in speech coding. Where S
The standard pattern of the PLIT method is called a pseudophoneme standard pattern, and is obtained by clustering short-time spectral patterns of a large number of voice samples for each speaker.
Since it is based only on the distribution of the spectral pattern, there is no clear association with phonemes. Therefore, orthography cannot be used to generate a word dictionary. In addition, the storage capacity and the amount of calculation required for recognition are about 1/10 of the case of using a standard pattern in word units.

【０００５】ＳＰＬＩＴ法は、通常特定話者の音声認識
を行う際に使用されるため、ここでは音素単位の単語音
声認識法における母音系列と単語辞書とのマッチングに
ついて図４を参照して説明する。入力端子１からの入力
音声は、Ａ／Ｄ変換部２を通り、ディジタル信号に変換
される。その後、分析部３によって、スペクトル分析さ
れる。この分析されたスペクトルは類似度算出部４で各
音素の標準パターン５（スペクトル）との比較により、
各音素との類似度を得る。各音素と時間経過の類似度を
類似度行列６に格納する。類似度和算出部７で単語辞書
８の各単語の音素系列とのＤＰマッチングを行うことに
より類似度和を得る。[0005] Since the SPLIT method is usually used for speech recognition of a specific speaker, matching between a vowel sequence and a word dictionary in the word speech recognition method for each phoneme will be described with reference to FIG. . The input voice from the input terminal 1 passes through the A / D converter 2 and is converted into a digital signal. Thereafter, the spectrum is analyzed by the analysis unit 3. The analyzed spectrum is compared with the standard pattern 5 (spectrum) of each phoneme by the similarity calculation unit 4 to obtain
Get the similarity with each phoneme. The similarity between each phoneme and the passage of time is stored in the similarity matrix 6. The similarity sum calculator 7 performs DP matching with the phoneme sequence of each word in the word dictionary 8 to obtain a similarity sum.

【０００６】この類似度和をもとにして、単語判定部９
でもっとも類似度和の高いものをその単語であると判定
する。その結果を出力端子１０に渡す。On the basis of the sum of similarities, the word judging unit 9
Is determined to be the word having the highest similarity sum. The result is passed to the output terminal 10.

【０００７】[0007]

【発明が解決しようとする課題】以上に述べた従来の方
法では、音素の数だけでも４０程度あり、そこに単語辞
書を用いたＤＰマッチングを施すと、かなりの計算量に
なる。また、利用者が新たに辞書に単語を登録するの
に、音素を知っている必要がある。In the conventional method described above, the number of phonemes alone is about 40, and if DP matching using a word dictionary is performed there, a considerable amount of calculation is required. In addition, the user needs to know phonemes in order to newly register words in the dictionary.

【０００８】[0008]

【課題を解決するための手段】この発明によれば、単語
辞書の各単語の音韻系列と、入力音声の音韻系列との音
韻記号の一致を語頭から順次確認し、不一致の場合は、
ペナルティを＋１増加させると共に、その辞書の単語音
韻系列の当該音韻と、入力音声の音韻系列における次に
続く音韻記号との照合を進める。ペナルティが少ない単
語程、その単語との類似度が高いと判定する。According to the present invention, the phonemic sequence of each word in the word dictionary and the phonemic sequence of the input speech are checked sequentially from the beginning of the phonemic symbol, and if they do not match,
The penalty is increased by +1 and the matching of the phoneme of the word phoneme sequence of the dictionary with the next succeeding phoneme symbol in the phoneme sequence of the input speech is advanced. It is determined that a word having a smaller penalty has a higher similarity to the word.

【０００９】特に各単語を、母音と摩擦音と無音部との
計７個の記号で各単語を表現し、入力音声も、この７個
の記号系列に変換して、前記マッチング法を適用する。In particular, each word is expressed by a total of seven symbols including vowels, fricatives, and silence, and the input speech is also converted into these seven symbol sequences, and the matching method is applied.

【００１０】[0010]

【発明の実施の形態】この発明の実施例を適用した認識
装置の処理手順を図１Ａに示す。図４と処理内容は違っ
ても、全体の流れから見て同等の処理をしている箇所に
は、同一符号をつけてある。入力端子１からの入力音声
は、Ａ／Ｄ変換部２を通り、ディジタル信号に変換され
る。その後分析部３によってＬＰＣ分析され、例えばケ
プストラムが抽出される。FIG. 1A shows a processing procedure of a recognition apparatus to which an embodiment of the present invention is applied. Although the processing contents are different from those in FIG. 4, the same reference numerals are given to the parts where the same processing is performed from the overall flow. The input voice from the input terminal 1 passes through the A / D converter 2 and is converted into a digital signal. Thereafter, the analysis unit 3 performs an LPC analysis to extract, for example, a cepstrum.

【００１１】系列生成部（フレーム毎）１２で標準パタ
ーン（ケプストラムのベクトル）５との比較により、音
韻の時間系列を得る。この実施例は、各単語を母音と摩
擦音と無音とにより表現するようにした場合であり、入
力音声に対しても系列生成部１２で、母音と摩擦音と無
音よりなる系列とする。例えば単語「カップ」の音声が
入力されると、図２Ａに示す音韻の時間系列が得られ
る。＊は無音を示し、ａは「カ」の母音を示し、「ッ」
は無音として検出され、ｕは「プ」の母音である。語頭
と語尾に無音＊があり、「ァ」と「ッ」の間に不連続な
母音ｏが、「ッ」「プ」の間に不連続な母音ｉが検出さ
れている。A sequence generation unit (for each frame) 12 obtains a time sequence of phonemes by comparison with a standard pattern (cepstrum vector) 5. In this embodiment, each word is represented by vowels, fricatives, and silences. The sequence generator 12 also makes the input speech into a sequence composed of vowels, fricatives, and silences. For example, when a voice of the word “cup” is input, a time sequence of phonemes shown in FIG. 2A is obtained. * Indicates silence, a indicates a vowel of "ka", "tsu"
Is detected as silence, and u is a vowel of “P”. There is a silence * at the beginning and end of the word, and a discontinuous vowel o is detected between "a" and "tsu", and a discontinuous vowel i is detected between "tsu" and "pu".

【００１２】系列処理（Ａ）１３で先に出てきた音韻の
時間系列中、語頭および語尾に出てくる無音部＊を削除
し、また不連続な音韻記号を直前の音韻記号を使って置
き換える。図２Ａの例では、図２Ｂに示すように、語頭
の無音＊と語尾の無音＊とが削除され、不連続な音韻記
号ｏがその直前の記号ａに置き換えられ、また不連続な
音韻記号ｉがその直前の記号＊に置き換えられる。In the time sequence of the phoneme that appears earlier in the sequence processing (A) 13, the silent part * appearing at the beginning and the end of the phoneme is deleted, and the discontinuous phoneme symbol is replaced with the previous phoneme symbol. . In the example of FIG. 2A, as shown in FIG. 2B, the silence * at the beginning and the silence * at the end are deleted, the discontinuous phonological symbol o is replaced by the symbol a immediately before it, and the discontinuous phonological symbol i is changed. Is replaced by the symbol * immediately before.

【００１３】次にこの例では、系列処理（Ｂ）１４で系
列処理（Ａ）で生成した系列中の同一音韻記号が連続す
る部分を、その連続した音韻記号の一文字で置き換え
る。図２に示した例では、図２Ｃに示すように連続した
８個の音韻記号ａは１つのａとして、連続した１８個の
無音記号＊は１個の＊で、連続した８個のｕは１個のｕ
に置き換えられる。Next, in this example, in the sequence processing (B) 14, the portion where the same phoneme symbol is continuous in the sequence generated in the sequence process (A) is replaced with one character of the continuous phoneme symbol. In the example shown in FIG. 2, as shown in FIG. 2C, eight consecutive phoneme symbols a are one a, eighteen consecutive silence symbols * are one *, and eight consecutive u are One u
Is replaced by

【００１４】この系列処理（Ｂ）１４から出てきた系列
と単語辞書８とのマッチングを類似度和算出部７で行
う。ここで、求めた類似度和をもとにして単語判定部９
で正解と思われる単語を求め、出力端子１０より出力す
る。ここで、類似度和算出部７、つまりマッチング部に
ついてくわしい説明をつけておく。まず、単語辞書８は
その単語の母音部分のみを強調して書かれている。つま
り、単語辞書内で使われている記号中の“ａ”は
“あ”、“ｉ”は“い”、“ｕ”は“う”、“ｅ”は
“え”、“ｏ”は“お”、“＊”は“無音部”、“Ｓ”
は“摩擦音”をそれぞれ表している。単語辞書に書かれ
ている例をあげると、図３に示すようにカップなら『ａ
＊ｕ』、ガウンなら『Ｃａｕ』のような文字列に書いて
ある。また、“ｂ、ｄ、ｇ”や“ｍ、ｎ”は実際の認識
の際には使用していないが、ここでは便宜上一般的に子
音を表す“Ｃ”で記す。例えば、ゲームなら『ＣｅＣ
ｕ』なる文字列という具合である。単語辞書８には多く
の人が発声する場合こういう文字列がとれるであろう、
というものを考えて登録する。The similarity sum calculator 7 performs matching between the series output from the series processing (B) 14 and the word dictionary 8. Here, the word determination unit 9 is performed based on the obtained similarity sum.
To find a word that seems to be the correct answer and output it from the output terminal 10. Here, the similarity sum calculating unit 7, that is, the matching unit will be described in detail. First, the word dictionary 8 is written with only the vowel parts of the word emphasized. That is, in the symbols used in the word dictionary, “a” is “A”, “i” is “Yes”, “u” is “U”, “e” is “E”, and “o” is “ “O” and “*” are “silence” and “S”
Represents "frictional sound", respectively. To give an example written in a word dictionary, as shown in FIG.
* U ”, and gowns are written in character strings such as“ Cau ”. Although “b, d, g” and “m, n” are not used in actual recognition, they are generally denoted by “C” representing consonants for convenience. For example, if the game is "CeC
u ”. If many people utter a word in the word dictionary 8, such a character string will be taken.
Think about it and register.

【００１５】また、類似度和算出部７では、従来行われ
ているようなＤＰマッチングを行わず、この発明の特徴
であるより簡単な方法を使用している。単語辞書８の前
記単語表記から理解されるように、系列処理（Ｂ）１４
より出てきた系列は、単語辞書８の各単語文字列とマッ
チングが行える形の系列になっている。この発明のマッ
チング方法では、例えば図１Ｂに示すように単語辞書８
に登録されている『ａ＊ｕａｅｉ』という系列と、系列
処理（Ｂ）１４で処理された入力音声の音韻系列『ａ＊
ｕ＊ａｅｉ』とのマッチング処理を例として説明する。Further, the similarity sum calculating section 7 does not perform the DP matching as conventionally performed, but uses a simpler method which is a feature of the present invention. As understood from the word notation in the word dictionary 8, the series processing (B) 14
The series that has come out is a series that can be matched with each word character string in the word dictionary 8. In the matching method of the present invention, for example, as shown in FIG.
And the phoneme series “a *” of the input voice processed in the series processing (B) 14.
u * aei ”will be described as an example.

【００１６】ステップ１：辞書系列の一番目“ａ”と入
力系列の一番目“ａ”とを比較し、同一記号であるか
ら、何もせず両系列とも一文字進める。ステップ２：辞書系列の二番目“＊”と入力系列の二番
目“＊”とを比較し、同一記号であるから同様に両系列
とも一文字進める。ステップ３：両系列の各三番目“ｕ”と“ｕ”を比較
し、同一記号で、両系列とも一文字進める。Step 1: The first "a" in the dictionary series is compared with the first "a" in the input series. Since the symbols are the same, both characters are advanced by one character without any operation. Step 2: The second "*" in the dictionary series is compared with the second "*" in the input series, and both strings are advanced by one character because they are the same symbol. Step 3: The third "u" and "u" of each of the two sequences are compared, and both characters are advanced by one character with the same symbol.

【００１７】ステップ４：両系列の各四番目“ａ”と
“＊”とを比較し、不一致であるからペナルティを＋１
して、入力系列だけ一文字進める。ステップ５：辞書系列の四番目“ａ”と入力系列の五番
目“ａ”とを比較し、一致するから両系列とも一文字進
める。ステップ６：辞書系列の五番目“ｅ”と入力系列の六番
目“ｅ”とを比較し、一致するから両系列とも一文字進
める。Step 4: Each fourth "a" of both streams is compared with "*", and since they do not match, the penalty is increased by +1.
Then, advance the input sequence by one character. Step 5: The fourth “a” in the dictionary series is compared with the fifth “a” in the input series, and both strings are advanced by one character because they match. Step 6: The fifth “e” in the dictionary series is compared with the sixth “e” in the input series, and both strings are advanced by one character because they match.

【００１８】ステップ７：辞書系列の六番目“ｉ”と入
力系列の七番目“ｉ”とを比較し、一致により両系列と
も一文字進める。ステップ８：辞書系列は六番目で終っており、入力系列
は一文字“ｅ”が残っているので不一致によりペナルテ
ィを＋１する。このように両系列の各記号の語頭から順次比較し、同じ
であれば両方とも一文字分進め、違っていればペナルテ
ィを＋１して、入力系列のみ一文字進める。これをどち
らかの系列の最後がくるまで続ける。一方が最後になっ
た時は、もう片方の系列にまだ比較していない系列があ
れば、その数分をペナルティに加える。Step 7: The sixth "i" in the dictionary series is compared with the seventh "i" in the input series, and both strings are advanced by one character by matching. Step 8: The dictionary sequence ends at the sixth, and the input sequence has one character "e" remaining, so the penalty is increased by 1 due to mismatch. As described above, the comparison is sequentially performed from the beginning of each symbol of both sequences, and if both are the same, both are advanced by one character, and if they are different, the penalty is incremented by one and only the input sequence is advanced by one character. This continues until the end of either series is reached. At the end of one, if there is a series that has not been compared to the other series, add a few minutes to the penalty.

【００１９】このようにして単語辞書８中の全ての単語
とのマッチングを行い、ペナルティが０のもの、又は所
定値以下のものの単語を認識結果として出力する。この
ような文字列マッチング方法は、前記単語辞書の単語表
記に限らず、図２Ｂに示したような表記とし、この辞書
と、単語系列処理（Ａ）１３から処理系列とのマッチン
グに適用してもよく、あるいは、単語辞書として子音も
含めた表記とし、入力音声系列も、子音も含めた音韻系
列とし、これらに対しても図１Ｂに示したようなマッチ
ング法を適用してもよい。In this way, matching with all the words in the word dictionary 8 is performed, and words having a penalty of 0 or less than a predetermined value are output as recognition results. Such a character string matching method is not limited to the word notation in the word dictionary, but is applied to the notation as shown in FIG. 2B, and is applied to matching between the dictionary and the processing sequence from the word sequence processing (A) 13. Alternatively, the word dictionary may be a notation including consonants, the input speech sequence may be a phoneme sequence including consonants, and the matching method as shown in FIG. 1B may be applied to these.

【００２０】[0020]

【発明の効果】以上説明してきたように、この発明によ
れば、マッチングの方法が基本的に比較と加算のみなの
でＣＰＵにかかる負荷が少なく、小さなシステムでも動
作可能である。先の実施例のように単語辞書に書く系列
を７種の記号に限る場合は、単語に現れる母音をなぞる
ようにして単語辞書を作るため、容易に辞書の更新がで
きる。また母音を中心とした系列のため、音素に関する
特殊な知識を持っている必要がない。基本的な知識に、
母音の無声化や、“あ”から“い”の音に遷移する時に
若干“え”の音が混じるなどの知識があれば、より認識
率を上げ得る。As described above, according to the present invention, since the matching method is basically only comparison and addition, the load on the CPU is small and the system can operate even with a small system. When the sequence written in the word dictionary is limited to seven types of symbols as in the previous embodiment, the dictionary can be easily updated because the word dictionary is created by tracing the vowels appearing in the word. Also, since the sequence is centered on vowels, there is no need to have special knowledge about phonemes. To basic knowledge,
If there is knowledge such as devoicing of vowels and slight mixing of "e" when transitioning from "a" to "i", the recognition rate can be further increased.

【００２１】このマッチング法を用いて１チップのＤＳ
Ｐに動作させ、３０名の被験者（男性１５名、女性１５
名）による認識実験を行ったところ、認識率は男性の場
合は９０．３％、女性の場合は９４．２％、総合で９
２．３％となった。Using this matching method, one-chip DS
P, 30 subjects (15 men, 15 women)
A), the recognition rate was 90.3% for men, 94.2% for women, and 9
2.3%.

【図面の簡単な説明】[Brief description of the drawings]

【図１】Ａはこの発明方法を適用した単語音声認識装置
の機能構成例を示すブロック図、Ｂはそのマッチング方
法の例示を示す図である。FIG. 1A is a block diagram showing a functional configuration example of a word speech recognition apparatus to which the method of the present invention is applied, and FIG. 1B is a diagram showing an example of a matching method thereof.

【図２】図１Ａ中の系列処理（Ａ）１３の処理例を示す
図。FIG. 2 is a view showing a processing example of series processing (A) 13 in FIG. 1A.

【図３】単語辞書８の内容の例を示す図。FIG. 3 is a view showing an example of the contents of a word dictionary 8;

【図４】従来の音素単位の単語音声認識装置の機能構成
を示すブロック図。FIG. 4 is a block diagram showing a functional configuration of a conventional phoneme-based word speech recognition apparatus.

フロントページの続き (72)発明者中台芳夫東京都新宿区西新宿三丁目19番２号日本電信電話株式会社内 (72)発明者鈴木義武東京都新宿区西新宿三丁目19番２号日本電信電話株式会社内 (72)発明者黒川俊英東京都武蔵野市御殿山一丁目１番３号エヌ・ティ・ティ・アドバンステクノロジ株式会社内 (72)発明者佐藤大和東京都武蔵野市御殿山一丁目１番３号エヌ・ティ・ティ・アドバンステクノロジ株式会社内Continued on the front page (72) Inventor Yoshio Nakadai 3-192-2 Nishi-Shinjuku, Shinjuku-ku, Tokyo Japan Telegraph and Telephone Corporation (72) Inventor Yoshitake Suzuki 3-192-2, Nishi-Shinjuku, Shinjuku-ku, Tokyo Japan Inside Telegraph and Telephone Corporation (72) Inventor Toshihide Kurokawa 1-3-1 Gotenyama, Musashino-shi, Tokyo NTT Advanced Technology Corporation (72) Inventor Yamato Sato 1-chome, Gotenyama, Musashino-shi, Tokyo No. 1-3 NTT Advanced Technology Co., Ltd.

Claims

[Claims]

1. A character string matching method for word speech recognition that performs recognition by matching a word dictionary represented by a sequence of phonemic symbols with a time sequence of phonemes obtained by a short-time analysis of input speech. The matching of phonemic symbols between the dictionary phoneme sequence and the phoneme sequence of the input speech is sequentially confirmed from the beginning of the word. If they do not match, the penalty is increased by +1. A character string matching method for word speech recognition, characterized in that the matching with the following phonological symbols is advanced.

2. The character string matching for word speech recognition according to claim 1, wherein the input phoneme sequence and the word phoneme sequence of the word dictionary are each composed of one or more of a vowel, a fricative, and a silent part. Law.

3. A silent part at the beginning and end of an input phoneme sequence is removed, and a phoneme of a transient discontinuous part is replaced with a phoneme immediately before.
3. A character string matching method for word speech recognition according to claim 2, wherein said matching is performed with each word of the word dictionary after replacing a portion where the same phoneme is continuous with the one phoneme.