JPH0612091A

JPH0612091A - Japanese speech recognizing method

Info

Publication number: JPH0612091A
Application number: JP4170898A
Authority: JP
Inventors: Shoichi Matsunaga; 昭一松永; Toshiaki Tsuboi; 俊明坪井; Tomokazu Yamada; 智一山田; Kiyohiro Kano; 清宏鹿野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1992-06-29
Filing date: 1992-06-29
Publication date: 1994-01-21

Abstract

PURPOSE:To convert an input speech to a character string mixed with KANA (Japanese syllabary) and KANJI (Chinese characters) at the high recognition ratio. CONSTITUTION:Concerning the first character of each word, similarity between the input speech and the standard pattern of phonemes is obtained by using Japanese clause syntax 6 and a word dictionary 7 showing the reading, tolerance to first generate that character is obtained, and the sum of the similarity and the tolerance is defined as total tolerance. While remaining characters having the high total tolerance, the second character having the high sum of the similarity and tolerance is similarily searched and concerning the obtained word having the high total tolerance, the similarity with the input speech and the tolerance of the generation are respectively obtained from the structure of noun and postpositional particle concerning the respective candidates of the postpositional particle coming next to the noun and afterwards, the obtained clause having the high total tolerance as a whole is obtained and defined as the recognized result.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は、例えば隠れマルコフ
モデル（例えば、中川聖一「確率モデルによる音声認
識」電子情報通信学会編（１９８８））、統計的言語モ
デル（例えば、Ｂａｈｌ，Ｌ．他“ＡＳｔａｔｉｓｔ
ｉｃａｌＡｐｐｒｏａｃｈｔｏＣｏｎｔｉｎｕｏ
ｕｓＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ”ＩＥＥ
ＥＴｒａｎｓ．ｏｎＰＡＭＩ（１９８３））のよう
な標準パターンと、日本語統計モデル（例えば、山田他
「かな漢字の文字連鎖情報を利用した統計的言語モデ
ル」電子情報通信学会技術報告、ＳＰ９１−２６（１９
９１））とを用いた日本語音声認識方法に関するもので
ある。BACKGROUND OF THE INVENTION The present invention is applicable to, for example, a hidden Markov model (for example, Seiichi Nakagawa "Speech recognition by a probabilistic model" edited by The Institute of Electronics, Information and Communication Engineers (1988)), a statistical language model (for example, Bahl, L. et al. "A Statist
iCal Approach to Continuo
us Speech Recognition "IEE
E Trans. on PAMI (1983)) and a Japanese statistical model (for example, Yamada et al., "Statistical Language Model Using Kana-Kanji Character Chain Information," IEICE Technical Report, SP91-26 (19).
91)) and a Japanese speech recognition method.

【０００２】[0002]

【従来の技術】従来の隠れマルコフモデル及び統計的言
語モデルを用いた音声認識方法として、学習用テキスト
データベースより、音素の生起順序に関する統計的言語
モデルと、隠れマルコフモデルの音素標準パターンとを
予め作成しておき、入力音声に対し、統計的言語モデル
を用いて、既に認識した直前の複数の音素から、次に生
起する確率の高い複数の音素候補を選出し、これら選出
した音素候補のそれぞれについて、その音素標準パター
ンと入力音声とを照合して、生起尤度と、標準パターン
との類似度との総合的尤度の最も高い音素を認識結果と
して出力することが提案されている。2. Description of the Related Art As a conventional speech recognition method using a hidden Markov model and a statistical language model, a statistical language model regarding a phoneme occurrence order and a phoneme standard pattern of the hidden Markov model are previously stored in a learning text database. For the input speech, a statistical language model is used to select a plurality of phoneme candidates that have a high probability of occurring next from the plurality of phonemes that have already been recognized, and each of the selected phoneme candidates is selected. It has been proposed that the phoneme standard pattern is collated with the input speech, and the phoneme with the highest total likelihood of the occurrence likelihood and the similarity with the standard pattern is output as the recognition result.

【０００３】[0003]

【発明が解決しようとする課題】しかし、この認識方法
は認識結果が、音素単位の系列として出力されるから、
入力音声を日本語文として出力したい場合は、その認識
結果の音素単位の系列を、仮名漢字変換する必要があ
る。つまり、入力音声を音素単位の系列への変換と、そ
の音素系列の仮名漢字系列への変換との２回の変換を行
うため、全体として正しい変換結果が得られる変換性能
が比較的低いものとならざるを得ない。However, in this recognition method, since the recognition result is output as a sequence of phoneme units,
To output the input speech as a Japanese sentence, it is necessary to convert the phoneme-based sequence of the recognition result into kana-kanji conversion. That is, since the input speech is converted into a sequence of phoneme units and the phoneme sequence is converted into a kana-kanji sequence, the conversion performance is relatively low in that a correct conversion result can be obtained as a whole. I have no choice.

【０００４】[0004]

【課題を解決するための手段】この発明によれば、学習
用テキストデータベースから読みを振った仮名及び漢字
の生起順序に関する統計的言語モデルを作成しておき、
また、文節の構文と、読みを振った単語辞書、読みを振
った文字を音素記号に変換する文字音素変換規則、及び
音素標準パターンを作成しておき、この構文と単語辞書
から導かれる認識候補の、仮名漢字の統計モデルと音素
標準パターンを用いて、入力音声を一挙に、仮名漢字混
じり文字系列に変換する。According to the present invention, a statistical language model relating to the occurrence order of kana and kanji whose readings are changed is created from a learning text database,
In addition, bunsetsu syntax, word dictionary with pronunciation, phoneme conversion rules for converting characters with pronunciation into phoneme symbols, and phoneme standard patterns are created in advance, and recognition candidates derived from this syntax and word dictionary are created. Using the statistical model of Kana-Kanji and the phoneme standard pattern, the input voice is converted into a mixed Kana-Kanji character sequence.

【０００５】[0005]

【実施例】図１にこの発明の実施例を示す。入力端子１
から入力された音声は、特徴抽出部２においてディジタ
ル信号に変換され、更にＬＰＣケプストラム分析された
後、１フレーム（１時点、例えば１０ミリ秒）ごとに特
徴パラメータに変換される。この特徴パラメータは例え
ばＬＰＣケプストラム係数である。FIG. 1 shows an embodiment of the present invention. Input terminal 1
The voice input from is converted into a digital signal in the feature extraction unit 2, further subjected to LPC cepstrum analysis, and then converted into a feature parameter every one frame (one time point, for example, 10 milliseconds). This characteristic parameter is, for example, an LPC cepstrum coefficient.

【０００６】学習用音声データベースより、上記特徴パ
ラメータと同一形式で、隠れマルコフモデルの音素標準
パターンを作り、標準パターンメモリ４に記憶してお
き、また文字の読みを振った学習用テキストデータベー
スより、読みを振った文字の生起順序に関する統計的言
語モデルを作り統計的言語モデルメモリ５に記憶してあ
る。同様に、上記学習用テキストデータベースから、日
本語文節の構造が単語の遷移規則として記述されている
文節構文６、読みを振った単語辞書７（例えば図２
Ａ）、読みを振った仮名及び漢字を音素の記号列に変換
する文字音素変換規則８（例えば図２Ｂ）を作成してそ
れぞれ記憶しておく。なお、仮名に対する読みの振りと
は例えば「は」は「ｈａ」または「ｗａ」の何れを振る
認識部３では、文節構文６と読みを振った単語辞書７と
を用いて選出した複数の文字候補について、その文字候
補の文字音素変換規則８により得られた音素の標準パタ
ーンを標準パターンメモリ４から読みだし、入力音声の
特徴パラメータの類似度（尤度）を求める。また上記文
字候補の生起尤度を統計的言語モデルより求める。From the learning voice database, a phoneme standard pattern of the hidden Markov model is created in the same format as the above-mentioned characteristic parameters, stored in the standard pattern memory 4, and from the learning text database in which the reading of the characters is given, A statistical language model relating to the order of occurrence of read characters is created and stored in the statistical language model memory 5. Similarly, from the above learning text database, a bunsetsu syntax 6 in which the structure of a Japanese bunsetsu is described as a word transition rule, and a word dictionary 7 with reading (for example, FIG. 2).
A), a character-phoneme conversion rule 8 (for example, FIG. 2B) that converts kana and kanji with pronunciations into phoneme symbol strings is created and stored. It is to be noted that the pronunciation of the pronunciation with respect to the kana is, for example, “ha” is either “ha” or “wa”. In the recognition unit 3, a plurality of characters selected by using the clause syntax 6 and the word dictionary 7 with the pronunciation For a candidate, the standard pattern of phonemes obtained by the character-phoneme conversion rule 8 of the character candidate is read from the standard pattern memory 4, and the similarity (likelihood) of the characteristic parameter of the input speech is obtained. Also, the likelihood of occurrence of the above character candidates is obtained from a statistical language model.

【０００７】つまり例えば入力音声のｉ番目の文字を認
識するには、統計的言語モデルから読みを振った文字
（仮名、漢字）の出現順序に関するトライグラムを用い
て（ｉ−２）番目と（ｉ−１）番目との各文字の認識結
果を基に、ｉ番目に出現する文字の生起尤度を各文字に
ついて求め、その文字の音素列の標準パターンとの類似
性を示す尤度とその文字の生起尤度との和を総合尤度と
する。That is, for example, in order to recognize the i-th character of the input speech, a trigram concerning the appearance order of the characters (kana and kanji) whose pronunciation is changed from the statistical language model is used to identify the (i-2) -th character. Based on the recognition result of each of the i-1) th character, the occurrence likelihood of the i-th character that appears is calculated for each character, and the likelihood indicating the similarity with the standard pattern of the phoneme sequence of that character and its The sum of the likelihood of occurrence of a character is the total likelihood.

【０００８】この読みを振った仮名・漢字候補の選出
と、それらについての標準パターンとの照合と、その総
合尤度から認識結果文字を得る操作とを音声区間が終わ
るまで繰り返し、最後に、それまで得られた認識結果文
字を、認識結果出力部９に送り、その順に仮名、漢字系
列として出力する。次に認識手順の具体例を述べる。例
えば単語辞書７の各単語の最初の文字について入力音声
の音素列と標準パターンとの類似性を求める。即ち、１．例えば単語辞書７中の単語の最初の文字東（と
う）より／ｔｏｕ／の標準パターンと入力音声とのマ
ッチングをとり類似性尤度Ｐ１を得る。最初に文字東
（とう）が生起する尤度Ｌ１より東（とう）の総合尤
度Ｑ１＝Ｐ１＋Ｌ１を得る。２．単語辞書７中の単語の最初の文字山（やま）より
／ｙａｍａ／の標準パターンと入力音声とのマッチン
グをとり類似性尤度Ｐ２を得る。最初に文字山（やま）
が生起する尤度Ｌ２より山（やま）の総合尤度Ｑ２＝
Ｐ２＋Ｌ２を得る。３．このようにして各単語の最初の文字との総合尤度を
求め、総合尤度の高い順次に例えば１０個を残しそれら
についてその単語の２番目の文字について、例えば東の
次の文字として京（きょう）より／ｋｙｏｕ／の標
準パターンと入力音声とのマッチングをとり類似性尤度
Ｐ３を得る。文字東（とう）の後に京（きょう）が
生起する尤度Ｌ３より東（とう）京（きょう）の総合
尤度Ｑ３＝Ｑ１＋Ｐ３＋Ｌ３を得る。４．山の次の文字として形（がた）をとり／ｇａｔ
ａ／の標準パターンと入力音声とのマッチングをとり類
似性尤度Ｐ４を得る。文字山（やま）の後に形（が
た）が生起する尤度Ｌ４より山（やま）形（がた）の
総合尤度Ｑ４＝Ｑ２＋Ｐ４＋Ｌ４を得る。５．そのようにして総合尤度の高いものを成し、日本語
の構文６における名詞＋助詞の構造の関係から、その助
詞の候補を辞書から求め、例えばは（わ）より／ｗ
ａ／の標準パターンと入力音声とのマッチングをとり類
似性尤度Ｐ５を得る。文字東（とう）京（きょう）の
後には（わ）が生起する尤度Ｌ５より文節東（と
う）京（きょう）は（わ）の総合尤度Ｑ５＝Ｑ３＋Ｐ５
＋Ｌ５を得る。６．同様に候補のより／ｎｏ／の標準パターンと
入力音声とのマッチングをとり類似性尤度Ｐ６を得る。
文字東（とう）京（きょう）の後にのが生起する
尤度Ｌ６より文節東（とう）京（きょう）のの総合
尤度Ｑ６＝Ｑ３＋Ｐ６＋Ｌ６を得る。７．同様に候補がより／ｇａ／との類似性尤度Ｐ
７を得る。文字山（やま）形（がた）の後にがが
生起する尤度Ｌ７より文節山（やま）形（がた）が
の総合尤度Ｑ７＝Ｑ４＋Ｐ７＋Ｌ７を得る。８．またのより／ｎｏ／との類似性尤度Ｐ８を得
る。文字山（やま）形（がた）の後にのが生起す
る尤度Ｌ８より文節山（やま）形（がた）のの総合尤
度Ｑ８＝Ｑ４＋Ｐ８＋Ｌ８を得る。９．このようにして得られた文節の候補東（とう）京（きょう）は（わ）東（とう）京（きょう）の山（やま）形（がた）が山（やま）形（がた）のの総合尤度Ｑ５，Ｑ６，Ｑ７，Ｑ８を比較する。１０．例えばＱ６が最も高ければ認識結果は東（と
う）京（きょう）のとなり、東京のを認識結果とし
て出力する。なお、特徴抽出部２、認識部３、認識結果
出力部９はそれぞれ、専用、または兼用のマイクロプロ
セッサにより処理することができる。The selection of kana / kanji candidates with this reading, the matching with the standard patterns for them, and the operation of obtaining the recognition result character from the total likelihood thereof are repeated until the end of the voice section, and finally, that The recognition result characters obtained up to this point are sent to the recognition result output unit 9, and are output in that order as a kana and kanji sequence. Next, a specific example of the recognition procedure will be described. For example, for the first character of each word in the word dictionary 7, the similarity between the phoneme sequence of the input speech and the standard pattern is obtained. That is, 1. For example, the similarity likelihood P1 is obtained by matching the standard pattern of / tou / with the input voice from the first character east of the word in the word dictionary 7. First, the total likelihood Q1 = P1 + L1 of the eastern part is obtained from the likelihood L1 of the character eastal part. 2. From the first character mountain of the word in the word dictionary 7, a standard pattern of / yama / is matched with the input voice to obtain the similarity likelihood P2. First, the mountain
Than the likelihood L2 that occurs in the mountain total likelihood Q2 =
Obtain P2 + L2. 3. In this way, the total likelihood of each word with the first character is obtained, and, for example, 10 pieces are left in order of high total likelihood, and the second character of that word, for example, the next character in the east, is set to K ( Today, the standard pattern of / kyou / and the input voice are matched to obtain the similarity likelihood P3. The total likelihood Q3 = Q1 + P3 + L3 of the eastern Kyo is obtained from the likelihood L3 that the Kyo occurs after the character east. 4. Take the shape as the next character of the mountain / gat
The similarity pattern P4 is obtained by matching the standard pattern of a / with the input voice. From the likelihood L4 that the shape (gata) occurs after the character mountain (yama), the overall likelihood Q4 = Q2 + P4 + L4 of the mountain shape (gata) is obtained. 5. In this way, one with a high overall likelihood is obtained, and from the relationship of the structure of the noun + particle in Japanese syntax 6, the candidate for that particle is obtained from the dictionary.
The similarity pattern P5 is obtained by matching the standard pattern of a / with the input voice. From the likelihood L5 that (wa) occurs after the letter east Kyoto, the total likelihood of the sentence east (to) Kyoto is (wa) Q5 = Q3 + P5
Get + L5. 6. Similarly, the similarity pattern P6 is obtained by matching the standard pattern of / no / of the candidate and the input voice.
From the likelihood L6 that occurs after the character Tokyo, we obtain the overall likelihood Q6 = Q3 + P6 + L6 of the phrase Tokyo. 7. Similarly, the candidate is better than / ga /
Get 7. From the likelihood L7 that a character appears after a mountain shape, the Bunshi mountain shape is
To obtain the total likelihood Q7 = Q4 + P7 + L7. 8. Moreover, the similarity likelihood P8 with / no / is obtained. From the likelihood L8 that occurs after the character mountain shape, the total likelihood Q8 = Q4 + P8 + L8 of the mountain shape is obtained. 9. Candidates for the bunsetsu obtained in this way are in the eastern Kyo (wa) eastern Kyo (mountain) shape. The overall likelihoods Q5, Q6, Q7, and Q8 of are compared. 10. For example, if Q6 is the highest, the recognition result will be East, and Tokyo will be output as the recognition result. The feature extraction unit 2, the recognition unit 3, and the recognition result output unit 9 can be processed by a dedicated or shared microprocessor.

【０００９】[0009]

【発明の効果】以上述べたように、この発明によれば、
読みを振った仮名・漢字の出現順序に関する統計的言語
モデルと、その読みに対応した音素標準パターンを用い
るため、高い認識性能が予期される。また、入力音声
を、仮名漢字系列に一挙に変換することができ、２回に
分けて変換する場合よりも高い変換性能が予期される。As described above, according to the present invention,
High recognition performance is expected by using a statistical language model for the appearance order of kana and kanji with pronunciation and phoneme standard patterns corresponding to the pronunciation. In addition, the input voice can be converted into a kana-kanji series at once, and higher conversion performance than that in the case of conversion in two steps is expected.

【００１０】文節単位に発声した５００文節に対して変
換率による評価を行うと、従来法により、音節の統計情
報（音節のトライグラム）を用いて音素の認識を行った
後に、音素列の認識結果に対して仮名・漢字変換を行う
と７０％（正解文節数／総文節数×１００）が正しく変
換された。これに対して、前記実施例の手法を用いる
と、同じ音素の認識率でも、読み付きの仮名・漢字統計
情報（読み付きの仮名・漢字群のトライグラム）の効果
により、変換性能は９４％に向上する。また、文節構文
を用いることで処理速度は１０倍に高速化される。When the conversion rate is evaluated for 500 syllables uttered in syllable units, the phoneme recognition is performed using the syllable statistical information (syllable trigram) by the conventional method, and then the phoneme string recognition is performed. When the kana / kanji conversion was performed on the result, 70% (correct phrase / total phrase × 100) was correctly converted. On the other hand, when the method of the embodiment is used, the conversion performance is 94% due to the effect of kana / kanji statistical information with reading (trigram of kana / kanji group with reading) even with the same phoneme recognition rate. Improve to. Moreover, the processing speed is increased 10 times by using the clause syntax.

【００１１】なお、この発明は上記実施例に限るわけで
はない。たとえば、認識標準パターンのユニットは音素
だけでなく、音節や連音節であってもよい。認識手法は
隠れマルコフモデルに限らず、ＤＰマッチングを用いて
も良い。統計的言語モデルもトライグラムに限らず、バ
イグラムの統計量でも良い。The present invention is not limited to the above embodiment. For example, the unit of the recognition standard pattern may be not only a phoneme but also a syllable or a continuous syllable. The recognition method is not limited to the hidden Markov model, and DP matching may be used. The statistical language model is not limited to the trigram, but may be a bigram statistic.

[Brief description of drawings]

【図１】この発明による音声認識方法を実施した装置の
一例を示すブロック図。FIG. 1 is a block diagram showing an example of an apparatus that implements a voice recognition method according to the present invention.

【図２】Ａは図１中の単語辞書７の記述例を示す図、Ｂ
は図１中の文字音素変換規則８の例を示す図である。FIG. 2A is a diagram showing a description example of a word dictionary 7 in FIG.
FIG. 3 is a diagram showing an example of a character phoneme conversion rule 8 in FIG. 1.

───────────────────────────────────────────────────── フロントページの続き (72)発明者鹿野清宏東京都千代田区内幸町１丁目１番６号日本電信電話株式会社内 ─────────────────────────────────────────────────── ─── Continuation of front page (72) Inventor Kiyohiro Kano 1-1-6 Uchisaiwaicho, Chiyoda-ku, Tokyo Nihon Telegraph and Telephone Corporation

Claims

[Claims]

1. The input speech is a time series of characteristic parameters, and a statistical language model relating to the occurrence order created from a Japanese syntax, a word dictionary, a graphoneme conversion rule, and a text database is used to input the input speech. For the characteristic parameter time series, a plurality of speech recognition candidates are selected, and for each of these speech recognition candidates, the standard pattern is compared with the characteristic parameter time series of the input speech, respectively, and the likelihood of occurrence and the likelihood of similarity are calculated. In the Japanese speech recognition method in which the candidate with a high total likelihood of is used as the recognition result, as the above-mentioned statistical language model, a kana with a read pronunciation, which is created from a text database with a read kanji,
, And a statistical language model for the occurrence order of kanji, a grammar that describes the structure of Japanese bunsetsu as word transition rules as the Japanese syntax, and a kana with kana and kanji with the reading as the word dictionary. A set of words in a sequence is used, the phoneme standard pattern is used as the standard pattern, using a kana with a reading and a rule for converting kanji into a phoneme symbol string as the character phoneme conversion rule. Japanese speech recognition method.