JP2000075885A

JP2000075885A - Voice recognition device

Info

Publication number: JP2000075885A
Application number: JP10241416A
Authority: JP
Inventors: Schuster Mike; マイク・シュスター; Atsushi Nakamura; 篤中村
Original assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Current assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Priority date: 1998-08-27
Filing date: 1998-08-27
Publication date: 2000-03-14
Anticipated expiration: 2018-08-27
Also published as: JP2938865B1

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognition device in which voice recognition is conducted with a high recognition rate and a high speed employing a highly precisely approximate language likelihood. SOLUTION: A word dictionary initialization processing section 10 generates a tree structure word dictionary based on memory learning text data, computes a look-ahead probability, which is an approximate language likelihood, and adds the probability to each node of the tree structure. A word collating section 6 computes the look-ahead probability, which is the approximate language likelihood given to the non-terminating state of the words in a tree structure word dictionary of a memory 22, for every word hypothesis inputted from a phoneme collating section 4 based on the probability data of the statistical language model N-gram in a memory 23, updates the dictionary in the memory 22 and voice recognizes the inputted voice signals employing the updated tree structure word dictionary.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、木構造単語辞書を
用いて音声認識を行う音声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus for performing speech recognition using a tree-structured word dictionary.

【０００２】[0002]

【従来の技術】近年、連続音声認識装置において、その
性能を高めるために統計的言語モデルを用いる方法が研
究されている。これは、統計的言語モデルを用いて、次
単語を予測し探索空間を削減することにより、認識率の
向上および計算時間の削減の効果を狙ったものである。
最近盛んに用いられている統計的言語モデルとしてＮ−
ｇｒａｍ（Ｎ−ｇｒａｍ；ここで、Ｎは２以上の自然数
である。）がある。これは、大規模なテキストデータを
学習し、直前のＮ−１個の単語から次の単語への遷移確
率を統計的に与えるものである。複数Ｌ個の単語列ｗ₁ ^L
＝ｗ₁，ｗ₂，…，ｗ_Lの生成確率Ｐ（ｗ₁ ^L）は次式で表
される。2. Description of the Related Art In recent years, in a continuous speech recognition apparatus, a method using a statistical language model has been studied in order to improve the performance thereof. This aims at improving the recognition rate and reducing the calculation time by predicting the next word and reducing the search space using a statistical language model.
As a statistical language model that has been widely used recently, N-
gram (N-gram; N is a natural number of 2 or more). It learns large-scale text data and statistically gives the transition probability from the previous N-1 words to the next word. Multiple L word strings w ₁ ^L
= W ₁ , w ₂ ,..., W _L The generation probability P (w ₁ ^L ) is represented by the following equation.

【０００３】[0003]

【数１】 (Equation 1)

【０００４】ここで、ｗ_tは単語列ｗ₁ ^Lのうちｔ番目の
１つの単語を表し、ｗ_i ^jはｉ番目からｊ番目の単語列を
表わす。上記数１において、確率Ｐ（ｗ_t｜
ｗ_t+1-N ^t-1）は、Ｎ個の単語からなる単語列ｗ_t+1-N ^t-1
が発声された後に単語ｗ_tが発声される確率であり、以
下同様に、確率Ｐ（Ａ｜Ｂ）は単語又は単語列Ｂが発声
された後に単語Ａが発声される確率を意味する。また、
数１における「Π」はｔ＝１からＬまでの確率Ｐ（ｗ_t
｜ｗ_t+1-N ^t-1）の積を意味し、以下同様である。[0004] Here, w _t represents a t-th one word of the word string w ₁ ^L, w _i ^j represents the j-th word string from the i-th. In the above _equation 1, the probability P (w _t |
wt _{+ 1-} ^Nt-1 ) is a word sequence wt _{+ 1-} ^Nt-1 composed of N words.
Is the probability that the word w _t will be uttered after is uttered, and similarly, the probability P (A | B) means the probability that the word A will be uttered after the word or word string B has been uttered. Also,
“Π” in Equation 1 represents the probability P (w _t from t = 1 to L
| W _{t + 1−N} ^t−1 ), and so on.

【０００５】ところで、近年、上記統計的言語モデルＮ
−ｇｒａｍを用いて連続音声認識の性能を向上させる手
法が盛んに提案されている（例えば、従来技術文献１
「Ｌ．Ｒ．Ｂａｈｌｅｔａｌ．，“ＡＭａｘｉｍ
ｕｍＬｉｋｅｌｉｈｏｏｄＡｐｐｒｏａｃｈｔｏ
ＣｏｎｔｉｎｕｏｕｓＳｐｅｅｃｈＲｅｃｏｇｎｉ
ｔｉｏｎ”，ＩＥＥＥＴｒａｎｓａｃｔｉｏｎｓｏ
ｎＰａｔｔｅｒｎＡｎａｌｙｓｉｓａｎｄＭａｃ
ｈｉｎｅＩｎｔｅｌｌｉｇｅｎｃｅ，ｐｐ．１７９−
１９０，１９８３年」及び従来技術文献２「清水ほか，
“単語グラフを用いた自由発話音声認識”，電子情報通
信学会技術報告，ＳＰ９５−８８，ｐｐ．４９−５４，
平成７年」参照。）。In recent years, the statistical language model N
Techniques for improving the performance of continuous speech recognition using -gram have been actively proposed.
"LR Bahl et al.," A Maxim
um LikelihoodApproach to
Continuous Speech Recogni
Tion ", IEEE Transactions o
n PatternAnalysis and Mac
hine Intelligence, pp. 179-
190, 1983 "and prior art document 2" Shimizu et al.,
"Free speech recognition using word graphs", IEICE Technical Report, SP95-88, pp. 146-64. 49-54,
1995 ". ).

【０００６】しかしながら、Ｎ−ｇｒａｍはパラメータ
数が多く、それぞれの値を正確に求めるためには、莫大
な量のテキストデータが必要とされる。この問題を解決
する方法として、学習用テキストデータに出現しない単
語遷移に対しても遷移確率を与える平滑化の手法（例え
ば、従来技術文献３「Ｆ．Ｊｅｌｉｎｅｋｅｔａ
ｌ．，“Ｉｎｔｅｒｐｏｌａｔｅｄｅｓｔｉｍａｔｉ
ｏｎｏｆＭａｒｋｏｖＳｏｕｒｃｅＰａｒａｍ
ｅｔｅｒｓｆｒｏｍＳｐａｒｓｅＤａｔａ”，Ｐ
ｒｏｃｅｅｄｉｎｇｓｏｆＷｏｒｋｓｈｏｐＰａ
ｔｔｅｒｎＲｅｃｏｇｎｉｔｉｏｎｉｎＰｒａｃ
ｔｉｃｅ，ｐｐ．３８１−３８７，１９８０年」、従来
技術文献４「Ｓ．Ｍ．Ｋａｔｚｅｔａｌ．，“Ｅｓ
ｔｉｍａｔｉｏｎｏｆＰｒｏｂａｂｉｌｉｔｉｅｓ
ｆｒｏｍＳｐａｒｓｅＤａｔａｆｏｒｔｈｅ
ＬａｎｇｕａｇｅｍｏｄｅｌＣｏｍｐｏｎｅｎｔ
ｏｆａＳｐｅｅｃｈＲｅｃｏｇｎｉｚｅｒ”，
ＩＥＥＥＴｒａｎｓａｃｔｉｏｎｓｏｎＡｃｏｕ
ｓｔｉｃｓ，Ｓｐｅｅｃｈ，ａｎｄＳｉｇｎａｌＰｒ
ｏｃｅｓｓｉｎｇ，ｐｐ．４００−４０１，１９８７
年」及び従来技術文献５「川端ほか，“二項事後分布に
基づくＮ−ｇｒａｍ統計的言語モデルのＢａｃｋ−ｏｆ
ｆ平滑化”，電子情報通信学会技術報告、ＳＰ９５−９
３，ｐｐ１−６，平成７年」参照。）や、クラス分類、
可変長Ｎ−ｇｒａｍ等パラメータの数を減少させる手法
（例えば、従来技術文献６「Ｐ．Ｆ．Ｂｒｏｗｎｅｔ
ａｌ．，“Ｃｌａｓｓ−Ｂａｓｅｄｎ−ｇｒａｍ
ｍｏｄｅｌｓｏｆｎａｔｕｒａｌｌａｎｇｕａｇ
ｅ”，ＣｏｍｐｕｔａｔｉｏｎａｌＬｉｎｇｕｉｓｔ
ｉｃｓ，Ｖｏｌ．１８，Ｎｏ．４，ｐｐ．４６７−４７
９，１９９２年」、従来技術文献７「Ｔ．Ｒ．Ｎｉｅｓ
ｌｅｒｅｔａｌ．，“ＡＶａｒｉａｂｌｅ−Ｌｅ
ｎｇｔｈＣａｔｅｇｏｒｙ−ＢａｓｅｄＮ−ｇｒａ
ｍＬａｎｇｕａｇｅＭｏｄｅｌ”，Ｐｒｏｃｅｅｄ
ｉｎｇｓｏｆＩＣＡＳＳＰ ’９６，Ｖｏｌ．１，
ｐｐ．１６４−１６７，１９９６年」及び従来技術文献
８「政瀧ほか，“連続音声認識のための可変長連鎖統計
統計的言語モデル”，電子情報通信学会技術報告，ＳＰ
９５−７３，ｐｐ．１−６，平成７年」参照。）等が数
多く提案されている。しかしながら、これらの手法を用
いても、精度の良い統計的言語モデルを構築するために
は、相当量のデータを用いる必要があると考えられる。[0006] However, N-gram has a large number of parameters, and an enormous amount of text data is required to accurately obtain each value. As a method for solving this problem, a smoothing method for giving a transition probability even to a word transition that does not appear in the learning text data (for example, see the related art document 3 “F. Jelinek et a
l. , “Interpolated estimati
on of Markov Source Param
eters from Sparse Data ”, P
rosedings of Workshop Pa
ttern Recognition in Prac
tice, pp. 381-387, 1980 "and prior art document 4" SM Katz et al., "Es
timing of Probabilities
from Sparse Data for the
Language model Component
of a Speech Recognizer ",
IEEE Transactions on Acou
stics, Speech, and SignalPr
ossing, pp. 400-401, 1987
Year "and Prior Art Document 5" Kawabata et al., "Back-of N-gram Statistical Language Model Based on Binomial Posterior Distribution"
f Smoothing ", IEICE Technical Report, SP95-9
3, pp1-6, 1995. " ), Classification,
Techniques for reducing the number of parameters such as variable-length N-grams (for example, see Prior Art Document 6 “PF Brownet
al. , “Class-Based n-gram
models of natural language
e ", Computational Linguist
ics, Vol. 18, No. 4, pp. 467-47
9, 1992 "and prior art document 7" TR Nies "
ler et al. , "A Variable-Le
Nth Category-Based N-gra
m Language Model ”, Proceed
ins of ICASSP '96, Vol. 1,
pp. 164-167, 1996 "and prior art document 8" Masataki et al., "Variable-length chain statistical statistical language model for continuous speech recognition", IEICE technical report, SP
95-73, p. 1-6, 1995 ". ) Have been proposed. However, even with these methods, it is considered necessary to use a considerable amount of data in order to construct an accurate statistical language model.

【０００７】以上の問題点を解決するために、従来技術
文献９「ＶｏｌｋｅｒＳｔｅｉｎｂｉｓｓｅｔａ
ｌ．，“Ｉｍｐｒｏｖｅｍｅｎｔｓｉｎｂｅａｍ
ｓｅａｒｃｈ”，ＩＣＬＳＰ９４，Ｙｏｋｏｈａ
ｍａ，Ｊａｐａｎ，ｐｐ．２１４３−２１４６」及
び従来技術文献１０「ＳｔｅｆａｎＯｒｔｍａｎｎｓ
ｅｔａｌ．，“Ａｗｏｒｄｇｒａｐｈａｌｇ
ｏｒｉｔｈｍｆｏｒｌａｒｇｅｖｏｃａｂｕｌａｒ
ｙｃｏｎｔｉｎｕｏｕｓｓｐｅｅｃｈｒｅｃｏｇｎ
ｉｔｉｏｎ”，ＣｏｍｐｕｔｅｒＳｐｅｅｃｈ＆
Ｌａｎｇｕａｇｅ，１９９７，１１，ｐｐ．４
３−７２」において、木構造単語辞書を用いた音声認識
方法（以下、従来例という。）が開示されている。この
従来例では、木構造辞書の非終端状態（非終端ノード）
に対する近似言語尤度として、当該ノードが属するすべ
ての単語のｕｎｉｇｒａｍ確率のうち最大のものを用い
ている。ここで、単語のｕｎｉｇｒａｍ確率とは、１つ
の単語の出現確率をいう。[0007] In order to solve the above-mentioned problems, the related art 9 [Volker Steinbiss et al.
l. , “Improvements in beam
search ”, ICLSP 94, Yokoha
ma, Japan, pp. 2143-2146 "and prior art document 10" Stephan Ortmanns "
et al. , "A word graph arg
orient for large vocabulary
y continuous speechrecogn
ition ”, Computer Speech &
Langage, 1997, 11, pp. 4
3-72 "discloses a speech recognition method using a tree-structured word dictionary (hereinafter, referred to as a conventional example). In this conventional example, the non-terminal state (non-terminal node) of the tree structure dictionary
As the approximate language likelihood for, the largest of the unigram probabilities of all the words to which the node belongs is used. Here, the unigram probability of a word refers to the appearance probability of one word.

【０００８】この従来例において用いている統計的言語
モデルによるｕｎｉｇｒａｍ先読み方法の処理について
説明する。木構造単語辞書内の各ノードのｐ_lookahead
を設定する手順は以下の通りである。（１）木構造単語辞書内の各リーフノードに関して、次
式に示すように、このリーフノードで終了する単語のす
べてのｕｎｉｇｒａｍ（Ｗ_leafnodeと表示される単語セ
ット）確率Ｐ（ｗ）の最大値を計算して各リーフノード
における先読み確率ｐ_lookahead（leafnode）に設定す
る。同音異義語及び複数の発音のため、１つのリーフノ
ードにおいて終了する単語が数個である可能性もある。The processing of the unigram prefetching method based on the statistical language model used in this conventional example will be described. P _lookahead of each node in the tree structure word dictionary
Is set as follows. (1) For each leaf node in the tree-structured word dictionary, as shown in the following equation, the maximum value of all unigram (word set displayed as W _leafnode ) probabilities P (w) of words ending at this leaf node Is calculated and set to the look-ahead probability p _lookahead (leafnode) at each leaf node. Due to homonyms and multiple pronunciations, a leaf node may end with several words.

【０００９】[0009]

【数２】ｐ_lookahead（leafnode）＝ＭＡＸ{Ｐ(ｗ)} ここで、ｗ∈Ｗ_leafnode ## EQU2 ## p _lookahead (leafnode) = MAX {P (w)} where wｗW _leafnode

【００１０】（２）すべての非リーフノードにおける先
読み確率ｐ_lookaheadに対して、そこからリーフノード
に枝分かれするすべての子ノードの先読み確率ｐ
_lookahead（child-node）の最大値を設定する。(2) For the look-ahead probabilities p _lookahead in all non-leaf nodes, the look-ahead probabilities p for all child nodes branching from the look-ahead to leaf nodes
Set the maximum value of _lookahead (child-node).

【００１１】[0011]

【数３】ｐ_lookahead（non-leafnode）＝ＭＡＸ{ｐ
_lookahead（child-node）}## EQU3 ## p _lookahead (non-leafnode) = MAX {p
_lookahead (child-node)}

【００１２】従来例のｕｎｉｇｒａｍの先読み方法は現
時点で展開された単語仮説に依存せず、従って、統計的
手順であって、通常は事前に１度だけ計算されるべきも
のであることに注意する。ここで、従来例の方法による
実施例を以下に示す。この実施例で用いられたｕｎｉｇ
ｒａｍの統計的言語モデルの一例を次の表に示す。ま
た、上述の処理で得られた木構造単語辞書を図４に示
す。Note that the prior art unigram look-ahead method does not depend on the currently expanded word hypothesis, and is therefore a statistical procedure, which should normally be calculated only once in advance. . Here, an embodiment according to a conventional method will be described below. The unig used in this example
An example of a statistical language model of ram is shown in the following table. FIG. 4 shows a tree-structured word dictionary obtained by the above processing.

【００１３】[0013]

【表１】 [Table 1]

【００１４】[0014]

【発明が解決しようとする課題】しかしながら、従来例
で用いる近似言語尤度は単語のｕｎｉｇｒａｍ確率に基
づいているために、近似の精度が低く、認識に要する計
算時間の短縮効果が十分でない。従って、計算コストが
高く、また、木構造単語辞書を記憶するメモリの容量が
比較的大きいという問題点があった。However, since the approximate language likelihood used in the conventional example is based on the unigram probability of a word, the approximation accuracy is low, and the effect of reducing the calculation time required for recognition is not sufficient. Therefore, there is a problem that the calculation cost is high and the capacity of the memory for storing the tree structure word dictionary is relatively large.

【００１５】本発明の目的は以上の問題点を解決し、従
来例に比較して高精度で近似した言語尤度を用いてより
高い認識率でかつ高速で音声認識することができる音声
認識装置を提供することにある。SUMMARY OF THE INVENTION An object of the present invention is to solve the above-mentioned problems, and to realize a speech recognition apparatus capable of recognizing speech at a higher recognition rate and at a higher speed by using a language likelihood approximated with higher precision as compared with the conventional example. Is to provide.

【００１６】[0016]

【課題を解決するための手段】本発明に係る請求項１記
載の音声認識装置は、学習用テキストデータに基づいて
木構造単語辞書を生成して、木構造の各ノードに対して
近似言語尤度である先読み確率を計算して付与し、上記
木構造単語辞書を用いて入力される音声信号を音声認識
する音声認識手段を備えた音声認識装置において、Ｎが
２以上の自然数である単語のＮ−ｇｒａｍの確率データ
を含む統計的言語モデルを記憶する記憶手段を備え、上
記音声認識手段は、生成される単語仮説毎に、上記木構
造単語辞書における単語の非終端状態に与える近似言語
尤度である先読み確率を、上記記憶手段に記憶された統
計的言語モデルのＮ−ｇｒａｍの確率データに基づいて
計算することにより上記木構造単語辞書を更新して、上
記更新された木構造単語辞書を用いて、入力される音声
信号を音声認識することを特徴とする。According to a first aspect of the present invention, there is provided a speech recognition apparatus which generates a tree-structured word dictionary based on learning text data and generates an approximate language likelihood for each node of the tree structure. A speech recognition device provided with speech recognition means for calculating and adding a look-ahead probability as a degree and recognizing a speech signal input using the tree-structured word dictionary, wherein N is a natural number of 2 or more. Storage means for storing a statistical language model including N-gram probability data, wherein the speech recognition means provides, for each generated word hypothesis, an approximate language likelihood given to a non-terminal state of a word in the tree-structured word dictionary Is updated based on the N-gram probability data of the statistical language model stored in the storage means, thereby updating the tree-structured word dictionary. Using a word dictionary, and recognizes speech inputted audio signal.

【００１７】また、請求項２記載の音声認識装置は、請
求項１記載の音声認識装置において、上記音声認識手段
は、学習用テキストデータに基づいて木構造単語辞書を
生成する生成手段と、上記木構造単語辞書において各リ
ーフノードに対して、リーフノードで終了する単語のす
べてのｕｎｉｇｒａｍの最大確率を先読み確率として計
算して付与する第１の付与手段と、上記木構造単語辞書
においてすべてのリーフノードでないノードに対してそ
の先読み確率にリーフノードに対して分岐するすべての
子ノードの最大確率を設定して付与することにより上記
木構造単語辞書を別の記憶手段に記憶する第２の付与手
段と、生成される単語仮説毎に、単語仮説の各組に対し
て単語のｕｎｉｇｒａｍを除く上記記憶手段に記憶され
た統計的言語モデルにおいて存在するすべてのＮ−ｇｒ
ａｍの入力データの最大のＮ−ｇｒａｍ確率に拡張して
各リーフノードの先読み確率を計算して上記別の記憶手
段に記憶された木構造単語辞書に付与する第３の付与手
段と、上記木構造単語辞書においてリーフノードでない
すべてのノードに対してその先読み確率に、リーフノー
ドに対して分岐するすべての子ノードの最大確率を設定
して付与することにより、上記別の記憶手段に記憶され
た木構造単語辞書を更新する第４の付与手段と、上記更
新された木構造単語辞書と、上記記憶手段に記憶された
統計的言語モデルを用いて、入力された音声信号に対し
て最尤の単語仮説を探索決定して認識結果として出力す
る探索認識手段とを備えたことを特徴とする。According to a second aspect of the present invention, in the voice recognition apparatus of the first aspect, the voice recognition unit generates a tree structure word dictionary based on learning text data; First assigning means for calculating and assigning, as a look-ahead probability, the maximum probability of all unigrams of words ending at a leaf node to each leaf node in the tree-structured word dictionary; A second assigning means for storing the tree-structured word dictionary in another storage means by setting and giving the prefetch probability to a node which is not a node and the maximum probability of all child nodes branching to a leaf node; And for each generated word hypothesis, a statistical language model stored in the storage means except for a word unigram for each set of word hypotheses. All of the N-gr present in
a third adding means for calculating the look-ahead probabilities of each leaf node by expanding to the maximum N-gram probability of the input data of am and adding the calculated prefetch probability to the tree-structured word dictionary stored in the another storage means; By setting and giving the prefetch probability to all the nodes that are not leaf nodes in the structured word dictionary and the maximum probabilities of all child nodes branching to the leaf node, Using a fourth assigning means for updating the tree structure word dictionary, the updated tree structure word dictionary, and the statistical language model stored in the storage means, the maximum likelihood of the input speech signal is obtained. Search recognition means for searching for and determining a word hypothesis and outputting the result as a recognition result.

【００１８】[0018]

【発明の実施の形態】以下、図面を参照して本発明に係
る実施形態について説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００１９】図１は、本発明に係る一実施形態である連
続音声認識装置のブロック図である。この実施形態の連
続音声認識装置においては、単語辞書初期化処理部１０
は、メモリ学習用テキストデータに基づいて二分木形式
の木構造単語辞書を生成して、木構造の各ノードに対し
て近似言語尤度である先読み確率を計算して付与して木
構造単語辞書メモリ２１，２２に格納し、Ｎが２以上の
自然数である単語のＮ−ｇｒａｍの確率データを含む統
計的言語モデルを記憶する統計的言語モデルメモリ２３
が備えられる。ここで、単語照合部６は、音素照合部４
から入力される単語仮説毎に、メモリ２２内の木構造単
語辞書における単語の非終端状態に与える近似言語尤度
である先読み確率を、メモリ２３内の統計的言語モデル
のＮ−ｇｒａｍの確率データに基づいて計算することに
より上記メモリ２２内の木構造単語辞書を更新して、上
記更新された木構造単語辞書を用いて、入力される音声
信号を音声認識することを特徴としている。FIG. 1 is a block diagram of a continuous speech recognition apparatus according to one embodiment of the present invention. In the continuous speech recognition device of this embodiment, the word dictionary initialization processing unit 10
Generates a tree-structured word dictionary of a binary tree format based on the text data for memory learning, and calculates and adds a look-ahead probability, which is an approximate linguistic likelihood, to each node of the tree-structured word. A statistical language model memory 23 that is stored in the memories 21 and 22 and stores a statistical language model including N-gram probability data of a word in which N is a natural number of 2 or more.
Is provided. Here, the word collating unit 6 includes the phoneme collating unit 4
For each word hypothesis input from, the look-ahead probability, which is the approximate language likelihood given to the non-terminal state of the word in the tree-structured word dictionary in the memory 22, is stored in the N-gram probability data of the statistical language model in the memory 23. Based on the calculation, the tree-structured word dictionary in the memory 22 is updated, and the input speech signal is speech-recognized using the updated tree-structured word dictionary.

【００２０】ところで、音声認識装置における最も可能
性の高い仮説を探索（サーチ）する処理は、認識可能な
すべての単語を包含する単語辞書に基づいて行ってい
る。従来例の音声認識装置においては、通常、木構造単
語辞書（メモリ内に、認識可能な単語が単なる直線的リ
ストではなく木構造として表示されている。）が使用さ
れる。木構造単語辞書が使用される場合は、木構造単語
辞書を用いた探索時に統計的言語モデル確率をできるだ
け早く組み込むために、統計的言語モデルの先読み方法
と呼ばれる手順が使用される。使用頻度の高い手順の１
つはｕｎｉｇｒａｍの先読み方法であり、従来例で説明
したものである。一方、本実施形態では、この従来例の
方法を拡張して探索速度を約２０％向上させることがで
きるオンデマンドのＮ−ｇｒａｍ先読み方法を用いる。The process of searching for the most likely hypothesis in the speech recognition apparatus is performed based on a word dictionary that includes all recognizable words. In a conventional speech recognition apparatus, a tree structure word dictionary (recognizable words are displayed in a memory as a tree structure instead of a simple linear list) is usually used. When a tree-structured word dictionary is used, a procedure called a pre-reading method of a statistical language model is used in order to incorporate the statistical language model probability as soon as possible in a search using the tree-structured word dictionary. One of the frequently used procedures
One is a unigram prefetching method, which has been described in the conventional example. On the other hand, in the present embodiment, an on-demand N-gram look-ahead method that can extend the conventional method and improve the search speed by about 20% is used.

【００２１】まず、統計的言語モデルの先読み方法につ
いて説明する。統計的言語モデルによる先読み方法は、
木構造単語辞書を使用する多くの音声認識装置で使用さ
れている。探索中に木構造単語辞書に入った時点では、
単語の同一性はリーフノード（単語が終了するノードで
あり、単語の終端状態をいう。）に達するまで判らず、
従って木構造単語辞書内での正確な言語モデル確率も不
明である。良好な高速探索性能を得るためには、木構造
単語辞書を通過する間にできるだけ早く言語モデル確率
を組み込む必要がある。木構造辞書を使用する多くの音
声認識システムにおいては、木構造辞書内に言語モデル
確率の推定値を組み込むために、統計的言語モデルの先
読み方法と呼ばれる手順が使用される。統計的言語モデ
ルの先読み確率（ｐ_lookahead）は、木構造辞書のあら
ゆるノードに帰属している。それらが既に設定済みであ
るものとすれば、それは探索中に以下のように使用され
る。First, a method of prefetching a statistical language model will be described. The look-ahead method based on the statistical language model is
It is used in many speech recognition devices that use a tree-structured word dictionary. When you enter the tree-structured word dictionary during the search,
The identity of a word is unknown until it reaches a leaf node (the node where the word ends, which means the end state of the word).
Therefore, the exact language model probability in the tree structure word dictionary is also unknown. In order to obtain good high-speed search performance, it is necessary to incorporate the language model probability as soon as possible while passing through the tree-structured word dictionary. In many speech recognition systems that use a tree-structured dictionary, a procedure called a statistical language model look-ahead method is used to incorporate language model probability estimates into the tree-structured dictionary. The look-ahead probability (p _lookahead ) of the statistical language model belongs to every node of the tree-structured dictionary. Assuming they have already been set, it is used during the search as follows.

【００２２】（ａ）ノードに入ると、現在の全体のスコ
アにｐ_lookahead（ノード）を加算する。（ｂ）ノードを離れると、現在の全体のスコアからｐ
_lookahead（ノード）を減算する。(A) When entering a node, add p _lookahead (node) to the current overall score. (B) After leaving the node, p
Subtract _lookahead (node).

【００２３】この方法は、統計的言語モデルの先読み方
法を何も使用しない場合よりも早く弱い言語モデル確率
を有するノードの枝刈り（プルーニング）を促進し、探
索の速度向上をもたらす。辞書における各ノードのｐ
_lookaheadを設定するための方法について説明する。This method promotes pruning of nodes with weak language model probabilities faster than if no statistical language model look-ahead method is used, resulting in an increase in search speed. P of each node in the dictionary
A method for setting _lookahead will be described.

【００２４】本実施形態に係る単語辞書初期化処理部１
０は、学習用テキストメモリ２に記憶された複数の発声
音声文のテキストデータ（コーパス）と、統計的言語モ
デルメモリ２３内の単語のｕｎｉｇｒａｍ確率データと
に基づいて、二分木形式の木構造単語辞書を生成し、従
来例の方法を用いて各ノードに対して先読み確率ｐ_lo
_okaheadを計算して付与することにより、初期値の木構
造単語辞書を生成してメモリ２１に記憶した後、メモリ
２２にコピーする。そして、次の処理により、音素照合
部４からバッファメモリ５を介して単語照合部６に単語
仮説が入力される毎に（オンデマンド）、メモリ２２内
の木構造単語辞書を更新して、メモリ２２内の木構造単
語辞書とメモリ２３内の統計的言語モデルとを用いて最
尤の単語仮説を探索決定して認識結果として出力する。The word dictionary initialization processing unit 1 according to this embodiment
0 is a binary tree-type tree-structured word based on text data (corpus) of a plurality of uttered speech sentences stored in the learning text memory 2 and unigram probability data of the word in the statistical language model memory 23. A dictionary is generated, and a look-ahead probability p _lo is calculated for each node using a conventional method.
_By calculating and adding _okahead , an initial value tree structure word dictionary is generated and stored in the memory 21, and then copied to the memory 22. By the following processing, each time a word hypothesis is input from the phoneme matching unit 4 to the word matching unit 6 via the buffer memory 5 (on-demand), the tree-structured word dictionary in the memory 22 is updated. The maximum likelihood word hypothesis is searched and determined using the tree-structured word dictionary in the memory 22 and the statistical language model in the memory 23, and is output as a recognition result.

【００２５】本実施形態に係るオンデマンドのＮ−ｇｒ
ａｍ先読み方法は新規の統計的言語モデルの先読み手順
であり、その処理点で展開された仮説の制約を組み込ん
でいる。これにより、従来例のｕｎｉｇｒａｍの先読み
方法の手順と比較して実際の言語モデル確率の推定値が
向上し、曳いては枝刈り精度の向上、それ故の高速探索
が導かれる。速度の増加分は約２０％である。On-demand N-gr according to this embodiment
The am look-ahead method is a look-ahead procedure for a new statistical language model, which incorporates the hypothesis constraints developed at that processing point. As a result, the estimated value of the actual language model probability is improved as compared with the procedure of the conventional unigram look-ahead method, which leads to an improvement in the pruning accuracy and hence a high-speed search. The speed increase is about 20%.

【００２６】次いで、オンデマンドのＮ−ｇｒａｍ先読
み方法の処理は以下の通りである。（１）探索開始前に上記のｕｎｉｇｒａｍの先読み方法
の手順によってすべてのノードの先読み確率ｐ
_lookaheadを初期化する。（２）各単語セットに関して仮説Ｈ_iを計算し、統計的
言語モデルにおいて、ｕｎｉｇｒａｍの初期化中に既に
設定されたｕｎｉｇｒａｍを除く、統計的言語モデルに
おいて存在するすべてのＮ−ｇｒａｍのデータ（Ｈ_i，
ｗ）のうちの最大Ｎ−ｇｒａｍ確率Ｐ（ｗ│Ｈ_i）に拡
張する。単語ｗに帰属する関連リーフノード（同音異義
語及び複数の発音のため数個になる可能性がある。）を
識別し、先読み確率ｐ_lookaheadとして、次式に示すよ
うに、計算された確率及び設定済みのｕｎｉｇｒａｍの
先読み確率ｐ_lookaheadのうちの最大値を設定する。Next, the processing of the on-demand N-gram prefetching method is as follows. (1) Prior to the search, the prefetch probability p of all nodes is determined by the procedure of the above-mentioned unigram prefetch method.
Initialize _lookahead . (2) The hypothesis H _i is calculated for each word set, and in the statistical language model, all N-gram data (H) existing in the statistical language model except for the unigram already set during the initialization of the unigram. _i ,
w) to the maximum N-gram probability P (w | H _i ). Identify related leaf nodes belonging to word w (there may be several due to homonyms and multiple pronunciations), and as the look-ahead probability p _lookahead , the calculated probabilities and The maximum value of the pre-reading probabilities p _lookahead of the set unigram is set.

【００２７】[0027]

【数４】ｐ_lookahead（leafnode）＝ＭＡＸ{Ｐ（ｗ|Ｈ_i）} ∀Ｈ_i及び∀ｗ∈{Ｎ−ｇｒａｍに存在する（Ｈ_i，ｗ）}Equation 4] _{p lookahead (leafnode) = MAX {} P (w | H i)} ∀H i and ∀W∈ {present in _{N-gram (H i, w} )}

【００２８】（３）すべての非リーフノード（すなわ
ち、リーフノードでないノードであって、単語の非終端
状態をいう。）に関して、その先読み確率ｐ_lookahead
を、次式のように、リーフノードに枝分かれするすべて
の子ノードの先読み確率ｐ_lookah _ead（child-node）の
最大値に設定する。(3) For all non-leaf nodes (that is, nodes that are not leaf nodes and refer to non-terminal states of words), their look-ahead probabilities p _lookahead
_Is set to the maximum value of the look-ahead probabilities p _lookah _ead (child-node) of all the child nodes branching to the leaf node as in the following equation.

【００２９】[0029]

【数５】ｐ_lookahead（non-leafnode）＝ＭＡＸ{ｐ
_lookahead（child-node）}## EQU5 ## p _lookahead (non-leafnode) = MAX {p
_lookahead (child-node)}

【００３０】この手順は、通常のｕｎｉｇｒａｍの先読
み方法の場合のように事前に実行することのできない新
規単語仮説セットが展開される毎に実行しなければなら
ない。この追加的な手順に関わらず、言語モデル確率が
正確であればあるほど枝刈りがより正確となり、全探索
の高速化が導かれる。This procedure must be executed every time a new word hypothesis set that cannot be executed in advance is developed as in the case of the normal unigram look-ahead method. Regardless of this additional procedure, the more accurate the language model probability, the more accurate the pruning, which leads to a faster full search.

【００３１】次いで、オンデマンドのＮ−ｇｒａｍ先読
み方法の一例について説明する。ここで、展開すべき仮
説リストに依存して、すべてのＮ−ｇｒａｍ確率が使用
されることに注意する。統計的言語モデルの一例を次の
表に示し、メモリ７内に記憶される展開すべき仮説リス
トの一例を次の表に示す。これらを用いて作成された木
構造単語辞書を図５に示す。なお、次の表における単語
ｗ₁，ｗ₂，ｗ₃，ｗ₄，…は、例えば音素列で表現された
単語である。Next, an example of an on-demand N-gram prefetching method will be described. Note that all N-gram probabilities are used, depending on the hypothesis list to be expanded. An example of a statistical language model is shown in the following table, and an example of a hypothesis list to be expanded stored in the memory 7 is shown in the following table. FIG. 5 shows a tree structure word dictionary created by using these. The words w ₁ , w ₂ , w ₃ , w ₄ ,... In the following table are words expressed by phoneme strings, for example.

【００３２】[0032]

【表２】 [Table 2]

【００３３】[0033]

【表３】 [Table 3]

【００３４】展開すべき仮説リストメモリ７では、単語
照合部６の処理により出てきた単語仮説の履歴を一時的
に記憶する。図５に示すように、木構造単語辞書におい
ては、ルートノードＲＮからリーフノードＬＮに向かっ
て二分木形式で木が成長してゆき、各ノードで先読み確
率ｐ_lookaheadが付与され、単語照合部６による処理に
より単語仮説が入力される毎に上記付与された各ノード
の先読み確率ｐ_lookah _eadが更新されて単語照合され
る。ここで、ルートノードＲＮからリーフノードＬＮに
向かう方向が子ノードに向かう方向である。The hypothesis list memory 7 to be developed temporarily stores the history of the word hypotheses generated by the processing of the word collating unit 6. As shown in FIG. 5, in the tree-structured word dictionary, a tree grows in a binary tree form from the root node RN to the leaf node LN, a look-ahead probability p _lookahead is given to each node, and the word matching unit 6 Each time a word hypothesis is input by the processing according to the above, the look-ahead probabilities p _lookah _ead of the _assigned _nodes are updated and word matching is performed. Here, the direction from the root node RN toward the leaf node LN is the direction toward the child node.

【００３５】図２は、図１の単語辞書初期化処理部１０
によって実行される単語辞書初期化処理を示すフローチ
ャートである。なお、統計的言語モデルメモリ２３に
は、複数の発声音声文を含むコーパスである学習用テキ
ストデータに基づいて、ｔｒｉｇｒａｍ以上の単語のＮ
−ｇｒａｍの連接確率データが予め記憶される。FIG. 2 shows the word dictionary initialization processing unit 10 shown in FIG.
6 is a flowchart showing a word dictionary initialization process executed by the CPU. The statistical language model memory 23 stores N or more words of “trigram” or more based on learning text data which is a corpus including a plurality of uttered voice sentences.
-Gram connection probability data is stored in advance.

【００３６】図２において、ステップＳ１において、メ
モリ２０内の学習用テキストデータと、メモリ２３内の
統計的言語モデルのうちのｕｎｉｇｒａｍの確率データ
とに基づいて二分木形式の木構造単語辞書を生成する。
次いで、ステップＳ２において木構造において各リーフ
ノードＬＮに対して、リーフノードＬＮで終了する単語
のすべてのｕｎｉｇｒａｍの最大確率を先読み確率ｐ
_lookahead（leafnode）として計算して付与する。さら
に、ステップＳ３においてすべてのリーフノードＬＦで
ないノードに対してその先読み確率ｐ_lookahead（non-l
eafnode）にリーフノードに対して分岐するすべての子
ノードの最大確率を設定して付与する。最後に、ステッ
プＳ４において生成された確率付き木構造単語辞書をメ
モリ２１に記憶するとともに、メモリ２２にコピーして
記憶して、当該単語辞書初期化処理を終了する。In FIG. 2, in step S1, a tree-structured word dictionary in a binary tree format is generated based on the learning text data in the memory 20 and the probability data of the unigram of the statistical language model in the memory 23. I do.
Next, in step S2, for each leaf node LN in the tree structure, the maximum probabilities of all unigrams of words ending at the leaf node LN are determined by a look-ahead probability p.
Calculated and given as _lookahead (leafnode). Further, in step S3, the look-ahead probabilities p _lookahead (non-l
eafnode) with the maximum probability of all child nodes branching to the leaf node. Finally, the tree-structured word dictionary with probability generated in step S4 is stored in the memory 21 and copied and stored in the memory 22, and the word dictionary initialization process ends.

【００３７】図３は、図１の単語照合部６によって実行
される単語照合処理を示すフローチャートである。図３
において、まず、ステップＳ１１において単語仮説が入
力されたかが判断され、入力されるまで待機し、入力さ
れる毎に、次のステップＳ１２乃至Ｓ１４の処理が実行
される。次いで、ステップＳ１２において単語仮説Ｈ_i
の各組に対して単語のｕｎｉｇｒａｍを含まない統計的
言語モデルにおいて存在するすべてのＮ−ｇｒａｍの入
力データ（Ｈ_i，ｗ）の最大のＮ−ｇｒａｍ確率ｐ（ｗ
│Ｈ_i）に拡張して各リーフノードの先読み確率ｐ
_lookahead（leafnode）を計算して付与する。さらに、
ステップＳ１３においてリーフノードでないすべてのノ
ードに対してその先読み確率ｐ_lookahead（non-leafnod
e）に、リーフノードに対して分岐するすべての子ノー
ドの最大確率を設定して付与して、メモリ２２内の木構
造単語辞書を更新する。最後に、ステップＳ１４で、更
新されたメモリ２２内の木構造単語辞書と、メモリ２３
内の統計的言語モデルを用いて最尤の単語仮説を探索決
定して認識結果として出力する。FIG. 3 is a flowchart showing a word matching process executed by the word matching unit 6 of FIG. FIG.
First, in step S11, it is determined whether or not a word hypothesis has been input, and the process waits until the word hypothesis is input. Each time the word hypothesis is input, the processes of the following steps S12 to S14 are executed. Next, in step S12, the word hypothesis H _i
, The maximum N-gram probability p (w) of all N-gram input data (H _i , w) present in the statistical language model that does not include the word unigram
│H _i ) to expand the look-ahead probability p of each leaf node
Calculate and add _lookahead (leafnode). further,
In step S13, the look-ahead probability p _lookahead (non-leafnod
The maximum probability of all the child nodes branching to the leaf node is set and assigned to e), and the tree structure word dictionary in the memory 22 is updated. Finally, in step S14, the updated tree-structured word dictionary in the memory 22 and the memory 23
The maximum likelihood word hypothesis is searched for and determined using the statistical language model in, and is output as a recognition result.

【００３８】次いで、図１に示す連続音声認識装置の構
成及び動作について説明する。図１において、音素照合
部４に接続された音素隠れマルコフモデル（以下、隠れ
マルコフモデルをＨＭＭという。）メモリ１１内の音素
ＨＭＭは、各状態を含んで表され、各状態はそれぞれ以
下の情報を有する。（ａ）状態番号、（ｂ）受理可能なコンテキストクラ
ス、（ｃ）先行状態、及び後続状態のリスト、（ｄ）出
力確率密度分布のパラメータ、及び（ｅ）自己遷移確率
及び後続状態への遷移確率。なお、本実施形態において
用いる音素ＨＭＭは、各分布がどの話者に由来するかを
特定する必要があるため、所定の話者混合ＨＭＭを変換
して生成する。ここで、出力確率密度関数は３４次元の
対角共分散行列をもつ混合ガウス分布である。Next, the configuration and operation of the continuous speech recognition apparatus shown in FIG. 1 will be described. In FIG. 1, a phoneme HMM in a phoneme hidden Markov model (hereinafter referred to as HMM) memory 11 connected to the phoneme matching unit 4 is represented by including each state, and each state includes the following information. Having. (A) state number, (b) acceptable context class, (c) list of preceding and succeeding states, (d) parameters of output probability density distribution, and (e) self-transition probability and transition to succeeding state probability. Note that the phoneme HMM used in the present embodiment is generated by converting a predetermined speaker-mixed HMM because it is necessary to specify which speaker each distribution originates from. Here, the output probability density function is a Gaussian mixture distribution having a 34-dimensional diagonal covariance matrix.

【００３９】図１において、話者の発声音声はマイクロ
ホン１に入力されて音声信号に変換された後、特徴抽出
部２に入力される。特徴抽出部２は、入力された音声信
号をＡ／Ｄ変換した後、例えばＬＰＣ分析を実行し、対
数パワー、１６次ケプストラム係数、Δ対数パワー及び
１６次Δケプストラム係数を含む３４次元の特徴パラメ
ータを抽出する。抽出された特徴パラメータの時系列は
バッファメモリ３を介して音素照合部４に入力される。
音素照合部４は、ワン−パス・ビタビ復号化法を用い
て、バッファメモリ３を介して入力される特徴パラメー
タのデータに基づいて、音素ＨＭＭ１１を用いて音素列
の単語仮説を検出し尤度を計算してバッファメモリ５を
介して単語照合部６に出力する。単語照合部６は、図３
の単語照合処理を実行して、メモリ２２内の木構造単語
辞書を更新しかつメモリ２３内の統計的言語モデルとメ
モリ７内の展開すべき仮説リストを参照して最尤の単語
仮説を探索決定して認識結果として出力する。In FIG. 1, a speaker's uttered voice is input to a microphone 1 and converted into a voice signal, and then input to a feature extracting unit 2. After performing A / D conversion on the input audio signal, the feature extraction unit 2 performs, for example, LPC analysis, and performs 34-dimensional feature parameters including logarithmic power, 16th-order cepstrum coefficient, Δlogarithmic power, and 16th-order Δcepstrum coefficient. Is extracted. The time series of the extracted feature parameters is input to the phoneme matching unit 4 via the buffer memory 3.
The phoneme matching unit 4 detects the word hypothesis of the phoneme string using the phoneme HMM 11 based on the feature parameter data input via the buffer memory 3 using the one-pass Viterbi decoding method, and detects the likelihood. Is calculated and output to the word collating unit 6 via the buffer memory 5. The word matching unit 6 is configured as shown in FIG.
Is executed, the tree structure word dictionary in the memory 22 is updated, and the maximum likelihood word hypothesis is searched for by referring to the statistical language model in the memory 23 and the hypothesis list to be expanded in the memory 7. Determined and output as recognition result.

【００４０】以上の実施形態において、特徴抽出部２
と、音素照合部４と、単語照合部６と、単語辞書初期化
処理部１０とは、例えば、デジタル電子計算機などのコ
ンピュータで構成され、バッファメモリ３，５と、展開
すべき仮説リストメモリ７と、音素ＨＭＭメモリ１１
と、学習用テキストデータメモリ２０、木構造単語辞書
メモリ２１，２２と、統計的言語モデルメモリ２３と
は、例えばハードデイスクメモリなどの記憶装置で構成
される。なお、メモリ２３内の統計的言語モデルは、好
ましくは、Ｎが２以上の自然数であるＮ−ｇｒａｍの統
計的言語モデルであり、より好ましくは、ｔｒｉｇｒａ
ｍの統計的言語モデルである。In the above embodiment, the feature extracting unit 2
The phoneme matching unit 4, the word matching unit 6, and the word dictionary initialization processing unit 10 are composed of, for example, a computer such as a digital computer, and include buffer memories 3 and 5, and a hypothesis list memory 7 to be developed. And the phoneme HMM memory 11
The learning text data memory 20, the tree structure word dictionary memories 21 and 22, and the statistical language model memory 23 are configured by a storage device such as a hard disk memory. The statistical language model in the memory 23 is preferably an N-gram statistical language model in which N is a natural number of 2 or more, and more preferably trigram.
m is a statistical language model.

【００４１】本発明に係る本実施形態によれば、上述の
オンデマンドのＮ−ｇｒａｍ先読み方法を用いることに
より、従来例に比較してより小さな記憶領域で精度の高
い言語尤度の近似値計算ができ、従来例に比べて、高い
認識率で音声認識することができ、しかも、認識に要す
る計算時間を大幅に短縮させることができる。According to the embodiment of the present invention, by using the above-described on-demand N-gram look-ahead method, it is possible to calculate the approximate value of the language likelihood with a smaller storage area and higher accuracy than in the conventional example. As a result, speech recognition can be performed with a higher recognition rate than the conventional example, and the calculation time required for recognition can be significantly reduced.

【００４２】以上の実施形態において、図２の単語辞書
初期化処理を、単語照合部６でも実行し、音素照合部４
からバッファメモリ５を介して単語照合部６に入力され
る毎に、オンデマンドで実行するように構成してもよ
い。In the above embodiment, the word dictionary initialization processing of FIG.
May be configured to be executed on demand each time the data is input to the word collating unit 6 via the buffer memory 5.

【００４３】以上の実施形態においては、二分木形式の
木構造単語辞書を生成しているが，本発明はこれに限ら
ず、複数Ｎ分木形式の木構造単語辞書であってもよい。In the above embodiment, the tree-structured word dictionary of the binary tree format is generated. However, the present invention is not limited to this, and the tree-structured word dictionary of the plural N-tree format may be used.

【００４４】[0044]

【発明の効果】以上詳述したように本発明に係る請求項
１記載の音声認識装置によれば、学習用テキストデータ
に基づいて木構造単語辞書を生成して、木構造の各ノー
ドに対して近似言語尤度である先読み確率を計算して付
与し、上記木構造単語辞書を用いて入力される音声信号
を音声認識する音声認識手段を備えた音声認識装置にお
いて、Ｎが２以上の自然数である単語のＮ−ｇｒａｍの
確率データを含む統計的言語モデルを記憶する記憶手段
を備え、上記音声認識手段は、生成される単語仮説毎
に、上記木構造単語辞書における単語の非終端状態に与
える近似言語尤度である先読み確率を、上記記憶手段に
記憶された統計的言語モデルのＮ−ｇｒａｍの確率デー
タに基づいて計算することにより上記木構造単語辞書を
更新して、上記更新された木構造単語辞書を用いて、入
力される音声信号を音声認識する。従って、従来例に比
較してより小さな記憶領域で精度の高い言語尤度の近似
値計算ができ、従来例に比べて、高い認識率で音声認識
することができ、しかも、認識に要する計算時間を大幅
に短縮させることができる。As described in detail above, according to the speech recognition apparatus of the first aspect of the present invention, a tree-structured word dictionary is generated based on the learning text data, and the tree-structured word dictionary is generated for each node of the tree structure. A speech recognition device provided with speech recognition means for recognizing a speech signal input using the tree-structured word dictionary by calculating and adding a look-ahead probability which is an approximate language likelihood. Storage means for storing a statistical language model including N-gram probability data of a word, wherein the speech recognition means gives, for each generated word hypothesis, a non-terminal state of the word in the tree-structured word dictionary The tree-structured word dictionary is updated by calculating a look-ahead probability, which is an approximate language likelihood, based on N-gram probability data of a statistical language model stored in the storage means. Using a tree word dictionary, the speech recognizing an input speech signal. Therefore, it is possible to calculate the approximate value of the linguistic likelihood with high accuracy in a smaller storage area than in the conventional example, to perform speech recognition with a higher recognition rate than in the conventional example, and to further reduce the calculation time required for recognition. Can be greatly reduced.

【００４５】また、請求項２記載の音声認識装置によれ
ば、請求項１記載の音声認識装置において、上記音声認
識手段は、学習用テキストデータに基づいて木構造単語
辞書を生成する生成手段と、上記木構造単語辞書におい
て各リーフノードに対して、リーフノードで終了する単
語のすべてのｕｎｉｇｒａｍの最大確率を先読み確率と
して計算して付与する第１の付与手段と、上記木構造単
語辞書においてすべてのリーフノードでないノードに対
してその先読み確率にリーフノードに対して分岐するす
べての子ノードの最大確率を設定して付与することによ
り上記木構造単語辞書を別の記憶手段に記憶する第２の
付与手段と、生成される単語仮説毎に、単語仮説の各組
に対して単語のｕｎｉｇｒａｍを除く上記記憶手段に記
憶された統計的言語モデルにおいて存在するすべてのＮ
−ｇｒａｍの入力データの最大のＮ−ｇｒａｍ確率に拡
張して各リーフノードの先読み確率を計算して上記別の
記憶手段に記憶された木構造単語辞書に付与する第３の
付与手段と、上記木構造単語辞書においてリーフノード
でないすべてのノードに対してその先読み確率に、リー
フノードに対して分岐するすべての子ノードの最大確率
を設定して付与することにより、上記別の記憶手段に記
憶された木構造単語辞書を更新する第４の付与手段と、
上記更新された木構造単語辞書と、上記記憶手段に記憶
された統計的言語モデルを用いて、入力された音声信号
に対して最尤の単語仮説を探索決定して認識結果として
出力する探索認識手段とを備える。従って、従来例に比
較してより小さな記憶領域で精度の高い言語尤度の近似
値計算ができ、従来例に比べて、高い認識率で音声認識
することができ、しかも、認識に要する計算時間を大幅
に短縮させることができる。According to a second aspect of the present invention, in the first aspect, the speech recognition unit may include a generation unit that generates a tree-structured word dictionary based on the learning text data. A first assigning means for calculating and assigning, as a look-ahead probability, a maximum probability of all unigrams of words ending at a leaf node to each leaf node in the tree-structured word dictionary; Storing the tree-structured word dictionary in another storage means by setting and giving the prefetch probability to the node that is not a leaf node the maximum probability of all child nodes branching to the leaf node. Adding means and, for each word hypothesis to be generated, a statistical word stored in the storage means except for a word unigram for each set of word hypotheses All of the N present in the model
A third adding means for calculating the look-ahead probability of each leaf node by expanding to the maximum N-gram probability of the input data of -gram and adding it to the tree-structured word dictionary stored in the another storage means; By setting and giving the look-ahead probabilities to all nodes that are not leaf nodes in the tree-structured word dictionary and the maximum probabilities of all child nodes branching to the leaf nodes, the data are stored in the other storage means. A fourth assigning means for updating the tree-structured word dictionary,
Search recognition for searching and determining the maximum likelihood word hypothesis for the input speech signal using the updated tree-structured word dictionary and the statistical language model stored in the storage means, and outputting the result as a recognition result Means. Therefore, it is possible to calculate the approximate value of the linguistic likelihood with high accuracy in a smaller storage area than in the conventional example, to perform speech recognition with a higher recognition rate than in the conventional example, and to further reduce the calculation time required for recognition. Can be greatly reduced.

[Brief description of the drawings]

【図１】本発明に係る一実施形態である連続音声認識
装置のブロック図である。FIG. 1 is a block diagram of a continuous speech recognition apparatus according to an embodiment of the present invention.

【図２】図１の単語辞書初期化処理部１０によって実
行される単語辞書初期化処理を示すフローチャートであ
る。FIG. 2 is a flowchart showing a word dictionary initialization process executed by a word dictionary initialization processing unit 10 of FIG.

【図３】図１の単語照合部６によって実行される単語
照合処理を示すフローチャートである。FIG. 3 is a flowchart showing a word matching process performed by the word matching unit 6 of FIG. 1;

【図４】従来例の木構造単語辞書の木構造構成の一例
を示す構造図である。FIG. 4 is a structural diagram showing an example of a tree structure configuration of a conventional tree structure word dictionary.

【図５】本実施形態の木構造単語辞書の木構造構成の
一例を示す構造図である。FIG. 5 is a structural diagram illustrating an example of a tree structure configuration of a tree structure word dictionary according to the embodiment;

[Explanation of symbols]

１…マイクロホン、２…特徴抽出部、３，５…バッファメモリ、４…単語照合部、６…単語照合部、７…展開すべき仮説リストメモリ、１０…単語辞書初期化処理部、１１…音素ＨＭＭメモリ、２１，２２…木構造単語辞書メモリ、２３…統計的言語モデルメモリ。 DESCRIPTION OF SYMBOLS 1 ... Microphone, 2 ... Feature extraction part, 3, 5 ... Buffer memory, 4 ... Word collation part, 6 ... Word collation part, 7 ... Hypothesis list memory to be expanded, 10 ... Word dictionary initialization processing part, 11 ... Phoneme HMM memory, 21, 22 ... tree structure word dictionary memory, 23 ... statistical language model memory.

フロントページの続き (72)発明者中村篤京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール音声翻訳通信研究所内Ｆターム(参考） 5D015 BB01 GG01 GG05 HH11 Continuation of the front page (72) Inventor Atsushi Nakamura 5 Sanraya, Seiya-cho, Soraku-cho, Kyoto, Japan F-term (reference) 5D015 BB01 GG01 GG05 HH11

Claims

[Claims]

1. A tree structure word dictionary is generated based on learning text data, and a look-ahead probability, which is an approximate language likelihood, is calculated and assigned to each node of the tree structure. A speech recognition apparatus comprising speech recognition means for recognizing a speech signal input using a storage means for storing a statistical language model including N-gram probability data of a word in which N is a natural number of 2 or more. The speech recognition unit includes, for each word hypothesis to be generated, a look-ahead probability, which is an approximate language likelihood given to a non-terminal state of a word in the tree structure word dictionary, of a statistical language model stored in the storage unit. By updating the tree-structured word dictionary by calculating based on N-gram probability data,
A speech recognition apparatus, characterized in that an inputted speech signal is speech-recognized using the updated tree structure word dictionary.

2. The speech recognition device according to claim 1, wherein
The speech recognition unit includes: a generation unit configured to generate a tree-structured word dictionary based on the text data for learning;
All unigrams of words ending with leaf nodes
First assigning means for calculating and assigning the maximum probability of the child node as the look-ahead probability; and for all the non-leaf nodes in the tree-structured word dictionary, A second assigning means for storing the tree-structured word dictionary in another storage means by setting and assigning a maximum probability; and for each generated word hypothesis, a word unigram for each set of word hypotheses. Excluding the maximum N-gram probabilities of all the N-gram input data present in the statistical language model stored in the storage means except for the above, the prefetch probability of each leaf node is calculated and stored in the another storage means. Third assigning means for assigning to the stored tree-structured word dictionary, and a look-ahead checker for all nodes that are not leaf nodes in the tree-structured word dictionary. A fourth assigning means for updating the tree-structured word dictionary stored in the another storage means by setting and assigning the maximum probability of all the child nodes branching to the leaf node; And a search / recognition unit that searches for and determines the maximum likelihood word hypothesis for the input speech signal using the statistical tree model stored in the storage unit and outputs the result as a recognition result. A speech recognition device comprising: