JPH08123472A

JPH08123472A - Speech recognition device and method for generating syntax control graph for the device

Info

Publication number: JPH08123472A
Application number: JP6265278A
Authority: JP
Inventors: Yoshiharu Abe; 芳春阿部
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1994-10-28
Filing date: 1994-10-28
Publication date: 1996-05-17
Anticipated expiration: 2020-07-06
Also published as: JP3668992B2

Abstract

PURPOSE: To prevent a phoneme string which is not prescribed in the syntax control graph from being recognized and to improve the estimation precision of a phoneme boundary by providing a boundary likelihood calculating means which calculates the boundary likelihood of a phoneme from an input speech and a model arithmetic means which selects an optimum phoneme model. CONSTITUTION: This device is equipped with the boundary likelihood calculating means 3 which calculates the boundary likelihood of the phoneme from the input speech and the model arithmetic means which selects the optimum phoneme model. In this case, an HMM arithmetic means 5 based upon viterbi algorithm wherein the calculation of the sum of HMM arithmetic based upon normal trellis algorithm is replaced with calculation of maximization is used as the model arithmetic means. Then the boundary likelihood calculating means 3 calculates the boundary likelihood of the phoneme from the input speech. Further, the model arithmetic means (HMM arithmetic means) 5 selects the optimum phoneme model only when the boundary likelihood of the phoneme calculated by the boundary likelihood calculating means 3 is larger than a specific value and the selected phoneme model meets restrictions as to >=3 phoneme strings.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は連続音声を認識し、音韻
系列に変換する音声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus for recognizing continuous speech and converting it into a phoneme sequence.

【０００２】[0002]

【従来の技術】連続音声を音韻の連結したものとみな
し、音韻モデルの連結である音韻モデル系列に従って入
力音声を分析し、入力音声に当てはまる最適な音韻モデ
ル系列をモデル演算手段によって求め、入力音声を、こ
うして得られる最適な音韻モデル系列の音韻の系列に変
換する音声認識装置において、音韻モデル系列の当ては
めの演算とは別に、音韻の境界を直接入力音声中より検
出し、音韻モデル系列の当てはめ時に、音韻モデル間の
遷移を検出された音韻境界付近に限定することで、認識
精度を改善する音声認識装置が平成５年１０月発行の日
本音響学会講演論文集１−８−５「状態間遷移束縛型Ｈ
ＭＭによる音韻記述」に述べられている。さらに、平成
６年３月発行の日本音響学会講演論文集２−Ｐ−１１
「境界尤度の信頼度を考慮した状態間遷移束縛型ＨＭＭ
による音韻記述」には、音韻境界の検出法として共通コ
ードブックを用いる方法が述べられている。2. Description of the Related Art A continuous speech is regarded as a concatenation of phonemes, an input speech is analyzed according to a phonological model series which is a concatenation of phonological models, and an optimum phonological model series applicable to the input speech is obtained by a model calculation means. In the speech recognition device for converting the phoneme sequence into the phoneme sequence of the optimal phoneme model sequence obtained in this way, apart from the calculation of the phoneme model sequence fitting, the boundary of the phoneme is directly detected in the input speech and the phoneme model sequence fitting is performed. At times, a speech recognition apparatus that improves recognition accuracy by limiting transitions between phonological models to the vicinity of a detected phonological boundary is published by the Acoustical Society of Japan, Proc. Transition bound type H
Phonological description by MM ". Furthermore, Proceedings of the Acoustical Society of Japan 2-P-11 published in March 1994
"Inter-state transition bound HMM considering reliability of boundary likelihood
"Phonological description by" describes a method of using a common codebook as a method of detecting a phonological boundary.

【０００３】図１０は、この種の状態間遷移束縛型ＨＭ
Ｍ（ＨＭＭ／ＢＴ）による音声認識装置の構成図であ
る。以下図１０の各部を説明する。音声区間検出手段１
は、入力音声の短区間パワーの変化形状により音声区間
を検出し、この音声区間内の音声信号Ｒ１を切り出して
特徴抽出手段２に送る。特徴抽出手段２は、音声区間内
の音声信号Ｒ１中から長さ２５．６ｍｓの時間窓を用い
た１５次線形予測メルケプストラム分析によって１０ｍ
ｓ毎に０〜１０次のメルケプストラム係数からなる特徴
パラメータ時系列Ｒ２を抽出し、境界尤度計算手段３、
及び、モデル演算手段としてのＨＭＭ演算手段５ａに送
る。FIG. 10 shows a state-boundary transition-bound HM of this type.
It is a block diagram of the speech recognition apparatus by M (HMM / BT). Each part of FIG. 10 will be described below. Voice section detection means 1
Detects a voice section from the change shape of the short-term power of the input voice, cuts out the voice signal R1 in this voice section, and sends it to the feature extraction means 2. The feature extracting means 2 measures 10 m from the voice signal R1 in the voice section by 15th-order linear predictive mel cepstrum analysis using a time window having a length of 25.6 ms.
A feature parameter time series R2 consisting of mel-cepstral coefficients of 0th to 10th order is extracted for each s, and the boundary likelihood calculating means 3,
And sends it to the HMM calculating means 5a as the model calculating means.

【０００４】境界尤度計算手段３は、音韻境界パラメー
タ記憶手段４に記憶されている音韻境界パラメータＲ４
を参照して、特徴パラメータ時系列Ｒ２より時刻ｔ＝
１，２，…,Ｔについて、時刻ｔを中心に時間幅１０フ
レームの範囲の０〜７次のメルケプストラム係数合計８
０（＝１０フレーム×８次元）個を１つの８０次元ベク
トル（以後、固定長セグメントと呼ぶ）として抽出し、
これら固定長セグメントの中心に入力音声中の音韻境界
が存在する尤度（境界尤度）を計算する。中心時刻ｔの
固定長セグメント（以後、Ｂtと記す）の中央に音韻ｉ
とｊの間の音韻境界ijが存在する境界尤度（以後、Ｃij
(Ｂt)と記す）は式（１）に従って計算される。ここ
で、式（１）の分母は固定長セグメントＢtの中央に音
韻ijの境界が存在しないとする時の尤度で、分子は固定
長セグメントＢtの中央に音韻ijの境界が存在するとす
る時の尤度に対応し、式（１）は全体として、音韻ijの
時刻tにおける境界尤度を表す。但し式中、Ｍbは共通要
素分布の数(共通コードブックのサイズ)、Ｎ（Ｂt｜μ
m，Σm）は第ｍ番目の要素分布の平均μm及び分散Σmの
多次元正規確率密度関数、Ｐmij及びＱmijは音韻境界ij
について予め学習によって求められた多項式係数であ
る。The boundary likelihood calculating means 3 is a phonological boundary parameter R4 stored in the phonological boundary parameter storing means 4.
From the characteristic parameter time series R2, the time t =
For 1, 2, ..., T, a total of 0 to 7th-order mel cepstrum coefficients within a range of 10 frames with a time t as a center, and a total of 8
0 (= 10 frames x 8 dimensions) are extracted as one 80-dimensional vector (hereinafter referred to as a fixed length segment),
The likelihood (boundary likelihood) that a phoneme boundary in the input speech exists at the center of these fixed length segments is calculated. A phoneme i is placed at the center of a fixed-length segment (hereinafter referred to as Bt) at the central time t.
Boundary likelihood that there exists a phonological boundary ij between ij and j (hereinafter, Cij
(Denoted as (Bt)) is calculated according to equation (1). Here, the denominator of equation (1) is the likelihood when the boundary of the phoneme ij does not exist in the center of the fixed length segment Bt, and the numerator assumes that the boundary of the phoneme ij exists in the center of the fixed length segment Bt. Corresponding to the likelihood of, the expression (1) as a whole represents the boundary likelihood of the phoneme ij at the time t. Where Mb is the number of common element distributions (size of common codebook), N (Bt | μ
m, Σm) is a multidimensional normal probability density function of mean μm and variance Σm of the m-th element distribution, Pmij and Qmij are phonological boundaries ij
Is a polynomial coefficient obtained by learning in advance.

【０００５】[0005]

【数１】 [Equation 1]

【０００６】つぎに、ＨＭＭ演算手段５ａについて説明
する。図１１はＨＭＭ演算手段５ａの演算対象とする音
韻系列ＨＭＭの構造を摸式的に示したものである。本Ｈ
ＭＭは状態数が丁度音韻数（ｎとする）と同じで、ｎ状
態（ｎ＝音韻数）からなり、各状態は、１つの音韻に対
応づけられている。状態ｉから状態ｊへの遷移確率は、
ａijで、また、時刻ｔの特徴パラメータｘtの状態ｊに
おける出力確率は、ｂj(ｘt)で示されている。出力確率
ｂj(ｘt)は、全音韻共通のＭ個の共通要素分布の混合ガ
ウス分布で表されており、第ｍ番目の平均ベクトルμm
及び共分散行列Σmの要素ガウス分布の確率密度関数Ｎ
(ｘt|μm,Σm)、及び、音韻ｊの分岐確率λmjとから式
（２）で計算される。Next, the HMM calculation means 5a will be described. FIG. 11 schematically shows the structure of the phoneme sequence HMM to be calculated by the HMM calculation means 5a. Book H
The MM has exactly the same number of states as the number of phonemes (let's say n), and consists of n states (n = number of phonemes), and each state is associated with one phoneme. The transition probability from state i to state j is
The output probability in the state j of the characteristic parameter xt at the time t is indicated by bj (xt). The output probability bj (xt) is represented by a Gaussian mixture distribution of M common element distributions common to all phonemes, and the m-th average vector μm
And the probability density function N of the element Gaussian distribution of the covariance matrix Σm
(xt | μm, Σm) and the branch probability λmj of the phoneme j are calculated by the equation (2).

【０００７】[0007]

【数２】 [Equation 2]

【０００８】ＨＭＭ演算手段５ａは境界尤度計算手段３
出力の境界尤度Ｒ３及びＨＭＭパラメータ記憶手段６に
記憶されているＨＭＭパラメータＲ６を参照しビタビア
ルゴリズムに基づく漸化式である式（３）と式（４）を
初期条件を与える式（５）の下で計算する。ここで、α
(j,t)は、時刻ｔにおいて、状態ｊに留まる確率(前向き
確率)を表し、β(j,t)は時刻ｔに状態ｊに至る一つ前の
最適な状態番号を表すバックポインタである。The HMM calculating means 5a is the boundary likelihood calculating means 3
The boundary likelihood R3 of the output and the HMM parameter R6 stored in the HMM parameter storage means 6 are referred to, and equations (3) and (4), which are recurrence equations based on the Viterbi algorithm, are given as equations (5) giving initial conditions. Calculate under. Where α
(j, t) is the probability of staying in state j (forward probability) at time t, and β (j, t) is a back pointer that represents the most suitable state number before reaching state j at time t. is there.

【０００９】[0009]

【数３】 (Equation 3)

【００１０】上記漸化式で示されたように、ＨＭＭ演算
手段５ａは、時刻ｔで音韻モデル間の遷移に対応する状
態ｉから状態ｊへの状態間遷移に際して、音韻の境界尤
度Ｃij（Ｂt）を参照して、音韻境界ijに依存した閾値
θijと比較し、音韻の境界尤度が本閾値θijより大きい
（Ｃij（Ｂt）＞θijである）時だけ、状態間の遷移を
許すようにしたため、状態間の遷移が入力音声中に推定
される音韻境界でだけ状態遷移が起こるようになり、挿
入誤りを減少することができる。なお、同一音韻内の状
態の遷移（ｉ＝ｊのとき）については、境界尤度Ｃij
（Ｂt）による制限は設けていない。As indicated by the above recurrence formula, the HMM computing means 5a performs the boundary likelihood Cij (of the phonological unit) at the time t during the state transition from the state i to the state j corresponding to the transition between the phoneme models. Bt) is compared with a threshold value θij depending on the phoneme boundary ij, and transition between states is allowed only when the boundary likelihood of the phoneme is larger than the threshold value θij (Cij (Bt)> θij). Therefore, the transition between states occurs only at the phoneme boundary estimated in the input speech, and the insertion error can be reduced. For the transition of states within the same phoneme (when i = j), the boundary likelihood Cij
No limit is set by (Bt).

【００１１】音韻系列変換手段としての最適状態系列検
出手段７ａは、ＨＭＭ演算結果Ｒ５として得られる前向
確率α(j,t)及びバックポインタβ(j,t)の値から、最適
状態系列Ｒ７（以後、β^(1),β^(2),…,β^(T)と記
す）を出力する。最適状態系列Ｒ７は漸化式（式
（６））を初期条件（式（７））の下で計算することで
得る。なお、最適状態系列Ｒ７は認識結果の音韻系列を
状態の番号の系列で表したものである。The optimum state series detecting means 7a as the phoneme sequence converting means calculates the optimum state series R7 from the values of the forward probability α (j, t) and the back pointer β (j, t) obtained as the HMM calculation result R5. (Hereinafter referred to as β ^ (1), β ^ (2), ..., β ^ (T)) is output. The optimum state series R7 is obtained by calculating a recurrence formula (formula (6)) under initial conditions (formula (7)). The optimum state sequence R7 is a sequence of state numbers indicating the phoneme sequence of the recognition result.

【００１２】[0012]

【数４】 [Equation 4]

【００１３】以上述べた構成の従来の状態間遷移束縛型
ＨＭＭ（ＨＭＭ／ＢＴ）による音声認識装置は、音韻間
の遷移に対応する状態間の遷移を入力音声より統計的に
推定される音韻境界付近に制限するものであり、入力音
声より直接得られる音韻境界情報により音韻境界以外で
の状態遷移が押さえられ、結果として挿入誤りが抑制さ
れる。このため、比較的高い認識精度が得られるが、多
少の認識誤りがまだ生じる。これらの、認識誤りを解析
すると、認識結果に、言語的には、音声データとしてあ
りえない音韻の列、例えば、[ｋ,ｔｓ,ｓｈ,ｔｓ,ｓｈ]
（以後、音韻の列をこのように音韻を[]で挾んで表す）
が含まれていることが分かる。従って、言語的な知識に
よって、このような音韻の列を抑制できれば、認識性能
をさらに改善することが可能である。In the conventional speech recognition apparatus using the state-to-state transition bound HMM (HMM / BT) having the above-described configuration, the transition between states corresponding to the transition between phonemes is statistically estimated from the input speech. It is limited to the vicinity, and the state transitions other than the phoneme boundaries are suppressed by the phoneme boundary information obtained directly from the input speech, and as a result, the insertion error is suppressed. Therefore, although relatively high recognition accuracy can be obtained, some recognition errors still occur. When these recognition errors are analyzed, the recognition result shows that a phoneme sequence that is linguistically impossible as voice data, for example, [k, ts, sh, ts, sh].
(Hereinafter, the phoneme sequence is represented by enclosing the phonemes with [] in this way)
You can see that is included. Therefore, if such a phoneme string can be suppressed by linguistic knowledge, it is possible to further improve the recognition performance.

【００１４】[0014]

【発明が解決しようとする課題】従来のＨＭＭ／ＢＴに
よる音声認識装置では、２音韻の列について、このよう
な制限をすることは容易である。すなわち、例えばある
音韻ｐのモデルの状態ｉから別の音韻ｑのモデルの状態
ｊへの遷移の結果である音韻の列[ｐ,ｑ]が言語的に音
声データとしてあり得ないとすれば、境界尤度Ｃij(Ｂ
t)に対する閾値θijを∞に設定することで、このような
遷移を禁止することが簡単にできる。しかし、この方法
では、「クツ」、「クシ」、「シツ」などの発声で第１
母音が無声化したデータ[ｋ,ｔｓ]、[ｋ,ｓｈ]、[ｓｈ,
ｔｓ]等の音韻列を認識する必要から、[ｋ,ｔｓ]、[ｔ
ｓ,ｓｈ]、[ｓｈ,ｔｓ]等の音韻の列に対応する状態間
遷移を禁止することはできないので、上記の音韻列、例
えば、[ｋ,ｔｓ,ｓｈ,ｔｓ,ｓｈ]が認識結果として生じ
得る。この様に上記従来のＨＭＭ／ＢＴによる音声認識
装置では、長さ３以上の音韻列について、言語的に音声
データとしてあり得ない音韻の列が認識されることを禁
止することができないという問題があった。In the conventional HMM / BT-based speech recognition apparatus, it is easy to make such a restriction on a sequence of two phonemes. That is, for example, if the phoneme sequence [p, q] resulting from the transition from the state i of the model of a certain phoneme p to the state j of the model of another phoneme q is linguistically impossible as speech data, Boundary likelihood Cij (B
By setting the threshold value θij for t) to ∞, such transition can be easily prohibited. However, in this method, the first utterance such as "Kusu", "Kushi", "Shitsu", etc.
Data [k, ts], [k, sh], [sh,
Since it is necessary to recognize phoneme sequences such as [ts], [k, ts], [t
Since it is not possible to prohibit transitions between states corresponding to phoneme strings such as s, sh] and [sh, ts], the above phoneme string, for example, [k, ts, sh, ts, sh] is the recognition result. Can occur as As described above, the conventional HMM / BT-based speech recognition device has a problem in that it is not possible to prohibit recognition of a phoneme string that is linguistically impossible as voice data for a phoneme string of length 3 or more. there were.

【００１５】従来より、言語モデルとして、音素や音節
等の言語記号の列についての統計的言語モデルやオート
マトン制御を用いると共に、音響モデルとして、音素や
音節など認識単位の音響モデルを用いる音声認識方法が
ある。この種の音声認識方法として、言語記号としての
音素の生起順序に関する統計的言語モデルを用い、か
つ、音響モデルとして音素のＨＭＭを用いる方法、ま
た、言語記号として仮名、及び漢字の生起順序に関する
統計的言語モデルを用い、音響モデルとして仮名に対応
する音節のＨＭＭ、及び漢字の読みに対応するＨＭＭを
用いる方法（例えば、特開平４−７３６９４）、さら
に、言語モデルとして、日本語に対応する音節の列を許
すように予め設計された音節のオートマトンで言語記号
としての音素の生起順序を規定する有向グラフを、音響
モデルとして音素ＨＭＭを用いる方法（例えば、平成２
年３月発行の音響学会講演論文集２−Ｐ−２６「音節オ
ートマトンと話者適応を用いたＨＭＭによる音素認
識」）が提案されている。また、言語記号としての音節
連鎖の統計的言語モデルと、音響モデルとして音素ＨＭ
Ｍを用いる方法（例えば、平成２年３月発行の日本音響
学会講演論文集３−３−９「ＨＭＭ音韻認識における音
節連鎖統計情報の利用」）が提案されている。特にこの
音節連鎖の言語モデルを用いる方法は、タスク依存性が
少なく強い制限が期待できる。Conventionally, a speech recognition method using a statistical language model or an automaton control for a sequence of language symbols such as phonemes and syllables as a language model, and an acoustic model of a recognition unit such as phonemes and syllables as an acoustic model. There is. As this kind of speech recognition method, a statistical language model regarding the phoneme occurrence sequence as a language symbol is used, and a phoneme HMM is used as an acoustic model, and a kana as a language symbol and statistics concerning the kanji occurrence sequence are used. Method using a dynamic language model, using an HMM of a syllable corresponding to a kana as an acoustic model and an HMM corresponding to reading of kanji (for example, Japanese Patent Laid-Open No. 4-73694), and further, a syllable corresponding to Japanese as a language model. A method of using a phoneme HMM as an acoustic model for a directed graph that defines the order of occurrence of phonemes as linguistic symbols in a syllable automaton designed to allow the sequence of
Proceedings of the Acoustical Society of Japan 2-P-26, "phoneme recognition by HMM using syllable automata and speaker adaptation", published in March, 2002) is proposed. In addition, a statistical language model of syllable chain as a language symbol and a phoneme HM as an acoustic model.
A method using M (for example, Proceedings of the Acoustical Society of Japan 3-3-9 “Use of syllable chain statistical information in HMM phoneme recognition” published in March 1990) has been proposed. In particular, the method using the language model of syllable chain has less task dependency and can be expected to be strongly restricted.

【００１６】ところで、以上の技術では、音素のＨＭ
Ｍ、音節のＨＭＭ、あるいは、漢字の読みに対するＨＭ
Ｍが音響モデルとして用いられ、かつ、これらのＨＭＭ
に対応した言語記号の列の言語モデルが用いられてい
る。また、音響モデルとしてのＨＭＭは予め状態数や状
態間の遷移の構造が決められたものが用いられている。
一方、連続音声中では、言語記号としての音節が音響音
声学的な音韻特徴の時系列構造として実現されるとき、
音声環境や個人差によって、各音韻特徴区間のスペクト
ルが変動すると共に、母音の無声化やバズバーの脱落な
どの音韻特徴区間の欠落により、音節内の音韻特徴系列
が構成する音節の音韻構造自体が変動することが知られ
ている（例えば、昭和５９年発行の音声研究会資料Ｓ８
４−６９「連続音節認識のための音節の変動の検
討」）。従って、以後、上記の音韻特徴区間を音韻区
間、またそのラベルを音韻記号と呼ぶことにすれば、音
節や音素の音響モデルを、従来例のように予め決められ
た状態遷移の構造で表すよりも、音声環境や個人差に対
応した、音韻のネットワーク構造で表すことが妥当であ
ると考えられる。By the way, in the above technique, the phoneme HM
M, syllable HMM, or HM for reading kanji
M is used as the acoustic model and these HMMs
The language model of the sequence of language symbols corresponding to is used. Further, as the HMM as an acoustic model, the one in which the number of states and the structure of transition between states are determined in advance is used.
On the other hand, in continuous speech, when syllables as linguistic symbols are realized as a time-series structure of acoustic phonetic phonological features,
The spectrum of each phonological feature section varies depending on the speech environment and individual differences, and due to the lack of phonological feature sections such as devoicing of vowels and drop of buzz bars, the phonological structure of the syllables formed by the phonological feature series in the syllable itself. It is known to fluctuate (for example, S8, the audio study group material published in 1984)
4-69 “Study of syllable variation for continuous syllable recognition”). Therefore, hereinafter, if the above phonological feature section is called a phonological section and its label is called a phonological symbol, an acoustic model of a syllable or a phoneme is represented by a predetermined state transition structure as in the conventional example. Also, it is considered appropriate to represent the phoneme network structure corresponding to the voice environment and individual differences.

【００１７】例えば、連続音声中の音節「つ」（音素表
記では／ｃｕ／）が、どのような音韻列として実現され
るかを音声データベース中に観察すると、母音の脱落や
子音閉鎖区間の脱落（あるいは先行音韻との融合）など
によって、音韻の列として、[ｔｓ]，[ｔｓ, ｕ] ，[ｃ
ｌ,ｔｓ]，[ｃｌ,ｔｓ,ｕ]などとして実現されることが
分かる。しかし、上記の音節のＨＭＭを用いる方法で
は、このような構造変動を有するデータに対しても所定
の状態数のグラフ構造を持ったＨＭＭを学習する。この
ため、例えば、音節「つ」（音素列としては／ｃｕ／）
が音声データベース中の音韻列として、[ｔｓ]，[ｔｓ,
ｕ]，[ｃｌ,ｔｓ]，[ｃｌ,ｔｓ,ｕ]と実現されていて
も、これに対して、所定の状態数のグラフ構造を持った
音節ＨＭＭを学習する。この結果、スペクトルの変動だ
けではなく音韻の脱落などによる音節内の音韻構造が変
化した未知音声に対しては音響モデル（音節ＨＭＭ）の
精度が低下するという問題があった。これは、音素のＨ
ＭＭを用いても同様である。例えば、「つ」の子音部を
表す音素／ｃ／は、音声データベース中の音韻列とし
て、[ｔｓ]，[ｃｌ,ｔｓ]などと実現されていても、こ
れに対して、所定の状態数のグラフ構造を持った音素Ｈ
ＭＭを学習するため、スペクトルの変動だけではなく、
音韻の脱落などによる音素内の音韻構造の変形を受けた
未知音声に対しては音響モデル（音素ＨＭＭ）の精度が
低下する。即ち従来の音素あるいは音節のＨＭＭとその
言語モデルを用いる音声認識方法では、音響モデルとし
て予め所定の状態遷移構造の音素や音節のＨＭＭを用い
ているため、音素や音節の内部の音韻のスペクトルの変
動と共に音韻の脱落などによる音節や音素の内部の音韻
構造が変動した未知音声に対して、モデルの精度が低下
するという問題があった。For example, when observing in a speech database what kind of phoneme sequence a syllable "tsu" (/ cu / in phoneme notation) in continuous speech is realized, a vowel dropout or a consonant closed section dropout occurs. (Or fusion with the preceding phoneme), etc., as a phoneme sequence [ts], [ts, u], [c
It can be seen that it is realized as l, ts], [cl, ts, u], and so on. However, in the method using the syllable HMM, an HMM having a graph structure with a predetermined number of states is learned even for data having such a structural variation. Therefore, for example, syllable "tsu" (/ cu / as a phoneme string)
Is a phoneme sequence in the voice database, [ts], [ts,
u], [cl, ts], and [cl, ts, u], the syllable HMM having a graph structure with a predetermined number of states is learned for this. As a result, there is a problem that the accuracy of the acoustic model (syllabic HMM) deteriorates for unknown speech in which the phonological structure in the syllable is changed due to dropout of the phoneme as well as the fluctuation of the spectrum. This is the phoneme H
The same applies when using MM. For example, although the phoneme / c / representing the consonant part of "tsu" is realized as [ts], [cl, ts], etc. as a phoneme sequence in the speech database, the number of states is Phoneme H with the graph structure
In order to learn MM, not only the fluctuation of spectrum but also
The accuracy of the acoustic model (phoneme HMM) is reduced for unknown speech that has undergone deformation of the phoneme structure in the phoneme due to dropping of the phoneme. That is, in the conventional speech recognition method using the phoneme or syllable HMM and its language model, since the phoneme or syllable HMM having a predetermined state transition structure is used as the acoustic model in advance, the phoneme or the phonological spectrum inside the syllable is analyzed. There is a problem that the accuracy of the model deteriorates for unknown speech in which the phoneme structure inside the syllable or the phoneme changes due to the drop of the phoneme along with the change.

【００１８】本発明は、上記のような問題点を解消する
ためになされたもので、第１の目的は、入力音声に音韻
モデル系列を当てはめ入力音声を最適な音韻系列に変換
する際、音韻モデル間の遷移時刻を入力音声中より推定
された音韻境界付近に束縛する音声認識装置において、
長さ３以上の音韻列に対して音韻の生起順序についての
制限を導入し、音声データとしてあり得ない音韻の列が
認識されることを防止すると共に、入力音声に対して仮
定する音韻境界の種類を限定し音韻境界の推定精度を向
上し、認識精度の改善された音声認識装置を提供するす
ることである。また、本発明の第２の目的は、入力音声
に構文制御グラフに従って音韻モデル系列を当てはめ入
力音声を最適な音韻系列に変換する音声認識装置の構文
制御グラフの生成方法として、音素や音節内の音韻区間
のスペクトルの変動と音韻の脱落などによる音節や音素
内の音韻構造の変動の両者をモデル化した構文制御グラ
フの生成方法を提供することである。The present invention has been made to solve the above problems, and a first object thereof is to apply a phonological model sequence to an input voice and convert the input voice into an optimum phonological sequence. In the speech recognition device that binds the transition time between models to the vicinity of the phoneme boundary estimated from the input speech,
Introducing restrictions on the phoneme occurrence sequence for phoneme strings of length 3 or more to prevent recognition of phoneme strings that are impossible as speech data, and to limit the phoneme boundaries assumed for input speech. It is an object of the present invention to provide a speech recognition device having limited types and improved estimation accuracy of phonological boundaries and improved recognition accuracy. A second object of the present invention is to generate a syntactic control graph of a speech recognition apparatus that applies a phonological model sequence to an input speech according to a syntactic control graph and converts the input speech into an optimal phonological sequence. It is intended to provide a method for generating a syntactic control graph that models both a variation in the spectrum of a phonological segment and a variation in the phonological structure in a syllable or a phoneme due to dropout of the phonological segment.

【００１９】[0019]

【課題を解決するための手段】本発明の請求項１の音声
認識装置は、入力音声から音韻の境界尤度を算出する境
界尤度算出手段と、前記音韻の境界尤度が所定の値より
大きく、かつ、長さ３以上の音韻列についての制約を満
たす時だけ、最適な音韻モデルを選択するモデル演算手
段とを備えた。A speech recognition apparatus according to claim 1 of the present invention comprises a boundary likelihood calculating means for calculating a boundary likelihood of a phoneme from an input speech, and a boundary likelihood of the phoneme from a predetermined value. The model calculation means selects an optimum phoneme model only when the constraint is large and the phoneme sequence having a length of 3 or more is satisfied.

【００２０】本発明の請求項２の音声認識装置は、入力
音声から音韻境界の種類に応じて、音韻の境界尤度を算
出する境界尤度計算手段と、前記音韻の境界尤度がその
音韻境界の種類に応じて設定された値より大きく、か
つ、音韻列中の長さ３以上の音韻の生起順序を規定する
構文制御グラフに規定される音韻の生起順序に従う時に
限り、最適な音韻モデルを選択するモデル演算手段とを
備えた。In the speech recognition apparatus according to the second aspect of the present invention, the boundary likelihood calculating means for calculating the boundary likelihood of the phoneme according to the kind of the phoneme boundary from the input speech, and the boundary likelihood of the phoneme are the phonemes. The optimum phoneme model is larger than the value set according to the type of boundary, and only when the phoneme occurrence order defined in the syntax control graph that defines the occurrence order of phonemes with a length of 3 or more in the phoneme sequence is followed. And model calculation means for selecting.

【００２１】本発明の請求項３の音声認識装置の構文制
御グラフの生成方法は、テキストデータベースから音節
の生起順序を規定する音節構文グラフを求める過程と、
音声データベースから音節内の音韻の生起順序を規定す
る音節内音韻グラフを求める過程と、前記音節構文グラ
フの音節相当部分に前記音節内音韻グラフを代入する過
程を有する。According to a third aspect of the present invention, a method for generating a syntax control graph of a speech recognition apparatus comprises a step of obtaining a syllabic syntax graph defining a syllable occurrence order from a text database,
The method includes a step of obtaining an intra-syllable phoneme graph that defines the order of occurrence of phonemes in a syllable from a voice database, and a step of substituting the in-syllable phoneme graph into a syllable equivalent part of the syllable syntax graph.

【００２２】本発明の請求項４の音声認識装置の構文制
御グラフの生成方法は、入力音声を分析し、前記入力音
声を音韻モデルの連結と見なして、前記入力音声に構文
制御グラフの規定に従って音韻モデルの系列を当ては
め、前記入力音声を最適な音韻列に変換する音声認識装
置の前記構文制御グラフの生成方法として、テキストデ
ータベースから音節とその音節を囲む音節文脈とを音節
データとして抽出しこれら音節データの生起順序を規定
する音節構文グラフを求める過程と、音声データベース
から前記音節データの音節文脈毎に音節内の音韻の生起
順序を規定する音節内音韻グラフを求める過程と、前記
音節構文グラフの前記音節データ相当部分には前記該音
節データの音節文脈と一致する音節文脈から求められた
前記音節内音韻グラフを代入する過程を有する。According to a fourth aspect of the present invention, a method of generating a syntax control graph for a speech recognition apparatus analyzes an input voice, regards the input voice as a concatenation of phoneme models, and determines the input voice according to the definition of the syntax control graph. As a method of generating the syntactic control graph of a speech recognition device that applies a sequence of phonological models and converts the input speech into an optimal phonological sequence, syllables and syllable contexts surrounding the syllables are extracted as syllable data from a text database. A step of obtaining a syllable syntactic graph that defines the occurrence order of syllable data; a step of obtaining an intra-syllable phoneme graph that defines the occurrence order of phonemes in a syllable for each syllable context of the syllable data from a speech database; and the syllable syntax graph. Of the syllable data corresponding to the syllable context of the syllable data corresponding to the syllable context of the syllable data. Comprising the step of assigning a.

【００２３】本発明の請求項５の音声認識装置における
モデル演算手段は、請求項３又は４に記載の構文制御グ
ラフの生成方法に基づいて生成された構文制御グラフを
用いるものである。The model operation means in the speech recognition apparatus according to claim 5 of the present invention uses a syntax control graph generated based on the syntax control graph generation method according to claim 3 or 4.

【００２４】[0024]

【作用】本発明の請求項１の音声認識装置において、境
界尤度算出手段は入力音声から音韻の境界尤度を算出す
る。また、モデル演算手段は前記境界尤度算出手段の算
出した音韻の境界尤度が所定の値より大きく、かつ、選
択する音韻モデルが長さ３以上の音韻列についての制約
を満たす時だけ、最適な音韻モデルを選択する。In the speech recognition apparatus according to the first aspect of the present invention, the boundary likelihood calculating means calculates the boundary likelihood of the phoneme from the input speech. Further, the model calculation means is optimal only when the boundary likelihood of the phoneme calculated by the boundary likelihood calculation means is larger than a predetermined value and the selected phoneme model satisfies the constraint for the phoneme sequence having a length of 3 or more. Select a phoneme model.

【００２５】本発明の請求項２の音声認識装置におい
て、境界尤度計算手段は入力音声から音韻境界の種類に
応じた音韻の境界尤度を算出する。また、モデル演算手
段は前記境界尤度計算手段が算出した音韻境界の種類に
応じた音韻の境界尤度と音韻境界の種類に応じて設定さ
れた値より大きく、かつ、選択する音韻モデルが長さ３
以上の音韻列についての制約を満たす時に限り、最適な
音韻モデルを選択する。In the speech recognition apparatus according to the second aspect of the present invention, the boundary likelihood calculating means calculates the boundary likelihood of the phoneme corresponding to the type of the phoneme boundary from the input speech. Further, the model calculation means is larger than the boundary likelihood of the phoneme according to the kind of the phoneme boundary calculated by the boundary likelihood calculation means and the value set according to the kind of the phoneme boundary, and the phoneme model to be selected is long. 3
The optimum phoneme model is selected only when the above restrictions on the phoneme sequence are satisfied.

【００２６】本発明の請求項３の音声認識装置の構文制
御グラフの生成方法において、音節構文グラフを求める
過程はテキストデータベースから音節の生起順序を規定
する音節のオートマトンに相当する音節構文グラフを求
める。また、音節内音韻グラフを求める過程は、音声デ
ータベース中の音節区間の音韻列を抽出し音節内の音韻
の生起順序を規定する音韻のオートマトンに相当する音
節内音韻グラフを求める。さらに、最後の過程は前記音
節構文グラフの音節相当部分に前記音節内音韻グラフを
代入する。In the method of generating a syntactic control graph of a speech recognition apparatus according to claim 3 of the present invention, the step of obtaining a syllable syntactic graph obtains a syllable syntactic graph corresponding to an syllable automaton that defines the syllable occurrence order from a text database. . In the process of obtaining the syllable in-syllable graph, the syllable sequence in the syllable section in the speech database is extracted, and the in-syllable phonological graph corresponding to the phoneme automaton that defines the occurrence order of the syllable in the syllable is obtained. Furthermore, in the last step, the intra-syllable phoneme graph is substituted into the syllable equivalent part of the syllable syntax graph.

【００２７】本発明の請求項４の音声認識装置の構文制
御グラフの生成方法において、音節構文グラフを求める
過程はテキストデータベースから音節とその音節を囲む
音節文脈とを音節データとして抽出しこれら音節データ
のオートマトンに相当する音節構文グラフを求める。ま
た、音節内音韻グラフを求める過程は音声データベース
から前記音節データの音節文脈毎に音節区間の音韻列を
抽出し音節文脈毎の音節内の音韻グラフを求める。さら
に、最後の過程は前記音節構文グラフの前記音節データ
相当部分に前記該音節データの音節文脈と一致する音節
文脈から求められた前記音節内音韻グラフを代入する。In the method for generating a syntactic control graph of a speech recognition apparatus according to claim 4 of the present invention, in the step of obtaining a syllable syntactic graph, a syllable and a syllable context surrounding the syllable are extracted as syllable data from a text database. Find a syllable syntax graph corresponding to the automaton of. Further, in the process of obtaining a syllable in-syllable graph, a phonological string in a syllable section is extracted for each syllable context of the syllable data from the speech database and a phonological graph in the syllable for each syllable context is obtained. Furthermore, in the final step, the intra-syllable phoneme graph obtained from the syllable context that matches the syllable context of the syllable data is substituted into the syllable data equivalent portion of the syllable syntax graph.

【００２８】本発明の請求項５の音声認識装置におい
て、前記モデル演算手段は、請求項３又は４に記載の構
文制御グラフの生成方法に基づいて生成された構文制御
グラフを用いる。In the speech recognition apparatus according to claim 5 of the present invention, the model calculation means uses a syntax control graph generated based on the syntax control graph generation method according to claim 3 or 4.

【００２９】[0029]

【実施例】以下この発明の一実施例を説明する。この実
施例においては、音響モデルとして、１音韻を１状態で
表す音韻のＨＭＭ（音韻ＨＭＭ）を用い、ある入力音声
はこれらの音韻ＨＭＭの列で表される。また、音韻は、
子音閉鎖区間を子音破裂部とは別音韻と見なした、図２
に示す２９種類の音韻からなる体系を用いており、各音
韻はこの図のように番号付けされている。以後、音韻は
この番号で参照される。また、本実施例では、音響モデ
ルとして、１音韻を１状態で表す音韻ＨＭＭを用いてい
るため、音韻境界の生成は、音韻ＨＭＭの状態間の遷移
として現れる。なお、１音韻に複数状態を有する音韻Ｈ
ＭＭを用いるときでも本発明は適用可能であることは言
うまでもなく、この場合、音韻境界の生成は、音韻ＨＭ
Ｍ間の遷移に対応する状態間の遷移として現れる。ま
た、本実施例では、モデル演算手段として、通常のトレ
リスアルゴリズムに基づくＨＭＭ演算における和の演算
を最大化の演算に置き換えたビタビのアルゴリズムに基
づくＨＭＭ演算手段を用いている。なお、通常のトレリ
スアルゴリズムに基づくＨＭＭ演算においても本発明が
適用できることは言うまでもない。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below. In this embodiment, a phoneme HMM (phoneme HMM) that represents one phoneme in one state is used as an acoustic model, and a certain input speech is represented by a sequence of these phoneme HMMs. Also, the phoneme is
The consonant closed section is regarded as a phoneme different from the consonant burst part, as shown in FIG.
The system consisting of 29 kinds of phonemes shown in Fig. 2 is used, and each phoneme is numbered as shown in this figure. Hereinafter, the phoneme is referred to by this number. Further, in the present embodiment, since the phoneme HMM that represents one phoneme in one state is used as the acoustic model, the generation of the phoneme boundary appears as a transition between the states of the phoneme HMM. In addition, a phoneme H having a plurality of states in one phoneme
It goes without saying that the present invention is applicable even when using MM, in which case the phonological boundary generation is performed by the phonological HM.
Appears as a transition between states corresponding to a transition between M. Further, in this embodiment, as the model calculation means, the HMM calculation means based on the Viterbi algorithm in which the sum calculation in the HMM calculation based on the ordinary trellis algorithm is replaced with the maximization calculation is used. Needless to say, the present invention can be applied to the HMM calculation based on the ordinary trellis algorithm.

【００３０】図１は、この発明の一実施例の音声認識装
置の構成図である。以下図１の各部を説明する。音声区
間検出手段１は、入力音声の短区間パワーの変化形状に
より音声区間を検出し、この音声区間内の音声信号Ｒ１
を切り出して特徴抽出手段２に送る。特徴抽出手段２
は、音声区間内の音声信号Ｒ１中から長さ２５．６ｍｓ
の時間窓を用いた１５次線形予測メルケプストラム分析
によって１０ｍｓ毎に０〜１０次のメルケプストラム係
数からなる特徴パラメータ時系列Ｒ２を抽出し境界尤度
計算手段３、及び、モデル演算手段としてのＨＭＭ演算
手段５に送る。FIG. 1 is a block diagram of a voice recognition apparatus according to an embodiment of the present invention. Each part of FIG. 1 will be described below. The voice section detecting means 1 detects a voice section based on the change shape of the short section power of the input voice, and the voice signal R1 in this voice section.
Is cut out and sent to the feature extracting means 2. Feature extraction means 2
Is 25.6 ms in length from the voice signal R1 in the voice section.
The characteristic parameter time series R2 consisting of 0th to 10th mel-cepstral coefficients is extracted every 10 ms by the 15th-order linear predictive mel-cepstral analysis using the time window of B, the boundary likelihood calculating means 3 and the HMM as a model calculating means. It is sent to the calculation means 5.

【００３１】境界尤度計算手段３は、音韻境界パラメー
タ記憶手段４に記憶されている音韻境界パラメータＲ４
を参照して、特徴パラメータ時系列Ｒ２より時刻ｔ＝
１，２，…,Ｔについて、時刻ｔを中心に時間幅１０フ
レームの範囲の０〜７次のメルケプストラム係数合計８
０（＝１０フレーム×８次元）個を１つの８０次元ベク
トル（以後、固定長セグメントと呼ぶ）として抽出し、
これら固定長セグメントの中心に入力音声中の音韻境界
が存在する尤度（境界尤度）を計算する。中心時刻ｔの
固定長セグメント（以後、Ｂtと記す）の中央に音韻ｉ
とｊの間の音韻境界ijが存在する境界尤度（以後、Ｃij
(Ｂt)と記す）は式（１）に従って計算される。ここ
で、式（１）の分母は固定長セグメントＢtの中央に音
韻ijの境界が存在しないとする時の尤度で、分子は固定
長セグメントＢtの中央に音韻ijの境界が存在するとす
る時の尤度に対応し、式（１）は全体として、音韻ijの
時刻tにおける境界尤度を表す。但し式中、Ｍbは共通要
素分布の数(共通コードブックのサイズ)、Ｎ（Ｂt｜μ
m，Σm）は第ｍ番目の要素分布の平均μm及び分散Σmの
多次元正規確率密度関数、Ｐmij及びＱmijは音韻境界ij
について予め学習によって求められた多項式係数であ
る。The boundary likelihood calculating means 3 has a phonological boundary parameter R4 stored in the phonological boundary parameter storing means 4.
From the characteristic parameter time series R2, the time t =
For 1, 2, ..., T, a total of 0 to 7th-order mel cepstrum coefficients within a range of 10 frames with a time t as a center, and a total of 8
0 (= 10 frames x 8 dimensions) are extracted as one 80-dimensional vector (hereinafter referred to as a fixed length segment),
The likelihood (boundary likelihood) that a phoneme boundary in the input speech exists at the center of these fixed length segments is calculated. A phoneme i is placed at the center of a fixed-length segment (hereinafter referred to as Bt) at the central time t.
Boundary likelihood that there exists a phonological boundary ij between ij and j (hereinafter, Cij
(Denoted as (Bt)) is calculated according to equation (1). Here, the denominator of equation (1) is the likelihood when the boundary of the phoneme ij does not exist in the center of the fixed length segment Bt, and the numerator assumes that the boundary of the phoneme ij exists in the center of the fixed length segment Bt. Corresponding to the likelihood of, the expression (1) as a whole represents the boundary likelihood of the phoneme ij at the time t. Where Mb is the number of common element distributions (size of common codebook), N (Bt | μ
m, Σm) is a multidimensional normal probability density function of mean μm and variance Σm of the m-th element distribution, Pmij and Qmij are phonological boundaries ij
Is a polynomial coefficient obtained by learning in advance.

【００３２】次に、ＨＭＭ演算手段５について説明す
る。図３はＨＭＭ演算手段５が演算対象とするＨＭＭの
構造を模式的に示したものである。本ＨＭＭ全体は、１
音韻当たり１状態で表された音響モデルとしての音韻Ｈ
ＭＭ（本例では全部で２９種類ある）を構文制御グラフ
としての音韻構文グラフに従って状態間遷移で連結した
ものである。即ち、本ＨＭＭ中の状態を遷移して得られ
る状態の系列は音韻構文グラフに従って生成され得るあ
る音韻の列に対応している。（音韻構文グラフの生成方
法は後で説明する。）特に、本図は、ＨＭＭ演算手段５の漸化式計算を説明す
るため、状態ｐから状態ｑへの状態遷移の様子を示す。
音韻構文グラフの各状態は、ある１つの音韻に対応づけ
られていて、例えば図３では状態ｑは音韻ｊに対応付け
られていて、状態ｑにおける時刻ｔの特徴パラメータｘ
tの出力確率は音韻ｊのパラメータを用いてｂj(ｘt)と
して計算される。状態ｐから状態ｑへの遷移は、音韻ｉ
から音韻ｊの音韻境界パラメータに基づく境界尤度Ｃij
(Ｂt)が閾値θijより大きく、かつ、構文制御グラフで
状態ｐから状態ｑへの状態間の遷移が許される（これ
は、漸化式中δpq＝1で示される）時、可能である。ま
た、状態ｐから状態ｑへの遷移確率は、ａpqで示されて
いる。出力確率ｂj(ｘt)は、全部でＭ個の共通ガウス分
布の混合分布で表されており、第ｍ番目の共通ガウス分
布の平均ベクトルμm及び共分散行列Σm、及び、音韻ｊ
の分岐確率λmjをパラメータとして、式（８）で計算さ
れる。式中、Ｎ(ｘt|μm,Σm)は平均μm、分散Σmの正
規確率密度関数を表す。上記出力確率計算用のパラメー
タはＨＭＭパラメータ記憶手段６に記憶されている。Next, the HMM calculation means 5 will be described. FIG. 3 schematically shows the structure of the HMM to be operated by the HMM operation means 5. The whole HMM is 1
Phoneme H as an acoustic model represented by one state per phoneme
MM (there are 29 types in total in this example) are connected by transitions between states according to a phoneme syntactic graph as a syntactic control graph. That is, the sequence of states obtained by transitioning the states in the HMM corresponds to a certain phoneme sequence that can be generated according to the phoneme syntactic graph. (The method of generating the phonological syntax graph will be described later.) In particular, this figure shows the state transition from the state p to the state q in order to explain the recurrence formula calculation of the HMM computing means 5.
Each state of the phonological syntax graph is associated with a certain phoneme, for example, in FIG. 3, the state q is associated with the phoneme j, and the characteristic parameter x at the time t in the state q.
The output probability of t is calculated as bj (xt) using the parameter of phoneme j. The transition from state p to state q is phonological i
Boundary likelihood Cij based on the phoneme boundary parameter of the phoneme j
It is possible when (Bt) is greater than the threshold θ ij and the transition between states from state p to state q is allowed in the syntax control graph (this is indicated by δpq = 1 in the recurrence equation). The transition probability from the state p to the state q is indicated by apq. The output probability bj (xt) is represented by a mixture distribution of M common Gaussian distributions in total, and the average vector μm and covariance matrix Σm of the mth common Gaussian distribution and the phoneme j
It is calculated by the equation (8) using the branch probability λmj of the parameter as a parameter. In the equation, N (xt | μm, Σm) represents a normal probability density function of mean μm and variance Σm. The parameters for calculating the output probability are stored in the HMM parameter storage means 6.

【００３３】[0033]

【数５】 (Equation 5)

【００３４】ＨＭＭ演算手段５は境界尤度計算手段３出
力の境界尤度Ｒ３及びＨＭＭパラメータＲ６及び構文制
御手段８に記憶された構文制御情報Ｒ８を参照しビタビ
アルゴリズムに基づくＨＭＭ演算を行う。構文制御情報
Ｒ８は構文制御グラフとしての音韻構文グラフを変換し
た結果として、各状態に対応する音韻番号の情報と、状
態間の接続を表す情報からなる。状態は全部で、ｎ状態
あり、状態ｑに対応する音韻番号ｊはｑの関数としてｊ
＝ｆ(ｑ)のごとく与えられる。また、状態ｐから状態ｑ
への遷移の可能性はδpq＝1で示される。ＨＭＭ演算手
段５におけるＨＭＭ演算は、式（９）および式（１０）
の漸化式を式（１１）の初期条件の下で計算する。ここ
で、ｎは音韻構文グラフの状態数、α(q,t)は、時刻ｔ
において、状態ｑに留まる確率(前向き確率)を表し、β
(q,t)は時刻ｔに状態ｑに至る一つ前の最適な状態番号
を表すバックポインタである。The HMM calculation means 5 refers to the boundary likelihood R3 output from the boundary likelihood calculation means 3 and the HMM parameter R6 and the syntax control information R8 stored in the syntax control means 8 to perform the HMM calculation based on the Viterbi algorithm. The syntax control information R8 is composed of information about phoneme numbers corresponding to the respective states and information indicating connections between the states, as a result of converting the phoneme syntax graph as the syntax control graph. There are n states in total, and the phoneme number j corresponding to the state q is j as a function of q.
= F (q) Also, from state p to state q
The possibility of transition to is shown by δpq = 1. The HMM calculation in the HMM calculation means 5 is performed by the equation (9) and the equation (10).
Is calculated under the initial condition of equation (11). Here, n is the number of states of the phonological syntactic graph, and α (q, t) is the time t.
, The probability of staying in the state q (forward probability)
(q, t) is a back pointer that represents the optimum state number before the state q at time t.

【００３５】[0035]

【数６】 (Equation 6)

【００３６】上記漸化式で示されたように、ＨＭＭ演算
手段５は、時刻ｔで音韻モデル間の遷移に対応する状態
ｉから状態ｊへの状態間遷移に際して、音韻の境界尤度
Ｃij（Ｂt）を参照して、音韻境界ijに依存した閾値θi
jと比較し、音韻の境界尤度が本閾値θijより大きく
（Ｃij（Ｂt）＞θijであり）、かつ、音韻構文グラフ
中の遷移として許される（δpq＝１である）時だけ、状
態間の遷移を許すようにしたため、状態間の遷移が入力
音声中に推定される音韻境界でだけ状態遷移が起こるよ
うになり、非音韻境界での状態遷移が減少するため認識
結果中の音韻の挿入誤りを減少すると共に、音韻構文グ
ラフ中の遷移として許されない音韻の列が状態系列とし
て選択されることが防止され、言語的に音声データとし
てあり得ない音韻列が認識されることが防止される。な
お、同一音韻内の状態の遷移（ｉ＝ｊのとき）について
は、境界尤度Ｃij（Ｂt）および音韻構文グラフによる
選択の制限は設けていない。As indicated by the above recurrence formula, the HMM calculating means 5 makes a boundary likelihood Cij (of the phoneme) at the time t during the state transition from the state i to the state j corresponding to the transition between the phoneme models. Bt), the threshold θi depending on the phoneme boundary ij
Only when the boundary likelihood of the phoneme is larger than the threshold value θij (Cij (Bt)> θij) as compared with j and the transition is allowed in the phoneme syntax graph (δpq = 1). Since the transition between states is allowed, the transition between states occurs only at the phoneme boundary estimated in the input speech, and the state transition at the non-phoneme boundary is reduced. It reduces errors, prevents phoneme sequences that are not allowed as transitions in the phoneme syntactic graph from being selected as state sequences, and recognizes phoneme sequences that are linguistically impossible as speech data. . Regarding the transition of states within the same phoneme (when i = j), there is no restriction on the selection by the boundary likelihood Cij (Bt) and the phoneme syntax graph.

【００３７】音韻系列変換手段としての最適状態系列検
出手段７は、ＨＭＭ演算結果Ｒ５として得られる前向確
率α(q,t)及びバックポインタβ(q,t)の値から、最適状
態系列Ｒ７（以後、β^(1),β^(2),…,β^(T)と記す）
を出力する。最適状態系列Ｒ７は漸化式である式（１
２）を初期条件を示す式（１３）の設定の下で計算する
ことで得る。なお、最適状態系列Ｒ７は認識結果の音韻
列を音韻構文グラフ中の音韻の状態の番号の系列で表し
たものであり、最適状態系列Ｒ７から音韻列への変換は
簡単な１対１の関数関係により実現される。The optimum state series detecting means 7 as the phoneme sequence converting means calculates the optimum state series R7 from the values of the forward probability α (q, t) and the back pointer β (q, t) obtained as the HMM calculation result R5. (Hereafter written as β ^ (1), β ^ (2),…, β ^ (T))
Is output. The optimum state sequence R7 is a recurrence formula (1
2) is calculated under the setting of Expression (13) indicating the initial condition. The optimum state sequence R7 is a sequence of phonetic state numbers in the phoneme syntax graph representing the phoneme sequence of the recognition result, and the conversion from the optimum state sequence R7 to the phoneme sequence is a simple one-to-one function. Realized by relationships.

【００３８】[0038]

【数７】 (Equation 7)

【００３９】以上で、音声認識装置の構成の説明を終わ
り、以下、本実施例の音声認識装置で用いれれている音
韻構文グラフの作成方法について説明する。図４は、本
実施例における音韻構文グラフの生成過程の説明図であ
る。音韻構文グラフの生成過程は、全体として図のよう
に過程I〜IIIからなる。Above, the description of the structure of the speech recognition apparatus has been completed, and the method for creating the phoneme syntax graph used in the speech recognition apparatus of this embodiment will be described below. FIG. 4 is an explanatory diagram of a phoneme syntax graph generation process in this embodiment. The process of generating a phonological syntactic graph is composed of processes I to III as a whole.

【００４０】過程Iでは、図中音節連鎖抽出において、
大量のテキストデータベースから音節の列を抽出し、抽
出されたすべての音節の列を受け入れるような音節を枝
とする有向グラフ(音節構文グラフ)を作成する。この音
節構文グラフは、言語制約を強く表現し、しかも、でき
るだけタスクに依存せず任意の文を受理するようにする
ため、例えば音節のトライグラム（三つ組）の列を受理
するように構成する。図５は音節のテキストデータか
ら、３音節連鎖を受理するような音節構文グラフを生成
する過程を例示したものである。テキストデータが「ε
＃サイタサイタ＃サクラガサイタ＃ε」とあるとき、こ
のテキストデータから、前後１つの音節環境に依存する
三つ組として、(ε)＃(サ),(＃)サ(イ),(サ)イ(タ),
(イ)タ(サ),(タ)サ(イ),(サ)イ(タ),(イ)タ(＃),(タ)＃
(サ),(＃)サ(ク),(サ)ク(ラ),(ク)ラ(ガ),(ラ)ガ(サ),
(ガ)サ(イ),(サ)イ(タ),(イ)タ(＃),(タ)＃(ε)が抽出
でき、これらの中で共通な三つ組を除くことで、図の中
央に示すような音節の三つ組の集合が得られる。これら
の三つ組の集合を、音声環境（音節の前後の文脈）の一
致を条件として、接続することにより、音節の有向グラ
フ（音節構文グラフ）として、図の下のようなグラフが
生成される。ここで、「ε」及び「＃」はそれぞれ空白
文字、及び、文または単語の境界を示す。また、音節の
三つ組の表記で左右の（）内の音節は中央の音節の環境
（音節の文脈）を示す。In step I, in the syllable chain extraction in the figure,
Extract a syllable string from a large amount of text database and create a directed graph (syllable syntax graph) with syllable branches that accept all the extracted syllable strings. This syllable syntactic graph is configured to accept a sequence of syllable trigrams (triads) in order to express language constraints strongly and to accept any sentence as much as possible without depending on a task. FIG. 5 exemplifies a process of generating a syllable syntactic graph that accepts a three-syllable chain from syllable text data. If the text data is "ε
# Saita Saita # Sakuraga Saita # ε ”, from this text data, as a triple depending on one syllable environment before and after, (ε) # (sa), (#) sa (a), (sa) i (ta) ,
(A) ta (sa), (ta) sa (a), (sa) ai (ta), (a) ta (#), (ta) #
(Sa), (#) Sa (ku), (sa) ku (la), (ku) la (ga), (la) ga (sa),
(Mo) sa (a), (sa) ai (ta), (a) ta (#), (ta) # (ε) can be extracted, and by removing the common triplet among them, the center of the figure We obtain a set of three syllables as shown in. By connecting these three sets under the condition that the speech environment (context before and after the syllable) matches, a graph like the one below is generated as a directed graph of syllables (syllable syntax graph). Here, “ε” and “#” indicate a blank character and a boundary between sentences or words, respectively. In the notation of syllable triples, the syllables in () on the left and right indicate the environment (context of syllables) of the central syllable.

【００４１】過程IIでは、まず、音韻単位にラベル付け
された既知の大量の音声データから、音節に対応する区
間の音韻ラベルの列（音韻列）を抽出し、音節と音韻列
の対応関係を求める。次に、この対応関係から音節毎に
音節の内部がどのような音韻列で実現されるかを網羅し
た表現として音韻を枝とする有向グラフ(音節内音韻グ
ラフ)を作成する。ここで、音節の文脈毎に音節と音韻
系列の対応関係を求めることで、音節の環境（音節文
脈）に依存した音節内音韻グラフが得られる。図６は例
えば「＃ウツクシク＃ツツム＃」という文章発声に対す
る音声データベース中の記述から音節内音韻グラフを抽
出する様子を示したものである。まず、図上段の「音声
データベース」の枠内の「音節列」で示される各音節の
区間と、その下の「音韻列」の部分音韻列との対応を求
め、音節毎に対応する部分音韻列の集合を求める（図中
段）。つぎに、これら音節内音韻列集合中の共通部分を
共通の枝とするなどして、各音節を音韻を枝とする有向
グラフ（音節内音韻グラフ）に変換する（図下段）。音
節文脈を考慮しない場合、音節「ツ」の音節内音韻グラ
フは、図下段中央に示すような４状態５枝の有向グラフ
として抽出される。また、前後１音節の音節文脈を考慮
した場合、例えば、音節「(ウ)ツ(ク)」は図最下段左に
示すような３状態２枝の音節内音韻グラフとして抽出さ
れる。In step II, first, a sequence of phoneme labels (phoneme sequence) of a section corresponding to a syllable (phoneme sequence) is extracted from a large amount of known speech data labeled in phonological units, and the correspondence relation between syllables and phonological sequences is determined. Ask. Next, a directed graph (in-syllable phoneme graph) having phonems as branches is created as an expression that comprehensively describes in what phonological sequence the inside of a syllable is realized from this correspondence. Here, by obtaining the correspondence relation between the syllable and the phoneme sequence for each syllable context, an intra-syllable phoneme graph depending on the environment of the syllable (syllable context) can be obtained. FIG. 6 shows a state in which the syllable-in-syllable graph is extracted from the description in the voice database for the sentence utterance "# tsukusiku # tsutsumu #", for example. First, the correspondence between each syllable section indicated by "syllable string" in the frame of "speech database" in the upper part of the figure and the partial phoneme string of "phoneme string" below it is found, and the partial phoneme corresponding to each syllable is obtained. Find the set of columns (middle row). Next, each syllable is converted into a directed graph (intra-syllable phoneme graph) having a phoneme as a branch by setting a common part in the syllable-in-syllable string set as a common branch (lower part of the figure). When the syllable context is not taken into consideration, the syllable in-syllable phoneme graph of the syllable "tsu" is extracted as a directed graph of four states and five branches as shown in the center of the lower part of the figure. When the syllable context of the preceding and following one syllable is taken into consideration, for example, the syllable “(U) tsu (ku)” is extracted as a three-state two-branch in-syllable phoneme graph as shown on the bottom left of the figure.

【００４２】過程IIIでは、過程Iで得られた音節構文グ
ラフ中のすべての音節の枝に対して、過程IIで得られた
音節内音韻グラフを代入することで、音韻構文グラフを
得る。図７は前後１音節の音節文脈を考慮した音節構文
グラフの一部分の枝について、音節内音韻グラフを代入
する様子を示したものである。この例では、状態ｓ1と
ｓ2を結ぶ枝に新たに状態ｓ12が挿入された音韻構文グ
ラフが生成される。この音韻構文グラフは音韻列とし
て、（ウ）ツ（ク）という音節文脈中の音節「ツ」に対
応して音韻列として、[ｃｌ,ｔｓ]だけが許される。一
方、前後の音節文脈を考慮しない場合、音節構文グラフ
中の音節「ツ」に対応する枝に、図６の下段中央に示し
た音節「ツ」の音節内音韻グラフが代入され、生成され
る音韻構文グラフは、音節「ツ」に対して、[ｔｓ]、
[ｃｌ,ｔｓ]、[ｃｌ,ｔｓ,ｕ]、等の音韻列が許される
ことになる。このように、音韻構文グラフの生成に当た
り、音節文脈を考慮する方が、同じ音節に対して、認識
すべき音韻列の種類が少なくなるため、より認識性能を
向上するという効果が期待できる。（この効果は後で述
べる実験で示される。）In step III, a phonological syntactic graph is obtained by substituting the in-syllable phoneme graphs obtained in step II for all syllable branches in the syllable syntactic graph obtained in step I. FIG. 7 shows a state in which an intra-syllable phoneme graph is substituted for a branch of a part of a syllable syntax graph in which the syllable context of one syllable before and after is considered. In this example, a phonological syntax graph in which the state s12 is newly inserted in the branch connecting the states s1 and s2 is generated. In this phonological syntactic graph, only [cl, ts] is allowed as a phonological sequence corresponding to the syllable "tsu" in the syllable context "(u) tsu (ku)". On the other hand, when the preceding and following syllable contexts are not considered, the syllable-intra-syllable graph of the syllable "tsu" shown in the center of the lower part of FIG. 6 is assigned to the branch corresponding to the syllable "tsu" in the syllable syntax graph to be generated. The phonological syntactic graph has [ts],
Phoneme strings such as [cl, ts] and [cl, ts, u] are allowed. As described above, when the syllable context graph is generated, considering the syllable context reduces the number of types of phoneme sequences to be recognized for the same syllable, and thus the effect of further improving the recognition performance can be expected. (This effect is shown in the experiment described later.)

【００４３】以上の過程I〜IIIで得られた音韻構文グラ
フを、前記構成の音声認識装置に基づく連続音声認識シ
ステムに適用することで、この装置に未知音声を入力し
た場合認識結果の音韻系列としての音韻記述中に現れる
音韻列は、テキストデータベース中の音節の生起順序に
従い、かつ、音声データベース中の音節内部の音韻列と
して観測されたものだけに限定される。この結果、言語
的に音節列としてあり得ず、かつその音声データとして
あり得ない音韻列、例えば、[ｋ,ｔｓ,ｓｈ,ｔｓ,ｓｈ]
が認識されることを防止できる。なお、本実施例の音声
認識装置は、音韻ＨＭＭを音響モデルとした、One Pass
DP法（例えば、中川聖一著「確率モデルによる音声認
識」）を構文制御に用いた音声認識装置と構成上類似し
ている。しかし、本実施例の音声認識装置では、構文制
御のために、テキストデータベース中の音節の生起順序
に従って生成された音節構文グラフにつき、この音節構
文グラフ中の音節相当部分に、音声データベース中で観
測された音節内部の音韻列の変動を表現する音節内音韻
グラフを代入して生成された音韻構文グラフが用いられ
ている。また、音響モデルとしての音韻モデル間の状態
遷移において、その遷移時刻が入力音声から直接得られ
る音韻境界の推定値情報（境界尤度）に基づいて束縛さ
れている。従って、本実施例の音声認識装置は、音節列
の言語知識と音節内の音韻構造の変動の知識の作用で、
入力音声に対して仮定する音韻境界の種類が削減し、そ
の結果の音韻境界の推定精度が向上する。また、逆に、
入力音声の音韻境界は前後の音韻の種類に依存している
ため、音韻列の生成において音韻境界の前後の音韻の種
類が考慮された音韻列が認識されるという特長があり、
認識精度が向上するという効果を有する。By applying the phonological syntactic graphs obtained in the above steps I to III to the continuous speech recognition system based on the speech recognition apparatus having the above-mentioned configuration, when an unknown speech is input to this apparatus, the phonological sequence of the recognition result is obtained. The phonological sequences appearing in the phonological description are limited to those observed as phonological sequences inside the syllables in the speech database according to the syllable occurrence order in the text database. As a result, a phoneme string that cannot be linguistically included as a syllable string and that cannot be included as the voice data, for example, [k, ts, sh, ts, sh].
Can be prevented from being recognized. The speech recognition apparatus according to the present embodiment uses the one-pass phoneme HMM as an acoustic model.
It is similar in structure to a speech recognition device that uses the DP method (for example, “Speech recognition by probabilistic model” by Seiichi Nakagawa) for syntax control. However, in the speech recognition apparatus of the present embodiment, for grammatical control, the syllable syntactic graphs generated according to the occurrence order of syllables in the text database are observed in the speech database at the syllable equivalent part in this syllable syntactic graph. A phonological syntactic graph generated by substituting an intra-syllable phonological graph expressing a variation of a phoneme sequence in a syllable is used. Also, in the state transition between phoneme models as acoustic models, the transition time is bound based on the estimated value information (boundary likelihood) of the phoneme boundary obtained directly from the input speech. Therefore, the speech recognition apparatus of the present embodiment, by the action of the language knowledge of the syllable string and the knowledge of the variation of the phonological structure in the syllable,
The number of phoneme boundaries assumed for the input speech is reduced, and the resulting phoneme boundary estimation accuracy is improved. On the contrary,
Since the phonological boundaries of the input speech depend on the types of phonemes before and after, there is a feature that phonological sequences that consider the types of phonemes before and after the phonological boundaries are recognized in the generation of phonological sequences.
This has the effect of improving recognition accuracy.

【００４４】次に本実施例の音声認識装置について行っ
た評価実験について述べる。ここでは、音韻および音韻
境界に対してセミ連続分布モデルを用いたＨＭＭ／ＢＴ
を用いて不特定話者の音韻記述実験を行った。共通の実
験条件を図８に示す。言語データは、一般のテキストデ
ータを用いることもできるが、ここでは、学習用音声デ
ータである音韻バランス文からなる４０２４文の発声テ
キスト（音声記述）を用いた。音節構文グラフの生成に
おける音節文脈としては文脈を考慮しない場合、音節の
バイグラム（２つ組）を用いる場合、および、音節のト
ライグラム（三つ組）を用いる場合について実験を行っ
た。音節内音韻グラフの抽出およびそれの音節構文グラ
フへの代入時の音節文脈に対する依存性を変えた複数の
言語制約付き音韻グラフを作成し、それぞれについて音
韻ベースの構文制御付きＨＭＭ／ＢＴ連続音声認識シス
テムにより音韻認識性能を求めた。また、ＨＭＭ／ＢＴ
の音韻境界束縛をしない従来のＨＭＭ（１音韻１状態）
を用いる場合についても実験を行った。Next, an evaluation experiment conducted on the voice recognition device of this embodiment will be described. Here, HMM / BT using a semi-continuous distribution model for phonemes and phoneme boundaries
We conducted a phoneme description experiment of an unspecified speaker using. The common experimental conditions are shown in FIG. Although general text data can be used as the language data, 4024 utterance texts (speech descriptions) composed of phonological balance sentences which are learning voice data are used here. Experiments were conducted for the case where the context was not considered as the syllable context in the generation of the syllable syntactic graph, the case where a syllable bigram (two sets) was used, and the case where a syllable trigram (three sets) was used. HMM / BT continuous speech recognition with phonologically-based syntactic control for each phonological graph is created by extracting phonological graphs in syllables and substituting them into syllable syntactic graphs with different dependences on syllable contexts. The phoneme recognition performance was obtained by the system. In addition, HMM / BT
Conventional HMM (1 phoneme 1 state) without phonological boundary constraint
Experiments were also performed when using.

【００４５】図９は実験結果を示す。図では音韻境界の
束縛を行うＨＭＭ（ＨＭＭ／ＢＴ）と、音韻境界の束縛
を行わないＨＭＭ（ＨＭＭ without ＢＴ）につい
て、各種の実験条件における音韻誤り率が示されてい
る。（ＨＭＭ／ＢＴにおいて、境界尤度の閾値（θij）
を音韻境界の種類に無関係に一定値（θ）にした場合に
ついて示す）。音韻誤り率は合計の誤り率と共に、内訳
として置換、脱落、挿入の各誤り率が示されている。音
韻誤り率は入力の音韻数に対して、それぞれの誤り形態
の音韻認識の誤りが発生した割合として算出されてい
る。また、音節内音韻グラフの抽出の際考慮した音節文
脈依存性は先行音節数及び後続音節数の欄に示されてい
る。さらに、音節構文グラフをテキストデータから抽出
する際の音節構文グラフの音節を囲む音節文脈として
は、図の左の第１欄に示されるように音節のバイグラム
（bigram）および音節のトライグラム（trigram）の場
合について実験結果が示されている。またさらに、参考
のために、音節構文グラフ及び音韻構文グラフのテスト
セットパープレキシティが示されている。なお、一般
に、テストセットパープレキシティが大きいほど構文に
よる限定が小さい（構文の自由度が大きい）ことを意味
する。実験結果から、音節構文グラフの生成において、
音節構文グラフの音節を囲む音節文脈として音節バイグ
ラム及び音節トライグラムのいずれを用いても、音節文
脈に依存しない（即ち先行音節数及び後続音節数が共に
０の）文脈独立の音節内音韻グラフよりも、音節文脈に
依存したの音節内音韻グラフを用いた方が音韻認識の誤
りが少なくなっており、音節文脈に依存して音節内音韻
グラフを用いる方法の方が認識性能がよい。これは音節
内音韻グラフを音節文脈依存とすることで音節が音韻系
列として実現される変動の幅が狭まるため、認識対象と
して仮定される音韻列の数が実質的に削減され、認識性
能が向上したことによると考えられる。この考え方は、
実際、音節文脈依存の場合の音韻パープレキシティが音
節文脈に依存しない場合よりも小さく、従って構文自由
度が減少していることからも説明される。また、ＨＭＭ
／ＢＴとＨＭＭ（ＢＴなし）との比較では、ＨＭＭ／Ｂ
Ｔの方が圧倒的に認識誤りが少なく、音節構文グラフの
音節を囲む音節文脈として後続音節数を２とした音節ト
ライグラムを用いたとき、最小の音韻誤り率合計１０．
０％（最下行）が得られている。これは、ＨＭＭ／ＢＴ
に従来の音韻トライグラムによる構文グラフを用いる場
合の５４．０％（上から２行目）に対して、大幅な認識
誤りの改善である。FIG. 9 shows the experimental results. The figure shows the phoneme error rates under various experimental conditions for the HMM (HMM / BT) that constrains the phoneme boundaries and the HMM (HMM without BT) that does not constrain the phoneme boundaries. (In HMM / BT, the boundary likelihood threshold (θij)
Is shown as a constant value (θ) regardless of the type of phonological boundary). The phonological error rate shows the total error rate and the error rates of replacement, dropout, and insertion as a breakdown. The phoneme error rate is calculated as the ratio of the phoneme recognition errors of each error form to the number of phonemes of the input. The syllable context dependence considered in extracting the syllable-in-syllable graph is shown in the columns of the number of preceding syllables and the number of subsequent syllables. Further, as the syllable context surrounding the syllables of the syllable syntax graph when extracting the syllable syntax graph from the text data, as shown in the first column on the left side of the figure, a syllable bigram and a syllable trigram are represented. The experimental results are shown for the case. Still further, for reference, the test set perplexity of syllable and phonological syntactic graphs is shown. Generally, the larger the test set perplexity, the smaller the limitation by the syntax (the greater the degree of freedom of the syntax). From the experimental results, in the generation of syllable syntax graph,
From a context-independent syllable graph that is independent of the syllable context (that is, the number of preceding syllables and the number of subsequent syllables are both 0), regardless of whether a syllable bigram or a syllable trigram is used as the syllable context surrounding the syllable syntactic graph. Also, the phonological recognition error is less in the case of using the syllable context dependent phoneme graph, and the method of using the syllable context dependent phoneme graph has better recognition performance. This is because the syllable contextual syllable context-dependent graph narrows the range of variations in which syllables are realized as phonological sequences, so the number of phonological sequences assumed as recognition targets is substantially reduced, and recognition performance is improved. It is thought that it depends on what you did. This idea is
In fact, it is also explained by the fact that the phonological perplexity in the syllable context-dependent case is smaller than that in the syllable context-independent case, and thus the syntactic freedom is reduced. Also, HMM
/ BT compared with HMM (without BT), HMM / B
T has overwhelmingly less recognition errors, and when a syllable trigram with the number of succeeding syllables of 2 is used as the syllable context surrounding the syllable of the syllable syntax graph, the minimum total phoneme error rate is 10.
0% (bottom line) is obtained. This is HMM / BT
This is a significant improvement in recognition error compared to 54.0% (the second line from the top) when the conventional syntactic graph based on the phoneme trigram is used.

【００４６】[0046]

【発明の効果】以上のように請求項１の発明によれば、
入力音声を分析し、前記入力音声を音韻モデルの連結と
見なして、前記入力音声に音韻モデルの系列を当ては
め、前記入力音声を最適な音韻列に変換する音声認識装
置において、前記入力音声から音韻の境界尤度を算出す
る境界尤度計算手段と、前記音韻の境界尤度が所定の値
より大きく、かつ、音韻列中の長さ３以上の音韻の生起
順序を規定する構文制御グラフに規定される音韻の生起
順序に従う時に限り、最適な音韻モデルを選択するモデ
ル演算手段とを備えたので、構文制御グラフに規定され
ていない音韻列が認識されることを防止すると共に、入
力音声に対して仮定する音韻境界の種類が構文制御グラ
フに規定された音韻列中の音韻境界に限定される結果、
音韻境界の推定精度が向上し、認識精度が向上した音声
認識装置を提供するという効果がある。As described above, according to the invention of claim 1,
A speech recognition apparatus that analyzes input speech, regards the input speech as a concatenation of phoneme models, applies a sequence of phoneme models to the input speech, and converts the input speech into an optimum phoneme sequence, And a syntax control graph that defines the order of occurrence of phonemes having a phoneme boundary likelihood greater than a predetermined value and a length of 3 or more in the phoneme sequence. The model operation means for selecting an optimal phoneme model is provided only when the phoneme sequence in which the phoneme sequence is generated is followed. As a result, the types of phonological boundaries that are assumed to be limited to the phonological boundaries in the phonological sequence specified in the syntax control graph are
There is an effect that the estimation accuracy of the phonological boundary is improved, and a speech recognition device with improved recognition accuracy is provided.

【００４７】請求項２の発明によれば、入力音声を分析
し、前記入力音声を音韻モデルの連結と見なして、前記
入力音声に音韻モデルの系列を当てはめ、前記入力音声
を最適な音韻列に変換する音声認識装置において、前記
入力音声から音韻境界の種類に応じて、音韻の境界尤度
を算出する境界尤度計算手段と、前記音韻の境界尤度が
その音韻境界の種類に応じて設定された値より大きく、
かつ、音韻列中の長さ３以上の音韻の生起順序を規定す
る構文制御グラフに規定される音韻の生起順序に従う時
に限り、最適な音韻モデルを選択するモデル演算手段と
を備えたので、構文制御グラフに規定されていない音韻
列が認識されることを防止すると共に、入力音声に対し
て仮定する音韻境界の種類が構文制御グラフに規定され
た音韻列中の音韻境界に限定され、さらに、音韻境界の
種類に応じて適切な音韻境界推定が行える結果、音韻境
界の推定精度が向上し、認識精度が向上した音声認識装
置を提供するという効果がある。According to the second aspect of the present invention, the input voice is analyzed, the input voice is regarded as the concatenation of the phoneme models, the sequence of the phoneme models is applied to the input voice, and the input voice is converted into the optimum phoneme sequence. In a speech recognition device for conversion, a boundary likelihood calculating means for calculating a boundary likelihood of a phoneme from the input speech according to a kind of a phoneme boundary, and a boundary likelihood of the phoneme are set according to the kind of the phoneme boundary. Greater than the given value,
In addition, the model operation means for selecting an optimum phoneme model is provided only when the phoneme occurrence order specified in the syntax control graph that defines the occurrence order of phonemes of length 3 or more in the phoneme sequence is followed. In addition to preventing recognition of phoneme sequences that are not specified in the control graph, the types of phoneme boundaries assumed for the input speech are limited to the phoneme boundaries in the phoneme sequence specified in the syntactic control graph. As a result of being able to perform appropriate phoneme boundary estimation according to the type of phoneme boundary, there is an effect that the accuracy of phoneme boundary estimation is improved and a speech recognition apparatus with improved recognition accuracy is provided.

【００４８】請求項３の発明によれば、前記構文制御グ
ラフの生成方法として、テキストデータベースより音節
の生起順序を規定する音節構文グラフを求める過程と、
音声データベースから音節内の音韻の生起順序を規定す
る音節内音韻グラフを求める過程と、前記音節構文グラ
フの各枝に前記音節内音韻グラフを代入する過程を有す
る構文制御グラフの生成方法を用いたので、音節の生起
順序と共に、音節内の音韻構造の変動が考慮された構文
制御グラフを提供し、音韻のスペクトルの変動と音節内
の音韻構造の変動の両者のモデル化が可能な音声認識装
置を提供するという効果がある。According to the third aspect of the present invention, as a method of generating the syntax control graph, a step of obtaining a syllable syntax graph that defines the occurrence order of syllables from a text database,
A method of generating a syntactic control graph having a process of obtaining an intra-syllable phonological graph that defines the order of occurrence of phonemes in a syllable from a speech database and a process of substituting the intra-syllable phonemic graph for each branch of the syllable syntactic graph is used. Therefore, the present invention provides a syntactic control graph in which variations in phonological structure within a syllable are considered together with the order of occurrence of syllables, and a speech recognition apparatus capable of modeling both variations in the spectrum of a phonological segment and variations in phonological structure within a syllable. Has the effect of providing.

【００４９】請求項４の発明によれば、前記構文制御グ
ラフの生成方法として、テキストデータベースから音節
をその音節を囲む音節文脈と共に文脈付き音節として抽
出しこれら文脈付き音節の生起順序を規定する音節構文
グラフを求める過程と、音声データベースから前記文脈
付き音節の音節文脈毎に音節内の音韻の生起順序を規定
する音節内音韻グラフを求める過程と、前記音節構文グ
ラフの前記文脈付き音節相当部分に前記文脈付き音節の
音節文脈と一致する音節文脈から求められた前記音節内
音韻グラフを代入する過程を有する構文制御グラフの生
成方法を用いたので、音節の生起順序と共に、音節文脈
に依存した音節内の音韻構造の変動を考慮した構文制御
グラフを提供し、音韻のスペクトルの変動と音節文脈に
依存した音節内の音韻構造の変動の両者のモデル化が可
能な音声認識装置を提供するという効果がある。According to the invention of claim 4, as a method of generating the syntactic control graph, a syllable is extracted from a text database as a syllable with a context together with a syllable context surrounding the syllable, and a syllable for defining the occurrence order of these syllables with a context. A step of obtaining a syntactic graph, a step of obtaining an intra-syllable phonological graph that defines the occurrence order of phonemes in a syllable for each syllable context of the syllable with context from a speech database, and a part corresponding to the syllable with context of the syllable syntactic graph. Since the method of generating the syntactic control graph including the step of substituting the syllable context graph in the syllable obtained from the syllable context matching the syllable context of the syllable with the context is used, the syllable that depends on the syllable context together with the syllable occurrence order. We provide a syntactic control graph that considers the variation of phonological structure in a syllable. There is an effect that both modeling of variation rhyme structure to provide a speech recognition apparatus capable.

【００５０】請求項５の発明によれば、前記モデル演算
手段は、請求項３又は４に記載の構文制御グラフの生成
方法に基づいて生成された構文制御グラフを用いるの
で、音節の生起順序に従い、しかも、音節内の音韻の生
起順序に従った音韻モデル系列だけが認識の対象となる
ため、入力音声に対して仮定する音韻境界の種類が限定
される。この結果、音韻境界の推定精度が向上し、認識
精度が向上された音声認識装置を提供するという効果が
ある。According to the invention of claim 5, since the model operation means uses the syntax control graph generated based on the method of generating the syntax control graph according to claim 3 or 4, it follows the syllable occurrence order. Moreover, since only the phoneme model sequence according to the phoneme occurrence order in the syllable is the target of recognition, the types of phoneme boundaries assumed for the input speech are limited. As a result, there is an effect that the estimation accuracy of the phoneme boundary is improved, and a speech recognition device with improved recognition accuracy is provided.

[Brief description of drawings]

【図１】この発明の一実施例の音声認識装置の構成図で
ある。FIG. 1 is a configuration diagram of a voice recognition device according to an embodiment of the present invention.

【図２】この発明の一実施例の音韻体系の説明図であ
る。FIG. 2 is an explanatory diagram of a phoneme system according to an embodiment of the present invention.

【図３】この発明の一実施例のＨＭＭの構造を示す図で
ある。FIG. 3 is a diagram showing a structure of an HMM according to an embodiment of the present invention.

【図４】この発明の一実施例の音韻構文グラフの全体の
生成過程を示す図である。FIG. 4 is a diagram showing an overall process of generating a phoneme syntax graph according to an embodiment of the present invention.

【図５】この発明の一実施例の音韻構文グラフの生成過
程Iを例示する図である。FIG. 5 is a diagram illustrating a phoneme syntax graph generation process I according to an embodiment of the present invention.

【図６】この発明の一実施例の音韻構文グラフの生成過
程IIを例示する図である。FIG. 6 is a diagram exemplifying a phoneme syntax graph generation process II according to an embodiment of the present invention.

【図７】この発明の一実施例の音韻構文グラフの生成過
程IIIを例示する図である。FIG. 7 is a diagram exemplifying a phoneme syntax graph generation process III according to an embodiment of the present invention.

【図８】この発明の一実施例を評価する際の条件を示す
図である。FIG. 8 is a diagram showing conditions for evaluating one embodiment of the present invention.

【図９】この発明の一実施例の評価結果を示す図であ
る。FIG. 9 is a diagram showing an evaluation result of an example of the present invention.

【図１０】従来の音声認識装置の構成図である。FIG. 10 is a configuration diagram of a conventional voice recognition device.

【図１１】従来の音声認識装置におけるＨＭＭの構造を
示す図である。FIG. 11 is a diagram showing a structure of an HMM in a conventional voice recognition device.

[Explanation of symbols]

１音声区間検出手段２特徴抽出手段３境界尤度計算手段４音韻境界パラメータ記憶手段５ＨＭＭ演算手段６ＨＭＭパラメータ記憶手段７最適状態系列検出手段８構文制御情報記憶手段 1 voice section detection means 2 feature extraction means 3 boundary likelihood calculation means 4 phonological boundary parameter storage means 5 HMM calculation means 6 HMM parameter storage means 7 optimal state sequence detection means 8 syntax control information storage means

Claims

[Claims]

1. A speech recognition apparatus for analyzing an input voice, regarding the input voice as a concatenation of phoneme models, applying a sequence of phoneme models to the input voice, and converting the input voice into an optimum phoneme sequence, A boundary likelihood calculating means for calculating a boundary likelihood of a phoneme from the input speech, and a generation order of phonemes having a boundary likelihood of the phoneme larger than a predetermined value and having a length of 3 or more in a phoneme sequence. A speech recognition apparatus comprising: a model operation unit that selects an optimum phoneme model only when the phoneme occurrence order defined in the syntax control graph is followed.

2. A speech recognition apparatus for analyzing an input voice, regarding the input voice as a concatenation of phoneme models, applying a sequence of phoneme models to the input voice, and converting the input voice into an optimum phoneme sequence, Boundary likelihood calculation means for calculating a boundary likelihood of a phoneme in accordance with the type of a phoneme boundary from the input speech, and a boundary likelihood of the phoneme is larger than a value set according to the type of the phoneme boundary, and , Length in phoneme sequence 3
A speech recognition apparatus comprising: model operation means for selecting an optimum phoneme model only when the phoneme occurrence order defined in the syntax control graph defining the phoneme occurrence order is followed.

3. An input speech is analyzed, the input speech is regarded as a concatenation of phonological models, a sequence of phonological models is applied to the input speech according to the rules of a syntax control graph, and the input speech is converted into an optimum phonological string. As a method of generating the syntactic control graph of the speech recognition device, a process of obtaining a syllable syntactic graph defining a syllable occurrence order from a text database, and an intra-syllable phonological graph defining a syllable occurrence order from the speech database. A method of generating a syntax control graph for a speech recognition apparatus, comprising: a step of obtaining and a step of substituting the in-syllable phoneme graph into a syllable equivalent part of the syllable syntax graph.

4. An input speech is analyzed, the input speech is regarded as a concatenation of phonological models, a sequence of phonological models is applied to the input speech according to the definition of a syntax control graph, and the input speech is converted into an optimum phonological string. As a method of generating the syntactic control graph of the speech recognition apparatus, a process of extracting a syllable from a text database together with a syllable context surrounding the syllable as a syllable with context and determining a syllable syntactic graph that defines the occurrence order of these syllables with context, and A process of obtaining an intra-syllable phoneme graph that defines the occurrence order of phonemes in a syllable for each syllable context of the syllable with context from a speech database, and a syllable context of the syllable with context in a portion corresponding to the syllable with context of the syllable syntax graph. And a process of substituting the in-syllable phoneme graph obtained from the syllable context matching with the speech recognition device. A method of generating a syntax control graph for a table.

5. The model calculation means according to claim 3,
The speech recognition apparatus according to claim 1 or 2, wherein the syntax control graph generated based on the syntax control graph generation method according to claim 1 is used.