JP2001265383A

JP2001265383A - Voice recognizing method and recording medium with recorded voice recognition processing program

Info

Publication number: JP2001265383A
Application number: JP2000077121A
Authority: JP
Inventors: Yasunaga Miyazawa; 康永宮沢
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 2000-03-17
Filing date: 2000-03-17
Publication date: 2001-09-28

Abstract

PROBLEM TO BE SOLVED: To improve voice recognizing performance on hardware composed of a CPU which has a small memory and small processing capability. SOLUTION: Temporary phoneme speech continuing times T11, T12... by phonemes constituting an input voice are obtained, the voice section of the input voice data is divided by the mentioned phoneme speech continuing times, and the path from the 1st-stage phoneme model to the final-stage phoneme model of HMM is given limitations on some phoneme model in sections (other than the sections shown by thick-line arrows) from certain time to certain time based upon the temporary phoneme speech continuing times. The limitations of the path is the control for inhibiting the transition from the final state of one phoneme model to the initial state of the next phoneme model from certain time to certain time based upon the temporary phoneme speech continuing times.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は少ないメモリ容量や
演算能力の低いＣＰＵで構成される安価なハードウエア
上で音声認識性能の向上を図った音声認識方法および音
声認識処理プログラムを記録した記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition method for improving speech recognition performance on inexpensive hardware constituted by a CPU having a small memory capacity and a low arithmetic capacity, and a recording medium for recording a speech recognition processing program. About.

【０００２】[0002]

【従来の技術】音声認識技術は様々な分野で広く利用さ
れてきている。音声認識を行う際の問題点として、同じ
単語であっても、話者の違いや同じ話者でもそのときの
発話の仕方の違いによる音声パターンの変動、あるい
は、前後に存在する音韻環境による音声パターンの変動
（調音結合）、さらには、同じ単語でもそれを発話し終
わるまでに要する時間の長さの変動による音声パターン
の変動など、様々な要因による音声パターンの変動に対
応した認識処理を行う必要がある。2. Description of the Related Art Voice recognition technology has been widely used in various fields. The problem with speech recognition is that even for the same word, the voice pattern changes due to the difference in the speaker or the way of speaking at the same speaker, or the voice due to the phonemic environment existing before and after. Recognition processing is performed in response to fluctuations in voice patterns due to various factors such as fluctuations in patterns (articulation coupling), and fluctuations in voice patterns due to fluctuations in the length of time required to finish speaking the same word. There is a need.

【０００３】このような問題点を考慮し、高い音声認識
性能を得るための音素モデルとして、従来からＨＭＭ
（ Hidden Markov Model）がよく知られている。[0003] In consideration of such problems, HMMs have been conventionally used as phoneme models for obtaining high speech recognition performance.
(Hidden Markov Model) is well known.

【０００４】このＨＭＭにおいて、簡単な方法でより一
層の認識率を向上させるための手法として、各音素に対
する発話継続時間を継続時間分布として考え、音素発話
継続時間から得られた確からしさを、認識処理を行う過
程でその音素の確からしさ（ＨＭＭの出力尤度）に考慮
する方法が知られている。In this HMM, as a technique for further improving the recognition rate by a simple method, the utterance duration for each phoneme is considered as a duration distribution, and the likelihood obtained from the phoneme utterance duration is recognized. There is known a method of considering the likelihood of the phoneme (the output likelihood of the HMM) in the process of performing the process.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、このよ
うな方法は確かに認識率の向上につながるが、音素発話
継続時間から得られた確からしさを得るための計算が必
要となり、そのような計算を行うための処理能力を持っ
たＣＰＵやメモリを搭載するなどハードウエアを充実さ
せる必要がる。しかし、玩具など小型・軽量・安価が要
求される製品に音声認識技術を用いる場合には、使用さ
れるハードウエア規模も大きな制約があるため、上述し
た計算を可能とするＣＰＵやメモリを搭載できないのが
現状である。However, although such a method certainly leads to an improvement in the recognition rate, a calculation for obtaining the certainty obtained from the phoneme utterance duration is required, and such a calculation is required. It is necessary to enhance the hardware by mounting a CPU or a memory having a processing capability for performing the processing. However, when speech recognition technology is used for products requiring small size, light weight, and low cost, such as toys, the scale of hardware used is also greatly restricted, so that a CPU or memory capable of performing the above calculations cannot be mounted. is the current situation.

【０００６】しかし一方で、音素の発話継続時間を考慮
した音声認識は、認識率の向上に寄与できることは確か
であるので、大きな計算量を必要とせずに音素の発話継
続時間を考慮した音声認識を可能とすることが望まれ
る。On the other hand, however, it is certain that speech recognition taking into account the duration of phoneme utterance can contribute to an improvement in the recognition rate. Therefore, speech recognition taking into account the duration of phoneme utterance without requiring a large amount of calculation is required. It is desired to make it possible.

【０００７】そこで本発明は、少ないメモリ容量や演算
能力の低いＣＰＵで構成される安価なハードウエア上で
音声認識性能の向上を図ることができるようにすること
を目的とする。SUMMARY OF THE INVENTION It is an object of the present invention to improve the speech recognition performance on inexpensive hardware including a small memory capacity and a low CPU.

【０００８】[0008]

【課題を解決するための手段】上述した目的を達成する
ために、本発明の音声認識方法は、認識可能な単語を構
成する音素ごとの音素モデルを組み合わせて当該単語の
音素連結モデルを構成し、その音素連結モデルに対し、
入力音声の時系列音声データを与えることで、当該音素
連結モデルを構成する第１段の音素モデルから最終段の
音素モデルに至る状態遷移アルゴリズムにおける所定の
パスを通過して最終段に存在する音素モデルの最終状態
から出力尤度を得て、その出力尤度の大きさから音声認
識を行う音声認識方法であって、前記入力音声の時系列
音声データから当該入力音声を構成するそれぞれの音素
に対し、それぞれの音素の発話継続時間を個々の音素に
対する標準データに基づいて仮の音素発話継続時間とし
て得て、その入力音声を構成する音素ごとの仮の音素発
話継続時間に基づいて当該入力音声データの音声区間を
分割し、前記第１段の音素モデルから最終段の音素モデ
ルにおける状態遷移アルゴリズムにおけるパスにおい
て、ある音素に対する音素モデルに対し、前記仮の音素
発話継続時間に基づくある時刻からある時刻までの区間
にパスの制限を設けるようにしている。In order to achieve the above-mentioned object, a speech recognition method according to the present invention comprises combining a phoneme model for each phoneme constituting a recognizable word to form a phoneme connection model of the word. , For that phoneme concatenation model,
By providing the time-series speech data of the input speech, the phonemes existing in the final stage after passing through a predetermined path in the state transition algorithm from the first stage phoneme model to the final stage phoneme model constituting the phoneme connection model. A speech recognition method for obtaining an output likelihood from the final state of the model and performing speech recognition from the magnitude of the output likelihood, wherein each of the phonemes constituting the input speech is obtained from time-series speech data of the input speech. On the other hand, the utterance duration of each phoneme is obtained as a tentative phoneme utterance duration based on standard data for each phoneme, and the input speech is determined based on the tentative phoneme utterance duration of each phoneme constituting the input speech. The voice section of the data is divided, and a certain phoneme is passed from the first-stage phoneme model to the last-stage phoneme model in the path in the state transition algorithm. To phoneme models, so that a limit of path section up to a certain time from a certain time based on the phoneme speech duration of the temporary.

【０００９】このような音声認識方法において、前記
仮の音素発話継続時間に基づくある時刻からある時刻ま
での区間に行うパスの制限は、ある音素モデルの最終状
態から次の音素モデルの最初の状態への遷移を前記仮の
音素発話継続時間に基づくある時刻からある時刻までの
間を禁止する制御である。In such a speech recognition method, the restriction of a path to be performed in a section from a certain time to a certain time based on the tentative phoneme utterance continuation time is determined from the last state of a certain phoneme model to the first state of the next phoneme model. Is a control for prohibiting the transition to the period from a certain time to a certain time based on the temporary phoneme utterance duration.

【００１０】また、本発明の音声認識処理プログラムを
記録した記録媒体は、認識可能な単語を構成する音素ご
との音素モデルを組み合わせて当該単語の音素連結モデ
ルを構成し、その音素連結モデルに対し、入力音声の時
系列音声データを与えることで、当該音素連結モデルを
構成する第１段の音素モデルから最終段の音素モデルに
至る状態遷移アルゴリズムにおける所定のパスを通過し
て最終段に存在する音素モデルの最終状態から出力尤度
を得て、その出力尤度の大きさから音声認識を行う音声
認識処理プログラムを記録した記録媒体であって、その
音声認識処理プログラムは、前記入力音声の時系列音声
データから当該入力音声を構成するそれぞれの音素に対
し、それぞれの音素の発話継続時間を個々の音素に対す
る標準データに基づいて仮の音素発話継続時間として得
る手順と、その入力音声を構成する音素ごとの仮の音素
発話継続時間に基づいて当該入力音声データの音声区間
を分割する手順と、前記第１段の音素モデルから最終段
の音素モデルにおける状態遷移アルゴリズムにおけるパ
スにおいて、ある音素に対する音素モデルに対し、前記
仮の音素発話継続時間に基づくある時刻からある時刻ま
での区間にパスの制限を設ける手順とを含むものであ
る。Further, the recording medium storing the speech recognition processing program of the present invention forms a phoneme connection model of the word by combining the phoneme models of the phonemes constituting the recognizable word. By providing the time-series speech data of the input speech, it passes through a predetermined path in the state transition algorithm from the first-stage phoneme model to the last-stage phoneme model constituting the phoneme connection model, and exists at the last stage. A recording medium on which a speech recognition processing program for obtaining an output likelihood from the final state of the phoneme model and performing speech recognition based on the magnitude of the output likelihood is recorded, wherein the speech recognition processing program is used for the input speech. For each phoneme composing the input speech from the sequence speech data, the utterance duration of each phoneme is based on the standard data for each phoneme. Obtaining a tentative phoneme utterance duration, dividing a voice section of the input voice data based on the tentative phoneme utterance duration for each phoneme constituting the input voice, and From the path in the state transition algorithm in the phoneme model of the last stage to the phoneme model for a certain phoneme, the procedure of setting a path restriction in a section from a certain time to a certain time based on the temporary phoneme utterance duration. .

【００１１】このような音声認識処理プログラムを記録
した記録媒体における音声認識処理プログラムにおい
て、前記仮の音素発話継続時間に基づくある時刻からあ
る時刻までの区間に行うパスの制限は、ある音素モデル
の最終状態から次の音素モデルの最初の状態への遷移を
前記仮の音素発話継続時間に基づくある時刻からある時
刻までの間を禁止する制御である。[0011] In the speech recognition processing program in the recording medium on which such a speech recognition processing program is recorded, the restriction on the path to be performed in a section from a certain time to a certain time based on the provisional phoneme utterance duration is limited to a certain phoneme model. This is control for prohibiting transition from the final state to the first state of the next phoneme model from a certain time to a certain time based on the provisional phoneme utterance duration.

【００１２】このように本発明は、入力音声を構成する
音素ごとの仮の音素発話継続時間を各音素ごとの標準デ
ータに基づいて求め、当該入力音声データの音声区間に
対し、その入力音声を構成する音素ごとの仮の音素発話
継続時間で分割し、ＨＭＭにおける第１段の音素モデル
から最終段の音素モデルまでの間のパスにおいて、ある
音素モデルに対し、前記仮の音素発話継続時間に基づく
ある時刻からある時刻までの区間にパスの制限を設ける
ようにしている。そして、仮の音素発話継続時間に基づ
いて行うパスの制限というのは、ある音素モデルの最終
状態から次の音素モデルの最初の状態への遷移を前記仮
の音素発話継続時間に基づくある時刻からある時刻まで
の間を禁止する制御である。As described above, according to the present invention, the tentative phoneme utterance duration for each phoneme constituting the input speech is obtained based on the standard data for each phoneme, and the input speech is converted into the speech section of the input speech data. It is divided by the provisional phoneme utterance duration for each constituent phoneme, and in the path from the first-stage phoneme model to the final-stage phoneme model in the HMM, for a certain phoneme model, The path is restricted in a section from a certain time to a certain time based on the path. The restriction of the path performed based on the temporary phoneme utterance duration means that the transition from the final state of a certain phoneme model to the first state of the next phoneme model is performed at a certain time based on the temporary phoneme utterance duration. This is control to prohibit until a certain time.

【００１３】このように、パスに制限を設けることによ
って、入力音声に対する時系列音声データに対する状態
遷移アルゴリズムは、制限のかかっていない範囲のパス
を通過して最終的な状態に到達するようなアルゴリズム
となる。このため、入力音声に対し適正な出力尤度が得
られ、誤認識を少なくすることができる。As described above, by providing a restriction on the path, the state transition algorithm for the time-series audio data for the input audio is such that the final state can be reached through a path in an unrestricted range. Becomes For this reason, an appropriate output likelihood is obtained for the input speech, and erroneous recognition can be reduced.

【００１４】たとえば、ある単語に対して高い出力尤度
の得られるように設定されたＨＭＭの音素連結モデル
に、その単語に類似する音素を多く含むような単語の音
声データが入力された場合、その音声は、制限のかかっ
ていない範囲のパスを通過せざるを得ないので、パスに
制限が与えられていない場合に比べると、最終的な出力
尤度を小さく抑えることができる。つまり、パスに制限
が与えられていない場合には、ある音素モデル部分のあ
る仮の音素発話継続時間において、高い状態確率が得ら
れ、それが、最終段に存在する音素モデルの最終状態に
おける出力尤度に影響を与えて、結果的に、最終状態に
おける出力尤度を高い値としてしまい、誤認識を生じさ
せる原因にもなっていたが、本発明のように、パスに制
限をかけることによって、このような不具合を解消する
ことができ、認識率の向上に寄与できる。For example, when speech data of a word containing many phonemes similar to the word is input to the phoneme connection model of the HMM set to obtain a high output likelihood for a certain word, Since the sound has to pass through a path in an unrestricted range, the final output likelihood can be reduced as compared with a case where the path is not restricted. In other words, when the path is not restricted, a high state probability is obtained for a certain temporary phoneme utterance duration of a certain phoneme model part, which is the output of the final state of the phoneme model existing at the final stage. Affecting the likelihood and consequently setting the output likelihood in the final state to a high value, causing misrecognition. However, by limiting the path as in the present invention, Such a problem can be solved and the recognition rate can be improved.

【００１５】[0015]

【発明の実施の形態】以下、本発明の実施の形態につい
て説明する。なお、この実施の形態で説明する内容は、
本発明の音声認識方法についての説明であるとともに、
本発明の音声認識処理プログラムを記録した記録媒体に
おける音声認識処理プログラムの具体的な処理手順をも
含むものである。Embodiments of the present invention will be described below. The contents described in this embodiment are as follows.
A description of the speech recognition method of the present invention,
It also includes a specific processing procedure of the voice recognition processing program on the recording medium storing the voice recognition processing program of the present invention.

【００１６】本発明は類似した音声パターンを有する単
語の認識性能を高めるために、音素の音素発話継続時間
を考慮した音声認識を行う。なお、この実施の形態で
は、ＨＭＭによる音声認識を行うものとする。The present invention performs speech recognition in consideration of the phoneme utterance duration of phonemes in order to improve the recognition performance of words having similar speech patterns. In this embodiment, it is assumed that speech recognition by HMM is performed.

【００１７】まず、それぞれの音素に対する音素発話継
続時間を求める。これは、各音素について各音素に対す
る標準データから、平均的な音素発話継続時間を求め
る。たとえば、音素「ａ」の発話継続時間の平均が100m
sec、音素「ｉ」の発話継続時間の平均が１２０msec、
音素「ｕ」の発話継続時間の平均が100msec、音素
「ｅ」の発話継続時間の平均が110msecというようにそ
れぞれの音素について平均の発話継続時間を求めてお
く。First, the phoneme utterance duration for each phoneme is determined. In this method, an average phoneme utterance duration is obtained for each phoneme from standard data for each phoneme. For example, the average utterance duration of phoneme "a" is 100m
sec, the average of the utterance duration of the phoneme “i” is 120 msec,
The average utterance duration of each phoneme is determined such that the average of the utterance duration of the phoneme "u" is 100 msec and the average of the utterance duration of the phoneme "e" is 110 msec.

【００１８】そして、システムが認識可能な幾つかの単
語について、その単語を構成する音素ごとの発話継続時
間の比を求める。たとえば、認識可能単語に「さとう」
という単語があるとすれば、その「さとう」を構成する
音素「ｓ」、「ａ」、「ｔ」、「ｏ」について、それぞ
れの音素発話継続時間の比を求める。たとえば、図１
（ａ）のような「さとう（ｓａｔｏ）」の音声データが
あったとする。この音声データの音声区間が３７０msec
であったとし、音声特徴分析処理における個々のフレー
ム長が２０msecでフレームシフト長が１０msecであった
とすると、この音声データは３６フレームで構成されて
いることになる。Then, for some words recognizable by the system, the ratio of the utterance duration for each phoneme constituting the word is determined. For example, the recognizable word "Sato"
If there is the word “sato”, the ratio of the phoneme utterance durations of the phonemes “s”, “a”, “t”, and “o” constituting the “sato” is obtained. For example, FIG.
It is assumed that there is audio data of "sato" as shown in FIG. The voice section of this voice data is 370 msec
If the individual frame length in the audio feature analysis processing is 20 msec and the frame shift length is 10 msec, this audio data is composed of 36 frames.

【００１９】そして、上述の標準音声データから得られ
た「ｓ」、「ａ」、「ｔ」、「ｏ」の各音素の発話継続
時間によって、それぞれの音素発話継続時間の比が、
１：２：１：２と求められたとすれば、図１（ｂ）に示
すように、３６個のフレームを１番目〜６番目のフレー
ム（フレーム数は６個）、７番目〜１８番目のフレーム
（フレーム数は１２個）、１９番目〜２４番目のフレー
ム（フレーム数は６個）、２５番目〜３６番目のフレー
ム（フレーム数は１２個）に区切ることができ、このよ
うに区切られて得られたそれぞれのフレーム数からそれ
ぞれの音素に対するおおよその発話継続時間を求めるこ
とができる。According to the speech duration of each phoneme of "s", "a", "t", and "o" obtained from the above-described standard voice data, the ratio of each phoneme speech duration is
Assuming that 1: 2: 1: 2 is obtained, as shown in FIG. 1B, 36 frames are divided into the first to sixth frames (the number of frames is six) and the seventh to eighteenth frames. It can be divided into frames (12 frames), 19th to 24th frames (6 frames), and 25th to 36th frames (12 frames). An approximate utterance duration for each phoneme can be obtained from the obtained number of frames.

【００２０】このおおよその発話継続時間をここでは仮
の音素発話継続時間と呼ぶ。また、このようにして得ら
れた仮の音素発話継続時間によって、それぞれの音素の
仮の境界（仮の音素境界という）を求めることができ
る。この場合、図１（ｂ）に示すように、音素「ｓ」の
仮の音素発話継続時間はＴ１、音素「ａ」の仮の音素発
話継続時間は２Ｔ１、音素「ｔ」の仮の音素発話継続時
間はＴ１、音素「ｏ」の仮の音素発話継続時間は２Ｔ１
であり、それぞれの音素の境界（仮の音素境界）は、ｐ
１，ｐ２，ｐ３として求められる。This approximate utterance duration is referred to herein as a temporary phoneme utterance duration. Further, a temporary boundary of each phoneme (referred to as a temporary phoneme boundary) can be obtained from the temporary phoneme utterance duration obtained in this way. In this case, as shown in FIG. 1B, the provisional phoneme utterance duration of the phoneme "s" is T1, the provisional phoneme utterance duration of the phoneme "a" is 2T1, and the provisional phoneme utterance of the phoneme "t". The duration is T1, and the temporary phoneme utterance duration of the phoneme "o" is 2T1.
And the boundary of each phoneme (temporary phoneme boundary) is p
1, p2, and p3.

【００２１】次に、ＨＭＭにおけるトレリスまたはビタ
ビ演算時に、ある音素モデルの最終状態から次の音素モ
デルの最初の状態への遷移を、前記仮の音素継続時間に
基づくある時刻からある時刻までの間を禁止する制限を
設ける。Next, at the time of trellis or Viterbi calculation in the HMM, the transition from the final state of a certain phoneme model to the first state of the next phoneme model is performed between a certain time based on the temporary phoneme duration and a certain time. Restrictions are set to prohibit

【００２２】今、「さとう」いう単語について、この単
語を構成する音素「ｓ」，「ａ」，「ｔ」，「ｏ」の音
素モデルが図２（ａ）〜（ｄ）に示すように、それぞれ
４状態３ループで表されるとする。そして、これら、
「ｓ」の音素モデル、「ａ」の音素モデル、「ｔ」の音
素モデル、「ｏ」の音素モデルをそれぞれ連結すると、
図２（ｅ）のようになる。Now, for the word "Sato", the phoneme models of the phonemes "s", "a", "t" and "o" constituting this word are as shown in FIGS. 2 (a) to 2 (d). Are represented by four states and three loops, respectively. And these,
When the phoneme model of "s", the phoneme model of "a", the phoneme model of "t", and the phoneme model of "o" are respectively connected,
The result is as shown in FIG.

【００２３】なお、この図２（ｅ）からもわかるよう
に、「ｓ」、「ａ」、「ｔ」のそれぞれの音素モデルに
ついては、それぞれ最終段の状態、つまり、ループを持
たない状態（図２(a)〜(d)で示すように、「ｓ」の音素
モデルでは状態Ｓ１４、「ａ」の音素モデルでは状態Ｓ
２４、「ｔ」の音素モデルでは状態Ｓ３４）を除去して
結合し、「ｏ」の音素モデルはその最終段にループを持
たない状態Ｓ４４が存在したものとなる。As can be seen from FIG. 2 (e), each of the phoneme models "s", "a", and "t" has a final stage, that is, a state without a loop ( As shown in FIGS. 2A to 2D, the state of the phoneme model “s” is state S14, and the state of the phoneme model of “a” is state S14.
24, in the phoneme model of "t", the state S34) is removed and combined, and the phoneme model of "o" has a state S44 having no loop at the final stage.

【００２４】すなわち、この「さとう」という単語に対
する音素連結モデルは、図２（ｅ）に示すように、ルー
プを有する１２個の状態Ｓ１１，Ｓ１２，Ｓ１３，Ｓ２
１，Ｓ２２，Ｓ２３，Ｓ３１，Ｓ３２，Ｓ３３，Ｓ４
１，Ｓ４２，Ｓ４３と最終段のループを持たない状態Ｓ
４４から構成されていると考えることができる。That is, as shown in FIG. 2E, the phoneme connection model for the word "Sato" has 12 states S11, S12, S13, S2 having a loop.
1, S22, S23, S31, S32, S33, S4
1, S42, S43 and state S without loop at the last stage
44 can be considered.

【００２５】これにより、この音素連結モデルは、ビタ
ビまたはトレリスのアルゴリズムによって、第１番目の
状態から最終段に存在する音素モデルまでの間における
パスを通過して最終段に存在する状態Ｓ４４の時刻ｔｎ
から最終的な状態確率としての出力尤度が求められる。
この場合の最終的な出力尤度は、図２（ｅ）のような音
素連結モデルに対し、ある入力音声に対する時系列の音
声データを与えることによって得られるもので、その最
終的な出力尤度の値の大きさによって、入力音声が何で
あるかが判定される。この場合、図２（ｅ）の音素連結
モデルは「さとう（ｓａｔｏ）」に対する音素連結モデ
ルであるため、「さとう」という音声が入力された場合
には高い出力尤度が得られることになる。Thus, the phoneme connection model passes through the path from the first state to the phoneme model existing at the last stage, and the time of the state S44 existing at the last stage by the Viterbi or trellis algorithm. tn
, The output likelihood as the final state probability is obtained.
The final output likelihood in this case is obtained by giving time-series speech data for a certain input speech to the phoneme connection model as shown in FIG. Is determined by the magnitude of the value of. In this case, since the phoneme connection model of FIG. 2E is a phoneme connection model for "sato", a high output likelihood is obtained when the speech "sato" is input.

【００２６】ここで、本発明では、仮の音素発話継続時
間に基づき、ある時刻からある時刻までの間、ある音素
モデルの最終状態から次の音素モデルにおける最初の状
態へのパスに制限を設ける。たとえば、図２（ｅ）に示
す音素連結モデルの例では、「ｓ」の音素モデルの最終
状態Ｓ１３から「ａ」の音素モデルの最初の状態Ｓ２
１、「ａ」の音素モデルの最終状態Ｓ２３から「ｔ」の
音素モデルの最初の状態Ｓ３１などにおいて、仮の音素
発話継続時間に基づくある時刻からある時刻までの間で
パスに制限を設けることで、認識率の向上を図ろうとす
るものである。以下、簡単な例を参照しながら説明す
る。Here, in the present invention, based on the tentative phoneme utterance duration, a limit is imposed on the path from the final state of a certain phoneme model to the first state of the next phoneme model from a certain time to a certain time. . For example, in the example of the phoneme connection model shown in FIG. 2E, the initial state S2 of the phoneme model “a” is changed from the final state S13 of the phoneme model “s”.
1. From the final state S23 of the phoneme model “a” to the first state S31 of the phoneme model “t”, etc., the path is restricted from a certain time based on the temporary phoneme utterance duration to a certain time. Thus, an attempt is made to improve the recognition rate. Hereinafter, description will be made with reference to a simple example.

【００２７】これまでの説明では、個々の音素モデル
は、４状態３ループの音素モデルを例にして説明した
が、ここでは、図面が複雑化するのを防ぐためと、説明
を簡略化するために、３状態２ループの音素モデルを連
結した音素連結モデルを用いて説明を行う。In the above description, each phoneme model has been described by taking a phoneme model of four states and three loops as an example, but here, in order to prevent the drawing from becoming complicated and to simplify the description. A description will be given using a phoneme connection model obtained by connecting three-state two-loop phoneme models.

【００２８】図３は３音素で構成されるある単語に対す
るＨＭＭにおける音素連結モデルの状態遷移アルゴリズ
ムを説明する図である。この音素連結モデルは、図４に
示すように、ループを有する６個の状態Ｓ１，Ｓ２，Ｓ
３，Ｓ４，Ｓ５，Ｓ６と最終段に接続されたループを持
たない状態Ｓ７で構成されている。なお、図４におい
て、Ｓ１，Ｓ２が第１段の音素、Ｓ３，Ｓ４が第２段の
音素、Ｓ５，Ｓ６，Ｓ７が第３段の音素にそれぞれ対応
している。また、ａ１１は状態Ｓ１におけるループの遷
移確率、ａ１２は状態Ｓ１から状態Ｓ２への遷移確率、
ａ２２は状態Ｓ２におけるループの遷移確率、ａ２３
は状態Ｓ２から状態Ｓ３への遷移確率、ａ３３は状態Ｓ
３におけるループの遷移確率、ａ３４は状態Ｓ３から状
態Ｓ４への遷移確率、ａ４４は状態Ｓ４におけるルー
プの遷移確率、ａ４５は状態Ｓ４から状態Ｓ５への遷移
確率、ａ５５は状態Ｓ５におけるループの遷移確率、
ａ５６は状態Ｓ５から状態Ｓ６への遷移確率、ａ６６
は状態Ｓ６におけるループの遷移確率、ａ６７は状態Ｓ
６から状態Ｓ７への遷移確率を表している。FIG. 3 is a diagram for explaining a state transition algorithm of a phoneme connection model in the HMM for a certain word composed of three phonemes. As shown in FIG. 4, the phoneme connection model includes six states S1, S2, S
3, S4, S5, S6 and a state S7 without a loop connected to the last stage. In FIG. 4, S1 and S2 correspond to the first-stage phonemes, S3 and S4 correspond to the second-stage phonemes, and S5, S6 and S7 correspond to the third-stage phonemes, respectively. A11 is the transition probability of the loop in the state S1, a12 is the transition probability from the state S1 to the state S2,
a22 is the transition probability of the loop in the state S2, a23
Is the transition probability from the state S2 to the state S3, and a33 is the state S
3, a34 is the transition probability from state S3 to state S4, a44 is the transition probability of the loop in state S4, a45 is the transition probability from state S4 to state S5, and a55 is the transition probability of the loop in state S5. ,
a56 is the transition probability from state S5 to state S6, a66
Is the transition probability of the loop in the state S6, and a67 is the state S
6 represents the transition probability from state 6 to state S7.

【００２９】ここで、これら各音素に対する標準データ
に基づいて、仮の音素発話継続時間が求められていると
する。たとえば、図３に示すように、その単語を構成す
る音素のうち、時刻ｔ０〜ｔ５が第１段の音素に対する
仮の音素発話継続時間Ｔ１、時刻ｔ６〜ｔ１０がその単
語の第２段の音素に対する仮の音素発話継続時間Ｔ２、
時刻ｔ１１〜ｔ１７がその単語の第３段の音素に対する
仮の音素発話継続時間Ｔ３とする。また、図３におい
て、ｐ１，ｐ２は仮の音素境界を示している。Here, it is assumed that a provisional phoneme utterance duration is obtained based on the standard data for each phoneme. For example, as shown in FIG. 3, among the phonemes constituting the word, times t0 to t5 are provisional phoneme utterance durations T1 for the phonemes of the first row, and times t6 to t10 are phonemes of the second row of the word. Tentative phoneme utterance duration T2 for
Times t11 to t17 are provisional phoneme utterance durations T3 for the phonemes of the third row of the word. In FIG. 3, p1 and p2 indicate temporary phoneme boundaries.

【００３０】このような状態遷移アルゴリズムにおい
て、今、斜線部分の状態確率が０となるようなパスの制
限を与えたとする。つまり、この場合は、仮の音素発話
継続時間Ｔ１（時刻ｔ０からｔ５）においては、第２段
の音素における状態Ｓ４から第３段の音素における状態
Ｓ５への遷移確率ａ４５＝０とし、仮の音素発話継続時
間Ｔ３（時刻ｔ１１からｔ１７）においては、第１段の
音素における状態Ｓ２から第２段の音素における状態Ｓ
３への遷移確率ａ２３＝０とする。これによって、仮の
音素発話継続時間（時刻ｔ０からｔ５）における状態Ｓ
５の状態確率Ｓ５（ｔ）＝０、仮の音素発話継続時間
（ｔ１１からｔ１７）における状態Ｓ３の状態確率Ｓ３
（ｔ）＝０となる。It is assumed that in such a state transition algorithm, a path is restricted so that the state probability of the hatched portion becomes zero. That is, in this case, during the provisional phoneme utterance duration T1 (from time t0 to t5), the transition probability a45 = 0 from the state S4 of the second-stage phoneme to the state S5 of the third-stage phoneme is set to 0, and the provisional During the phoneme utterance duration T3 (from time t11 to t17), the state S2 in the first-stage phoneme changes to the state S in the second-stage phoneme.
The transition probability a23 to 3 is set to 0. As a result, the state S in the provisional phoneme utterance continuation time (time t0 to t5) is obtained.
5, the state probability S5 (t) = 0, the state probability S3 of the state S3 in the temporary phoneme utterance duration (t11 to t17)
(T) = 0.

【００３１】このようなパスの制限を設けることによ
り、時刻ｔ０から時刻ｔ１７でなる時系列の入力音声デ
ータがこのアルゴリズムに与えられたとき、図４に示さ
れるような音素連結モデルの第１段の音素モデルから最
終段に存在する音素モデルまでの間におけるパスを通過
して最終段に存在する音素モデルの状態ＦＳ（図３参
照）から出力尤度を求める際、パスが制限されているの
で、その制限されたパスを通ることはなくなり、その制
限のかけられたパスによる状態確率が最終の状態FSの状
態確率（この音素モデルの出力尤度）には影響を与えな
いことになる。By providing such a path restriction, when time-series input speech data from time t0 to time t17 is given to this algorithm, the first stage of the phoneme connection model as shown in FIG. When the output likelihood is obtained from the state FS (see FIG. 3) of the phoneme model existing at the final stage after passing through the path from the phoneme model of the second stage to the phoneme model existing at the last stage, the path is restricted. Therefore, the state probability due to the restricted path does not affect the state probability of the final state FS (output likelihood of this phoneme model).

【００３２】ちなみに、図３で示したようなパスの制限
を設けない場合であっても、結果的には、図５に示すよ
うに、パスの制限は自ずとかかってはいる。つまり、最
終段の状態ＦＳから最終的な出力尤度を得るのに何等寄
与しないパス（図５において、網掛けを施した部分）が
もともと存在するが、本発明では、この図５に示される
ようなもともと最終的な出力尤度を得るのに何等寄与し
ないパスに加えて、図６に示すように、斜線部分のパス
の制限を加えている。By the way, even if the path restriction as shown in FIG. 3 is not provided, as a result, the path restriction is naturally applied as shown in FIG. That is, there is originally a path that does not contribute to obtaining the final output likelihood from the state FS of the last stage (the shaded portion in FIG. 5), but in the present invention, it is shown in FIG. In addition to the paths that do not contribute to obtaining the final output likelihood in the first place as described above, restrictions on the paths indicated by oblique lines are added as shown in FIG.

【００３３】つまり、図６の網掛け部分は、もともと最
終的な出力尤度を得るのに何等寄与しないパス（図５で
示したものと同じ）であり、これは自ずから制限のかか
っているパスであるといえる。そして、これに加えて、
図３で説明したような本発明のパスの制限を加えること
によって、結果的に、図６の斜線部分と網掛けを施した
部分のパスに制限がかけられたものとなる。That is, the shaded portion in FIG. 6 is a path which does not contribute to obtaining the final output likelihood (same as that shown in FIG. 5), and is a path which is naturally restricted. You can say that. And in addition to this,
By adding the path restriction of the present invention as described with reference to FIG. 3, as a result, the path of the hatched portion and the hatched portion in FIG. 6 is restricted.

【００３４】ここで、図５に示す本発明のような制限を
与えないアルゴリズムと、本発明によるパスの制限を与
えたアルゴリズム（図６）とで、ビタビアルゴリズムで
得られたパスの例について図７および図８を参照して説
明する。図７は図５に対応するもので、制限のかけかた
は図５と全く同じである。また、図８は図６に対応する
もので、制限のかけかたは図６と全く同じである。FIG. 5 shows an example of a path obtained by the Viterbi algorithm between the algorithm that does not impose the restriction as shown in FIG. 5 and the algorithm that restricts the path according to the present invention (FIG. 6). 7 and FIG. FIG. 7 corresponds to FIG. 5, and the way of limiting is exactly the same as FIG. FIG. 8 corresponds to FIG. 6, and the way of limiting is exactly the same as that of FIG.

【００３５】図５の例は積極的なパスの制限をかけたも
のではなく、パスの制限がゆるいので、極端な例を示せ
ば、図７の太線で示すようなパスを通過する可能性があ
る。すなわち、この場合は、第２段の音素に対する仮の
音素発話継続時間Ｔ２付近から第３段の音素に対する仮
の音素発話継続時間Ｔ３付近で、それぞれ第１段の音素
モデルにおける状態確率が大きな値となって現れた例
で、誤認識を生じた例である。The example of FIG. 5 does not impose an active restriction on the path, and the restriction on the path is loose. Therefore, in an extreme example, there is a possibility of passing the path shown by the thick line in FIG. is there. That is, in this case, the state probabilities in the first-stage phoneme model are large values around the temporary phoneme utterance duration T2 for the second-stage phoneme and near the temporary phoneme utterance duration T3 for the third-stage phoneme. This is an example in which erroneous recognition has occurred.

【００３６】つまり、たとえば、「あき」という単語の
音素モデルがあって、その音素モデルに「あき」という
音声を入力させたときは、その音素モデルの最終的な出
力尤度は高い値となる。つまり、図７に示す音素連結モ
デルが「あき」という単語を認識するための音素連結モ
デルであるとすれば、「あき」という入力音声の時系列
データが与えられた場合には、その最終の状態ＦＳから
は高い出力尤度が得られ、それによって、入力音声は
「あき」であるとの認識がなされる。That is, for example, when there is a phoneme model of the word "Aki" and a speech "Aki" is input to the phoneme model, the final output likelihood of the phoneme model becomes a high value. . That is, if the phoneme connection model shown in FIG. 7 is a phoneme connection model for recognizing the word “Aki”, if the time series data of the input voice “Aki” is given, the final From the state FS, a high output likelihood is obtained, whereby the input speech is recognized as being "vacant".

【００３７】一方、この音素連結モデルに、たとえば
「あか」という音声を与えたときは、その「あき」の音
素モデルの最終の状態ＦＳからは、「あき」を入力した
ときよりも低い出力尤度が得られなければならない。と
ころが、図７に示すような制限の緩やかなパスでは、
「あか」を構成する音素「ａ」、「ｋ」、「ａ」に多く
含まれる「ａ」の音素部分が、「あき」の「ａ」の音素
モデルにおいて高い状態確率が得られてしまい、それ
が、最終の状態ＦＳにおける出力尤度にも影響を残し
て、出力尤度を高いものとしてしまうおそれがある。On the other hand, when a voice such as "Aka" is given to this phoneme connection model, the output likelihood lower than when "Aki" is input is obtained from the final state FS of the "Aki" phoneme model. Degree must be obtained. However, in a path with a moderate restriction as shown in FIG.
The phoneme part of “a” that is included in the phonemes “a”, “k”, and “a” that make up “red” has a high state probability in the phoneme model of “a” of “aki”, This may increase the output likelihood while leaving the output likelihood in the final state FS unaffected.

【００３８】これを防ぐために本発明では、図３に示す
ようにパスに制限をかけ、これによって、結果的に図６
のような制限のかかったパスが形成される。これによれ
ば、「あか」と発話した場合、その音声は、制限を与え
られていない範囲のパスを通過せざるを得ないので、
「あか（ａｋａ）」という音声データに対するそれぞれ
の音素のうち、特に、第２段の音素に対する仮の音素発
話継続時間Ｔ２付近から第３段の音素に対する仮の音素
発話継続時間Ｔ３付近における状態確率は、図５に比べ
て低い値となり、それが最終の状態ＦＳにおける出力尤
度にも影響して、その出力尤度を低く抑えることができ
る。In order to prevent this, according to the present invention, the paths are restricted as shown in FIG.
Thus, a restricted path is formed. According to this, when uttering "red", the voice has to pass through an unrestricted range path,
Among the respective phonemes for the voice data "aka", the state probability in the vicinity of the provisional phoneme utterance duration T2 for the second stage phoneme and the vicinity of the provisional phoneme utterance duration T3 for the third stage phoneme Has a lower value than that in FIG. 5, which also affects the output likelihood in the final state FS, and can reduce the output likelihood.

【００３９】これに対して、このような「あき」の音素
モデルに対し、「あき」と発話した場合には、制限を加
える加えないにかかわらず、図８に示すようなパスを通
過する可能性が高くなるので、制限に対しては大きな影
響を受けないで、それぞれの音素に対する仮の音素発話
継続時間において最適な状態確率が得られ、それによっ
て、最終の状態ＦＳからは高い出力尤度を得ることがで
きる。On the other hand, if "Aki" is uttered to such an "Aki" phoneme model, it is possible to pass through a path as shown in FIG. , The optimal state probabilities are obtained at the tentative phoneme utterance duration for each phoneme without being significantly affected by the restriction, thereby increasing the output likelihood from the final state FS. Can be obtained.

【００４０】このように、パスに制限を設けることで認
識率の向上が図れる。１つの具体例として、たとえば、
図９に示すように、「おはよう」の音声に対して、その
音声の時系列データ（時刻ｔ０からｔｎ）を前述したよ
うに、それぞれの音素に対する標準データに基づいて、
それぞれの音素「ｏ」、「ｈ」、「ａ」、「ｙ」、
「ｏ」ごとに仮の音素発話継続時間Ｔ１１，Ｔ１２，Ｔ
１３，Ｔ１４，Ｔ１５で区切り、仮の音素境界ｐ０，ｐ
１，ｐ２，・・・，ｐ５が求められたとする。As described above, the recognition rate can be improved by providing restrictions on the paths. As one specific example, for example,
As shown in FIG. 9, the time series data (time t0 to tn) of the voice of “Good morning” is based on the standard data for each phoneme as described above.
Each phoneme "o", "h", "a", "y",
Temporary phoneme utterance duration T11, T12, T for each "o"
13, T14, T15, temporary phoneme boundaries p0, p
1, p2,..., P5 are determined.

【００４１】そして、この場合は、それぞれの音素
「ｏ」、「ｈ」、「ａ」、「ｙ」、「ｏ」の音素モデル
において、音素「ｏ」の音素モデルの最終状態から音素
「ｈ」の最初の状態への遷移は、仮の音素境界ｐ０から
仮の音素境界ｐ２までの間のパス（太線矢印部分）は通
過を許容し、それ以外のパス（細線矢印部分）には制限
を与える（その間の状態確率を０とする）。また、音素
「ｈ」の音素モデルの最終状態から音素「ａ」の最初の
状態への遷移は、仮の音素境界ｐ０とｐ１の中間点から
仮の音素境界ｐ３までの間のパス（太線矢印部分）は通
過を許容し、それ以外のパス（細線矢印部分）には制限
を与える（その間の状態確率を０とする）。また、音素
「ａ」の音素モデルの最終状態から音素「ｙ」の最初の
状態への遷移は、仮の音素境界ｐ１とｐ２の中間点から
仮の音素境界ｐ４までの間のパス（太線矢印部分）は通
過を許容し、それ以外のパス（細線矢印部分）には制限
を与える（その間の状態確率を０とする）。また、音素
「ｙ」の音素モデルの最終状態から音素「ｏ」の最初の
状態への遷移は、仮の音素境界ｐ２とｐ３の中間点から
仮の音素境界ｐ５までの間のパス（太線矢印部分）は通
過を許容し、それ以外のパス（細線矢印部分）には制限
を与える（その間の状態確率を０とする）。In this case, in the phoneme models of the phonemes “o”, “h”, “a”, “y”, and “o”, the phoneme “h” is changed from the final state of the phoneme model of the phoneme “o”. To the first state, the path between the provisional phoneme boundary p0 and the provisional phoneme boundary p2 (portion indicated by a thick line) is allowed to pass, and the other path (portion indicated by a thin line arrow) is restricted. (The state probability between them is set to 0). Further, the transition from the final state of the phoneme model of the phoneme “h” to the first state of the phoneme “a” is performed by a path (the thick arrow) from the intermediate point between the temporary phoneme boundaries p0 and p1 to the temporary phoneme boundary p3. Part) is allowed to pass, and the other paths (fine arrow parts) are restricted (state probability between them is 0). Further, the transition from the final state of the phoneme model of the phoneme “a” to the first state of the phoneme “y” is performed by a path (the thick arrow) from the intermediate point between the temporary phoneme boundaries p1 and p2 to the temporary phoneme boundary p4. Part) is allowed to pass, and the other paths (fine arrow parts) are restricted (state probability between them is 0). Further, the transition from the final state of the phoneme model of the phoneme “y” to the first state of the phoneme “o” is performed by a path (the thick arrow) between the intermediate point between the temporary phoneme boundaries p2 and p3 and the temporary phoneme boundary p5. Part) is allowed to pass, and the other paths (fine arrow parts) are restricted (state probability between them is 0).

【００４２】このようなパスの制限を与えたところ、良
好な認識結果がえられることが実験により確認された。Experiments have confirmed that good recognition results can be obtained when such path restrictions are applied.

【００４３】なお、以上の説明はビタビアルゴリズムで
の説明であったがトレリスアルゴリズムにおいても同様
の考え方で実施できる。Although the above description is based on the Viterbi algorithm, the same concept can be applied to the trellis algorithm.

【００４４】また、以上説明した本発明の処理を行う音
声認識処理プログラムは、フロッピィディスク、光ディ
スク、ハードディスクなどの記録媒体に記録させておく
ことができ、本発明はその記録媒体をも含むものであ
る。また、ネットワークから処理プログラムを得るよう
にしてもよい。The above-described speech recognition processing program for performing the processing of the present invention can be recorded on a recording medium such as a floppy disk, an optical disk, or a hard disk. The present invention also includes the recording medium. Further, the processing program may be obtained from a network.

【００４５】[0045]

【発明の効果】以上説明したように本発明によれば、入
力音声を構成する音素ごとの仮の音素発話継続時間を各
音素ごとの標準データに基づいて求め、当該入力音声デ
ータの音声区間に対し、その入力音声を構成する音素ご
との仮の音素発話継続時間で分割し、ＨＭＭにおける第
１段の音素モデルから最終段の音素モデルまでの間のパ
スにおいて、ある音素モデルに対し、前記仮の音素発話
継続時間に基づくある時刻からある時刻までの区間にパ
スの制限を設けるようにしている。そして、仮の音素発
話継続時間に基づいて行うパスの制限というのは、ある
音素モデルの最終状態から次の音素モデルの最初の状態
への遷移を前記仮の音素発話継続時間に基づくある時刻
からある時刻までの間を禁止する制御である。As described above, according to the present invention, the tentative phoneme utterance duration for each phoneme constituting the input voice is obtained based on the standard data for each phoneme, and the tentative phoneme utterance duration is set in the voice section of the input voice data. On the other hand, the input speech is divided by the provisional phoneme utterance duration for each phoneme constituting the input speech, and in the path from the first-stage phoneme model to the last-stage phoneme model in the HMM, the temporary Is limited in a section from a certain time to a certain time based on the phoneme utterance continuation time. The restriction of the path performed based on the temporary phoneme utterance duration means that the transition from the final state of a certain phoneme model to the first state of the next phoneme model is performed at a certain time based on the temporary phoneme utterance duration. This is control to prohibit until a certain time.

【００４６】このように、パスに制限を設けることによ
って、入力音声に対する時系列音声データに対する状態
遷移アルゴリズムは、制限のかかっていない範囲のパス
を通過して最終的な状態に到達するようなアルゴリズム
となる。このため、入力音声に対し適正な出力尤度が得
られ、誤認識を少なくすることができる。As described above, by providing a restriction on the path, the state transition algorithm for the time-series audio data for the input audio can be such that the final state is reached through a path in an unrestricted range. Becomes For this reason, an appropriate output likelihood is obtained for the input speech, and erroneous recognition can be reduced.

【００４７】たとえば、ある単語に対して高い出力尤度
の得られるように設定されたＨＭＭの音素連結モデル
に、その単語に類似する音素を多く含むような単語の音
声データが入力された場合、その音声は、制限のかかっ
ていない範囲のパスを通過せざるを得ないので、パスに
制限が与えられていない場合に比べると、最終的な出力
尤度を小さく抑えることができる。つまり、パスに制限
が与えられていない場合には、ある音素モデル部分のあ
る仮の音素発話継続時間において、高い状態確率が得ら
れ、それが、最終段に存在する音素モデルの最終状態に
おける出力尤度に影響を与えて、結果的に、最終状態に
おける出力尤度を高い値としてしまい、誤認識を生じさ
せる原因にもなっていたが、本発明のように、パスに制
限をかけることによって、このような不具合を解消する
ことができ、認識率の向上に寄与できる。For example, when speech data of a word including many phonemes similar to the word is input to the phoneme connection model of the HMM set to obtain a high output likelihood for a certain word, Since the sound has to pass through a path in an unrestricted range, the final output likelihood can be reduced as compared with a case where the path is not restricted. In other words, when the path is not restricted, a high state probability is obtained for a certain temporary phoneme utterance duration of a certain phoneme model part, which is the output of the final state of the phoneme model existing at the final stage. Affecting the likelihood and consequently setting the output likelihood in the final state to a high value, causing misrecognition. However, by limiting the path as in the present invention, Such a problem can be solved and the recognition rate can be improved.

[Brief description of the drawings]

【図１】ある入力音声データに対しその入力音声を構成
する音素ごとの仮の発話継続時間を求める例を説明する
図である。FIG. 1 is a diagram illustrating an example of obtaining temporary utterance duration for each phoneme included in an input voice data with respect to certain input voice data.

【図２】ある単語を構成する複数の音素に対する複数の
音素モデル（４状態３ループ）を連結して得られたた音
素連結モデルの状態遷移を模式的に示す図である。FIG. 2 is a diagram schematically showing a state transition of a phoneme connection model obtained by connecting a plurality of phoneme models (four states and three loops) to a plurality of phonemes constituting a certain word;

【図３】３状態２ループの音素モデルを連結してなる３
音素単語の音素連結モデルに対し、本発明のパスの制限
を与えた例を説明する図である。FIG. 3 is a diagram showing a state in which three-state two-loop phoneme models are connected.
It is a figure explaining the example which gave the restriction of the pass of the present invention to the phoneme connection model of the phoneme word.

【図４】３状態２ループの音素モデルを連結してなる３
音素単語の音素連結モデルを模式的に示す図である。FIG. 4 is a diagram showing a concatenation of three-state two-loop phoneme models.
It is a figure which shows the phoneme connection model of a phoneme word typically.

【図５】３状態２ループの音素モデルを連結してなる３
音素単語の音素連結モデルに対し、最終の状態における
出力尤度に影響を与えないパスの存在について説明する
図である。FIG. 5 is a diagram showing a concatenation of three-state two-loop phoneme models.
FIG. 11 is a diagram illustrating the existence of a path that does not affect the output likelihood in the final state for a phoneme link model of phoneme words.

【図６】図３で示した本発明のパスの制限を設けること
によって、結果的に最終の状態における出力尤度に影響
を与えないパスについて説明する図である。FIG. 6 is a diagram illustrating a path which does not affect the output likelihood in the final state by providing the path restriction of the present invention shown in FIG. 3;

【図７】図５で示した音素連結モデルにおけるアルゴリ
ズムで得られたパスの一例を示すもので誤認識を生じた
例を説明する図である。FIG. 7 is a diagram illustrating an example of a path obtained by an algorithm in the phoneme connection model illustrated in FIG. 5 and illustrating an example in which erroneous recognition has occurred.

【図８】図６で示した音素連結モデルにおけるアルゴリ
ズムで得られたパスの一例を示すもので適正な認識を行
った例を説明する図である。FIG. 8 is a diagram illustrating an example of a path obtained by an algorithm in the phoneme connection model illustrated in FIG. 6, and illustrating an example in which proper recognition is performed.

【図９】ある単語に対して適正な認識が可能となるパス
の制限の一例を説明する図である。FIG. 9 is a diagram illustrating an example of a restriction on a path that enables proper recognition of a certain word.

[Explanation of symbols]

ｐ０，ｐ１，・・・仮の音素境界Ｔ１，Ｔ２，・・・仮の音素発話継続時間ＦＳ最終段の音素モデルにおける最終の状態 p0, p1, ... provisional phoneme boundary T1, T2, ... provisional phoneme utterance duration FS Final state in phoneme model at final stage

Claims

[Claims]

1. A phoneme connection model of a word is constructed by combining phoneme models for each phoneme constituting a recognizable word, and time-series speech data of an input speech is given to the phoneme connection model to provide the phoneme connection model. The output likelihood is obtained from the final state of the phoneme model existing at the final stage through a predetermined path in the state transition algorithm from the first stage phoneme model to the final stage phoneme model constituting the phoneme connection model. A speech recognition method for performing speech recognition based on the magnitude of output likelihood, wherein for each phoneme constituting the input speech from time-series speech data of the input speech, the utterance duration of each phoneme is determined for each phoneme. Of the input speech data based on the provisional phoneme utterance duration of each phoneme constituting the input speech. In the path of the state transition algorithm in the first-stage phoneme model to the last-stage phoneme model from the first-stage phoneme model, the phoneme model for a certain phoneme is at a certain time based on the temporary phoneme utterance duration. A speech recognition method, wherein a path is restricted in a section up to a time.

2. The method according to claim 1, wherein the restriction of a path to be performed in a section from a certain time to a certain time based on the temporary phoneme utterance continuation time is based on a transition from a final state of a certain phoneme model to a first state of the next phoneme model. 2. The speech recognition method according to claim 1, wherein control is performed to prohibit a period from a certain time to a certain time based on the phoneme utterance continuation time.

3. A phoneme concatenation model of the word is constructed by combining phoneme models for each phoneme constituting a recognizable word, and time-series speech data of an input speech is given to the phoneme concatenation model. The output likelihood is obtained from the final state of the phoneme model existing at the final stage through a predetermined path in the state transition algorithm from the first stage phoneme model to the final stage phoneme model constituting the phoneme connection model. A recording medium that stores a speech recognition processing program that performs speech recognition from the magnitude of output likelihood, and the speech recognition processing program stores, for each phoneme constituting the input speech from time-series speech data of the input speech. On the other hand, the procedure for obtaining the utterance duration of each phoneme as a temporary phoneme utterance duration based on the standard data for each phoneme, A step of dividing a speech section of the input speech data based on a provisional phoneme utterance duration for each phoneme to be composed; A step of providing a path restriction for a section from a certain time to a certain time based on the provisional phoneme utterance duration with respect to the phoneme model corresponding to.

4. Limiting a path to be performed in a section from a certain time to a certain time based on the tentative phoneme utterance continuation time is performed by changing a transition from a final state of a certain phoneme model to a first state of the next phoneme model. 4. A recording medium storing a speech recognition processing program according to claim 3, wherein the control is for prohibiting a period from a certain time to a certain time based on the phoneme utterance continuation time.