JP2001075596A

JP2001075596A - Voice recognition device, voice recognition method and recording medium -> stored with voice recognition program

Info

Publication number: JP2001075596A
Application number: JP25047099A
Authority: JP
Inventors: Yuzo Maruta; 裕三丸田; Yoshiharu Abe; 芳春阿部
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1999-09-03
Filing date: 1999-09-03
Publication date: 2001-03-23

Abstract

PROBLEM TO BE SOLVED: To obtain a voice recognition device whose recognition accuracy is high without increasing computational cost. SOLUTION: A likelihood calculating means 15 calculates hypothetic likelihood by acoustic featured vectors from an acoustic analyzing means 12, acoustical models of respective phonemes and acoustical models of words. A simple acoustical model probability calculating means 22 calculates simple acoustic output probability from the acoustical models of the respective phonemes and the acoustic featured vectors and an order fluctuation calculating means 23 calculates the average of order fluctuation widths of the simple acoustical output probability and a beam width changing means 24 changes a beam width by the average of the order fluctuation widths. A pruning means 16 rejects hypotheses having acoustical likelihood equal to or smaller than the beam width set by the beam width changing means 24 from the maximum likelihood.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、音声認識装置、
音声認識方法及び音声認識プログラムを記録した記録媒
体に関するものである。[0001] The present invention relates to a speech recognition device,
The present invention relates to a voice recognition method and a recording medium storing a voice recognition program.

【０００２】[0002]

【従来の技術】図２６は特公平４−２２２７６号公報及
び特開平８−２４８９８４号公報に基づき類推した従来
の音声認識装置の構成を示す図である。図において、１
１は入力した音声をデジタル化して音声データとして記
憶する音声データ記憶手段、１２は音声データ記憶手段
１１が記憶している音声データを所定時刻ごとに取り込
み音響分析して音響特徴ベクトルを出力する音響分析手
段、１３は各音素の音響モデルを記憶する音響モデル記
憶手段、１４は単語辞書における各単語の音素表記から
単語の音響モデルを生成して記憶する単語辞書記憶手段
である。2. Description of the Related Art FIG. 26 is a diagram showing the configuration of a conventional speech recognition apparatus analogized based on Japanese Patent Publication No. Hei 4-22276 and Japanese Patent Laid-Open Publication No. H8-248984. In the figure, 1
Reference numeral 1 denotes an audio data storage unit which digitizes an input audio and stores it as audio data, and 12 denotes an audio which takes in the audio data stored in the audio data storage unit 11 at predetermined times, performs an acoustic analysis, and outputs an acoustic feature vector. An analysis unit 13 is an acoustic model storage unit that stores an acoustic model of each phoneme, and 14 is a word dictionary storage unit that generates and stores an acoustic model of a word from a phoneme description of each word in the word dictionary.

【０００３】また、図２６において、１５は音響分析手
段１２が出力する音響特徴ベクトルと、音響モデル記憶
手段１３に記憶されている各音素の音響モデルと、単語
辞書記憶手段１４に記憶されている単語の音響モデルに
より、認識候補である仮説の尤度を演算する尤度演算手
段、１６は、仮説の最大尤度を求め、求めた最大尤度か
らの一定値（以下、ビーム幅という）以内の仮説を残し
て、ビーム幅以下の仮説を棄却する枝刈り手段、１７は
尤度の大きい順に認識単語を出力する認識結果出力手段
である。音響モデル記憶手段１３，単語辞書記憶手段１
４，尤度演算手段１５及び枝刈り手段１６により、フレ
ーム認識処理手段１８を構成している。[0006] In FIG. 26, reference numeral 15 denotes an acoustic feature vector output from the acoustic analysis means 12, an acoustic model of each phoneme stored in the acoustic model storage means 13, and stored in the word dictionary storage means 14. A likelihood calculating means 16 for calculating the likelihood of a hypothesis that is a recognition candidate based on the acoustic model of the word, finds the maximum likelihood of the hypothesis, and is within a certain value (hereinafter, referred to as a beam width) from the found maximum likelihood. Is a pruning means for rejecting a hypothesis smaller than the beam width while leaving the hypothesis, and a recognition result output means 17 for outputting recognition words in descending order of likelihood. Acoustic model storage means 13, word dictionary storage means 1
4. The frame recognition processing means 18 is constituted by the likelihood calculating means 15 and the pruning means 16.

【０００４】図２７は従来の音声認識装置の処理を示す
フローチャートであり、図２８は単語辞書の具体例を示
す図であり、図２９は単語の音響モデルの構造例を示す
図であり、図３０は時刻（以下、フレームともいう）が
進むにつれて仮説が展開される状況を説明する図であ
る。FIG. 27 is a flowchart showing the processing of a conventional speech recognition apparatus, FIG. 28 is a diagram showing a specific example of a word dictionary, and FIG. 29 is a diagram showing a structural example of an acoustic model of a word. Numeral 30 is a diagram for explaining a situation in which a hypothesis is developed as the time (hereinafter, also referred to as a frame) advances.

【０００５】次に動作について説明する。まず、音響モ
デル記憶手段１３は、外部記憶装置（図示せず）から、
各音素の音響モデルを読み込み記憶する。以下、説明の
簡略のため、音響モデルを各音素のＨＭＭ音響モデル
（ＨｉｄｄｅｎＭａｒｃｏｖＭｏｄｅｌ）として説
明するが、他の音響モデルでも同一である。Next, the operation will be described. First, the acoustic model storage unit 13 reads from an external storage device (not shown)
The acoustic model of each phoneme is read and stored. Hereinafter, for the sake of simplicity, the acoustic model will be described as an HMM acoustic model (Hidden Markov Model) of each phoneme, but the same applies to other acoustic models.

【０００６】次に、単語辞書記憶手段１４は、外部記憶
装置から単語辞書を読み込み記憶する。単語辞書は、例
えば図２８に示すように、各単語について、漢字表記、
ひらがな表記、音素表記のデータを持っている。さらに
単語辞書記憶手段１４は、各単語の音素表記から、その
単語のＨＭＭ音響モデルを生成する。図２９は単語「橋
（はし）」のＨＭＭ音響モデルの構成例を示す図であ
る。図２９に示すように、単語はそれぞれ各音素に分解
され、各音素ごとに対応するＨＭＭ音響モデルを当ては
めて、それを連結することによって各単語のＨＭＭの音
響モデルを構成する。Next, the word dictionary storage means 14 reads and stores a word dictionary from an external storage device. The word dictionary, as shown in FIG.
It has Hiragana and Phoneme data. Further, the word dictionary storage unit 14 generates an HMM acoustic model of the word from the phoneme description of each word. FIG. 29 is a diagram illustrating a configuration example of the HMM acoustic model of the word “hashi”. As shown in FIG. 29, each word is decomposed into each phoneme, an HMM acoustic model corresponding to each phoneme is applied, and the resultant is connected to form an acoustic model of the HMM of each word.

【０００７】次に、音声データ記憶手段１１は、入力さ
れた音声をＡ／Ｄ変換し、デジタルデータ化して音声デ
ータとして記憶する。Next, the audio data storage means 11 performs A / D conversion of the input audio, converts it into digital data, and stores it as audio data.

【０００８】図２７のステップＳＴ１において、制御手
段（図示せず）は、時刻ｔをｔ＝０に設定し、ステップ
ＳＴ２において、音響分析手段１２は時刻ｔ（ｔ＝０）
の音声データを音声データ記憶手段１１から取り込み、
ステップＳＴ３において、音響分析手段１２は、取得し
たｔ＝０での音声データを音響分析し、音響特徴ベクト
ルを計算する。In step ST1 of FIG. 27, the control means (not shown) sets the time t to t = 0, and in step ST2, the acoustic analysis means 12 sets the time t to time t (t = 0).
From the voice data storage means 11,
In step ST3, the acoustic analysis unit 12 acoustically analyzes the acquired audio data at t = 0, and calculates an acoustic feature vector.

【０００９】ステップＳＴ４において、尤度演算手段１
５は、上記ステップＳＴ３で計算された音響特徴ベクト
ルと、音響モデル記憶手段１３に記憶されている各音素
の音響モデルから、各ＨＭＭ状態の出力確率を計算す
る。In step ST4, likelihood calculating means 1
5 calculates the output probability of each HMM state from the acoustic feature vector calculated in step ST3 and the acoustic model of each phoneme stored in the acoustic model storage unit 13.

【００１０】ステップＳＴ５において、尤度演算手段１
５は、単語辞書記憶手段１４から各単語の認識候補であ
る仮説を読み出し、上記ステップＳＴ４で計算した各Ｈ
ＭＭ状態の出力確率と、単語辞書記憶手段１４に記憶さ
れている単語のＨＭＭ音響モデルから、各仮説の音響尤
度を計算し仮説を展開する。音響尤度の計算方法は周知
であり、例えば特公平４−２２２７６号公報に開示され
ている。ここで、音響尤度は音声データがその音にどれ
位近いかを示す尺度である。In step ST5, likelihood calculating means 1
5 reads out a hypothesis that is a candidate for recognition of each word from the word dictionary storage means 14 and calculates each H calculated in step ST4.
From the output probability of the MM state and the HMM acoustic model of the word stored in the word dictionary storage means 14, the acoustic likelihood of each hypothesis is calculated and the hypothesis is developed. A method for calculating the acoustic likelihood is well known, and is disclosed, for example, in Japanese Patent Publication No. Hei 4-22276. Here, the acoustic likelihood is a measure indicating how close the voice data is to the sound.

【００１１】図３０は各時刻（フレーム）における仮説
の展開の様子を示す図である。簡略のために、各音素の
ＨＭＭ状態数は１としている。図において、各々の四角
がそれぞれ仮説であり、各仮説は認識している単語、現
在の音素及び音響尤度を情報として持っている。仮説の
認識単語を「記事」とすると、フレームが進行するにつ
れて、ＨＭＭ音響モデルが自己ループし音素が進行しな
い仮説と、ＨＭＭ音響モデルが進行し音素が進行する仮
説に展開され、仮説の数が増えていくことになる。特に
図３０の太枠の仮説に示すように、「記事」の終端の音
素／ｉ／が終了すると、次の認識単語に遷移するが、
「記事」の次には、「が」、「学校」、「橋」・・・
と、さまざまな単語が遷移し得るために、各々に対して
独立に仮説を割り当てる。FIG. 30 is a diagram showing how a hypothesis is developed at each time (frame). For simplicity, the number of HMM states of each phoneme is set to one. In the figure, each square is a hypothesis, and each hypothesis has a recognized word, a current phoneme, and acoustic likelihood as information. Assuming that the recognition word of the hypothesis is "article", as the frame progresses, the HMM acoustic model self-loops and the phoneme does not progress, and the hypothesis that the HMM acoustic model progresses and the phoneme progresses is developed. It will increase. In particular, as shown by the thick framed hypothesis in FIG. 30, when the phoneme / i / at the end of the "article" ends, the transition to the next recognized word occurs.
After "article", "ga", "school", "bridge" ...
, So that various words can transition, hypotheses are assigned to each independently.

【００１２】上記ステップＳＴ５における単語遷移を行
った直後の仮説について、尤度演算手段１５は、単語間
の連接の確率を尤度計算に和又は積の型で追加する場合
がある。単語間の連接の確率は条件付き確率で表わさ
れ、バイグラムモデル、トライグラムモデル等が周知で
ある。単語間の連接確率を尤度計算に追加する場合、尤
度演算手段１５は、言語モデル記憶手段（図示せず）か
ら、現在認識中の単語と、その直前、又はさらにその直
前の単語の連接確率（言語尤度）を取得し音響尤度に追
加する。この言語尤度は、ある単語の次に来る別の単語
の遷移確率を示している。With respect to the hypothesis immediately after the word transition in step ST5, the likelihood calculating means 15 may add the connection probability between words to the likelihood calculation in the form of a sum or a product. The probability of connection between words is represented by a conditional probability, and a bigram model, a trigram model, and the like are well known. When the connection probability between words is added to the likelihood calculation, the likelihood calculating means 15 reads the connection between the currently recognized word and the word immediately before or further from the language model storage means (not shown). The probability (language likelihood) is obtained and added to the acoustic likelihood. This linguistic likelihood indicates the transition probability of another word following a certain word.

【００１３】その場合に、「一般化ベルヌーイ試行に基
づく言語確率の補正方法（電子情報通信学会論文誌Ｄ−
ＩＩ，Ｖｏｌ．Ｊ８１−Ｄ−ＩＩ，Ｎｏ１２，ｐｐ．２
７０３−２７１１）」の式（４），（５）に記載されて
いるように、音響尤度に対して言語尤度はある重みを持
って追加されるのが一般的であり、さらにその重みは、
同文献で開示されているように、予備実験で求めた最適
値を与えるのが一般的である。In such a case, a method of correcting a language probability based on a generalized Bernoulli trial (IEICE Transactions D-
II, Vol. J81-D-II, No12, pp. 2
703-2711) ", the linguistic likelihood is generally added with a certain weight to the acoustic likelihood, and the weight is further added. Is
As disclosed in the same document, it is common to give an optimum value obtained in a preliminary experiment.

【００１４】また、尤度演算手段１５は、上記ステップ
ＳＴ５において、継続時間長制御を行う場合がある。即
ち、ある仮説において、ある音素に留まっている時間で
ある継続時間長を計測し、それが事前に求められている
継続時間長データと異なる場合に、ペナルティを与え
る。図３１は継続時間長の例を示す図である。例えば、
ある仮説が図３１に示すように展開された場合、／ｋ／
の継続時間長は１フレーム、／ｉ／の継続時間長は２フ
レーム、・・・等と計測される。継続時間長データに
は、予め学習時に計算されている、／ｋ／の平均的な継
続時間長、／ｉ／の平均的な継続時間長、・・・等が存
在し、その差を基にペナルティが決定される。Further, the likelihood calculating means 15 may perform the duration control in step ST5. That is, in a certain hypothesis, a duration that is a time remaining in a certain phoneme is measured, and a penalty is given if the duration is different from the duration data obtained in advance. FIG. 31 is a diagram illustrating an example of the duration time. For example,
When a certain hypothesis is expanded as shown in FIG. 31, / k /
Is measured as one frame, the duration of / i / is measured as two frames, etc. The duration data includes an average duration of / k /, an average duration of / i /, etc., which are calculated in advance during learning, and based on the difference. The penalty is determined.

【００１５】この場合、学習データより早口に発声した
場合等は、ペナルティが多く与えられ、その結果、認識
誤りを招く場合がある。そこで、学習データにおける発
話の速さと、認識における発話の速さの違いを考慮し
て、その補正をすることが、例えば「単語リジェクト方
式のキーワードスポッティングによる評価（日本音響学
会講演論文集、平成１０年９月、ｐｐ．１５９−１６
０）」に開示されている。ここでは、発声の一定時間を
解析し、発話速度を求めて、各音素の継続時間を発話速
度の１次関数として表現し、発話速度が変わると、平均
的な継続時間長も変化させている。In this case, if the utterance is uttered earlier than the learning data, a large penalty is given, and as a result, a recognition error may be caused. Therefore, in consideration of the difference between the utterance speed in the learning data and the utterance speed in the recognition, the correction is performed, for example, as described in "Evaluation by Keyword Spotting in Word Reject Method (Journal of the Acoustical Society of Japan, Heisei 10 September, pp. 159-16
0) ". Here, a certain period of the utterance is analyzed, the utterance speed is obtained, and the duration of each phoneme is expressed as a linear function of the utterance speed. When the utterance speed changes, the average duration also changes. .

【００１６】図２７のステップＳＴ６において、枝刈り
手段１６は各仮説の中から最大尤度を求め、ステップＳ
Ｔ７において、枝刈り手段１６は各仮説を音響尤度順に
並べて、最大尤度から一定値（以下、ビーム幅という）
以内の音響尤度を持つ仮説のみを残し、それ以下の仮説
を棄却する。In step ST6 of FIG. 27, the pruning means 16 obtains the maximum likelihood from each of the hypotheses.
At T7, the pruning unit 16 arranges each hypothesis in the order of acoustic likelihood, and sets a constant value (hereinafter, referred to as a beam width) from the maximum likelihood.
Retain only hypotheses with acoustic likelihoods within and reject hypotheses below that.

【００１７】ステップＳＴ８，ＳＴ９において、以上の
ステップＳＴ２からＳＴ７までの処理を音声が終了する
まで行い、ステップＳＴ１０において、認識結果出力手
段１７は、発声の全区間において計算が終わった仮説に
ついて、尤度の大きい順から認識単語（列）を認識結果
として出力する。In steps ST8 and ST9, the above-described processing in steps ST2 to ST7 is performed until the speech is completed. In step ST10, the recognition result output means 17 determines the likelihood of the hypothesis for which the calculation has been completed in all the sections of the utterance. Recognized words (strings) are output as recognition results in descending order of degree.

【００１８】また、上記の処理は、ステップＳＴ９でｔ
をｔ＋１に置き換えることにより、毎時刻（フレームご
と）に処理を行うが、処理時間を速めるために、フレー
ムによって処理を省略する場合がある。この場合、「音
声認識における特徴ベクトルの補間及びフレーム間引き
手法の検討（１９９９年度春季電子情報通信学会全国大
会、Ｄ−１４−２１）」に開示されているごとく、数フ
レームごとに処理を省略するのが一般的である。Further, the above-described processing is performed in step ST9 when t
Is replaced with t + 1 to perform processing at each time (for each frame), but processing may be omitted depending on the frame in order to speed up the processing time. In this case, as described in “Study of Feature Vector Interpolation and Frame Decimation Method in Speech Recognition (Spring 1999 IEICE National Convention, D-14-21)”, processing is skipped every few frames. It is common.

【００１９】[0019]

【発明が解決しようとする課題】従来の音声認識装置は
以上のように構成されているので、認識性能をあげるた
めの手段の１つとして、上記ステップＳＴ７においてビ
ーム幅を広くする方法がある。しかし、ビーム幅を広く
とった場合には、正解仮説が枝刈りされてしまう可能性
は減るが、ビーム幅内に残存する仮説数が増えるため
に、処理量が増大するという課題があった。特に、正解
候補がビーム内上位にあることが分かっている場合に
は、ビーム幅を広くとることは、単に計算コストの増大
を招くのみであり、無駄が多いという課題があった。Since the conventional speech recognition apparatus is configured as described above, there is a method of increasing the beam width in step ST7 as one of means for improving the recognition performance. However, when the beam width is widened, the possibility that the correct hypothesis is pruned is reduced, but the number of hypotheses remaining in the beam width increases, so that the processing amount increases. In particular, when it is known that the correct answer candidate is in the upper position in the beam, widening the beam width merely causes an increase in calculation cost, and there is a problem that there is much waste.

【００２０】また、上記ステップＳＴ５の処理におい
て、音響尤度に対する言語尤度の重みが固定されている
ので、音響のあいまいさによっては、認識精度が劣化す
るという課題があった。In the process of step ST5, since the weight of the linguistic likelihood with respect to the acoustic likelihood is fixed, there is a problem that the recognition accuracy is deteriorated depending on the ambiguity of the acoustic.

【００２１】さらに、上記ステップＳＴ５の処理におけ
る継続時間長制御では、発話速度を求めること自体に、
一定以上の発声区間の解析が必要であり、処理時間がか
かるという課題があった。Further, in the duration control in step ST5, the determination of the speech rate is
There is a problem that it is necessary to analyze a certain utterance section or more, and it takes a long processing time.

【００２２】さらに、上記ステップＳＴ９の処理におけ
るフレーム処理の省略については、間引きの有無や割合
（間引き率）が固定になっているので、音響があいまい
で、精密な尤度が必要とされるような場合に、認識誤り
を招く可能性があるという課題があった。Further, regarding the omission of the frame processing in the processing of step ST9, since the presence or absence of the thinning and the ratio (thinning rate) are fixed, the sound is ambiguous and precise likelihood is required. In such a case, there is a problem that a recognition error may be caused.

【００２３】さらに、上記ステップＳＴ５の処理におい
て、文章を発声する際に、「が」等の助詞の後には、や
や間が空く場合が多いという特徴を反映していないの
で、正しい発話速度を検出できないという課題があっ
た。Further, in the process of step ST5, when a sentence is uttered, a characteristic that a space is often left slightly behind a particle such as "ga" is not reflected. There was a problem that could not be done.

【００２４】さらに、上記ステップＳＴ５の処理におい
て、単語を連続して発声する場合、単語の切れ目で間が
空きやすい場合と、空きにくい場合があるという特徴を
反映していないので、正しく継続時間長制御ができない
場合があるという課題があった。Further, in the processing of step ST5, when words are uttered continuously, the feature that the gap between words is likely to be vacant and the case where vacancy is difficult is not reflected. There was a problem that control might not be possible.

【００２５】さらに、呼気等がマイクにかかった場合
に、それを音声と認識してしまい、誤認識する場合があ
るという課題があった。Further, there is another problem that when exhalation or the like is applied to the microphone, it is recognized as a voice and may be erroneously recognized.

【００２６】この発明は上記のような課題を解決するた
めになされたもので、計算コストを増加させずに、正解
が枝刈りされることを抑えて、高い認識精度を持つ音声
認識装置、音声認識方法及び音声認識プログラムを記録
した記録媒体を得ることを目的とする。SUMMARY OF THE INVENTION The present invention has been made to solve the above-described problems, and a speech recognition apparatus and a speech recognition system having high recognition accuracy without increasing computational cost, suppressing pruning of a correct answer. It is an object to obtain a recording medium on which a recognition method and a voice recognition program are recorded.

【００２７】また、この発明は音響のあいまいさがあっ
ても、高い認識精度を持つ音声認識装置、音声認識方法
及び音声認識プログラムを記録した記録媒体を得ること
を目的とする。Another object of the present invention is to provide a speech recognition apparatus, a speech recognition method, and a recording medium on which a speech recognition program is recorded, having high recognition accuracy even if there is ambiguity in sound.

【００２８】さらに、この発明は速い発声、又は遅い発
声に対しても、高い認識精度を持つ音声認識装置、音声
認識方法及び音声認識プログラムを記録した記録媒体を
得ることを目的とする。A further object of the present invention is to obtain a speech recognition apparatus, a speech recognition method, and a recording medium on which a speech recognition program is recorded, having a high recognition accuracy even for a fast utterance or a slow utterance.

【００２９】さらに、この発明は文章等の発話速度が局
所的に変動するような長い発声に対しても、高い認識精
度を持つ音声認識装置、音声認識方法及び音声認識プロ
グラムを記録した記録媒体を得ることを目的とする。Further, the present invention provides a speech recognition apparatus, a speech recognition method, and a recording medium on which a speech recognition program is recorded with high recognition accuracy even for long utterances in which the utterance speed of a sentence or the like varies locally. The purpose is to gain.

【００３０】さらに、この発明は呼気等がマイクにかか
った場合でも、高い認識精度を持つ音声認識装置、音声
認識方法及び音声認識プログラムを記録した記録媒体を
得ることを目的とする。A further object of the present invention is to provide a speech recognition apparatus, a speech recognition method and a recording medium on which a speech recognition program is recorded with high recognition accuracy even when exhalation or the like is applied to a microphone.

【００３１】[0031]

【課題を解決するための手段】この発明に係る音声認識
装置は、入力した音声をデジタルデータ化して音声デー
タとして記憶する音声データ記憶手段と、上記音声デー
タを所定時刻ごとに取り込み音響分析して音響特徴ベク
トルを出力する音響分析手段と、各音素の音響モデルを
記憶する音響モデル記憶手段と、単語辞書における各単
語の音素表記から単語の音響モデルを生成して記憶する
単語辞書記憶手段と、上記音響分析手段から出力された
音響特徴ベクトルと、上記音響モデル記憶手段に記憶さ
れている各音素の音響モデルと、上記単語辞書記憶手段
に記憶されている単語の音響モデルにより、認識候補で
ある仮説の尤度を演算する尤度演算手段と、上記尤度演
算手段が演算した仮説の尤度から最大尤度を求め、求め
た最大尤度から所定のビーム幅以下の仮説を棄却する枝
刈り手段と、上記枝刈り手段により残された仮説を認識
候補として出力する認識結果出力手段とを備えたものに
おいて、各音素の簡易な音響モデルを記憶する簡易音響
モデル記憶手段と、上記音響分析手段から出力された音
響特徴ベクトルと、上記簡易音響モデル記憶手段に記憶
されている各音素の簡易な音響モデルにより、現在時刻
をはさむ所定の時間内における各時刻の各ＨＭＭ状態の
簡易音響出力確率を演算する簡易音響モデル確率演算手
段と、上記簡易音響モデル確率演算手段が求めた各時刻
の各ＨＭＭ状態の簡易音響出力確率の順位を求め、現在
時刻をはさむ所定の時間内における各ＨＭＭ状態の順位
変動幅を計算し、ＨＭＭ状態の順位変動幅の平均を計算
する順位変動計算手段とを備え、上記順位変動計算手段
が計算した順位変動幅の平均に基づき、音声認識に係る
パラメータを調整するものである。According to the present invention, there is provided a speech recognition apparatus comprising: a speech data storage unit for converting input speech into digital data and storing the same as speech data; An acoustic analysis unit that outputs an acoustic feature vector, an acoustic model storage unit that stores an acoustic model of each phoneme, a word dictionary storage unit that generates and stores an acoustic model of a word from a phoneme description of each word in a word dictionary, It is a recognition candidate based on the acoustic feature vector output from the acoustic analysis means, the acoustic model of each phoneme stored in the acoustic model storage means, and the acoustic model of a word stored in the word dictionary storage means. A likelihood calculating means for calculating the likelihood of the hypothesis, a maximum likelihood is obtained from the likelihood of the hypothesis calculated by the likelihood calculating means, and a maximum likelihood is obtained from the obtained maximum likelihood. A simple acoustic model of each phoneme is stored in a device including a pruning unit that rejects a hypothesis having a beam width equal to or less than a beam width and a recognition result output unit that outputs a hypothesis left by the pruning unit as a recognition candidate. The simple acoustic model storage means, the acoustic feature vector output from the acoustic analysis means, and the simple acoustic model of each phoneme stored in the simple acoustic model storage means, for each of the predetermined times including the current time. A simple acoustic model probability calculating means for calculating a simple acoustic output probability of each HMM state at a time; and a ranking of the simple sound output probabilities of each HMM state at each time obtained by the simple acoustic model probability calculating means, and Rank change calculating means for calculating a rank change width of each HMM state within a predetermined time and calculating an average of the rank change widths of the HMM states; Based on the average rank variation width of rank variation calculation means has calculated, and adjusts the parameters of the speech recognition.

【００３２】この発明に係る音声認識装置は、音声認識
に係るパラメータとしてのビーム幅を設定するビーム幅
変更手段を備え、順位変動計算手段が計算した順位変動
幅の平均が所定値より小さい場合に、上記ビーム幅変更
手段が上記ビーム幅を小さく設定し、上記順位変動幅の
平均が所定値より大きい場合に、上記ビーム幅変更手段
が上記ビーム幅を大きく設定し、枝刈り手段は、上記ビ
ーム幅変更手段が設定したビーム幅に基づき、仮説を棄
却するものである。The speech recognition apparatus according to the present invention includes beam width changing means for setting a beam width as a parameter relating to speech recognition, and is adapted to be used when the average of the order variation calculated by the order variation calculating means is smaller than a predetermined value. The beam width changing means sets the beam width small, and if the average of the order variation width is larger than a predetermined value, the beam width changing means sets the beam width large, and the pruning means The hypothesis is rejected based on the beam width set by the width changing means.

【００３３】この発明に係る音声認識装置は、音声認識
に係るパラメータとしての、尤度に追加される言語尤度
の重みを設定する言語重み変更手段を備え、順位変動計
算手段が計算した順位変動幅の平均が所定値より小さい
場合に、上記言語重み変更手段が上記言語尤度の重みを
小さく設定し、上記順位変動幅の平均が所定値より大き
い場合に、上記言語重み変更手段が上記言語尤度の重み
を大きく設定し、尤度演算手段は、上記言語重み変更手
段が設定した言語尤度の重みに基づき、仮説の尤度を演
算するものである。The speech recognition apparatus according to the present invention includes language weight changing means for setting the weight of the language likelihood added to the likelihood as a parameter relating to speech recognition, and the rank variation calculated by the rank variation calculation means. When the average of the width is smaller than a predetermined value, the language weight changing means sets the weight of the language likelihood to be small, and when the average of the order variation width is larger than a predetermined value, the language weight changing means sets the language The weight of the likelihood is set large, and the likelihood calculating means calculates the likelihood of the hypothesis based on the weight of the language likelihood set by the language weight changing means.

【００３４】この発明に係る音声認識装置は、複数の発
声速度に対応した発声の継続時間長データを記憶する継
続時間長データ記憶手段と、音声認識に係るパラメータ
としての継続時間長データを設定する継続時間長データ
変更手段とを備え、順位変動計算手段が計算した順位変
動幅の平均が所定値より小さい場合に、上記継続時間長
データ変更手段が、上記継続時間長データ記憶手段に記
憶されている遅めに発声した継続時間長データを選択
し、上記順位変動幅の平均が所定値より大きい場合に、
上記継続時間長データ変更手段が、上記継続時間長デー
タ記憶手段に記憶されている速めに発声した継続時間長
データを選択し、尤度演算手段は、上記継続時間長デー
タ変更手段が選択した継続時間長データに基づき、仮説
の尤度を演算するものである。A speech recognition apparatus according to the present invention sets duration data as a parameter relating to speech recognition, and a duration data storage means for storing speech duration data corresponding to a plurality of speech rates. When the average of the rank fluctuation range calculated by the rank fluctuation calculating means is smaller than a predetermined value, the duration data changing means is stored in the duration data storage means. If you select the duration time data that was uttered late, and the average of the rank fluctuation range is larger than a predetermined value,
The duration data changing means selects the uttered duration data stored in the duration data storage means, and the likelihood calculating means selects the duration selected by the duration data changing means. The likelihood of a hypothesis is calculated based on the time length data.

【００３５】この発明に係る音声認識装置は、音声認識
に係るパラメータとしての、各時刻の認識処理であるフ
レーム処理の間引きの有無を決定する間引き決定手段を
備え、順位変動計算手段が計算した順位変動幅の平均が
所定値より小さい場合に、上記間引き決定手段がフレー
ム処理の間引きを有りに設定し、上記順位変動幅の平均
が所定値より大きい場合に、上記間引き決定手段がフレ
ーム処理の間引きを無しに設定するものである。The speech recognition apparatus according to the present invention includes thinning-out determining means for determining whether or not frame processing, which is recognition processing at each time, is to be thinned out as a parameter relating to voice recognition. When the average of the fluctuation width is smaller than a predetermined value, the thinning-out determining means sets the thinning of the frame processing to ON, and when the average of the rank fluctuation width is larger than the predetermined value, the thinning-out determining means sets the thinning of the frame processing. Is set to none.

【００３６】この発明に係る音声認識装置は、音声認識
に係るパラメータとしての、各時刻の認識処理であるフ
レーム処理の間引き率を決定する間引き率決定手段を備
え、順位変動計算手段が計算した順位変動幅の平均が所
定値より小さい場合に、上記間引き率決定手段がフレー
ム処理の間引き率を大きく設定し、上記順位変動幅の平
均が所定値より大きい場合に、上記間引き率決定手段が
フレーム処理の間引き率を小さく設定するものである。The speech recognition apparatus according to the present invention includes a thinning-out rate determining means for determining a thinning-out rate of a frame process as a recognition process at each time, as a parameter relating to the voice recognition. When the average of the fluctuation width is smaller than a predetermined value, the thinning rate determining means sets the thinning rate of the frame processing to be large, and when the average of the rank fluctuation width is larger than the predetermined value, the thinning rate determining means sets the frame processing. The thinning rate is set to be small.

【００３７】この発明に係る音声認識装置は、入力した
音声をデジタルデータ化して音声データとして記憶する
音声データ記憶手段と、上記音声データを所定時刻ごと
に取り込み音響分析して音響特徴ベクトルを出力する音
響分析手段と、各音素の音響モデルを記憶する音響モデ
ル記憶手段と、単語辞書における各単語の音素表記から
単語の音響モデルを生成して記憶する単語辞書記憶手段
と、上記音響分析手段から出力された音響特徴ベクトル
と、上記音響モデル記憶手段に記憶されている各音素の
音響モデルと、上記単語辞書記憶手段に記憶されている
単語の音響モデルにより、認識候補である仮説の尤度を
演算する尤度演算手段と、上記尤度演算手段が演算した
仮説の尤度から最大尤度を求め、求めた最大尤度から所
定のビーム幅以下の仮説を棄却する枝刈り手段と、上記
枝刈り手段により残された仮説を認識候補として出力す
る認識結果出力手段を備えたものにおいて、複数の発声
速度に対応した発声の継続時間長データを記憶する継続
時間長データ記憶手段と、上記尤度演算手段から現在認
識中の各仮説の単語及び単語内位置を取得し、上記単語
辞書記憶手段から仮説の単語の品詞を取得する単語種類
取得手段と、上記単語種類取得手段が取得した仮説の単
語の品詞及び単語内位置に対応した継続時間長データ
を、上記継続時間長データ記憶手段から選択する継続時
間長データ変更手段とを備え、上記尤度演算手段が、上
記継続時間長データ変更手段が選択した継続時間長デー
タに基づき、仮説の尤度を演算するものである。The voice recognition apparatus according to the present invention converts the input voice into digital data and stores it as voice data. The voice data is taken in at predetermined time intervals, and the voice is analyzed to output a voice feature vector. Acoustic analysis means, an acoustic model storage means for storing an acoustic model of each phoneme, a word dictionary storage means for generating and storing an acoustic model of a word from a phoneme description of each word in a word dictionary, and an output from the acoustic analysis means The likelihood of a hypothesis that is a recognition candidate is calculated based on the obtained acoustic feature vector, the acoustic model of each phoneme stored in the acoustic model storage unit, and the acoustic model of a word stored in the word dictionary storage unit. A maximum likelihood is calculated from the likelihood of the hypothesis calculated by the likelihood calculation means, and a predetermined beam width or less is calculated from the calculated maximum likelihood. In the apparatus provided with a pruning means for rejecting a hypothesis and a recognition result output means for outputting a hypothesis left by the pruning means as a recognition candidate, duration data of utterance corresponding to a plurality of utterance speeds is stored. Word length obtaining means for obtaining the word and the position in the word of each hypothesis currently being recognized from the likelihood calculating means, and obtaining the part of speech of the hypothetical word from the word dictionary storing means; A duration length data changing unit that selects duration length data corresponding to the part of speech and the position in the word of the hypothesis word acquired by the word type acquisition unit from the duration length data storage unit; The means calculates the likelihood of the hypothesis based on the duration data selected by the duration data changing means.

【００３８】この発明に係る音声認識装置は、入力した
音声をデジタルデータ化して音声データとして記憶する
音声データ記憶手段と、上記音声データを所定時刻ごと
に取り込み音響分析して音響特徴ベクトルを出力する音
響分析手段と、各音素の音響モデルを記憶する音響モデ
ル記憶手段と、単語辞書における各単語の音素表記から
単語の音響モデルを生成して記憶する単語辞書記憶手段
と、上記音響分析手段から出力された音響特徴ベクトル
と、上記音響モデル記憶手段に記憶されている各音素の
音響モデルと、上記単語辞書記憶手段に記憶されている
単語の音響モデルにより、認識候補である仮説の尤度を
演算する尤度演算手段と、上記尤度演算手段が演算した
仮説の尤度から最大尤度を求め、求めた最大尤度から所
定のビーム幅以下の仮説を棄却する枝刈り手段と、上記
枝刈り手段により残された仮説を認識候補として出力す
る認識結果出力手段とを備えたものにおいて、単語の連
鎖確率を記憶する言語モデル記憶手段と、発声速度に対
応した発声の継続時間長データを記憶する継続時間長デ
ータ記憶手段と、上記尤度演算手段から現在認識中の各
仮説の単語とそれ以前に認識した単語を取得し、上記言
語モデル記憶手段からそれらの単語の連鎖確率を取得す
る単語連鎖確率取得手段と、上記単語連鎖確率取得手段
が取得した単語の連鎖確率に応じた継続時間長データ
を、上記継続時間長データ記憶手段から選択する継続時
間長データ変更手段とを備え、上記尤度演算手段が、上
記継続時間長データ変更手段が選択した継続時間長デー
タに基づき、仮説の尤度を演算するものである。The voice recognition apparatus according to the present invention converts voice data into digital data and stores the voice data as voice data. The voice data is taken in at predetermined time intervals, the voice is analyzed, and a voice feature vector is output. Acoustic analysis means, an acoustic model storage means for storing an acoustic model of each phoneme, a word dictionary storage means for generating and storing an acoustic model of a word from a phoneme description of each word in a word dictionary, and an output from the acoustic analysis means The likelihood of a hypothesis that is a recognition candidate is calculated based on the obtained acoustic feature vector, the acoustic model of each phoneme stored in the acoustic model storage unit, and the acoustic model of a word stored in the word dictionary storage unit. A maximum likelihood is calculated from the likelihood of the hypothesis calculated by the likelihood calculation means, and a predetermined beam width or less is calculated from the calculated maximum likelihood. A language model storage means for storing a chain probability of a word, comprising: a pruning means for rejecting a hypothesis; and a recognition result output means for outputting a hypothesis left by the pruning means as a recognition candidate; A duration data storage unit for storing duration data of the utterance corresponding to the word, and a word of each hypothesis currently being recognized and a word recognized before that are obtained from the likelihood calculation unit, and the language model storage unit is obtained. A word chain probability obtaining means for obtaining the chain probability of those words from the word sequence, and a continuation time period data corresponding to the word chain probability obtained by the word chain probability obtaining means, selected from the continuous time data storage means. Time length data changing means, wherein the likelihood calculating means calculates the likelihood of a hypothesis based on the duration data selected by the duration data changing means. It is.

【００３９】この発明に係る音声認識装置は、入力した
音声をデジタルデータ化して音声データとして記憶する
音声データ記憶手段と、上記音声データを所定時刻ごと
に取り込み音響分析して音響特徴ベクトルを出力する音
響分析手段と、各音素の音響モデルを記憶する音響モデ
ル記憶手段と、単語辞書における各単語の音素表記から
単語の音響モデルを生成して記憶する単語辞書記憶手段
と、上記音響分析手段から出力された音響特徴ベクトル
と、上記音響モデル記憶手段に記憶されている各音素の
音響モデルと、上記単語辞書記憶手段に記憶されている
単語の音響モデルにより、認識候補である仮説の尤度を
演算する尤度演算手段と、上記尤度演算手段が演算した
仮説の尤度から最大尤度を求め、求めた最大尤度から所
定のビーム幅以下の仮説を棄却する枝刈り手段と、上記
枝刈り手段により残された仮説を認識候補として出力す
る認識結果出力手段とを備えたものにおいて、上記枝刈
り手段から仮説の最大尤度を取得し、取得した最大尤度
を持ち、発声最初とする仮説を追加して上記枝刈り手段
に出力する最大尤度仮説追加手段を備えたものである。The voice recognition apparatus according to the present invention converts the input voice into digital data and stores it as voice data. The voice data is stored at predetermined time intervals, and the voice is analyzed to output a sound feature vector. Acoustic analysis means, an acoustic model storage means for storing an acoustic model of each phoneme, a word dictionary storage means for generating and storing an acoustic model of a word from a phoneme description of each word in a word dictionary, and an output from the acoustic analysis means The likelihood of a hypothesis that is a recognition candidate is calculated based on the obtained acoustic feature vector, the acoustic model of each phoneme stored in the acoustic model storage unit, and the acoustic model of a word stored in the word dictionary storage unit. A maximum likelihood is calculated from the likelihood of the hypothesis calculated by the likelihood calculation means, and a predetermined beam width or less is calculated from the calculated maximum likelihood. In a device comprising a pruning means for rejecting a hypothesis and a recognition result output means for outputting a hypothesis left by the pruning means as a recognition candidate, obtaining the maximum likelihood of the hypothesis from the pruning means and obtaining It has a maximum likelihood hypothesis adding means for adding a hypothesis having the maximum likelihood as the first utterance and outputting it to the pruning means.

【００４０】この発明に係る音声認識方法は、音声デー
タを所定時刻ごとに取り込み音響分析して音響特徴ベク
トルを出力する第１のステップと、上記第１のステップ
で出力された音響特徴ベクトルと、予め記憶されている
各音素の簡易な音響モデルにより、現在時刻をはさむ所
定の時間内における各時刻の各ＨＭＭ状態の簡易音響出
力確率を演算する第２のステップと、上記各時刻の各Ｈ
ＭＭ状態の簡易音響出力確率の順位を求め、現在時刻を
はさむ所定の時間内における各ＨＭＭ状態の順位変動幅
を計算し、ＨＭＭ状態の順位変動幅の平均を計算する第
３のステップと、上記順位変動幅の平均に基づき所定の
ビーム幅を設定する第４のステップと、上記第１のステ
ップで出力された音響特徴ベクトルと、予め記憶されて
いる各音素の音響モデルと単語の音響モデルにより、認
識候補である仮説の尤度を演算する第５のステップと、
演算した仮説の尤度から最大尤度を求め、求めた最大尤
度から上記第４のステップで設定された所定のビーム幅
以下の仮説を棄却する第６のステップと、上記第６のス
テップで残された仮説を認識候補として出力する第７の
ステップとを備えて音声を認識するものである。According to the speech recognition method of the present invention, there is provided a first step of taking in speech data at a predetermined time and performing acoustic analysis to output an acoustic feature vector; an acoustic feature vector output in the first step; A second step of calculating a simple acoustic output probability of each HMM state at each time within a predetermined time including the current time by using a simple acoustic model of each phoneme stored in advance;
A third step of obtaining the order of the simplified sound output probabilities in the MM state, calculating the order variation width of each HMM state within a predetermined time including the current time, and calculating the average of the order variation width of the HMM state; A fourth step of setting a predetermined beam width based on the average of the order variation width, the acoustic feature vector output in the first step, an acoustic model of each phoneme and an acoustic model of a word stored in advance. A fifth step of calculating the likelihood of a hypothesis that is a recognition candidate;
A sixth step of obtaining a maximum likelihood from the calculated likelihood of the hypothesis and rejecting a hypothesis having a beam width equal to or smaller than the predetermined beam width set in the fourth step from the obtained maximum likelihood; And a seventh step of outputting the remaining hypotheses as recognition candidates.

【００４１】この発明に係る音声認識方法は、音声デー
タを所定時刻ごとに取り込み音響分析して音響特徴ベク
トルを出力する第１のステップと、上記第１のステップ
で出力された音響特徴ベクトルと、予め記憶されている
各音素の簡易な音響モデルにより、現在時刻をはさむ所
定の時間内における各時刻の各ＨＭＭ状態の簡易音響出
力確率を演算する第２のステップと、上記各時刻の各Ｈ
ＭＭ状態の簡易音響出力確率の順位を求め、現在時刻を
はさむ所定の時間内における各ＨＭＭ状態の順位変動幅
を計算し、全ＨＭＭ状態の順位変動幅の平均を計算する
第３のステップと、上記順位変動幅の平均に基づき音声
認識に係るパラメータを調整する第４のステップと、上
記第１のステップで出力された音響特徴ベクトルと、予
め記憶されている各音素の音響モデルと単語の音響モデ
ルと、上記第４のステップで調整した音声認識に係るパ
ラメータにより、認識候補である仮説の尤度を演算する
第５のステップと、演算した仮説の尤度から最大尤度を
求め、求めた最大尤度から所定のビーム幅以下の仮説を
棄却する第６のステップと、上記第６のステップで残さ
れた仮説を認識候補として出力する第７のステップとを
備えて音声を認識するものである。According to the speech recognition method of the present invention, a first step of taking in speech data at predetermined time points and performing acoustic analysis to output an acoustic feature vector; an acoustic feature vector output in the first step; A second step of calculating a simple acoustic output probability of each HMM state at each time within a predetermined time including the current time by using a simple acoustic model of each phoneme stored in advance;
A third step of obtaining the order of the simplified sound output probabilities of the MM states, calculating the order change width of each HMM state within a predetermined time including the current time, and calculating the average of the order change widths of all the HMM states; A fourth step of adjusting parameters related to speech recognition based on the average of the order variation width, an acoustic feature vector output in the first step, an acoustic model of each phoneme stored in advance, and an acoustic of a word. A fifth step of calculating the likelihood of a hypothesis that is a recognition candidate using the model and parameters related to speech recognition adjusted in the fourth step, and a maximum likelihood is calculated from the calculated likelihood of the hypothesis. Recognizing speech by providing a sixth step of rejecting a hypothesis having a predetermined beam width or less from the maximum likelihood and a seventh step of outputting the hypothesis left in the sixth step as a recognition candidate Is shall.

【００４２】この発明に係る音声認識方法は、音声デー
タを所定時刻ごとに取り込み音響分析して音響特徴ベク
トルを出力する第１のステップと、上記第１のステップ
で出力された音響特徴ベクトルと、予め記憶されている
各音素の音響モデルから各ＨＭＭ状態の出力確率を計算
する第２のステップと、現在認識中の各仮説の単語及び
単語内位置を取得し、予め記憶されている仮説の単語の
品詞を取得する第３のステップと、上記第３のステップ
で取得した仮説の単語の品詞及び単語内位置に対応した
継続時間長データを、予め記憶されている発声速度に対
応した発声の継続時間長データの中から選択する第４の
ステップと、上記第２のステップで計算した各ＨＭＭ状
態の出力確率と、予め記憶されている単語の音響モデル
と、上記第４のステップで選択した継続時間長データに
より、認識候補である仮説の尤度を演算する第５のステ
ップと、演算した仮説の尤度から最大尤度を求め、求め
た最大尤度から所定のビーム幅以下の仮説を棄却する第
６のステップと、上記第６のステップで残された仮説を
認識候補として出力する第７のステップとを備えて音声
を認識するものである。According to the voice recognition method of the present invention, a first step of taking in voice data at predetermined time points and performing acoustic analysis to output an acoustic feature vector; an acoustic feature vector output in the first step; A second step of calculating an output probability of each HMM state from a pre-stored acoustic model of each phoneme; acquiring a word and a position in the word of each hypothesis currently being recognized; A third step of acquiring the part of speech of the hypothesis word and the duration data corresponding to the part of speech and the position in the word of the hypothesis word acquired in the third step. A fourth step of selecting from time length data, an output probability of each HMM state calculated in the second step, an acoustic model of a word stored in advance, and the fourth step A fifth step of calculating the likelihood of a hypothesis that is a recognition candidate based on the duration data selected in the step, and obtaining a maximum likelihood from the calculated likelihood of the hypothesis, and a predetermined beam from the obtained maximum likelihood. The speech recognition apparatus includes a sixth step of rejecting a hypothesis having a width less than or equal to a width and a seventh step of outputting the hypothesis left in the sixth step as a recognition candidate.

【００４３】この発明に係る音声認識方法は、音声デー
タを所定時刻ごとに取り込み音響分析して音響特徴ベク
トルを出力する第１のステップと、上記第１のステップ
で出力された音響特徴ベクトルと予め記憶されている各
音素の音響モデルから各ＨＭＭ状態の出力確率を計算す
る第２のステップと、現在認識中の各仮説の単語とそれ
以前に認識した単語を取得し、予め記憶されている単語
の連鎖確率から、それらの単語の連鎖確率を取得する第
３のステップと、上記第３のステップで取得した単語の
連鎖確率に応じた継続時間長データを、予め記憶されて
いる発声速度に対応した発声の継続時間長データから選
択する第４のステップと、上記第２のステップで計算し
た各ＨＭＭ状態の出力確率と、予め記憶されている単語
の音響モデルと、上記第４のステップで選択した継続時
間長データにより、認識候補である仮説の尤度を演算す
る第５のステップと、演算した仮説の尤度から最大尤度
を求め、求めた最大尤度から所定のビーム幅以下の仮説
を棄却する第６のステップと、上記第６のステップで残
された仮説を認識候補として出力する第７のステップと
を備えて音声を認識するものである。According to the speech recognition method of the present invention, a first step of taking in speech data at predetermined time points and performing acoustic analysis to output an acoustic feature vector, and the acoustic feature vector output in the first step and the acoustic feature vector A second step of calculating an output probability of each HMM state from the stored acoustic model of each phoneme; acquiring a word of each hypothesis currently being recognized and a word recognized before that; A third step of obtaining the chain probabilities of those words from the chain probabilities of the above, and the duration time data corresponding to the chain probabilities of the words obtained in the third step are corresponded to the speech rates stored in advance. A fourth step of selecting from the utterance duration data of the utterance, an output probability of each HMM state calculated in the second step, an acoustic model of a word stored in advance, A fifth step of calculating the likelihood of a hypothesis that is a recognition candidate based on the duration data selected in the fourth step, and obtaining a maximum likelihood from the calculated likelihood of the hypothesis. The speech recognition apparatus includes a sixth step of rejecting a hypothesis having a predetermined beam width or less and a seventh step of outputting the hypothesis left in the sixth step as a recognition candidate.

【００４４】この発明に係る音声認識方法は、音声デー
タを所定時刻ごとに取り込み音響分析して音響特徴ベク
トルを出力する第１のステップと、上記第１のステップ
で出力された音響特徴ベクトルと、予め記憶されている
各音素の音響モデルと単語の音響モデルにより、認識候
補である仮説の尤度を演算する第２のステップと、上記
第２のステップで演算した仮説の尤度から最大尤度を求
める第３のステップと、上記第３のステップで求めた最
大尤度を取得し、取得した最大尤度を持ち、発声最初と
する仮説を追加する第４のステップと、上記第３のステ
ップで求めた最大尤度から所定のビーム幅以下の仮説を
棄却する第５のステップと、上記第５のステップで残さ
れた仮説を認識候補として出力する第６のステップとを
備えて音声を認識するものである。According to the speech recognition method of the present invention, a first step of taking in speech data at a predetermined time and performing acoustic analysis to output an acoustic feature vector; an acoustic feature vector output in the first step; A second step of calculating the likelihood of a hypothesis that is a recognition candidate using the acoustic model of each phoneme and the acoustic model of a word stored in advance, and a maximum likelihood calculated from the likelihood of the hypothesis calculated in the second step. A fourth step of obtaining the maximum likelihood obtained in the third step, adding a hypothesis having the obtained maximum likelihood and making the first utterance, and a third step of The voice recognition includes a fifth step of rejecting a hypothesis having a predetermined beam width or less from the maximum likelihood obtained in the above, and a sixth step of outputting the hypothesis left in the fifth step as a recognition candidate. Is shall.

【００４５】この発明に係る音声認識プログラムを記録
した記録媒体は、音声データを所定時刻ごとに取り込
み、音響分析して音響特徴ベクトルを出力する第１のス
テップと、上記第１のステップで出力された音響特徴ベ
クトルと、予め記憶されている各音素の簡易な音響モデ
ルにより、現在時刻をはさむ所定の時間内における各時
刻の各ＨＭＭ状態の簡易音響出力確率を演算する第２の
ステップと、上記各時刻の各ＨＭＭ状態の簡易音響出力
確率の順位を求め、現在時刻をはさむ所定の時間内にお
ける各ＨＭＭ状態の順位変動幅を計算し、ＨＭＭ状態の
順位変動幅の平均を計算する第３のステップと、上記順
位変動幅の平均に基づき所定のビーム幅を設定する第４
のステップと、上記第１のステップで出力された音響特
徴ベクトルと、予め記憶されている各音素の音響モデル
と単語の音響モデルにより、認識候補である仮説の尤度
を演算する第５のステップと、演算した仮説の尤度から
最大尤度を求め、求めた最大尤度から上記第４のステッ
プで設定された所定のビーム幅以下の仮説を棄却する第
６のステップと、上記第６のステップで残された仮説を
認識候補として出力する第７のステップをコンピュータ
に実行させるものである。The recording medium on which the speech recognition program according to the present invention is recorded captures the speech data at predetermined time intervals, analyzes the acoustic data, and outputs an acoustic feature vector. A second step of calculating a simple sound output probability of each HMM state at each time within a predetermined time including the current time, based on the obtained sound feature vector and a simple sound model of each phoneme stored in advance; A third step of calculating the order of the simple sound output probabilities of the respective HMM states at each time, calculating the order fluctuation width of the respective HMM states within a predetermined time including the current time, and calculating the average of the order fluctuation width of the HMM states; And a fourth step of setting a predetermined beam width based on the average of the order variation width.
And a fifth step of calculating the likelihood of a hypothesis that is a recognition candidate using the acoustic feature vector output in the first step, the acoustic model of each phoneme and the acoustic model of a word stored in advance. A sixth step of obtaining a maximum likelihood from the calculated likelihood of the hypothesis, and rejecting a hypothesis having a beam width equal to or smaller than the predetermined beam width set in the fourth step from the obtained maximum likelihood; The seventh step of outputting the hypothesis left in the step as a recognition candidate is executed by a computer.

【００４６】この発明に係る音声認識プログラムを記録
した記録媒体は、音声データを所定時刻ごとに取り込み
音響分析して音響特徴ベクトルを出力する第１のステッ
プと、上記第１のステップで出力された音響特徴ベクト
ルと、予め記憶されている各音素の簡易な音響モデルに
より、現在時刻をはさむ所定の時間内における各時刻の
各ＨＭＭ状態の簡易音響出力確率を演算する第２のステ
ップと、上記各時刻の各ＨＭＭ状態の簡易音響出力確率
の順位を求め、現在時刻をはさむ所定の時間内における
各ＨＭＭ状態の順位変動幅を計算し、全ＨＭＭ状態の順
位変動幅の平均を計算する第３のステップと、上記順位
変動幅の平均に基づき音声認識に係るパラメータを調整
する第４のステップと、上記第１のステップで出力され
た音響特徴ベクトルと、予め記憶されている各音素の音
響モデルと単語の音響モデルと、上記第４のステップで
調整した音声認識に係るパラメータにより、認識候補で
ある仮説の尤度を演算する第５のステップと、演算した
仮説の尤度から最大尤度を求め、求めた最大尤度から所
定のビーム幅以下の仮説を棄却する第６のステップと、
上記第６のステップで残された仮説を認識候補として出
力する第７のステップをコンピュータに実行させるもの
である。The recording medium on which the speech recognition program according to the present invention is recorded has a first step of taking in speech data at predetermined time points, performing acoustic analysis and outputting an acoustic feature vector, and the output in the first step. A second step of calculating a simple sound output probability of each HMM state at each time within a predetermined time including the current time by using the sound feature vector and a simple sound model of each phoneme stored in advance; A third order of obtaining the order of the simplified sound output probabilities of the respective HMM states at the time, calculating the order variation of each HMM state within a predetermined time including the current time, and calculating the average of the order variation of all the HMM states; Step, a fourth step of adjusting a parameter related to speech recognition based on the average of the order variation width, and an acoustic feature vector output in the first step. And a fifth step of calculating the likelihood of a hypothesis that is a recognition candidate by using the acoustic model of each phoneme and the acoustic model of the word stored in advance and the parameters related to the speech recognition adjusted in the fourth step. A sixth step of obtaining a maximum likelihood from the calculated likelihood of the hypothesis and rejecting a hypothesis having a predetermined beam width or less from the obtained maximum likelihood;
The seventh step of outputting the hypothesis left in the sixth step as a recognition candidate is executed by a computer.

【００４７】この発明に係る音声認識プログラムを記録
した記録媒体は、音声データを所定時刻ごとに取り込み
音響分析して音響特徴ベクトルを出力する第１のステッ
プと、上記第１のステップで出力された音響特徴ベクト
ルと、予め記憶されている各音素の音響モデルから各Ｈ
ＭＭ状態の出力確率を計算する第２のステップと、現在
認識中の各仮説の単語及び単語内位置を取得し、予め記
憶されている仮説の単語の品詞を取得する第３のステッ
プと、上記第３のステップで取得した仮説の単語の品詞
及び単語内位置に対応した継続時間長データを、予め記
憶されている発声速度に対応した発声の継続時間長デー
タの中から選択する第４のステップと、上記第２のステ
ップで計算した各ＨＭＭ状態の出力確率と、予め記憶さ
れている単語の音響モデルと、上記第４のステップで選
択した継続時間長データにより、認識候補である仮説の
尤度を演算する第５のステップと、演算した仮説の尤度
から最大尤度を求め、求めた最大尤度から所定のビーム
幅以下の仮説を棄却する第６のステップと、上記第６の
ステップで残された仮説を認識候補として出力する第７
のステップをコンピュータに実行させるものである。The recording medium on which the speech recognition program according to the present invention is recorded has the first step of taking in speech data at predetermined time points, performing acoustic analysis and outputting an acoustic feature vector, and the output in the first step. From the acoustic feature vector and the acoustic model of each phoneme stored in advance, each H
A second step of calculating the output probability of the MM state, a third step of acquiring a word and a position in the word of each hypothesis currently being recognized, and acquiring a part of speech of a previously stored hypothesis word, A fourth step of selecting, from the pre-stored utterance duration data corresponding to the utterance speed, duration time data corresponding to the part of speech and the position in the word of the hypothesis word acquired in the third step; And the output probability of each HMM state calculated in the second step, the acoustic model of the word stored in advance, and the duration data selected in the fourth step, the likelihood of a hypothesis that is a recognition candidate. A fifth step of calculating the degree of likelihood, a sixth step of obtaining a maximum likelihood from the calculated likelihood of the hypothesis, and rejecting a hypothesis of a predetermined beam width or less from the obtained maximum likelihood; Left in 7 for outputting the hypothesis as a recognition candidate
Are executed by a computer.

【００４８】この発明に係る音声認識プログラムを記録
した記録媒体は、音声データを所定時刻ごとに取り込み
音響分析して音響特徴ベクトルを出力する第１のステッ
プと、上記第１のステップで出力された音響特徴ベクト
ルと、予め記憶されている各音素の音響モデルから各Ｈ
ＭＭ状態の出力確率を計算する第２のステップと、現在
認識中の各仮説の単語とそれ以前に認識した単語を取得
し、予め記憶されている単語の連鎖確率から、それらの
単語の連鎖確率を取得する第３のステップと、上記第３
のステップで取得した単語の連鎖確率に応じた継続時間
長データを、予め記憶されている発声速度に対応した発
声の継続時間長データから選択する第４のステップと、
上記第２のステップで計算した各ＨＭＭ状態の出力確率
と、予め記憶されている単語の音響モデルと、上記第４
のステップで選択した継続時間長データにより、認識候
補である仮説の尤度を演算する第５のステップと、演算
した仮説の尤度から最大尤度を求め、求めた最大尤度か
ら所定のビーム幅以下の仮説を棄却する第６のステップ
と、上記第６のステップで残された仮説を認識候補とし
て出力する第７のステップをコンピュータに実行させる
ものである。The recording medium on which the speech recognition program according to the present invention is recorded has the first step of taking in speech data at predetermined time points, performing acoustic analysis and outputting an acoustic feature vector, and the output in the first step. From the acoustic feature vector and the acoustic model of each phoneme stored in advance, each H
A second step of calculating the output probability of the MM state, acquiring the word of each hypothesis currently being recognized and the previously recognized word, and calculating the chain probability of those words from the chain probability of the word stored in advance. A third step of obtaining
A fourth step of selecting duration time data corresponding to the word chain probability obtained in the step from the utterance duration data corresponding to the utterance speed stored in advance;
The output probability of each HMM state calculated in the second step, the acoustic model of the word stored in advance, and the fourth
A fifth step of calculating the likelihood of a hypothesis that is a recognition candidate based on the duration data selected in the step, and obtaining a maximum likelihood from the calculated likelihood of the hypothesis, and a predetermined beam from the obtained maximum likelihood. A sixth step of rejecting a hypothesis having a width less than or equal to the width and a seventh step of outputting the hypothesis left in the sixth step as a recognition candidate are executed by a computer.

【００４９】この発明に係る音声認識プログラムを記録
した記録媒体は、音声データを所定時刻ごとに取り込み
音響分析して音響特徴ベクトルを出力する第１のステッ
プと、上記第１のステップで出力された音響特徴ベクト
ルと、予め記憶されている各音素の音響モデルと単語の
音響モデルにより、認識候補である仮説の尤度を演算す
る第２のステップと、上記第２のステップで演算した仮
説の尤度から最大尤度を求める第３のステップと、上記
第３のステップで求めた最大尤度を取得し、取得した最
大尤度を持ち、発声最初とする仮説を追加する第４のス
テップと、上記第３のステップで求めた最大尤度から所
定のビーム幅以下の仮説を棄却する第５のステップと、
上記第５のステップで残された仮説を認識候補として出
力する第６のステップをコンピュータに実行させるもの
である。The recording medium on which the speech recognition program according to the present invention is recorded has the first step of taking in speech data at predetermined time points, performing acoustic analysis and outputting an acoustic feature vector, and the output in the first step. A second step of calculating a likelihood of a hypothesis that is a recognition candidate based on the acoustic feature vector, an acoustic model of each phoneme and an acoustic model of a word stored in advance, and the likelihood of the hypothesis calculated in the second step. A third step of obtaining the maximum likelihood from the degrees, a fourth step of obtaining the maximum likelihood obtained in the third step, adding a hypothesis having the obtained maximum likelihood and the first utterance, A fifth step of rejecting a hypothesis of a predetermined beam width or less from the maximum likelihood obtained in the third step;
The sixth step is to cause a computer to execute a sixth step of outputting the hypotheses left in the fifth step as recognition candidates.

【００５０】[0050]

【発明の実施の形態】以下、この発明の実施の一形態を
説明する。実施の形態１．図１はこの発明の実施の形態１による音
声認識装置の構成を示す図である。図において、２１は
各音素の簡易な音響モデルを記憶する簡易音響モデル記
憶手段、２２は、音響分析手段１２からの音響特徴ベク
トルと、簡易音響モデル記憶手段２１に記憶されている
各音素の簡易な音響モデルを用いて、全て又は一部のＨ
ＭＭ状態について、現在時刻（現在フレーム）をはさむ
時間内における各時刻の簡易音響出力確率を求める簡易
音響モデル確率演算手段である。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the present invention will be described below. Embodiment 1 FIG. FIG. 1 is a diagram showing a configuration of a speech recognition apparatus according to Embodiment 1 of the present invention. In the figure, reference numeral 21 denotes a simple acoustic model storage unit for storing a simple acoustic model of each phoneme; 22, an acoustic feature vector from the acoustic analysis unit 12 and a simple acoustic model of each phoneme stored in the simple acoustic model storage unit 21. All or part of H using a simple acoustic model
This is a simple acoustic model probability calculating means for obtaining a simple acoustic output probability at each time within a time sandwiching the current time (current frame) in the MM state.

【００５１】また、図１において、２３は、各ＨＭＭ状
態ｉの各時刻における簡易音響出力確率の順位を求め、
現在時刻ｔをはさむ時間内における順位変動幅を計算
し、全て又は一部のＨＭＭ状態についての順位変動幅の
平均を求める順位変動計算手段、２４は、順位変動計算
手段２３が求めた順位変動幅の平均に基づきビーム幅を
設定し、枝刈り手段１６に出力するビーム幅変更手段で
あり、その他の構成は、従来の図２６に示す構成と同一
である。この実施の形態は、音声認識に係るパラメータ
としてのビーム幅を設定するものである。In FIG. 1, reference numeral 23 denotes the order of the simplified sound output probability at each time of each HMM state i,
The rank change calculating means for calculating the rank change width within the time including the current time t and obtaining the average of the rank change widths for all or some of the HMM states, the rank change width calculated by the rank change calculating means 23 Is a beam width changing unit that sets a beam width based on the average of the above and outputs the beam width to the pruning unit 16. The other configuration is the same as the conventional configuration shown in FIG. In this embodiment, a beam width is set as a parameter related to speech recognition.

【００５２】次に動作について説明する。まず、簡易音
響モデル記憶手段２１は、外部記憶装置（図示せず）か
ら、認識処理で用いる各音素の音響モデルより簡易な音
響モデル、例えば、認識処理の音響モデルが環境依存Ｔ
ｒｉｐｈｏｎｅモデルの場合であれば、簡易な音響モデ
ルとして、環境独立Ｕｎｉｐｈｏｎｅモデルを取得して
記憶する。このような簡易な音響モデルは、認識処理で
用いる音響モデルより少数であったり、計算コストがか
からないという特徴を持っている。Next, the operation will be described. First, the simple acoustic model storage unit 21 stores, from an external storage device (not shown), an acoustic model that is simpler than the acoustic model of each phoneme used in the recognition processing, for example, an acoustic model of the recognition processing is environment-dependent T
In the case of the ripone model, an environment independent Uniphone model is acquired and stored as a simple acoustic model. Such a simple acoustic model is characterized in that it is smaller in number than the acoustic model used in the recognition processing and does not require a calculation cost.

【００５３】次に、音声データ記憶手段１１は、入力さ
れた音声をＡ／Ｄ変換し、デジタルデータ化して音声デ
ータとして記憶する。Next, the audio data storage means 11 performs A / D conversion on the input audio, converts it into digital data, and stores it as audio data.

【００５４】図２はこの発明の実施の形態１による音声
認識装置の処理を示すフローチャートである。ステップ
ＳＴ１において、従来と同様に時刻ｔをｔ＝０に設定
し、ステップＳＴ２において、音響分析手段１２は、現
在時刻をはさむｔ₁＝ｔ−Δｔからｔ₂＝ｔ＋Δｔの時
間で、音声データの取り込みを行う。ただし、時刻が負
になる場合は取り込まない。ステップＳＴ３において、
音響分析手段１２は、ｔ ₁＝ｔ−Δｔからｔ₂＝ｔ＋Δ
ｔの時間に取り込んだ音声データを分析し、音響特徴ベ
クトルを計算する。FIG. 2 shows a voice according to the first embodiment of the present invention.
It is a flowchart which shows the process of a recognition device. Steps
In ST1, time t is set to t = 0 as in the conventional case.
Then, in step ST2, the acoustic analysis means 12
T between times₁= T-Δt to t_Two= T + Δt
In between, audio data is captured. However, the time is negative
If it becomes, do not take in. In step ST3,
The acoustic analysis means 12 calculates t ₁= T-Δt to t_Two= T + Δ
The voice data captured at time t is analyzed and the
Calculate the vector.

【００５５】ステップＳＴ２０において、簡易音響モデ
ル確率演算手段２２，順位変動計算手段２３，ビーム幅
変更手段２４は、音響分析手段１２からの音響特徴ベク
トルと、簡易音響モデル記憶手段２１に記憶されている
各音素の簡易な音響モデルに基づき、枝刈り手段１６が
使用するビーム幅を設定する。In step ST20, the simple acoustic model probability calculating means 22, the rank variation calculating means 23, and the beam width changing means 24 are stored in the simple acoustic model storing means 21 with the acoustic feature vector from the acoustic analyzing means 12. The beam width used by the pruning unit 16 is set based on a simple acoustic model of each phoneme.

【００５６】図３は図２のステップＳＴ２０におけるビ
ーム幅を設定する処理を示すフローチャートである。ス
テップＳＴ２１において、簡易音響モデル確率演算手段
２２は、音響分析手段１２からの音響特徴ベクトルと、
簡易音響モデル記憶手段２１に記憶されている各音素の
簡易な音響モデルを用いて、全て又は一部のＨＭＭ状態
について、現在時刻（現在フレーム）をはさむ時間内に
おける各時刻ｔ_xの簡易音響出力確率を求める。すなわ
ち、フレーム時刻ｔでのＨＭＭ状態ｉの簡易音響出力確
率をＰ（ｔ，ｉ）とすると、全て又は一部のＨＭＭ状態
について、現在時刻ｔをはさむｔ₁＝ｔ−Δｔからｔ₂
＝ｔ＋Δｔの時間内で、簡易音響出力確率Ｐ（ｔ_x，
ｉ）を計算する。ここで、一部のＨＭＭ状態について、
簡易音響出力確率Ｐ（ｔ_x，ｉ）を計算する場合とし
て、例えば、現在、ビーム幅内に存在する仮説について
のみ計算しても良い。FIG. 3 is a flowchart showing the processing for setting the beam width in step ST20 of FIG. In step ST <b> 21, the simple acoustic model probability calculation means 22 calculates the sound feature vector from the sound analysis means 12,
Using simple acoustic models for each phoneme are stored in the simple acoustic model storage unit 21, for all or part of the HMM state, simple acoustic output at each time t _x in a time sandwiching the current time (current frame) Find the probability. That is, when a simple sound output probability of the HMM state i at frame time t and P (t, i), for all or part of the HMM states, t ₂ from t ₁ = t-Delta] t sandwiching the current time t
= T + Δt, the simple sound output probability P (t _x ,
Calculate i). Here, for some HMM states,
As a case of calculating the simple sound output probability P (t _x , i), for example, only the hypothesis existing within the beam width may be calculated.

【００５７】ステップＳＴ２２において、順位変動計算
手段２３は、各時刻ｔ_xにおいて、簡易音響出力確率Ｐ
（ｔ_x，ｉ）を大きい順に並べた場合のＨＭＭ状態ｉの
順位をＲ１（ｔ_x，ｉ）として求め、ステップＳＴ２３
において、順位変動計算手段２３は、現在時刻ｔをはさ
むｔ₁＝ｔ−Δｔからｔ₂＝ｔ＋Δｔの時間内で順位変
動幅Ｒ２（ｔ，ｉ）＝ｍａｘ（Ｒ１（ｔ_x，ｉ））−ｍ
ｉｎ（Ｒ１（ｔ_x，ｉ））を計算する。ここで、ｍａｘ
とｍｉｎはｔ₁≦ｔ_x≦ｔ₂に渡ってとる。ステップＳ
Ｔ２４において、順位変動計算手段２３は、全て又は一
部のＨＭＭ状態についての順位変動幅Ｒ２（ｔ，ｉ）の
平均を求める。ここで、一部のＨＭＭ状態についての順
位変動幅Ｒ２（ｔ，ｉ）の平均を求める場合として、例
えば、上記ステップＳＴ２２で求めた順位のうち、上位
のＨＭＭ状態のみから求めても良い。[0057] In step ST22, rank variation calculating means 23, at each time t _x, simple acoustic output probability P
The order of HMM state i when (t _x , i) is arranged in descending order is determined as R1 (t _x , i), and step ST23 is performed.
In, order variation calculation means 23, sandwich the current time t t ₁ = t-Δt from t ₂ = t + time in a Position variation range of Δt R2 (t, i) = max (R1 (t x, i)) - m
Calculate in (R1 (t _x , i)). Where max
And min are taken over t ₁ ≦ t _x ≦ t ₂ . Step S
At T24, the rank change calculation means 23 calculates the average of the rank change width R2 (t, i) for all or some of the HMM states. Here, as a case where the average of the order variation R2 (t, i) for some HMM states is obtained, for example, the average may be obtained only from the higher-order HMM state in the order obtained in step ST22.

【００５８】ステップＳＴ２５において、ビーム幅変更
手段２４は、順位変動計算手段２３が求めた順位変動幅
の平均が所定値より小さい場合にビーム幅を小さく設定
し、順位変動幅の平均が所定値より大きい場合にビーム
幅を大きく設定する。In step ST25, the beam width changing unit 24 sets the beam width to be smaller when the average of the order variation obtained by the order variation calculator 23 is smaller than the predetermined value, and the average of the order variation is larger than the predetermined value. If it is large, set the beam width large.

【００５９】ここで、上記ステップＳＴ２５の意味を説
明する。図４及び図５は３ヶのＨＭＭ状態の各時刻にお
ける出力確率順位Ｒ１を示す図であり、図４は順位変動
幅Ｒ２（ｔ，ｉ）が大きい場合を示し、図５は順位変動
幅Ｒ２（ｔ，ｉ）が小さい場合を示している。Here, the meaning of step ST25 will be described. 4 and 5 are diagrams showing the output probability rank R1 at each time of the three HMM states. FIG. 4 shows a case where the rank variation R2 (t, i) is large, and FIG. 5 shows a rank variation R2. The case where (t, i) is small is shown.

【００６０】図４に示すように、順位変動幅の平均が所
定値より大きくなっていると、その付近では音響的に不
安定である可能性があるので、ビーム幅を増やすのが妥
当である。また、図５に示すように、順位変動幅が小さ
い場合には、音響的に安定しており、その付近では、図
２に示すステップＳＴ４以降の認識の本処理において
も、音響尤度の逆転は起こりにくいと考えられるので、
ビーム幅を減らしても正解を枝刈りする可能性は少な
い。As shown in FIG. 4, if the average of the order fluctuation width is larger than a predetermined value, there is a possibility that acoustic fluctuation may occur in the vicinity of the predetermined range, so it is appropriate to increase the beam width. . Further, as shown in FIG. 5, when the order variation width is small, the sound is stable, and in the vicinity thereof, the inversion of the acoustic likelihood is also performed in the main recognition process after step ST4 shown in FIG. Is unlikely to occur,
Even if the beam width is reduced, it is unlikely that the correct answer will be pruned.

【００６１】図２のステップＳＴ４からＳＴ６までは、
従来の図２７に示すステップＳＴ４からＳＴ６までの処
理と同一である。図２のステップＳＴ７において、枝刈
り手段１６は、各仮説を音響尤度順に並べて、最大尤度
から、ビーム幅変更手段２４が設定したビーム幅以内の
音響尤度を持つ仮説を残し、最大尤度からビーム幅以下
の音響尤度を持つ仮説を棄却する。図２のステップＳＴ
８からＳＴ１０までの処理は、従来の図２７に示すステ
ップＳＴ８からＳＴ１０までの処理と同一である。Steps ST4 to ST6 in FIG.
This is the same as the processing of steps ST4 to ST6 shown in FIG. In step ST7 of FIG. 2, the pruning unit 16 arranges each hypothesis in the order of the acoustic likelihood, and leaves a hypothesis having an acoustic likelihood within the beam width set by the beam width changing unit 24 from the maximum likelihood. Reject hypotheses with acoustic likelihood from the degree to the beam width. Step ST in FIG.
The processes from 8 to ST10 are the same as the processes from steps ST8 to ST10 shown in FIG.

【００６２】以上のように、この実施の形態１によれ
ば、各音素の簡易な音響モデルを用いて、適切にビーム
幅を設定しているので、計算コストはそれほど増加せず
に、正解が枝刈りされることを抑えて、認識精度を高く
することができるという効果が得られる。As described above, according to the first embodiment, since the beam width is appropriately set using the simple acoustic model of each phoneme, the calculation cost does not increase so much and the correct answer can be obtained. The effect of suppressing pruning and improving recognition accuracy can be obtained.

【００６３】実施の形態２．図６はこの発明の実施の形
態２による音声認識装置の構成を示す図であり、図にお
いて、３１は、順位変動計算手段２３が求めた順位変動
幅の平均に基づき、言語尤度の重みを設定する言語重み
変更手段である。この実施の形態は、実施の形態１の図
１におけるビーム幅変更手段２４を、言語重み変更手段
３１に置き換えて、言語重み変更手段３１の出力を、尤
度演算手段１５に入力するようにしたものであり、音声
認識に係るパラメータとしての、尤度に追加される言語
尤度の重みを設定するものである。Embodiment 2 FIG. 6 is a diagram showing a configuration of a speech recognition apparatus according to Embodiment 2 of the present invention. In the figure, reference numeral 31 denotes a weight of the linguistic likelihood based on the average of the order variation width obtained by the order variation calculating means 23. This is a language weight changing means to be set. In this embodiment, the beam width changing unit 24 in FIG. 1 of the first embodiment is replaced with a language weight changing unit 31, and the output of the language weight changing unit 31 is input to the likelihood calculating unit 15. This is for setting the weight of the linguistic likelihood added to the likelihood as a parameter related to speech recognition.

【００６４】次に動作について説明する。図７はこの発
明の実施の形態２による音声認識装置の処理を示すフロ
ーチャートである。ステップＳＴ１からＳＴ３までの処
理は、実施の形態１の図２におけるステップＳＴ１から
ＳＴ３までの処理と同一である。Next, the operation will be described. FIG. 7 is a flowchart showing processing of the voice recognition device according to the second embodiment of the present invention. The processing from step ST1 to ST3 is the same as the processing from step ST1 to ST3 in FIG. 2 of the first embodiment.

【００６５】ステップＳＴ３０において、簡易音響モデ
ル確率演算手段２２，順位変動計算手段２３，言語重み
変更手段３１は、音響分析手段１２からの音響特徴ベク
トルと、簡易音響モデル記憶手段２１に記憶されている
各音素の簡易な音響モデルに基づき、言語尤度の重みを
設定する。In step ST 30, the simple acoustic model probability calculating means 22, the rank variation calculating means 23, and the language weight changing means 31 are stored in the simple acoustic model storing means 21 and the acoustic feature vector from the acoustic analyzing means 12. The weight of the linguistic likelihood is set based on a simple acoustic model of each phoneme.

【００６６】図８は図７のステップＳＴ３０における言
語尤度の重みを設定する処理を示すフローチャートであ
る。ステップＳＴ３１からＳＴ３４までの処理は、実施
の形態１の図３におけるステップＳＴ２１からＳＴ２４
までの処理と同一である。FIG. 8 is a flowchart showing the process of setting the weight of the language likelihood in step ST30 of FIG. The processing from steps ST31 to ST34 is performed from steps ST21 to ST24 in FIG. 3 of the first embodiment.
The processing is the same as that described above.

【００６７】ステップＳＴ３５において、言語重み変更
手段３１は、順位変動計算手段２３が求めた順位変動幅
の平均が所定値より小さい場合には、音響的に安定して
いると思われるので、音響尤度を言語尤度より重要視し
て言語重みを小さく設定し、一方、順位変動幅の平均が
所定値より大きい場合には、音響的に不安定であると思
われるので、信頼性の低い音響尤度より言語尤度を重要
視して言語重みを大きく設定する。In step ST35, if the average of the order variation width obtained by the order variation calculating means 23 is smaller than a predetermined value, the language weight changing means 31 considers the sound likelihood to be acoustically stable. If the weight of language is set smaller than the language likelihood, and the average of the order variation is larger than a predetermined value, it is considered that the sound is unstable. The language weight is set to be large, with the language likelihood being regarded as more important than the likelihood.

【００６８】図７のステップＳＴ４における処理は、従
来の図２７におけるステップＳＴ４の処理と同一であ
る。図７のステップＳＴ３６において、尤度演算手段１
５は、単語辞書記憶手段１４から各単語の仮説を読み出
し、上記ステップＳＴ４で計算した各ＨＭＭ状態の出力
確率と、単語辞書記憶手段１４に記憶されている単語の
ＨＭＭ音響モデルと、言語重み変更手段３１が設定した
言語重みに基づき、各仮説の音響尤度を計算し仮説を展
開する。The process in step ST4 in FIG. 7 is the same as the process in step ST4 in FIG. 27 in the related art. In step ST36 of FIG. 7, likelihood calculating means 1
5 reads out the hypothesis of each word from the word dictionary storage means 14 and outputs the output probability of each HMM state calculated in step ST4, the HMM acoustic model of the word stored in the word dictionary storage means 14, and the language weight change. Based on the language weight set by the means 31, the acoustic likelihood of each hypothesis is calculated and the hypothesis is developed.

【００６９】図７のステップＳＴ６からＳＴ１０までの
処理は、従来の図２７におけるステップＳＴ６からＳＴ
１０までの処理と同一である。The processing from steps ST6 to ST10 in FIG. 7 is the same as the processing from steps ST6 to ST10 in FIG.
This is the same as the processing up to 10.

【００７０】以上のように、この実施の形態２によれ
ば、各音素の簡易な音響モデルを用いて、適切に言語重
みを設定するので、計算コストはそれほど増加せずに、
認識精度を高くすることができるという効果が得られ
る。As described above, according to the second embodiment, the language weight is appropriately set using the simple acoustic model of each phoneme, so that the calculation cost does not increase so much.
The effect that the recognition accuracy can be improved can be obtained.

【００７１】実施の形態３．図９はこの発明の実施の形
態３による音声認識装置の構成を示す図であり、図にお
いて、４１は発声速度に対応した発声の継続時間長デー
タを記憶している継続時間長データ記憶手段、４２は、
順位変動計算手段２３が求めた順位変動幅の平均に基づ
き、継続時間長データ記憶手段４１が記憶している継続
時間長データを選択し、使用する継続時間長データを設
定する継続時間長データ変更手段である。この実施の形
態は、実施の形態２の図６における言語重み変更手段３
１を、継続時間長データ記憶手段４１，継続時間長デー
タ変更手段４２に置き換えたものであり、音声認識に係
るパラメータとしての継続時間長データを設定するもの
である。Embodiment 3 FIG. 9 is a diagram showing a configuration of a speech recognition apparatus according to Embodiment 3 of the present invention. In the drawing, reference numeral 41 denotes a duration data storage unit that stores duration data of an utterance corresponding to the utterance speed; 42 is
Based on the average of the rank fluctuation width obtained by the rank fluctuation calculation means 23, the duration data stored in the duration data storage means 41 is selected, and the duration data to be used is set. Means. In this embodiment, the language weight changing means 3 in FIG.
1 is replaced by a duration data storage unit 41 and a duration data change unit 42, which sets duration data as a parameter related to speech recognition.

【００７２】次に動作について説明する。まず、継続時
間長データ記憶手段４１は、外部記憶装置（図示せず）
から、速い発声、通常の発声、遅い発声等の異なった発
声速度に対応するそれぞれの継続時間長データを取得し
記憶する。Next, the operation will be described. First, the duration time data storage means 41 is an external storage device (not shown).
, And obtains and stores respective duration data corresponding to different utterance speeds such as fast utterance, normal utterance, and slow utterance.

【００７３】図１０はこの発明の実施の形態３による音
声認識装置の処理を示すフローチャートである。ステッ
プＳＴ１からＳＴ３までの処理は、実施の形態１の図２
におけるステップＳＴ１からＳＴ３までの処理と同一で
ある。FIG. 10 is a flowchart showing processing of the voice recognition device according to the third embodiment of the present invention. The processing from steps ST1 to ST3 is the same as that of the first embodiment shown in FIG.
Is the same as the processing from step ST1 to ST3.

【００７４】ステップＳＴ４０において、簡易音響モデ
ル確率演算手段２２，順位変動計算手段２３，継続時間
長データ変更手段４２は、音響分析手段１２からの音響
特徴ベクトルと、簡易音響モデル記憶手段２１に記憶さ
れている各音素の簡易な音響モデルに基づき、継続時間
長データ記憶手段４１に記憶されている継続時間長デー
タを選択し、使用する継続時間長データを設定する。In step ST 40, the simple acoustic model probability calculating means 22, the rank variation calculating means 23, and the duration data changing means 42 are stored in the simple acoustic model storing means 21 with the acoustic feature vector from the acoustic analyzing means 12. Based on the simple acoustic model of each phoneme, the duration data stored in the duration data storage unit 41 is selected, and the duration data to be used is set.

【００７５】図１１は図１０のステップＳＴ４０におけ
る継続時間長データを設定する処理を示すフローチャー
トである。ステップＳＴ４１からＳＴ４４は、実施の形
態１の図３におけるステップＳＴ２１からＳＴ２４まで
の処理と同一である。FIG. 11 is a flowchart showing the process of setting the duration data in step ST40 of FIG. Steps ST41 to ST44 are the same as the processing of steps ST21 to ST24 in FIG. 3 of the first embodiment.

【００７６】ステップＳＴ４５において、継続時間長デ
ータ変更手段４２は、順位変動計算手段が計算した順位
変動幅の平均が所定値より小さい場合は、音響的に安定
している状態が長く続いていると判断し、継続時間長デ
ータ記憶手段４１に記憶されている遅めに発声した継続
時間長データを選択して、使用する継続時間長データを
設定し、順位変動幅の平均が所定値より大きい場合は、
音響が速く変化していると判断して、継続時間長データ
記憶手段４１に記憶されている速めに発声した継続時間
長データを選択して、使用する継続時間長データを設定
する。In step ST45, if the average of the order change width calculated by the order change calculating means is smaller than a predetermined value, the duration time data changing means 42 determines that the acoustically stable state has continued for a long time. Judgment and selection of the uttered duration data stored later in the duration data storage means 41, and setting the duration data to be used, when the average of the rank fluctuation width is larger than a predetermined value. Is
When it is determined that the sound is changing quickly, the utterance duration data that is uttered earlier stored in the duration data storage unit 41 is selected, and the duration data to be used is set.

【００７７】図１０のステップＳＴ４における処理は、
従来の図２７におけるステップＳＴ４の処理と同一であ
る。図１０のステップＳＴ４６において、尤度演算手段
１５は、単語辞書記憶手段１４から各単語の仮説を読み
出し、上記ステップＳＴ４で計算した各仮説が含むＨＭ
Ｍ状態の出力確率と、単語辞書記憶手段１４に記憶され
ている単語のＨＭＭ音響モデルと、継続時間長データ変
更手段４２が設定した継続時間長データに基づき、各仮
説の音響尤度を計算し仮説を展開する。The processing in step ST4 of FIG.
This is the same as the processing in step ST4 in FIG. 27 of the related art. In step ST46 of FIG. 10, the likelihood calculating means 15 reads out the hypothesis of each word from the word dictionary storage means 14, and the HM included in each hypothesis calculated in step ST4 is included.
Based on the output probability of the M state, the HMM acoustic model of the word stored in the word dictionary storage unit 14, and the duration data set by the duration data change unit 42, the acoustic likelihood of each hypothesis is calculated. Expand your hypothesis.

【００７８】図１０のステップＳＴ６からＳＴ１０まで
の処理は、従来の図２７におけるステップＳＴ６からＳ
Ｔ１０までの処理と同一である。The processing from steps ST6 to ST10 in FIG. 10 is the same as the processing from steps ST6 to ST in FIG.
This is the same as the processing up to T10.

【００７９】以上のように、この実施の形態３によれ
ば、各音素の簡易な音響モデルを用いて、発声の速さに
対応した継続時間長データを設定するので、計算コスト
はそれほど増加せずに、発声の速さが変化しても、認識
精度を高くすることができるという効果が得られる。As described above, according to the third embodiment, since the duration data corresponding to the utterance speed is set by using the simple acoustic model of each phoneme, the calculation cost is significantly increased. Instead, even if the speed of the utterance changes, the effect of increasing the recognition accuracy can be obtained.

【００８０】実施の形態４．図１２はこの発明の実施の
形態４による音声認識装置の構成を示す図であり、図に
おいて、５１は順位変動計算手段２３が求めた順位変動
幅の平均に基づき、フレーム処理の省略を行うか否かを
決定する間引き決定手段である。この実施の形態は、実
施の形態２の図６における言語重み変更手段３１を、間
引き決定手段５１に置き換えたものであり、音声認識に
係るパラメータとしての、各時刻の認識処理であるフレ
ーム処理の間引きの有無を決定するものである。Embodiment 4 FIG. 12 is a diagram showing a configuration of a speech recognition apparatus according to Embodiment 4 of the present invention. In the figure, reference numeral 51 denotes whether frame processing is to be omitted based on the average of the order variation width obtained by the order variation calculating means 23. This is a thinning-out determining means for determining whether or not to perform the determination. In this embodiment, the language weight changing means 31 in FIG. 6 of the second embodiment is replaced with a thinning-out determining means 51, and a frame processing which is a recognition processing at each time as a parameter relating to speech recognition is performed. This is to determine the presence or absence of thinning.

【００８１】次に動作について説明する。図１３はこの
発明の実施の形態４による音声認識装置の処理を示すフ
ローチャートである。ステップＳＴ１からＳＴ３までの
処理は、実施の形態１の図２におけるステップＳＴ１か
らＳＴ３までの処理と同一である。Next, the operation will be described. FIG. 13 is a flowchart showing processing of the voice recognition device according to the fourth embodiment of the present invention. The processing from step ST1 to ST3 is the same as the processing from step ST1 to ST3 in FIG. 2 of the first embodiment.

【００８２】ステップＳＴ５０において、簡易音響モデ
ル確率演算手段２２，順位変動計算手段２３，間引き決
定手段５１は、音響分析手段１２からの音響特徴ベクト
ルと簡易音響モデル記憶手段２１に記憶されている各音
素の簡易な音響モデルに基づき、フレーム処理の省略を
行うか否かを決定する。In step ST 50, the simple acoustic model probability calculating means 22, the order variation calculating means 23, and the thinning determining means 51 determine the acoustic feature vector from the acoustic analyzing means 12 and each phoneme stored in the simple acoustic model storing means 21. It is determined whether to omit the frame processing based on the simple acoustic model.

【００８３】図１４は図１３のステップＳＴ５０におけ
るフレーム処理の省略を行うか否かを決定する処理を示
すフローチャートである。ステップＳＴ５１からＳＴ５
４は、実施の形態１の図３におけるステップＳＴ２１か
らＳＴ２４までの処理と同一である。FIG. 14 is a flowchart showing a process for determining whether or not to omit the frame process in step ST50 in FIG. Steps ST51 to ST5
Step 4 is the same as the processing from step ST21 to ST24 in FIG. 3 of the first embodiment.

【００８４】ステップＳＴ５５において、間引き決定手
段５１は、順位変動計算手段２３が計算した順位変動幅
の平均が所定値より小さい場合は音響的に安定してお
り、計算を省略しても精度劣化が少ないと判断して間引
き有りに設定し、順位変動幅の平均が所定値より大きい
場合は、正確に計算する必要があると判断して間引き無
しに設定する。In step ST55, the thinning-out determining means 51 is acoustically stable when the average of the rank fluctuation width calculated by the rank fluctuation calculating means 23 is smaller than a predetermined value. If it is determined that the number is small, it is set to be thinned. If the average of the order variation width is larger than a predetermined value, it is determined that it is necessary to calculate accurately, and the thinning is set.

【００８５】図１３のステップＳＴ５６において、制御
手段（図示せず）は、上記ステップＳＴ５５で間引き有
りと設定された場合に、現在のフレームが、予め決めら
れている間引き対象のフレームであるかを調べ、間引き
対象であればステップＳＴ４からＳＴ７までのフレーム
処理を省略して、ステップＳＴ８に移行し、間引き対象
でなければ、ステップＳＴ４からＳＴ７までのフレーム
処理を実行する。また、上記ステップＳＴ５５で間引き
無しと設定された場合にも、ステップＳＴ４からＳＴ７
までのフレーム処理を実行する。In step ST56 in FIG. 13, the control means (not shown) determines whether or not the current frame is a predetermined frame to be thinned out when the thinning is set in step ST55. It is checked, and if it is a thinning target, the frame processing of steps ST4 to ST7 is omitted, and the process proceeds to step ST8. If it is not a thinning target, the frame processing of steps ST4 to ST7 is executed. Also, when no thinning is set in step ST55, steps ST4 to ST7 are performed.
Execute the frame processing up to.

【００８６】図１３のステップＳＴ８からＳＴ１０まで
の処理は、従来の図２７におけるステップＳＴ８からＳ
Ｔ１０までの処理と同一である。The processing from steps ST8 to ST10 in FIG. 13 is the same as the processing from steps ST8 to ST in FIG.
This is the same as the processing up to T10.

【００８７】以上のように、この実施の形態４によれ
ば、各音素の簡易な音響モデルを用いて音声データの状
況に応じた間引きの有無を決定するので、計算コストは
それほど増加せずに、認識精度を高くすることができる
という効果が得られる。As described above, according to the fourth embodiment, the use of a simple acoustic model of each phoneme determines whether or not to perform thinning in accordance with the situation of voice data. Therefore, the calculation cost does not increase so much. This has the effect of increasing the recognition accuracy.

【００８８】実施の形態５．図１５はこの発明の実施の
形態５による音声認識装置の構成を示す図であり、図に
おいて、６１は順位変動計算手段２３が求めた順位変動
幅の平均に基づき、フレーム処理の間引き率を決定する
間引き率決定手段である。この実施の形態は、実施の形
態４の図１２における間引き決定手段５１を、間引き率
決定手段６１に置き換えたものであり、音声認識に係る
パラメータとしての各時刻の認識処理であるフレーム処
理の間引き率を決定するものである。Embodiment 5 FIG. 15 is a diagram showing the configuration of a speech recognition apparatus according to Embodiment 5 of the present invention. In the figure, reference numeral 61 denotes a frame processing thinning rate determined on the basis of the average of the order variation width calculated by the order variation calculating means 23. This is the thinning rate determining means. In this embodiment, the thinning-out determining means 51 in FIG. 12 of the fourth embodiment is replaced with a thinning-out rate determining means 61. Determine the rate.

【００８９】次に動作について説明する。図１６はこの
発明の実施の形態５による音声認識装置の処理を示すフ
ローチャートである。ステップＳＴ１からＳＴ３までの
処理は、実施の形態１の図２におけるステップＳＴ１か
らＳＴ３までの処理と同一である。Next, the operation will be described. FIG. 16 is a flowchart showing processing of the voice recognition device according to the fifth embodiment of the present invention. The processing from step ST1 to ST3 is the same as the processing from step ST1 to ST3 in FIG. 2 of the first embodiment.

【００９０】ステップＳＴ６０において、簡易音響モデ
ル確率演算手段２２，順位変動計算手段２３，間引き率
決定手段６１は、音響分析手段１２からの音響特徴ベク
トルと簡易音響モデル記憶手段２１に記憶されている各
音素の簡易な音響モデルに基づき、フレーム処理の間引
き率を決定する。In step ST60, the simple acoustic model probability calculating means 22, the rank variation calculating means 23, and the thinning rate determining means 61 determine the acoustic feature vector from the acoustic analyzing means 12 and the respective data stored in the simple acoustic model storing means 21. A thinning rate of frame processing is determined based on a simple acoustic model of a phoneme.

【００９１】図１７は図１６のステップＳＴ６０におけ
るフレーム処理の間引き率を決定する処理を示すフロー
チャートである。ステップＳＴ６１からＳＴ６４は、実
施の形態１の図３におけるステップＳＴ２１からＳＴ２
４までの処理と同一である。FIG. 17 is a flow chart showing the processing for determining the thinning rate of the frame processing in step ST60 of FIG. Steps ST61 to ST64 correspond to steps ST21 to ST2 in FIG. 3 of the first embodiment.
This is the same as the processing up to 4.

【００９２】ステップＳＴ６５において、間引き率決定
手段６１は、順位変動計算手段２３が計算した順位変動
幅の平均が所定値より小さい場合は、音響的に安定して
おり、計算を省略しても精度劣化が少ないと判断して、
フレーム処理の間引き率を大きく設定し、順位変動幅の
平均が所定値より大きい場合は、正確に計算する必要が
あると判断して、フレーム処理の間引き率を小さく設定
する。In step ST65, the thinning-out rate determining means 61 is acoustically stable when the average of the rank fluctuation width calculated by the rank fluctuation calculating means 23 is smaller than a predetermined value. Judging that deterioration is small,
If the frame processing thinning rate is set to be large and the average of the order variation width is larger than a predetermined value, it is determined that accurate calculation is necessary, and the frame processing thinning rate is set to be small.

【００９３】図１６のステップＳＴ６６において、制御
手段（図示せず）は現在のフレームが予め決められてい
る間引き率の大小に基づく間引き対象のフレームである
かを調べ、間引き対象であれば、ステップＳＴ４からＳ
Ｔ７までのフレーム処理を省略して、ステップＳＴ８に
移行し、間引き対象でなければ、ステップＳＴ４からＳ
Ｔ７までのフレーム処理を実行する。In step ST66 in FIG. 16, the control means (not shown) checks whether the current frame is a frame to be thinned out based on a predetermined thinning rate. ST4 to S
The frame processing up to T7 is omitted, and the process proceeds to step ST8.
The frame processing up to T7 is executed.

【００９４】図１６のステップＳＴ８からＳＴ１０まで
の処理は、従来の図２７におけるステップＳＴ８からＳ
Ｔ１０までの処理と同一である。The processing from steps ST8 to ST10 in FIG. 16 is the same as the processing from steps ST8 to ST in FIG.
This is the same as the processing up to T10.

【００９５】以上のように、この実施の形態５によれ
ば、各音素の簡易な音響モデルを用いて音声データの状
況に応じた間引き率を決定するので、計算コストはそれ
ほど増加せずに、認識精度を高くすることができるとい
う効果が得られる。As described above, according to the fifth embodiment, the thinning rate according to the situation of the voice data is determined using the simple acoustic model of each phoneme, so that the calculation cost does not increase so much. The effect that the recognition accuracy can be improved can be obtained.

【００９６】実施の形態６．図１８はこの発明の実施の
形態６による音声認識装置の構成を示す図であり、図に
おいて、４１は発声速度に対応した発声の継続時間長デ
ータを記憶している継続時間長データ記憶手段、７１は
尤度演算手段１５から現在認識中の各仮説の単語及び単
語内位置を取得し、単語辞書記憶手段１４から仮説の単
語の品詞を取得する単語種類取得手段、７２は単語種類
取得手段７１が取得した仮説の単語の品詞及び単語内位
置に対応した継続時間長データを、継続時間長データ記
憶手段４１から選択して、使用する継続時間長データを
設定する継続時間長データ変更手段である。この実施の
形態は、従来の図２６に示す構成に、継続時間長データ
記憶手段４１，単語種類取得手段７１，継続時間長デー
タ変更手段７２を追加したものである。Embodiment 6 FIG. FIG. 18 is a diagram showing a configuration of a speech recognition apparatus according to Embodiment 6 of the present invention. In the figure, reference numeral 41 denotes duration time data storage means for storing duration time data of utterance corresponding to utterance speed; 71 is a word type obtaining unit that obtains the word of each hypothesis currently being recognized and the position in the word from the likelihood calculating unit 15 and obtains the part of speech of the hypothetical word from the word dictionary storage unit 14. 72 is a word type obtaining unit 71. Is the duration length data changing unit that selects duration length data corresponding to the part of speech and the position within the word of the hypothesis acquired from the duration length data storage unit 41 and sets the duration length data to be used. . In this embodiment, a duration data storage unit 41, a word type acquisition unit 71, and a duration data change unit 72 are added to the conventional configuration shown in FIG.

【００９７】次に動作について説明する。まず、継続時
間長データ記憶手段４１は、外部記憶装置（図示せず）
から、速い発声、通常の発声、遅い発声等の異なった発
声速度に対応するそれぞれの継続時間長データを取得し
記憶する。図１９はこの発明の実施の形態６による音声
認識装置の処理を示すフローチャートである。ステップ
ＳＴ１からＳＴ４までは、従来の図２７のステップＳＴ
１からＳＴ４までの処理と同一である。Next, the operation will be described. First, the duration time data storage means 41 is an external storage device (not shown).
, And obtains and stores respective duration data corresponding to different utterance speeds such as fast utterance, normal utterance, and slow utterance. FIG. 19 is a flowchart showing processing of the voice recognition device according to the sixth embodiment of the present invention. Steps ST1 to ST4 correspond to step ST1 in FIG.
This is the same as the processing from 1 to ST4.

【００９８】ステップＳＴ７１において、単語種類取得
手段７１は、尤度演算手段１５から現在認識中の各仮説
の単語及び単語内位置を取得し、単語辞書記憶手段１４
から仮説の単語の品詞を取得する。そして、継続時間長
データ変更手段７２は、単語種類取得手段７１が取得し
た仮説の単語の品詞及び単語内位置に対応した継続時間
長データを、継続時間長データ記憶手段４１から選択し
て設定する。例えば、現在認識中の単語が助詞「が」で
あり、音素位置がその最終／ａ／である場合、発声上で
はそこで間が空く場合が多いので、遅い発声に対応した
継続時間長データを選択して設定する。In step ST 71, the word type obtaining means 71 obtains the word and the position in the word of each hypothesis currently being recognized from the likelihood calculating means 15, and the word dictionary storing means 14.
From the part of speech of the hypothesis word. Then, the duration length data changing unit 72 selects and sets the duration length data corresponding to the part of speech and the position in the word of the hypothesis acquired by the word type acquisition unit 71 from the duration length data storage unit 41. . For example, if the word currently being recognized is the particle "ga" and the phoneme position is the final / a /, there are many cases where there is a gap in the utterance, so the duration time data corresponding to the late utterance is selected. And set.

【００９９】ステップＳＴ７２において、尤度演算手段
１５は、単語辞書記憶手段１４から各単語の仮説を読み
出し、上記ステップＳＴ４で計算した各ＨＭＭ状態の出
力確率と、単語辞書記憶手段１４に記憶されている単語
のＨＭＭ音響モデルと、継続時間長データ変更手段７２
が設定した継続時間長データに基づき、各仮説の音響尤
度を計算し仮説を展開する。In step ST72, the likelihood calculating means 15 reads the hypothesis of each word from the word dictionary storage means 14, and stores the output probabilities of each HMM state calculated in step ST4 and the word probabilities. HMM acoustic model of the word and the duration data changing means 72
Calculates the acoustic likelihood of each hypothesis based on the set duration data and develops the hypothesis.

【０１００】ステップＳＴ６からＳＴ１０までの処理
は、従来の図２７におけるステップＳＴ６からＳＴ１０
までの処理と同一である。The processing from steps ST6 to ST10 is the same as the processing from steps ST6 to ST10 in FIG.
The processing is the same as that described above.

【０１０１】以上のように、この実施の形態６によれ
ば、品詞等単語の種類によって最適な継続時間長データ
を設定し、発声の状況に応じた継続時間長制御をするの
で、発声の速さが変化しても認識精度を高くすることが
できるという効果が得られる。As described above, according to the sixth embodiment, optimal duration data is set according to the type of word such as part of speech, and duration control is performed according to the utterance situation. The effect is that the recognition accuracy can be increased even if the value changes.

【０１０２】実施の形態７．図２０はこの発明の実施の
形態７による音声認識装置の構成を示す図であり、図に
おいて、４１は発声速度に対応した発声の継続時間長デ
ータを記憶している継続時間長データ記憶手段、８１は
バイグラム・トライグラム等の単語連鎖確率を記憶する
言語モデル記憶手段、８２は尤度演算手段１５から現在
認識中の各仮説の単語とそれ以前に認識した単語を取得
し、言語モデル記憶手段８１から、それらの単語の連鎖
確率を取得する単語連鎖確率取得手段、８３は単語連鎖
確率取得手段８２が取得した単語の連鎖確率に応じた継
続時間長データを、継続時間長データ記憶手段４１から
選択して設定する継続時間長データ変更手段である。Embodiment 7 FIG. FIG. 20 is a diagram showing a configuration of a speech recognition apparatus according to Embodiment 7 of the present invention. In the figure, reference numeral 41 denotes duration time data storage means for storing duration time data of utterance corresponding to the utterance speed; 81 is a language model storage means for storing word chain probabilities such as bigrams and trigrams; 82 is a language model storage means which acquires words of each hypothesis currently being recognized and words recognized before that from the likelihood calculating means 15; The word chain probability obtaining means for obtaining the chain probability of those words from 81, and the duration data corresponding to the word chain probability obtained by the word chain probability obtaining means 82 from the duration data storage means 41. This is a duration time data changing unit that is selected and set.

【０１０３】この実施の形態は、従来の図２６に示す構
成に、継続時間長データ記憶手段４１，言語モデル記憶
手段８１，単語連鎖確率取得手段８２，継続時間長デー
タ変更手段８３を追加したものである。In this embodiment, a duration data storage unit 41, a language model storage unit 81, a word chain probability acquisition unit 82, and a duration data change unit 83 are added to the conventional configuration shown in FIG. It is.

【０１０４】次に動作について説明する。まず、継続時
間長データ記憶手段４１は、外部記憶装置（図示せず）
から、速い発声、通常の発声、遅い発声等の異なった発
声速度に対するそれぞれの継続時間長データを取得し記
憶する。また、言語モデル記憶手段８１は、外部記憶装
置からバイグラム・トライグラム等の単語連鎖確率を取
得し記憶する。Next, the operation will be described. First, the duration time data storage means 41 is an external storage device (not shown).
, And obtains and stores respective duration data for different utterance speeds such as fast utterance, normal utterance, and slow utterance. The language model storage means 81 acquires and stores word chain probabilities such as bigrams and trigrams from an external storage device.

【０１０５】図２１はこの発明の実施の形態７による音
声認識装置の処理を示すフローチャートである。ステッ
プＳＴ１からＳＴ４までの処理は、従来の図２７のステ
ップＳＴ１からＳＴ４までの処理と同一ある。FIG. 21 is a flowchart showing a process performed by the speech recognition apparatus according to the seventh embodiment of the present invention. The processing from step ST1 to ST4 is the same as the processing from step ST1 to ST4 in FIG. 27 of the related art.

【０１０６】ステップＳＴ８１において、単語連鎖確率
取得手段８２は、尤度演算手段１５から現在認識中の各
仮説の単語とそれ以前に認識した単語を取得し、言語モ
デル記憶手段８１から、それらの単語の連鎖確率を取得
する。そして、継続時間長データ変更手段８３は、単語
連鎖確率取得手段８２が取得した単語の連鎖確率に応じ
た継続時間長データを、継続時間長データ記憶手段４１
から選択して、使用する継続時間長データを設定する。In step ST 81, the word chain probability obtaining means 82 obtains the words of each hypothesis currently being recognized and the words recognized before that from the likelihood calculating means 15, and reads those words from the language model storing means 81. Get the chain probability of. Then, the duration data changing unit 83 stores the duration data corresponding to the word chain probability acquired by the word chain probability acquiring unit 82 in the duration data storage unit 41.
And set the duration time data to be used.

【０１０７】ここで、単語連鎖確率取得手段８２が取得
した単語の単語連鎖確率が１に近い場合には、発声上で
はそこで間が空く可能性が小さいので、継続時間長デー
タ変更手段８３は、速い発声又は通常の発声に対応した
継続時間長データを選択して設定する。一方、単語連鎖
確率取得手段８２が取得した単語の単語連鎖確率が０に
近い場合には、あまり関連のない単語が連続しているだ
けの可能性が高く、発声上ではそこで間が空く可能性が
大きいので、継続時間長データ変更手段８３は、通常の
発声又は遅い発声に対応した継続時間長データを選択し
て設定する。If the word chain probability of the word obtained by the word chain probability obtaining means 82 is close to 1, there is a small possibility that there will be a gap in the utterance. Select and set duration time data corresponding to fast utterance or normal utterance. On the other hand, when the word concatenation probability of the word acquired by the word concatenation probability acquisition means 82 is close to 0, there is a high possibility that only words that are not very related are continuous, and there is a possibility that a gap will occur in the utterance. Is large, the duration data changing means 83 selects and sets duration data corresponding to a normal utterance or a slow utterance.

【０１０８】ステップＳＴ８２において、尤度演算手段
１５は、単語辞書記憶手段１４から各単語の仮説を読み
出し、上記ステップＳＴ４で計算した各ＨＭＭ状態の出
力確率と、単語辞書記憶手段１４に記憶されている単語
のＨＭＭ音響モデルと、継続時間長データ変更手段８３
が設定した継続時間長データに基づき、各仮説の音響尤
度を計算し仮説を展開する。In step ST82, the likelihood calculation means 15 reads out the hypothesis of each word from the word dictionary storage means 14, and stores the output probabilities of each HMM state calculated in step ST4 and the word dictionary storage means 14. HMM acoustic model of the word and the duration data changing means 83
Calculates the acoustic likelihood of each hypothesis based on the set duration data and develops the hypothesis.

【０１０９】ステップＳＴ６からＳＴ１０までの処理
は、従来の図２７におけるステップＳＴ６からＳＴ１０
までの処理と同一である。The processing from steps ST6 to ST10 is the same as the processing from steps ST6 to ST10 in FIG.
The processing is the same as that described above.

【０１１０】以上のように、この実施の形態７によれ
ば、単語の連接のしやすさにより最適な継続時間長デー
タを設定し、発声の内容に応じた継続時間長制御をする
ので、発声の内容が変化しても、認識精度を高くするこ
とができるという効果が得られる。As described above, according to the seventh embodiment, the optimal duration data is set according to the ease of word connection, and the duration control is performed according to the content of the utterance. However, even if the content of the information changes, the effect that the recognition accuracy can be improved can be obtained.

【０１１１】実施の形態８．図２２はこの発明の実施の
形態８による音声認識装置の構成を示す図であり、図に
おいて、９１は、枝刈り手段１６から残存する仮説の最
大尤度を取得し、取得した最大尤度を持ち、発声最初と
する仮説を追加する最大尤度仮説追加手段である。この
実施の形態は、従来の図２６に示す構成に、最大尤度仮
説追加手段９１を追加したものである。Embodiment 8 FIG. FIG. 22 is a diagram showing a configuration of a speech recognition apparatus according to Embodiment 8 of the present invention. In the figure, reference numeral 91 denotes the maximum likelihood of the remaining hypothesis from the pruning means 16, and This is a maximum likelihood hypothesis adding means for adding a hypothesis that has the first utterance. In this embodiment, a maximum likelihood hypothesis adding means 91 is added to the conventional configuration shown in FIG.

【０１１２】次に動作について説明する。図２３はこの
発明の実施の形態８による音声認識装置の処理を示すフ
ローチャートである。ステップＳＴ１からＳＴ６までの
処理は、従来の図２７のステップＳＴ１からＳＴ６まで
の処理と同一である。Next, the operation will be described. FIG. 23 is a flowchart showing processing of the voice recognition device according to the eighth embodiment of the present invention. The processing from step ST1 to ST6 is the same as the processing from step ST1 to ST6 in the conventional FIG.

【０１１３】ステップＳＴ９１において、最大尤度仮説
追加手段９１は、発声最初のＮフレーム（Ｎは整数）に
おいて、枝刈り手段１６から残存する仮説の最大尤度を
取得し、取得した最大尤度を持ち、発声最初とする仮説
を追加する。In step ST91, the maximum likelihood hypothesis adding means 91 obtains the maximum likelihood of the remaining hypothesis from the pruning means 16 in the first N frames (N is an integer) of the utterance, and calculates the obtained maximum likelihood. Add a hypothesis that you have the first utterance.

【０１１４】ステップＳＴ７からＳＴ１０までの処理
は、従来の図２７に示すステップＳＴ７からＳＴ１０ま
での処理と同一である。The processing from step ST7 to ST10 is the same as the conventional processing from step ST7 to ST10 shown in FIG.

【０１１５】ステップＳＴ９１について詳細に説明す
る。図２４は、「あした（明日）」と発声したが、最初
の２フレームに呼気がかかってしまった場合の従来で展
開される仮説を示す図である。時刻ｔ＝０では、認識候
補「明日」で音素位置が最初の／ａ／の仮説、認識候補
「橋」で音素位置が最初の／ｈ／の仮説の２つが描かれ
ているが、他にも各認識候補について音素位置が最初の
仮説が存在する。これらの仮説は、時間が進行するにつ
れて、従来例と同様に展開されていき、尤度が低いもの
は枝刈りされる。図において、太枠の仮説が、そのフレ
ームにおいて、最大の尤度を与える仮説とする。この図
では、最初の呼気を「は」と認識してしまい、単語とし
て、「はし（橋）」が最も尤度が高くなってしまい、正
解の「あした」は途中で枝刈りされてしまう。Step ST91 will be described in detail. FIG. 24 is a diagram illustrating a hypothesis developed in the related art when “tomorrow” is uttered but expiration is applied to the first two frames. At time t = 0, the hypothesis that the phoneme position is the first / a / in the recognition candidate “Tomorrow” and the hypothesis that the phoneme position is the first / h / in the recognition candidate “Bridge” are drawn. Also, there is a hypothesis that the phoneme position is the first for each recognition candidate. As the time progresses, these hypotheses are developed as in the conventional example, and those with low likelihoods are pruned. In the figure, the thick framed hypothesis is the one that gives the maximum likelihood in that frame. In this figure, the first exhalation is recognized as "ha", and the word "hashi (bridge)" has the highest likelihood as a word, and the correct "tomorrow" is pruned on the way. .

【０１１６】図２５は、この実施の形態における仮説の
追加と仮説の展開例を示している。時刻ｔ＝０では、図
に示している、認識候補「橋」で音素位置／ｈ／の他に
も、各認識候補で音素位置が最初の仮説が存在してい
る。時刻ｔ＝１で各仮説は展開されるが、図ではｔ＝１
において、太枠の仮説の尤度が最大であり、そのときの
尤度がＡであるとする。そのとき、最大尤度仮説追加手
段９１は、音素位置が最初であり尤度がＡである仮説を
追加する。図においては、ｔ＝１で認識候補「明日」で
音素位置／ａ／であり尤度がＡである仮説、認識候補
「学校」で音素位置／ｇ／であり尤度がＡである仮説が
追加されている。追加される仮説はこの他にも、認識候
補「記事」で音素位置／ｋ／であり尤度Ａの仮説等、各
認識候補について追加される。FIG. 25 shows an example of adding a hypothesis and developing the hypothesis in this embodiment. At time t = 0, in addition to the phoneme position / h / at the recognition candidate "bridge" shown in the figure, a hypothesis whose phoneme position is the first in each recognition candidate exists. At time t = 1, each hypothesis is developed.
, It is assumed that the likelihood of the bold frame hypothesis is the maximum, and the likelihood at that time is A. At this time, the maximum likelihood hypothesis adding means 91 adds a hypothesis whose phoneme position is first and the likelihood is A. In the figure, at t = 1, the hypothesis that the recognition candidate “tomorrow” is the phoneme position / a / and the likelihood is A, and the hypothesis that the recognition candidate “school” is the phoneme position / g / and the likelihood is A is Has been added. Other hypotheses to be added are added for each recognition candidate such as the hypothesis of the phoneme position / k / likelihood A in the recognition candidate “article”.

【０１１７】同様に、時刻ｔ＝２における最大尤度がＢ
であるとすると、認識候補「明日」で音素位置／ａ／，
尤度Ｂの仮説、認識候補「学校」で音素位置／ｇ／，尤
度Ｂの仮説等が追加される。Similarly, the maximum likelihood at time t = 2 is B
, The phoneme position / a /,
The hypothesis of likelihood B, the phoneme position / g /, the hypothesis of likelihood B, etc. are added for the recognition candidate “school”.

【０１１８】時刻ｔ＝２で追加された認識候補「明日」
の仮説は、最初の呼気の影響を受けず、また、大きい尤
度をもっているため、枝刈りの影響も受けず棄却されず
にすむので、結果として、正解の仮説が展開され認識精
度が向上する。Recognition candidate "tomorrow" added at time t = 2
Is not affected by the first exhalation and has a large likelihood, so it is not affected by pruning and is not rejected. As a result, the correct hypothesis is developed and the recognition accuracy is improved. .

【０１１９】追加された仮説のうち、正解でない認識候
補のもの（（Ａ２）や（Ｂ２）など）は、時刻が進展す
るにつれて、音響モデルが合わなくなり、相対的に尤度
が低下するので、認識に悪影響を与えることはない。ま
た同様に、呼気の部分で最大尤度を持った認識候補
「橋」も、時刻が進展するにつれて、音響モデルが合わ
なくなり、相対的に尤度が低下し棄却される。Of the added hypotheses, those of recognition candidates that are not the correct answer (such as (A2) and (B2)) will not match the acoustic model as the time advances, and the likelihood will relatively decrease. It does not affect recognition. Similarly, the recognition candidate “bridge” having the maximum likelihood in the expiration part also becomes rejected because the acoustic model does not match as the time elapses, and the likelihood decreases relatively.

【０１２０】以上のように、この実施の形態８によれ
ば、発声の最初のＮフレームに、常に最大尤度を持ち、
認識単語の最初の音素位置を持つ仮説を追加するので、
発声の最初の呼気等を単語として認識する仮説が存在し
ても、その呼気等の影響を受けずに、認識精度を高くす
ることができるという効果が得られる。As described above, according to the eighth embodiment, the first N frames of the utterance always have the maximum likelihood,
Since we add a hypothesis with the first phoneme position of the recognized word,
Even if there is a hypothesis that recognizes the first exhalation of a utterance as a word, the effect that the recognition accuracy can be improved without being affected by the exhalation or the like is obtained.

【０１２１】上記の各実施の形態における音声認識処理
は、音声認識プログラムにより実現され、この音声認識
プログラムは、記録媒体に記録して提供される。The speech recognition processing in each of the above embodiments is realized by a speech recognition program, and the speech recognition program is provided by being recorded on a recording medium.

【０１２２】[0122]

【発明の効果】以上のように、この発明によれば、各音
素の簡易な音響モデルを記憶する簡易音響モデル記憶手
段と、音響分析手段から出力された音響特徴ベクトル
と、簡易音響モデル記憶手段に記憶されている各音素の
簡易な音響モデルにより、現在時刻をはさむ所定の時間
内における各時刻の各ＨＭＭ状態の簡易音響出力確率を
演算する簡易音響モデル確率演算手段と、簡易音響モデ
ル確率演算手段が求めた各時刻の各ＨＭＭ状態の簡易音
響出力確率の順位を求め、現在時刻をはさむ所定の時間
内における各ＨＭＭ状態の順位変動幅を計算し、ＨＭＭ
状態の順位変動幅の平均を計算する順位変動計算手段と
を備え、順位変動計算手段が計算した順位変動幅の平均
に基づき、音声認識に係るパラメータを調整することに
より、各音素の簡易な音響モデルを用いて、音声認識に
係るパラメータを調整するので、計算コストはそれほど
増加せずに、認識精度を高くすることができるという効
果がある。As described above, according to the present invention, a simple acoustic model storage means for storing a simple acoustic model of each phoneme, an acoustic feature vector output from the acoustic analysis means, and a simple acoustic model storage means A simple acoustic model probability calculating means for calculating a simple acoustic output probability of each HMM state at each time within a predetermined time including the current time by using a simple acoustic model of each phoneme stored in The means determines the order of the simplified sound output probability of each HMM state at each time, calculates the order variation width of each HMM state within a predetermined time including the current time,
A rank variation calculator for calculating an average of the rank variation of the state, and adjusting a parameter related to speech recognition based on the average of the rank variation calculated by the rank variation calculator, thereby obtaining a simple sound of each phoneme. Since the parameters relating to speech recognition are adjusted using the model, there is an effect that the recognition accuracy can be increased without increasing the calculation cost so much.

【０１２３】この発明によれば、音声認識に係るパラメ
ータとしてのビーム幅を設定するビーム幅変更手段を備
え、順位変動計算手段が計算した順位変動幅の平均が所
定値より小さい場合に、ビーム幅変更手段がビーム幅を
小さく設定し、順位変動幅の平均が所定値より大きい場
合に、ビーム幅変更手段がビーム幅を大きく設定し、枝
刈り手段は、ビーム幅変更手段が設定したビーム幅に基
づき、仮説を棄却することにより、各音素の簡易な音響
モデルを用いて、適切にビーム幅を設定しているので、
計算コストはそれほど増加せずに、正解が枝刈りされる
ことを抑えて、認識精度を高くすることができるという
効果がある。According to the present invention, there is provided a beam width changing means for setting a beam width as a parameter relating to speech recognition, and when the average of the order variation width calculated by the order variation calculating means is smaller than a predetermined value, the beam width is changed. The changing means sets the beam width small, and when the average of the order variation width is larger than a predetermined value, the beam width changing means sets the beam width large, and the pruning means sets the beam width to the beam width set by the beam width changing means. By rejecting the hypothesis, the beam width is set appropriately using a simple acoustic model of each phoneme.
There is an effect that the calculation cost does not increase so much, the correct answer is suppressed from being pruned, and the recognition accuracy can be increased.

【０１２４】この発明によれば、音声認識に係るパラメ
ータとしての、尤度に追加される言語尤度の重みを設定
する言語重み変更手段を備え、順位変動計算手段が計算
した順位変動幅の平均が所定値より小さい場合に、言語
重み変更手段が言語尤度の重みを小さく設定し、順位変
動幅の平均が所定値より大きい場合に、言語重み変更手
段が上記言語尤度の重みを大きく設定し、尤度演算手段
は、言語重み変更手段が設定した言語尤度の重みに基づ
き、仮説の尤度を演算することにより、各音素の簡易な
音響モデルを用いて、適切に言語重みを設定するので、
計算コストはそれほど増加せずに、認識精度を高くする
ことができるという効果がある。According to the present invention, there is provided the language weight changing means for setting the weight of the language likelihood added to the likelihood as a parameter relating to speech recognition, and the average of the order variation width calculated by the order variation calculating means is provided. Is smaller than a predetermined value, the language weight changing means sets a small weight of the language likelihood, and if the average of the rank variation is larger than a predetermined value, the language weight changing means sets a large weight of the language likelihood. Then, the likelihood calculating means calculates the likelihood of the hypothesis based on the weight of the language likelihood set by the language weight changing means, and appropriately sets the language weight using a simple acoustic model of each phoneme. So
There is an effect that the recognition accuracy can be increased without significantly increasing the calculation cost.

【０１２５】この発明によれば、複数の発声速度に対応
した発声の継続時間長データを記憶する継続時間長デー
タ記憶手段と、音声認識に係るパラメータとしての継続
時間長データを設定する継続時間長データ変更手段とを
備え、順位変動計算手段が計算した順位変動幅の平均が
所定値より小さい場合に、継続時間長データ変更手段
が、継続時間長データ記憶手段に記憶されている遅めに
発声した継続時間長データを選択し、順位変動幅の平均
が所定値より大きい場合に、継続時間長データ変更手段
が、継続時間長データ記憶手段に記憶されている速めに
発声した継続時間長データを選択し、尤度演算手段は、
継続時間長データ変更手段が選択した継続時間長データ
に基づき、仮説の尤度を演算することにより、各音素の
簡易な音響モデルを用いて、発声の速さに対応した継続
時間長データを設定するので、計算コストはそれほど増
加せずに、発声の速さが変化しても、認識精度を高くす
ることができるという効果がある。According to the present invention, duration data storage means for storing duration data of utterances corresponding to a plurality of utterance speeds, and duration setting for setting duration data as a parameter relating to speech recognition. Data change means, and when the average of the order change width calculated by the order change calculation means is smaller than a predetermined value, the duration time data change means utters the voice data lately stored in the duration time data storage means. When the average of the order variation width is larger than a predetermined value, the duration data changing unit outputs the uttered duration data stored in the duration data storage unit. The likelihood calculation means,
Based on the duration data selected by the duration data changing unit, the likelihood of the hypothesis is calculated, and the duration data corresponding to the utterance speed is set using a simple acoustic model of each phoneme. Therefore, there is an effect that the recognition accuracy can be increased even if the speed of utterance changes without increasing the calculation cost so much.

【０１２６】この発明によれば、音声認識に係るパラメ
ータとしての、各時刻の認識処理であるフレーム処理の
間引きの有無を決定する間引き決定手段を備え、順位変
動計算手段が計算した順位変動幅の平均が所定値より小
さい場合に、間引き決定手段がフレーム処理の間引きを
有りに設定し、順位変動幅の平均が所定値より大きい場
合に、間引き決定手段がフレーム処理の間引きを無しに
設定することにより、各音素の簡易な音響モデルを用い
て、音声データの状況に応じた間引きの有無を決定する
ので、計算コストはそれほど増加せずに、認識精度を高
くすることができるという効果がある。According to the present invention, there is provided thinning-out determining means for determining whether or not to perform frame processing as a recognition processing at each time, as a parameter relating to speech recognition. When the average is smaller than a predetermined value, the thinning-out determining means sets the thinning out of the frame processing to be present, and when the average of the order variation width is larger than the predetermined value, the thinning-out determining means sets no thinning out of the frame processing. As a result, the use of a simple acoustic model of each phoneme determines whether or not to perform thinning in accordance with the situation of voice data. Therefore, there is an effect that the calculation accuracy does not increase so much and the recognition accuracy can be increased.

【０１２７】この発明によれば、音声認識に係るパラメ
ータとしての、各時刻の認識処理であるフレーム処理の
間引き率を決定する間引き率決定手段を備え、順位変動
計算手段が計算した順位変動幅の平均が所定値より小さ
い場合に、間引き率決定手段がフレーム処理の間引き率
大きく設定し、順位変動幅の平均が所定値より大きい場
合に、間引き率決定手段がフレーム処理の間引き率を小
さく設定することにより、簡易な音響モデルを用いて音
声データの状況に応じた間引き率を決定するので、計算
コストはそれほど増加せずに、認識精度を高くすること
ができるという効果がある。According to the present invention, there is provided a thinning-out rate determining means for determining a thinning-out rate of a frame process which is a recognition process at each time, as a parameter relating to voice recognition, and a rank variation width calculated by the rank variation calculating means is provided. When the average is smaller than the predetermined value, the thinning rate determining means sets the thinning rate of the frame processing to be large, and when the average of the order variation width is larger than the predetermined value, the thinning rate determining means sets the thinning rate of the frame processing to be small. Thus, since the thinning rate according to the situation of the audio data is determined using the simple acoustic model, there is an effect that the recognition accuracy can be increased without increasing the calculation cost so much.

【０１２８】この発明によれば、複数の発声速度に対応
した発声の継続時間長データを記憶する継続時間長デー
タ記憶手段と、尤度演算手段から現在認識中の各仮説の
単語及び単語内位置を取得し、単語辞書記憶手段から仮
説の単語の品詞を取得する単語種類取得手段と、単語種
類取得手段が取得した仮説の単語の品詞及び単語内位置
に対応した継続時間長データを、継続時間長データ記憶
手段から選択する継続時間長データ変更手段とを備え、
尤度演算手段が、継続時間長データ変更手段が選択した
継続時間長データに基づき、仮説の尤度を演算すること
により、品詞等の単語の種類によって最適な継続時間長
データを設定し、発声の状況に応じた継続時間長制御を
するので、発声の速さが変化しても、認識精度を高くす
ることができるという効果がある。According to the present invention, the duration data storage means for storing the duration data of utterances corresponding to a plurality of utterance speeds, and the word and the intra-word position of each hypothesis currently being recognized by the likelihood calculating means. Word type acquiring means for acquiring the part of speech of the word of the hypothesis from the word dictionary storage means, and the duration data corresponding to the part of speech and the position in the word of the hypothetical word acquired by the word type acquiring means. Length data changing means selected from the long data storage means,
The likelihood calculating means calculates the likelihood of the hypothesis based on the duration data selected by the duration data changing means, thereby setting the optimal duration data according to the type of word such as part of speech, and uttering. Since the duration control is performed according to the situation described above, there is an effect that the recognition accuracy can be increased even if the utterance speed changes.

【０１２９】この発明によれば、単語の連鎖確率を記憶
する言語モデル記憶手段と、発声速度に対応した発声の
継続時間長データを記憶する継続時間長データ記憶手段
と、尤度演算手段から現在認識中の各仮説の単語とそれ
以前に認識した単語を取得して、言語モデル記憶手段か
らそれらの単語の連鎖確率を取得する単語連鎖確率取得
手段と、単語連鎖確率取得手段が取得した単語の連鎖確
率に応じた継続時間長データを、継続時間長データ記憶
手段から選択する継続時間長データ変更手段とを備え、
尤度演算手段が、継続時間長データ変更手段が選択した
継続時間長データに基づき、仮説の尤度を演算すること
により、単語の連接のしやすさによって最適な継続時間
長データを設定し、発声の内容に応じた継続時間長制御
をするので、発声の内容が変化しても、認識精度を高く
することができるという効果がある。According to the present invention, the language model storage means for storing the word chain probability, the duration data storage means for storing the duration data of the utterance corresponding to the utterance speed, and A word chain probability obtaining unit that obtains a word of each hypothesis being recognized and a word recognized before that, and obtains a chain probability of those words from a language model storage unit; Comprising a duration data changing means for selecting duration data according to the chain probability from the duration data storage means,
Based on the duration data selected by the duration data changing means, the likelihood calculating means calculates the likelihood of the hypothesis, thereby setting the optimal duration data according to the ease of word connection, Since the duration control is performed according to the content of the utterance, there is an effect that the recognition accuracy can be increased even if the content of the utterance changes.

【０１３０】この発明によれば、枝刈り手段から仮説の
最大尤度を取得し、取得した最大尤度を持ち、発声最初
とする仮説を追加して枝刈り手段に出力する最大尤度仮
説追加手段を備えたことにより、最大尤度を持ち、認識
単語の最初の音素位置を持つ仮説を追加するので、発声
の最初の呼気等を単語として認識する仮説が存在して
も、その呼気等の影響を受けずに、認識精度を高くする
ことができるという効果がある。According to the present invention, the maximum likelihood of the hypothesis is acquired from the pruning means, and the maximum likelihood hypothesis having the acquired maximum likelihood and adding the hypothesis as the first utterance and outputting to the pruning means is added. With the provision of the means, a hypothesis having the maximum likelihood and having the first phoneme position of the recognized word is added, so even if there is a hypothesis that recognizes the first exhalation of speech as a word, the There is an effect that recognition accuracy can be increased without being affected.

【０１３１】この発明によれば、音声データを所定時刻
ごとに取り込み音響分析して音響特徴ベクトルを出力す
る第１のステップと、第１のステップで出力された音響
特徴ベクトルと、予め記憶されている各音素の簡易な音
響モデルにより、現在時刻をはさむ所定の時間内におけ
る各時刻の各ＨＭＭ状態の簡易音響出力確率を演算する
第２のステップと、各時刻の各ＨＭＭ状態の簡易音響出
力確率の順位を求め、現在時刻をはさむ所定の時間内に
おける各ＨＭＭ状態の順位変動幅を計算し、ＨＭＭ状態
の順位変動幅の平均を計算する第３のステップと、順位
変動幅の平均に基づき所定のビーム幅を設定する第４の
ステップと、第１のステップで出力された音響特徴ベク
トルと、予め記憶されている各音素の音響モデルと単語
の音響モデルにより、認識候補である仮説の尤度を演算
する第５のステップと、演算した仮説の尤度から最大尤
度を求め、求めた最大尤度から第４のステップで設定さ
れた所定のビーム幅以下の仮説を棄却する第６のステッ
プと、第６のステップで残された仮説を認識候補として
出力する第７のステップとを備えて音声を認識すること
により、各音素の簡易な音響モデルを用いて、適切にビ
ーム幅を設定しているので、計算コストはそれほど増加
せずに、正解が枝刈りされることを抑えて、認識精度を
高くすることができるという効果がある。According to the present invention, the first step of taking in audio data at predetermined time points and performing acoustic analysis to output an acoustic feature vector, and the acoustic feature vector output in the first step, are stored in advance. A second step of calculating a simple acoustic output probability of each HMM state at each time within a predetermined time including a current time by using a simple acoustic model of each phoneme, and a simple acoustic output probability of each HMM state at each time A third step of calculating the order fluctuation width of each HMM state within a predetermined time including the current time, and calculating the average of the order fluctuation width of the HMM state; and The fourth step of setting the beam width of the sound, the acoustic feature vector output in the first step, the acoustic model of each phoneme and the acoustic model of a word stored in advance. A fifth step of calculating the likelihood of the hypothesis that is a recognition candidate, and obtaining the maximum likelihood from the calculated likelihood of the hypothesis, and using the obtained maximum likelihood to be equal to or less than the predetermined beam width set in the fourth step. By recognizing the speech by providing a sixth step of rejecting the hypothesis of the above and a seventh step of outputting the hypothesis left in the sixth step as a recognition candidate, a simple acoustic model of each phoneme is used. Since the beam width is set appropriately, there is an effect that the calculation cost does not increase so much, the correct answer can be suppressed from being pruned, and the recognition accuracy can be increased.

【０１３２】この発明によれば、音声データを所定時刻
ごとに取り込み音響分析して音響特徴ベクトルを出力す
る第１のステップと、第１のステップで出力された音響
特徴ベクトルと、予め記憶されている各音素の簡易な音
響モデルにより、現在時刻をはさむ所定の時間内におけ
る各時刻の各ＨＭＭ状態の簡易音響出力確率を演算する
第２のステップと、各時刻の各ＨＭＭ状態の簡易音響出
力確率の順位を求め、現在時刻をはさむ所定の時間内に
おける各ＨＭＭ状態の順位変動幅を計算し、全ＨＭＭ状
態の順位変動幅の平均を計算する第３のステップと、順
位変動幅の平均に基づき音声認識に係るパラメータを調
整する第４のステップと、第１のステップで出力された
音響特徴ベクトルと、予め記憶されている各音素の音響
モデルと単語の音響モデルと、第４のステップで調整し
た音声認識に係るパラメータにより、認識候補である仮
説の尤度を演算する第５のステップと、演算した仮説の
尤度から最大尤度を求め、求めた最大尤度から所定のビ
ーム幅以下の仮説を棄却する第６のステップと、第６の
ステップで残された仮説を認識候補として出力する第７
のステップとを備えて音声を認識することにより、各音
素の簡易な音響モデルを用いて、音声認識に係るパラメ
ータを調整するので、計算コストはそれほど増加せず
に、認識精度を高くすることができるという効果があ
る。According to the present invention, the first step of taking in audio data at predetermined time points and performing acoustic analysis to output an acoustic feature vector, and the acoustic feature vector output in the first step, are stored in advance. A second step of calculating a simple acoustic output probability of each HMM state at each time within a predetermined time including a current time by using a simple acoustic model of each phoneme, and a simple acoustic output probability of each HMM state at each time A third step of calculating the order variation of each HMM state within a predetermined time including the current time, calculating an average of the order variation of all the HMM states, and based on the average of the order variation. A fourth step of adjusting parameters related to speech recognition, an acoustic feature vector output in the first step, an acoustic model of each phoneme and a sound of a word stored in advance. A fifth step of calculating the likelihood of a hypothesis that is a recognition candidate based on the model and parameters related to the speech recognition adjusted in the fourth step; and obtaining the maximum likelihood from the calculated likelihood of the hypothesis. A sixth step of rejecting a hypothesis having a predetermined beam width or less from the likelihood, and a seventh step of outputting the hypothesis left in the sixth step as a recognition candidate.
By using the simple acoustic model of each phoneme to adjust the parameters related to speech recognition by recognizing the speech with the steps of, it is possible to increase the recognition accuracy without significantly increasing the calculation cost. There is an effect that can be.

【０１３３】この発明によれば、音声データを所定時刻
ごとに取り込み音響分析して音響特徴ベクトルを出力す
る第１のステップと、第１のステップで出力された音響
特徴ベクトルと、予め記憶されている各音素の音響モデ
ルから各ＨＭＭ状態の出力確率を計算する第２のステッ
プと、現在認識中の各仮説の単語及び単語内位置を取得
し、予め記憶されている仮説の単語の品詞を取得する第
３のステップと、第３のステップで取得した仮説の単語
の品詞及び単語内位置に対応した継続時間長データを、
予め記憶されている発声速度に対応した発声の継続時間
長データの中から選択する第４のステップと、第２のス
テップで計算した各ＨＭＭ状態の出力確率と、予め記憶
されている単語の音響モデルと、第４のステップで選択
した継続時間長データにより、認識候補である仮説の尤
度を演算する第５のステップと、演算した仮説の尤度か
ら最大尤度を求め、求めた最大尤度から所定のビーム幅
以下の仮説を棄却する第６のステップと、第６のステッ
プで残された仮説を認識候補として出力する第７のステ
ップとを備えて音声を認識することにより、品詞等の単
語の種類によって最適な継続時間長データを設定し、発
声の状況に応じた継続時間長制御をするので、発声の速
さが変化しても、認識精度を高くすることができるとい
う効果がある。According to the present invention, the first step of taking in audio data at predetermined time points and performing acoustic analysis to output an acoustic feature vector, and the acoustic feature vector output in the first step, are stored in advance. A second step of calculating the output probability of each HMM state from the acoustic model of each phoneme that is present, acquiring the word and position in the word of each hypothesis currently being recognized, and acquiring the part of speech of the previously stored hypothesis word The third step to perform, and the duration of time data corresponding to the part of speech of the word of the hypothesis acquired in the third step and the position within the word,
A fourth step of selecting from utterance duration data corresponding to the utterance speed stored in advance, an output probability of each HMM state calculated in the second step, and a sound of a word stored in advance. A fifth step of calculating the likelihood of a hypothesis that is a recognition candidate based on the model and the duration data selected in the fourth step; and obtaining the maximum likelihood from the calculated likelihood of the hypothesis. By recognizing speech by including a sixth step of rejecting a hypothesis having a beam width equal to or smaller than a predetermined value and a seventh step of outputting the hypothesis remaining in the sixth step as a recognition candidate, Since the optimal duration data is set according to the type of the word and the duration control is performed according to the utterance situation, the recognition accuracy can be improved even if the utterance speed changes. is there.

【０１３４】この発明によれば、音声データを所定時刻
ごとに取り込み音響分析して音響特徴ベクトルを出力す
る第１のステップと、第１のステップで出力された音響
特徴ベクトルと予め記憶されている各音素の音響モデル
から各ＨＭＭ状態の出力確率を計算する第２のステップ
と、現在認識中の各仮説の単語とそれ以前に認識した単
語を取得し、予め記憶されている単語の連鎖確率から、
それらの単語の連鎖確率を取得する第３のステップと、
第３のステップで取得した単語の連鎖確率に応じた継続
時間長データを、予め記憶されている発声速度に対応し
た発声の継続時間長データから選択する第４のステップ
と、第２のステップで計算した各ＨＭＭ状態の出力確率
と、予め記憶されている単語の音響モデルと、第４のス
テップで選択した継続時間長データにより、認識候補で
ある仮説の尤度を演算する第５のステップと、演算した
仮説の尤度から最大尤度を求め、求めた最大尤度から所
定のビーム幅以下の仮説を棄却する第６のステップと、
第６のステップで残された仮説を認識候補として出力す
る第７のステップとを備えて音声を認識することによ
り、単語の連接のしやすさによって最適な継続時間長デ
ータを設定し、発声の内容に応じた継続時間長制御をす
るので、発声の内容が変化しても、認識精度を高くする
ことができるという効果がある。According to the present invention, the first step of taking in audio data at predetermined time points and performing acoustic analysis to output an acoustic feature vector, and the acoustic feature vector output in the first step are stored in advance. A second step of calculating an output probability of each HMM state from an acoustic model of each phoneme; acquiring words of each hypothesis currently being recognized and words recognized before that; ,
A third step of obtaining the chain probabilities of those words;
A fourth step of selecting duration time data corresponding to the word chain probability obtained in the third step from utterance duration data corresponding to the utterance speed stored in advance; and a second step. A fifth step of calculating the likelihood of a hypothesis that is a recognition candidate based on the calculated output probabilities of each HMM state, the acoustic model of the word stored in advance, and the duration data selected in the fourth step; A sixth step of obtaining a maximum likelihood from the calculated likelihood of the hypothesis and rejecting a hypothesis having a predetermined beam width or less from the obtained maximum likelihood;
And a seventh step of outputting a hypothesis left in the sixth step as a recognition candidate. By recognizing the speech, optimal duration time data is set according to the ease of connecting words, and Since the duration control is performed according to the content, the recognition accuracy can be improved even if the content of the utterance changes.

【０１３５】この発明によれば、音声データを所定時刻
ごとに取り込み音響分析して音響特徴ベクトルを出力す
る第１のステップと、第１のステップで出力された音響
特徴ベクトルと、予め記憶されている各音素の音響モデ
ルと単語の音響モデルにより、認識候補である仮説の尤
度を演算する第２のステップと、第２のステップで演算
した仮説の尤度から最大尤度を求める第３のステップ
と、第３のステップで求めた最大尤度を取得し、取得し
た最大尤度を持ち、発声最初とする仮説を追加する第４
のステップと、第３のステップで求めた最大尤度から所
定のビーム幅以下の仮説を棄却する第５のステップと、
第５のステップで残された仮説を認識候補として出力す
る第６のステップとを備えて音声を認識することによ
り、最大尤度を持ち、認識単語の最初の音素位置を持つ
仮説を追加するので、発声の最初の呼気等を単語として
認識する仮説が存在しても、その呼気等の影響を受けず
に、認識精度を高くすることができるという効果があ
る。According to the present invention, the first step of taking in audio data at predetermined time points and performing acoustic analysis to output an acoustic feature vector, and the acoustic feature vector output in the first step, are stored in advance. A second step of calculating the likelihood of a hypothesis that is a recognition candidate based on the acoustic model of each phoneme and the acoustic model of the word, and a third step of calculating the maximum likelihood from the likelihood of the hypothesis calculated in the second step. A fourth step of obtaining the maximum likelihood obtained in the step and the third step, and adding a hypothesis having the obtained maximum likelihood and being the first utterance
And a fifth step of rejecting a hypothesis of a predetermined beam width or less from the maximum likelihood obtained in the third step;
And a sixth step of outputting the hypothesis left in the fifth step as a recognition candidate. By recognizing the speech, a hypothesis having the maximum likelihood and having the first phoneme position of the recognized word is added. Even if there is a hypothesis that recognizes the first exhalation of speech as a word, there is an effect that recognition accuracy can be increased without being affected by the exhalation or the like.

[Brief description of the drawings]

【図１】この発明の実施の形態１による音声認識装置
の構成を示す図である。FIG. 1 is a diagram showing a configuration of a speech recognition device according to a first embodiment of the present invention.

【図２】この発明の実施の形態１による音声認識装置
の処理を示すフローチャートである。FIG. 2 is a flowchart showing processing of the voice recognition device according to the first embodiment of the present invention.

【図３】この発明の実施の形態１によるビーム幅を設
定する処理を示すフローチャートである。FIG. 3 is a flowchart showing a process for setting a beam width according to the first embodiment of the present invention.

【図４】この発明の実施の形態１によるＨＭＭ状態の
各時刻における簡易音響出力確率を示す図である。FIG. 4 is a diagram showing a simplified sound output probability at each time in the HMM state according to the first embodiment of the present invention.

【図５】この発明の実施の形態１によるＨＭＭ状態の
各時刻における簡易音響出力確率を示す図である。FIG. 5 is a diagram showing a simplified sound output probability at each time in the HMM state according to the first embodiment of the present invention.

【図６】この発明の実施の形態２による音声認識装置
の構成を示す図である。FIG. 6 is a diagram showing a configuration of a voice recognition device according to a second embodiment of the present invention.

【図７】この発明の実施の形態２による音声認識装置
の処理を示すフローチャートである。FIG. 7 is a flowchart showing processing of the voice recognition device according to the second embodiment of the present invention.

【図８】この発明の実施の形態２による言語尤度の重
みを設定する処理を示すフローチャートである。FIG. 8 is a flowchart showing a process for setting a weight of language likelihood according to the second embodiment of the present invention.

【図９】この発明の実施の形態３による音声認識装置
の構成を示す図である。FIG. 9 is a diagram showing a configuration of a voice recognition device according to a third embodiment of the present invention.

【図１０】この発明の実施の形態３による音声認識装
置の処理を示すフローチャートである。FIG. 10 is a flowchart showing processing of the voice recognition device according to the third embodiment of the present invention.

【図１１】この発明の実施の形態３による継続時間長
データを設定する処理を示すフローチャートである。FIG. 11 is a flowchart showing a process for setting duration time data according to the third embodiment of the present invention.

【図１２】この発明の実施の形態４による音声認識装
置の構成を示す図である。FIG. 12 is a diagram showing a configuration of a voice recognition device according to a fourth embodiment of the present invention.

【図１３】この発明の実施の形態４による音声認識装
置の処理を示すフローチャートである。FIG. 13 is a flowchart showing processing of the voice recognition device according to the fourth embodiment of the present invention.

【図１４】この発明の実施の形態４によるフレーム処
理の省略を行うか否かを決定する処理を示すフローチャ
ートである。FIG. 14 is a flowchart showing a process for determining whether or not to omit the frame process according to the fourth embodiment of the present invention.

【図１５】この発明の実施の形態５による音声認識装
置の構成を示す図である。FIG. 15 is a diagram showing a configuration of a voice recognition device according to a fifth embodiment of the present invention.

【図１６】この発明の実施の形態５による音声認識装
置の処理を示すフローチャートである。FIG. 16 is a flowchart showing processing of the speech recognition device according to the fifth embodiment of the present invention.

【図１７】この発明の実施の形態５によるフレーム処
理の間引き率を決定する処理を示すフローチャートであ
る。FIG. 17 is a flowchart showing a process for determining a thinning rate of frame processing according to the fifth embodiment of the present invention.

【図１８】この発明の実施の形態６による音声認識装
置の構成を示す図である。FIG. 18 is a diagram showing a configuration of a voice recognition device according to a sixth embodiment of the present invention.

【図１９】この発明の実施の形態６による音声認識装
置の処理を示すフローチャートである。FIG. 19 is a flowchart showing processing of the voice recognition device according to the sixth embodiment of the present invention.

【図２０】この発明の実施の形態７による音声認識装
置の構成を示す図である。FIG. 20 is a diagram showing a configuration of a speech recognition device according to a seventh embodiment of the present invention.

【図２１】この発明の実施の形態７による音声認識装
置の処理を示すフローチャートである。FIG. 21 is a flowchart showing processing of the voice recognition device according to the seventh embodiment of the present invention.

【図２２】この発明の実施の形態８による音声認識装
置の構成を示す図である。FIG. 22 is a diagram showing a configuration of a speech recognition device according to an eighth embodiment of the present invention.

【図２３】この発明の実施の形態８による音声認識装
置の処理を示すフローチャートである。FIG. 23 is a flowchart showing processing of the voice recognition device according to the eighth embodiment of the present invention.

【図２４】従来の呼気がかかった場合に展開される仮
説を示す図である。FIG. 24 is a diagram showing a conventional hypothesis developed when exhalation is applied.

【図２５】この発明の実施の形態８による呼気がかか
った場合に展開される仮説を示す図である。FIG. 25 is a diagram showing a hypothesis developed when exhalation is applied according to the eighth embodiment of the present invention.

【図２６】従来の音声認識装置の構成を示す図であ
る。FIG. 26 is a diagram showing a configuration of a conventional voice recognition device.

【図２７】従来の音声認識装置の処理を示すフローチ
ャートである。FIG. 27 is a flowchart showing processing of a conventional speech recognition device.

【図２８】従来の単語辞書の具体例を示す図である。FIG. 28 is a diagram showing a specific example of a conventional word dictionary.

【図２９】従来の単語の音響モデルの構造例を示す図
である。FIG. 29 is a diagram showing an example of the structure of a conventional word acoustic model.

【図３０】従来の時刻が進むにつれて仮説が展開され
る状況を説明する図である。FIG. 30 is a diagram illustrating a conventional situation in which a hypothesis is developed as time advances.

【図３１】従来の継続時間長の例を示す図である。FIG. 31 is a diagram illustrating an example of a conventional duration time.

[Explanation of symbols]

１１音声データ記憶手段、１２音響分析手段、１３
音響モデル記憶手段、１４単語辞書記憶手段、１５
尤度演算手段、１６枝刈り手段、１７認識結果出
力手段、２１簡易音響モデル記憶手段、２２簡易音
響モデル確率演算手段、２３順位変動計算手段、２４
ビーム幅変更手段、３１言語重み変更手段、４１
継続時間長データ記憶手段、４２継続時間長データ変
更手段、５１間引き決定手段、６１間引き率決定手
段、７１単語種類取得手段、７２継続時間長データ
変更手段、８１言語モデル記憶手段、８２単語連鎖
確率取得手段、８３継続時間長データ変更手段、９１
最大尤度仮説追加手段。11 sound data storage means, 12 sound analysis means, 13
Acoustic model storage means, 14 Word dictionary storage means, 15
Likelihood calculating means, 16 pruning means, 17 recognition result output means, 21 simple acoustic model storing means, 22 simple acoustic model probability calculating means, 23 rank variation calculating means, 24
Beam width changing means, 31 Language weight changing means, 41
Duration time data storage means, 42 duration data change means, 51 decimation determination means, 61 decimation rate determination means, 71 word type acquisition means, 72 duration data change means, 81 language model storage means, 82 word chain probability Acquisition means, 83 Duration length data change means, 91
Means of maximum likelihood hypothesis addition.

Claims

[Claims]

1. An audio data storage means for converting input audio into digital data and storing the digital data as audio data, an audio analysis means for taking in the audio data at predetermined time intervals, performing an acoustic analysis and outputting an acoustic feature vector, Acoustic model storage means for storing an acoustic model of, a word dictionary storage means for generating and storing an acoustic model of a word from a phoneme description of each word in a word dictionary, and an acoustic feature vector output from the acoustic analysis means,
An acoustic model of each phoneme stored in the acoustic model storage means, and a likelihood calculation means for calculating the likelihood of a hypothesis that is a recognition candidate, based on the acoustic model of the word stored in the word dictionary storage means, Pruning means for obtaining the maximum likelihood from the likelihood of the hypothesis calculated by the likelihood calculating means, and rejecting a hypothesis having a predetermined beam width or less from the obtained maximum likelihood, and a hypothesis left by the pruning means. In a speech recognition device including a recognition result output unit that outputs a recognition candidate, a simple acoustic model storage unit that stores a simple acoustic model of each phoneme; an acoustic feature vector output from the acoustic analysis unit;
A simple acoustic model probability calculating means for calculating a simple acoustic output probability of each HMM state at each time within a predetermined time including the current time by a simple acoustic model of each phoneme stored in the simple acoustic model storage means; Each H at each time obtained by the simple acoustic model probability calculation means.
Rank fluctuation calculating means for calculating the rank of the simplified sound output probability of the MM state, calculating the rank fluctuation width of each HMM state within a predetermined time including the current time, and calculating the average of the rank fluctuation width of the HMM state. A speech recognition device for adjusting a parameter relating to speech recognition based on an average of the order variation width calculated by the order variation calculating means.

2. A beam width changing means for setting a beam width as a parameter relating to speech recognition, wherein the beam width changing means calculates a beam width when the average of the rank fluctuation width calculated by the rank fluctuation calculating means is smaller than a predetermined value. The beam width is set small, and when the average of the order variation width is larger than a predetermined value, the beam width changing means sets the beam width large, and the pruning means sets the beam set by the beam width changing means. 2. The hypothesis rejected based on the width.
The speech recognition device according to the above.

3. A language weight changing means for setting a weight of a language likelihood added to the likelihood as a parameter relating to speech recognition, wherein an average of a rank variation width calculated by the rank variation calculating means is larger than a predetermined value. When the language weight is small, the language weight changing means sets the weight of the language likelihood small, and when the average of the order variation width is larger than a predetermined value, the language weight changing means sets the weight of the language likelihood large. 2. The speech recognition apparatus according to claim 1, wherein the likelihood calculating means calculates the likelihood of the hypothesis based on the weight of the language likelihood set by the language weight changing means.

4. A duration data storage unit for storing duration data of utterance corresponding to a plurality of utterance speeds, and a duration data changing unit for setting duration data as a parameter relating to speech recognition. When the average of the order change width calculated by the order change calculating means is smaller than a predetermined value, the duration time data changing means is:
The duration time data uttered late stored in the duration time data storage means is selected, and when the average of the rank variation width is larger than a predetermined value, the duration time data changing means selects the duration time data. The duration utterance data that has been uttered earlier stored in the duration data storage is selected, and the likelihood calculation means determines the likelihood of the hypothesis based on the duration data selected by the duration data changer. The speech recognition device according to claim 1, wherein the speech recognition device performs a calculation.

5. A thinning-out determining means for determining whether or not a frame processing which is a recognition processing at each time is thinned out as a parameter relating to voice recognition, wherein an average of the rank variation calculated by the rank variation calculating means is a predetermined value. If it is smaller, the thinning-out determining means sets the thinning-out of the frame processing to "yes", and if the average of the order variation width is larger than a predetermined value, the thinning-out determining means sets the thinning-out of the frame processing to no. Claim 1.
The speech recognition device according to the above.

6. A thinning-out rate determining means for determining a thinning-out rate of frame processing as a recognition processing at each time as a parameter relating to voice recognition, wherein the average of the rank variation calculated by the rank variation calculating means is a predetermined value. If it is smaller, the thinning rate determining means sets the thinning rate of the frame processing to be large, and if the average of the order variation width is larger than a predetermined value, the thinning rate determining means sets the thinning rate of the frame processing to be small. The speech recognition device according to claim 1, wherein:

7. Audio data storage means for converting input speech into digital data and storing the same as speech data; acoustic analysis means for taking in the speech data at predetermined time points and performing acoustic analysis to output an acoustic feature vector; Acoustic model storage means for storing an acoustic model of, a word dictionary storage means for generating and storing an acoustic model of a word from a phoneme description of each word in a word dictionary, and an acoustic feature vector output from the acoustic analysis means,
An acoustic model of each phoneme stored in the acoustic model storage means, and a likelihood calculation means for calculating the likelihood of a hypothesis that is a recognition candidate, based on the acoustic model of the word stored in the word dictionary storage means, Pruning means for obtaining the maximum likelihood from the likelihood of the hypothesis calculated by the above-mentioned likelihood calculating means, and rejecting a hypothesis having a predetermined beam width or less from the obtained maximum likelihood, and a hypothesis left by the pruning means. A speech recognition device comprising: a recognition result output unit that outputs a recognition candidate; a duration data storage unit that stores duration data of utterances corresponding to a plurality of utterance speeds; A word type obtaining unit that obtains the word of each hypothesis and the position within the word, and obtains the part of speech of the word of the hypothesis from the word dictionary storage unit; and a word product of the hypothesis obtained by the word type obtaining unit. And duration time data changing means for selecting duration time data corresponding to the position within the word from the duration data storage means, wherein the likelihood calculating means selects the duration data changing means. A speech recognition apparatus for calculating a likelihood of a hypothesis based on duration time data.

8. Audio data storage means for converting input speech into digital data and storing the same as speech data, acoustic analysis means for taking in the speech data at predetermined time points, performing acoustic analysis and outputting an acoustic feature vector, and each phoneme. Acoustic model storage means for storing an acoustic model of, a word dictionary storage means for generating and storing an acoustic model of a word from a phoneme description of each word in a word dictionary, and an acoustic feature vector output from the acoustic analysis means,
An acoustic model of each phoneme stored in the acoustic model storage means, and a likelihood calculation means for calculating the likelihood of a hypothesis that is a recognition candidate, based on the acoustic model of the word stored in the word dictionary storage means, Pruning means for obtaining the maximum likelihood from the likelihood of the hypothesis calculated by the above-mentioned likelihood calculating means, and rejecting a hypothesis having a predetermined beam width or less from the obtained maximum likelihood, and a hypothesis left by the pruning means. A speech recognition apparatus comprising: a recognition result output unit that outputs a recognition candidate; a language model storage unit that stores a word chain probability; and a duration data that stores duration data of an utterance corresponding to an utterance speed. Means, a word sequence for obtaining words of each hypothesis currently being recognized and words recognized before that from the likelihood calculating means, and obtaining a chain probability of those words from the language model storage means. A probability obtaining means, and a duration data changing means for selecting duration data according to the word chain probability obtained by the word chain probability obtaining means from the duration data storage means; A speech recognition device, wherein a calculating means calculates a likelihood of a hypothesis based on the duration data selected by the duration data changing means.

9. Voice data storage means for converting input voice into digital data and storing the voice data as voice data, acoustic analysis means for capturing the voice data at predetermined time points and performing acoustic analysis to output an acoustic feature vector; Acoustic model storage means for storing an acoustic model of, a word dictionary storage means for generating and storing an acoustic model of a word from a phoneme description of each word in a word dictionary, and an acoustic feature vector output from the acoustic analysis means,
An acoustic model of each phoneme stored in the acoustic model storage means, and a likelihood calculation means for calculating the likelihood of a hypothesis that is a recognition candidate, based on the acoustic model of the word stored in the word dictionary storage means, Pruning means for obtaining the maximum likelihood from the likelihood of the hypothesis calculated by the above-mentioned likelihood calculating means, and rejecting a hypothesis having a predetermined beam width or less from the obtained maximum likelihood, and a hypothesis left by the pruning means. A speech recognition apparatus comprising: a recognition result output unit that outputs a recognition candidate; and obtains a maximum likelihood of a hypothesis from the pruning unit, adds the obtained maximum likelihood, and adds a hypothesis that is the first utterance. A speech recognition device comprising a maximum likelihood hypothesis adding unit that outputs to a pruning unit.

10. A first step of taking in audio data at predetermined time points, performing acoustic analysis and outputting an acoustic feature vector, an acoustic feature vector output in the first step,
By a simple acoustic model of each phoneme stored in advance,
Each HM at each time within a predetermined time including the current time
A second step of calculating a simple sound output probability of the M state; obtaining a rank of the simple sound output probability of each HMM state at each of the above times;
A third step of calculating an order variation width of the state and calculating an average of the order variation width of the HMM state; a fourth step of setting a predetermined beam width based on the average of the order variation width; The acoustic feature vector output in the step of
Fifth calculation of the likelihood of a hypothesis that is a recognition candidate is performed based on the acoustic model of each phoneme and the acoustic model of a word stored in advance.
And a sixth step of obtaining the maximum likelihood from the calculated likelihood of the hypothesis, and rejecting a hypothesis having a beam width equal to or smaller than the predetermined beam width set in the fourth step from the obtained maximum likelihood; And a seventh step of outputting the hypothesis left in step 6 as a recognition candidate.

11. A first step of taking in audio data at predetermined time points, performing acoustic analysis and outputting an acoustic feature vector, an acoustic feature vector output in the first step,
By a simple acoustic model of each phoneme stored in advance,
Each HM at each time within a predetermined time including the current time
A second step of calculating a simple sound output probability of the M state; obtaining a rank of the simple sound output probability of each HMM state at each of the above times;
A third step of calculating the order change width of the states and calculating an average of the order change ranges of all the HMM states; a fourth step of adjusting a parameter relating to speech recognition based on the average of the order change ranges; The acoustic feature vector output in the first step;
A fifth step of calculating the likelihood of a hypothesis that is a recognition candidate based on the acoustic model of each phoneme and the acoustic model of the word stored in advance and the parameters related to the speech recognition adjusted in the fourth step; A sixth step of obtaining a maximum likelihood from the likelihood of the obtained hypothesis, and rejecting a hypothesis of a predetermined beam width or less from the obtained maximum likelihood; and outputting the hypothesis left in the sixth step as a recognition candidate. And a seventh step of recognizing a voice.

12. A first step of taking in audio data at predetermined time points, performing acoustic analysis and outputting an acoustic feature vector, an acoustic feature vector output in the first step,
A second step of calculating an output probability of each HMM state from a pre-stored acoustic model of each phoneme; acquiring a word of each hypothesis currently recognized and a position in the word; A third step of acquiring the part of speech of the hypothesis, and the duration time data corresponding to the part of speech and the position in the word of the hypothesis word acquired in the third step, by continuing the utterance corresponding to the utterance speed stored in advance. A fourth step of selecting from time length data; an output probability of each HMM state calculated in the second step; an acoustic model of a word stored in advance;
A fifth step of calculating the likelihood of the hypothesis that is a recognition candidate based on the duration data selected in the step; and obtaining a maximum likelihood from the calculated likelihood of the hypothesis, and a predetermined beam from the obtained maximum likelihood. A speech recognition method comprising: recognizing speech, comprising: a sixth step of rejecting a hypothesis having a width equal to or less than a width; and a seventh step of outputting a hypothesis left in the sixth step as a recognition candidate.

13. A first step of taking in audio data at predetermined time points, performing acoustic analysis and outputting an acoustic feature vector, and the acoustic feature vector output in the first step and a prestored phoneme of each phoneme. A second step of calculating the output probabilities of each HMM state from the acoustic model, acquiring the words of each hypothesis currently being recognized and the words recognized before, and calculating them from the chain probability of the words stored in advance. A third step of acquiring the chain probability of the word, and the duration data corresponding to the chain probability of the word acquired in the third step are converted into the duration data of the utterance corresponding to the utterance speed stored in advance. A fourth step of selecting from the following: an output probability of each HMM state calculated in the second step; an acoustic model of a word stored in advance;
A fifth step of calculating the likelihood of the hypothesis that is a recognition candidate based on the duration data selected in the step; and obtaining a maximum likelihood from the calculated likelihood of the hypothesis, and a predetermined beam from the obtained maximum likelihood. A speech recognition method comprising: recognizing speech, comprising: a sixth step of rejecting a hypothesis having a width equal to or less than a width; and a seventh step of outputting a hypothesis left in the sixth step as a recognition candidate.

14. A first step of taking in audio data at predetermined time points, performing acoustic analysis and outputting an acoustic feature vector, and an acoustic feature vector output in the first step;
A second calculating unit that calculates a likelihood of a hypothesis that is a recognition candidate using an acoustic model of each phoneme and an acoustic model of a word stored in advance.
And a third step of obtaining the maximum likelihood from the likelihood of the hypothesis calculated in the second step; and obtaining the maximum likelihood obtained in the third step. Add a hypothesis to be the first to utter
A fifth step of rejecting a hypothesis having a predetermined beam width or less from the maximum likelihood obtained in the third step, and a sixth step of outputting the hypothesis left in the fifth step as a recognition candidate. A voice recognition method comprising the steps of:

15. A first step of taking in audio data at predetermined times and performing acoustic analysis to output an acoustic feature vector; and an acoustic feature vector output in the first step.
By a simple acoustic model of each phoneme stored in advance,
Each HM at each time within a predetermined time including the current time
A second step of calculating a simple sound output probability of the M state; obtaining a rank of the simple sound output probability of each HMM state at each of the above times;
A third step of calculating an order variation width of the state and calculating an average of the order variation width of the HMM state; a fourth step of setting a predetermined beam width based on the average of the order variation width; The acoustic feature vector output in the step of
Fifth calculation of the likelihood of a hypothesis that is a recognition candidate is performed based on the acoustic model of each phoneme and the acoustic model of a word stored in advance.
And a sixth step of obtaining the maximum likelihood from the calculated likelihood of the hypothesis, and rejecting a hypothesis having a beam width equal to or smaller than the predetermined beam width set in the fourth step from the obtained maximum likelihood; A recording medium on which a speech recognition program for causing a computer to execute a seventh step of outputting a hypothesis left in step 6 as a recognition candidate is recorded.

16. A first step of taking in audio data at predetermined times and performing acoustic analysis to output an acoustic feature vector, an acoustic feature vector output in the first step,
By a simple acoustic model of each phoneme stored in advance,
Each HM at each time within a predetermined time including the current time
A second step of calculating a simple sound output probability of the M state; obtaining a rank of the simple sound output probability of each HMM state at each of the above times;
A third step of calculating the order change width of the states and calculating an average of the order change ranges of all the HMM states; a fourth step of adjusting a parameter relating to speech recognition based on the average of the order change ranges; The acoustic feature vector output in the first step;
A fifth step of calculating the likelihood of a hypothesis that is a recognition candidate based on the acoustic model of each phoneme and the acoustic model of the word stored in advance and the parameters related to the speech recognition adjusted in the fourth step; A sixth step of obtaining a maximum likelihood from the likelihood of the obtained hypothesis, and rejecting a hypothesis of a predetermined beam width or less from the obtained maximum likelihood; and outputting the hypothesis left in the sixth step as a recognition candidate. A recording medium recording a voice recognition program for causing a computer to execute the seventh step.

17. A first step of taking in audio data at predetermined times, performing acoustic analysis and outputting an acoustic feature vector, an acoustic feature vector output in the first step,
A second step of calculating an output probability of each HMM state from a pre-stored acoustic model of each phoneme; acquiring a word of each hypothesis currently recognized and a position in the word; A third step of acquiring the part of speech of the hypothesis, and the duration time data corresponding to the part of speech and the position in the word of the hypothesis word acquired in the third step, by continuing the utterance corresponding to the utterance speed stored in advance. A fourth step of selecting from time length data, an output probability of each HMM state calculated in the second step, an acoustic model of a word stored in advance,
A fifth step of calculating the likelihood of the hypothesis that is a recognition candidate based on the duration data selected in the step; and obtaining a maximum likelihood from the calculated likelihood of the hypothesis, and a predetermined beam from the obtained maximum likelihood. A recording medium storing a speech recognition program for causing a computer to execute a sixth step of rejecting a hypothesis having a width equal to or less than a width and a seventh step of outputting the hypothesis left in the sixth step as a recognition candidate.

18. A first step of taking in audio data at predetermined time points and performing acoustic analysis to output an acoustic feature vector, an acoustic feature vector output in the first step,
A second step of calculating an output probability of each HMM state from an acoustic model of each phoneme stored in advance; acquiring words of each hypothesis currently being recognized and words recognized before that; A third step of acquiring the chain probabilities of the words from the chain probabilities of the words; and converting the duration time data corresponding to the chain probabilities of the words acquired in the third step to a speech rate stored in advance. A fourth step of selecting from the corresponding utterance duration data, an output probability of each HMM state calculated in the second step, an acoustic model of a word stored in advance,
A fifth step of calculating the likelihood of the hypothesis that is a recognition candidate based on the duration data selected in the step; and obtaining a maximum likelihood from the calculated likelihood of the hypothesis, and a predetermined beam from the obtained maximum likelihood. A recording medium storing a speech recognition program for causing a computer to execute a sixth step of rejecting a hypothesis having a width equal to or less than a width and a seventh step of outputting the hypothesis left in the sixth step as a recognition candidate.

19. A first step of taking in audio data at predetermined times and performing acoustic analysis to output an acoustic feature vector; and an acoustic feature vector output in the first step.
A second calculating unit that calculates a likelihood of a hypothesis that is a recognition candidate using an acoustic model of each phoneme and an acoustic model of a word stored in advance.
And a third step of obtaining the maximum likelihood from the likelihood of the hypothesis calculated in the second step; and obtaining the maximum likelihood obtained in the third step. Add a hypothesis to be the first to utter
A fifth step of rejecting a hypothesis having a predetermined beam width or less from the maximum likelihood obtained in the third step, and a sixth step of outputting the hypothesis left in the fifth step as a recognition candidate. Recording medium on which a voice recognition program for causing a computer to execute the steps of the above is recorded.