JP2003255972A

JP2003255972A - Speech recognizing device

Info

Publication number: JP2003255972A
Application number: JP2002057793A
Authority: JP
Inventors: Michihiro Yamazaki; 道弘山崎
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2002-03-04
Filing date: 2002-03-04
Publication date: 2003-09-10
Anticipated expiration: 2022-03-04
Also published as: JP4219603B2

Abstract

<P>PROBLEM TO BE SOLVED: To improve recognition precision even if a speech is broken during recognition of successive words. <P>SOLUTION: A matching means 5 performs a matching process by using a sound analysis result from a sound analyzing means 2, a vocabulary to be recognized from a recognized object vocabulary dictionary storage means 3, and a sound model from a sound model storage means 4, and computes the likelihood of each partial hypothesis showing each state of each vocabulary to be recognized and a next speech section wait decision means 6 sets a wait time up to a next speech section corresponding to the partial hypothesis with the maximum likelihood computed by the matching means 5, receives notice of speech section detection determination from a speech section detecting means 1, and instructs the matching means 5 to match the next speech section continuously when the next speech section is detected within the set wait time. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は音声信号を入力し
認識結果を出力する音声認識装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device which inputs a voice signal and outputs a recognition result.

【０００２】[0002]

【従来の技術】音声認識を行う場合に、入力した音声信
号の音声区間を検出し、検出された音声区間に対して認
識対象の語彙（以後、認識対象語彙と呼ぶ）との照合を
行うものとして、特開昭５９−２１０９７９７号公報に
開示されたものが一般的であり、図５はこのような音声
認識装置の構成を示すブロック図である。2. Description of the Related Art When performing voice recognition, a voice section of an input voice signal is detected, and the detected voice section is collated with a vocabulary to be recognized (hereinafter referred to as a recognition target vocabulary). As a general example, the one disclosed in Japanese Patent Laid-Open No. 59-21099797 is shown in FIG. 5, and FIG. 5 is a block diagram showing the configuration of such a voice recognition device.

【０００３】図５において、１１は入力した音声信号の
音声区間を検出する音声区間検出手段、１２は音声区間
検出手段１１で検出された音声区間の音声信号に対して
音響分析を行う音響分析手段、１３は認識対象となる認
識対象語彙と各認識対象語彙の接続関係を定義する構文
情報とを記憶する認識対象語彙辞書記憶手段、１４は認
識の最小単位となる音響モデルと各音響モデルの待ち時
間情報を記憶する音響モデル記憶手段、１５は音響分析
手段１２による音響分析結果と、認識対象語彙辞書記憶
手段１３に記憶されている認識対象語彙と、音響モデル
記憶手段１４に記憶されている音響モデルとを用いて照
合を行い、尤度を演算して認識結果を出力する照合手段
である。In FIG. 5, 11 is a voice section detecting means for detecting a voice section of an input voice signal, and 12 is an acoustic analyzing means for performing an acoustic analysis on the voice signal of the voice section detected by the voice section detecting means 11. , 13 is a recognition target vocabulary dictionary storage unit that stores a recognition target vocabulary that is a recognition target and syntax information that defines a connection relationship between each recognition target vocabulary, and 14 is an acoustic model that is a minimum unit of recognition and waiting for each acoustic model. Acoustic model storage means for storing time information, 15 is an acoustic analysis result by the acoustic analysis means 12, recognition target vocabulary stored in the recognition target vocabulary dictionary storage means 13, and acoustics stored in the acoustic model storage means 14. It is a matching unit that performs matching using a model, calculates a likelihood, and outputs a recognition result.

【０００４】次に動作について説明する。音声区間検出
手段１１は入力した音声信号の音声区間を検出する。こ
こで、音声区間は、例えば、音声信号のパワーの所定の
閾値により検出するものとする。図６は音声区間の始終
端検出アルゴリズムを説明する図である。音声区間検出
手段１１は、図６に示すように、入力した音声信号のパ
ワーが所定の閾値以上の区間を音声区間候補として検出
し、その音声区間候補間のポーズ区間が所定の闘値、例
えば３５０ｍｓｅｃ未満ならば、その二つの音声区間候
補を一つの音声区間として検出し、音響分析手段１２に
検出した音声区間の音声信号を出力する。Next, the operation will be described. The voice section detection means 11 detects the voice section of the input voice signal. Here, it is assumed that the voice section is detected by a predetermined threshold value of the power of the voice signal, for example. FIG. 6 is a diagram for explaining a start / end detection algorithm of a voice section. As shown in FIG. 6, the voice section detecting means 11 detects a section in which the power of the input voice signal is equal to or higher than a predetermined threshold as a voice section candidate, and the pause section between the voice section candidates has a predetermined threshold value, for example, If it is less than 350 msec, the two voice section candidates are detected as one voice section, and the voice signal of the detected voice section is output to the acoustic analysis unit 12.

【０００５】認識対象語彙辞書記憶手段１３が記憶して
いる認識対象語彙は、例えば“とうきょうと”、“かな
がわけん”、“かまくらし”、“けせんぬま”、“ゆく
はし”、・・・という単語と、“かながわけん”から
“かまくらし”への接続、“とうきょうと”から“まる
のうち”への接続等の、各認識対象語彙の接続関係を定
義する構文情報である。この認識対象語彙辞書記憶手段
１３に記憶する認識対象語彙は認識毎に入れ替えてもか
まわない。The recognition target vocabulary stored in the recognition target vocabulary dictionary storage means 13 is, for example, “Tokyo”, “Kana garan”, “Kamakurashi”, “Kesenuma”, “Yukuhashi”, ... This is syntactic information that defines the connection relation of each recognition target vocabulary, such as the word "-" and the connection from "Kana garan" to "Kamakurashi" and from "Tokyo" to "Marunouchi". The recognition target vocabulary stored in the recognition target vocabulary dictionary storage unit 13 may be replaced for each recognition.

【０００６】図７はＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏ
ｖＭｏｄｅｌ）の例を示す図であり、ここでは、“か
ながわけん”“かまくらし”と接続された場合のＨＭＭ
を示している。図において、各丸印がＨＭＭの各状態を
表し音響モデル記憶手段１４に記憶され、矢印が遷移を
表し認識対象語彙辞書記憶手段１３に記憶されている。
また、／Ｌ１／は語頭（発声前）の無音区間に対応する
音響モデルを表わし、／Ｌ２／は語尾（発声後）の無音
区間に対応する音響モデルを表わし、／Ｌ３／は単語間
（発声中）の無音区間に対応する音響モデルを表わして
いる。FIG. 7 shows an HMM (Hidden Marko).
v Model), in which the HMM in the case of being connected to “Kana ga ren” or “Kamakurashi”
Is shown. In the figure, each circle represents each state of the HMM and is stored in the acoustic model storage means 14, and an arrow represents each transition and is stored in the recognition target vocabulary dictionary storage means 13.
Further, / L1 / represents an acoustic model corresponding to a silent section at the beginning of a word (before utterance), / L2 / represents an acoustic model corresponding to a silent section at the end of a word (after utterance), and / L3 / represents between words (speaking). It represents an acoustic model corresponding to the silence interval of (middle).

【０００７】音響分析手段１２は、音声区間検出手段１
１で検出された音声区間の音声信号を一定長の長さ（フ
レーム長）で一定周期（フレーム周期）毎に切り出し、
この切り出された音声データ（フレーム毎の音声デー
タ）を分析して、音響分析結果である時系列データを照
合手段１５に出力する。The acoustic analysis means 12 is a voice section detection means 1
The voice signal in the voice section detected in 1 is cut out at a constant length (frame length) at regular intervals (frame periods),
The cut-out voice data (voice data for each frame) is analyzed, and the time-series data that is the acoustic analysis result is output to the matching unit 15.

【０００８】照合手段１５は、音響分析手段１２による
音響分析結果と、認識対象語彙辞書記憶手段１３に記憶
されている認識対象語彙と、音響モデル記憶手段１４で
記憶されている音響モデルとを用いて照合を行い、全認
識対象語彙の最終状態での尤度を求め、最終状態で最大
尤度を取る認識対象語彙を認識結果として出力する。The matching means 15 uses the acoustic analysis result by the acoustic analysis means 12, the recognition target vocabulary stored in the recognition target vocabulary dictionary storage means 13, and the acoustic model stored in the acoustic model storage means 14. Then, the likelihood in the final state of all the recognition target vocabularies is obtained, and the recognition target vocabulary having the maximum likelihood in the final state is output as the recognition result.

【０００９】ここで、照合手段１５は、例えば以下のよ
うな演算をして尤度を求める。認識対象語彙辞書記憶手
段１３に記憶している認識対象語彙ｌのｎ番目の状態に
対応する音響モデルをｄｉｃ（ｌ，ｎ）とし、時刻（フ
レーム）ｔのときに認識対象語彙ｌがｎ番目の状態にあ
ったと仮定し、このときの分析結果の１フレーム分の尤
度をｌｋｌｈｄ（ｌ，ｔ，ｎ）とする。Here, the matching means 15 obtains the likelihood by performing the following calculation, for example. The acoustic model corresponding to the nth state of the recognition target vocabulary l stored in the recognition target vocabulary dictionary storage unit 13 is dic (l, n), and the recognition target vocabulary l is the nth at time (frame) t. In this case, the likelihood of one frame of the analysis result at this time is lklhd (l, t, n).

【００１０】図８は認識対象語彙ｌに対する認識パスの
例を示す図であり、入力された時刻（フレーム）と、あ
る認識対象語彙ｌの状態での経路を示している。この経
路は複数考えられるが、図８はそのうちの１つの経路を
示しており、認識対象語彙ｌに対してフレームｔでの状
態がｎ番目のときの尤度をｌｋｌｈｄ（ｌ，ｔ，ｎ）と
している。FIG. 8 is a diagram showing an example of a recognition path for the recognition target vocabulary 1, and shows an input time (frame) and a route in a state of a certain recognition target vocabulary 1. Although there are a plurality of possible routes, FIG. 8 shows one of them, and the likelihood when the state at the frame t is the n-th is lklhd (l, t, n) for the recognition target vocabulary l. I am trying.

【００１１】例えば、図８に示すような経路をとった場
合の尤度は、以下の式により演算する。音声区間検出手
段１１で検出された音声区間の長さがＴフレームであっ
たとすると、認識対象語彙ｌのｎ番目の状態までのある
経路に対する累積尤度Ｌｋｌｈｄ’（ｌ，ｎ）は次式で
表される。For example, the likelihood when the route as shown in FIG. 8 is taken is calculated by the following equation. Assuming that the length of the voice section detected by the voice section detecting means 11 is T frames, the cumulative likelihood Lklhd ′ (l, n) for a certain route up to the nth state of the recognition target vocabulary l is expressed.

【数１】ここで、ｋ（ｔ）はフレームｔに対して割り当てられた
状態が何番目であるかを示す。[Equation 1] Here, k (t) indicates the order of the state assigned to the frame t.

【００１２】ここで、この入力音声に対する認識対象語
彙ｌのｎ番目の状態に到達する経路の中で最大尤度とな
る累積尤度Ｌｋｌｈｄ（ｌ，ｎ）は、次式で表わされ
る。Here, the cumulative likelihood Lklhd (l, n), which is the maximum likelihood in the route reaching the nth state of the recognition target vocabulary l for this input speech, is expressed by the following equation.

【数２】また、認識対象語彙ｌの尤度ＬＫ（ｌ）は最終状態をＮ
（ｌ）とすると、ＬＫ（ｌ）＝Ｌｋｌｈｄ（ｌ，Ｎ（ｌ））（３）となる。各認識対象語彙の尤度中で最大尤度ＬＫ（Ｌ）
を取る認識対象語彙Ｌを認識結果として出力する。[Equation 2] Further, the likelihood LK (l) of the recognition target vocabulary l is N in the final state.
If (l), then LK (l) = Lklhd (l, N (l)) (3). Maximum likelihood LK (L) in the likelihood of each recognition target vocabulary
The recognition target vocabulary L is output as the recognition result.

【数３】 [Equation 3]

【００１３】このように、図５に示す従来の音声認識装
置では、音声区間検出手段１１が入力した音声信号の音
声区間を検出し、音響分析手段１２が、音声区間検出手
段１１で検出された音声区間の音声信号をフレーム毎に
切り出し、この切り出されたフレーム毎の音声データを
分析してその分析結果を照合手段１５に出力し、照合手
段１５はフレーム毎に認識対象語彙辞書記憶手段１３に
記憶されている認識対象語彙と、音響モデル記憶手段１
４に記憶されている音響モデルとを用いて照合を行い、
全認識対象語彙の最終状態での尤度を求め、最終状態で
最大尤度を取る認識対象語彙を認識結果として出力して
いる。As described above, in the conventional voice recognition device shown in FIG. 5, the voice section of the voice signal input by the voice section detection means 11 is detected, and the acoustic analysis means 12 is detected by the voice section detection means 11. The voice signal in the voice section is cut out for each frame, the cut-out voice data for each frame is analyzed, and the analysis result is output to the matching means 15. The matching means 15 stores in the recognition target vocabulary dictionary storage means 13 for each frame. The recognition target vocabulary stored and the acoustic model storage means 1
Matching is performed using the acoustic model stored in 4,
The likelihoods of all recognition target vocabularies in the final state are obtained, and the recognition target vocabulary that takes the maximum likelihood in the final state is output as the recognition result.

【００１４】[0014]

【発明が解決しようとする課題】従来の音声認識装置
は、以上のように構成されているので、連続した単語を
認識させる場合にポーズ等で音声が途切れると、その後
に続く音声を認識できず認識精度が劣化するという課題
があった。また、これに対処するため、次の音声区間の
入力を一定時間待つ方法を用いると、発声から音声認識
結果を出力するまでの応答時間が遅くなると共に、音声
の後ろに続くポーズ区間の雑音を拾って認識してしまい
認識精度が劣化するという課題があった。Since the conventional speech recognition apparatus is constructed as described above, if the speech is interrupted by a pause when recognizing consecutive words, the speech following it cannot be recognized. There was a problem that the recognition accuracy deteriorates. In order to deal with this, if a method of waiting for the input of the next voice section for a certain period of time is used, the response time from the utterance to the output of the voice recognition result is delayed, and the noise in the pause section following the voice is reduced. There is a problem that the recognition accuracy is deteriorated by picking up and recognizing.

【００１５】この発明は上記のような課題を解決するた
めになされたもので、連続した単語を認識させる場合に
ポーズ等で音声が途切れても、認識精度を向上させるこ
とができる音声認識装置を得ることを目的とする。The present invention has been made in order to solve the above problems, and provides a voice recognition device capable of improving the recognition accuracy even when the voice is interrupted due to a pause or the like when recognizing continuous words. The purpose is to get.

【００１６】[0016]

【課題を解決するための手段】この発明に係る音声認識
装置は、次音声区間待ち判定手段が、最大尤度となる部
分仮説に対応して次の音声区間までの待ち時間を設定
し、設定した待ち時間未満に次の音声区間が検出された
場合には、照合手段に次の音声区間を継続して照合を行
うよう指示するものである。In the voice recognition device according to the present invention, the next voice section wait determination means sets and sets the waiting time until the next voice section corresponding to the partial hypothesis having the maximum likelihood. When the next voice section is detected within the waiting time, the collating means is instructed to continue the collation of the next voice section.

【００１７】この発明に係る音声認識装置は、次音声区
間待ち判定手段が、最大尤度となる部分仮説の構文上の
位置又は認識対象語彙上の位置に対応して次の音声区間
までの待ち時間を設定するものである。In the speech recognition apparatus according to the present invention, the next speech segment waiting determination means waits for the next speech segment in correspondence with the syntactic position of the partial hypothesis having the maximum likelihood or the position on the recognition target vocabulary. It sets the time.

【００１８】この発明に係る音声認識装置は、次音声区
間待ち判定手段が、最大尤度となる部分仮説が最終結果
の場合には、待ち時間を０として最終結果を認識結果と
して出力するものである。In the voice recognition device according to the present invention, the next voice section wait determination means outputs the final result as a recognition result with the waiting time set to 0 when the partial hypothesis having the maximum likelihood is the final result. is there.

【００１９】この発明に係る音声認識装置は、次音声区
間待ち判定手段が、最大尤度となる部分仮説が中間結果
の場合には、次の音声区間までの第１の待ち時間を設定
し、設定した第１の待ち時間未満に次の音声区間が検出
された場合には、照合手段に次の音声区間を継続して照
合を行うよう指示し、設定した第１の待ち時間未満に次
の音声区間が検出されない場合には、中間結果を認識結
果として出力するものである。In the voice recognition apparatus according to the present invention, the next voice section wait determination means sets a first waiting time until the next voice section when the partial hypothesis having the maximum likelihood is an intermediate result. When the next voice section is detected within the set first waiting time, the collating means is instructed to continue the collation for the next voice section, and the next speech section is detected within the set first waiting time. If no voice section is detected, the intermediate result is output as the recognition result.

【００２０】この発明に係る音声認識装置は、次音声区
間待ち判定手段が、最大尤度となる部分仮説が中間状態
の場合には、第１の待ち時間より短い次の音声区間まで
の第２の待ち時間を設定し、設定した第２の待ち時間未
満に次の音声区間が検出された場合には、照合手段に次
の音声区間を継続して照合を行うよう指示し、設定した
第２の待ち時間未満に次の音声区間が検出されない場合
には、中間状態を認識結果として出力するか、又は認識
結果なしを出力するものである。In the speech recognition apparatus according to the present invention, the next speech segment waiting determination means, when the partial hypothesis having the maximum likelihood is in the intermediate state, the second speech segment until the next speech segment shorter than the first waiting time. When the next voice section is detected within the set second wait time, the collating means is instructed to continue the collation of the next voice section, and the set second wait time is set. When the next voice section is not detected within the waiting time of, the intermediate state is output as the recognition result or no recognition result is output.

【００２１】この発明に係る音声認識装置は、次音声区
間待ち判定手段が、最大尤度となる部分仮説の最後の音
響モデルに対応して次の音声区間までの待ち時間を設定
するものである。In the voice recognition device according to the present invention, the next voice section wait determination means sets the waiting time until the next voice section in correspondence with the last acoustic model of the partial hypothesis having the maximum likelihood. .

【００２２】この発明に係る音声認識装置は、次音声区
間待ち判定手段が、設定した待ち時間未満に次の音声区
間が検出されない場合には、最大尤度となる部分仮説が
各認識対象語彙の接続関係を定義する構文情報により認
識結果として採用可能かを判断して、最大尤度となる部
分仮説を認識結果として出力するものである。In the speech recognition apparatus according to the present invention, when the next speech section waiting determination means does not detect the next speech section within the set waiting time, the partial hypothesis with the maximum likelihood becomes the recognition target vocabulary. It judges whether or not it can be adopted as a recognition result based on the syntax information that defines the connection relation, and outputs the partial hypothesis with the maximum likelihood as the recognition result.

【００２３】この発明に係る音声認識装置は、次音声区
間待ち判定手段が、最大尤度となる部分仮説が認識結果
として採用不可能な場合に、次に尤度が高い部分仮説の
最後の音響モデルに対応して次の音声区間までの待ち時
間を設定するものである。In the speech recognition apparatus according to the present invention, when the next speech section waiting determination means cannot adopt the partial hypothesis having the maximum likelihood as the recognition result, the last sound of the partial hypothesis with the next highest likelihood is detected. The waiting time until the next voice section is set according to the model.

【００２４】この発明に係る音声認識装置は、次音声区
間待ち判定手段が、照合手段からの各部分仮説の尤度
と、各部分仮説の構文上の位置又は認識対象語彙上の位
置とを入力し、最大尤度となる部分仮説の構文上の位置
又は認識対象語彙上の位置に対応して次の音声区間まで
の待ち時間を設定し、設定した待ち時間未満に次の音声
区間が検出された場合には、照合手段に次の音声区間を
継続して照合を行うよう指示し、設定した待ち時間未満
に次の音声区間が検出されない場合には、最大尤度とな
る部分仮説を認識結果として出力するものである。In the voice recognition apparatus according to the present invention, the next voice section wait determination means inputs the likelihood of each partial hypothesis from the matching means and the syntactic position of each partial hypothesis or the position on the recognition target vocabulary. Then, the waiting time until the next speech section is set corresponding to the position on the syntactical hypothesis or the position on the recognition target vocabulary with the maximum likelihood, and the next speech section is detected within the set waiting time. In this case, the matching means is instructed to continue matching the next speech section, and if the next speech section is not detected within the set waiting time, the partial hypothesis with the maximum likelihood is recognized as the recognition result. Is output as.

【００２５】この発明に係る音声認識装置は、次音声区
間待ち判定手段が、照合手段からの上記各部分仮説の尤
度と、音響モデル記憶手段に記憶されている各音響モデ
ルの待ち時間情報と、認識対象語彙辞書記憶手段に記憶
されている各認識対象語彙の接続関係を定義する構文情
報とを入力し、最大尤度となる部分仮説の最後の音響モ
デルの待ち時間情報により、次の音声区間までの待ち時
間を設定し、設定した待ち時間未満に次の音声区間が検
出された場合には、照合手段に次の音声区間を継続して
照合を行うよう指示し、設定した待ち時間未満に次の音
声区間が検出されない場合には、最大尤度となる部分仮
説が構文情報により認識結果として採用可能かを判断し
て、最大尤度となる部分仮説を認識結果として出力する
ものである。In the speech recognition apparatus according to the present invention, the next speech section wait determination means uses the likelihood of each of the above partial hypotheses from the matching means and the waiting time information of each acoustic model stored in the acoustic model storage means. , The syntactic information that defines the connection relation of each recognition target vocabulary stored in the recognition target vocabulary dictionary storage unit, and the waiting time of the last acoustic model of the partial hypothesis that is the maximum likelihood If the waiting time to the section is set and the next voice section is detected within the set waiting time, the collating means is instructed to continue the collation of the next voice section, and the waiting time is less than the set waiting time. If the next speech segment is not detected in, it is judged whether the partial hypothesis with the maximum likelihood can be adopted as the recognition result by the syntax information, and the partial hypothesis with the maximum likelihood is output as the recognition result. .

【００２６】[0026]

【発明の実施の形態】以下、この発明の実施の一形態を
説明する。実施の形態１．図１はこの発明の実施の形態１による音
声認識装置の構成を示すブロック図であり、図におい
て、１は入力した音声信号の音声区間を検出し、検出し
た音声区間の音声信号を出力すると共に、音声区間を確
定したことを示す音声区間確定通知を出力する音声区間
検出手段、２は音声区間検出手段１で検出された音声区
間の音声信号に対して音響分析を行う音響分析手段、３
は認識対象となる認識対象語彙と各認識対象語彙の接続
関係を定義する構文情報とを記憶する認識対象語彙辞書
記憶手段、４は認識の最小単位となる音響モデルと各音
響モデルの待ち時間情報を記憶する音響モデル記憶手段
である。BEST MODE FOR CARRYING OUT THE INVENTION An embodiment of the present invention will be described below. Embodiment 1. 1 is a block diagram showing a configuration of a voice recognition device according to a first embodiment of the present invention. In FIG. 1, reference numeral 1 indicates a voice section of an input voice signal and outputs a voice signal of the detected voice section. A voice section detection means for outputting a voice section confirmation notification indicating that the voice section has been confirmed; 2 an acoustic analysis means for performing acoustic analysis on the voice signal of the voice section detected by the voice section detection means 3;
Is a recognition target vocabulary dictionary storage unit that stores a recognition target vocabulary to be recognized and syntax information that defines a connection relationship between each recognition target vocabulary, and 4 is an acoustic model as a minimum unit of recognition and waiting time information of each acoustic model. Is an acoustic model storage means for storing

【００２７】また、図１において、５は音響分析手段２
による音響分析結果と、認識対象語彙辞書記憶手段３に
記憶されている認識対象語彙と、音響モデル記憶手段４
に記憶されている音響モデルとを用いて照合を行い、各
認識対象語彙の各状態を示す各部分仮説における尤度を
演算し、認識対象語彙辞書記憶手段３に記憶されている
各認識対象語彙の接続関係を定義する構文情報から、各
部分仮説の構文上の位置又は認識対象語彙上の位置を求
める照合手段である。Further, in FIG. 1, 5 is an acoustic analysis means 2.
Acoustic analysis result, the recognition target vocabulary stored in the recognition target vocabulary dictionary storage unit 3, and the acoustic model storage unit 4
Each of the recognition target vocabularies stored in the recognition target vocabulary dictionary storage means 3 is compared with each other by using the acoustic model stored in the recognition target vocabulary to calculate the likelihood in each partial hypothesis indicating each state of each recognition target vocabulary. It is a collation unit that obtains the syntactic position of each partial hypothesis or the position on the recognition target vocabulary from the syntactic information that defines the connection relation of.

【００２８】さらに、図１において、６は照合手段５か
らの各部分仮説の尤度と、各部分仮説の構文上の位置又
は認識対象語彙上の位置とを入力し、最大尤度となる部
分仮説の構文上の位置又は認識対象語彙上の位置に対応
して次の音声区間までの待ち時間を設定し、音声区間検
出手段１からの音声区間検出確定通知を受けて、設定し
た待ち時間未満に次の音声区間が検出された場合には、
照合手段５に次の音声区間を継続して照合を行うよう指
示し、設定した待ち時間未満に次の音声区間が検出され
ない場合には、最大尤度となる部分仮説を認識結果とし
て出力する次音声区間待ち判定手段である。Further, in FIG. 1, reference numeral 6 is a portion for inputting the likelihood of each partial hypothesis from the collating means 5 and the syntactic position of each partial hypothesis or the position on the recognition target vocabulary to obtain the maximum likelihood. The waiting time until the next speech section is set in correspondence with the position on the syntactical hypothesis or the position on the recognition target vocabulary, and upon receipt of the speech section detection confirmation notification from the speech section detection means 1, less than the set waiting time. If the next voice segment is detected in
The collating means 5 is instructed to continue the collation of the next speech section, and if the next speech section is not detected within the set waiting time, the partial hypothesis having the maximum likelihood is output as the recognition result. It is a voice section waiting determination means.

【００２９】このように、音響分析手段２、認識対象語
彙辞書記憶手段３及び音響モデル記憶手段４は、従来の
図５に示す音響分析手段１２、認識対象語彙辞書記憶手
段１３及び音響モデル記憶手段１４と同等のものであ
る。As described above, the acoustic analysis unit 2, the recognition target vocabulary dictionary storage unit 3, and the acoustic model storage unit 4 are the conventional acoustic analysis unit 12, recognition target vocabulary dictionary storage unit 13, and acoustic model storage unit shown in FIG. It is equivalent to 14.

【００３０】次に動作について説明する。音声区間検出
手段１は、図６に示すように、入力した音声信号のパワ
ーが所定の閾値以上の区間を音声区間候補として検出
し、その音声区間候補間のポーズ区間が所定の闘値、例
えば３５０ｍｓｅｃ未満ならば、その二つの音声区間候
補を一つの音声区間として検出し、検出した音声区間の
音声信号を音響分析手段２に出力する。また、音声区間
検出手段１は、音声区間の開始を検出して、所定の闘値
以上の音声信号のパワーが所定時間、例えば５０ｍｓｅ
ｃ続いた時点で音声区間であることを確定し、音声区間
確定通知を次音声区間待ち判定手段６に出力する。Next, the operation will be described. As shown in FIG. 6, the voice section detecting means 1 detects a section in which the power of the input voice signal is equal to or more than a predetermined threshold as a voice section candidate, and the pause section between the voice section candidates has a predetermined threshold value, for example, If it is less than 350 msec, the two voice section candidates are detected as one voice section, and the voice signal of the detected voice section is output to the acoustic analysis unit 2. Further, the voice section detection means 1 detects the start of the voice section, and the power of the voice signal having a predetermined threshold value or more is kept for a predetermined time, for example, 50 mse.
When c continues, the voice section is decided to be a voice section, and a voice section confirmation notice is output to the next voice section wait determination means 6.

【００３１】音響分析手段２は、従来と同様に、音声区
間検出手段１で検出された音声区間の音声信号に対して
音響分析を行う。すなわち、音響分析手段２は、音声区
間検出手段１が検出した音声区間の音声信号をフレーム
長でフレーム周期毎に切り出し、この切り出されたフレ
ーム毎の音声データを分析して、音響分析結果である時
系列データを照合手段５に出力する。The acoustic analysis means 2 performs acoustic analysis on the voice signal of the voice section detected by the voice section detection means 1 as in the conventional case. That is, the acoustic analysis unit 2 cuts out the voice signal of the voice section detected by the voice section detection unit 1 at a frame length for each frame period, analyzes the cut-out voice data of each frame, and outputs the result of the acoustic analysis. The time series data is output to the matching means 5.

【００３２】従来の照合手段１５は最終状態が最大尤度
を取る認識対象語彙を認識結果として出力するのに対
し、この照合手段５は、認識対象語彙辞書記憶手段３に
記憶されている全認識対象語彙のＨＭＭの各状態に対応
する音響モデル記憶手段４に記憶されている音響モデル
と、音響分析手段２による音響分析結果を用いて、各認
識対象語彙の各状態を示す各部分仮説における尤度を演
算し、認識対象語彙辞書記憶手段３に記憶されている各
認識対象語彙の接続関係を定義する構文情報から、各部
分仮説の構文上の位置又は認識対象語彙上の位置を求め
て、各部分仮説における尤度と、各部分仮説の構文上の
位置又は認識対象語彙上の位置とを次音声区間待ち判定
手段６に出力する。The conventional collating means 15 outputs the recognition target vocabulary having the maximum likelihood as the final state as the recognition result, whereas the collating means 5 outputs all recognitions stored in the recognition target vocabulary dictionary storage means 3. Using the acoustic model stored in the acoustic model storage unit 4 corresponding to each state of the HMM of the target vocabulary and the acoustic analysis result by the acoustic analysis unit 2, the likelihood in each partial hypothesis indicating each state of each recognition target vocabulary. The degree is calculated, and the syntactical position of each partial hypothesis or the position on the recognition target vocabulary is obtained from the syntax information defining the connection relation of each recognition target vocabulary stored in the recognition target vocabulary dictionary storage unit 3, The likelihood in each partial hypothesis and the syntactical position of each partial hypothesis or the position on the recognition target vocabulary are output to the next speech section waiting determination means 6.

【００３３】ここで、各部分仮説における尤度の演算で
は、例えば図７において、／ｋａ／までの部分仮説にお
ける尤度、／ｋａ／，／ｎａ／までの部分仮説における
尤度、／ｋａ／，／ｎａ／，／ｇａ／までの部分仮説に
おける尤度というように、順次演算し、／ｋａ／，／ｎ
ａ／，／ｇａ／，／ｗａ／，／ｋｅ／，／Ｎ／，／Ｌ３
／，／ｋａ／，／ｍａ／，／ｋｕ／，／ｒａ／，／ｓｉ
／，／Ｌ２／までの部分仮説における尤度を演算してい
く。Here, in the calculation of the likelihood in each partial hypothesis, for example, in FIG. 7, the likelihood in the partial hypotheses up to / ka /, the likelihood in the partial hypotheses up to / ka /, / na /, / ka / , / Na /, / ga / such as the likelihood in the partial hypothesis are sequentially calculated, and / ka /, / n
a /, / ga /, / wa /, / ke /, / N /, / L3
/, / Ka /, / ma /, / ku /, / ra /, / si
Likelihoods in the partial hypotheses up to /, / L2 / are calculated.

【００３４】次音声区間待ち判定手段６は、照合手段５
から受け取った部分仮説のうち、最大尤度をとる部分仮
説の構文上の位置又は認識対象語彙上の位置に対応し
て、次の音声区間までの待ち時間を設定し、音声区間検
出手段１からの音声区間検出確定通知を受けて、設定し
た待ち時間未満に次の音声区間が検出された場合には、
照合手段５に次の音声区間を継続して照合を行うよう指
示し、設定した待ち時間未満に次の音声区間が検出され
ない場合には、最大尤度となる部分仮説を認識結果とし
て出力する。The next voice section wait determination means 6 is the collation means 5
Of the partial hypotheses received from, the waiting time until the next speech section is set corresponding to the position on the syntactical position or the recognition target vocabulary of the partial hypothesis having the maximum likelihood, and the speech section detection means 1 When the next voice section is detected within the set waiting time after receiving the voice section detection confirmation notification of
When the collating means 5 is instructed to continue the collation of the next speech section and the next speech section is not detected within the set waiting time, the partial hypothesis having the maximum likelihood is output as the recognition result.

【００３５】図２は次音声区間待ち判定手段の判定処理
を示すフローチャートであり、ここでは、最大尤度とな
る部分仮説が構文上又は認識対象語彙上、後続する語彙
が存在しない位置にある場合を最終結果とし、認識対象
語彙辞書記憶手段３で予めポーズとして指定されている
位置にある部分仮説を中間結果とし、それ以外の位置に
ある部分仮説を中間状態としている。すなわち、図７の
例では、最大尤度となる部分仮説が／Ｌ２／の位置にあ
る場合には最終結果とし、／Ｌ３／の位置にある場合に
は中間結果とし、／Ｌ１／，／Ｌ２／，／Ｌ３／以外の
位置にある場合には中間状態とする。FIG. 2 is a flow chart showing the judgment processing of the next voice section waiting judgment means. Here, in the case where the partial hypothesis having the maximum likelihood is in a position where there is no succeeding vocabulary in the syntax or the recognition target vocabulary. Is set as the final result, the partial hypotheses at positions previously designated as poses in the recognition target vocabulary dictionary storage unit 3 are set as intermediate results, and the partial hypotheses at other positions are set as intermediate states. That is, in the example of FIG. 7, when the partial hypothesis having the maximum likelihood is located at the position of / L2 /, it is the final result, and when it is at the position of / L3 /, it is the intermediate result, and / L1 /, / L2 If the position is other than /, / L3 /, the intermediate state is set.

【００３６】次音声区間待ち判定手段６は、最大尤度と
なる部分仮説が構文上又は認識対象語彙上の位置に対応
して、照合を行った音声区間の終端から次の音声区間の
始端までの待ち時間を設定している。例えば、次音声区
間待ち判定手段６は、最大尤度となる部分仮説が、最終
結果である場合には、続の音声区間を継続して認識する
必要がないために待ち時間を０に設定し、中間結果であ
る場合には待ち時間ＴｈＴｉｍｅ１を例えば３秒に設定
し、中間結果である場合には待ち時間ＴｈＴｉｍｅ２を
例えば１秒に設定している。The next voice section wait decision means 6 corresponds to the position on the syntactic or recognition target vocabulary where the partial hypothesis having the maximum likelihood corresponds to from the end of the verified voice section to the start of the next voice section. The waiting time is set. For example, the next voice section wait determination unit 6 sets the waiting time to 0 because it is not necessary to continuously recognize the subsequent voice section when the partial hypothesis having the maximum likelihood is the final result. If it is an intermediate result, the waiting time ThTime1 is set to, for example, 3 seconds, and if it is an intermediate result, the waiting time ThTime2 is set to, for example, 1 second.

【００３７】図２のステップＳＴ１１において、次音声
区間待ち判定手段６は照合手段５から、部分仮説におけ
る尤度、部分仮説の構文上の位置又は認識対象語彙上の
位置を受け取る。ステップＳＴ１２において、受け取っ
た各部分仮説における尤度の中で最大尤度となる部分仮
説を判定用部分仮説とする。ステップＳＴ１３におい
て、受け取った各部分仮説の構文上の位置又は認識対象
語彙上の位置から、判定用部分仮説が最終結果であるか
を判定し、判定用部分仮説が最終結果の場合には、待ち
時間が０であるため、ステップＳＴ１４において、判定
用部分仮説を即座に認識結果として出力する。In step ST11 of FIG. 2, the next speech section waiting determination means 6 receives from the matching means 5 the likelihood in the partial hypothesis, the syntactic position of the partial hypothesis, or the position in the recognition target vocabulary. In step ST12, the partial hypothesis having the maximum likelihood among the received partial hypotheses is set as the determination partial hypothesis. In step ST13, it is determined whether the partial hypothesis for judgment is the final result from the received syntactical position of each partial hypothesis or the position on the recognition target vocabulary. Since the time is 0, the judgment partial hypothesis is immediately output as the recognition result in step ST14.

【００３８】ステップＳＴ１３で、判定用部分仮説が最
終結果でなければ、ステップＳＴ１５にいて、判定用部
分仮説が中間結果であるかを判定し、中間結果であれば
ステップＳＴ１６に進み、中間結果でなければ、すなわ
ち、中間状態であれば、ステップＳＴ１９に進む。判定
用部分仮説が中間結果の場合には、ステップＳＴ１６に
おいて、音声区間検出手段１からの音声区間確定通知を
受けて、照合を行った音声区間の終端から次の音声区間
の始端までの待ち時間がＴｈＴｉｍｅ１未満、例えばＴ
ｈＴｉｍｅ１＝３秒未満であるかをチェックする。If the judgment partial hypothesis is not the final result in step ST13, it is judged in step ST15 whether the judgment partial hypothesis is an intermediate result. If it is an intermediate result, the process proceeds to step ST16 and the intermediate result is judged. If not, that is, if it is in the intermediate state, the process proceeds to step ST19. When the judgment partial hypothesis is an intermediate result, in step ST16, upon receipt of the voice section confirmation notification from the voice section detecting means 1, the waiting time from the end of the verified voice section to the start of the next voice section. Is less than ThTime1, eg T
Check if hTime1 = less than 3 seconds.

【００３９】ステップＳＴ１６で、次の音声区間が検出
されないまま待ち時間ＴｈＴｉｍｅ１を経過した場合、
ステップＳＴ１７において、タイムアウト処理として中
間結果を認識結果として出力する。一方、ステップＳＴ
１６で、ＴｈＴｉｍｅ１未満で次音声区間が検出されて
いれば、ステップＳＴ１８において、照合手段５に以前
の照合状態から継続して認識を行うように指示する。In step ST16, when the waiting time ThTime1 has elapsed without detecting the next voice section,
In step ST17, the intermediate result is output as the recognition result as the timeout process. On the other hand, step ST
If the next voice section is detected at less than ThTime1 at 16, the collating means 5 is instructed to continue the recognition from the previous collation state at step ST18.

【００４０】ここで、中間結果のときに、次の音声区間
の待ち時間（ＴｈＴｉｍｅ１）を、例えば３秒と長めに
設定しているのは、部分仮説が中間結果の場合、もとも
とポーズ（無音区間）が挿入されることが予想されてい
るため、無音区間が長い可能性が高いためである。Here, in the case of the intermediate result, the waiting time (ThTime1) of the next voice section is set to be a long time, for example, 3 seconds, when the partial hypothesis is the intermediate result, the pause (silent section) is originally set. ) Is expected to be inserted, so there is a high possibility that the silent section is long.

【００４１】また、ステップＳＴ１５で、判定用部分仮
説が中間状態の場合には、ステップＳＴ１９において、
音声区間検出手段１からの音声区間確定通知を受けて、
照合を行った音声区間の終端から次の音声区間の始端ま
での待ち時間がＴｈＴｉｍｅ２未満、例えばＴｈＴｉｍ
ｅ２＝１秒未満であるかをチェックして、ＴｈＴｉｍｅ
１未満で次音声区間が検出されていれば、ステップＳＴ
１８において、照合手段５に以前の照合状態から継続し
て認識を行うように指示する。If the judgment partial hypothesis is in the intermediate state in step ST15, in step ST19,
Upon receiving the voice section confirmation notification from the voice section detecting means 1,
The waiting time from the end of the verified voice section to the start of the next voice section is less than ThTime2, for example ThTime.
e2 = 1 Check if it is less than 1 second, ThTime
If less than 1 and the next voice section is detected, step ST
At 18, the collating means 5 is instructed to continue the recognition from the previous collation state.

【００４２】ステップＳＴ１９で、次の音声区間が検出
されないまま待ち時間が一定時間ＴｈＴｉｍｅ２を経過
した場合、ステップＳＴ２０においてタイムアウト処理
を行う。このタイムアウト処理では、前の音声区間での
部分仮説の中で認識結果として出力することができる最
大尤度のものを認識結果として出力する。また、このタ
イムアウト処理では、認識結果なしということでリジェ
クトとしても良い。If it is determined in step ST19 that the waiting time has exceeded the predetermined time ThTime2 without detecting the next voice section, a timeout process is performed in step ST20. In this time-out process, the maximum likelihood that can be output as the recognition result among the partial hypotheses in the previous speech section is output as the recognition result. Further, in this timeout processing, it may be rejected because there is no recognition result.

【００４３】ここで、中間状態のときの次の音声区間の
待ち時間ＴｈＴｉｍｅ２を、中間結果のときの待ち時間
ＴｈＴｉｍｅ１より短めに設定しているのは、文章や単
語の区切り等、予めポーズが想定されている場所に比べ
て、それ程長いポーズが入らないと想定されるためであ
る。Here, the waiting time ThTime2 of the next voice section in the intermediate state is set to be shorter than the waiting time ThTime1 in the intermediate result, because a pause such as a sentence or word break is assumed in advance. This is because it is assumed that a pose that is not that long compared to the place where it is done is not included.

【００４４】なお、この実施の形態１では、説明の便宜
上、構文上の位置又は認識対象語彙上の位置を、最終結
果、中間結果、中間状態の３種類としたが、例えば、図
７に示す認識対象語彙“かながわけん”の“かながわ”
と“けん”の間に別の待ち時間を設定する等、さらに細
かい設定をしても良い。In the first embodiment, for convenience of explanation, the syntactical position or the position on the recognition target vocabulary has three types of final result, intermediate result, and intermediate state. For example, FIG. "Kanagawa" of the recognition target vocabulary "Kanagawa"
You may make more detailed settings, such as setting another waiting time between "ken" and "ken".

【００４５】以上のように、この実施の形態１によれ
ば、最大尤度となる部分仮説の構文上の位置又は認識対
象語彙上の位置に対応して、次の音声区間までの待ちの
時間を設定することにより、連続した単語を認識させる
場合にポーズ等で音声が途切れても、認識精度を向上さ
せることができるという効果が得られる。As described above, according to the first embodiment, the waiting time until the next speech section is associated with the syntactic position of the partial hypothesis having the maximum likelihood or the position on the recognition target vocabulary. By setting, it is possible to improve the recognition accuracy even when the voice is interrupted due to a pause or the like when recognizing continuous words.

【００４６】また、この実施の形態１によれば、構文の
最後まで発声が終了している場合には、発声から音声認
識結果を出力するまでの応答時間を早くすることができ
るという効果が得られる。Further, according to the first embodiment, when the utterance is completed up to the end of the syntax, the response time from the utterance to the output of the voice recognition result can be shortened. To be

【００４７】実施の形態２．図３はこの発明の実施の形
態２による音声認識装置の構成を示すブロック図であ
る。図において、５ａは音響分析手段２による音響分析
結果と、認識対象語彙辞書記憶手段３に記憶されている
認識対象語彙と、音響モデル記憶手段４に記憶されてい
る音響モデルとを用いて照合を行い、各認識対象語彙の
各状態を示す各部分仮説における尤度を演算する照合手
段である。Embodiment 2. 3 is a block diagram showing the configuration of a voice recognition device according to a second embodiment of the present invention. In the figure, reference numeral 5 a is used to perform matching using the acoustic analysis result by the acoustic analysis unit 2, the recognition target vocabulary stored in the recognition target vocabulary dictionary storage unit 3, and the acoustic model stored in the acoustic model storage unit 4. It is a matching means for performing the likelihood calculation in each partial hypothesis indicating each state of each recognition target vocabulary.

【００４８】また、図３において、６ａは照合手段５ａ
からの部分仮説の尤度と、音響モデル記憶手段４からの
各音響モデルの待ち時間情報と、認識対象語彙辞書記憶
手段３からの各認識対象語彙の接続関係を定義する構文
情報とを入力し、最大尤度となる部分仮説の最後の音響
モデルの待ち時間情報により、次の音声区間までの待ち
時間を設定し、音声区間検出手段１からの音声区間検出
確定通知を受けて、設定した待ち時間未満に次の音声区
間が検出された場合には、照合手段５ａに次の音声区間
を継続して照合を行うよう指示し、設定した待ち時間未
満に次の音声区間が検出されない場合には、認識対象語
彙辞書記憶手段３からの構文情報により、最大尤度とな
る部分仮説が認識結果として採用可能かを判断して、認
識結果を出力する次音声区間待ち判定手段である。Further, in FIG. 3, 6a is a collating means 5a.
The likelihood of the partial hypothesis from, the waiting time information of each acoustic model from the acoustic model storage unit 4, and the syntax information defining the connection relation of each recognition target vocabulary from the recognition target vocabulary dictionary storage unit 3 are input. , The waiting time until the next voice section is set based on the waiting time information of the last acoustic model of the partial hypothesis having the maximum likelihood, the voice section detection confirmation notification from the voice section detecting means 1 is received, and the set waiting time is set. When the next voice section is detected within the time, the collating means 5a is instructed to continue the collation of the next voice section, and when the next voice section is not detected within the set waiting time, It is a next voice section waiting determination unit that determines whether the partial hypothesis having the maximum likelihood can be adopted as a recognition result based on the syntax information from the recognition target vocabulary dictionary storage unit 3 and outputs the recognition result.

【００４９】さらに、図３において、音声区間検出手段
１、音響分析手段２、認識対象語彙辞書記憶手段３、音
響モデル記憶手段４は実施の形態１の図１に示す構成と
同等である。Further, in FIG. 3, the voice section detection means 1, the acoustic analysis means 2, the recognition target vocabulary dictionary storage means 3 and the acoustic model storage means 4 are equivalent to the configuration shown in FIG. 1 of the first embodiment.

【００５０】次に動作について説明する。上記実施の形
態１では、最大尤度となる部分仮説の構文上の位置又は
認識対象語彙上の位置に対応して、次の音声区間までの
待ち時間を設定していたが、この実施の形態２では、最
大尤度となる部分仮説の最後の音響モデルに対応して、
次の音声区間までの待ち時間を設定するものである。Next, the operation will be described. In the first embodiment, the waiting time until the next speech section is set in correspondence with the syntactic position of the partial hypothesis having the maximum likelihood or the position on the recognition target vocabulary. In 2, corresponding to the last acoustic model of the partial hypothesis with maximum likelihood,
The waiting time until the next voice section is set.

【００５１】例えば、図７において、語頭（発声前）の
無音区間に対応する音響モデル／Ｌ１／に対しては音声
区間の待ち時間を１秒とし、語尾（発声後）の無音区間
に対応する音響モデル／Ｌ２／に対しては音声区間の待
ち時間を０秒とし、単語間（発声中）の無音区間に対応
する音響モデル／Ｌ３／に対しては音声区間の待ち時間
を３秒とし、それ以外の／ｋａ／，／ｎａ／等の音響モ
デルに対しては１秒等とする。また、拗音に対応する音
響モデルに対しては例えば２秒とし、さらに、騒音下環
境で音声区間検出で誤って無音区間と判断されやすい音
響モデル、例えば無声化しやすい「し」、「ひ」、
「ふ」、「ち」等に対しては、例えば１．５秒とする。
これらの待ち時間は音響モデル記憶手段４に各音響モデ
ルの待ち時間情報として記憶されている。For example, in FIG. 7, the waiting time of the voice section is set to 1 second for the acoustic model / L1 / corresponding to the silent section at the beginning of the word (before utterance), and it corresponds to the silent section at the end of the word (after utterance). The waiting time of the voice section is 0 seconds for the acoustic model / L2 /, and the waiting time of the voice section is 3 seconds for the acoustic model / L3 / corresponding to the silent section between words (during utterance). For other acoustic models such as / ka /, / na /, etc., 1 second or the like is set. Further, for example, 2 seconds is set for an acoustic model corresponding to a whisper, and further, an acoustic model that is apt to be mistakenly determined to be a silent section by voice section detection in a noisy environment, for example, “shi”, “hi”, which is easily devoiced,
For “fu”, “chi”, etc., it is set to 1.5 seconds, for example.
These waiting times are stored in the acoustic model storage means 4 as waiting time information of each acoustic model.

【００５２】図４は次音声区間待ち判定手段の判定処理
を示すフローチャートである。ステップＳＴ２１におい
て、次音声区間待ち判定手段６ａは照合手段５ａから、
各部分仮説における尤度を受け取る。ステップＳＴ２２
において、受け取った部分仮説の中で最大尤度を取る部
分仮説を判定用部分仮説とする。FIG. 4 is a flow chart showing the judgment processing of the next voice section waiting judgment means. In step ST21, the next voice section wait determination means 6a receives from the matching means 5a,
Receive the likelihood for each partial hypothesis. Step ST22
In, the partial hypothesis that takes the maximum likelihood among the received partial hypotheses is the determination partial hypothesis.

【００５３】ステップＳＴ２３において、判定用部分仮
説の最後の音響モデルｐの待ち時間情報を音響モデル記
憶手段４から抽出し、抽出した待ち時間情報により、次
の音声区間の待ち時間ＴｈＴｉｍｅ（ｐ）を設定する。
例えば、図７において、最後の音響モデルが／Ｌ２／の
場合にはＴｈＴｉｍｅ（ｐ）＝０秒と設定し、／Ｌ３／
の場合にはＴｈＴｉｍｅ（ｐ）＝３秒と設定する。In step ST23, the waiting time information of the final acoustic model p of the judgment partial hypothesis is extracted from the acoustic model storage means 4, and the waiting time ThTime (p) of the next speech section is calculated from the extracted waiting time information. Set.
For example, in FIG. 7, when the last acoustic model is / L2 /, ThTime (p) = 0 seconds is set, and / L3 /
In this case, ThTime (p) = 3 seconds is set.

【００５４】ステップＳＴ２４において、音声区間検出
手段１からの音声区間確定通知を受けて、照合を行った
音声区間の終端からの次の音声区間の始端までの待ち時
間が、ＴｈＴｉｍｅ（ｐ）を超えていないかをチェック
する。ステップＳＴ２４で、待ち時間ＴｈＴｉｍｅ
（ｐ）未満で次の音声区間が検出されていれば、ステッ
プＳＴ２５において、照合手段５ａに以前の照合状態か
ら継続して認識を行うように指示する。In step ST24, upon receipt of the voice section confirmation notification from the voice section detecting means 1, the waiting time from the end of the verified voice section to the start of the next voice section exceeds ThTime (p). Check if not. At step ST24, the waiting time ThTime
If the next voice section is detected under (p), the collating means 5a is instructed to continue the recognition from the previous collating state in step ST25.

【００５５】一方、ステップＳＴ２４で、次の音声区間
が検出されないまま待ち時間ＴｈＴｉｍｅ（ｐ）を経過
した場合、次のステップＳＴ２６からステップＳＴ２９
までのタイムアウト処理を行う。このタイムアウト処理
として、例えば以下のような処理を行う。On the other hand, in step ST24, when the waiting time ThTime (p) has elapsed without the detection of the next voice section, the next step ST26 to step ST29.
Timeout process up to. As this timeout processing, for example, the following processing is performed.

【００５６】ステップＳＴ２６において、判定用部分仮
説が認識結果として採用できるものであるか判定する。
判定用部分仮説が認識結果として採用できるかは、認識
対象語彙辞書記憶手段３に各認識対象語彙の接続関係を
定義する構文情報として記憶されている。この構文情報
としては、例えば図７において、語尾（発声後）の無音
区間に対応する音響モデル／Ｌ２／に到達している部分
仮説だけを認識結果として採用するとか、語尾（発声
後）の無音区間に対応する音響モデル／Ｌ２／、又は単
語間（発声中）の無音区間に対応する音響モデル／Ｌ３
／に到達している部分仮説を認識結果として採用すると
いうものである。In step ST26, it is determined whether the determination partial hypothesis can be adopted as a recognition result.
Whether or not the judgment partial hypothesis can be adopted as a recognition result is stored in the recognition target vocabulary dictionary storage means 3 as syntax information that defines the connection relation of each recognition target vocabulary. As the syntactic information, for example, in FIG. 7, only the partial hypotheses reaching the acoustic model / L2 / corresponding to the silent section of the ending (after utterance) is adopted as the recognition result, or the ending (after utterance) silence is used. Acoustic model / L2 / corresponding to a section, or acoustic model / L3 corresponding to a silent section between words (during vocalization)
The partial hypothesis reaching / is adopted as the recognition result.

【００５７】ステップＳＴ２６の判定結果で、認識結果
として採用可能であれば、ステップＳＴ２７において、
判定用部分仮説を認識結果として出力する。ステップＳ
Ｔ２６の判定結果で、認識結果として採用不可能であれ
ば、ステップＳＴ２８において、判定用部分仮説の次に
尤度が高い部分仮説が存在するかチェックする。If the result of determination in step ST26 can be adopted as the recognition result, in step ST27
The judgment partial hypothesis is output as the recognition result. Step S
If the determination result of T26 cannot be adopted as the recognition result, it is checked in step ST28 whether or not there is a partial hypothesis with the next highest likelihood of the determination partial hypothesis.

【００５８】ステップＳＴ２８で、次に尤度が高い部分
仮説が存在していれば、ステップＳＴ２９において、次
に尤度が高い部分仮説を新たな判定部分用部分仮説と
し、ステップＳＴ２３に戻り、上記の処理を繰り返す。
一方、ステップＳＴ２８で、次に尤度が高い部分仮説が
存在しなければ、認識結果なしとしてリジェクトし終了
する。If there is a partial hypothesis with the next highest likelihood in step ST28, the partial hypothesis with the next highest likelihood is set as a new partial hypothesis for determination in step ST29, and the process returns to step ST23 to The process of is repeated.
On the other hand, in step ST28, if there is no partial hypothesis with the next highest likelihood, the recognition result is rejected and the process ends.

【００５９】ここで、次に尤度が高い部分仮説が存在し
ない場合があるのは、全ての部分仮説の演算量は膨大に
なるため、ビームサーチと呼ばれる方法等により、フレ
ーム毎に、最大尤度から一定以上の尤度の差がある部分
仮説の演算をしなかったり、最大尤度となる部分仮説か
ら上位ｎ個までの部分仮説の演算しかしないことにより
演算量を削減しているからである。Here, there is a case where the partial hypothesis with the next highest likelihood does not exist, because the calculation amount of all the partial hypotheses is enormous, and the maximum likelihood is calculated for each frame by a method called beam search. The calculation amount is reduced by not calculating the partial hypotheses that have a certain likelihood difference from the degree, or by only calculating the partial n hypotheses from the maximum likelihood to the top n. is there.

【００６０】以上のように、この実施の形態２によれ
ば、最大尤度となる部分仮説の最後の音響モデルに対応
して、次の音声区間までの待ち時間を設定することによ
り、連続した単語を認識させる場合にポーズ等で音声が
途切れても、認識精度を向上させることができると共
に、ポーズや、拗音等で想定される無音区間が異なるこ
とに対応でき、無声化しやすい音声を音声区間検出で誤
って無音区間とした場合にも対応でき、認識精度を向上
させることができるという効果が得られる。As described above, according to the second embodiment, by setting the waiting time until the next speech section in correspondence with the last acoustic model of the partial hypothesis having the maximum likelihood, continuous processing is performed. When recognizing a word, even if the voice is interrupted due to a pause, etc., the recognition accuracy can be improved, and it is possible to cope with different pauses and different silence intervals due to jumbles, etc. It is possible to deal with a case where a silent section is erroneously detected, and the recognition accuracy can be improved.

【００６１】また、この実施の形態２によれば、最大尤
度となる部分仮説の最後の音響モデルに対応して次の音
声区間までの待ち時間を設定することにより、最大尤度
となる部分仮説の構文上の位置又は認識対象語彙上の位
置に対応して、次の音声区間待ちの時間を設定するより
も、構文や認識対象語彙を変更する際に、細かく待ち時
間を設定する必要がなくなるという効果が得られる。Further, according to the second embodiment, by setting the waiting time until the next speech section in correspondence with the last acoustic model of the partial hypothesis having the maximum likelihood, the portion having the maximum likelihood is obtained. It is necessary to set the waiting time in detail when changing the syntax or the recognition target vocabulary, rather than setting the time for the next voice section wait in accordance with the position of the hypothesis in the syntax or the position of the recognition target vocabulary. The effect of disappearing is obtained.

【００６２】上記実施の形態１及び実施の形態２の音声
認識装置の各手段については、ハードウェア、ソフトウ
ェアのいずれでも構成できることはいうまでもない。ま
た、ソフトウェアによって構成する場合には、そのソフ
トウェアを記録した媒体が必要となる。It goes without saying that each means of the speech recognition apparatus of the above-mentioned first and second embodiments can be constituted by either hardware or software. Further, when the software is used, a medium in which the software is recorded is required.

【００６３】[0063]

【発明の効果】以上のように、この発明によれば、次音
声区間待ち判定手段が、最大尤度となる部分仮説に対応
して次の音声区間までの待ち時間を設定し、設定した待
ち時間未満に次の音声区間が検出された場合には、照合
手段に次の音声区間を継続して照合を行うよう指示する
ことにより、連続した単語を認識させる場合にポーズ等
で音声が途切れても、認識精度を向上させることができ
るという効果がある。As described above, according to the present invention, the next voice section wait determination means sets the waiting time until the next voice section in accordance with the partial hypothesis having the maximum likelihood, and the set wait time is set. If the next voice segment is detected within the time, the collating means is instructed to continue the collation of the next voice segment so that the voice may be interrupted by a pause when recognizing consecutive words. Also, there is an effect that the recognition accuracy can be improved.

【００６４】この発明によれば、次音声区間待ち判定手
段が、最大尤度となる部分仮説の構文上の位置又は認識
対象語彙上の位置に対応して次の音声区間までの待ち時
間を設定することにより、連続した単語を認識させる場
合にポーズ等で音声が途切れても、認識精度を向上させ
ることができるという効果がある。According to the present invention, the next speech segment waiting determination means sets the waiting time until the next speech segment in correspondence with the syntactic position of the partial hypothesis having the maximum likelihood or the position on the recognition target vocabulary. By doing so, when recognizing continuous words, even if the voice is interrupted due to a pause or the like, there is an effect that the recognition accuracy can be improved.

【００６５】この発明によれば、次音声区間待ち判定手
段が、最大尤度となる部分仮説が最終結果の場合には、
待ち時間を０として最終結果を認識結果として出力する
ことにより、構文の最後まで発声が終了している場合に
は、発声から音声認識結果を出力するまでの応答時間を
早くすることができるという効果がある。According to the present invention, when the next voice section wait determination means determines that the partial hypothesis having the maximum likelihood is the final result,
By outputting the final result as the recognition result with the waiting time set to 0, the response time from the utterance to the output of the speech recognition result can be shortened when the utterance is completed up to the end of the syntax. There is.

【００６６】この発明によれば、次音声区間待ち判定手
段が、最大尤度となる部分仮説が中間結果の場合には、
次の音声区間までの第１の待ち時間を設定し、設定した
上記第１の待ち時間未満に次の音声区間が検出された場
合には、照合手段に次の音声区間を継続して照合を行う
よう指示し、設定した第１の待ち時間未満に次の音声区
間が検出されない場合には、中間結果を認識結果として
出力することにより、連続した単語を認識させる場合に
ポーズ等で音声が途切れても、認識精度を向上させるこ
とができるという効果がある。According to the present invention, when the next voice section wait determination means determines that the partial hypothesis having the maximum likelihood is an intermediate result,
The first waiting time until the next voice section is set, and when the next voice section is detected within the set first waiting time, the collating means continues to collate the next voice section. If the next voice section is not detected within the set first waiting time, the intermediate result is output as the recognition result, and the voice is interrupted by a pause when recognizing continuous words. However, there is an effect that the recognition accuracy can be improved.

【００６７】この発明によれば、次音声区間待ち判定手
段が、最大尤度となる部分仮説が、中間状態の場合に
は、第１の待ち時間より短い次の音声区間までの第２の
待ち時間を設定し、設定した第２の待ち時間未満に次の
音声区間が検出された場合には、照合手段に次の音声区
間を継続して照合を行うよう指示し、設定した第２の待
ち時間未満に次の音声区間が検出されない場合には、中
間状態を認識結果として出力するか、又は認識結果なし
を出力することにより、連続した単語を認識させる場合
に中間状態等で音声が途切れても、認識精度を向上させ
ることができるという効果がある。According to the present invention, the next voice section wait determination means, when the partial hypothesis having the maximum likelihood is in the intermediate state, waits for the second voice section until the next voice section shorter than the first wait time. When the time is set and the next voice section is detected within the set second waiting time, the collating means is instructed to continue the collation of the next voice section, and the set second waiting time is set. If the next voice segment is not detected within the time, the intermediate state is output as the recognition result, or the absence of the recognition result is output to interrupt the voice in the intermediate state when recognizing consecutive words. Also, there is an effect that the recognition accuracy can be improved.

【００６８】この発明によれば、次音声区間待ち判定手
段が、最大尤度となる部分仮説の最後の音響モデルに対
応して次の音声区間までの待ち時間を設定することによ
り、連続した単語を認識させる場合にポーズ等で音声が
途切れても、認識精度を向上させることができると共
に、ポーズや、拗音等で想定される無音区間が異なるこ
とに対応でき、無声化しやすい音声を音声区間検出で誤
って無音区間とした場合にも対応でき、認識精度を向上
させることができるという効果がある。According to the present invention, the next speech segment waiting determination means sets the waiting time until the next speech segment corresponding to the last acoustic model of the partial hypothesis having the maximum likelihood, so that consecutive words When recognizing a voice, even if the voice is interrupted due to a pause, etc., the recognition accuracy can be improved, and it is possible to deal with different silence intervals that are assumed due to pauses, jumbles, etc. Therefore, it is possible to deal with the case where the silent section is mistakenly set, and the recognition accuracy can be improved.

【００６９】この発明によれば、次音声区間待ち判定手
段が、設定した待ち時間未満に次の音声区間が検出され
ない場合には、最大尤度となる部分仮説が各認識対象語
彙の接続関係を定義する構文情報により認識結果として
採用可能かを判断して、最大尤度となる部分仮説を認識
結果として出力することにより、認識精度を向上させる
ことができるという効果がある。According to the present invention, when the next voice section wait determination means does not detect the next voice section within the set waiting time, the partial hypothesis having the maximum likelihood indicates the connection relation of each recognition target vocabulary. It is possible to improve the recognition accuracy by determining whether the recognition result can be adopted as the recognition result based on the defined syntactic information and outputting the partial hypothesis having the maximum likelihood as the recognition result.

【００７０】この発明によれば、次音声区間待ち判定手
段が、最大尤度となる部分仮説が認識結果として採用不
可能な場合に、次に尤度が高い部分仮説の最後の音響モ
デルに対応して次の音声区間までの待ち時間を設定する
ことにより、連続した単語を認識させる場合にポーズ等
で音声が途切れても、認識精度を向上させることができ
ると共に、ポーズや、拗音等で想定される無音区間が異
なることに対応でき、無声化しやすい音声を音声区間検
出で誤って無音区間とした場合にも対応でき、認識精度
を向上させることができるという効果がある。According to the present invention, when the next speech segment wait judgment means cannot adopt the partial hypothesis having the maximum likelihood as the recognition result, it corresponds to the last acoustic model of the partial hypothesis with the next highest likelihood. By setting the waiting time until the next voice section, even if the voice is interrupted by a pause when recognizing consecutive words, it is possible to improve the recognition accuracy, and also assume a pause or a whisper. It is possible to deal with different silent sections that are generated, and to deal with a case where a voice that is easily devoiced is erroneously made into a silent section by voice section detection, and it is possible to improve recognition accuracy.

【００７１】この発明によれば、次音声区間待ち判定手
段が、照合手段からの各部分仮説の尤度と、各部分仮説
の構文上の位置又は認識対象語彙上の位置とを入力し、
最大尤度となる部分仮説の構文上の位置又は認識対象語
彙上の位置に対応して次の音声区間までの待ち時間を設
定し、設定した待ち時間未満に次の音声区間が検出され
た場合には、照合手段に次の音声区間を継続して照合を
行うよう指示し、設定した待ち時間未満に次の音声区間
が検出されない場合には、最大尤度となる部分仮説を認
識結果として出力することにより、連続した単語を認識
させる場合にポーズ等で音声が途切れても、認識精度を
向上させることができるという効果がある。According to the present invention, the next voice section waiting judgment means inputs the likelihood of each partial hypothesis from the matching means and the syntactic position of each partial hypothesis or the position on the recognition target vocabulary,
When the waiting time to the next speech section is set corresponding to the position on the syntactical position or the recognition target vocabulary of the maximum likelihood and the next speech section is detected within the set waiting time. , The collating means is instructed to continue to collate the next speech segment, and if the next speech segment is not detected within the set waiting time, the partial hypothesis with the maximum likelihood is output as the recognition result. By doing so, when recognizing continuous words, even if the voice is interrupted due to a pause or the like, there is an effect that the recognition accuracy can be improved.

【００７２】この発明によれば、次音声区間待ち判定手
段が、照合手段からの各部分仮説の尤度と、音響モデル
記憶手段に記憶されている各音響モデルの待ち時間情報
と、認識対象語彙辞書記憶手段に記憶されている各認識
対象語彙の接続関係を定義する構文情報とを入力し、最
大尤度となる部分仮説の最後の音響モデルの待ち時間情
報により、次の音声区間までの待ち時間を設定し、設定
した待ち時間未満に次の音声区間が検出された場合に
は、照合手段に次の音声区間を継続して照合を行うよう
指示し、設定した待ち時間未満に次の音声区間が検出さ
れない場合には、最大尤度となる部分仮説が構文情報に
より認識結果として採用可能かを判断し、最大尤度とな
る部分仮説を認識結果として出力することにより、連続
した単語を認識させる場合にポーズ等で音声が途切れて
も、認識精度を向上させることができると共に、ポーズ
や、拗音等で想定される無音区間が異なることに対応で
き、無声化しやすい音声を音声区間検出で誤って無音区
間とした場合にも対応でき、認識精度を向上させること
ができるという効果がある。According to the present invention, the next voice section wait determination means is the likelihood of each partial hypothesis from the matching means, the waiting time information of each acoustic model stored in the acoustic model storage means, and the recognition target vocabulary. Input the syntactic information that defines the connection relation of each recognition target vocabulary stored in the dictionary storage means, and wait until the next speech section by the waiting time information of the last acoustic model of the partial hypothesis with the maximum likelihood. When the time is set and the next voice section is detected within the set waiting time, the collating means is instructed to continue the collation of the next voice section, and the next voice section is set within the set waiting time. If no interval is detected, it is determined whether the partial hypothesis with the maximum likelihood can be adopted as the recognition result based on the syntactic information, and the partial hypothesis with the maximum likelihood is output as the recognition result to recognize consecutive words. Let In this case, the recognition accuracy can be improved even if the voice is interrupted due to a pause, etc. There is an effect that it is possible to cope with the case where there is no sound section and the recognition accuracy can be improved.

[Brief description of drawings]

【図１】この発明の実施の形態１による音声認識装置
の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a voice recognition device according to a first embodiment of the present invention.

【図２】この発明の実施の形態１による音声認識装置
の次音声区間待ち判定手段の判定処理を示すフローチャ
ートである。FIG. 2 is a flowchart showing a determination process of a next voice section wait determination means of the voice recognition device according to the first embodiment of the present invention.

【図３】この発明の実施の形態２による音声認識装置
の構成を示すブロック図である。FIG. 3 is a block diagram showing a configuration of a voice recognition device according to a second embodiment of the present invention.

【図４】この発明の実施の形態２による音声認識装置
の次音声区間待ち判定手段の判定処理を示すフローチャ
ートである。FIG. 4 is a flowchart showing a determination process of a next voice section wait determination means of the voice recognition device according to the second embodiment of the present invention.

【図５】従来の音声認識装置の構成を示すブロック図
である。FIG. 5 is a block diagram showing a configuration of a conventional voice recognition device.

【図６】音声区間の始終端検出アルゴリズムを説明す
る図である。FIG. 6 is a diagram illustrating a start / end detection algorithm of a voice section.

【図７】ＨＭＭの例を示す図である。FIG. 7 is a diagram showing an example of an HMM.

【図８】認識対象語彙に対する認識パスの例を示す図
である。FIG. 8 is a diagram showing an example of a recognition path for a recognition target vocabulary.

[Explanation of symbols]

１音声区間検出手段、２音響分析手段、３認識対
象語彙辞書記憶手段、４音響モデル記憶手段、５，５
ａ照合手段、６，６ａ次音声区間待ち判定手段。1 voice section detection means, 2 acoustic analysis means, 3 recognition target vocabulary dictionary storage means, 4 acoustic model storage means, 5, 5
a collating means, 6 and 6a next voice section waiting determination means.

Claims

[Claims]

1. A voice section detecting unit for detecting a voice section of an input voice signal and outputting a voice section confirmation notification indicating that the voice section has been fixed, and a voice section detected by the voice section detecting unit. Acoustic analysis means for performing acoustic analysis on a voice signal; recognition target vocabulary dictionary storage means for storing recognition target vocabulary to be recognized; acoustic model storage means for storing an acoustic model as a minimum unit of recognition; Matching is performed using the acoustic analysis result by the acoustic analysis unit, the recognition target vocabulary stored in the recognition target vocabulary dictionary storage unit, and the acoustic model stored in the acoustic model storage unit, and each recognition target vocabulary When the waiting time until the next speech section is associated with the matching means for calculating the likelihood in each partial hypothesis indicating each state of, and the partial hypothesis with the maximum likelihood calculated by the matching means When the next voice section is detected within the set waiting time after receiving the voice section detection confirmation notification from the voice section detection means, the next voice section is continued to the matching means. A voice recognition apparatus comprising: a next voice section wait determination means for instructing to perform matching.

2. The next speech segment waiting determination means sets the waiting time until the next speech segment in correspondence with the position on the syntactical position of the partial hypothesis or the recognition target vocabulary that has the maximum likelihood. The voice recognition device according to claim 1.

3. The next voice section wait determination means sets the waiting time to 0 when the partial hypothesis having the maximum likelihood is the final result in a position where no vocabulary following it is present in the syntactic or vocabulary to be recognized. The voice recognition device according to claim 2, wherein the final result is output as a recognition result.

4. The next voice section wait determination means determines the first waiting time until the next voice section in the case of an intermediate result in which the partial hypothesis having the maximum likelihood is in a position designated in advance as a pause. When the next voice section is detected within the set and the set first waiting time, the collating means is instructed to continue the collation of the next voice section, and the set first waiting time is set. The speech recognition apparatus according to claim 2, wherein the intermediate result is output as a recognition result when the next speech section is not detected within a time period.

5. The next speech section wait determining means is an intermediate except for an intermediate result other than the final result in which the partial hypothesis having the maximum likelihood is in a position where no following vocabulary exists in the syntax or the recognition target vocabulary. In the case of the state, the second waiting time until the next voice section shorter than the first waiting time is set, and when the next voice section is detected within the set second waiting time, If the collating means is instructed to continue the collation of the next voice section and the next voice section is not detected within the set second waiting time, the intermediate state is output as a recognition result, or Alternatively, the voice recognition device according to claim 4, which outputs no recognition result.

6. The next speech segment waiting determination means sets a waiting time until the next speech segment corresponding to the last acoustic model of the partial hypothesis having the maximum likelihood.
The voice recognition device described.

7. The next voice section wait determination means, when the next voice section is not detected within the set waiting time, the partial hypothesis with the maximum likelihood defines the connection information of each recognition target vocabulary. 7. The voice recognition apparatus according to claim 6, wherein the partial hypothesis having the maximum likelihood is output as a recognition result by determining whether or not it can be adopted as a recognition result.

8. The next speech section waiting determination means, when a partial hypothesis having the maximum likelihood cannot be adopted as a recognition result, corresponds to the last acoustic model of the partial hypothesis with the next highest likelihood, The voice recognition device according to claim 7, wherein a waiting time until a voice section is set.

9. A voice section detecting unit that detects a voice section of an input voice signal and outputs a voice section confirmation notification indicating that the voice section has been fixed, and a voice section detected by the voice section detecting unit. An acoustic analysis means for performing acoustic analysis on a speech signal, a recognition target vocabulary dictionary storage means for storing recognition target vocabulary to be recognized and syntax information defining connection relationships between each recognition target vocabulary, and a minimum unit of recognition The acoustic model storage means for storing an acoustic model, the acoustic analysis result by the acoustic analysis means, the recognition target vocabulary stored in the recognition target vocabulary dictionary storage means, and the acoustic model storage means. Matching is performed using the acoustic model, the likelihood in each partial hypothesis indicating each state of each recognition target vocabulary is calculated, and stored in the recognition target vocabulary dictionary storage means. Collation means for obtaining the syntactic position of the partial hypothesis or the position on the recognition target vocabulary from the syntactic information defining the connection relation of the recognition target vocabulary, the likelihood of each partial hypothesis from the collation means, and the above By inputting the syntactic position of the partial hypothesis or the position on the recognition target vocabulary, the waiting time until the next speech section corresponding to the syntactic position of the partial hypothesis or the position on the recognition target vocabulary When the next voice section is detected within the set waiting time after receiving the voice section detection confirmation notification from the voice section detection means, the next voice section is continued to the matching means. A voice recognition apparatus including a next voice section wait determination unit that outputs a partial hypothesis having the maximum likelihood as a recognition result when a next voice section is not detected within a set waiting time after instructing to perform matching. .

10. A voice section detecting unit that detects a voice section of an input voice signal and outputs a voice section confirmation notification indicating that the voice section has been fixed, and a voice section detected by the voice section detecting unit. An acoustic analysis means for performing acoustic analysis on a speech signal, a recognition target vocabulary dictionary storage means for storing recognition target vocabulary to be recognized and syntax information defining connection relationships between each recognition target vocabulary, and a minimum unit of recognition Acoustic models and acoustic model storage means for storing waiting time information of each acoustic model, acoustic analysis results by the acoustic analysis means, recognition target vocabulary stored in the recognition target vocabulary dictionary storage means, and the acoustic A matching unit that performs matching using the acoustic model stored in the model storage unit and calculates a likelihood in each partial hypothesis indicating each state of each recognition target vocabulary; Likelihood of each partial hypothesis from the matching means, waiting time information of each acoustic model stored in the acoustic model storage means, and connection of each recognition target vocabulary stored in the recognition target vocabulary dictionary storage means By inputting the syntactic information defining the relationship, the waiting time until the next speech section is set by the waiting time information of the last acoustic model of the partial hypothesis having the maximum likelihood, and the speech section from the speech section detecting means is set. When the next voice section is detected within the set waiting time after receiving the notification of detection confirmation, the collating means is instructed to continue the collation of the next voice section and the waiting time is set within the set waiting time. When the next speech segment is not detected, it is determined whether the partial hypothesis with the maximum likelihood can be adopted as the recognition result by the syntax information, and the partial hypothesis with the maximum likelihood is output as the recognition result. Speech recognition apparatus and a voice interval waiting determining means.