JP3277579B2

JP3277579B2 - Voice recognition method and apparatus

Info

Publication number: JP3277579B2
Application number: JP36141492A
Authority: JP
Inventors: 雅文南
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1992-12-28
Filing date: 1992-12-28
Publication date: 2002-04-22
Anticipated expiration: 2017-04-22
Also published as: JPH06202689A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、音声認識方法および装
置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition method and apparatus.

【０００２】[0002]

【従来の技術】従来の音声認識技術は、受け付ける発話
形態により、連続発話認識方法および離散発話認識方法
の２つに分類される。2. Description of the Related Art Conventional speech recognition techniques are classified into two types, a continuous speech recognition method and a discrete speech recognition method, according to the accepted speech form.

【０００３】[0003]

【発明が解決しようとする課題】上記２つの認識方法
は、ともに、ユーザの発話形態に不自然な制約を加える
という欠点を持っている。具体的には、離散発話認識方
法では、ユーザは、前もって登録された単語を小間切れ
に発話しなければならないし、連続発話認識方法におい
ては、逆に連続に文章を発話することを強いられる。Both of the above two recognition methods have a drawback that an unnatural restriction is imposed on the utterance form of the user. Specifically, in the discrete utterance recognition method, the user has to utter the words registered in advance in short breaks, and in the continuous utterance recognition method, the user is forced to utter the sentences continuously.

【０００４】通常の人間の自然な発話には、発話の意味
と関係ない不要語、一時休止、言い換えとともに、発話
途中でも発話を中止してしまう等の現象も存在する。上
述の従来の方法では、この様な話言葉の自由度に過度に
制約を加えるため、ユーザに心理圧迫を加え、音声認識
装置を使いにくいものにしている。[0004] Natural human natural utterances include phenomena such as unnecessary words that are not related to the meaning of the utterance, pauses, paraphrases, and utterances being stopped even during the utterance. In the above-described conventional method, since the degree of freedom of the spoken language is excessively restricted, psychological pressure is applied to the user, and the speech recognition device is difficult to use.

【０００５】従来手法においても、与えられた文法に対
して、途中迄の発話を受け付けることは、可能である。
それは、与えられた文法に対して、発話が中止する可能
性の有るステートから、終了ステートに対して、ジャン
プ‐アーク（ｊｕｍｐａｒｃ）を接続することにより
実現される（図６参照）。[0005] Even in the conventional method, it is possible to accept a halfway utterance for a given grammar.
It is realized by connecting a jump-arc from a state where the utterance may stop for a given grammar to an end state (see FIG. 6).

【０００６】しかし、この手法は、中途までの発話を認
識できるが、ユーザがより長い発話をする場合に、発話
を全く切らないで行う必要がある（例えば、゛ビデオに
録画゛と言う発話において、゛ビデオ゛、゛に゛、゛録
画゛の間に休止が入ってはいけない）。そうしないと、
ユーザが最後まで発話しようと思っているのにも関わら
ず、途中までの解析結果を認識結果として出力されてし
まうという、ユーザにとって、発話時の心理的圧迫が大
きいという欠点を持つ。[0006] However, this method can recognize a halfway utterance, but when the user makes a longer utterance, it is necessary to perform the utterance without cutting off the utterance at all (for example, in the utterance of "record on video"). There should be no pause between video, video, recording, and recording). If I do not,
Although the user intends to speak to the end, the analysis result in the middle is output as a recognition result, which has the disadvantage that psychological pressure during speech is large for the user.

【０００７】本発明は、このような状況に鑑みてなされ
たものであり、ユーザが発話を任意の時間休止できるよ
うにする音声認識方法および装置を提供することを目的
とする。The present invention has been made in view of such a situation, and an object of the present invention is to provide a voice recognition method and apparatus which enable a user to pause speech for an arbitrary time.

【０００８】[0008]

【課題を解決するための手段】本発明の音声認識方法
は、発話の語順を規定する情報中に、発話が途中で休止
する可能性にある部分に、独立に継続時間を設定し、音
声認識中に、前記設定された継続時間以上発話の休止が
継続したことにより（例えば、図４のステップＳ６のＹ
ＥＳ）、発話完了を検出し、その時点までの発話解析結
果を出力し、発話解析結果に基づいて音声合成を使用し
て応答を生成することを特徴とするAccording to the speech recognition method of the present invention, a duration is set independently in a part where the speech may be interrupted in the information defining the word order of the speech, and the speech recognition is performed. During this time, the pause of the utterance continued for the set duration or longer (for example, Y in step S6 in FIG. 4).
ES), detecting utterance completion, outputting an utterance analysis result up to that point, and generating a response using speech synthesis based on the utterance analysis result.

【０００９】本発明の音声認識装置は、発話の語順を規
定する情報中の発話が途中で休止する可能性にある部分
に独立に設定された継続時間を記憶する記憶手段を備
え、設定された継続時間以上発話の休止が継続したこと
により、その時点までの発話解析結果を出力することを
特徴とする。The speech recognition apparatus according to the present invention includes storage means for storing a duration independently set in a part of the information defining the word order of the utterance where the utterance may pause in the middle. When the pause of the utterance continues for the duration or more, the utterance analysis result up to that point is output.

【００１０】[0010]

【作用】本発明の音声認識方法および装置においては、
発話の語順を規定する情報中の発話が途中で休止する可
能性にある部分に継続時間が独立に設定され、設定され
た継続時間以上発話の休止が継続すると、その時点まで
の発話解析結果が出力される。従って、ユーザは、発話
を任意の時間休止することができる。According to the speech recognition method and apparatus of the present invention,
When the utterance in the information that defines the word order of the utterance may be paused in the middle, the duration is set independently, and if the pause of the utterance continues for the set duration, the utterance analysis result up to that point will be Is output. Therefore, the user can pause the utterance for an arbitrary time.

【００１１】[0011]

【実施例】図１は、本発明の音声認識装置の一実施例の
構成を示す。音声信号処理部１は、音声信号入力につい
て時間領域、周波数領域での解析を行い、ＶＱコード列
を出力する。音声認識処理部２は、音声信号処理部１か
らのＶＱコード列を受け、音韻モデル記憶部３、単語辞
書記憶部４および発話文法記憶部５を参照して、連続発
話された音声を認識する探索処理を行うモジュールであ
る。すなわち、音声認識処理部２は、予めデータとして
用意された以下の（Ａ）（Ｂ）（Ｃ）の制約情報の中
で、ＶＱコード列を最も確かに説明する単語列を、トレ
リス計算により求め、認識した単語列とその単語列の意
味する意味構造を出力する。FIG. 1 shows the configuration of an embodiment of the speech recognition apparatus according to the present invention. The audio signal processing unit 1 analyzes the audio signal input in the time domain and the frequency domain, and outputs a VQ code string. The voice recognition processing unit 2 receives the VQ code string from the voice signal processing unit 1 and refers to the phoneme model storage unit 3, the word dictionary storage unit 4, and the utterance grammar storage unit 5 to recognize a continuously uttered voice. This is a module that performs search processing. That is, the speech recognition processing unit 2 obtains, by trellis calculation, a word string that most surely describes the VQ code string from among the following constraint information (A), (B), and (C) prepared as data. Then, the recognized word string and the meaning structure of the word string are output.

【００１２】（Ａ）記憶部３に記憶されたＨＭＭにより
表現された音韻モデル（Ｂ）単語辞書記憶部４の記憶内容：システムの持つ単
語群の発音表記が音韻の列あるいは、ネットワークで表
現されているもの（Ｃ）発話文法記憶部５の記憶内容：ユーザの発話単語
列の制約が、文脈自由文法、あるいは、ネツトワーク手
法の枠組みで表現されているもの(A) Phoneme model represented by the HMM stored in the storage unit 3 (B) Content stored in the word dictionary storage unit 4: Phonetic notation of a group of words in the system is represented by a sequence of phonemes or a network. (C) Stored contents of the utterance grammar storage unit 5: The one in which the restriction of the user's utterance word string is expressed by a context-free grammar or a framework of a network method.

【００１３】対話管理部６は、認識処理部２からの認識
結果からユーザの要求を解釈し、機器７に対しての命令
の生成を行ったり、場合によっては、不足の情報を文生
成部８および音声合成部９を通じてユーザに発話し、ユ
ーザの情報提供を促したりする。The dialog management unit 6 interprets a user's request from the recognition result from the recognition processing unit 2 and generates a command for the device 7, and in some cases, transmits insufficient information to the sentence generation unit 8. And the user speaks through the voice synthesizing unit 9 to urge the user to provide information.

【００１４】文生成部８は、対話管理部６から出力され
る応答意味表現すなわち発話意味情報からそれに相当す
る発話を発音記号列に変換して出力し、音声合成部９
は、文生成部からの発音記号列を受けて、実際に音声を
発する。The sentence generation unit 8 converts the corresponding utterance from the response meaning expression, that is, the utterance meaning information output from the dialog management unit 6 into a phonetic symbol string, and outputs it.
Receives the phonetic symbol string from the sentence generator and actually utters a voice.

【００１５】図１の構成において、本発明に係わる部分
は、音声認識処理部２と対話管理部６である。特に
（ｉ）継続時間長の設定、（ｉｉ）継続時間長を用いた
発話完了検出、（ｉｉｉ）発話解析結果による対話管理
部６からのユーザへの応答生成が重要ポイントである。
以下に、文法の指定方法、無音声モデルについて説明
し、それらに継続時間長の設定の取り込み方法、及び、
この継続時間長を用いた発話終了検出方式について説明
する。In the configuration of FIG. 1, the parts relating to the present invention are the speech recognition processing unit 2 and the dialog management unit 6. Particularly important points are (i) setting of the duration, (ii) utterance completion detection using the duration, and (iii) generation of a response from the dialog management unit 6 to the user based on the utterance analysis result.
The following describes how to specify the grammar, the silent model, how to capture the duration setting, and
An utterance end detection method using the duration will be described.

【００１６】図２は、図１の実施例における発話文法、
単語辞書、音韻モデルおよび無音声モデルの関係を示
す。発話文法は、発話として受け付ける（文章の）語順
を規定するものである。この実施例では、ネットワーク
文法を用いる。しかし、後述のように、必ずしもこの記
述方法である必要はなく、分脈自由文法（末尾の参考文
献Ｓｈｉｅｂｅｒ８６参照）であってもよい。図２の様
に、ネットワーク文法では、状態（ステート）とアーク
により表現され、各アークには、単語あるいは、語彙カ
テゴリが付けられ、それにより、受け付ける発話文のバ
リエーションを表している。本発明では、図２の様に各
ステートに、次のようなタイプ（ｔｙｐｅ），デスティ
ネーション（ｄｅｓｔｉｎａｔｉｏｎ）および継続時間
長の各フィールドを設ける。FIG. 2 shows the utterance grammar in the embodiment of FIG.
3 shows the relationship between a word dictionary, a phonemic model, and a silent model. The utterance grammar defines the order of words (of sentences) accepted as utterances. In this embodiment, a network grammar is used. However, as will be described later, this description method is not necessarily required, and a free context grammar (refer to the reference document Shiever 86 at the end) may be used. As shown in FIG. 2, in the network grammar, a state (state) and an arc are expressed, and each arc is assigned a word or a vocabulary category, thereby indicating a variation of an accepted utterance sentence. In the present invention, the following types (type), destination (destination) and duration fields are provided in each state as shown in FIG.

【００１７】タイプ：ステートの種類を表し、後述の無
音声モデルを挿入するか、それとも、単なる遷移上の状
態なのかを決める。デスティネーション：（単語（ｗｏｒｄ），デスティネ
ーション−ステート）のペアになっており、このステー
トで入力が単語の時の遷移先ステートを記述した表であ
る。継続時間長：タイプが無音声モデルの場合に設定される
タイムアウト値を保持する。この値は、各ステート毎に
独立に設定可能であり、文法で予め定数を設定したり、
発話を解析中に、動的に設定値を変更することも可能で
ある。Type: Indicates the type of state, and determines whether a silent model described later is inserted or whether it is a state on a mere transition. Destination: It is a pair of (word (word), destination-state), and is a table describing the transition destination state when the input is a word in this state. Duration: Holds the timeout value set when the type is the silent model. This value can be set independently for each state.
It is also possible to dynamically change the setting value while analyzing the utterance.

【００１８】通常、途中で発話が中止する可能性のある
ステートに、継続時間長を設定する。Normally, a continuation time length is set in a state in which speech may be interrupted halfway.

【００１９】無音声モデルは、ユーザの発話がない部分
の音をモデル化したものである。従来技術においても、
上記ネットワーク上の状態群にこの無音声モデルを挿入
し、単語間の言い淀みに対処している手法が存在する
（末尾の参考文献ＫＦＬｅｅ８８参照）。本発明の実施
例でも、状態に無音声モデルを挿入する事は、同様であ
るが、後述のアルゴリズムに有るように、継続時間長制
御をこの部分に導入している。The non-speech model is a model of a sound in a portion where the user does not speak. In the conventional technology,
There is a method of inserting this silence model into the group of states on the network to cope with the stagnation between words (see KFLee88 at the end). In the embodiment of the present invention, the insertion of the silent model into the state is the same, but the duration control is introduced in this part as in the algorithm described later.

【００２０】認識探索は、２次元の配列の（一方の軸
が、時間（通常フレーム数）、もう一方の軸が音韻モデ
ルの各ステート）トレリスと呼ばれるデータ構造の上で
なされる（図３参照）。The recognition search is performed on a data structure called a trellis of a two-dimensional array (one axis is time (normal number of frames), and the other axis is each state of the phoneme model) (see FIG. 3). ).

【００２１】時間軸は、入力音声の時間軸に対応し、あ
る時間をｔ、とすると、その次の時間は、ｔ＋１にな
る。通常、単位変化時間は、１０ｍｓｅｃであり、各時
刻のステート群をフレームと称する。トレリス上の各ス
テートの時間軸方向の遷移は、音韻モデル、単語辞書、
文法から決定される。The time axis corresponds to the time axis of the input voice. Assuming that a certain time is t, the next time is t + 1. Usually, the unit change time is 10 msec, and the state group at each time is called a frame. The transition of each state on the trellis in the time axis direction is based on phonological models, word dictionaries,
Determined from grammar.

【００２２】トレリス上の各ステートは、認識スコア、
バックポインタ（一つ前のフレームにおいてどのステー
トから遷移したかを表す。）から構成される。無音声モ
デルを構成するステートでは、これに加えて、継続時間
長を記憶する領域を設ける。Each state on the trellis has a recognition score,
It consists of a back pointer (indicating from which state a transition has been made in the previous frame). In the state constituting the silent model, an area for storing the duration is additionally provided.

【００２３】本発明の実施例でも使用するビーム探索法
は、すべてのステートを並べる事はせずに、最も確立ス
コアの良いステートを基準として、一定の範囲に入るス
テートのみを探索の対象とする。（言い換えれば、ある
程度将来性のあるステートのみを考慮する）。In the beam search method used in the embodiment of the present invention, all states are not arranged, and only states within a certain range are searched based on the state having the highest probability score. . (In other words, only states with some future potential are considered).

【００２４】上記の音韻モデル、単語辞書、文法からト
レリスを生成する方法は、末尾の参考文献（Ｂｒｉｄｌ
ｅ８２）に詳しく説明されており、ビーム探索手法につ
いては、末尾の参考文献（Ｋａｉ−ＦｕＬｅｅ９１）
に詳しく説明されている。The method of generating a trellis from the above phoneme model, word dictionary and grammar is described in the reference (Bridl) at the end.
e82), and the beam search technique is described at the end of the reference (Kai-Fu Lee 91).
Is described in detail.

【００２５】図４は、図１の実施例の音声認識アルゴリ
ズムを示す。以下、図４を参照して、本発明の実施例の
認識アルゴリズムを説明する。FIG. 4 shows the speech recognition algorithm of the embodiment of FIG. Hereinafter, the recognition algorithm according to the embodiment of the present invention will be described with reference to FIG.

【００２６】処理は、概略、以下の手順で進む。処理の
概略は、末尾の参考文献（Ｋａｉ−ＦｕＬｅｅ９１）
に詳しく説明されている。The processing generally proceeds according to the following procedure. For an outline of the processing, refer to the reference at the end (Kai-Fu Lee 91).
Is described in detail.

【００２７】Ｓｔｅｐ１．初期化（図４のステップＳ１
およびＳ２）各フレーム毎の状態を保持するためのトレリスＢを用意
する。Ｂは、ｔ＝０から発話終了フレームまでのＮフレ
ーム分の大きさを持つ。以下、Ｂ［ｔ］は、ｔ番目のフ
レームのヒープを表し、通常、ビームの範囲に入ったス
テートが入れられる（従って、時刻ｔのステートすべて
が入るわけではない）。ステップＳ２では、時刻ｔ＝０
の状態について、文法上、発話の初期ステートの確率ス
コアを１，０とし、Ｂ［０］に登録する。Step 1. Initialization (Step S1 in FIG. 4)
And S2) Prepare a trellis B for holding the state of each frame. B has a size corresponding to N frames from t = 0 to the speech end frame. Hereinafter, B [t] represents the heap of the t-th frame, and usually includes states that fall within the range of the beam (thus, not all states at time t are included). In step S2, time t = 0
In the grammar, the probability score of the initial state of the utterance is set to 1, 0 and registered in B [0].

【００２８】Ｓｔｅｐ２．各フレーム毎の処理（図
４のステップＳ３乃至Ｓ８）すべてのステートについてフレーム同期（時刻ｔの処理
をすべて終えてから、時刻ｔ＋１の処理を行う事。）で
処理する。各フレームの処理は、通常の構文情報制御の
ビタビ（Ｖｉｔｅｒｂｉ）ビーム探索法を用いる（末尾
の参考文献Ｋａｉ−ＦｕＬｅｅ９１参照）。Step 2. Processing for Each Frame (Steps S3 to S8 in FIG. 4) All states are processed in frame synchronization (after all processing at time t is completed, processing at time t + 1 is performed). For processing of each frame, a Viterbi beam search method of ordinary syntax information control is used (see the reference document Kai-Fu Lee 91 at the end).

【００２９】Ｓｔｅｐ３．認識終了及びバックトレース
処理（図４のステップＳ９）各フレームの処理は、発話フレーム長Ｎまで繰り返され
る。その後、最後のフレーム（Ｎ−１番目）において、
最も確率スコアの良いステートから、バックポインタを
辿り（バックトレース処理）、認識結果のワード系列を
求める。Step 3. Recognition end and back trace processing (Step S9 in FIG. 4) The processing of each frame is repeated up to the utterance frame length N. Then, in the last frame (N-1),
The back pointer is traced from the state having the highest probability score (back trace processing), and a word sequence of the recognition result is obtained.

【００３０】本発明の実施例と上記の通常の探索方法と
の相違は、Ｉ．無音声モデル内の遷移の継続時間を計測すること。ＩＩ．この継続時間がある一定の限度を超えた場合、認
識を中止し、その時点までの認識結果を出力すること
（タイムアウト処理）。の２点である。The difference between the embodiment of the present invention and the above-described ordinary search method is as follows. Measuring the duration of a transition in a silent model. II. If the duration exceeds a certain limit, the recognition is stopped and the recognition result up to that point is output (timeout process). 2 points.

【００３１】以下、図４を参照して、ＩおよびＩＩにつ
いて説明する。Hereinafter, I and II will be described with reference to FIG.

【００３２】Ｉ．無音声モデルの継続時間長の計測は、
図４中のステップＳ５：遷移先ステートの更新・登録ス
テップで行われる。ステートの違いにより、以下の３つ
のケースがある。（１）ケース１：通常のこのステップでは（無音声モデ
ル以外のステート同士の遷移の場合）、（ｉ）．遷移先のステート（時刻ｔ＋１）が、Ｂ［ｔ＋
１］にない場合は、このステートを確率スコア、遷移元
のステートを指し示すバックポインタと共にＢ［ｔ＋
１］に登録する、（ｉｉ）．その他の場合（遷移先のス
テートが、Ｂ［ｔ＋１］に既にある場合）は、確率スコ
アの良い方をＢ［ｔ＋１］に登録する、といった処理を
行う。（２）ケース２：無音声モデルへのその他モデルから遷
移の場合。上記（ｉ），（ｉｉ）ステップに於いて、遷移先ステー
トをＢ［ｔ＋１］に登録する際、継続時間を記録する領
域を設け、初期値を登録する。この初期値は、図２の様
に、無音声モデルの文法上の位置により設定したり、会
話の文脈により動的に設定することも可能である。（３）ケース３：無音声モデル内の遷移の場合。遷移先ステートをＢ［ｔ＋１］に登録する際、継続時間
情報を−１カウントダウンする。I. To measure the duration of the silent model,
Step S5 in FIG. 4 is performed in the update / registration step of the transition destination state. There are the following three cases depending on the state. (1) Case 1: In this ordinary step (in the case of transition between states other than the silent model), (i). The state of the transition destination (time t + 1) is B [t +
1], this state is represented by a probability score, B [t +
1), (ii). In other cases (when the state of the transition destination is already in B [t + 1]), a process of registering a better probability score in B [t + 1] is performed. (2) Case 2: Transition from the other model to the silent model. In the steps (i) and (ii), when registering the transition destination state to B [t + 1], an area for recording the duration is provided, and the initial value is registered. This initial value can be set according to the grammatical position of the silent model as shown in FIG. 2, or can be set dynamically according to the context of the conversation. (3) Case 3: A transition in the silent model. When registering the transition destination state in B [t + 1], the duration time information is decremented by -1.

【００３３】ＩＩ．継続時間のタイムアウト処理は、
（１）各フレーム毎に確率スコアの最大値を持つ状態が
無音声モデルであつて、かつ、その継続時間領域が０
になっていたら、ビタビ探索を中止する（ステップＳ
６のＹＥＳ）、（２）通常の認識結果を出力する動作の
バックトレース処理（ステップＳ９）を、上記の状態を
起点として行う、といった手順で行う。II. The timeout process for the duration is
(1) The state having the maximum value of the probability score for each frame is the silent model, and its duration is 0.
Stop the Viterbi search (step S
6), and (2) the back tracing process (step S9) of the operation of outputting the normal recognition result is performed starting from the above state.

【００３４】上記以外の手順は、従来のビタビビーム探
索法に準ずる。Other procedures are based on the conventional Viterbi beam search method.

【００３５】次に、本発明の実施例の対話管理手法につ
いて説明する。図１の認識処理部２の認識結果は、対話
管理部６に渡され、対話管理部６は、この発話結果（及
び現在の状況から）、機器７への制御指令及び、ユーザ
への発話を決定する。Next, a dialog management method according to the embodiment of the present invention will be described. The recognition result of the recognition processing unit 2 in FIG. 1 is passed to the dialog management unit 6, and the dialog management unit 6 transmits the utterance result (and the current situation), a control command to the device 7, and an utterance to the user. decide.

【００３６】上述の認識手法を用いることにより、この
対話管理部６においてよりユーザフレンドリなマン・マ
シン・インタフェースを実現することが可能になる。こ
れを、簡単な発話文法を例に、以下に説明する。By using the above-described recognition method, a more user-friendly man-machine interface can be realized in the dialogue manager 6. This will be described below using a simple utterance grammar as an example.

【００３７】図５に、発話文法を示す。この文法は、ネ
ットワーク表現で、ビデオ、ＬＤ、カセットを制御する
発話を表しており、◎のステートは、終了状態を表す
る。FIG. 5 shows an utterance grammar. This grammar represents an utterance for controlling a video, an LD, and a cassette in a network expression, and the state of ◎ indicates an end state.

【００３８】本発明の実施例では、各ステートにタイム
アウト値を対応させることにより、そのステートまでの
認識結果を対話管理部６に渡すことができる。In the embodiment of the present invention, by associating a timeout value with each state, the recognition result up to that state can be passed to the dialog management unit 6.

【００３９】例えば、「ビデオに」という発話がされる
と、１）認識処理部２は、助詞「に」の後のタイムアウトを
検出することにより、対話管理部６に発話解析結果（゛
ビデオに゛）を出力する。２）対話管理部６は、認識結果を受け取り、会話の状況
など他の情報も加味して、ユーザへの発話（質問など）
を生成する事ができる。For example, when the utterance “to video” is made, 1) the recognition processing unit 2 detects the timeout after the particle “ni”, and sends the utterance analysis result (@video゛) is output. 2) The dialogue management unit 6 receives the recognition result, and utters the user (question, etc.) in consideration of other information such as the state of the conversation.
Can be generated.

【００４０】例えば、発話構文を利用して、゛ビデオに
゛の後に゛録画゛しか続かない事を利用して、「ビデオ
に何を録音すれば良いのですか？」と言うようなユーザ
発話に基づく質問を生成できる。For example, a user utterance such as "What should be recorded in a video?" By using an utterance syntax and using only "recording" after "@" for "video". Can be generated based on the question.

【００４１】本発明の重要点は、継続時間長を発話の切
れ目になりうる状態に設定して、発話終了検出をおこな
っている点である。The important point of the present invention is that the end of the utterance is detected by setting the duration time to a state where the utterance can be broken.

【００４２】なお、上記例では、ネットワーク文法をも
ちいて説明したが、分脈自由文法その他の方法も使用で
きる。Although the above example has been described using the network grammar, a context-free grammar and other methods can also be used.

【００４３】また、上記例では、ワード（単語）間に無
音声モデルを挿入するように説明したが、この無音モデ
ルは、音韻間でも挿入ができる。In the above example, the silent model is inserted between words (words). However, the silent model can be inserted between phonemes.

【００４４】また、マルチプロセッサ構成におけるタイ
マー処理の割り込みも利用できる。Further, an interruption of the timer processing in the multiprocessor configuration can be used.

【００４５】また、無音声モデルは、１モデルとは、限
らず、複数あってもよい。The number of non-voice models is not limited to one, but may be plural.

【００４６】参考文献を列挙すれば、次の通りである。References are listed as follows.

【００４７】［Ｂｒｉｄｌｅ８２］Ｂｒｉｄｌｅ．Ｊ．
Ｓ．ｅｔａｌ，゛ＡｎＡｌｇｏｒｉｔｈｍｆｏｒ
ＣｏｎｎｅｃｔｅｄＷｏｒｄＲｅｃｏｇｎｉｔｉ
ｏｎ゛、Ｐｒｏｃ．ＩＣＡＳＳＰ８２，ｐｐ８９９−９
０２，Ｐａｒｉｓ，Ｍａｙ１９８２[Bridle 82] Bridle. J.
S. et al, @ An Algorithmism for
Connected Word Recogniti
on @, Proc. ICASPSP82, pp899-9
02, Paris, May 1982

【００４８】［Ｋａｉ−ＦｕＬｅｅ８８］Ｌｅｅ．
Ｋ．Ｆ．゛Ｌａｒｇｅ−ＶｏｃａｂｕｌａｒｙＳｐｅ
ａｋｅｒ−ｉｎｄｅｐｅｎｄｅｎｔＣｏｎｔｉｎｕｏｕ
ｓＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ：Ｔｈｅ
ＳＰＨＩＮＸＳｙｓｔｅｍ゛、ＣＭＵ−ＣＳ−８８−
１４８、ＣｏｍｐｕｔｅｒＳｃｉｅｎｃｅＤｅｐｔ．
Ｃａｒｎｅｇｉｅ−ＭｅｌｌｏｎＵｎｉｖ．[Kai-Fu Lee 88] Lee.
K. F.゛ Large-Vocabulary Spe
aker-independentcontinuou
s Speech Recognition: The
SPHINX System @, CMU-CS-88-
148, ComputerScience Dept.
Carnegie-Mellon Univ.

【００４９】［Ｋａｉ−ＦｕＬｅｅ９１］Ｌｅｅ．
Ｋ．Ｆ．ａｎｄＡｌｌｅｖａ．Ｆ．，゛Ｃｏｎｔｉｎ
ｕｏｕｓＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ゛，
ｉｎＡｄｖａｎｃｅｓｉｎＳｐｅｅｃｈＳｉｇｎ
ａｌＰｒｏｓｅｓｓｉｎｇ，ｐｐ６２３−６５０，１
９９１ＭａｒｅｌＤｅｋｋｅｒＩｎｃ．[Kai-Fu Lee91] Lee.
K. F. and Alleva. F. , @ Contin
uous Speech Recognition ゛,
in Advances in Speech Sign
al Processing, pp623-650, 1
991 Marel Dekker Inc.

【００５０】［Ｓｈｉｅｂｅｒ８６］Ｓｈｉｅｂｅ
ｒ．Ｓ．゛ＡｎＩｎｔｒｏｄｕｃｔｉｏｎｔｏＵ
ｎｉｆｉｃａｔｉｏｎ−ｂａｓｅｄＡｐｐｒｏａｃｈ
ｅｓｔｏＧｒａｍｍａｒ゛，ＬｅｃｔｕｒｅＮｏ
ｔｅｓｏｆＣＳＬＩ，ＳｔａｎｆｏｒｄＵｎｉｖ
ｅｒｓｉｔｙ，１９８６[Shieber 86] Shiebe
r. S.゛ An Introduction to U
nifation-based Approach
es to Grammar ゛, Lecture No
tes of CSLI, Stanford Univ
erity, 1986

【００５１】[0051]

【発明の効果】本発明の音声認識方法および装置によれ
ば、発話の語順を規定する情報中の発話が途中で休止す
る可能性にある部分に継続時間を独立に設定し、設定さ
れた継続時間以上発話の休止が継続すると、その時点ま
での発話解析結果を出力するようにしたので、ユーザ
は、発話を任意の時間休止することができる。また、ユ
ーザの心理的圧迫を軽減し、より使い易いヒューマンイ
ンタフェースを実現できる。According to the speech recognition method and apparatus of the present invention, the duration is independently set at a portion where the utterance in the information defining the word order of the utterance may be interrupted in the middle, and the set continuation is set. When the pause of the utterance continues for more than the time, the utterance analysis result up to that point is output, so that the user can pause the utterance for an arbitrary time. Further, it is possible to reduce the psychological pressure of the user and realize a more user-friendly human interface.

[Brief description of the drawings]

【図１】本発明の音声認識装置の一実施例の構成を示す
ブロック図である。FIG. 1 is a block diagram showing a configuration of an embodiment of a speech recognition device of the present invention.

【図２】図１の実施例における発話文法、単語辞書、音
韻モデルおよび無音声モデルの関係を示す図である。FIG. 2 is a diagram showing a relationship among an utterance grammar, a word dictionary, a phoneme model, and a non-voice model in the embodiment of FIG.

【図３】図１の実施例のトレリスを示す図である。FIG. 3 is a diagram showing a trellis of the embodiment of FIG. 1;

【図４】図１の実施例の音声認識アルゴリズムを示す図
である。FIG. 4 is a diagram showing a speech recognition algorithm of the embodiment of FIG. 1;

【図５】サンプル文法を示す図である。FIG. 5 is a diagram showing a sample grammar.

【図６】従来の中途発話を受け付けるための文法の一例
を示す図である。FIG. 6 is a diagram illustrating an example of a conventional grammar for accepting a halfway utterance.

[Explanation of symbols]

１音声信号処理部２音声認識処理部３音韻モデル記憶部４単語辞書記憶部５発話文法記憶部６対話管理部７機器８文生成部９音声合成部 DESCRIPTION OF SYMBOLS 1 Speech signal processing part 2 Speech recognition processing part 3 Phoneme model storage part 4 Word dictionary storage part 5 Utterance grammar storage part 6 Dialogue management part 7 Equipment 8 Sentence generation part 9 Speech synthesis part

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/00 - 15/28 ──────────────────────────────────────────────────続き Continued on front page (58) Field surveyed (Int.Cl. ⁷ , DB name) G10L 15/00-15/28

Claims

(57) [Claims]

1. A speech recognition method for recognizing an input speech in accordance with information defining a word order of an utterance, wherein the information defining the word order of the utterance includes a part where the utterance is likely to be interrupted halfway. A time is set, during the speech recognition, the pause of the utterance continues for the set duration or more, the completion of the utterance is detected, and the utterance analysis result up to that point is output, based on the utterance analysis result. A speech recognition method comprising generating a response using speech synthesis.

2. The speech recognition method according to claim 1, wherein a position of a portion where the utterance is likely to pause is stored.

3. The speech recognition method according to claim 1, wherein the set duration is stored.

4. A speech recognition apparatus for recognizing an input speech in accordance with information defining the word order of an utterance, wherein the speech in the information defining the word order of the utterance is independently set to a portion where the utterance may pause in the middle. A speech recognition apparatus, comprising: storage means for storing a duration time, and outputting an utterance analysis result up to the time point when the pause of the utterance continues for the set duration time or more.

5. The speech recognition device according to claim 4, further comprising a dialogue management unit that presents missing information to a voice entrant according to the utterance analysis result.

6. The voice according to claim 4, further comprising a dialogue management unit that controls the device in accordance with the result of the utterance analysis, and presents lacking information to a voice entrant in response to a response from the device. Recognition device.