JPH08263092A

JPH08263092A - Response voice generating method and voice interactive system

Info

Publication number: JPH08263092A
Application number: JP7063974A
Authority: JP
Inventors: Otoya Shirotsuka; 音也城塚
Original assignee: N T T DATA TSUSHIN KK; NTT Data Communications Systems Corp
Current assignee: N T T DATA TSUSHIN KK; NTT Data Corp
Priority date: 1995-03-23
Filing date: 1995-03-23
Publication date: 1996-10-11

Abstract

PURPOSE: To improve the degree of understanding and to shorten a response time in a voice interactive system of understanding the contents of an input voice and outputting a proper response based on the contents. CONSTITUTION: When the voice is inputted, a voice recognition part 21 performs instantaneously recognition processing, and sends the recognition result to a language analysis processing part 4. The language analysis processing part 4 outputs sentence realization information when the recognition result shows realization of a sentence. A start/end terminal detection part 1 decides the point of time when the sentence realization information is inputted from the language analysis processing part 4 as the end terminal of the voice when a power duration time exceeds a vocalization time threshold value after the start terminal of the input voice is detected. A voice synthesis part 2 ends the recognition processing of the voice until this point of time, and analyzes the recognition result with respect to grammar, semantics, and the interactive context, and synthesizes the response voice to the input voice to output it. In such a manner, while the input voice is interrupted on the way, the interactive is performed gradually while confirming the contents with a speaker.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、利用者の発声に対する
応答音声を生成する応答音声生成方法およびこの方法を
採用した音声対話システムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a response voice generation method for generating a response voice in response to a user's utterance and a voice dialogue system adopting this method.

【０００２】[0002]

【従来の技術】従来、利用者の発声を自動認識して発声
内容に対応する応答音声を生成出力するシステムが知ら
れている。このようなシステムにおいては、利用者が発
した音声の終端検出を契機にその発声内容に対応する応
答音声を生成し、生成した応答音声を音声出力装置より
出力している。この場合の音声の終端を自動的に検出す
る手法として、従来は、入力音声のパワーが、有音とし
て識別可能な最低パワー値を下回った時点からその状態
を一定時間継続する場合に、当該時点を音声終端とみな
す方式が採られていた。2. Description of the Related Art Conventionally, there is known a system for automatically recognizing a user's utterance and generating and outputting a response voice corresponding to the utterance content. In such a system, upon detection of the end of the voice uttered by the user, a response voice corresponding to the utterance content is generated, and the generated response voice is output from the voice output device. As a method of automatically detecting the end of the voice in this case, conventionally, when the power of the input voice falls below the minimum power value that can be identified as voiced, if that state is continued for a certain period of time, Was adopted as a voice termination.

【０００３】また、入力音声を即時的に認識して、常に
現時点の認識結果が文として成り立つ確からしさを評価
し、評価値が予め設定した基準を上回る確からしさを持
った時点の直後に、無音状態が一定時間継続することを
もって入力の終端とみなす、という方式も提案されてい
る。この方式については、「文仮説の尤度を用いた音声
区間検出方法の検討：内藤ら；55ｐ」（平成６年度日本
音響学会秋季研究発表会；２−８−９）の記載が参考に
なる。Further, the input voice is immediately recognized, and the certainty that the recognition result at the present time is always formed as a sentence is evaluated. Immediately after the certainty that the evaluation value exceeds a preset standard, there is silence. A method has also been proposed in which the state is regarded as the end of input when the state continues for a certain period of time. For this method, refer to the description in "Study on voice segment detection method using likelihood of sentence hypothesis: Naito et al .; 55p" (Autumn Acoustics Society of Japan 1994 Autumn Research Presentation; 2-8-9). .

【０００４】[0004]

【発明が解決しようとする課題】上述のように、従来の
音声対話システムでは、利用者が発した音声の終端検出
を契機にその発声内容に対応する応答音声を生成するの
で、例えば音声が非常に長い時間連続して入力されると
音声の認識に時間がかかる。しかも、音声の終端検出後
に初めて音声認識の結果の理解やそれに対する応答音声
の生成が行われるため、利用者の発声に対する応答が出
力されるまでかなりの時間がかかり、さらに、その精度
も低下してしまうという問題があった。このとき、シス
テムが、利用者の発声内容の理解に失敗すると、利用者
は再び長い時間にわたる音声を発声しなければならず、
対話の進行に支障をきたす、という問題もあった。As described above, in the conventional voice dialogue system, when the end of the voice uttered by the user is detected, the response voice corresponding to the utterance content is generated. If it is continuously input for a long time, it takes time to recognize the voice. Moreover, since the result of the voice recognition is understood and the response voice is generated only after the end of the voice is detected, it takes a considerable time until the response to the user's utterance is output, and the accuracy is also lowered. There was a problem that it ended up. At this time, if the system fails to understand the user's utterance content, the user has to utter a long voice again,
There was also a problem that it hindered the progress of the dialogue.

【０００５】本発明の課題は、かかる問題点に鑑み、発
声者からの音声を自動認識し、該認識結果に基づいて適
切な応答を出力する音声対話システムにおいて、音声の
理解率を向上させ、さらに、システム応答時間を短縮し
て対話の進行を円滑にする技術を提供することにある。In view of the above problems, an object of the present invention is to improve the understanding rate of a voice in a voice dialogue system which automatically recognizes a voice from a speaker and outputs an appropriate response based on the recognition result. Another object is to provide a technique for reducing system response time and facilitating the progress of dialogue.

【０００６】[0006]

【課題を解決するための手段】本発明は、上記課題を解
決する応答音声生成方法を提供する。この方法は、少な
くとも、入力音声の始端検出を契機に当該音声を逐次認
識するステップと、前記認識の結果に基づき前記入力音
声の内容が文として成立するか否かを解析するステップ
と、前記始端検出時点の入力音声が所定の継続時間経過
した時点で当該入力音声が文として成立している場合に
前記時点を当該入力音声の終端とみなしてその内容を解
読するステップと、解読した内容に対応する応答音声を
生成するステップと、を有することを特徴とする。The present invention provides a response voice generation method that solves the above problems. This method includes at least a step of sequentially recognizing the start of the input voice and recognizing the voice, a step of analyzing whether or not the content of the input voice is satisfied as a sentence based on the result of the recognition, and the start point. Corresponding to the step of deciphering the input voice when the input voice is detected as a sentence at the time when the input voice at the time of detection reaches a predetermined duration and deciphering the content as the end of the input voice. Generating a response voice to perform.

【０００７】なお、前記音声の内容を解読するステップ
において、前記所定継続時間の経過前に無音状態を検出
した場合は、当該無音検出時点を終端とする音声の内容
を解読するようにしても良い。In the step of deciphering the content of the voice, if a silence state is detected before the elapse of the predetermined duration, the content of the voice ending at the point of silence detection may be deciphered. .

【０００８】本発明は、上記応答音声生成方法の実施に
適した音声対話システムをも提供する。この音声対話シ
ステムは、従来方式、特に入力音声の終端検出方式を改
良したものである。具体的には、以下の要素を備えたこ
とを特徴とする。（１）有音として識別可能な最低パワー値を規定した第
１の閾値，継続有音として識別可能な最短時間を規定し
た第２の閾値，及び，継続有音の継続許容時間を規定し
た第３の閾値を格納した閾値格納手段、（２）音声の入
力を契機に当該入力音声を逐次認識するとともに、認識
結果に基づき文として成立するか否かを判定し、成立す
るときに文成立情報を生成する音声認識手段、（３）入
力音声のパワー，該パワーの変化時点，及び変化前後の
音声パワーの継続時間を検出する手段を備え、検出され
た前記入力音声のパワー及び有音継続時間と前記閾値格
納手段に格納された第１及び第２の閾値とを比較して当
該入力音声の始端を検出するとともに、有音継続時間が
第３の閾値を越えた場合又は該第３の閾値を越え且つ前
記文成立情報を受信した場合に、該第３の閾値を越えた
時点又は前記第３の閾値を越え且つ前記文成立情報を受
信した時点の音声の終端を検出する始終端検出手段、
（４）前記検出された始端から終端までの文が意味する
内容を解読し、解読した内容に対応する応答音声、例え
ば当該文の意味する内容の要部語句を検出し、検出した
要部語句を含む応答音声を合成する解読手段。なお、
「有音」とは、音声パワーの低下が一定時間継続する場
合以外の音声状態をいい、必ずしも第１の閾値を越えた
もののみを意味するものではない。また、「継続有音」
とは、このような有音が時間的に継続している音声状態
をいうものとする。さらに、上記（２）の要素における
文成否の判定は、例えば所定の文法情報、意味情報、及
び単語情報等を用いて行うものとする。The present invention also provides a voice dialogue system suitable for implementing the above-described response voice generation method. This speech dialogue system is an improvement of the conventional method, particularly the method of detecting the end of input speech. Specifically, it is characterized by having the following elements. (1) A first threshold value that defines a minimum power value that can be identified as a voice, a second threshold value that defines a shortest time that can be identified as a continuous voice, and a first threshold that defines a continuous allowable time of continuous voice. Threshold storage means for storing the threshold value of No. 3, (2) The input voice is sequentially recognized upon the input of voice, and it is determined whether or not the sentence is satisfied based on the recognition result. (3) input voice power, means for detecting the power change time of the power, and duration of the voice power before and after the change, and the detected power and voice duration of the input voice. And the first and second threshold values stored in the threshold value storage means are compared to detect the start end of the input voice, and when the voiced duration exceeds the third threshold value or the third threshold value. And received the sentence formation information When, starting and end detecting means for detecting the end of voice at the time of receiving the and the statement holds information beyond the time or the third threshold value exceeds the threshold value of the third,
(4) Decoding the content of the detected sentence from the beginning to the end, detecting a response voice corresponding to the deciphered content, for example, a key phrase of the content of the sentence, and detecting the key phrase Decoding means for synthesizing a response voice including. In addition,
“Voice” refers to a voice state other than the case where the reduction in voice power continues for a certain period of time, and does not necessarily mean only a voice state that exceeds the first threshold value. Also, "Continuous voice"
Is a voice state in which such voiced sound continues in time. Further, it is assumed that the sentence success / failure of the element (2) is determined using, for example, predetermined grammatical information, semantic information, word information, and the like.

【０００９】本発明は、従来方式を併用することもでき
る。この場合は、無音状態として識別可能な無音継続時
間を規定した第４の閾値を前記閾値格納手段に格納して
おく。そして、始終端検出手段が、前記有音継続時間が
第３の閾値に達する前に、パワー低下、すなわち無音を
検出し且つ該無音の継続時間が第４の閾値を越えた場合
に、無音検出時点を当該音声の終端として検出するよう
にする。The present invention can also use the conventional method together. In this case, a fourth threshold value that defines a silent duration that can be identified as a silent state is stored in the threshold storage unit. Then, the start / end detection means detects power decrease, that is, silence before the sound duration reaches the third threshold, and detects silence when the duration of the silence exceeds the fourth threshold. The time point is detected as the end of the voice.

【００１０】[0010]

【作用】本発明の始終端検出手段は、音声パワー検出手
段で検出された入力音声のパワーが第１の閾値を越え、
有音継続時間が第２の閾値を越えたことをもって当該音
声の始端検出を可能とする。この場合の始端は音声パワ
ーが第１の閾値を越えた時点である。音声認識手段は、
この始端検出を契機に認識を開始するとともに、所定の
解析及び判定処理の結果、文として成立する場合は文成
立情報を生成する。始終端検出手段は、また、有音継続
時間が第３の閾値を越え且つ文成立情報を受信した場合
に該受信時点を当該音声の終端とみなす。文成立情報を
無視できる場合は第３の閾値を越えた時点を当該音声の
終点とみなす。それ以前であっても無音を検出し且つそ
の継続時間が第４の閾値を越えた場合に無音検出時点を
当該音声の終端とする。応答音声合成手段は、検出され
た始端から終端までの文が意味する内容を解読し、解読
した内容に対応する応答音声を合成して出力する。な
お、終端検出直後に発声された音声の内容については認
識されないが、その後の対話の中で、認識されなかった
後続音声の内容を再度入力させるような応答音声を合成
することにより漸進的に対話処理が可能になる。In the start / end detecting means of the present invention, the power of the input voice detected by the voice power detecting means exceeds the first threshold value,
When the voiced duration exceeds the second threshold, the start edge of the voice can be detected. The starting point in this case is a point in time when the voice power exceeds the first threshold value. The voice recognition means is
When the start edge is detected, recognition is started and, as a result of the predetermined analysis and determination processing, sentence establishment information is generated when the sentence is satisfied. The start / end detection means also regards the reception time as the end of the voice when the voiced duration exceeds the third threshold and the sentence formation information is received. When the sentence formation information can be ignored, the time when the third threshold is exceeded is regarded as the end point of the voice. Even before that, when the silence is detected and the duration exceeds the fourth threshold value, the silence detection time is set as the end of the voice. The response voice synthesizing unit decodes the content of the detected sentence from the start end to the end, synthesizes the response voice corresponding to the decoded content, and outputs it. Although the content of the voice uttered immediately after the detection of the end point is not recognized, a gradual dialogue is created by synthesizing a response voice that causes the content of the following unrecognized voice to be input again in the subsequent dialogue. Processing becomes possible.

【００１１】[0011]

【実施例】以下、図面を参照して本発明の好適な実施例
を詳細に説明する。図１は、本発明の一実施例に係る音
声対話システムの構成図である。この音声対話システム
では、利用者が発声した音声を音声入力端子ＩＮから入
力して始終端検出部１及び音声合成部２に導く。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Preferred embodiments of the present invention will be described below in detail with reference to the drawings. FIG. 1 is a configuration diagram of a voice dialogue system according to an embodiment of the present invention. In this voice dialogue system, the voice uttered by the user is input from the voice input terminal IN and guided to the start / end detection unit 1 and the voice synthesis unit 2.

【００１２】始終端検出部１は、入力音声のパワー及び
その変化を検出する音声パワー検出部１１，有音継続時
間又は無音継続時間を検出（計測）する継続時間検出部
１２，これら検出値や閾値テーブル３に格納された各種
閾値，言語解析処理部４からの出力に基づいて所定の判
定処理を行う判定処理部１３，及び命令信号生成部１４
から構成される。言語解析処理部４は、後述の音声認識
部２１の認識結果に基づいて言語解析を行い、現時点ま
での認識結果が文として成り立つとみなすことができる
場合に、文成立情報を出力するものである。The start / end detection unit 1 includes a voice power detection unit 11 for detecting the power of an input voice and a change in the power, a duration detection unit 12 for detecting (measuring) a voiced duration or a silent duration, and detection values thereof. Various threshold values stored in the threshold table 3, a determination processing unit 13 that performs a predetermined determination process based on the output from the language analysis processing unit 4, and a command signal generation unit 14
Consists of The language analysis processing unit 4 performs language analysis based on the recognition result of the voice recognition unit 21 described later, and outputs the sentence formation information when the recognition result up to the present time can be regarded as a sentence. .

【００１３】また、音声合成部２は、入力単位ＩＮから
入力された音声を即時的に認識する音声認識部２１，認
識結果を理解して意味表現情報を生成する音声理解部２
２，生成された意味表現情報に基づき所要の素片を合成
して応答音声を合成する応答生成部２３から構成され、
合成された音声を出力単位ＯＵＴに出力している。これ
ら各部は、命令信号生成部１４で生成された各種命令信
号によりその入出力制御がなされる。Further, the voice synthesizing unit 2 recognizes a voice inputted from the input unit IN immediately, and a voice recognizing unit 2 for understanding the recognition result and generating semantic expression information.
2. A response generation unit 23 that synthesizes required voices based on the generated semantic expression information to synthesize a response voice,
The synthesized voice is output to the output unit OUT. Input / output control of each of these units is performed by various command signals generated by the command signal generation unit 14.

【００１４】なお、本実施例では、音声認識部２１及び
言語解析処理部４により音声認識手段を構成し、また、
音声理解部２２と応答生成部２３を含んで解読手段を構
成している。In this embodiment, the voice recognition section 21 and the language analysis processing section 4 constitute a voice recognition means, and
The speech comprehension unit 22 and the response generation unit 23 are included in the decoding unit.

【００１５】図２は、上記閾値テーブル３に格納される
各種閾値の例を示す概念図である。図２において、音声
パワー閾値３１は、有音として識別可能な最低パワー値
を規定した値（第１の閾値）であり、有音時間閾値３２
は、継続有音として識別可能な最短時間を規定した値
（第２の閾値）である。これら閾値３１，３２は入力音
声の始端検出時に用いられる。無音時間閾値３３は、無
音状態として識別可能な継続時間を規定した値（第４の
閾値）であり、発声時間閾値３４は、有音の継続許容時
間を規定した値（第３の閾値）である。これら閾値３
３，３４は入力音声の終端検出時に用いられる。また、
最大発声時間閾値３５は、システムが許容し得る最大有
音継続時間を規定した値である。FIG. 2 is a conceptual diagram showing an example of various threshold values stored in the threshold table 3. In FIG. 2, the voice power threshold 31 is a value (first threshold) that defines the lowest power value that can be identified as a voice, and the voice duration threshold 32.
Is a value (second threshold value) that defines the shortest time that can be identified as continuous voice. These threshold values 31 and 32 are used when the start edge of the input voice is detected. The silent time threshold value 33 is a value (fourth threshold value) that defines the duration time that can be identified as a silent state, and the vocalization time threshold value 34 is a value (third threshold value) that defines the allowable duration of voiced sound. is there. These thresholds 3
3, 34 are used when detecting the end of the input voice. Also,
The maximum vocalization time threshold value 35 is a value that defines the maximum voiced duration that can be accepted by the system.

【００１６】次に、本実施例の音声対話システムの各部
作用を説明する。図３は、本実施例による始終端検出部
１を中心とした処理の手順図であり、Ｓは処理ステップ
を表す。図３を参照すると、入力音声のパワー値と音声
パワー閾値３１との比較により有音状態が確認されると
（Ｓ１０１，１０２）、判定処理部１３は、有音の継続
時間が有音時間閾値３２以上継続するか否かを判定し
（Ｓ１０３）、継続している場合は、有音検出時点を当
該入力音声の始端と認定する（Ｓ１０４）。その後、有
音継続時間を監視する（Ｓ１０５）。Next, the operation of each part of the voice dialogue system of this embodiment will be described. FIG. 3 is a procedure diagram of processing centered on the start / end detection unit 1 according to the present embodiment, and S represents a processing step. Referring to FIG. 3, when a voiced state is confirmed by comparing the power value of the input voice with the voice power threshold 31 (S101, 102), the determination processing unit 13 causes the voiced duration to be the voiced time threshold. It is determined whether or not 32 or more continues (S103), and if it continues, the sound detection time point is recognized as the start end of the input voice (S104). Then, the sound duration time is monitored (S105).

【００１７】ここで、該有音継続時間が発声時間閾値３
４に達する前に（Ｓ１０５：No）音声パワー検出部１１
で無音、すなわち音声パワーが音声パワー閾値３１を下
回り始めたことを検出し（Ｓ１０６）、かつ継続時間検
出部１２で無音継続時間が無音時間閾値３３を越えたこ
とを検出したときは、無音検出時点を音声の終端と認定
する（Ｓ１０８）。Here, the voiced duration is the vocalization time threshold value 3
Before reaching 4 (S105: No), the voice power detection unit 11
Silence is detected, that is, when the sound power starts to fall below the sound power threshold 31 (S106), and the duration detecting unit 12 detects that the silence duration exceeds the silence time threshold 33, the silence detection is performed. The time point is recognized as the end of the voice (S108).

【００１８】また、有音の継続時間が発声時間閾値３４
を越えたときは（Ｓ１０５：Yes）、言語解析処理部４
から文成立情報が入力されるのを待つ（Ｓ１０９）。入
力されたとき（Ｓ１０９：Yes）、あるいはこの時点で
の未入力を無視しても文の成立が推定できる場合は（Ｓ
１１０：Yes）、文成立情報の受信時点あるいは発声時
間閾値３４を越えた時点を当該音声の終端と認定する
（Ｓ１１２）。なお、Ｓ１１０において、文成立情報の
未入力を無視できない場合は（Ｓ１１０：No）、最大発
声時間閾値３５を越えるまで文成立情報の入力を待ち
（Ｓ１１１）、入力された場合は（Ｓ１１１：Yes）、
文成立情報の受信時点を当該音声の終端と認定する（Ｓ
１１２）。Further, the duration of the voiced time is the vocalization time threshold value 34.
When it exceeds (S105: Yes), the language analysis processing unit 4
It waits until the sentence formation information is input from (S109). When it is input (S109: Yes), or when it can be estimated that the sentence is established even if the non-input at this point is ignored (S109: Yes)
110: Yes), the time when the sentence formation information is received or the time when the utterance time threshold 34 is exceeded is recognized as the end of the voice (S112). In S110, if the non-input of the sentence formation information cannot be ignored (S110: No), the input of the sentence formation information is waited until the maximum vocalization time threshold 35 is exceeded (S111), and if the sentence formation information is input (S111: Yes). ),
The time when the sentence formation information is received is recognized as the end of the voice (S
112).

【００１９】Ｓ１０８あるいはＳ１１２において音声の
終端が認定された場合は、音声の認識処理終了及び認識
結果の命令出力を命令信号生成部１４に依頼する。一
方、Ｓ１１１において最大発声時間閾値３５を越えても
文成立情報が入力されなかった場合は（Ｓ１１１：Ye
s）、再度の音声入力の命令出力を命令信号生成部１４
に依頼する（Ｓ１１３）。When the end of the voice is recognized in S108 or S112, the command signal generator 14 is requested to end the voice recognition process and output a command of the recognition result. On the other hand, if the sentence formation information is not input even if the maximum vocalization time threshold value 35 is exceeded in S111 (S111: Ye
s), the command output of the voice input again is performed by the command signal generation unit 14
(S113).

【００２０】音声認識部２１は、命令信号生成部１４か
ら送られる認識処理終了等の命令に従い、現時点までの
音声の認識処理を終了させ、その認識結果を音声理解部
４へ渡す。音声理解部４は、この認識結果を文法的，意
味的，対話文脈的に解析し、入力音声が含んでいる意味
内容を表す意味表現情報を応答生成部５に渡す。応答生
成部５は、音声理解部４からもたらされた入力音声の意
味表現情報、あるいは命令信号生成部１４からの再度の
音声入力命令に対応する合成音声を生成し、これを音声
出力端子ＯＵＴより出力する。合成音声の生成に際して
は、予め備えている対話知識や専門知識を使用し、入力
音声の内容から不要語句を除いた要部語句やその正否を
問う内容の応答音声、あるいは入力音声の内容を認識で
きなかった内容の音声を生成する。以上の処理を一連の
対話処理が終了するまで繰り返す（Ｓ１１４）。The voice recognition unit 21 terminates the voice recognition process up to the present time in accordance with a command such as the recognition process end sent from the command signal generation unit 14, and passes the recognition result to the voice understanding unit 4. The speech understanding unit 4 analyzes the recognition result grammatically, semantically, and in a dialogue context, and passes the semantic expression information representing the meaning content included in the input speech to the response generation unit 5. The response generation unit 5 generates the semantic expression information of the input voice provided from the voice understanding unit 4 or the synthetic voice corresponding to the voice input command from the command signal generation unit 14 again, and outputs the synthesized voice. Output more. When generating synthetic speech, use the dialogue knowledge and specialized knowledge that are provided in advance, and recognize the main words and phrases that exclude unnecessary words from the contents of the input voice and the response voice that asks the correctness of the words, or the contents of the input voice. Generates audio that could not be done. The above process is repeated until a series of interactive processes is completed (S114).

【００２１】次に、本実施例の音声対話システムにおけ
る具体的な処理例を図４を参照して説明する。図４
（ａ）は、利用者から発せられる音声の内容説明図であ
り、従来の音声始終端検出方式では、図示のような発声
内容は１つの文として扱われ、その認識結果に基づい
て、システムの応答音声が生成されていた。Next, a specific processing example in the voice dialogue system of this embodiment will be described with reference to FIG. FIG.
(A) is a content explanatory view of the voice uttered by the user, and in the conventional voice start / end detection method, the utterance contents shown in the figure are treated as one sentence, and based on the recognition result, the system A response voice was being generated.

【００２２】これに対して本実施例では、図示の(1)〜
(6)の区間毎に音声認識及びその内容理解を行い、図４
（ｂ）のような対話形式で応答音声を生成出力する。つ
まり、利用者が「えーと、会議室を予約したいんですが
・・・」と発した時点(1)でシステム側で有音状態を検
出し、認識を開始する。この時点(1)では未だ発声時間
閾値３４を越えず、しかも文として成立していないと判
定すると、「・・・会議室は第３会議室」と発した時点
(2)まで認識を続ける。その後、発声時間閾値３４を越
え、かつ文として成立したと判定したときは、時点(2)
にて最初の文に対する発声が終了したとみなし、それま
での意味内容に対応する応答音声、例えば入力発声内容
の要部語句を含む応答音声「第３会議室ですね」を生成
して出力する。これにより利用者は、発声が一時中断さ
れたことを知り、後続内容、すなわち時点(2)以降の内
容について発声を開始する。On the other hand, in the present embodiment, (1) through (1)
Speech recognition and content understanding are performed for each section (6), and
A response voice is generated and output in an interactive form as shown in (b). In other words, at the time (1) when the user issues "Well, I want to reserve the conference room ...", the system detects the voiced state and starts recognition. At this time (1), when it is determined that the utterance time threshold value 34 has not been exceeded and the sentence is not yet established, the time "... the meeting room is the third meeting room" is issued.
Continue to recognize until (2). After that, when it is determined that the utterance time threshold 34 is exceeded and the sentence is satisfied, time (2)
It is considered that the utterance for the first sentence is completed, and a response voice corresponding to the meaning contents up to that time, for example, a response voice "third conference room" including the main words and phrases of the input utterance contents is generated and output. . As a result, the user knows that the utterance has been temporarily stopped, and starts uttering the subsequent content, that is, the content after the time (2).

【００２３】システム側は、後続内容について上記同様
の手順で処理を行う。すなわち、発声時間閾値３４を越
えて最初の文として成り立つ時点(4)で発声が終了した
とみなし、「来週の金曜日、３時からご希望ですね」と
応答する。利用者は、この応答が正しければ、時点(4)
以降の内容を発声する。なお、時点(4)以降の発声の継
続時間（時点(4)〜(6)）が、発声時間閾値３４以下であ
る場合は、システム側は、発声時間閾値３４に達する前
であっても発声が終了したと判定し、最終発声終了時点
(6)に対応する応答音声「大きい会議室を希望ですね」
を出力する。このようにして利用者との間で対話を漸進
的に行うことにより、円滑なシステム運用が可能にな
る。The system side processes the subsequent contents in the same procedure as described above. That is, it is considered that the utterance is completed at the time (4) when the utterance time threshold 34 is exceeded and the sentence is satisfied as the first sentence, and replies, "I hope from 3 o'clock Friday next week." If this response is correct, the user shall report the time (4)
Speak the following contents. If the duration of utterance after the time point (4) (time points (4) to (6)) is less than or equal to the utterance time threshold 34, the system side utters even before reaching the utterance time threshold 34. Is judged to have ended, and when the final vocalization ends
Response voice corresponding to (6) "I want a large conference room."
Is output. In this way, by gradually conducting the dialogue with the user, smooth system operation becomes possible.

【００２４】[0024]

【発明の効果】以上の説明から明かなように、本発明で
は、長期にわたって連続して音声が入力される場合であ
っても、この入力音声が所定の継続時間（第３の閾値）
を越えた場合、あるいはその時点で文として成立する場
合は、逐次認識対象音声の終端とみなして所要の応答音
声を生成し、発声者との間の対話を漸進的に行うので、
システムの応答音声が出力されるまでの時間が短縮化さ
れるほか、個々の音声の認識率が従来方式に比べて格段
に向上する効果がある。また、上記継続時間経過前であ
っても、無音状態を検出した場合は、無音検出時点を終
端として同様の処理を行うので、従来方式との互換も図
れる。これにより発話者との間の対話の進行を円滑にす
ることができ、従来の問題点を解消することができる。As is apparent from the above description, according to the present invention, even when a voice is continuously input for a long period of time, the input voice has a predetermined duration (third threshold value).
If it exceeds, or if it is satisfied as a sentence at that time, it is considered as the end of the speech to be sequentially recognized, a required response speech is generated, and the dialogue with the speaker is gradually performed.
In addition to shortening the time until the system response voice is output, the recognition rate of each voice is significantly improved compared to the conventional method. Further, even when the silent state is detected before the above-described duration time elapses, the same processing is performed with the silent point being detected as the end, so that compatibility with the conventional method can be achieved. As a result, the progress of the dialogue with the speaker can be made smooth, and the conventional problems can be solved.

[Brief description of drawings]

【図１】本発明の一実施例に係る音声対話システムのブ
ロック構成図。FIG. 1 is a block configuration diagram of a voice dialogue system according to an embodiment of the present invention.

【図２】本実施例による閾値テーブルの格納内容説明
図。FIG. 2 is an explanatory diagram of stored contents of a threshold table according to the present embodiment.

【図３】本実施例による認識対象音声の始終端検出処理
の手順説明図。FIG. 3 is a procedure explanatory diagram of a start / end detection process of a recognition target voice according to the present embodiment.

【図４】本実施例の音声対話システムによる具体的な対
話形式説明図であり、（ａ）は入力音声の内容例，
（ｂ）はこの内容例に基づく対話手順図である。FIG. 4 is an explanatory diagram of a concrete dialogue format by the voice dialogue system of the present embodiment, in which (a) is an example of contents of input voice,
(B) is a dialogue procedure diagram based on this content example.

[Explanation of symbols]

１始終端検出部１１音声パワー検出部１２継続時間検出部１３判定処理部１４命令信号生成部２音声合成部２１音声認識部２２音声理解部２３応答生成部３閾値テーブル４言語解析処理部 DESCRIPTION OF SYMBOLS 1 start / end detection unit 11 voice power detection unit 12 duration detection unit 13 determination processing unit 14 command signal generation unit 2 voice synthesis unit 21 voice recognition unit 22 voice understanding unit 23 response generation unit 3 threshold table 4 language analysis processing unit

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁶ 識別記号庁内整理番号ＦＩ技術表示箇所Ｇ０６Ｆ 3/16 ３３０ 9172−5ＥＧ０６Ｆ 3/16 ３３０ＡＧ１０Ｌ 5/02 Ｇ１０Ｌ 5/02 Ｊ ─────────────────────────────────────────────────── ─── Continuation of the front page (51) Int.Cl. ⁶ Identification code Internal reference number FI Technical display location G06F 3/16 330 9172-5E G06F 3/16 330A G10L 5/02 G10L 5/02 J

Claims

[Claims]

1. A step of sequentially recognizing an input voice by detecting a start point of the input voice, a step of analyzing whether or not the content of the input voice is satisfied as a sentence based on the result of the recognition, and the start point detection. Corresponding to the step of deciphering the input voice when the input voice is satisfied as a sentence when the input voice at the time has passed a predetermined duration, and deciphering the content as the end of the input voice. And a step of generating a response voice, the response voice generation method comprising:

2. The step of decoding the content of the input voice, when a silence state is detected before the elapse of the predetermined duration, decodes the content of the voice ending at the point of silence detection. The response voice generation method according to claim 1.

3. A first threshold value that defines a minimum power value that can be identified as a voice, a second threshold value that defines a minimum time that can be identified as a continuous voice, and an allowable duration of a voice. A threshold value storage unit that stores a third threshold value, sequentially recognizes the input voice when triggered by a voice input, determines whether or not the sentence is satisfied based on the recognition result, and when the condition is satisfied, the sentence establishment information is output. The voice recognition means to be generated is compared with the power of the input voice and the duration of the voice and the first and second threshold values stored in the threshold value storage means to detect the start edge of the input voice and detect the start edge. When the voice duration at the time point (hereinafter, voiced duration) exceeds the third threshold value, or exceeds the third threshold value and receives the sentence establishment information, the third threshold value is exceeded. At the time or when the sentence formation information is received Start-end detecting means for detecting the end of the voice, and response voice synthesizing means for decoding the content of the detected sentence from the start end to the end and synthesizing a response voice corresponding to the decoded content. Spoken dialogue system characterized by.

4. A fourth threshold value, which defines a silence duration that can be identified as a silence state, is stored in the threshold value storage means, and the start / end detection means is configured to detect the voiced duration time by the third threshold value. 4. When the silence is detected before reaching, and when the duration of the silence exceeds the fourth threshold value, the silence detection time point is detected as the end of the voice.
The spoken dialogue system described.

5. The response voice synthesizing means detects a key word / phrase of the meaning of the detected sentence from the start end to the terminal end, and synthesizes a response voice including the detected key word / phrase. The voice dialogue system according to claim 3 or 4.