JP3199972B2

JP3199972B2 - Dialogue device with response

Info

Publication number: JP3199972B2
Application number: JP02034695A
Authority: JP
Inventors: 憲治坂本; 啓子綿貫; 文雄外川
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1995-02-08
Filing date: 1995-02-08
Publication date: 2001-08-20
Anticipated expiration: 2016-08-20
Also published as: JPH08211986A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、あいづち応答のある対
話装置、より詳細には、人間とコンピュータが音声ある
いは身振りを通じて対話する対話装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an interactive device having a responsive response, and more particularly, to an interactive device in which a human and a computer interact with each other through voice or gesture.

【０００２】[0002]

【従来の技術】従来、人間とコンピュータが自然に対話
できるように、音声入力に対して対応する対話装置が考
えられてきた。これは、人間の発する音声を認識し、そ
れに応じてシステムの内部状態を変化させ、あらかじめ
決められた出力をし、人間との対話を実現しようとした
ものである。また、コンピュータとの対話をより円滑に
行えるように、入力音声に対して適切なタイミングでア
ニメーション等が応答する音声反応システムが提案され
ている。これは、音声の発声レベル等を検出し、それに
応じて反応するシステムである。2. Description of the Related Art Hitherto, an interactive device corresponding to voice input has been considered so that humans and computers can naturally interact with each other. This is to realize a dialogue with a human by recognizing a voice uttered by a human, changing an internal state of the system according to the voice, outputting a predetermined output. In addition, a voice response system has been proposed in which animation or the like responds to an input voice at an appropriate timing so that a dialog with a computer can be performed more smoothly. This is a system that detects the utterance level of a voice and reacts accordingly.

【０００３】[0003]

【発明が解決しようとする課題】上述のごとき対話装置
を実現するのに、発話が終了してから、あるいは、呼気
段落で、音声を理解する処理が始められることが多い。
しかし、このようにすると、処理時間がかかるため、人
は発話が終了してからしばらくしてシステムの応答を見
ることになる。このような対話は、現実の人間対人間の
対話ではなく、違和感が生じ、対話がスムーズに進行し
ない。In order to realize the above-described dialogue device, it is often the case that after the utterance ends, or in the exhalation paragraph, the process of understanding the voice is started.
However, this takes a long processing time, so that a person sees the response of the system some time after the utterance ends. Such a dialogue is not a real human-to-human dialogue, but causes a sense of incongruity and does not proceed smoothly.

【０００４】[0004]

【課題を解決するための手段】本発明を上述のごとき課
題を解決するために、（１）音声を入力する音声入力部
と、入力された音声の特徴量を求める音響分析部と、あ
らかじめ決められたキーワードを設定し、そのキーワー
ドの特徴量を格納するキーワード格納部と、現在時刻を
得るための時間情報取得部と、前記音響分析部で得られ
た入力音声の特徴量と、前記キーワード格納部のキーワ
ード特徴量を比較し、前記入力音声中のキーワードを検
出するマッチング部と、前記音声入力に対するあいづち
の応答を出力する出力部と、前記マッチング部において
検出した前記入力音声中のキーワードの開始時刻または
終了時刻に基づいて前記出力部のあいづち応答出力時刻
を制御する出力制御部とを備えたことを特徴としたもの
であり、更には、（２）前記（１）において、前記出力
制御部は、あいづちの応答をした後、一定の時間、あい
づちの応答を抑制すること、更には、（３）音声を入力
する音声入力部と、入力された音声の特徴量を求める音
響分析部と、あらかじめ決められたキーワードを設定
し、そのキーワードの特徴量を格納するキーワード格納
部と、現在時刻を得るための第１の時間情報取得部と、
前記音響分析部で得られた入力音声の特徴量と前記キー
ワード格納部のキーワード特徴量とを比較し、前記入力
音声中のキーワードを検出する第１のマッチング部と、
画像を入力する画像入力部と、入力された画像の特徴量
を求める画像分析部と、あらかじめ決められたキー動作
を設定し、そのキー動作の特徴量を格納するキー動作格
納部と、現在時刻を得るための第２の時間情報取得部
と、前記画像分析部で得られた入力画像中の特徴量と前
記キー動作格納部のキー動作の特徴量とを比較し、前記
音声入力画像中のキー動作を検出する第２のマッチング
部と、前記音声入力または前記画像入力に対するあいづ
ちの応答を出力する出力部と、前記第１のマッチング部
と第２のマッチング部とを統合し、前記第１のマッチン
グ部において検出した前記入力音声中のキーワードの開
始時刻あるいは終了時刻、または、前記第２のマッチン
グ部において検出した前記入力画像中のキー動作の開始
時刻あるいは終了時刻に基づいて、前記出力部のあいづ
ち応答出力時刻を制御する統合部とを備えたことを特徴
としたものである。The present invention SUMMARY OF THE INVENTION To solve such problems described above, an acoustic analysis section for obtaining an audio input unit, a feature quantity of the input voice to input a speech (1), Oh
Set Luo beforehand-determined keywords, and keyword storage unit that stores a feature quantity of the keyword, and time information acquisition unit for obtaining the current time, obtained by the acoustic analysis unit
The feature amount of the input speech is compared with the keyword feature amount of the keyword storage unit, and the keyword in the input speech is detected.
A matching unit for output, an output unit for outputting the response of nod to said speech input, in the matching section
Start time of the keyword in the detected input voice or
The response output time of the output unit based on the end time
And an output control unit for controlling the
, And still more, (2) in the (1), the output
Control unit, after the response of reactive tokens, a certain time, and Turkey to suppress the response of nod, and further, inputs (3) Voice
Voice input unit and the sound for which the amount of feature of the input voice is calculated
Set sound analysis section and predetermined keywords
Keyword storage to store the feature amount of the keyword
A first time information obtaining unit for obtaining a current time;
The feature amount of the input voice obtained by the acoustic analysis unit and the key
Compare with the keyword features in the word storage,
A first matching unit that detects a keyword in the voice;
An image input unit for inputting an image, and a feature amount of the input image
The image analysis unit that asks for
Key operation type that stores the characteristic amount of the key operation
Storage unit and a second time information obtaining unit for obtaining the current time
And the feature amount in the input image obtained by the image analysis unit and the
The key operation is compared with the key operation characteristic amount in the key operation storage unit,
Second matching for detecting a key operation in a voice input image
And a response to the voice input or the image input.
An output unit for outputting the response, and a first matching unit
And the second matching unit, and the first match
Opening of the keyword in the input voice detected by the
Start time or end time, or the second match
Of key operation in the input image detected by the
Based on the time or end time,
And an integration unit for controlling the response output time .

【０００５】[0005]

【作用】請求項１の発明では、音声入力部より入力され
た音声は、音響分析部にて特徴量に変換される。マッチ
ング部では、入力音声の特徴量と、あらかじめキーワー
ド格納部に登録されているキーワードの特徴量とを比較
し、キーワードの検出が行われる。このとき、時間情報
取得部によりキーワード発声の終了時刻が得られ、キー
ワードの情報とともに出力部に送られる。出力部では、
前記時間情報取得部より得られた現在時刻とキーワード
の終了時刻を比較し、その差がある閾値より大きくなる
と、あいづちの応答を出力する。請求項２の発明では、
キーワードマッチング部より出力されたキーワードの終
了時刻が、前回あいづち応答をしてからある時間が経過
するまでは、応答を抑制する、つまり、出力部に出力し
ないようにし、あいづちが頻繁に起こり、対話が円滑に
進行しなくなることを防ぐ。なお、請求項１又は２の発
明において、検出されたキーワードに応じてシステムの
内部状態を変化させ、次に認識すべきキーワードを生成
し、キーワード格納部に記憶しておくことにより、次に
認識すべきキーワードを限定し、処理時間の縮小を図る
ことができる。請求項３の発明では、音声から得られた
キーワードの終了時刻と、動作から得られたキー動作の
終了時間とのどちらか一方の情報で、あいづちの応答を
出力するようにし、より人間の発声・動作に反応したあ
いづちを生成し、より円滑な対話を実現する。According to the first aspect of the present invention, the voice input from the voice input unit is converted into a feature by the acoustic analysis unit. The matching unit compares the feature amount of the input voice with the feature amount of the keyword registered in advance in the keyword storage unit, and detects the keyword. At this time, the end time of the utterance of the keyword is obtained by the time information acquisition unit, and is sent to the output unit together with the information of the keyword. In the output section,
The current time obtained from the time information obtaining unit is compared with the end time of the keyword, and if the difference is larger than a certain threshold value, a response is output. In the invention of claim 2,
Until the end time of the keyword output from the keyword matching unit has passed a certain time since the last response, the response is suppressed, that is, the output is not output to the output unit. , To prevent the conversation from going smoothly. It should be noted that claim 1 or 2
In the description, the internal state of the system is changed according to the detected keyword, the next keyword to be recognized is generated and stored in the keyword storage unit, so that the next keyword to be recognized is limited and the processing is performed. Reduce time
Can be . According to the third aspect of the present invention, a response to each other is output based on either one of the end time of the keyword obtained from the voice and the end time of the key operation obtained from the operation. Generates messages in response to utterances and movements, and realizes smoother dialogue.

【０００６】[0006]

【実施例】図１は、本発明の請求項１の発明の実施例を
説明するための構成図で、図中、１は音声入力部、２は
音響分析部、３はマッチング部、４はキーワード格納
部、５は時間情報取得部、６は出力部で、人間が発声し
た音声はマイク等の音声入力部１により装置に取り込ま
れ、取り込まれた音声信号はＡＤ変換され、音響分析部
２において処理単位（フレーム）毎に特徴量（メルケプ
ストラム）に変換される。ここでは、１フレームは１０
ｍｓに相当する。あらかじめ認識すべきキーワードにつ
いてその特徴量を求め、キーワード格納部４に記憶して
おく。マッチング部３では、キーワード格納部４に記憶
されているキーワードの特徴量と入力音声の特徴量をフ
レーム単位毎に比較し、キーワードの検出を行う。この
ときの処理として、例えば、連続ＤＰ（Dynamic Progra
mming）マッチング法などが用いられる。1 is a block diagram for explaining an embodiment of the first aspect of the present invention. In FIG. 1, 1 is a voice input unit, 2 is an acoustic analysis unit, 3 is a matching unit, and 4 is a matching unit. A keyword storage unit, 5 is a time information acquisition unit, 6 is an output unit, and a voice uttered by a human is taken into the device by a voice input unit 1 such as a microphone. Is converted into a feature amount (mel cepstrum) for each processing unit (frame). Here, one frame is 10
ms. The feature amount of a keyword to be recognized in advance is obtained and stored in the keyword storage unit 4. The matching unit 3 detects the keyword by comparing the feature amount of the keyword stored in the keyword storage unit 4 with the feature amount of the input voice for each frame unit. As a process at this time, for example, continuous DP (Dynamic Progra
mming) A matching method is used.

【０００７】図５は、キーワードとして「湯浅」が設定
されている場合で、入力音声として「私、シャープの湯
浅と申します」が入力された時のキーワードと入力音声
との距離をフレーム毎に表示したものである。ここで
は、Ｔ_minは距離が最小になった時点、Ｄ_minはそのとき
の距離、Ｔ_eは実際にキーワードが検出される時点であ
る。FIG. 5 shows a case in which “Yuasa” is set as a keyword, and the distance between the keyword and the input voice when “I, Sharp Yuasa” is input as the input voice is frame by frame. It is displayed. Here, T _min is the time the distance is minimized, D _min is the distance at that time, is T _e is the time at which the detected actual keyword.

【０００８】Ｔ_eとＴ_minとの関係は、Ｔ_e＝Ｔ_min＋Ｔ_d である。ここで、Ｔ_dは最小値を検出するために必要な
フレーム数で、ここでは３フレーム（＝３０（ｍｓｅ
ｃ））である。したがって、Ｔ_minは、Ｔ_min＝Ｔ_e−Ｔ_d より求められる。以下、キーワード終了時刻としてＴ
_minを用いる。[0008] The relationship between T _e and T _min is a _{_{_{T e = T min + T d}}} . Here, T _d is the number of frames required to detect the minimum value, and here, 3 frames (= 30 (mse
c)). Therefore, T _min is obtained from T _min = T _e −T _d . Hereinafter, the keyword end time T
_{Use min} .

【０００９】出力部６では、時間情報取得部５より随時
得られる現在時刻ｔが、ｔ＝Ｔ_min＋Ｔ_M を満たすと、あいづちの応答を行う。ここで、Ｔ_Mは人
間同士の対話の中で、あいづちの挿入されるタイミング
を解析して得られた値で、ここでは０.５（ｓｅｃ）で
ある。このＴ_Mの値は、システムの内部状態に応じて値
を変えることも考えられる。また、キーワードの検出時
に、キーワードの開始時刻も検出されるので、開始時刻
からの時間であいづちの応答をすることも考えられる。
あいづちの応答として、人間の姿をしたＣＧ（Computer
Graphics）合成のモデルを音声出力「はい」と首を縦
に振るうなずきの動作をさせて行う。また、まばたきな
どをさせることも考えられる。[0009] The output unit 6, the current time t obtained at any time from the time information acquisition unit 5, satisfies the t = T _min + T _M, a response of the back-channel feedback. Here, T _M is a value obtained by analyzing the timing at which a hint is inserted in a dialogue between humans, and is 0.5 (sec) here. The value of T _M may be changed according to the internal state of the system. In addition, when the keyword is detected, the start time of the keyword is also detected, so that it is possible to make a response at any time from the start time.
As a response to each other, CG (Computer)
(Graphics) Synthesizing model with voice output "yes" and nodding motion of waving his head vertically. In addition, blinking may be considered.

【００１０】図２は、本発明の請求項２の発明の実施例
を説明するための構成図で、図中、１は音声入力部、２
は音響分析部、３はマッチング部、４はキーワード格納
部、５は時間情報取得部、６は出力部、７は出力制御部
で、出力制御部７では、前記マッチング部３より得られ
たキーワード終了時刻Ｔ_minと前回あいづち応答に係わ
ったキーワード終了時刻ｔ_cが確率関数ｆで評価され、ｆ（Ｔ_min−ｔ_c）＞０.５を満たす場合、キーワード終了時刻の情報が出力部６に
送られ、図１の場合と同様の手法であいづちの応答がさ
れる。このとき、ｔ_cの値がＴ_minの値で更新される。満
たさない場合は、キーワード終了時刻の情報は、出力部
６に送られない。確率関数ｆは、０から１までの乱数を
一様に発生する関数で、その平均値が図６に示すような
値となるものである。この関数は、人間対人間の対話を
解析して得られたものを簡略化したもので、人間対人間
の対話の場合、約１〜２秒間隔であいづちが挿入される
ことが最も多かったという解析結果から得られたもので
ある。この関数により、あいづちの応答があってから後
１秒以内は、あいづちが抑制される。FIG. 2 is a block diagram for explaining an embodiment of the second aspect of the present invention. In FIG.
Is an acoustic analysis unit, 3 is a matching unit, 4 is a keyword storage unit, 5 is a time information acquisition unit, 6 is an output unit, 7 is an output control unit, and an output control unit 7 outputs a keyword obtained from the matching unit 3 The ending time T _min and the keyword ending time t _c related to the previous response are evaluated by the probability function f. If f (T _min −t _c )> 0.5 is satisfied, the information on the keyword ending time is output to the output unit 6. Is sent to the user in the same manner as in FIG. At this time, the value of t _c is updated with the value of T _min. If not, the information on the keyword end time is not sent to the output unit 6. The probability function f is a function that uniformly generates a random number from 0 to 1, and has an average value as shown in FIG. This function is a simplified version of the result obtained by analyzing a human-to-human conversation. In the case of a human-to-human conversation, it is said that the most frequently inserted ones at intervals of about 1 to 2 seconds. It is obtained from the analysis results. With this function, the response is suppressed within one second after the response is received.

【００１１】図３は、本発明の他の実施例を説明するた
めの構成図で、１は音声入力部、２は音響分析部、３は
マッチング部、４はキーワード格納部、５は時間情報取
得部、６は出力部、７は出力制御部、８は対話管理部
で、対話管理部８では、マッチング部３で検出されるキ
ーワードに応じてシステムの内部状態を遷移させる。図
７は、このときの状態遷移図の例を示す図で、各状態の
下に書かれた表は、その状態での認識すべきキーワード
を表している。このキーワードの特徴量はキーワード格
納部４に記憶されている。矢印は状態遷移の方向を示
し、矢印と共に併記したキーワードが検出された場合、
その矢印に沿って状態を変化させることを示している。
例えば、始めシステムの内部状態が「状態１」にあると
きに、「こんにちわ」という音声が入力されると、図７
の遷移図より、システムの内部状態は「状態２」に遷移
する。この状態での認識キーワードは、「はい」「いい
え」等に変更される。FIG. 3 is a block diagram for explaining another embodiment of the present invention, wherein 1 is a voice input unit, 2 is an acoustic analysis unit, 3 is a matching unit, 4 is a keyword storage unit, and 5 is time information. An acquisition unit, 6 is an output unit, 7 is an output control unit, and 8 is a dialog management unit. The dialog management unit 8 transitions the internal state of the system according to the keyword detected by the matching unit 3. FIG. 7 is a diagram showing an example of a state transition diagram at this time, and a table written below each state represents a keyword to be recognized in that state. The feature amount of the keyword is stored in the keyword storage unit 4. The arrow indicates the direction of the state transition, and if a keyword added together with the arrow is detected,
The state is changed along the arrow.
For example, when the voice of “Hello” is input when the internal state of the system is “State 1” at first, FIG.
According to the transition diagram, the internal state of the system transits to “state 2”. The recognition keyword in this state is changed to “Yes”, “No”, or the like.

【００１２】図４は、本発明の請求項３の発明の実施例
を説明するための構成図で、１は音声入力部、２は音響
分析部、３は音声マッチング部、４はキーワード格納部
で、これらによって音声認識部Ｉを構成している。１１
は画像入力部、１２は画像分析部、１３は画像マッチン
グ部、１４はキー動作格納部で、これらによって動作認
識部IIを構成している。２５は時間情報取得部、２６は
統合部、２７は出力部で、音声認識部Ｉに関しては、前
述の手法により、入力音声中のキーワードの終了時刻が
検出されるので、以下に画像認識について説明する。FIG. 4 is a block diagram for explaining an embodiment of the third aspect of the present invention, wherein 1 is a voice input unit, 2 is a sound analysis unit, 3 is a voice matching unit, and 4 is a keyword storage unit. These constitute the voice recognition unit I. 11
Denotes an image input unit, 12 denotes an image analysis unit, 13 denotes an image matching unit, 14 denotes a key operation storage unit, and these constitute an operation recognition unit II. 25 is a time information acquisition unit, 26 is an integration unit, 27 is an output unit. For the speech recognition unit I, the end time of the keyword in the input speech is detected by the above-described method. I do.

【００１３】画像入力部１１は、カメラ等から構成さ
れ、該画像入力部１１より人間の動作の画像が装置に取
り込まれ、画像分析部１２において、フレーム毎の画像
の特徴量が求められる。あらかじめ決められた動作（以
下、これをキー動作と呼ぶ）の特徴量がキー動作格納部
１４に記憶されている。ここでは、キー動作として、首
を縦に振るいわゆる「うなずき」を例に考える。入力画
像からキー動作の終了時刻を、音声認識の場合と同様、
画像マッチング部１３において、連続ＤＰなどを用いて
検出する。The image input unit 11 is constituted by a camera or the like. An image of a human motion is taken into the apparatus from the image input unit 11, and the image analysis unit 12 obtains the feature amount of the image for each frame. A feature amount of a predetermined operation (hereinafter, referred to as a key operation) is stored in the key operation storage unit 14. Here, as a key operation, a so-called “nodding” in which the head is shaken vertically is considered as an example. From the input image, the end time of the key operation is determined in the same way as in the case of voice recognition.
In the image matching unit 13, detection is performed using continuous DP or the like.

【００１４】図８は、検出されたキーワードおよびキー
動作の例を示すが、ここでは、入力音声中から、キーワ
ード１、キーワード２、キーワード３が、入力画像中か
ら、うなずき１、うなずき２、うなずき３が検出された
例を示している。統合部２６には、音声マッチング部３
からキーワードの終了時刻の情報が、画像マッチング部
１３からキー動作の終了時刻の情報が順次入力される。
この統合部２６では、キーワードの終了時刻およびキー
動作の終了時刻の情報を前述の確率関数ｆに適用して、
出力部２７に出力する情報の制御を行う。FIG. 8 shows an example of detected keywords and key operations. Here, keywords 1, keyword 2, and keyword 3 are input from the input speech, and nod 1, nod 2, and nod. 3 shows an example in which 3 is detected. The integration unit 26 includes a voice matching unit 3
, And information on the end time of the key operation is sequentially input from the image matching unit 13.
The integration unit 26 applies the information on the end time of the keyword and the end time of the key operation to the above-mentioned probability function f,
The information output to the output unit 27 is controlled.

【００１５】[0015]

【発明の効果】請求項１の発明に対応する効果：キーワ
ードに反応してあいづちが挿入されるので、コンピュー
タと自然でスムースな対話が実現できる。請求項２の発
明に対応する効果：あいづちが頻繁に起こり、違和感が
生じるのを防止することができる。なお、請求項１及び
２の発明においては、更に、認識対象のキーワードを限
定することで処理量の削減が実現できる。請求項３の発
明に対応する効果：人間の動作や発声内容に反応してあ
いづちが挿入されるので、より円滑な対話が実現でき
る。According to the first aspect of the present invention, since a message is inserted in response to a keyword, a natural and smooth conversation with a computer can be realized. Advantageous effect corresponding to the second aspect of the present invention: It is possible to prevent a frequent occurrence of unpleasant feelings. In addition, Claim 1 and
According to the second aspect , the processing amount can be reduced by further limiting the keywords to be recognized. Effect corresponding to the third aspect of the present invention: Since a message is inserted in response to a human motion or utterance content, a smoother conversation can be realized.

[Brief description of the drawings]

【図１】本発明の請求項１の発明の実施例を説明するた
めの構成図である。FIG. 1 is a configuration diagram for explaining an embodiment of the first aspect of the present invention;

【図２】本発明の請求項２の発明の実施例を説明するた
めの構成図である。FIG. 2 is a configuration diagram for explaining an embodiment of the invention of claim 2 of the present invention.

【図３】本発明の他の実施例を説明するための構成図で
ある。FIG. 3 is a configuration diagram for explaining another embodiment of the present invention.

【図４】本発明の請求項３の発明の実施例を説明するた
めの構成図である。FIG. 4 is a configuration diagram for explaining an embodiment of the third aspect of the present invention.

【図５】入力音声とあるキーワードとのマッチング距離
を表わす図である。FIG. 5 is a diagram illustrating a matching distance between an input voice and a certain keyword.

【図６】出力を制御する確率関数の例を示す図である。FIG. 6 is a diagram illustrating an example of a probability function for controlling an output.

【図７】状態遷移および設定キーワードの例を示す図で
ある。FIG. 7 is a diagram illustrating an example of a state transition and a setting keyword.

【図８】検出されたキーワードおよびキー動作の例を説
明するための図である。FIG. 8 is a diagram illustrating an example of a detected keyword and a key operation.

[Explanation of symbols]

１…音声入力部、２…音響分析部、３…音声マッチング
部、４…キーワード格納部、５…時間情報取得部、６…
出力部、７…出力制御部、８…対話管理部、１１…画像
入力部、１２…画像分析部、１３…画像マッチング部、
１４…キー動作格納部、２５…時間情報取得部、２６…
統合部、２７…出力部、Ｉ…音声認識部、II…画像認識
部。DESCRIPTION OF SYMBOLS 1 ... Speech input part, 2 ... Sound analysis part, 3 ... Speech matching part, 4 ... Keyword storage part, 5 ... Time information acquisition part, 6 ...
Output unit, 7: Output control unit, 8: Dialogue management unit, 11: Image input unit, 12: Image analysis unit, 13: Image matching unit,
14 ... key operation storage unit, 25 ... time information acquisition unit, 26 ...
Integration unit, 27 output unit, I voice recognition unit, II image recognition unit.

フロントページの続き (56)参考文献特開昭62−40577（ＪＰ，Ａ) 特開平５−216618（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 3/02 G06F 3/16 Continuation of the front page (56) References JP-A-62-40577 (JP, A) JP-A-5-216618 (JP, A) (58) Fields investigated (Int. Cl. ⁷ , DB name) G06F 3 / 02 G06F 3/16

Claims

(57) [Claims]

A voice input unit for inputting voice, an acoustic analysis unit for obtaining a feature amount of the input voice, a keyword storage unit for setting a predetermined keyword and storing the feature amount of the keyword, It compares the time information obtaining unit for obtaining the current time, and the feature quantity of the input voice obtained by the acoustic analysis section, a keyword feature quantity of the keyword storage unit, wherein the input
A matching unit that detects a keyword in the voice, an output unit that outputs a response to the voice input , and an input unit that detects the input voice detected by the matching unit.
Based on the start time or end time of the keyword,
Interactive apparatus characterized by comprising an output control unit for controlling the nod response output time of the force unit.

2. The output control unit according to claim 1 ,
After the response of reactive tokens, certain time, interactive device, wherein the benzalkonium to suppress the response of reactive tokens.

3. A voice input unit for inputting voice, an acoustic analysis unit for obtaining a characteristic amount of the input voice, a predetermined keyword is set, and a key word thereof is set.
A keyword storage unit for storing a feature amount of the input voice; a first time information obtaining unit for obtaining a current time; a feature amount of the input voice obtained by the acoustic analysis unit;
The input sound is compared with a keyword feature amount in a word storage unit.
A first matching unit for detecting a keyword in a voice, an image input unit for inputting an image, an image analysis unit for obtaining a feature amount of the input image, and a predetermined key operation set,
A key operation storage unit for storing the feature amount of the input image; a second time information obtaining unit for obtaining the current time; a feature amount in the input image obtained by the image analysis unit;
Comparing the feature amount of the key operation of the over operation storage unit, the input
A second matching unit for detecting a key operation in the image;
Respond to the voice input or the image input
An output unit for outputting said a first matching unit and a second matching unit integrated
And the input detected by the first matching unit.
The start or end time of the keyword in the audio, or
Is the input detected by the second matching unit.
Based on the start time or end time of the key action in the image
To control the response output time of the output unit.
Interactive apparatus characterized by comprising: a part, the.