JP6480351B2

JP6480351B2 - Speech control system, speech control device and speech control program

Info

Publication number: JP6480351B2
Application number: JP2016001177A
Authority: JP
Inventors: 石井　亮; 亮石井; 大塚　和弘; 和弘大塚; 史朗熊野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-01-06
Filing date: 2016-01-06
Publication date: 2019-03-06
Anticipated expiration: 2036-01-06
Also published as: JP2017121680A

Description

本発明は、発話制御システム、発話制御装置及び発話制御プログラムに関する。 The present invention relates to a speech control system, a speech control device, and a speech control program.

従来、ヒューマノイドロボット及びコンピュータグラフィックにより描画されたエージェント（仮想的な人物）を表示するエージェントシステムは、周囲にいる複数の利用者と会話を行う会話機能を有するものがある。 BACKGROUND Conventionally, an agent system that displays an agent (virtual person) drawn by a humanoid robot and a computer graphic may have a conversation function of making conversations with a plurality of surrounding users.

複数の利用者と対話を行う対話システムとして、例えば、利用者の発話の内容から、次に対話システムが発話するべきであるか否かを推定し、対話システムからの発話を行う技術が開示されている（特許文献１を参照。）。 As a dialog system that interacts with a plurality of users, for example, a technique is disclosed that estimates whether or not the dialog system should speak next from the content of the user's speech, and performs a speech from the dialog system. (See Patent Document 1).

特開２０１３−６２３２号公報JP, 2013-6232, A

しかしながら、特許文献１に記載の対話システムは、利用者の発話の内容にのみ基づいて発話の判断を行い、利用者の発話が誰に向けてのものかを考慮していない。このため、特許文献１に記載の対話システムは、誤った判断により発話する可能性がある。例えば、利用者Ａ及び利用者Ｂを含む複数の利用者の会話において、利用者Ａが利用者Ｂに対して「どう思う？」と質問をしたとする。この場合は、特許文献１に記載の対話システムは、誰に対して「どう思う？」と質問をしたのかを把握できないため、誤った判断により発話する可能性がある。このように、従来のヒューマノイドロボット及びエージェントの発話は、周囲にいる複数の利用者に対して適切なタイミングで発話することが困難であった。 However, the dialog system described in Patent Document 1 determines the utterance based only on the content of the user's utterance, and does not consider who the user's utterance is for. For this reason, the dialog system described in Patent Document 1 may speak by false judgment. For example, in a conversation of a plurality of users including the user A and the user B, it is assumed that the user A asks the user B "What do you think?" In this case, since the dialogue system described in Patent Document 1 can not grasp to whom the question "What do you think?" Thus, it has been difficult for conventional humanoid robots and agents to speak at appropriate timings to a plurality of surrounding users.

上記事情に鑑み、本発明は、複数の利用者と会話するロボット又はエージェントにより適切なタイミングで発話を行わせることができる発話制御システム、発話制御装置及び発話制御プログラムを提供することを目的としている。 In view of the above circumstances, the present invention has an object to provide a speech control system, a speech control device and a speech control program that can cause a robot or an agent talking to a plurality of users to make speech at appropriate timing. .

本発明の一態様は、複数の利用者と会話を行うロボットの発話を、又は、複数の利用者と会話を行う表示装置に表示される話者の発話を制御する発話制御部と、前記利用者が任意の時刻に次話者となる確率である第１の次話者確率を取得する次話者推定部と、を備え、前記発話制御部は、前記次話者推定部が取得した前記利用者の前記第１の次話者確率に基づいて前記ロボット又は前記話者の発話を制御する発話制御システムである。 According to one aspect of the present invention, there is provided an utterance control unit for controlling an utterance of a robot talking with a plurality of users or an utterance of a speaker displayed on a display device performing a conversation with a plurality of users; A second speaker estimation unit for acquiring a first next speaker probability that is a probability that the speaker will be the next speaker at any time, and the speech control unit is configured to acquire the first speaker obtained by the next speaker estimation unit. It is a speech control system which controls a speech of said robot or said speaker based on said 1st next speaker probability of a user.

本発明の一態様は、前記の発話制御システムであって、前記発話制御部は、複数の前記利用者の前記第１の次話者確率が所定の閾値以下の場合に前記ロボット又は前記話者の発話を行わせるよう制御する。 One aspect of the present invention is the speech control system, wherein the speech control unit is configured to set the robot or the speaker when the first next speaker probability of a plurality of the users is equal to or less than a predetermined threshold. Control to make them speak.

本発明の一態様は、前記の発話制御システムであって、前記発話制御部は、複数の前記利用者の前記第１の次話者確率を所定時間積分した積分値が所定の閾値以下の場合に前記ロボット又は前記話者の発話を行わせるよう制御する。 One embodiment of the present invention is the speech control system, wherein the speech control unit is configured such that an integral value obtained by integrating the first next speaker probability of a plurality of the users for a predetermined time is equal to or less than a predetermined threshold. Control to cause the robot or the speaker to speak.

本発明の一態様は、前記の発話制御システムであって、前記次話者推定部は、前記利用者の非言語行動に基づいて前記第１の次話者確率を取得する。 One embodiment of the present invention is the speech control system, wherein the next speaker estimation unit acquires the first next speaker probability based on the non-verbal behavior of the user.

本発明の一態様は、前記の発話制御システムであって、前記ロボット又は前記話者は、前記非言語行動に対応した動作を行うよう制御が可能であり、前記次話者推定部は、前記ロボットの前記非言語行動に対応した動作に関する情報に基づいて、前記時刻に前記ロボット又は前記話者が次話者となる確率である第２の次話者確率を取得し、前記発話制御部は、複数の前記利用者の前記第１の次話者確率と、前記ロボット又は前記話者の前記第２の次話者確率との比較に基づいて、前記ロボット又は前記話者の発話を制御する。 One embodiment of the present invention is the speech control system described above, wherein the robot or the speaker can be controlled to perform an operation corresponding to the non-verbal behavior, and the next speaker estimation unit is configured to A second next speaker probability, which is a probability that the robot or the speaker will be the next speaker, is obtained at the time based on the information on the motion corresponding to the non-verbal behavior of the robot, and the utterance control unit Controlling the speech of the robot or the speaker based on a comparison of the first next speaker probability of the plurality of users and the second next speaker probability of the robot or the speaker .

本発明の一態様は、前記の発話制御システムであって、前記発話制御部は、前記ロボット又は前記話者の発話の必要性を表す指標である発話必要度を取得して、取得した前記発話必要度に応じて前記次話者推定部が取得した前記第２の次話者確率の値を変化させる。 One aspect of the present invention is the speech control system described above, wherein the speech control unit acquires a speech necessity degree which is an index indicating the necessity of speech of the robot or the speaker, and acquires the speech The value of the second next speaker probability acquired by the next speaker estimation unit is changed according to the degree of necessity.

本発明の一態様は、複数の利用者と会話を行うロボットの発話を、又は、複数の利用者と会話を行う表示装置に表示される話者の発話を制御する発話制御部と、前記利用者が任意の時刻に次話者となる確率である第１の次話者確率を取得する次話者推定部と、を備え、前記発話制御部は、前記次話者推定部が取得した前記利用者の前記第１の次話者確率に基づいて前記ロボット又は前記話者の発話を制御する発話制御装置である。 According to one aspect of the present invention, there is provided an utterance control unit for controlling an utterance of a robot talking with a plurality of users or an utterance of a speaker displayed on a display device performing a conversation with a plurality of users; A second speaker estimation unit for acquiring a first next speaker probability that is a probability that the speaker will be the next speaker at any time, and the speech control unit is configured to acquire the first speaker obtained by the next speaker estimation unit. It is an utterance control device which controls an utterance of the robot or the speaker based on the first next speaker probability of the user.

本発明の一態様は、複数の利用者と会話を行うロボットの発話を、又は、複数の利用者と会話を行う表示装置に表示される話者の発話を制御する発話制御プログラムであって、前記利用者が任意の時刻に次話者となる確率である第１の次話者確率を取得する次話者推定ステップと、前記次話者推定ステップにおいて取得した前記利用者の前記第１の次話者確率に基づいて前記ロボット又は前記話者の発話を制御する発話制御ステップと、をコンピュータに実行させるための発話制御プログラムである。 One embodiment of the present invention is a speech control program for controlling a speech of a robot talking with a plurality of users, or a speech of a speaker displayed on a display device talking with a plurality of users, A next speaker estimation step of acquiring a first next speaker probability that is a probability that the user will be the next speaker at any time; and the first of the user acquired in the next speaker estimation step A speech control program for causing a computer to execute a speech control step of controlling speech of the robot or the speaker based on the next speaker probability.

本発明により、複数の利用者と会話するロボット又はエージェントにより適切なタイミングで発話を行わせることができる。 According to the present invention, a robot or an agent who talks with a plurality of users can make speech occur at an appropriate timing.

第１の実施形態におけるロボット１００が備える機能構成の概略を示す図である。It is a figure showing an outline of functional composition with which robot 100 in a 1st embodiment is provided. 第１の実施形態におけるセンサ１０３の具体的な構成例を示す図である。It is a figure which shows the specific structural example of the sensor 103 in 1st Embodiment. 第１の実施形態における次話者推定部１０８が出力する次話者確率Ｐ^ｎｓ _ｉ（ｔ）の例を示す図である。It is a diagram illustrating an example of the next speaker probability P ^{ns i} output by the next speaker estimation unit 108 ^_(t) in the first embodiment. 第１の実施形態における音制御部１１０の構成の詳細の具体例を示す図である。It is a figure which shows the specific example of the detail of a structure of the sound control part 110 in 1st Embodiment. 第１の実施形態におけるロボット１００の外観及び構成の具体例を示す図である。It is a figure which shows the specific example of the external appearance and structure of the robot 100 in 1st Embodiment. 第１の実施形態におけるロボット１００の動作を示すフロー図である。FIG. 5 is a flow chart showing the operation of the robot 100 in the first embodiment. 第２の実施形態におけるロボット１００Ａが備える機能構成の概略を示す図である。It is a figure showing an outline of functional composition with which robot 100A in a 2nd embodiment is provided. 第２の実施形態における制御部１０９Ａの構成の詳細の具体例を示す図である。It is a figure which shows the specific example of the detail of a structure of control part 109A in 2nd Embodiment. 第２の実施形態における音制御部１１０Ａの構成の詳細の具体例を示す図である。It is a figure which shows the specific example of the detail of a structure of the sound control part 110A in 2nd Embodiment. 第２の実施形態におけるロボット１００Ａの会話前動作の具体例を示す図である。It is a figure which shows the specific example of the pre-talk operation | movement of the robot 100A in 2nd Embodiment. 第２の実施形態におけるロボット１００Ａの会話動作を示すフロー図である。It is a flow figure showing conversation operation of robot 100A in a 2nd embodiment. 息の吸い込み区間の例を示す図である。It is a figure which shows the example of the breathing area of breath. 注視対象ラベルの具体例を示す図である。It is a figure which shows the specific example of a gaze object label. 話者である利用者Ｐ１（Ｒ＝Ｓ）の注視対象ラベルＬ１についての時間構造情報を示す図である。It is a figure which shows the time structure information about gaze object label L1 of user P1 (R = S) who is a speaker.

以下、図面を参照して、本発明の実施形態について説明する。
（第１の実施形態）
図１は、第１の実施形態におけるロボット（発話制御システム）１００が備える機能構成の概略を示す図である。第１の実施形態におけるロボット１００は、複数人の利用者と会話を行うロボットである。図１に示すように、ロボット１００は、マイク１０１と、カメラ１０２と、センサ１０３と、音声入力部１０４と、映像入力部１０５と、センサ入力部１０６と、発話区間検出部１０７と、次話者推定部１０８と、発話制御部１０９と、音制御部１１０と、スピーカ１１５とを備える。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
First Embodiment
FIG. 1 is a diagram showing an outline of a functional configuration provided in a robot (utterance control system) 100 in the first embodiment. The robot 100 in the first embodiment is a robot that talks with a plurality of users. As shown in FIG. 1, the robot 100 includes a microphone 101, a camera 102, a sensor 103, an audio input unit 104, an image input unit 105, a sensor input unit 106, an utterance period detection unit 107, and a next talk. A person estimation unit 108, an utterance control unit 109, a sound control unit 110, and a speaker 115 are provided.

マイク１０１は、会話する利用者の音声等を含むロボット１００の周囲の音を集音して、音声信号を含む音信号（以下の説明では単に音声信号という）を出力する。なお、マイク１０１は、複数の各利用者それぞれに装着された複数のマイクで構成される。なお、ロボット１００自身にマイク１０１が搭載されていても良い。マイク１０１は、利用者やロボット１００に装着される構成に限られるものではなく、利用者又はロボット１００の周囲に配置しても良い。マイク１０１は、利用者の口元により近い位置に配置することが望ましいが、少なくとも利用者の声が集音可能な位置であれば、任意の位置及び任意の数を配置してよい。ロボット１００において、複数のマイク１０１と音声入力部１０４とは、有線又は無線で音声信号の送受信が可能に接続された構成である。 The microphone 101 collects sounds around the robot 100 including the voice of the user who is conversing and the like, and outputs a sound signal including an audio signal (hereinafter referred to simply as an audio signal). The microphone 101 is configured of a plurality of microphones attached to each of a plurality of users. The microphone 101 may be mounted on the robot 100 itself. The microphone 101 is not limited to the configuration mounted on the user or the robot 100, and may be disposed around the user or the robot 100. The microphone 101 is preferably disposed at a position closer to the user's mouth, but any position and any number may be disposed as long as at least the user's voice can be collected. In the robot 100, the plurality of microphones 101 and the voice input unit 104 are configured to be able to transmit and receive voice signals by wire or wirelessly.

カメラ１０２は、会話する利用者の映像を撮影して、映像信号を出力する。カメラ１０２は、好ましくは、利用者全員の姿が画角にはいるよう広角な画角を有する撮像装置である。また、カメラ１０２は、利用者全員の姿をそれぞれ撮影する利用者の人数分の複数のカメラであってもよい。カメラ１０２は、少なくとも利用者全員を撮影可能な位置に配置するのであれば、任意の位置及び任意の台数で構成してもよい。この場合には、ロボット１００において、映像入力部１０５と、複数のカメラとは、有線又は無線で映像信号を送受信可能に接続された構成となる。 The camera 102 captures an image of a user who has a conversation, and outputs an image signal. The camera 102 is preferably an imaging device having a wide angle of view so that all the users are in the angle of view. Further, the cameras 102 may be a plurality of cameras corresponding to the number of users who respectively capture the images of all the users. The camera 102 may be configured at any position and any number as long as at least all the users can be photographed. In this case, in the robot 100, the video input unit 105 and the plurality of cameras are connected to be able to transmit and receive video signals by wire or wirelessly.

センサ１０３は、会話する利用者の位置を計測する第１のセンサ、利用者の呼吸動作を計測する第２のセンサ、利用者の注視対象を検出する第３のセンサ及び利用者の頭部動作を検出する第４のセンサ等の複数のセンサを備え、それらの各センサからのセンサ信号をセンサ入力部１０６へ出力する。 The sensor 103 includes a first sensor that measures the position of the user who talks, a second sensor that measures the breathing motion of the user, a third sensor that detects the gaze target of the user, and a head motion of the user And a plurality of sensors such as a fourth sensor that detects the sensor signal, and outputs a sensor signal from each of the sensors to the sensor input unit 106.

図２は、第１の実施形態におけるセンサ１０３の具体的な構成例を示す図である。
図２に示すように、センサ１０３は、会話する利用者の位置を計測する位置計測装置（第１のセンサ）２０１と、利用者の呼吸動作を計測する呼吸動作計測装置（第２のセンサ）２０２と、利用者の注視対象を検出する注視対象検出装置（第３のセンサ）２０３と、利用者の頭部動作を検出する頭部動作検出装置（第４のセンサ）２０４とを備える。位置計測装置２０１は、例えばロボット１００内に設置され、呼吸動作計測装置２０２、注視対象検出装置２０３及び頭部動作検出装置２０４は、利用者の頭部等に装着される。位置計測装置２０１は、センサ入力部１０６と接続されている。呼吸動作計測装置２０２、注視対象検出装置２０３及び頭部動作検出装置２０４は、センサ入力部１０６と、有線又は無線でセンサ信号の送受信が可能に接続されている。 FIG. 2 is a view showing a specific configuration example of the sensor 103 in the first embodiment.
As shown in FIG. 2, the sensor 103 measures the position of the user who talks (the first sensor) 201, and the respiratory movement measurement device (the second sensor) which measures the breathing movement of the user. 202, a gaze target detection device (third sensor) 203 for detecting a gaze object of a user, and a head movement detection device (fourth sensor) 204 for detecting a head movement of the user. The position measurement device 201 is installed, for example, in the robot 100, and the respiratory movement measurement device 202, the gaze target detection device 203, and the head movement detection device 204 are mounted on the head of the user or the like. The position measurement device 201 is connected to the sensor input unit 106. The respiratory movement measurement device 202, the gaze target detection device 203, and the head movement detection device 204 are connected to the sensor input unit 106 so as to be able to transmit and receive sensor signals in a wired or wireless manner.

音声入力部１０４は、マイク１０１からの音声信号を入力とし、発話区間検出部１０７、次話者推定部１０８及び音制御部１１０へ音声信号を出力する。音声入力部１０４は、マイク１０１からの音声信号を、ロボット１００内で処理可能な信号形式の音声信号に変換する等の処理を行う。映像入力部１０５は、カメラ１０２からの映像信号を入力とし、次話者推定部１０８へ映像信号を出力する。映像入力部１０５は、カメラ１０２からの映像信号を、ロボット１００内で処理可能な信号形式の映像信号に変換する等の処理を行う。センサ入力部１０６は、センサ１０３からのセンサ信号を入力とし、次話者推定部１０８へセンサ信号を出力する。センサ入力部１０６は、センサ１０３からのセンサ信号を、ロボット１００内で処理可能な信号形式のセンサ信号に変換する等の処理を行う。 The voice input unit 104 receives a voice signal from the microphone 101, and outputs a voice signal to the speech zone detection unit 107, the next speaker estimation unit 108, and the sound control unit 110. The audio input unit 104 performs processing such as converting an audio signal from the microphone 101 into an audio signal in a signal format that can be processed in the robot 100. The video input unit 105 receives the video signal from the camera 102 and outputs the video signal to the next speaker estimation unit 108. The video input unit 105 performs processing such as converting a video signal from the camera 102 into a video signal in a signal format that can be processed in the robot 100. The sensor input unit 106 receives the sensor signal from the sensor 103 and outputs the sensor signal to the next speaker estimation unit 108. The sensor input unit 106 performs processing such as converting a sensor signal from the sensor 103 into a sensor signal in a signal format that can be processed in the robot 100.

発話区間検出部１０７は、音声入力部１０４からの音声信号に基づいて、任意の窓幅を設けてその区間内の音声信号のパワー、ゼロ交差数、周波数などを、音声の特徴を示す値である音声特徴量として算出する。発話区間検出部１０７は、算出した音声特徴量と所定の閾値を比較して発話区間を検出する。発話区間検出部１０７は、検出した発話区間に関する情報である発話区間情報を次話者推定部１０８及び音制御部１１０へ出力する。なお、マイク１０１から取得される音声信号において、音声の存在する区間（発話区間）と音声の存在しない区間（非発話区間）を自動的に検出するＶＡＤ（Voice Activity Detection）技術は、以下の参考文献１に示すように公知の技術である。発話区間検出部１０７は、公知のＶＡＤ技術を用いて発話区間を検出する。
参考文献１：澤田宏、外４名、"多人数多マイクでの発話区間検出〜ピンマイクでの事例〜"、日本音響学会春季研究発表会、ｐｐ．６７９−６８０、２００７年３月 The speech zone detection unit 107 sets an arbitrary window width based on the speech signal from the speech input unit 104 and sets the power, the number of zero crossings, the frequency, etc. of the speech signal in the zone to a value indicating the feature of speech. It is calculated as an audio feature quantity. The speech zone detection unit 107 detects a speech zone by comparing the calculated speech feature amount with a predetermined threshold. The speech zone detection unit 107 outputs speech zone information that is information related to the detected speech zone to the next speaker estimation unit 108 and the sound control unit 110. In the voice signal acquired from the microphone 101, a voice activity detection (VAD) technique for automatically detecting a section where speech is present (a speech section) and a section where speech is not present (a non-speech section) is described below. As shown in Document 1, this is a known technique. The speech zone detection unit 107 detects a speech zone using a known VAD technique.
Reference 1: Hiroshi Sawada, 4 others, "Speech segment detection with a large number of microphones-a case with pin microphones", Spring Meeting of the Acoustical Society of Japan, pp. 679-680, March 2007

次話者推定部１０８は、音声入力部１０４からの音声信号と、映像入力部１０５からの映像信号と、センサ入力部１０６からのセンサ信号と、発話区間検出部１０７からの発話区間情報とを入力とし、各利用者が時刻ｔに次話者となる確率である次話者確率を出力する。次話者推定部１０８は、音声信号、映像信号、センサ信号及び発話区間情報に基づいて、発話区間情報で特定される発話区間の発話者を示す発話者情報を取得する。次話者推定部１０８は、音声信号、映像信号、センサ信号及び取得した発話者情報に基づいて、各利用者ｉが時刻ｔに次話者となる確率である次話者確率Ｐ^ｎｓ _ｉ（ｔ）を算出して、発話制御部１０９へ出力する。次話者推定部１０８は、利用者の非言語行動に基づいて次話者確率Ｐ^ｎｓ _ｉ（ｔ）を算出している。すなわち、次話者推定部１０８での、次話者確率Ｐ^ｎｓ _ｉ（ｔ）の算出は、利用者の発話内容を解析等して利用者の言語行動に関する情報を得る必要はない。 The next speaker estimation unit 108 includes the audio signal from the audio input unit 104, the video signal from the video input unit 105, the sensor signal from the sensor input unit 106, and the speech segment information from the speech segment detection unit 107. The next speaker probability is output, which is the probability that each user will be the next speaker at time t. The next speaker estimation unit 108 acquires, based on the audio signal, the video signal, the sensor signal, and the utterance period information, the speaker information indicating the utterer of the utterance period specified by the utterance period information. The next speaker estimation unit 108 determines a next speaker probability P ^ns _i which is a probability that each user i will be the next speaker at time t based on the audio signal, the video signal, the sensor signal, and the acquired speaker information. t) is calculated and output to the speech control unit 109. The next speaker estimation unit 108 calculates the next speaker probability P ^ns _i (t) based on the non-verbal behavior of the user. That is, in the calculation of the next speaker probability P ^ns _i (t) in the next speaker estimation unit 108, it is not necessary to analyze the content of the user's speech and obtain information on the user's speech behavior.

図３は、第１の実施形態における次話者推定部１０８が出力する次話者確率Ｐ^ｎｓ _ｉ（ｔ）の例を示す図である。図３においては、４名の利用者Ａ〜Ｄについて利用者Ａの発話の切れ目となる時刻ｔ_ｂｕｅ以降における次話者確率Ｐ^ｎｓ _ｉ（ｔ）の変化例を示している。符号３１を付与した矩形は、利用者Ａの発話区間を示している。発話区間３１は、発話終了時刻ｔ_ｂｕｅで終了している。次話者確率Ｐ^ｎｓ _Ａ（ｔ）３２で示す点線は、利用者Ａにおける発話終了時刻ｔ_ｂｕｅ以降の時刻ｔにおける次話者確率の変化を示している。次話者確率Ｐ^ｎｓ _Ｂ（ｔ）３３で示す点線は、利用者Ｂにおける発話終了時刻ｔ_ｂｕｅ以降の時刻ｔにおける次話者確率の変化を示している。次話者確率Ｐ^ｎｓ _Ｃ（ｔ）３４で示す点線は、利用者Ｃにおける発話終了時刻ｔ_ｂｕｅ以降の時刻ｔにおける次話者確率の変化を示している。次話者確率Ｐ^ｎｓ _Ｄ（ｔ）３５で示す点線は、利用者Ｄにおける発話終了時刻ｔ_ｂｕｅ以降の時刻ｔにおける次話者確率の変化を示している。このように、次話者推定部１０８は、利用者ｉの発話終了時刻ｔ_ｂｕｅ以降の時刻ｔにおける次話者確率Ｐ^ｎｓ _ｉ（ｔ）の変化を算出する。なお、次話者推定部１０８における次話者の推定処理の詳細については後述する。 FIG. 3 is a diagram showing an example of the next speaker probability P ^ns _i (t) output by the next speaker estimation unit 108 in the first embodiment. FIG. 3 shows a change example of the next speaker probability P ^ns _i (t) after time t _{bue at} which the user A's utterance _{ends at} four users A to _D. The rectangle given the reference numeral 31 indicates the utterance section of the user A. The speech section 31 ends at the speech end time t _bue . _A dotted line indicated by the next speaker probability P ^ns _A (t) 32 indicates a change in the next speaker probability at time t after the speech end time t _bue of the user A. A dotted line indicated by the next speaker probability P ^ns _B (t) 33 indicates a change in the next speaker probability at time t after the speech end time t _bue of the user B. A dotted line indicated by the next speaker probability P ^ns _C (t) 34 indicates a change in the next speaker probability at time t after the speech end time t _bue of the user C. The dotted line indicated by the next speaker probability P ^ns _D (t) 35 indicates a change in the next speaker probability at time t after the speech end time t _bue of the user D. As described above, the next speaker estimation unit 108 calculates a change in the next speaker probability P ^ns _i (t) at time t after the speech end time t _bue of the user i. The details of the estimation process of the next speaker in the next speaker estimation unit 108 will be described later.

発話制御部１０９は、次話者推定部１０８からの次話者確率を入力とし、音制御部１１０に対して発話制御信号を出力する。発話制御部１０９は、次話者推定部１０８からの各利用者の次話者確率に基づいて、ロボット１００に発話を行わせるか否かを制御する発話制御信号を出力する。発話制御部１０９は、例えば利用者のいずれか一人の次話者確率が閾値以上であれば、ロボット１００に発話させず、全利用者の次話者確率が閾値以下であれば、ロボット１００に発話させるように制御する発話制御信号を出力する。これにより、ロボット１００は、他の利用者における発話開始の確率が低いタイミングで、発話を開始することができる。このように、ロボット１００は、複数の利用者と会話する際に、より適切なタイミングで発話を行うことができる。 The speech control unit 109 receives the next speaker probability from the next speaker estimation unit 108, and outputs a speech control signal to the sound control unit 110. The speech control unit 109 outputs a speech control signal for controlling whether to cause the robot 100 to make a speech based on the next speaker probability of each user from the next speaker estimation unit 108. For example, if the next speaker probability of any one of the users is equal to or higher than the threshold, the speech control unit 109 does not cause the robot 100 to utter, and if the next speaker probability of all users is equal to or lower than the threshold, A speech control signal is output to control to cause speech. Thus, the robot 100 can start speaking at a timing when the probability of the start of speaking by another user is low. Thus, the robot 100 can speak at more appropriate timing when talking with a plurality of users.

発話制御部１０９は、具体的には、以下に示す第１制御方法〜第４制御方法の４つの制御方法のいずれかを用いてロボット１００の発話の制御を行う。なお、以下の説明においては、利用者Ａ、Ｂ、Ｃ、Ｄの４名とロボット１００とが会話を行う場合について説明する。 Specifically, the speech control unit 109 controls the speech of the robot 100 using any of the four control methods of the first control method to the fourth control method described below. In the following description, a case in which four users A, B, C, and D and the robot 100 have a conversation will be described.

（第１制御方法）
発話制御部１０９は、次話者推定部１０８から取得した時刻ｔにおける利用者Ａ〜Ｄの次話者確率Ｐ^ｎｓ _ｉ（ｔ），（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ｝）と、第１の閾値である任意の確率Ｐ_αとを比較する（以下、第１の比較という。）。発話制御部１０９は、上記第１の比較により、利用者Ａ〜Ｄのいずれか一人の次話者確率Ｐ^ｎｓ _ｉ（ｔ）が確率Ｐ_α以上である（Ｐ^ｎｓ _ｉ（ｔ）≧Ｐ_α）と判断した場合は、ロボット１００に発話させないよう制御する発話制御信号を出力する。発話制御部１０９は、上記第１の比較により、利用者Ａ〜Ｄの全員の次話者確率Ｐ^ｎｓ _ｉ（ｔ）が確率Ｐ_α未満である（Ｐ^ｎｓ _ｉ（ｔ）＜Ｐ_α）と判断した場合は、ロボット１００に時刻ｔで発話させるよう制御する発話制御信号を出力する。 (First control method)
The speech control unit 109 determines the next speaker probability P ^ns _i (t), ( _i ∈ {A, B, C, D}) of the users A to D at time t acquired from the next speaker estimation unit 108, The first threshold value is compared with an arbitrary probability P _α (hereinafter referred to as a first comparison). According to the first comparison, the speech control unit 109 determines that the next speaker probability P ^ns _i (t) of any one of the users A to D is equal to or higher than the probability P _α (P ^ns _i (t) P P _α If it is determined that the robot 100 is determined to be), an utterance control signal for controlling the robot 100 not to make an utterance is output. According to the first comparison, the utterance control unit 109 determines that the next speaker probability P ^ns _i (t) of all the users A to D is less than the probability P _α (P ^ns _i (t) <P _α ) When it is determined, an utterance control signal for controlling the robot 100 to make an utterance at time t is output.

（第２制御方法）
発話制御部１０９は、次話者推定部１０８から取得した時刻ｔにおける利用者Ａ〜Ｄの次話者確率Ｐ^ｎｓ _ｉ（ｔ），（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ｝）を、時刻ｔについて所定時間（例えば、３〜４秒以上の時間）積分して、積分値Ｐ^ｎｓ _ｉを取得する。発話制御部１０９は、この積分値Ｐ^ｎｓ _ｉと、第２の閾値である任意の確率Ｐ_βとを比較する（以下、第２の比較という。）。発話制御部１０９は、上記第２の比較により、利用者Ａ〜Ｄのいずれか一人の積分値Ｐ^ｎｓ _ｉが確率Ｐ_β以上である（Ｐ^ｎｓ _ｉ≧Ｐ_β）と判断した場合は、ロボット１００に発話させないよう制御する発話制御信号を出力する。発話制御部１０９は、上記第２の比較により、利用者Ａ〜Ｄの全員の積分値Ｐ^ｎｓ _ｉが確率Ｐ_β未満である（Ｐ^ｎｓ _ｉ＜Ｐ_β）と判断した場合は、ロボット１００に時刻ｔで発話させるよう制御する発話制御信号を出力する。 (Second control method)
The speech control unit 109 calculates the next speaker probability P ^ns _i (t), ( _i A {A, B, C, D}) of the users A to D at time t acquired from the next speaker estimation unit 108, Integration is performed for a predetermined time (for example, a time of 3 to 4 seconds or more) at time t to obtain an integral value P ^ns _i . The speech control unit 109 compares the integral value P ^ns _i with an arbitrary probability P _β which is a second threshold (hereinafter, referred to as a second comparison). If the speech control unit 109 determines that the integral value P ^ns _i of any one of the users A to D is greater than or equal to the probability P _β (P ^ns _i P P _β ) according to the second comparison, the robot A speech control signal for controlling not to make 100 speak is output. If the speech control unit 109 determines that the integral value P ^ns _i of all the users A to D is less than the probability P _β (P ^ns _i <P _β ) according to the second comparison, the robot 100 A speech control signal is output to control to cause speech at time t.

（第３制御方法）
発話制御部１０９は、次話者推定部１０８から取得した時刻ｔにおける利用者Ａ〜Ｄの次話者確率Ｐ^ｎｓ _ｉ（ｔ），（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ｝）を用いて、利用者全員の次話者確率を加算した加算値（Ｐ^ｎｓ _Ａ（ｔ）＋Ｐ^ｎｓ _Ｂ（ｔ）＋Ｐ^ｎｓ _Ｃ（ｔ）＋Ｐ^ｎｓ _Ｄ（ｔ））を取得する。発話制御部１０９は、加算値（Ｐ^ｎｓ _Ａ（ｔ）＋Ｐ^ｎｓ _Ｂ（ｔ）＋Ｐ^ｎｓ _Ｃ（ｔ）＋Ｐ^ｎｓ _Ｄ（ｔ））と、第３の閾値である任意の確率Ｐ_γとを比較する（以下、第３の比較という。）。発話制御部１０９は、上記第３の比較により、利用者Ａ〜Ｄのいずれか一人の加算値（Ｐ^ｎｓ _Ａ（ｔ）＋Ｐ^ｎｓ _Ｂ（ｔ）＋Ｐ^ｎｓ _Ｃ（ｔ）＋Ｐ^ｎｓ _Ｄ（ｔ））が確率Ｐ_γ以上である（（Ｐ^ｎｓ _Ａ（ｔ）＋Ｐ^ｎｓ _Ｂ（ｔ）＋Ｐ^ｎｓ _Ｃ（ｔ）＋Ｐ^ｎｓ _Ｄ（ｔ））≧Ｐ_γ）と判断した場合は、ロボット１００に発話させないよう制御する発話制御信号を出力する。発話制御部１０９は、上記第３の比較により、利用者Ａ〜Ｄの全員の加算値（Ｐ^ｎｓ _Ａ（ｔ）＋Ｐ^ｎｓ _Ｂ（ｔ）＋Ｐ^ｎｓ _Ｃ（ｔ）＋Ｐ^ｎｓ _Ｄ（ｔ））が確率Ｐ_γ未満である（（Ｐ^ｎｓ _Ａ（ｔ）＋Ｐ^ｎｓ _Ｂ（ｔ）＋Ｐ^ｎｓ _Ｃ（ｔ）＋Ｐ^ｎｓ _Ｄ（ｔ））＜Ｐ_γ）と判断した場合は、ロボット１００に時刻ｔで発話させるよう制御する発話制御信号を出力する。 (Third control method)
The speech control unit 109 uses the next speaker probability P ^ns _i (t), ( _i ∈ {A, B, C, D}) of the users A to D at time t acquired from the next speaker estimation unit 108 Te, to get an added value obtained by adding the following speaker probability of use everyone ^{_{^{_{(P ns a (t) +}}}} P ns B (t) + P ns C (t) + P ns D (t)). Speech control unit 109, compares the sum value ^{_{^{_{(P ns A (t) +}}}} P ns B (t) + P ns C (t) + P ns D (t)), and any probability P _gamma is a third threshold value (Hereinafter referred to as the third comparison). According to the third comparison, the speech control unit 109 adds the value (P ^ns _A (t) + P ^ns _B (t) + P ^ns _C (t) + P ^ns _D (t)) of one of the users A to _D. ) is the case where it is determined to be equal to or greater than the probability _{^{_{^{_{P γ ((P ns a (}}}}} t) + P ns B (t) + P ns C (t) + P ns D (t)) ≧ P γ), not speaking the robot 100 Output an utterance control signal to control the Speech control unit 109, by the third comparison, the sum of all of the user ^{_{^{_{A~D (P ns A (t)}}}} + P ns B (t) + P ns C (t) + P ns D (t)) is When it is determined that the probability is less than P _γ ((P ^ns _A (t) + P ^ns _B (t) + P ^ns _C (t) + P ^ns _D (t)) <P _γ ), the robot 100 speaks at time t It outputs a speech control signal to control to make it

（第４制御方法）
発話制御部１０９は、次話者推定部１０８から取得した時刻ｔにおける利用者Ａ〜Ｄの次話者確率Ｐ^ｎｓ _ｉ（ｔ），（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ｝）を、時刻ｔについて所定時間（例えば、３〜４秒以上の時間）積分して、積分値Ｐ^ｎｓ _ｉを取得する。発話制御部１０９は、利用者全員の積分値Ｐ^ｎｓ _ｉを加算した加算値（Ｐ^ｎｓ _Ａ＋Ｐ^ｎｓ _Ｂ＋Ｐ^ｎｓ _Ｃ＋Ｐ^ｎｓ _Ｄ）を取得する。発話制御部１０９は、加算値（Ｐ^ｎｓ _Ａ＋Ｐ^ｎｓ _Ｂ＋Ｐ^ｎｓ _Ｃ＋Ｐ^ｎｓ _Ｄ）と、第４の閾値である任意の確率Ｐ_θとを比較する（以下、第４の比較という。）。発話制御部１０９は、上記第４の比較により、利用者Ａ〜Ｄのいずれか一人の加算値（Ｐ^ｎｓ _Ａ＋Ｐ^ｎｓ _Ｂ＋Ｐ^ｎｓ _Ｃ＋Ｐ^ｎｓ _Ｄ）が確率Ｐ_θ以上である（（Ｐ^ｎｓ _Ａ＋Ｐ^ｎｓ _Ｂ＋Ｐ^ｎｓ _Ｃ＋Ｐ^ｎｓ _Ｄ）≧Ｐ_θ）と判断した場合は、ロボット１００に発話させないよう制御する発話制御信号を出力する。発話制御部１０９は、上記第４の比較により、利用者Ａ〜Ｄの全員の加算値（Ｐ^ｎｓ _Ａ＋Ｐ^ｎｓ _Ｂ＋Ｐ^ｎｓ _Ｃ＋Ｐ^ｎｓ _Ｄ）が確率Ｐ_θ未満である（（Ｐ^ｎｓ _Ａ＋Ｐ^ｎｓ _Ｂ＋Ｐ^ｎｓ _Ｃ＋Ｐ^ｎｓ _Ｄ）＜Ｐ_θ）と判断した場合は、ロボット１００に時刻ｔで発話させるよう制御する発話制御信号を出力する。 (4th control method)
The speech control unit 109 calculates the next speaker probability P ^ns _i (t), ( _i A {A, B, C, D}) of the users A to D at time t acquired from the next speaker estimation unit 108, Integration is performed for a predetermined time (for example, a time of 3 to 4 seconds or more) at time t to obtain an integral value P ^ns _i . Speech control unit 109 acquires the integrated value sum value obtained by adding the ^{P ns} _i of the user all the ^{_{^{_{^{_{(P ns A + P ns B}}}}}} + P ns C + P ns D). Speech control unit 109 compares the sum value ^{_{^{_{^{_{(P ns A + P ns B}}}}}} + P ns C + P ns D), and any probability P _theta is the fourth threshold value (hereinafter,. Fourth of comparison). According to the fourth comparison, the speech control unit 109 determines that the sum (P ^ns _A + P ^ns _B + P ^ns _C + P ^ns _D ) of any one of the users A to _D has a probability P _θ or more ((P ^ns If it is determined that the _{^{_{^{_{a + P ns B + P ns}}}}} C + P ns D) ≧ P θ), and outputs a speech control signal for controlling so as not to utterance robot 100. Speech control unit 109, by the fourth comparison, the sum of all of the user ^{_{^{_{^{A~D (P ns A + P ns}}}}} B + P ns C + P ns D) is less than the probability _{^{_{P θ ((P ns A +}}} P If it is determined that ^ns _B + P ^ns _C + P ^ns _D ) <P _θ ), a speech control signal is output to control the robot 100 to make speech at time t.

次話者確率Ｐ^ｎｓ _ｉ（ｔ），（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ｝）は、図３に示したように、発話終了から所定時間後にピークを有する場合が多い。そこで、発話制御部１０９は、第１制御方法〜第４制御方法において、次話者確率Ｐ^ｎｓ _ｉ（ｔ）を求める時刻ｔを含む窓幅を設けて、その窓幅の中における次話者確率の最大値を、時刻ｔにおける次話者確率Ｐ^ｎｓ _ｉ（ｔ）として出力するようにしてもよい。また、発話制御部１０９は、第１制御方法〜第４制御方法において、次話者確率Ｐ^ｎｓ _ｉ（ｔ）を求める時刻ｔを含む窓幅を設けて、その窓幅の中における次話者確率に複数のピークがある場合に、ｎ番目（ｎは１以上の整数）のピークの次話者確率を、時刻ｔにおける次話者確率Ｐ^ｎｓ _ｉ（ｔ）として出力するようにしてもよい。 The next speaker probability P ^ns _i (t), ( _i A {A, B, C, D}) often has a peak after a predetermined time from the end of the speech, as shown in FIG. Therefore, in the first control method to the fourth control method, the speech control unit 109 provides a window width including the time t for obtaining the next speaker probability P ^ns _i (t), and the next speaker within the window width The maximum value of the probability may be output as the next speaker probability P ^ns _i (t) at time t. Further, in the first control method to the fourth control method, the speech control unit 109 provides a window width including the time t for obtaining the next speaker probability P ^ns _i (t), and the next speaker in the window width When there are a plurality of peaks in the probability, the next speaker probability of the n-th (n is an integer of 1 or more) peak may be output as the next speaker probability P ^ns _i (t) at time t. .

音制御部１１０は、音声入力部１０４からの音声信号と、発話区間検出部１０７からの発話区間情報と、発話制御部１０９からの発話制御信号とに基づいて、スピーカ１１５に対して音信号を出力する。音制御部１１０は、発話制御信号に基づいて、ロボット１００に発話を行わせるか否かを判断する。音制御部１１０は、発話制御信号に基づいて、ロボット１００に発話を行わせると判断した場合には、ロボット１００に発話させる会話内容（言葉）を含む会話情報を生成し、生成した会話情報に基づいた音信号を出力する。音制御部１１０は、例えば、音声信号及び発話区間情報に基づいて利用者の会話内容を解析し、解析結果に基づいて、ロボット１００に発話させるための会話情報を生成する。 The sound control unit 110 transmits a sound signal to the speaker 115 based on the voice signal from the voice input unit 104, the speech segment information from the speech segment detection unit 107, and the speech control signal from the speech control unit 109. Output. The sound control unit 110 determines whether to cause the robot 100 to make a speech based on the speech control signal. When the sound control unit 110 determines that the robot 100 is made to speak based on the speech control signal, the sound control unit 110 generates conversation information including conversation contents (words) to be uttered by the robot 100, and generates the conversation information generated. Output a sound signal based on it. The sound control unit 110 analyzes the contents of the user's conversation based on, for example, the voice signal and the utterance interval information, and generates conversation information for causing the robot 100 to speak based on the analysis result.

ここで、第１の実施形態における音制御部１１０の構成の詳細について一例を示して説明する。
図４は、第１の実施形態における音制御部１１０の構成の詳細の具体例を示す図である。音制御部１１０は、音声解析部４０１と、会話情報生成部４０２と、会話情報ＤＢ（データベース）４０３と、発声情報生成部４０４と、音信号生成部４０５とを備える。 Here, the details of the configuration of the sound control unit 110 in the first embodiment will be described by way of an example.
FIG. 4 is a diagram showing a specific example of the details of the configuration of the sound control unit 110 in the first embodiment. The sound control unit 110 includes a voice analysis unit 401, a conversation information generation unit 402, a conversation information DB (database) 403, an utterance information generation unit 404, and a sound signal generation unit 405.

会話情報ＤＢ４０３は、ロボット１００に会話させるための会話サンプル情報を格納する。会話サンプル情報とは、日常の会話でよく使われる名詞、「こんにちは」等の挨拶及び「ありがとうございます」、「大丈夫ですか」等の日常会話でよく利用するフレーズの音声信号を含む情報である。 The conversation information DB 403 stores conversation sample information for causing the robot 100 to make a conversation. The conversation sample information, noun often used in everyday conversation, "Hello" greeting and "Thank you" such as, it is the information that contains the phrase of the speech signal that frequently used in everyday conversation, such as "Are you okay?" .

音声解析部４０１は、音声入力部１０４からの音声信号と、発話区間検出部１０７からの発話区間情報とに基づいて、音声信号を解析して、その内容（言葉）を特定し、解析結果を出力する。 The voice analysis unit 401 analyzes the voice signal based on the voice signal from the voice input unit 104 and the speech segment information from the speech segment detection unit 107, identifies the content (word), and analyzes the analysis result. Output.

会話情報生成部４０２は、音声解析部４０１の解析結果に基づいて、ロボット１００の発話内容となる会話情報を生成する。会話情報生成部４０２は、音声解析部４０１の解析結果に基づいて、会話する内容に応じた会話サンプル情報を会話情報ＤＢ４０３から取得する。会話情報生成部４０２は、取得した会話サンプル情報に基づいて、会話情報を生成する。会話情報生成部４０２は、発声情報生成部４０４からの会話情報の要求に応じて、会話情報を生成し、発声情報生成部４０４へ出力する。 The conversation information generation unit 402 generates conversation information to be the speech content of the robot 100 based on the analysis result of the speech analysis unit 401. The conversation information generation unit 402 acquires, from the conversation information DB 403, conversation sample information according to the content of conversation based on the analysis result of the speech analysis unit 401. The conversation information generation unit 402 generates conversation information based on the acquired conversation sample information. In response to the request for speech information from the speech information generation unit 404, the speech information generation unit 402 generates speech information and outputs the speech information to the speech information generation unit 404.

発声情報生成部４０４は、会話情報生成部４０２からの会話情報と、発話制御部１０９からの発話制御情報とを入力として、発話信号を出力する。発声情報生成部４０４は、発話制御部１０９からの発話制御情報に基づいて、会話情報生成部４０２に対して会話情報を要求する。発声情報生成部４０４は、要求に応じて会話情報生成部４０２から取得した会話情報と、発話制御部１０９からの発話制御情報とに基づいて、ロボット１００が発声するための発話信号を生成する。発声情報生成部４０４は、生成した発話信号を音信号生成部４０５へ出力する。 The utterance information generation unit 404 receives the conversation information from the conversation information generation unit 402 and the utterance control information from the utterance control unit 109, and outputs an utterance signal. The utterance information generation unit 404 requests conversation information to the conversation information generation unit 402 based on the utterance control information from the utterance control unit 109. The utterance information generation unit 404 generates an utterance signal to be uttered by the robot 100 based on the conversation information acquired from the conversation information generation unit 402 in response to the request and the speech control information from the speech control unit 109. The utterance information generation unit 404 outputs the generated utterance signal to the sound signal generation unit 405.

音信号生成部４０５は、発声情報生成部４０４からの発話信号を入力とし、音信号を出力する。音信号生成部４０５は、発声情報生成部４０４からの発話信号に基づいてスピーカ１１５から発話させるための音信号を生成して、スピーカ１１５へ出力する。 The sound signal generation unit 405 receives the speech signal from the utterance information generation unit 404, and outputs a sound signal. The sound signal generation unit 405 generates a sound signal to be uttered from the speaker 115 based on the utterance signal from the utterance information generation unit 404, and outputs the sound signal to the speaker 115.

図５は、第１の実施形態におけるロボット１００の外観及び構成の具体例を示す図である。第１の実施形態におけるロボット１００は、例えば図５に示す外観を有し、図１に示す機能構成を有する。 FIG. 5 is a view showing a specific example of the appearance and configuration of the robot 100 in the first embodiment. The robot 100 in the first embodiment has, for example, an appearance shown in FIG. 5 and a functional configuration shown in FIG.

図５に示すように、ロボット１００は、例えば、人間の上半身をモデルとした形状のヒューマノイドロボット（人型ロボット）である。ロボット１００は、発話を行う発話機能、人の音声を認識する音声認識機能、利用者を撮影するカメラ機能を少なくとも備える。ロボット１００は、右目５１ａ及び左目５１ｂと、口部５２とが配置された顔を有する頭部５３を備える。 As shown in FIG. 5, the robot 100 is, for example, a humanoid robot (humanoid robot) having a shape that models a human upper body. The robot 100 has at least a speech function for speech, a speech recognition function for recognizing human speech, and a camera function for photographing a user. The robot 100 includes a head 53 having a face on which the right eye 51 a and the left eye 51 b and the mouth 52 are disposed.

ロボット１００は、頭部５３を支持する頸部５４と、頸部５４を支える胴部５５とを備える。胴部５５は、右腕５５ａと左腕５５ｂとが側面上部に設けられている。また、頭部５３の右目５１ａ、左目５１ｂの間には、カメラ１０２が設置されている。以下の説明において、右目５１ａ、左目５１ｂをまとめて説明する場合は、眼部５１と称する。 The robot 100 includes a neck 54 supporting the head 53 and a body 55 supporting the neck 54. The trunk 55 is provided with a right arm 55a and a left arm 55b at the upper side. In addition, a camera 102 is installed between the right eye 51 a and the left eye 51 b of the head 53. In the following description, when the right eye 51a and the left eye 51b are described collectively, they are referred to as the eye 51.

図１に示す構成の内、図５に示しているのは、カメラ１０２のみであるので、カメラ１０２以外の図１に示す構成の設置位置の一例について説明する。マイク１０１及びセンサ１０３は、ロボット１００の胴部５５内における任意の位置又は胴部５５から離れた位置（例えば利用者の位置）に設置される。図１に示すマイク１０１、カメラ１０２及びセンサ１０３以外の構成は、ロボット１００内部に設置されるものであり、例えば、スピーカ１１５は、図５に示した口部５２の内部に設置されている。 Of the configuration shown in FIG. 1, only the camera 102 is shown in FIG. 5, so an example of the installation position of the configuration shown in FIG. 1 other than the camera 102 will be described. The microphone 101 and the sensor 103 are installed at an arbitrary position in the trunk 55 of the robot 100 or at a position distant from the trunk 55 (for example, the position of the user). The configuration other than the microphone 101, the camera 102, and the sensor 103 shown in FIG. 1 is installed inside the robot 100, and for example, the speaker 115 is installed inside the mouth 52 shown in FIG.

次に、第１の実施形態におけるロボット１００の動作について説明する。
図６は、第１の実施形態におけるロボット１００の動作を示すフロー図である。図６に示す処理は、ロボット１００において、複数の利用者と会話を行う動作を開始した際に行う処理である。 Next, the operation of the robot 100 in the first embodiment will be described.
FIG. 6 is a flow chart showing the operation of the robot 100 in the first embodiment. The process shown in FIG. 6 is a process performed when the robot 100 starts an operation of talking with a plurality of users.

音声入力部１０４は、マイク１０１からの音声信号が入力され、映像入力部１０５は、カメラ１０２からの映像信号が入力され、センサ入力部１０６は、センサ１０３からのセンサ信号が入力される（ステップＳ１０１）。発話区間検出部１０７は、音声入力部１０４からの音声信号に基づいて、音声特徴量を算出し、算出した音声特徴量と所定の閾値を比較して発話区間を検出する（ステップＳ１０２）。 The audio input unit 104 receives an audio signal from the microphone 101, the video input unit 105 receives a video signal from the camera 102, and the sensor input unit 106 receives a sensor signal from the sensor 103 (step S101). The speech zone detection unit 107 calculates a speech feature based on the speech signal from the speech input unit 104, and compares the calculated speech feature with a predetermined threshold to detect a speech zone (step S102).

次話者推定部１０８は、音声信号、映像信号、センサ信号及び取得した発話者情報に基づいて、各利用者ｉが時刻ｔに次話者となる確率である次話者確率Ｐ^ｎｓ _ｉ（ｔ）を算出する（ステップＳ１０３）。発話制御部１０９は、次話者推定部１０８からの各利用者の次話者確率に基づいて、ロボット１００に発話を行わせるか否かを制御する発話制御信号を出力する（ステップＳ１０４）。 The next speaker estimation unit 108 determines a next speaker probability P ^ns _i which is a probability that each user i will be the next speaker at time t based on the audio signal, the video signal, the sensor signal, and the acquired speaker information. t) is calculated (step S103). The speech control unit 109 outputs a speech control signal for controlling whether to cause the robot 100 to make a speech based on the next speaker probability of each user from the next speaker estimation unit 108 (step S104).

音制御部１１０は、発話制御部１０９からの発話制御信号に基づいて、ロボット１００に発話を行わせるか否かを判断する（ステップＳ１０５）。ここで、ロボット１００に発話を行わせないと判断した場合（ステップＳ１０５のＮＯ）には、ステップＳ１０１の処理に戻る。ロボット１００に発話を行わせると判断した場合（ステップＳ１０５のＹＥＳ）には、音制御部１１０は、ロボット１００に発話させるための会話情報を生成し、生成した会話情報に基づいた音信号をスピーカ１１５へ出力する（ステップＳ１０６）。これにより、ロボット１００は、音信号に応じた発話をスピーカ１１５から発音する。 The sound control unit 110 determines, based on the speech control signal from the speech control unit 109, whether to cause the robot 100 to make a speech (step S105). Here, if it is determined that the robot 100 can not make a speech (NO in step S105), the process returns to step S101. When it is determined that the robot 100 is made to speak (YES in step S105), the sound control unit 110 generates conversation information for causing the robot 100 to utter, and a speaker based on the generated conversation information is used as a speaker It outputs to 115 (step S106). Thereby, the robot 100 produces an utterance corresponding to the sound signal from the speaker 115.

音制御部１１０は、発話制御部１０９からの発話制御信号に基づいて、ロボット１００の発話を終了するか否かを判断する（ステップＳ１０７）。ここで、ロボット１００の発話を終了しない場合（ステップＳ１０７のＮＯ）には、音制御部１１０は、ステップＳ１０６の処理に戻る。ロボット１００の発話を終了する場合（ステップＳ１０７のＹＥＳ）には、音制御部１１０は、会話情報の生成を停止することに応じて音信号の出力を停止する。 The sound control unit 110 determines whether to end the speech of the robot 100 based on the speech control signal from the speech control unit 109 (step S107). Here, when the speech of the robot 100 is not ended (NO in step S107), the sound control unit 110 returns to the process of step S106. When ending the speech of the robot 100 (YES in step S107), the sound control unit 110 stops the output of the sound signal in response to the stop of the generation of the conversation information.

次に、ロボット１００は、複数の利用者と会話を行う会話動作を終了するか否かを判断する（ステップＳ１０８）。ここで、会話動作を終了しないと判断した場合（ステップＳ１０８のＮＯ）には、ステップＳ１０１の処理に戻る。会話動作を終了すると判断した場合（ステップＳ１０８のＹＥＳ）には、ロボット１００は、会話動作を終了する。例えば、利用者が電源スイッチ（図示せず）を入れたタイミングや会話モードのスイッチ（図示せず）をオンにしたタイミングで、ロボット１００は、会話動作を開始し、利用者が電源スイッチを切ったタイミングや会話モードのスイッチをオフにしたタイミングで、ロボット１００は、会話動作を終了する。 Next, the robot 100 determines whether to end the conversation operation for talking with a plurality of users (step S108). Here, if it is determined that the conversation operation is not ended (NO in step S108), the process returns to step S101. When it is determined that the conversation operation is to be ended (YES in step S108), the robot 100 ends the conversation operation. For example, when the user turns on the power switch (not shown) or turns on the switch in conversation mode (not shown), the robot 100 starts the conversation operation and the user turns off the power switch. The robot 100 ends the conversation operation at the timing when the conversation mode switch is turned off.

以上に説明したとおり、第１の実施形態におけるロボット１００は、複数の利用者と会話する際に、各利用者の次話者確率に基づいて、各利用者が発話する可能性の低い、より適切なタイミングで発話を行うことができる。これにより、ロボット１００は、他の利用者と発話のタイミングが重なる発話衝突の発生を低減した発話を行うことができる。また、発話中のロボット１００は、各利用者の次話者確率に基づいて、利用者の内の誰かが発話をしそうな場合には、発話を終了することができる。 As described above, when the robot 100 according to the first embodiment talks with a plurality of users, it is less likely that each user will speak based on the next speaker probability of each user. Utterance can be made at appropriate timing. As a result, the robot 100 can perform an utterance with reduced occurrence of an utterance collision in which the timing of the utterance overlaps with that of another user. In addition, the robot 100 in speech can end the speech when it is likely that one of the users will speak based on the next speaker probability of each user.

（第２の実施形態）
第２の実施形態におけるロボット１００Ａについて説明する。第２の実施形態におけるロボット１００Ａが、第１の実施形態におけるロボット１００と異なる点は、ロボット１００Ａ自身の動き（呼吸動作、視線動作、頭部動作）からロボット１００Ａ自身の次話者確率Ｐ^ｎｓ _Ｒ（ｔ）を求める点と、求めた次話者確率Ｐ^ｎｓ _Ｒ（ｔ）と他の利用者の次話者確率とに基づいて、ロボット１００Ａの発話を制御する点である。 Second Embodiment
The robot 100A in the second embodiment will be described. The robot 100A in the second embodiment differs from the robot 100 in the first embodiment in that the next speaker probability P ^{ns of the} robot 100A itself from the movement (breathing motion, sight line motion, head motion) of the robot 100A itself _The point at which _R (t) is determined, and the point at which the robot 100A's speech is controlled based on the determined next speaker probability P ^ns _R (t) and the next speaker probability of other users.

また、第２の実施形態におけるロボット１００Ａは、会話中に、会話中の人間同様の動きを行う点でも異なる。ロボット１００Ａは、会話中に、呼吸音を発したり胸の膨らみを変化させたりする呼吸動作、視線を現話者に向ける等の視線動作、会話に応じて頷いたりする頭部動作を行う。 In addition, the robot 100A in the second embodiment is different in that it performs the same movement as a human being in a conversation during a conversation. During a conversation, the robot 100A performs a breathing motion that emits a breathing sound or changes the swelling of the chest, a gaze motion such as directing a gaze to the current speaker, and a head motion that crawls according to the conversation.

図７は、第２の実施形態におけるロボット１００Ａが備える機能構成の概略を示す図である。図７に示す第２の実施形態におけるロボット１００Ａは、第１の実施形態におけるロボット１００と同じ構成要素を含む。よって、ロボット１００Ａの説明においては、第１の実施形態におけるロボット１００と同じ構成要素については、同じ符号を付与して説明を省略する。 FIG. 7 is a diagram schematically showing the functional configuration of the robot 100A according to the second embodiment. The robot 100A in the second embodiment shown in FIG. 7 includes the same components as the robot 100 in the first embodiment. Therefore, in the description of the robot 100A, the same components as those of the robot 100 in the first embodiment will be assigned the same reference numerals and descriptions thereof will be omitted.

図７に示すように、ロボット１００Ａは、マイク１０１と、カメラ１０２と、センサ１０３と、音声入力部１０４と、映像入力部１０５と、センサ入力部１０６と、発話区間検出部１０７と、次話者推定部１０８Ａと、制御部１０９Ａと、音制御部１１０Ａと、口部制御部１１１と、視線制御部１１２と、頭部制御部１１３と、胴部制御部１１４と、スピーカ１１５と、口部駆動部１１６と、眼部駆動部１１７と、頭部駆動部１１８と、胴部駆動部１１９とを備える。 As shown in FIG. 7, the robot 100A includes the microphone 101, the camera 102, the sensor 103, the voice input unit 104, the video input unit 105, the sensor input unit 106, the speech section detection unit 107, and the next talk. The person estimation unit 108A, the control unit 109A, the sound control unit 110A, the mouth control unit 111, the sight control unit 112, the head control unit 113, the trunk control unit 114, the speaker 115, the mouth unit A drive unit 116, an eye unit drive unit 117, a head drive unit 118, and a body drive unit 119 are provided.

次話者推定部１０８Ａは、音声入力部１０１からの音声信号と、映像入力部１０２からの映像信号と、センサ入力部１０６からのセンサ信号と、発話区間検出部１０７からの発話区間情報と、制御部１０９Ａからの疑似センサ信号とを入力とし、各利用者及びロボット１００Ａが時刻ｔに次話者となる確率である次話者確率を出力する。疑似センサ信号は、制御部１０９Ａが生成する動作制御信号に基づいてロボット１００Ａを動作させ、かつ、そのロボット１００Ａの動作をセンサ１０３で検出したと仮定した場合に、センサ１０３が出力するセンサ信号である。 The next speaker estimation unit 108 A includes an audio signal from the audio input unit 101, a video signal from the video input unit 102, a sensor signal from the sensor input unit 106, and speech period information from the speech period detection unit 107. The pseudo sensor signal from the control unit 109A is input, and the next speaker probability that is the probability that each user and the robot 100A will be the next speaker at time t is output. The pseudo sensor signal is a sensor signal output from the sensor 103 when it is assumed that the robot 100A is operated based on the operation control signal generated by the control unit 109A and the operation of the robot 100A is detected by the sensor 103. is there.

次話者推定部１０８Ａは、音声信号、映像信号、センサ信号及び発話区間情報に基づいて、発話区間情報で特定される発話区間の発話者を示す発話者情報を取得する。次話者推定部１０８Ａは、音声信号、映像信号、センサ信号、疑似センサ信号及び取得した発話者情報に基づいて、ロボット１００Ａ及び各利用者ｉが時刻ｔに次話者となる確率である次話者確率Ｐ^ｎｓ _ｉ（ｔ）を算出して、制御部１０９Ａへ出力する。次話者推定部１０８Ａは、次話者確率Ｐ^ｎｓ _ｉ（ｔ）の他に、発話者情報及び利用者の位置情報を制御部１０９Ａへ出力する。 The next speaker estimation unit 108A acquires, based on the audio signal, the video signal, the sensor signal, and the utterance period information, the speaker information indicating the utterer of the utterance period specified by the utterance period information. The next speaker estimation unit 108A is a probability that the robot 100A and each user i will be the next speaker at time t based on the audio signal, the video signal, the sensor signal, the pseudo sensor signal, and the acquired speaker information. The speaker probability P ^ns _i (t) is calculated and output to the control unit 109A. The next speaker estimation unit 108A outputs the speaker information and the position information of the user to the control unit 109A, in addition to the next speaker probability P ^ns _i (t).

次話者推定部１０８Ａは、利用者の位置情報を、例えば、センサ１０３からの利用者の位置を計測したセンサ信号に基づいて取得してもよいし、映像信号に基づいて取得してもよいし、センサ１０３からの利用者の位置を計測したセンサ信号及び映像信号に基づいて取得してもよい。 The next speaker estimation unit 108A may acquire the user's position information, for example, based on the sensor signal obtained by measuring the position of the user from the sensor 103, or may acquire it based on the video signal. Alternatively, the position of the user from the sensor 103 may be acquired based on the measured sensor signal and video signal.

制御部１０９Ａは、次話者推定部１０８Ａからの次話者確率Ｐ^ｎｓ _ｉ（ｔ）、発話者情報及び利用者の位置情報を入力とし、発話制御信号及び動作制御信号を出力する。制御部１０９Ａは、次話者推定部１０８Ａからのロボット１００Ａ及び各利用者の次話者確率に基づいて、ロボット１００Ａに発話を行わせるか否かを制御する発話制御信号を生成し、生成した発話制御信号を音制御部１１０Ａへ出力する。制御部１０９Ａは、生成した発話制御信号と、発話者情報と、利用者の位置情報とに基づいて、動作制御信号を生成し、生成した動作制御信号を口部制御部１１１、視線制御部１１２、頭部制御部１１３及び胴部制御部１１４へ出力する。 The control unit 109A receives the next speaker probability P ^ns _i (t) from the next speaker estimation unit 108A, the speaker information and the user's position information, and outputs a speech control signal and an operation control signal. The control unit 109A generates and generates a speech control signal for controlling whether or not the robot 100A is made to speak based on the robot 100A from the next speaker estimation unit 108A and the next speaker probability of each user. The speech control signal is output to the sound control unit 110A. The control unit 109A generates an operation control signal based on the generated speech control signal, the speaker information, and the position information of the user, and the generated operation control signal is transmitted to the mouth control unit 111 and the gaze control unit 112. , And the head control unit 113 and the torso control unit 114.

制御部１０９Ａは、動作制御信号に基づいて疑似センサ信号を生成し、生成した疑似センサ信号を次話者推定部１０８Ａへ出力する。これにより、次話者推定部１０８Ａは、ロボット１００Ａが時刻ｔに次話者となる確率である次話者確率Ｐ^ｎｓ _Ｒ（ｔ）を推定することができる。なお、第２の実施形態における制御部１０９Ａは、疑似センサ信号を生成する構成であるが、この構成に限られるものではない。例えば、制御部１０９Ａは、疑似センサ信号を生成する機能を有さずに、次話者推定部１０８Ａが、ロボット１００Ａの動作制御信号から疑似センサ信号を生成する機能を有する構成としてもよい。また、次話者推定部１０８Ａにおいて、ロボット１００Ａの動作制御信号に基づいてロボット１００Ａの次話者確率を推定する構成としてもよい。 Control unit 109A generates a pseudo sensor signal based on the operation control signal, and outputs the generated pseudo sensor signal to next speaker estimation section 108A. Thereby, the next speaker estimation unit 108A can estimate the next speaker probability P ^ns _R (t) that is the probability that the robot 100A will be the next speaker at time t. In addition, although the control part 109A in 2nd Embodiment is a structure which produces | generates a pseudo sensor signal, it is not restricted to this structure. For example, the control unit 109A may be configured to have a function of generating a pseudo sensor signal from the operation control signal of the robot 100A, without having a function of generating a pseudo sensor signal. Further, the next speaker estimation unit 108A may be configured to estimate the next speaker probability of the robot 100A based on the operation control signal of the robot 100A.

制御部１０９Ａは、ロボット１００Ａの次話者確率Ｐ^ｎｓ _Ｒ（ｔ）と、全利用者の次話者確率との関係に応じて、ロボット１００Ａの発話を制御する発話制御信号を出力する。例えば、制御部１０９Ａは、ロボット１００Ａの次話者確率Ｐ^ｎｓ _Ｒ（ｔ）が、全利用者の次話者確率以下の値であれば、ロボット１００Ａに発話させないように制御する発話制御信号を出力する。制御部１０９Ａは、ロボット１００Ａの次話者確率Ｐ^ｎｓ _Ｒ（ｔ）が、全利用者の次話者確率より大きい値であれば、ロボット１００Ａに発話させるように制御する発話制御信号を出力する。これにより、ロボット１００Ａは、自身の次話者確率が高く、他の利用者における次話者確率が低いタイミングで、発話を開始することができる。このように、ロボット１００は、複数の利用者と会話する際に、より適切なタイミングで発話を行うことができる。 Control unit 109A outputs a speech control signal for controlling the speech of robot 100A according to the relationship between the next speaker probability P ^ns _R (t) of robot 100A and the next speaker probability of all users. For example, when the next speaker probability P ^ns _R (t) of the robot 100A is a value equal to or less than the next speaker probability of all users, the control unit 109A controls the speech control signal to prevent the robot 100A from making speech. Output. If the next speaker probability P ^ns _R (t) of the robot 100A is a value larger than the next speaker probability of all users, the control unit 109A outputs an utterance control signal that causes the robot 100A to utter. . As a result, the robot 100A can start speaking at a timing when its next speaker probability is high and the next speaker probability among other users is low. Thus, the robot 100 can speak at more appropriate timing when talking with a plurality of users.

制御部１０９Ａは、具体的には、以下に示す第５制御方法〜第８制御方法の４つの制御方法のいずれかを用いてロボット１００Ａの発話の制御を行う。なお、以下の説明においては、利用者Ａ、Ｂ、Ｃ、Ｄの４名とロボット１００ＡであるＲとが会話を行う場合について説明する。 Specifically, control unit 109A controls the speech of robot 100A using any of the following four control methods of the fifth control method to the eighth control method. In the following description, a case where four users A, B, C, and D and R who is the robot 100A have a conversation will be described.

（第５制御方法）
制御部１０９Ａは、次話者推定部１０８Ａから、時刻ｔにおける利用者Ａ〜Ｄ及びロボット１００ＡであるＲの次話者確率Ｐ^ｎｓ _ｉ（ｔ），（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ，Ｒ｝）を取得する。制御部１０９Ａは、ロボット１００Ａの次話者確率Ｐ^ｎｓ _Ｒ（ｔ）と、利用者Ａ〜Ｄの次話者確率Ｐ^ｎｓ _Ａ（ｔ）、Ｐ^ｎｓ _Ｂ（ｔ）、Ｐ^ｎｓ _Ｃ（ｔ）、Ｐ^ｎｓ _Ｄ（ｔ）とを比較して、Ｐ^ｎｓ _Ｒ（ｔ）が最大であると判断した場合は、ロボット１００Ａに時刻ｔで発話させるよう制御する発話制御信号を出力する。制御部１０９Ａは、上記比較においてＰ^ｎｓ _Ｒ（ｔ）が最大でないと判断した場合は、ロボット１００Ａに発話させないよう制御する発話制御信号を出力する。 (Fifth control method)
From control unit 109A, control unit 109A calculates next speaker probability P ^ns _i (t), ( _i D {A, B, C, D) of users A to D and robot 100A R at time t. , R}). Control section 109A calculates next speaker probability P ^ns _R (t) of robot 100A, and next speaker probability P ^ns _A (t), P ^ns _B (t), P ^ns _C (t) of users A to D. , P ^ns _D (t), and when it is determined that P ^ns _R (t) is maximum, an utterance control signal is output to control the robot 100A to utter at time t. When it is determined that P ^ns _R (t) is not the maximum in the above comparison, the control unit 109A outputs an utterance control signal that controls the robot 100A not to utter.

（第６制御方法）
制御部１０９Ａは、次話者推定部１０８Ａから取得した時刻ｔにおけるロボット１００Ａ及び利用者Ａ〜Ｄの次話者確率Ｐ^ｎｓ _ｉ（ｔ），（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ，Ｒ｝）を、時刻ｔについて所定時間（例えば、３〜４秒以上の時間）積分して、積分値Ｐ^ｎｓ _ｉを取得する。制御部１０９Ａは、ロボット１００Ａの積分値Ｐ^ｎｓ _Ｒと、利用者Ａ〜Ｄの積分値Ｐ^ｎｓ _Ａ、Ｐ^ｎｓ _Ｂ、Ｐ^ｎｓ _Ｃ、Ｐ^ｎｓ _Ｄとを比較して、Ｐ^ｎｓ _Ｒが最大であると判断した場合は、ロボット１００ＡにＰ^ｎｓ _Ｒ（ｔ）が最大となる時刻ｔで発話させるよう制御する発話制御信号を出力する。制御部１０９Ａは、上記比較において、Ｐ^ｎｓ _Ｒが最大でないと判断した場合は、ロボット１００Ａに発話させないよう制御する発話制御信号を出力する。 (Sixth control method)
Control unit 109A determines next speaker probability P ^ns _i (t) of robot 100A and users A to D at time t acquired from next speaker estimation unit 108A, (i∈ {A, B, C, D, R). }) Is integrated with respect to time t for a predetermined time (for example, a time of 3 to 4 seconds or more) to obtain an integral value P ^ns _i . The control unit 109A compares the integral value P ^ns _R of the robot 100A with the integral values P ^ns _A , P ^ns _B , P ^ns _C and P ^ns _{D of} the users A to _D, and P ^ns _R is at maximum. If it is determined that there is an utterance control signal, the utterance control signal is output to control the robot 100A to utter at time t when P ^ns _R (t) is maximum. When it is determined in the above comparison that P ^ns _R is not the maximum, the control unit 109A outputs an utterance control signal that controls the robot 100A not to make an utterance.

（第７制御方法）
制御部１０９Ａは、次話者推定部１０８Ａから、時刻ｔにおける利用者Ａ〜Ｄ及びロボット１００ＡであるＲの次話者確率Ｐ^ｎｓ _ｉ（ｔ），（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ，Ｒ｝）を取得する。制御部１０９Ａは、利用者全員の次話者確率を加算した加算値（Ｐ^ｎｓ _Ａ（ｔ）＋Ｐ^ｎｓ _Ｂ（ｔ）＋Ｐ^ｎｓ _Ｃ（ｔ）＋Ｐ^ｎｓ _Ｄ（ｔ））を取得する。制御部１０９Ａは、加算値（Ｐ^ｎｓ _Ａ（ｔ）＋Ｐ^ｎｓ _Ｂ（ｔ）＋Ｐ^ｎｓ _Ｃ（ｔ）＋Ｐ^ｎｓ _Ｄ（ｔ））と、ロボット１００Ａの次話者確率Ｐ^ｎｓ _Ｒ（ｔ）に定数ιを乗算したＰ^ｎｓ _Ｒ（ｔ）・ιとを比較する（ιは正の値となる任意の定数）。制御部１０９Ａは、（Ｐ^ｎｓ _Ａ（ｔ）＋Ｐ^ｎｓ _Ｂ（ｔ）＋Ｐ^ｎｓ _Ｃ（ｔ）＋Ｐ^ｎｓ _Ｄ（ｔ））≧Ｐ^ｎｓ _Ｒ（ｔ）・ιと判断した場合は、ロボット１００Ａに発話させないよう制御する発話制御信号を出力する。制御部１０９Ａは、（Ｐ^ｎｓ _Ａ（ｔ）＋Ｐ^ｎｓ _Ｂ（ｔ）＋Ｐ^ｎｓ _Ｃ（ｔ）＋Ｐ^ｎｓ _Ｄ（ｔ））＜Ｐ^ｎｓ _Ｒ（ｔ）・ιと判断した場合は、ロボット１００Ａに時刻ｔで発話させるよう制御する発話制御信号を出力する。なお、定数ιは、例えば、利用者の人数と同じ値としてもよく、ロボット１００Ａの発話機会を増やしたければより大きな値としてもよく、ロボット１００Ａの発話を控えめにしたければより小さい値としてもよい。 (Seventh control method)
From control unit 109A, control unit 109A calculates next speaker probability P ^ns _i (t), ( _i D {A, B, C, D) of users A to D and robot 100A R at time t. , R}). Control unit 109A acquires an added value obtained by adding the following speaker probability All users ^{_{^{_{(P ns A (t) +}}}} P ns B (t) + P ns C (t) + P ns D (t)). Control unit 109A includes an addition value ^{_{^{_{(P ns A (t) +}}}} P ns B (t) + P ns C (t) + P ns D (t)), the constant to the next speaker probability ^P _ns R of the robot 100A (t) Compare with P ^ns _R (t) · ι multiplied by した (ι is an arbitrary constant that becomes a positive value). If the control unit 109A determines that (P ^ns _A (t) + P ^ns _B (t) + P ^ns _C (t) + P ^ns _D (t)) P P ^ns _R (t) · 発話, the robot 100A speaks to the robot 100A It outputs an utterance control signal that controls not to be caused. If the control unit 109A determines that (P ^ns _A (t) + P ^ns _B (t) + P ^ns _C (t) + P ^ns _D (t)) <P ^ns _R (t) · ι, the robot 100A determines the time A speech control signal is output to control to make the speech at t. Note that the constant ι may be, for example, the same value as the number of users, or may be a larger value if it is desired to increase the speech opportunity of the robot 100A, or may be a smaller value if the speech of the robot 100A is to be conservative. .

（第８制御方法）
制御部１０９Ａは、次話者推定部１０８Ａから取得した時刻ｔにおけるロボット１００Ａ及び利用者Ａ〜Ｄの次話者確率Ｐ^ｎｓ _ｉ（ｔ），（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ，Ｒ｝）を、時刻ｔについて所定時間（例えば、３〜４秒以上の時間）積分して、積分値Ｐ^ｎｓ _ｉを取得する。制御部１０９Ａは、利用者全員の積分値Ｐ^ｎｓ _ｉを加算した加算値（Ｐ^ｎｓ _Ａ＋Ｐ^ｎｓ _Ｂ＋Ｐ^ｎｓ _Ｃ＋Ｐ^ｎｓ _Ｄ）を取得する。発話制御部１０９は、加算値（Ｐ^ｎｓ _Ａ＋Ｐ^ｎｓ _Ｂ＋Ｐ^ｎｓ _Ｃ＋Ｐ^ｎｓ _Ｄ）と、ロボット１００Ａの積分値Ｐ^ｎｓ _Ｒに定数ζを乗算したＰ^ｎｓ _Ｒ・ζとを比較する（ζは正の値となる任意の定数）。制御部１０９Ａは、（Ｐ^ｎｓ _Ａ＋Ｐ^ｎｓ _Ｂ＋Ｐ^ｎｓ _Ｃ＋Ｐ^ｎｓ _Ｄ）≧Ｐ^ｎｓ _Ｒ・ζと判断した場合は、ロボット１００Ａに発話させないよう制御する発話制御信号を出力する。制御部１０９Ａは、（Ｐ^ｎｓ _Ａ＋Ｐ^ｎｓ _Ｂ＋Ｐ^ｎｓ _Ｃ＋Ｐ^ｎｓ _Ｄ）＜Ｐ^ｎｓ _Ｒ・ζと判断した場合は、ロボット１００にＰ^ｎｓ _Ｒ（ｔ）が最大となる時刻ｔで発話させるよう制御する発話制御信号を出力する。 (Eighth control method)
Control unit 109A determines next speaker probability P ^ns _i (t) of robot 100A and users A to D at time t acquired from next speaker estimation unit 108A, (i∈ {A, B, C, D, R). }) Is integrated with respect to time t for a predetermined time (for example, a time of 3 to 4 seconds or more) to obtain an integral value P ^ns _i . Control unit 109A obtains the integral value sum value obtained by adding the ^{P ns} _i of the user all the ^{_{^{_{^{_{(P ns A + P ns B}}}}}} + P ns C + P ns D). The utterance control unit 109 compares the addition value (P ^ns _A + P ^ns _B + P ^ns _C + P ^ns _D ) with P ^ns _R · ζ obtained by multiplying the integral value P ^ns _R of the robot 100A by a constant ζ (ζ is Any constant that is positive). When it is determined that (P ^ns _A + P ^ns _B + P ^ns _C + P ^ns _D ) ≧ P ^ns _R · ζ, the control unit 109A outputs an utterance control signal that controls the robot 100A not to speak. If the control unit 109A determines that (P ^ns _A + P ^ns _B + P ^ns _C + P ^ns _D ) <P ^ns _R · ζ, the control unit 109A causes the robot 100 to speak at time t when the P ^ns _R (t) becomes maximum. It outputs an utterance control signal to be controlled.

次話者確率Ｐ^ｎｓ _ｉ（ｔ），（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ，Ｒ｝）は、図３に示したように、発話終了から所定時間後にピークを有する場合が多い。そこで、制御部１０９Ａは、第５制御方法〜第８制御方法において、次話者確率Ｐ^ｎｓ _ｉ（ｔ）を求める時刻ｔを含む窓幅を設けて、その窓幅の中における次話者確率の最大値を、時刻ｔにおける次話者確率Ｐ^ｎｓ _ｉ（ｔ）として出力するようにしてもよい。また、制御部１０９Ａは、第５制御方法〜第８制御方法において、次話者確率Ｐ^ｎｓ _ｉ（ｔ）を求める時刻ｔを含む窓幅を設けて、その窓幅の中における次話者確率に複数のピークがある場合に、ｎ番目（ｎは１以上の整数）のピークの次話者確率を、時刻ｔにおける次話者確率Ｐ^ｎｓ _ｉ（ｔ）として出力するようにしてもよい。 The next speaker probability P ^ns _i (t), ( _i ∈ {A, B, C, D, R}) often has a peak after a predetermined time from the end of the speech, as shown in FIG. Therefore, in the fifth control method to the eighth control method, control unit 109A provides a window width including time t for obtaining next speaker probability P ^ns _i (t), and the next speaker probability in the window width The maximum value of may be output as the next speaker probability P ^ns _i (t) at time t. Further, in the fifth control method to the eighth control method, control unit 109A provides a window width including time t for obtaining next speaker probability P ^ns _i (t), and the next speaker probability in the window width When there are a plurality of peaks, the next speaker probability of the nth (n is an integer of 1 or more) peak may be output as the next speaker probability P ^ns _i (t) at time t.

音制御部１１０Ａは、音声入力部１０４からの音声信号と、発話区間検出部１０７からの発話区間情報と、制御部１０９Ａからの発話制御信号とに基づいて、スピーカ１１５に対して音信号を出力する。音制御部１１０Ａは、発話制御信号に基づいて、ロボット１００Ａに発話を行わせるか否かを判断する。音制御部１１０Ａは、発話制御信号に基づいて、ロボット１００Ａに発話を行わせると判断した場合には、ロボット１００Ａに発話させる会話内容（言葉）を含む会話情報を生成し、生成した会話情報に基づいた音信号を出力する。音制御部１１０Ａは、例えば、音声信号及び発話区間情報に基づいて利用者の会話内容を解析し、解析結果に基づいて、ロボット１００Ａに発話させるための会話情報を生成する。 The sound control unit 110A outputs a sound signal to the speaker 115 based on the voice signal from the voice input unit 104, the speech segment information from the speech segment detection unit 107, and the speech control signal from the control unit 109A. Do. The sound control unit 110A determines whether to cause the robot 100A to make a speech based on the speech control signal. When sound control unit 110A determines that robot 100A is to make a speech based on the speech control signal, sound control unit 110A generates conversation information including conversation contents (words) to be uttered by robot 100A, and generates the generated conversation information. Output a sound signal based on it. The sound control unit 110A analyzes the contents of the user's conversation based on, for example, the voice signal and the utterance interval information, and generates conversation information for making the robot 100A speak based on the analysis result.

口部制御部１１１は、制御部１０９Ａからの動作制御信号に基づいて、口部駆動部１１６に対して口部駆動信号を出力する。視線制御部１１２は、制御部１０９Ａからの動作制御信号に基づいて、眼部駆動部１１７に対して眼部駆動信号を出力する。頭部制御部１１３は、制御部１０９Ａからの動作制御信号に基づいて、頭部駆動部１１８に対して頭部駆動信号を出力する。胴部制御部１１４は、制御部１０９Ａからの動作制御信号に基づいて、胴部駆動部１１９に対して胴部駆動信号を出力する。 The mouth control unit 111 outputs a mouth driving signal to the mouth driving unit 116 based on the operation control signal from the control unit 109A. The gaze control unit 112 outputs an eye part drive signal to the eye part drive unit 117 based on the operation control signal from the control unit 109A. Head control unit 113 outputs a head drive signal to head drive unit 118 based on the operation control signal from control unit 109A. The body control unit 114 outputs a body driving signal to the body driving unit 119 based on the operation control signal from the control unit 109A.

第２の実施形態におけるロボット１００Ａの外観は、図２に示したロボット１００と同一である。ここで、図２を用いてロボット１００Ａが備える口部駆動部１１６、眼部駆動部１１７、頭部駆動部１１８及び胴部駆動部１１９の配置と駆動する対象について説明する。頭部２３は、右目２１ａ及び左目２１ｂの黒目（視線）を移動させる眼部駆動部１１７と、口部２２の開閉を行う口部駆動部１１６とを備える。 The appearance of the robot 100A in the second embodiment is the same as the robot 100 shown in FIG. Here, with reference to FIG. 2, the arrangement of the mouth drive unit 116, the eye drive unit 117, the head drive unit 118, and the trunk drive unit 119 included in the robot 100A will be described and objects to be driven. The head 23 includes an eye driving unit 117 that moves the black eyes (visual lines) of the right eye 21 a and the left eye 21 b, and a mouth driving unit 116 that opens and closes the mouth 22.

頸部２４は、頭部２３に対して所定の動き（例えば、頷かせたり、顔の方向を変えたりする動き）を行わせる頭部駆動部１１８を備え、頭部２３を支持する。胴部２５は、呼吸をしているかのように、肩を動かしたり、胸の部分を膨らませたりする胴部駆動部１１９を備える。口部駆動部１１６は、口部制御部１１１からの口部駆動信号に基づいてロボット１００Ａの口部２２の開閉を行う。眼部駆動部１１７は、視線制御部１１２からの眼部駆動信号に基づいてロボット１００Ａの眼部２１における黒目の方向（＝ロボット１００の視線の方向）を制御する。 The neck 24 includes a head drive unit 118 that performs a predetermined movement (for example, a motion to turn on or turn the face) relative to the head 23 and supports the head 23. The torso 25 includes a torso drive 119 that moves the shoulders and inflates the chest as if breathing. The mouth driving unit 116 opens and closes the mouth 22 of the robot 100A based on the mouth driving signal from the mouth control unit 111. The eye part driving unit 117 controls the direction of the black eye in the eye part 21 of the robot 100A (= the direction of the line of sight of the robot 100) based on the eye part driving signal from the line of sight control unit 112.

頭部駆動部１１８は、頭部制御部１１３からの頭部駆動信号に基づいてロボット１００Ａの頭部２３の動きを制御する。胴部駆動部１１９は、胴部制御部１１４からの胴部駆動信号に基づいてロボット１００の胴部２５の形状を制御する。また、胴部駆動部１１９は、胴部制御部１１４からの胴部駆動信号に基づいてロボット１００の右腕２５ａ及び左腕２５ｂの動きも制御する。 The head drive unit 118 controls the movement of the head 23 of the robot 100A based on the head drive signal from the head control unit 113. The body drive unit 119 controls the shape of the body 25 of the robot 100 based on the body drive signal from the body control unit 114. The body driving unit 119 also controls the movement of the right arm 25 a and the left arm 25 b of the robot 100 based on the body driving signal from the body control unit 114.

第２の実施形態における制御部１０９Ａの構成の詳細について一例を示して説明する。
図８は、第２の実施形態における制御部１０９Ａの構成の詳細の具体例を示す図である。制御部１０９Ａは、発話制御部３０１と、動作パターン情報格納部３０２と、動作制御部３０３と、センサ信号変換部３０４とを備える。発話制御部３０１は、ロボット１００Ａの発話を制御する発話制御信号を出力する。 The details of the configuration of the control unit 109A in the second embodiment will be described by way of an example.
FIG. 8 is a diagram showing a specific example of details of the configuration of the control unit 109A in the second embodiment. The control unit 109A includes an utterance control unit 301, an operation pattern information storage unit 302, an operation control unit 303, and a sensor signal conversion unit 304. The speech control unit 301 outputs a speech control signal for controlling the speech of the robot 100A.

発話制御部３０１は、次話者推定部１０８Ａからの次話者確率を入力とし、音制御部１１０Ａに対して発話制御信号を出力する。発話制御部３０１は、次話者推定部１０８Ａからのロボット１００Ａ及び各利用者の次話者確率に基づいて、上述した第５制御方法〜第８制御方法の４つの制御方法のいずれかを用いてロボット１００Ａの発話の制御を行う発話制御信号を出力する。発話制御部３０１は、動作制御部３０３に対しても発話制御信号を出力する。発話制御部３０１は、動作制御部３０３からの呼吸音やフィラーを発音するよう指示する発音指示信号に応じて、その発音指示信号を音制御部１１０Ａへ出力する。ここで、フィラーとは、言い淀み時などに出現する場つなぎのための発声であり、例えば、「あのー」、「そのー」、「えっと」、等の音声である。 The speech control unit 301 receives the next speaker probability from the next speaker estimation unit 108A as an input, and outputs a speech control signal to the sound control unit 110A. The speech control unit 301 uses any of the four control methods of the fifth control method to the eighth control method described above based on the robot 100A from the next speaker estimation unit 108A and the next speaker probability of each user. A speech control signal for controlling the speech of the robot 100A is output. The speech control unit 301 also outputs a speech control signal to the operation control unit 303. The speech control unit 301 outputs the sound generation instruction signal to the sound control unit 110A in response to the sound generation instruction signal instructing to generate the breathing sound and the filler from the operation control unit 303. Here, the filler is a voice for connecting the place which appears at the time of saying and so on, and is, for example, a voice such as "Ano", "No", "Her".

動作パターン情報格納部３０２は、ロボット１００Ａが会話中に行う動作の動作パターンの情報である動作パターン情報を格納する。動作パターン情報格納部３０２は、例えば、発話を開始する前に、これから発話を行うことを周りの人に察知させるよう人が行っている動作と同様の動作をロボット１００Ａに行わせる動作パターン情報を格納している。 The motion pattern information storage unit 302 stores motion pattern information which is information on a motion pattern of a motion performed by the robot 100A during a conversation. The motion pattern information storage unit 302 causes, for example, motion pattern information to cause the robot 100A to perform the same action as the action that a person performs to cause the surrounding people to sense that the utterance is to be made before the start of the utterance. It is stored.

複数人が会話している際に、非話者である人が次話者として発話する直前に行う行動を解析した結果、以下の（１）〜（３）の行動が「次は私が話を始めます」ということを周囲に示す行動であると考えられる。
（１）吸気音又はフィラーを発声する
（２）現話者に視線向ける
（３）現話者の会話に頷く As a result of analyzing the action taken immediately before the non-speaker speaks as the next speaker while multiple people are talking, the following actions (1) to (3) It is thought that it is an action shown to the surrounding that "it begins".
(1) Speak inspiratory noise or filler (2) Look at the current speaker (3) Look at the conversation of the current speaker

上述した解析結果を参考にして、制御部１０９Ａは、ロボット１００Ａの発話前に、ロボット１００Ａに上述した（１）〜（３）の動作を行わせるよう制御することで、ロボット１００がもうすぐ発話を開始することを利用者に予見させることができる。ロボット１００Ａが上述した（１）〜（３）の動作を行うと次話者推定部１０８Ａが推定するロボット１００Ａの次話者確率Ｐ^ｎｓ _Ｒ（ｔ）が上昇する。すなわち、発話を行うことを周りの人に察知させる動作とは、例えば、現話者に視線を移動させる動作、頭を頷かせる動作、吸気音とともに吸気する動作等を含む。 Based on the analysis result described above, the control unit 109A controls the robot 100A to perform the above-described operations (1) to (3) before the robot 100A makes an utterance, so that the robot 100 makes an utterance immediately. The user can be foreseen to start. When the robot 100A performs the above-described operations (1) to (3), the next speaker probability P ^ns _R (t) of the robot 100A estimated by the next speaker estimation unit 108A is increased. That is, the action of causing the surrounding person to sense that the user speaks includes, for example, an action of moving the line of sight to the current speaker, an action of turning the head, an action of sucking with the intake sound, and the like.

制御部１０９Ａは、以下の参考文献に記載の技術を用いてロボット１００Ａに上述した（１）〜（３）の動作を行わせるよう制御してもよい。
（１）の吸気音を発声する動作をロボット１００Ａに行わせるための技術として以下の参考文献２に記載された公知技術がある。
参考文献２：吉田直人、外３名、“吐息と腹部運動を伴う呼吸表現に関する因子分析に基づいた生物的身体感情インタラクションの設計”、ＨＡＩシンポジウム２０１４、２０１４年
（２）の現話者に視線を向ける動作をロボット１００Ａに行わせるための技術として以下の参考文献３に記載された公知技術がある。
参考文献３：石井亮、外２名、“アバタ音声チャットシステムにおける会話促進のための注視制御”、ヒューマンインタフェース学会論文誌、２００８年
（３）の現話者の会話に頷く動作をロボット１００Ａに行わせるための技術として以下の参考文献４に記載された公知技術がある。
参考文献４：渡辺富夫、外３名、“InterActorを用いた発話音声に基づく身体的インタラクションシステム”、ヒューマンインタフェース学会論文誌、Ｖｏｌ．２、Ｎｏ．２、ｐｐ．２１−２９、２０００年 The control unit 109A may control the robot 100A to perform the above-described operations (1) to (3) using the technology described in the following reference.
As a technique for causing the robot 100A to perform the operation of producing the suction sound of (1), there is a known technique described in Reference 2 below.
Reference 2: Naoto Yoshida, 3 others, "Design of biological body emotion interaction based on factor analysis on respiratory expression with exhalation and abdominal movement", Gaze to current speakers in HAI Symposium 2014, 2014 (2) There is a known technique described in the following reference 3 as a technique for causing the robot 100A to perform an operation of pointing the
Reference 3: Atsushi Ishii and 2 others, "Gaze control to promote conversation in avatar voice chat system", Human Interface Society Journal, 2008 (3) Motion to talk to the speaker of the current speaker to robot 100A There is a known technique described in the following reference 4 as a technique for performing.
Reference 4: Tomio Watanabe, 3 others, "A physical interaction system based on speech voice using InterActor", Journal of Human Interface Society, Vol. 2, No. 2, pp. 21-29, 2000

動作制御部３０３は、発話制御部３０１からの発話制御信号と、次話者推定部１０８Ａからの発話者情報及び利用者の位置情報とに基づいて、動作パターン情報格納部３０２から動作パターン情報を取得して動作制御信号を生成し、生成した動作制御信号を口部制御部１１１、視線制御部１１２、頭部制御部１１３及び胴部制御部１１４へ出力する。動作制御部３０３は、動作パターン情報に応じて、呼吸音やフィラーを発音するよう指示する発音指示信号を発話制御部３０１へ出力する。位置計測装置２０１は、利用者の位置情報として、利用者の顔の位置を特定する位置情報を取得する機能を有してもよい。これにより、ロボット１００は、利用者の位置情報として、利用者の顔の位置を特定する位置情報を用いることで、例えば、現話者の顔に対してのロボット１００の顔および視線を向ける動作を行うことができる。 The motion control unit 303 receives the motion pattern information from the motion pattern information storage unit 302 based on the speech control signal from the speech control unit 301, the speaker information from the next speaker estimation unit 108A, and the position information of the user. The motion control signal is acquired and generated, and the generated motion control signal is output to the mouth control unit 111, the sight control unit 112, the head control unit 113, and the trunk control unit 114. The operation control unit 303 outputs, to the speech control unit 301, a sound generation instruction signal instructing to sound the breathing sound and the filler according to the operation pattern information. The position measurement device 201 may have a function of acquiring position information for specifying the position of the user's face as the user's position information. Thereby, the robot 100 uses, for example, position information for specifying the position of the user's face as the position information of the user, for example, an operation of directing the face and the line of sight of the robot 100 to the face of the current speaker It can be performed.

第２の実施形態における音制御部１１０Ａの構成の詳細について一例を示して説明する。
図９は、第２の実施形態における音制御部１１０Ａの構成の詳細の具体例を示す図である。図９に示す第２の実施形態における音制御部１１０Ａは、第１の実施形態における音制御部１１０と同じ構成要素を含む。よって、音制御部１１０Ａの説明においては、第１の実施形態における音制御部１１０と同じ構成要素については、同じ符号を付与して説明を省略する。 The details of the configuration of the sound control unit 110A in the second embodiment will be described by way of an example.
FIG. 9 is a diagram showing a specific example of the details of the configuration of the sound control unit 110A in the second embodiment. The sound control unit 110A in the second embodiment shown in FIG. 9 includes the same components as the sound control unit 110 in the first embodiment. Therefore, in the description of the sound control unit 110A, the same components as those of the sound control unit 110 in the first embodiment will be assigned the same reference numerals and descriptions thereof will be omitted.

音制御部１１０Ａは、音声解析部４０１と、会話情報生成部４０２と、会話情報ＤＢ４０３Ａと、発声情報生成部４０４Ａと、音信号生成部４０５とを備える。 The sound control unit 110A includes a voice analysis unit 401, a conversation information generation unit 402, a conversation information DB 403A, an utterance information generation unit 404A, and a sound signal generation unit 405.

会話情報ＤＢ４０３Ａは、上述した第１の実施形態の会話情報ＤＢ４０３が格納していた会話サンプル情報に加えて、更に呼吸音情報及びフィラー情報を格納する。呼吸音情報は、ロボット１００Ａに発音させる呼吸音の情報である。呼吸音情報は、例えば、「スーッ」又は「シュー」という人が吸気する際に出す吸気音の音声信号を含む情報である。フィラー情報は、ロボット１００に発音させるフィラーの情報であるフィラー情報を格納する。フィラー情報は、「あのー」、「そのー」、「えっと」等のフィラーの音声信号を含む情報である。 The conversation information DB 403A further stores respiratory sound information and filler information in addition to the conversation sample information stored in the conversation information DB 403 of the first embodiment described above. The breathing sound information is information on breathing sound to be generated by the robot 100A. The respiratory sound information is, for example, information including an audio signal of an inspiratory sound which is emitted when a person inhales as "soo" or "shoe". The filler information stores filler information which is information of a filler to be sounded by the robot 100. The filler information is information including sound signals of fillers such as “Ann”, “Hmm”, “Ett” and the like.

発声情報生成部４０４Ａは、上述した第１の実施形態の発声情報生成部４０４の機能に加えて、制御部１０９Ａからの呼吸音やフィラーを発音するよう指示する発音指示信号に応じて、会話情報生成部４０２に対して呼吸音やフィラーを発音する会話情報を要求する。これにより、会話情報生成部４０２は、呼吸音やフィラーを発音するための音声信号を含む呼吸音情報やフィラー情報を会話情報ＤＢ４０３Ａから参照して、呼吸音やフィラーを発音する会話情報を発声情報生成部４０４Ａへ出力する。発声情報生成部４０４Ａは、要求に応じて会話情報生成部４０２から取得した呼吸音やフィラーを発音する会話情報に基づいて、ロボット１００に呼吸音やフィラーを発声させるための発話信号を生成する。発声情報生成部４０４は、生成した発話信号を音信号生成部４０５へ出力する。 In addition to the function of the utterance information generation unit 404 of the above-described first embodiment, the utterance information generation unit 404A responds to a speech instruction signal instructing to pronounce the breathing sound and the filler from the control unit 109A. It requests the generation unit 402 for conversation information for producing a breathing sound and a filler. As a result, the conversation information generation unit 402 refers to respiratory information and filler information including respiratory sound and a voice signal for producing a filler from the conversation information DB 403A, and speaks conversation information for producing a respiratory sound and a filler. It is output to the generation unit 404A. The utterance information generation unit 404A generates a speech signal for causing the robot 100 to utter the breathing sound or the filler based on the speech information for producing the breathing sound or the filler acquired from the conversation information generation unit 402 in response to the request. The utterance information generation unit 404 outputs the generated utterance signal to the sound signal generation unit 405.

以上の構成により、ロボット１００Ａは、発話を行いたい場合に、発話前に、動作制御信号に基づいて視線を利用者に向けたり、呼吸音やフィラーを発音したりすることができる。これにより、ロボット１００Ａの次話者確率Ｐ^ｎｓ _Ｒ（ｔ）が上昇するので、制御部１０９Ａが、ロボット１００Ａに発話させる発話制御情報を出力する可能性が高まる。利用者は、ロボット１００が発話を開始する前に、ロボット１００がまもなく発話することを予見することができる。この予見により、利用者とロボット１００との発話衝突を防ぎ、スムーズな会話を実現することができる。 With the above-described configuration, when the robot 100A wants to speak, the robot 100A can direct the line of sight to the user based on the operation control signal, and can sound the breathing sound and the filler before the speech. As a result, since the next speaker probability P ^ns _R (t) of the robot 100A is increased, the possibility that the control unit 109A outputs speech control information to be uttered by the robot 100A is increased. The user can foresee that the robot 100 will speak soon before the robot 100 starts speaking. By this prediction, a speech collision between the user and the robot 100 can be prevented, and a smooth conversation can be realized.

図１０は、第２の実施形態におけるロボット１００Ａの会話前動作の具体例を示す図である。図１０に示すとおり、ロボット１００Ａと発話中の利用者である現話者６０とがいる場合の具体例について説明する。図１０の左側は、現話者６０の話をロボット１００Ａが聞いている状態を示している。図１０の右側は、ロボット１００Ａが発話を開始する直前の動作を示している。図１０に示すとおり、ロボット１００は、発話開始の直前に、頭部５３を矢印６１に示す方向に回転させることで、視線を現話者６０に向ける。その頭部５３の回転と同時又は前後して口部５２内にあるスピーカ１１５から「スーッ」という吸気音６２を発音する。これにより、ロボット１００Ａの次話者確率Ｐ^ｎｓ _Ｒ（ｔ）を上昇するので、ロボット１００Ａの発話の可能性を高めることができる。現話者６０は、ロボット１００Ａがもうすぐ発話することを予見することができる。すなわち、ロボット１００Ａは、より適切なタイミングで発話を行うことができる。 FIG. 10 is a diagram showing a specific example of the pre-conversation operation of the robot 100A in the second embodiment. As shown in FIG. 10, a specific example in the case where the robot 100A and the current speaker 60 who is the user who is speaking are present will be described. The left side of FIG. 10 shows a state in which the robot 100A is listening to the talk of the current speaker 60. The right side of FIG. 10 shows an operation immediately before the robot 100A starts speech. As shown in FIG. 10, the robot 100 turns its head to the current speaker 60 by rotating the head 53 in the direction indicated by the arrow 61 immediately before the start of speech. At the same time or before or after the rotation of the head 53, an inspiratory sound 62 "sud" is produced from the speaker 115 in the mouth 52. As a result, since the next speaker probability P ^ns _R (t) of the robot 100A is increased, the possibility of the speech of the robot 100A can be enhanced. The current speaker 60 can foresee that the robot 100A will speak soon. That is, the robot 100A can speak at more appropriate timing.

次に、第２の実施形態におけるロボット１００Ａの会話動作について説明する。
図１１は、第２の実施形態におけるロボット１００Ａの会話動作を示すフロー図である。図１１に示す処理は、図６に示した処理と同様に、ロボット１００Ａにおいて、複数の利用者と会話を行う動作を開始した際に行う処理である。 Next, the conversation operation of the robot 100A in the second embodiment will be described.
FIG. 11 is a flowchart showing the conversation operation of the robot 100A in the second embodiment. Similar to the process shown in FIG. 6, the process shown in FIG. 11 is a process performed when the robot 100A starts an operation to talk with a plurality of users.

音声入力部１０４は、マイク１０１からの音声信号が入力され、映像入力部１０５は、カメラ１０２からの映像信号が入力され、センサ入力部１０６は、センサ１０３からのセンサ信号が入力される。また、制御部１０９Ａの制御によりロボット１００Ａの会話動作を行う（ステップＳ２０１）。ロボット１００Ａの会話動作には、上述した（１）〜（３）の動作が含まれる。このロボット１００Ａの会話動作に応じて、制御部１０９Ａは、疑似センサ信号を次話者推定部１０８Ａに出力する。 The audio input unit 104 receives an audio signal from the microphone 101, the video input unit 105 receives a video signal from the camera 102, and the sensor input unit 106 receives a sensor signal from the sensor 103. Further, the conversation operation of the robot 100A is performed under the control of the control unit 109A (step S201). The conversation operation of the robot 100A includes the operations (1) to (3) described above. In response to the conversational motion of the robot 100A, the control unit 109A outputs the pseudo sensor signal to the next speaker estimation unit 108A.

発話区間検出部１０７は、音声入力部１０４からの音声信号に基づいて、音声特徴量を算出し、算出した音声特徴量と所定の閾値を比較して発話区間を検出する（ステップＳ２０２）。次話者推定部１０８Ａは、音声信号、映像信号、センサ信号、疑似センサ信号及び発話者情報に基づいて、ロボット１００Ａ及び各利用者ｉが時刻ｔに次話者となる確率である次話者確率Ｐ^ｎｓ _ｉ（ｔ）を算出する（ステップＳ２０３）。 The speech zone detection unit 107 calculates a speech feature amount based on the speech signal from the speech input unit 104, and compares the calculated speech feature amount with a predetermined threshold to detect a speech zone (step S202). The next speaker estimation unit 108A is a next speaker who has a probability that the robot 100A and each user i will be the next speaker at time t based on the audio signal, the video signal, the sensor signal, the pseudo sensor signal, and the speaker information. The probability P ^ns _i (t) is calculated (step S203).

制御部１０９Ａは、次話者推定部１０８Ａからのロボット１００Ａ及び各利用者の次話者確率に基づいて、発話制御信号を出力する（ステップＳ２０４）。音制御部１１０Ａは、制御部１０９Ａからの発話制御信号に基づいて、ロボット１００Ａに発話を行わせるか否かを判断する（ステップＳ２０５）。ここで、ロボット１００Ａに発話を行わせないと判断した場合（ステップＳ２０５のＮＯ）には、ステップＳ２０１の処理に戻る。ロボット１００Ａに発話を行わせると判断した場合（ステップＳ１０５のＹＥＳ）には、音制御部１１０Ａは、ロボット１００Ａに発話させるための会話情報を生成し、生成した会話情報に基づいた音信号をスピーカ１１５へ出力する（ステップＳ２０６）。これにより、ロボット１００Ａは、音信号に応じた発話をスピーカ１１５から発音する。 The control unit 109A outputs a speech control signal based on the robot 100A from the next speaker estimation unit 108A and the next speaker probability of each user (step S204). The sound control unit 110A determines whether to cause the robot 100A to make a speech based on the speech control signal from the control unit 109A (step S205). Here, if it is determined that the robot 100A is not allowed to make a speech (NO in step S205), the process returns to step S201. When it is determined that the robot 100A is made to utter (YES in step S105), the sound control unit 110A generates conversation information for making the robot 100A utter, and the speaker makes a sound signal based on the generated conversation information It outputs to 115 (step S206). As a result, the robot 100A produces an utterance corresponding to the sound signal from the speaker 115.

音制御部１１０Ａは、制御部１０９Ａからの発話制御信号に基づいて、ロボット１００Ａの発話を終了するか否かを判断する（ステップＳ２０７）。ここで、ロボット１００Ａの発話を終了しない場合（ステップＳ２０７のＮＯ）には、音制御部１１０Ａは、ステップＳ２０６の処理に戻る。ロボット１００Ａの発話を終了する場合（ステップＳ２０７のＹＥＳ）には、音制御部１１０Ａは、会話情報の生成を停止することに応じて音信号の出力を停止する。 The sound control unit 110A determines whether to end the speech of the robot 100A based on the speech control signal from the control unit 109A (step S207). Here, when the speech of the robot 100A is not ended (NO in step S207), the sound control unit 110A returns to the process of step S206. When ending the speech of the robot 100A (YES in step S207), the sound control unit 110A stops the output of the sound signal in response to the stop of the generation of the conversation information.

次に、ロボット１００Ａは、複数の利用者と会話を行う会話動作を終了するか否かを判断する（ステップＳ２０８）。ここで、会話動作を終了しないと判断した場合（ステップＳ２０８のＮＯ）には、ステップＳ２０１の処理に戻る。会話動作を終了すると判断した場合（ステップＳ２０８のＹＥＳ）には、ロボット１００Ａは、会話動作を終了する。 Next, the robot 100A determines whether to end the conversation operation for talking with a plurality of users (step S208). Here, when it is determined that the conversation operation is not ended (NO in step S208), the process returns to the process of step S201. When it is determined that the conversation operation is ended (YES in step S208), the robot 100A ends the conversation operation.

以上に説明したとおり、第２の実施形態におけるロボット１００Ａは、複数の利用者と会話する際に、ロボット１００Ａの次話者確率と、各利用者の次話者確率との比較に基づいて、より適切なタイミングで発話を行うことができる。これにより、ロボット１００Ａは、他の利用者と発話のタイミングが重なる発話衝突の発生を低減した発話を行うことができる。また、ロボット１００Ａは、発話前にこれから発話を行うことを周りの人に察知させる動作を行うことで、より確実に発話衝突の発生を低減した発話を行うことができる。 As described above, when the robot 100A in the second embodiment talks with a plurality of users, the robot 100A compares the next speaker probability of the robot 100A with the next speaker probability of each user. Utterance can be made at more appropriate timing. As a result, the robot 100A can perform an utterance with reduced occurrence of an utterance collision in which the timing of the utterance overlaps with that of another user. In addition, the robot 100A can perform the utterance with the occurrence of the utterance collision reduced more surely by performing the operation of making the surrounding people detect that the utterance is to be made before the utterance.

（第２の実施形態の変形例）
第２の実施形態の変形例は、ロボット１００Ａの発話の必要性を表す指標であり、発話の必要性が高いほど大きな値となる、発話必要度を算出し、算出した発話必要度を用いて、ロボット１００Ａの次話者確率を変更する構成である。ロボット１００Ａは、公知の音声認識技術を用いて利用者の発話の内容を取得して、取得した発話内容に応じて発話必要度を算出する。
発話必要度の具体的な算出方法として、「発話内容」、「相手との関係」及び「発話者の性格」に基づいて、発話必要度を示す発話レベルＵ_Ｌを算出する方法が、以下の参考文献５に記載されている。
参考文献５：河添麻衣子、北村泰彦“発話タイミングを考慮したマルチエージェント対話システム”、電子情報通信学会技術研究報告、ＡＩ、人工知能と知識処理、１０６（６１７）、ｐｐ５３−５６、２００７年３月２１日 (Modification of the second embodiment)
The modification of the second embodiment is an index that indicates the necessity of speech of the robot 100A, and the higher the necessity of speech is, the larger the value of speech required degree is calculated, and the calculated degree of speech necessity is used , The next speaker probability of the robot 100A is changed. The robot 100A acquires the content of the user's speech using a known speech recognition technology, and calculates the degree of speech necessity according to the acquired speech content.
As a specific calculation method of the degree of necessity of speech, a method of calculating the speech level U _L indicating the degree of necessity of speech on the basis of "speech contents", "relationship with the other party" and "personality of the speaker" It is described in reference 5.
Reference 5: Mai Kawagoe, Yasuhiko Kitamura "Multi-Agent Dialogue System Considering Utterance Timing", IEICE Technical Report, AI, Artificial Intelligence and Knowledge Processing, 106 (617), pp 53-56, March 2007 21st

制御部１０９Ａは、発話レベルＵ_Ｌを用いて例えば、下記の式により上述した第５制御方法又は第７制御方法で用いる次話者確率Ｐ^ｎｓ _Ｒ（ｔ）の代わりのＰ^ｎｓ _Ｒ（ｔ）’を求める。
Ｐ^ｎｓ _Ｒ（ｔ）’＝Ｐ^ｎｓ _Ｒ（ｔ）・Ｕ_Ｌ・ζ（ζは任意の定数）
なお、上記式の変形例として以下の式を用いてもよい。
Ｐ^ｎｓ _Ｒ（ｔ）’＝Ｐ^ｎｓ _Ｒ（ｔ）＋Ｕ_Ｌ・ζ（ζは任意の定数） Control unit 109A, for example by using a speech level _{U L,} instead of the ^P _ns R follows the speaker probability ^P _ns R (t) used in the fifth control method or seventh control method described above by the following equation (t) Ask for '.
P ^ns _R (t) '= P ^ns _R (t) · U _L · ζ (ζ is an arbitrary constant)
The following equation may be used as a modification of the above equation.
P ^ns _R (t) '= P ^ns _R (t) + U _L ζ (ζ is an arbitrary constant)

制御部１０９Ａは、発話レベルＵ_Ｌを用いて例えば、下記の式により上述した第６制御方法又は第８制御方法で用いる積分値Ｐ^ｎｓ _Ｒの代わりのＰ^ｎｓ _Ｒ’を求める。
Ｐ^ｎｓ _Ｒ’＝Ｐ^ｎｓ _Ｒ・Ｕ_Ｌ・ζ（ζは任意の定数）
なお、上記式の変形例として以下の式を用いてもよい。
Ｐ^ｎｓ _Ｒ’＝Ｐ^ｎｓ _Ｒ＋Ｕ_Ｌ・ζ（ζは任意の定数） Control unit 109A, for example by using a speech level _{U L,} determine the ^{P ns} _R 'in place of the integrated value ^{P ns} _R used in the sixth control method or the eighth control method described above by the following equation.
P ^ns _R '= P ^ns _R · U _L · ζ (ζ is an arbitrary constant)
The following equation may be used as a modification of the above equation.
P ^ns _R '= P ^ns _R + U _L · ζ (ζ is an arbitrary constant)

このように発話必要度に応じて変化する次話者確率に基づいて発話を制御することにより、ロボット１００Ａは、発話が必要な際により発話を行うようになり、発話が不必要な時は、より発話を行わないようになる。すなわち、ロボット１００Ａは、より適切なタイミングで発話を行うことができる。 By controlling the speech on the basis of the next speaker probability which changes in accordance with the degree of necessity of speech in this manner, the robot 100A can speak according to the necessity of the speech, and when the speech is unnecessary, I will not speak more. That is, the robot 100A can speak at more appropriate timing.

（第１、第２の実施形態に共通の次話者を推定する処理の具体例）
次に、上述した第１の実施形態におけるロボット１００および第２の実施形態におけるロボット１００Ａに共通である次話者を推定する処理の具体例について説明する。ロボット１００及びロボット１００Ａにおける次話者推定には、例えば、以下の参考文献６、７の技術などを適用することができるが、任意の既存の技術を利用してもよい。参考文献６、７記載の技術を利用した場合は、注視対象検出装置２０３が出力する注視対象情報に基づく発話者と非発話者の注視行動の遷移パターンを用いて、次話者推定部１０８又は次話者推定部１０８Ａは、次発話者および発話のタイミングを予測する。 (Specific example of processing for estimating the next speaker common to the first and second embodiments)
Next, a specific example of processing for estimating the next speaker common to the robot 100 in the first embodiment and the robot 100A in the second embodiment described above will be described. For example, the techniques of the following references 6 and 7 can be applied to the next speaker estimation in the robot 100 and the robot 100A, but any existing technique may be used. When the techniques described in References 6 and 7 are used, the next-speaker estimation unit 108 or the next-speaker estimation unit 108 using the transition patterns of the gaze actions of the speaker and the non-speaker based on the gaze target information output by the gaze target detection device 203. The next speaker estimation unit 108A predicts the timing of the next speaker and the utterance.

参考文献６：特開２０１４−２３８５２５号公報
参考文献７：石井亮、外４名、“複数人対話における注視遷移パターンに基づく次話者と発話タイミングの予測”、人工知能学会研究会資料、SIG-SLUD-B301-06、pp.27-34、2013年 Reference 6: Japanese Patent Laid-Open No. 2014-238525 Reference 7: Jun Ishii, 4 others, "Prediction of the next speaker and the speech timing based on the gaze transition pattern in multi-person dialogue", Artificial Intelligence Society research meeting material, SIG -SLUD-B301-06, pp. 27-34, 2013

以下に、本実施形態に適用可能な参考文献６、７以外の次話者推定技術の例を示す。
会話の利用者の呼吸動作は次発話者と発話のタイミングに深い関連性がある。このことを利用して、会話の利用者の呼吸動作をリアルタイムに計測し、計測された呼吸動作から発話の開始直前に行われる特徴的な呼吸動作を検出し、この呼吸動作を基に次発話者とその発話タイミングを高精度に算出する。具体的には、発話開始直前におこなわれる呼吸動作の特徴として、発話を行っている発話者は、継続して発話する際（発話者継続時）には、発話終了直後にすぐに急激に息を吸い込む。逆に発話者が次に発話を行わない際（発話者交替時）には、発話者継続時に比べて、発話終了時から間を空けて、ゆっくりと息を吸い込む。また、発話者交替時に、次に発話をおこなう次発話者は、発話を行わない非発話者に比べて大きく息を吸い込む。このような発話の前におこなわれる呼吸は、発話開始に対しておおよそ決められたタイミングで行われる。このように、発話の直前に次発話者は特徴的な息の吸い込みを行うため、このような息の吸い込みの情報は、次発話者とその発話タイミングを予測するのに有用である。本次話者推定技術では、人物の息の吸い込みに着目し、息の吸い込み量や吸い込み区間の長さ、タイミングなどの情報を用いて、次発話者と発話タイミングを予測する。 An example of the next speaker estimation technique other than the references 6 and 7 applicable to the present embodiment will be shown below.
The user's breathing behavior of the conversation is closely related to the timing of the next speaker and the speech. Using this fact, the user's respiratory activity of the conversation is measured in real time, and the characteristic respiratory activity performed just before the start of the speech is detected from the measured respiratory activity, and the next utterance is made based on this respiratory activity. People and their speech timing with high accuracy. Specifically, as a feature of the breathing movement performed immediately before the start of the utterance, when the utterer is uttering continuously (when the utterer continues), the breath immediately immediately after the end of the utterance immediately and sharply Inhale. On the other hand, when the utterer does not speak next (during utterer substitution), he / she breathes in slowly, leaving a gap from the end of the uttering as compared with the uttering continuation. In addition, at the time of speaker change, the next speaker who speaks next inhales a large amount of breath compared to the non-speaker who does not speak. The breathing performed prior to such an utterance is performed at a roughly determined timing relative to the start of the utterance. As described above, since the next speaker performs characteristic breath suction immediately before the speech, such breath suction information is useful for predicting the next speaker and the timing of his / her speech. In this next speaker estimation technology, attention is focused on the inhaling of a person's breath, and information on the amount of inhaling breath, the length of the inhaling section, the timing, and the like is used to predict the next utterer and the utterance timing.

以下では、Ａ人の利用者Ｐ_１，…，Ｐ_Ａが対面コミュニケーションを行う状況を想定する。利用者Ｐ_ａ（ただし、ａ＝１，…，Ａ、Ａ≧２）には呼吸動作計測装置２０２およびマイク１０１が装着される。呼吸動作計測装置２０２は、利用者Ｐ_ａの呼吸動作を計測し、各離散時刻ｔでの計測結果を表す呼吸情報Ｂ_ａ，ｔを得て、次話者推定部１０８又は次話者推定部１０８Ａに出力する。呼吸動作計測装置２０２が、バンド式の呼吸装置を備える構成について説明する。バンド式の呼吸装置は、バンドの伸縮の強さによって呼吸の深さの度合いを示す値を出力する。息の吸い込みが大きいほどバンドの伸びが大きくなり、逆に息の吐き出しが大きいほどバンドの縮みが大きくなる（バンドの伸びが小さくなる）。以降、この値をＲＳＰ値と呼ぶ。なお、ＲＳＰ値は、バンドの伸縮の強さに応じて利用者Ｐ_ａごとに異なる大きさを取る。そこで、これに起因するＰ_ａごとのＲＳＰ値の相違を排除するために、各利用者Ｐ_ａのＲＳＰ値の平均値μ_ａと標準偏差値δ_ａを用いて、μ_ａ+δ_ａが１、μ_ａ−δ_ａが−１になるように利用者Ｐ_ａごとにＲＳＰ値を正規化する。これによって、すべての利用者Ｐ_ａの呼吸動作データを同一に分析することが可能となる。各呼吸動作計測装置２０２は、正規化されたＲＳＰ値を呼吸情報Ｂ_ａ，ｔとして次話者推定部１０８又は次話者推定部１０８Ａに送る。 In the following, the user P ₁ of the _{A's, ...,} P _A is assumed a situation to perform a face-to-face communication. The respiratory movement measurement device 202 and the microphone 101 are attached to the user P _a (where a = 1,..., A, A ≧ 2). Respiration measuring device 202 measures the respiration of the user P _a, respiration information B _a representative of the measurement results for each discrete time _t, to obtain _t, next speaker estimation unit 108 or the next speaker estimator Output to 108A. A configuration in which the respiratory motion measurement device 202 includes a band-type respiratory device will be described. The band type breathing apparatus outputs a value indicating the degree of breathing depth by the stretching strength of the band. The greater the inspiration of the breath, the greater the extension of the band, and the greater the exhalation of breath, the greater the contraction of the band (the smaller the extension of the band). Hereinafter, this value is called the RSP value. The RSP value takes a different size for each user P _a according to the strength of expansion and contraction of the band. Therefore, in order to eliminate the difference of RSP values for each P _a resulting therefrom, using the average value mu _a and the standard deviation value [delta] _a of RSP values for each user P _{_{_a,} μ a} ₊ _{δ a} is 1 , R ap values are normalized for each user P _a such that μ _a −δ _a is −1. This makes it possible to analyze respiratory motion data of all users P _a identically. Each respiratory motion measurement device 202 sends the normalized RSP value to the next speaker estimation unit 108 or the next speaker estimation unit 108A as respiration information Ba _{, t} .

さらに、マイク１０１は、利用者Ｐ_ａの音声を取得し、各離散時刻ｔでの利用者Ｐ_ａの音声を表す音声信号Ｖ_ａ，ｔを得て、次話者推定部１０８又は次話者推定部１０８Ａに出力する。次話者推定部１０８又は次話者推定部１０８Ａは、入力された音声信号Ｖ_ａ，ｔ（ただし、ａ＝１，…，Ａ）から雑音を除去し、さらに発話区間Ｕ_ｋ（ただし、ｋは発話区間Ｕ_ｋの識別子）とその発話者Ｐ_ｕｋとを抽出する。ただし、「Ｐ_ｕｋ」の下付き添え字はｕ_ｋ＝１，…，Ａを表す。１つの発話区間Ｕ_ｋをＴｄ［ｍｓ］連続した無音区間で囲まれた区間と定義し、この発話区間Ｕ_ｋを発話の一つの単位と規定する。これにより、次話者推定部１０８又は次話者推定部１０８Ａは、各発話区間Ｕ_ｋを表す発話区間情報、およびその発話者Ｐ_ｕｋを表す発話者情報（利用者Ｐ_１，…，Ｐ_Ａのうち何れが発話区間Ｕ_ｋでの発話者Ｐ_ｕｋであるかを表す発話者情報）を得る。 Further, the microphone 101, the user P obtains the sound _a, the audio signals V _a representative of the voice of the user P _a at each discrete time _t, to obtain _t, next speaker estimation unit 108 or the next speaker It outputs to estimation section 108A. The next-speaker estimation unit 108 or the next-speaker estimation unit 108A removes noise from the input voice signal V _{a, t} (where a = 1,..., A), and further the speech interval U _k (where k is k). Extracts the identifier of the speech section U _k and its speaker P _uk . However, the subscript of “P _uk ” represents u _k = 1,. One speech segment U _k is defined as a segment surrounded by Td [ms] continuous silent segments, and this speech segment U _k is defined as one unit of speech. As a result, the next speaker estimation unit 108 or the next speaker estimation unit 108A outputs the utterance section information representing each utterance section U _k and the speaker information representing the speaker P _uk (users P ₁ ,..., P _A Speaker information representing which is the speaker P _uk in the utterance section U _k .

次話者推定部１０８又は次話者推定部１０８Ａは、各利用者Ｐ_ａの呼吸情報Ｂ_ａ，ｔを用いて、各利用者Ｐ_ａの息の吸い込み区間Ｉ_ａ，ｋを抽出し、さらに息の吸い込みに関するパラメータλ_ａ，ｋを取得する。息の吸い込み区間とは、息を吐いている状態から、息を吸い込みだす開始位置と、息を吸い込み終わる終了位置との間の区間を示す。 Next speaker estimation unit 108 or the next speaker estimation unit 108A, the breathing information _{B a} of each user _{P _a,} with _t, extracts the suction section _{I a, k} breath of the user _{P a,} further The parameters λ _{a, k} for breathing in are obtained. The breath inhaling section indicates a section between the start position where the breath starts and the end position where the breath ends, from the state of exhaling.

図１２は、息の吸い込み区間の例を示す図である。図１２を用いて、息の吸い込み区間Ｉ_ａ，ｋの算出方法を例示する。ここで利用者Ｐ_ａの離散時刻ｔでのＲＳＰ値をＲ_ａ，ｔと表記する。ＲＳＰ値Ｒ_ａ，ｔは呼吸情報Ｂ_ａ，ｔに相当する。図１２に例示するように、例えば、以下の（式１）が成り立つとき、 FIG. 12 is a diagram illustrating an example of a breath suction section. A method of calculating the breathing interval _{Ia, k} will be illustrated using FIG. Here, the RSP value at the discrete time t of the user P _a is expressed as R _{a, t} . The RSP value _{Ra, t} corresponds to the respiration information _{Ba, t} . As exemplified in FIG. 12, for example, when the following (Expression 1) holds,

離散時刻ｔ＝ｔ_ｓ（ｋ）の前２フレームでＲＳＰ値Ｒ_ａ，ｔが連続して減少し、その後２フレームでＲＳＰ値Ｒ_ａ，ｔが連続して上昇しているから、離散時刻ｔ_ｓ（ｋ）を息の吸い込みの開始位置とする。さらに、以下の（式２）が成り立つとき、 RSP value _{R a} in the previous two frames discrete time t _{= t s _(k),} _t continuously decreases, RSP value _{R a} in the subsequent two _frames, since _t is increasing continuously, discrete time t _{Let s (k) be} the starting point for breathing. Furthermore, when the following (formula 2) holds,

離散時刻ｔ＝ｔ_ｅ（ｋ）の前２フレームのＲＳＰ値Ｒ_ａ，ｔが連続して上昇し、その後２フレームのＲＳＰ値Ｒ_ａ，ｔが連続して減少しているから、離散時刻ｔ_ｅ（ｋ）を息の吸い込みの終了位置とする。このとき、利用者Ｐ_ａの息の吸い込み区間Ｉ_ａ，ｋはｔ_ｓ（ｋ）からｔ_ｅ（ｋ）までの区間となり、息の吸い込み区間の長さはｔ_ｅ（ｋ）−ｔ_ｓ（ｋ）となる。 The RSP value _{Ra, t in} the previous two frames of discrete time t = t _{e (k)} rises continuously, and then the RSP value _{Ra, t in} two frames decreases continuously, so discrete time t _{Let e (k) be} the end position of the breath intake. In this case, the user _{P a} suction section _{I a breath, k} becomes the interval from _{t s (k)} to _{t e (k),} the length of the suction section of breath _{t e} _{(k) -t s} ( _k) .

次話者推定部１０８又は次話者推定部１０８Ａは、息の吸い込み区間Ｉ_ａ，ｋが抽出されると、息の吸い込み区間Ｉ_ａ，ｋ、呼吸情報Ｂ_ａ，ｔ、および発話区間Ｕ_ｋの少なくとも一部を用い、息の吸い込みに関するパラメータλ’_ａ，ｋを抽出する。パラメータλ’_ａ，ｋは、利用者Ｐ_ａの吸い込み区間Ｉ_ａ，ｋでの息の吸い込みの量、吸い込み区間Ｉ_ａ，ｋの長さ、吸い込み区間Ｉ_ａ，ｋでの息の吸い込み量の時間変化、および発話区間Ｕ_ｋと吸い込み区間Ｉ_ａ，ｋとの時間関係の少なくとも一部を表す。パラメータλ’_ａ，ｋは、これらの一つのみを表してもよいし、これらのうち複数を表してもよいし、これらすべてを表してもよい。パラメータλ’_ａ，ｋは、例えば以下のパラメータＭＩＮ_ａ，ｋ，ＭＡＸ_ａ，ｋ，ＡＭＰ_ａ，ｋ，ＤＵＲ_ａ，ｋ，ＳＬＯ_ａ，ｋ，ＩＮＴ１_ａ，ｋの少なくとも一部を含む。パラメータλ’_ａ，ｋは、これらの１つのみを含んでいてもよいし、これらのうち複数を含んでいてもよいし、これらのすべてを含んでいてもよい。
・ＭＩＮ_ａ，ｋ：利用者Ｐ_ａの息の吸い込み開始時のＲＳＰ値Ｒ_ａ，ｔ、すなわち、息の吸い込み区間Ｉ_ａ，ｋのＲＳＰ値Ｒ_ａ，ｔの最小値。
・ＭＡＸ_ａ，ｋ：利用者Ｐ_ａの息の吸い込み終了時のＲＳＰ値Ｒ_ａ，ｔ、すなわち、息の吸い込み区間Ｉ_ａ，ｋのＲＳＰ値Ｒ_ａ，ｔの最大値。
・ＡＭＰ_ａ，ｋ：利用者Ｐ_ａの息の吸い込み区間Ｉ_ａ，ｋのＲＳＰ値Ｒ_ａ，ｔの振幅、すなわち、ＭＡＸ_ａ，ｋ−ＭＩＮ_ａ，ｋで算出される値。吸い込み区間Ｉ_ａ，ｋでの息の吸い込み量を表す。
・ＤＵＲ_ａ，ｋ：利用者Ｐ_ａの息の吸い込み区間Ｉ_ａ，ｋの長さ、すなわち、息の吸い込み区間Ｉ_ａ，ｋの終了位置の離散時刻ｔ_ｅ（ｋ）から開始位置の離散時刻ｔ_ｓ（ｋ）を減じて得られる値ｔ_ｅ（ｋ）−ｔ_ｓ（ｋ）。
・ＳＬＯ_ａ，ｋ：利用者Ｐ_ａの息の吸い込み区間Ｉ_ａ，ｋにおけるＲＳＰ値Ｒ_ａ，ｔの単位時間当たりの傾きの平均値、すなわち、ＡＭＰ_ａ，ｋ／ＤＵＲ_ａ，ｋで算出される値。吸い込み区間Ｉ_ａ，ｋでの息の吸い込み量の時間変化を表す。
・ＩＮＴ１_ａ，ｋ：手前の発話区間Ｕ_ｋの終了時刻ｔ_{ｕｅ（ｋ）}（発話区間末）から利用者Ｐ_ａの息の吸い込みが開始されるまでの間隔、すなわち、息の吸い込み区間Ｉ_ａ，ｋの開始位置の離散時刻ｔ_ｓ（ｋ）から発話区間Ｕ_ｋの終了時刻ｔ_{ｕｅ（ｋ）}を減じて得られる値ｔ_ｓ（ｋ）−ｔ_{ｕｅ（ｋ）}。発話区間Ｕ_ｋと吸い込み区間Ｉ_ａ，ｋとの時間関係を表す。 Next speaker estimation unit 108 or the next speaker estimation unit 108A, the suction section _{I a breath,} when _k is extracted, the suction section _{I a breath, k,} respiration information _{B a, t,} and speech segment _{U k} The parameter λ ′ _{a, k} for breathing in is extracted using at least a portion of Parameter lambda _{'a, k} is the suction of the user _{P a} suction section _{I a,} breath in _k amounts, suction section _{I a,} the length of _k, the suction section _{I a,} the suction amount of the breath at the _k The time change represents at least a part of the time relationship between the speech interval U _k and the suction interval I _{a, k} . The parameters λ ′ _{a and k} may represent only one of them, or may represent a plurality of them, or all of them. Parameter lambda _{'a, k} includes, for example, the following parameters _{_{_{MIN a, k, MAX a,}}} k, AMP a, k, DUR a, k, SLO a, k, INT1 a, at least a portion of _k. The parameter λ ′ _{a, k} may include only one of these, may include a plurality of these, or may include all of these.
· _{MIN a, k:} user _{P a} suction at the start of the RSP value _{R a breath, t,} that is, the suction section _{I a breath, k} of the RSP value _{R a,} minimum value of _t.
· _{MAX a, k:} user _{P a} suction at the end of the RSP value _{R a breath, t,} that is, the suction section _{I a breath, k} of the RSP value _{R a,} the maximum value of _t.
· _{AMP a, k:} User _{P a} suction section _{I a breath, k} of RSP values _{R a,} the amplitude of _t, _{_i.e., MAX _a,} k -MIN _a, value calculated by _k. It represents the amount of inhaled breath in the inhaling section _{Ia, k} .
DUR _{a, k} : The length of the breathing interval I _{a, k} of the user P _a , that is, the discrete time of the starting position from the discrete time t _{e (k)} of the ending position of the breathing interval I _{a, k} the value obtained by subtracting _{_{t s (k) t e (}} k) -t s (k).
SLO _{a, k} : The average value of the slope per unit time of the RSP value R _{a, t} in the breathing interval I _{a, k} of the user P _a , that is, calculated by AMP _{a, k} / DUR _{a, k} Value. It represents the time change of the amount of inhaled breath in the inhale section _{Ia, k} .
• INT1 _{a, k} : an interval from the end time t _{ue (k) of the preceding} utterance section U _k (end of the utterance section) to the start of breathing of the user P _a , that is, a breath inspiration section I _{a ,} discrete time _{t s (k)} from the speech segment _{U k} of the end time _{t ue} value obtained by subtracting the _(k) _t s of the start position of _{_{k (k) -t ue (k}} ). This represents the time relationship between the speech interval U _k and the suction interval I _{a, k} .

次話者推定部１０８又は次話者推定部１０８Ａは、さらに以下のパラメータＩＮＴ２_ａ，ｋを生成してもよい。
・ＩＮＴ２_ａ，ｋ：利用者Ｐ_ａの息の吸い込み終了時から次発話者の発話区間Ｕ_ｋ＋１が開始されるまでの間隔、すなわち、次発話者の発話区間Ｕ_ｋ＋１の開始時刻ｔ_{ｕｓ（ｋ＋１）}から息の吸い込み区間Ｉ_ａ，ｋの終了位置の離散時刻ｔ_ｅ（ｋ）を減じて得られる値ｔ_{ｕｓ（ｋ＋１）}−ｔ_ｅ（ｋ）。発話区間Ｕ_ｋ＋１と吸い込み区間Ｉ_ａ，ｋとの時間関係を表す。パラメータλ’_ａ，ｋにＩＮＴ２_ａ，ｋを加えたものをパラメータλ_ａ，ｋと表記する。 The next speaker estimation unit 108 or the next speaker estimation unit 108A may further generate the following parameter INT2a _{, k} .
· INT2 _{a, k:} interval up to the speech segment _{U k + 1} of the next speaker is started from the time of the end intake of breath of the user _{P a,} ie, the next speaker of the speech segment _{U k + 1} of the start time _{t us (k + 1 The} value _{tus (k + 1)} -te _(k) obtained by subtracting the discrete time te _(k) at the end position of the breathing interval _{Ia, k from} the It represents the time relationship between the speech interval U _{k + 1} and the suction interval I _{a, k} . Parameters λ _{_'a,} INT2 _a, a plus _k is denoted as parameter lambda _{a, k} to _k.

次話者推定部１０８又は次話者推定部１０８Ａは、例えば発話区間Ｕ_ｋ＋１を表す情報が得られ、さらに、パラメータλ_ａ，ｋが得られた以降（発話区間Ｕ_ｋ＋１が開始された後）に、発話区間Ｕ_ｋおよびその発話者Ｐ_ｕｋ、発話区間Ｕ_ｋ＋１およびその発話者Ｐ_ｕｋ＋１とその発話開始タイミングＴ_ｕｋ＋１を表す情報とともにデータベースに記録する。次発話者Ｐ_ｕｋ＋１の発話タイミングとは、発話区間Ｕ_ｋ＋１の何れかの時点またはそれに対応する時点であればよい。発話タイミングＴ_ｕｋ＋１は、発話区間Ｕ_ｋ＋１の開始時刻ｔ_{ｕｓ（ｋ＋１）}であってもよいし、時刻ｔ_{ｕｓ（ｋ＋１）}＋γ（ただし、γは正または負の定数）であってもよいし、発話区間Ｕ_ｋ＋１の終了時刻ｔ_{ｕｅ（ｋ＋１）}であってもよいし、時刻ｔ_{ｕｅ（ｋ＋１）}＋γであってもよいし、発話区間Ｕ_ｋ＋１の中心時刻ｔ_{ｕｓ（ｋ＋１）}＋（ｔ_{ｕｅ（ｋ＋１）}−ｔ_{ｕｓ（ｋ＋１）}）／２であってもよい。λ_ａ，ｋ，Ｕ_ｋ，Ｐ_ｕｋ，Ｐ_ｕｋ＋１，Ｔ_ｕｋ＋１を表す情報の一部またはすべてがデータベースに保持され、次話者推定部１０８又は次話者推定部１０８Ａが発話区間Ｕ_ｋ＋１よりも後の次発話者とその発話タイミングを予測するために使用される。 The next speaker estimation unit 108 or the next speaker estimation unit 108A obtains information representing, for example, the speech interval U _{k + 1} and further obtains the parameter λ _{a, k} (after the speech interval U _{k + 1} is started) The utterance section U _k and its speaker P _uk , the utterance section U _{k + 1} and its speaker P _{uk + 1} and its speech start timing T _{uk + 1} are recorded in the database. The speech timing of the next speaker P _{uk + 1} may be any time point of the speech interval U _{k + 1} or a time point corresponding thereto. The speech timing T _{uk + 1} may be the start time _{tus (k + 1)} of the speech segment U _{k + 1} or may be the time _{tus (k + 1)} + γ (where γ is a positive or negative constant). The end time t _{ue (k + 1)} of the speech section U _{k + 1} may be used, or the time t _{ue (k + 1)} + γ may be used, or the center time t _{us (k + 1)} + (t _{ue (} _{k) of} the speech section U _{k + 1} _It may be _{k + 1)} -t _{us (k + 1} ) / 2. A part or all of the information representing λ _{a, k} , U _k , P _uk , P _{uk + 1} , T _{uk + 1} is held in the database, and the next speaker estimation unit 108 or the next speaker estimation unit 108A performs more than the speech interval U _{k + 1} It is used to predict the subsequent next speaker and its speech timing.

次話者推定部１０８又は次話者推定部１０８Ａは、発話者情報Ｐ_ｕｋ、発話区間Ｕ_ｋ、利用者Ｐ_ａの吸い込み区間Ｉ_ａ，ｋでの息の吸い込み量、吸い込み区間Ｉ_ａ，ｋの長さ、吸い込み区間Ｉ_ａ，ｋでの息の吸い込み量の時間変化、および発話区間Ｕ_ｋと吸い込み区間Ｉ_ａ，ｋとの時間関係の少なくとも一部に基づき、利用者Ｐ_１，…，Ｐ_Ａのうち何れが次発話者Ｐ_ｕｋ＋１であるか、および次発話者Ｐ_ｕｋ＋１の発話タイミングの少なくとも一方を表す推定情報を得る。ただし、「Ｐ_ｕｋ＋１」の下付き添え字「ｕｋ＋１」はｕ_ｋ＋１を表す。発話区間Ｕ_ｋの発話者Ｐ_ｕｋが発話区間Ｕ_ｋ＋１でも発話を行う場合（発話継続する場合）、次発話者は発話区間Ｕ_ｋの発話者Ｐ_ｕｋと同一である。一方、発話区間Ｕ_ｋの発話者Ｐ_ｕｋ以外の利用者が発話区間Ｕ_ｋ＋１でも発話を行う場合（すなわち発話交替する場合）、次発話者は発話区間Ｕ_ｋの発話者Ｐ_ｕｋ以外の利用者である。 Next speaker estimation unit 108 or the next speaker estimation unit 108A, speaker information _{P uk,} speech segment _{U k,} the suction section _{I a} of the user _{P _a,} suction amount of breath at _k, the suction section _{I a, k} The user P ₁ , ..., based on at least a part of the length of the breathing section I _{a, k} and the time change of the breathing amount in the breathing section I _{a, k} and the time relationship between the speech section U _k and the sucking section I _{a, k} which of the P _a is either the next speaker P _{uk + 1,} and obtain estimated information representative of at least one of the following speaker P _{uk + 1} of the utterance timing. However, the subscript “ _{uk + 1} ” of “P _{uk + 1} ” represents u _{k + 1} . (If speech continues) if speaker _{P uk} speech period _{U k} performs speech even speech section _{U k + 1,} the next speaker is the same as the speaker _{P uk} speech period _{U k.} On the other hand, (if That utterance replacement) when uttered P _uk other users of the speech segment U _k performs speech even speech section U _{k + 1,} the following speaker is other than speaker P _uk speech period U _k users It is.

次話者推定部１０８又は次話者推定部１０８Ａは、発話者情報Ｐ_ｕｋ、発話区間Ｕ_ｋ、利用者Ｐ_ａの吸い込み区間Ｉ_ａ，ｋでの息の吸い込み量、吸い込み区間Ｉ_ａ，ｋの長さ、吸い込み区間Ｉ_ａ，ｋでの息の吸い込み量の時間変化、および発話区間Ｕ_ｋと吸い込み区間Ｉ_ａ，ｋとの時間関係の少なくとも一部に対応する特徴量ｆ_ａ，ｋに対する推定情報を得るためのモデルを機械学習し、このモデルを用いて特徴量に対する推定情報を得る。特徴量ｆ_ａ，ｋは、発話者情報Ｐ_ｕｋ、発話区間Ｕ_ｋ、利用者Ｐ_ａの吸い込み区間Ｉ_ａ，ｋでの息の吸い込み量、吸い込み区間Ｉ_ａ，ｋの長さ、吸い込み区間Ｉ_ａ，ｋでの息の吸い込み量の時間変化、および発話区間Ｕ_ｋと吸い込み区間Ｉ_ａ，ｋとの時間関係の１つのみに対応してもよいし、これらのうち複数に対応してもよいし、すべてに対応してもよい。モデルの機械学習には、例えば、過去の吸い込み区間Ｉ_ａ，ｉ（ただし、ｉ＜ｋ）での息の吸い込み量、吸い込み区間Ｉ_ａ，ｉの長さ、吸い込み区間Ｉ_ａ，ｉでの息の吸い込み量の時間変化、および発話区間Ｕ_ｉと吸い込み区間Ｉ_ａ，ｉとの時間関係の少なくとも一部に対応する特徴量ｆ_ａ，ｋ、ならびに発話区間Ｕ_ｉ，Ｕ_ｉ＋１およびそれらの発話者Ｐ_ｕｋ，Ｐ_ｕｋ＋１の情報が学習データとして用いられる。 Next speaker estimation unit 108 or the next speaker estimation unit 108A, speaker information _{P uk,} speech segment _{U k,} the suction section _{I a} of the user _{P _a,} suction amount of breath at _k, the suction section _{I a, k} For a feature amount f _{a, k} corresponding to at least a part of the time length of the suction interval I _{a, k} and the time change of the breathing amount in the suction interval I _{a, k} and the time relationship between the speech interval U _k and the suction interval I _{a, k} A model for obtaining estimated information is machine-learned, and this model is used to obtain estimated information for feature quantities. Feature value _{f a, k} is the speaker information _{P uk,} speech segment _{U k,} the suction section _{I a} of the user _{P _a,} suction amount of breath at _k, the suction section _{I a,} the length of _k, the suction section I _It may correspond to only one time change of the amount of breath suction at _{a, k} and the time relationship between the speech interval U _k and the suction interval I _{a, k} , or may correspond to a plurality of these. It is good and may correspond to all. The machine learning model, for example, past suction section _{I a, i} (although, i <k) suction of breath, the suction section _{I a,} the length of the _i, suction section _{I a,} breath in _i Changes in the amount of suction and feature quantities f _{a, k} corresponding to at least a part of the time relationship between the speech interval U _i and the suction interval I _{a, i} , and the speech intervals U _i , U _{i + 1} and their speakers Information of P _uk and P _{uk + 1} is used as learning data.

次話者推定部１０８又は次話者推定部１０８Ａによる次発話者／発話タイミング推定処理を例示する。この例では、次発話者Ｐ_ｕｋ＋１を推定するモデルである次発話者推定モデルと、次発話者Ｐ_ｕｋ＋１の発話タイミングを推定するモデルである発話タイミング推定モデルとが生成され、それぞれのモデルを用いて次発話者Ｐ_ｕｋ＋１とその発話タイミングが推定される。 A next speaker / speech timing estimation process by the next speaker estimation unit 108 or the next speaker estimation unit 108A is illustrated. In this example, the next speaker estimation model is a model that estimates the next speaker P _{uk + 1,} and the response timing estimation model is a model for estimating the response timing of the next speaker P _{uk + 1} is generated, using each model Then, the next speaker P _{uk + 1} and its speech timing are estimated.

次発話者推定モデルを学習する場合、次話者推定部１０８又は次話者推定部１０８Ａは、学習データとして、データベースから過去のパラメータλ_ａ，ｉ（ただし、ａ＝１，…，Ａであり、ｉ＜ｋである）の少なくとも一部、および発話区間Ｕ_ｉ，Ｕ_ｉ＋１およびそれらの発話者Ｐ_ｕｉ，Ｐ_ｕｉ＋１を表す情報を読み出す。次話者推定部１０８又は次話者推定部１０８Ａは、パラメータλ_ａ，ｉの少なくとも一部に対応する特徴量Ｆ１_ａ，ｉおよびＵ_ｉ，Ｕ_ｉ＋１，Ｐ_ｕｉ，Ｐ_ｕｉ＋１を学習データとして、次発話者推定モデルを機械学習する。次発話者推定モデルには、例えば、ＳＶＭ（Support Vector Machine）、ＧＭＭ（Gaussian Mixture Model）、ＨＭＭ（Hidden Markov Model）等を用いることができる。 When learning the next-speaker estimation model, the next-speaker estimation unit 108 or the next-speaker estimation unit 108A uses the past parameters λ _{a, i} (where a = 1,..., A) as learning data. , I <k), and information representing utterance intervals U _i and U _{i + 1} and their speakers P _ui and P _{ui + 1} . The next-speaker estimation unit 108 or the next-speaker estimation unit 108A sets the feature quantities F1 _{a, i} and U _i , U _{i + 1} , P _ui , P _{ui + 1} corresponding to at least a part of the parameters λ _{a, i} as learning data. The next speaker estimation model is machine-learned. For the next speaker estimation model, for example, SVM (Support Vector Machine), GMM (Gaussian Mixture Model), HMM (Hidden Markov Model) or the like can be used.

次話者推定部１０８又は次話者推定部１０８Ａは、パラメータλ’_ａ，ｋの少なくとも一部に対応する特徴量Ｆ１_ａ，ｋを次発話者推定モデルに適用し、それによって推定された次発話Ｐ_ｕｋ＋１を表す情報を「推定情報」の一部とする。なお、次発話Ｐ_ｕｋ＋１を表す情報は、何れかの利用者Ｐ_ａを確定的に表すものであってもよいし、確率的に表すものであってもよい。利用者Ｐ_ａが次話者になる確率を、Ｐ１_ａとする。 Next speaker estimation unit 108 or the next speaker estimation unit 108A, the parameter lambda _'a, the feature amount corresponding to at least a portion of the _k F1 _a, the _k is applied to the next speaker estimation model was estimated thereby following Information representing the utterance P _{uk + 1} is made part of “estimated information”. Note that the information representing the next utterance P _{uk + 1} may represent either the user P _a definitively or probabilistically. The probability that the user P _a will be the next speaker is P 1 _a .

発話タイミング推定モデルを学習する場合、次話者推定部１０８又は次話者推定部１０８Ａは、学習データとして、データベースから過去のパラメータλ_ａ，ｉ（ただし、ａ＝１，…，Ａであり、ｉ＜ｋである）の少なくとも一部、発話区間Ｕ_ｉ，Ｕ_ｉ＋１およびそれらの発話者Ｐ_ｕｉ，Ｐ_ｕｉ＋１、および発話区間Ｕ_ｉ＋１の発話開始タイミングＴ_ｕｉ＋１を表す情報を読み出す。次話者推定部１０８又は次話者推定部１０８Ａは、パラメータλ_ａ，ｉの少なくとも一部に対応する特徴量Ｆ２_ａ，ｉおよびＵ_ｉ，Ｕ_ｉ＋１，Ｐ_ｕｉ，Ｐ_ｕｉ＋１，Ｔ_ｕｉ＋１を学習データとして、発話タイミング推定モデルを機械学習する。次発話者推定モデルには、例えば、ＳＶＭ、ＧＭＭ、ＨＭＭ等を用いることができる。 When learning a speech timing estimation model, the next speaker estimation unit 108 or the next speaker estimation unit 108A uses the database as parameters λ _{a, i} (where a = 1,..., A) as learning data. Information representing at least a part of i) <k>, speech sections U _i and U _{i + 1,} their speakers P _ui and P _{ui + 1} , and speech start timing T _{ui + 1} of the speech section U _{i + 1} is read out. Next speaker estimation unit 108 or the next speaker estimation unit 108A, the learning parameter lambda _a, feature amount corresponding to at least a portion of the _i _{F2 a, i} and _{_{_{_{U i, U i + 1,}}}} P ui, the _{P ui + 1, T ui +} 1 Machine learning of a speech timing estimation model is performed as data. For example, SVM, GMM, HMM, etc. can be used as the next speaker estimation model.

次話者推定部１０８又は次話者推定部１０８Ａは、発話者Ｐ_ｕｋ、パラメータλ’_ａ，ｋの少なくとも一部、および次発話者推定モデルにより推定された次発話者Ｐ_ｕｋ＋１が得られると、パラメータλ’_ａ，ｋの少なくとも一部に対応する特徴量Ｆ２_ａ，ｋを発話タイミング推定モデルに適用する。次話者推定部１０８又は次話者推定部１０８Ａは、特徴量Ｆ２_ａ，ｋを発話タイミング推定モデルに適用して推定された次の発話区間Ｕ_ｋ＋１の発話タイミングＴ_ｕｋ＋１（例えば、発話区間Ｕ_ｋ＋１の開始時刻）を表す情報を「推定情報」の一部として出力する。なお、発話タイミングを表す情報は、何れかの発話タイミングを確定的に表すものであってもよいし、確率的に表すものであってもよい。利用者Ｐ_ａが時刻ｔに発話を開始する確率（時刻ｔが利用者Ｐ_ａの発話タイミングである確率）を、Ｐ２_ａ（ｔ）とする。
上述した実施形態の次話者推定部１０８又は次話者推定部１０８Ａが推定する利用者ｉの時刻ｔにおける次話者確率Ｐ^ｎｓ _ｉ（ｔ）は、利用者ｉが本次話者推定技術における利用者Ｐ_ａである場合、確率Ｐ１_ａ×確率Ｐ２_ａ（ｔ）により算出される。 If the next speaker estimation unit 108 or the next speaker estimation unit 108A obtains the speaker P _uk , at least a part of the parameters λ ′ _{a, k} , and the next speaker P _{uk + 1} estimated by the next speaker estimation model applies the parameter lambda _'a, feature amount corresponding to at least a portion of the _k F2 _a, the _k utterance timing estimation model. The next-speaker estimation unit 108 or the next-speaker estimation unit 108A applies the feature amount F2 _{a, k} to the utterance timing estimation model to determine the utterance timing T _{uk + 1 of} the next utterance segment U _k _{+ 1} (for example, the utterance segment U _The information representing the start time of ( _{k + 1} ) is output as part of the "estimated information". Note that the information indicating the utterance timing may represent any one of the utterance timings in a definite manner, or may represent the utterance timings in a probabilistic manner. The probability that the user _{P a} to start a speech to the time t (the probability time t is the utterance timing of the user _{P a),} and _P2 a (t).
The next speaker probability P ^ns _i (t) at time t of the user i estimated by the next speaker estimation unit 108 or the next speaker estimation unit 108A in the above-described embodiment is the user i's next speaker estimation technique In the case of the user P _{a in,} it is calculated by probability P 1 _a × probability P 2 _a (t).

上述の次話者推定部１０８又は次話者推定部１０８Ａは、呼吸動作の観測値に基づいて次に発話を開始する利用者およびタイミングを推定しているが、さらに、視線の観測値を用いてもよい。
視線行動をさらに利用する場合、各利用者Ｐ_ａ（ただし、ａ＝１，…，Ａ）には注視対象検出装置２０３がさらに装着される。注視対象検出装置２０３は、利用者Ｐ_ａが誰を注視しているか（注視対象）を検出し、利用者Ｐ_ａおよび各離散時刻ｔでの注視対象Ｇ_ａ，ｔを表す情報を次話者推定部１０８又は次話者推定部１０８Ａに送る。次話者推定部１０８又は次話者推定部１０８Ａは、注視対象情報Ｇ_１，ｔ，…，Ｇ_Ａ，ｔ、発話区間Ｕ_ｋ、および話者情報Ｐ_ｕｋを入力とし、発話区間終了前後における注視対象ラベル情報θ_ｖ，ｋ（ただし、ｖ＝１，…，Ｖ、Ｖは注視対象ラベルの総数）を生成する。注視対象ラベル情報は、発話区間Ｕ_ｋの終了時点Ｔ_ｓｅに対応する時間区間における利用者の注視対象を表す情報である。ここでは、終了時点Ｔ_ｓｅを含む有限の時間区間における利用者Ｐ_ａの注視対象をラベル付けした注視対象ラベル情報θ_ｖ，ｋを例示する。この場合、例えば、発話区間Ｕ_ｋの終了時点Ｔ_ｓｅよりも前の時点Ｔ_ｓｅ−Ｔ_ｂから終了時点Ｔ_ｓｅよりも後の時点Ｔ_ｓｅ＋Ｔ_ａまでの区間に出現した注視行動を扱う。Ｔ_ｂ，Ｔ_ａは０以上の任意の値でよいが、目安として、Ｔ_ｂは０秒〜２．０秒、Ｔ_ａは０秒〜３．０秒程度にするのが適当である。 The above-mentioned next-speaker estimation unit 108 or next-speaker estimation unit 108A estimates the user and timing to start speeching next based on the observed value of the respiratory motion, but further uses the observed value of the line of sight. May be
When the eye-gaze action is further used, the gaze target detection device 203 is further attached to each user P _a (where a = 1,..., A). Gaze object detection device 203, the user P _a detects someone or gazing (gaze target), the user P _a and gaze target G _a, next speaker information representing a _t at each discrete time t It sends to the estimation unit 108 or the next speaker estimation unit 108A. The next-speaker estimation unit 108 or the next-speaker estimation unit 108A receives the gaze target information G _{1, t} ,..., G _{A, t} , the speech segment U _k and the speaker information P _uk as input, and before and after the speech segment end Gaze target label information θ _{v, k} (where, v = 1,..., V, V is the total number of gaze target labels). The gaze target label information is information indicating the gaze target of the user in the time interval corresponding to the end time point T _se of the speech interval U _k . Here, gaze target label information θ _{v, k in} which the gaze target of the user P _a in a finite time interval including the end time point T _se is labeled is illustrated. In this case, for example, a gaze action appearing in a section from a time point T _se −T _b before an end time point T _se of the speech interval U _{k to} a time point T _se + T _a after the end time point T _se is handled. T _b, _{T a} is may be any value from 0 or more, as a guide, _{T b} is 0 seconds to 2.0 seconds, _{T a} is appropriate to about 0 seconds to 3.0 seconds.

次話者推定部１０８又は次話者推定部１０８Ａは、注視対象の利用者を以下のような種別に分類し、注視対象のラベリングを行う。なお、ラベルの記号に意味はなく、判別できればどのような表記でも構わない。
・ラベルＳ：現話者（すなわち、現話者である利用者Ｐ_ｕｋを表す）
・ラベルＬ_ξ：非話者（ただし、ξは互いに異なる非話者である利用者を識別し、ξ＝１，…，Ａ−１である。例えば、ある利用者が、非話者Ｐ_２、非話者Ｐ_３、の順に注視をしていたとき、非話者Ｐ_２にＬ_１というラベル、非話者Ｐ_３にＬ_２というラベルが割り当てられる。）
・ラベルＸ：誰も見ていない The next speaker estimation unit 108 or the next speaker estimation unit 108A classifies the gaze target user into the following types, and labels the gaze target. Note that the symbols of the labels have no meaning, and any notation may be used as long as they can be determined.
Label S: current speaker (ie, represents the user P _uk who is the current speaker)
Label L _ξ : Non-speaker (However, 識別 identifies different non-speaker users, and, = 1,..., A−1. For example, a certain user is a non-speaker P ₂ , The non-speaker P ₃ , the non-speaker P ₂ is assigned the label L ₁ , and the non-speaker P ₃ the label L ₂ ).
・ Label X: No one is watching

ラベルがＳまたはＬ_ξのときには、相互注視（視線交差）が起きたか否かという情報を付与する。本形態では、相互注視が起きた際には、Ｓ_Ｍ，Ｌ_ξＭ（下付き添え字の「_ξＭ」はξ_Ｍを表す）のように、ラベルＳ，Ｌ_ξの末尾にＭラベルを付与する。 Label when S or L _xi] are mutually gaze (line-of-sight intersections) to impart information that whether happened. In this embodiment, when the mutual gaze has occurred is, S _M, as in the L _{ξM (under} _"KushiM" in superscript represents xi] _M), imparts M label at the end of the label S, L _xi] .

図１３は、注視対象ラベルの具体例を示す図である。図１３はＡ＝４の例であり、発話区間Ｕ_ｋ，Ｕ_ｋ＋１と各利用者の注視対象が時系列に示されている。図１３の例では、利用者Ｐ_１が発話した後、発話交替が起き、新たに利用者Ｐ_２が発話をした際の様子を示している。ここでは、現話者である利用者Ｐ_１が利用者Ｐ_４を注視した後、利用者Ｐ_２を注視している。Ｔ_ｓｅ−Ｔ_ｂの時点からＴ_ｓｅ＋Ｔ_ａの時点までの区間では、利用者Ｐ_１が利用者Ｐ_２を見ていたとき、利用者Ｐ_２は利用者Ｐ_１を見ている。これは、利用者Ｐ_１と利用者Ｐ_２とで相互注視が起きていることを表す。この場合、利用者Ｐ_１の注視対象情報Ｇ_１，ｔから生成される注視対象ラベルはＬ_１とＬ_２Ｍの２つとなる。上述の区間では、利用者Ｐ_２は利用者Ｐ_４を注視した後、現話者である利用者Ｐ_１を注視している。この場合、利用者Ｐ_２の注視対象ラベルはＬ_１とＳ_Ｍの２つとなる。また、上述の区間では、利用者Ｐ_３は現話者である利用者Ｐ_１を注視している。この場合、利用者Ｐ_３の注視対象ラベルはＳとなる。また、上述の区間では、利用者Ｐ_４は誰も見ていない。この場合、利用者Ｐ_４の注視対象ラベルはＸとなる。したがって、図１３の例では、Ｖ＝６である。 FIG. 13 is a diagram illustrating a specific example of the gaze target label. FIG. 13 shows an example of A = 4, where utterance segments U _k and U _{k + 1} and gaze targets of each user are shown in time series. In the example of FIG. 13, after the user P ₁ is uttered, occurs speech alternation shows how when a new user P ₂ has an utterance. In this case, after the user P ₁ is the current speaker was gazing at the user P _4, gazing at the user P _2. In the period from the time of T _se -T _b up to the point of _{_T} se + _T _a, when the user _{P 1} had seen the user _{P 2,} the user _{P 2} is a look at the user _{P 1.} This represents that the mutual fixation is happening at the user P ₁ and the user P _2. In this case, the gaze target labels generated from the gaze target information G _{1, t of the} user P ₁ are two, L ₁ and L _{2 M.} In the above-mentioned period, the user P ₂ After watching the user P _4, gazing at the user P ₁ is the current speaker. In this case, you gaze target label of the user _{P 2} is two and the _{L 1} and _{S M.} In addition, in the above-mentioned period, the user P ₃ is gazing at the user P ₁ is the current speaker. In this case, the gaze target label of the user P ₃ is a S. In addition, in the above-mentioned period, the user P ₄ is not anyone seen. In this case, the gaze target label of the user P ₄ is the X. Therefore, in the example of FIG. 13, V = 6.

次話者推定部１０８又は次話者推定部１０８Ａは、注視対象ラベルごとの開始時刻、終了時刻も取得する。ここで、誰（Ｒ∈｛Ｓ，Ｌ｝）のどの注視対象ラベル（ＧＬ∈｛Ｓ，Ｓ_Ｍ，Ｌ_１，Ｌ_１Ｍ，Ｌ_２，Ｌ_２Ｍ，…｝）であるかを示す記号としてＲ_ＧＬ、その開始時刻をＳＴ＿Ｒ_ＧＬ、終了時刻をＥＴ＿Ｒ_ＧＬと定義する。ただし、Ｒは利用者の発話状態（現話者か非話者か）を表し、Ｓは現話者、Ｌは非話者である。例えば、図１３の例において、利用者Ｐ_１の最初の注視対象ラベルはＳ_Ｌ１であり、その開始時刻はＳＴ＿Ｓ_Ｌ１、終了時刻はＥＴ＿Ｓ_Ｌ１である。注視対象ラベル情報θ_ｖ，ｋは注視対象ラベルＲ_ＧＬ、開始時刻ＳＴ＿Ｒ_ＧＬ、および終了時刻ＥＴ＿Ｒ_ＧＬを含む情報である。 The next speaker estimation unit 108 or the next speaker estimation unit 108A also acquires a start time and an end time for each gaze target label. Here, R as a symbol indicating which gaze target label (GL ∈ {S, S _M , L ₁ , L _{1 M} , L ₂ , L _{2 M} , ...}) of who (R ∈ {S, L}) Define _GL , its start time as ST_R _GL , and its end time as ET_R _GL . Here, R represents the user's speech state (current speaker or non-speaker), S is the current speaker, and L is the non-speaker. For example, in the example of FIG. 13, the first fixation target label of the user _{P 1} is _{S L1,} the start time ST_S _L1, the end time is ET_S _L1. The gaze target label information θ _{v, k} is information including a gaze target label R _GL , a start time ST_R _GL , and an end time ET_R _GL .

次話者推定部１０８又は次話者推定部１０８Ａは、注視対象ラベル情報θ_ｖ，ｋを用いて、各利用者Ｐ_ａの注視対象遷移パターンＥ_ａ，ｋを生成する。注視対象遷移パターンの生成は、注視対象ラベルＲ_ＧＬを構成要素として、時間的な順序を考慮した遷移ｎ−ｇｒａｍを生成して行う。ここで、ｎは正の整数である。例えば、図１３の例を考えると、利用者Ｐ１の注視対象ラベルから生成される注視対象遷移パターンＥ_１，ｋはＬ_１−Ｌ_２Ｍである。同様にして、利用者Ｐ_２の注視対象遷移パターンＥ_２，ｋはＬ_１−Ｓ_Ｍ、利用者Ｐ_３の注視対象遷移パターンＥ_３，ｋはＳ、利用者Ｐ_４の注視対象遷移パターンＥ_４，ｋはＸとなる。 Next speaker estimation unit 108 or the next speaker estimation unit 108A, by using the gaze target label information theta _{v, k,} gaze target transition pattern _{E a} of each user _{P _a,} generates a _k. The generation of the gaze target transition pattern is performed by generating a transition n-gram in consideration of the temporal order, using the gaze target label R _GL as a component. Here, n is a positive integer. For example, in the example of FIG. 13, the gaze target transition pattern E _{1, k} generated from the gaze target label of the user P1 is L ₁ -L _2M . Similarly, gaze target transition pattern _{E 2} of the user _{P _2,} _k is _L 1 -S _M, gaze target transition patterns _{E 3, k} of the user _{P 3} is S, gaze target transition pattern E of the user _{P 4} _{4, k} is X.

注視対象遷移パターンＥ_ａ，ｋは、例えば発話区間Ｕ_ｋ＋１が開始された後に、発話区間Ｕ_ｋおよびその発話者Ｐ_ｕｋ、発話区間Ｕ_ｋ＋１に該当する発話を行う次発話者Ｐ_ｕｋ＋１および次発話開始タイミングＴ_ｕｋ＋１を表す情報とともにデータベースに送られる。データベースでは、注視対象遷移パターンＥ_ａ，ｋが、パラメータλａ，ｋと併合され、Ｅ_ａ，ｋ，λ_ａ，ｋ，Ｕ_ｋ，Ｐ_ｕｋ，Ｐ_ｕｋ＋１，Ｔ_ｕｋ＋１を表す情報の一部またはすべてがデータベースに保持される。 The gaze target transition pattern E _{a, k} is, for example, a speech period U _k and its speaker P _uk after the speech period U _{k + 1} is started, and a next utterer P _{uk +1} and a next utter that perform speech corresponding to the speech period U _{k + 1} It is sent to the database together with information representing the start timing T _{uk + 1} . In the database, the gaze target transition pattern E _{a, k} is merged with the parameters λ _{a, k} , and part or all of the information representing E _{a, k} , λ _{a, k} , U _k , P _uk , P _{uk + 1} , T _{uk + 1} Is kept in the database.

次話者推定部１０８又は次話者推定部１０８Ａは、注視対象ラベル情報θ_ｖ，ｋを入力とし、注視対象ラベルごとの時間構造情報Θ_ｖ，ｋを生成する。時間構造情報は利用者の視線行動の時間的な関係を表す情報であり、（１）注視対象ラベルの時間長、（２）注視対象ラベルと発話区間の開始時刻または終了時刻との間隔、（３）注視対象ラベルの開始時刻または終了時刻と他の注視対象ラベルの開始時刻または終了時刻との間隔、をパラメータとして持つ。 The next speaker estimation unit 108 or the next speaker estimation unit 108A receives the gaze target label information θ _{v, k} as input, and generates time structure information Θ _{v, k} for each gaze target label. The temporal structure information is information representing the temporal relationship of the user's gaze behavior, (1) time length of the gaze target label, (2) an interval between the gaze target label and the start time or end time of the speech section, 3) It has an interval between the start time or end time of the gaze target label and the start time or end time of another gaze target label as a parameter.

具体的な時間構造情報のパラメータを以下に示す。以下では、発話区間の開始時刻をＳＴ＿Ｕ、発話区間の終了時刻をＥＴ＿Ｕと定義する。
・ＩＮＴ１（＝ＥＴ＿Ｒ_ＧＬ−ＳＴ＿Ｒ_ＧＬ）：注視対象ラベルＲ_ＧＬの開始時刻ＳＴ＿Ｒ_ＧＬと終了時刻ＥＴ＿Ｒ_ＧＬの間隔
・ＩＮＴ２（＝ＳＴ＿Ｕ−ＳＴ＿Ｒ_ＧＬ）：注視対象ラベルＲ_ＧＬの開始時刻ＳＴ＿Ｒ_ＧＬが発話区間の開始時刻ＳＴ＿Ｕよりもどれくらい前であったか
・ＩＮＴ３（＝ＥＴ＿Ｕ−ＳＴ＿Ｒ_ＧＬ）：注視対象ラベルＲ_ＧＬの開始時刻ＳＴ＿Ｒ_ＧＬが発話区間の終了時刻ＥＴ＿Ｕよりもどれくらい前であったか
・ＩＮＴ４（＝ＥＴ＿Ｒ_ＧＬ−ＳＴ＿Ｕ）：注視対象ラベルＲ_ＧＬの終了時刻ＥＴ＿Ｒ_ＧＬが発話区間の開始時刻ＳＴ＿Ｕよりもどれくらい後であったか
・ＩＮＴ５（＝ＥＴ＿Ｕ−ＥＴ＿Ｒ_ＧＬ）：注視対象ラベルＲ_ＧＬの終了時刻ＥＴ＿Ｒ_ＧＬが発話区間の終了時刻ＥＴ＿Ｕよりもどれくらい前であったか
・ＩＮＴ６（＝ＳＴ＿Ｒ_ＧＬ−ＳＴ＿Ｒ_ＧＬ’）：注視対象ラベルＲ_ＧＬの開始時刻ＳＴ＿Ｒ_ＧＬが他の注視対象ラベルＲ_ＧＬ’の開始時刻ＳＴ＿Ｒ_ＧＬ’よりもどれくらい後であったか
・ＩＮＴ７（＝ＥＴ＿Ｒ_ＧＬ’−ＳＴ＿Ｒ_ＧＬ）：注視対象ラベルＲ_ＧＬの開始時刻ＳＴ＿Ｒ_ＧＬが他の注視対象ラベルＲ_ＧＬ’の終了時刻ＥＴ＿Ｒ_ＧＬ’よりもどれくらい前であったか
・ＩＮＴ８（＝ＥＴ＿Ｒ_ＧＬ−ＳＴ＿Ｒ_ＧＬ’）：注視対象ラベルＲ_ＧＬの終了時刻ＥＴ＿Ｒ_ＧＬが注視対象ラベルＲ_ＧＬ’の開始時刻ＳＴ＿Ｒ_ＧＬ’よりもどれくらい後であったか
・ＩＮＴ９（＝ＥＴ＿Ｒ_ＧＬ−ＥＴ＿Ｒ_ＧＬ’）：注視対象ラベルＲ_ＧＬの終了時刻ＥＴ＿Ｒ_ＧＬが注視対象ラベルＲ_ＧＬ’の終了時刻ＥＴ＿Ｒ_ＧＬ’よりもどれくらい後であったか Specific parameters of time structure information are shown below. In the following, the start time of the speech section is defined as ST_U, and the end time of the speech section is defined as ET_U.
_{_{· INT1 (= ET_R GL -ST_R GL}} ): gazing target label _{R GL} of the start time ST_R _GL and end time ET_R interval of _{GL · INT2 (= ST_U-ST_R} GL): start time ST_R _GL of the gaze target label _{R GL} utterance How long ago the start time ST_U of the section was • INT3 (= ET_U-ST_R _GL ): how long the start time ST_R _GL of the gaze target label R _GL was before the end time ET_U of the speech section INT4 (= ET_R _GL -ST_U): gazing target label _{R GL} of the end time ET_R _GL Do · INT5 was after much than the start time ST_U of the speech segment (= ET_U-ET_R _GL): end time ET_R _GL is the utterance section of the gaze target label _{R GL} Any more than the end time ET_U of Have either · _INT6 had been before _{_{(= ST_R GL -ST_R GL ')}} : the gaze target label _{R GL} of the start time ST_R _GL other of the gaze target label _{R GL'} of the start time ST_R _GL or was after much than _'· INT7 ( = ET_R _{_GL '-ST_R GL):} gazing target label _{R GL} of the start time ST_R _GL other of the gaze target label _{R GL'} of the end time ET_R _{GL 'or} was before much than _{_{· INT8 (= ET_R GL -ST_R GL}} ' ): gaze target label _{R GL} of the end time ET_R _GL is gazing target label _{R GL 'of} the start time ST_R _GL' or was after much than _{_{· INT9 (= ET_R GL -ET_R GL}} '): the end of the gazing target label _{R GL} time ET_R _GL is none than the _'end time ET_R _{GL of'} gaze target label _{R GL} Did even after leprosy

なお、ＩＮＴ６〜ＩＮＴ９については、すべての利用者の注視対象ラベルとの組み合わせに対して取得する。図１３の例では、注視対象ラベル情報は全部で６つ（Ｌ_１，Ｌ_２Ｍ，Ｌ_１，Ｓ_Ｍ，Ｓ，Ｘ）あるため、ＩＮＴ６〜ＩＮＴ９は、それぞれ６×５＝３０個のデータが生成される。 In addition, about INT6-INT9, it acquires with respect to the combination with the gaze object label of all the users. In the example of FIG. 13, since there are six gaze target label information in total (L ₁ , L _{2 M} , L ₁ , S _M , S, X), each of INT 6 to INT 9 has 6 × 5 = 30 pieces of data. It is generated.

時間構造情報Θ_ｖ，ｋは注視対象ラベル情報θ_ｖ，ｋについてのパラメータＩＮＴ１〜ＩＮＴ９からなる情報である。時間構造情報Θ_ｖ，ｋを構成する上記の各パラメータについて、図１４を用いて具体的に示す。図１４は、現話者である利用者Ｐ１（Ｒ＝Ｓ）の注視対象ラベルＬ１についての時間構造情報を示す図である。すなわち、Ｒ_ＧＬ＝Ｓ_Ｌ１における時間構造情報である。なお、ＩＮＴ６〜ＩＮＴ９については、図示を簡略化するために、利用者Ｐ２の注視対象ラベルＬ１、すなわちＲ_ＧＬ＝Ｌ_Ｌ１との関係のみを示す。図１４の例では、ＩＮＴ１〜ＩＮＴ９は以下のように求められることがわかる。
・ＩＮＴ１＝ＥＴ＿Ｓ_Ｌ１−ＳＴ＿Ｓ_Ｌ１
・ＩＮＴ２＝ＳＴ＿Ｕ−ＳＴ＿Ｓ_Ｌ１
・ＩＮＴ３＝ＥＴ＿Ｕ−ＳＴ＿Ｓ_Ｌ１
・ＩＮＴ４＝ＥＴ＿Ｓ_Ｌ１−ＳＴ＿Ｕ
・ＩＮＴ５＝ＥＴ＿Ｕ−ＥＴ＿Ｓ_Ｌ１
・ＩＮＴ６＝ＳＴ＿Ｓ_Ｌ１−ＳＴ＿Ｌ_Ｌ１
・ＩＮＴ７＝ＥＴ＿Ｌ_Ｌ１−ＳＴ＿Ｓ_Ｌ１
・ＩＮＴ８＝ＥＴ＿Ｓ_Ｌ１−ＳＴ＿Ｌ_Ｌ１
・ＩＮＴ９＝ＥＴ＿Ｓ_Ｌ１−ＥＴ＿Ｌ_Ｌ１ Temporal structure information Θ _{v, k} is information composed of parameters INT 1 to INT 9 for gaze target label information θ _{v, k} . Each of the above-mentioned parameters constituting the time structure information Θ _{v, k} will be concretely shown using FIG. FIG. 14 is a diagram showing time structure information of the gaze target label L1 of the user P1 (R = S) who is the current speaker. That is, it is the time structure information in _RGL = _SL1 . Note that the INT6～INT9, for simplicity of illustration, gaze target label L1 of the user P2, ie the only relationship between _R GL _{= L L1.} In the example of FIG. 14, it is understood that INT1 to INT9 can be obtained as follows.
· INT1 = ET_S _L1- ST_S _L1
・ INT2 = ST_U-ST_S _L1
・ INT3 = ET_U-ST_S _L1
· INT4 = ET_S _L1- ST_U
・ INT5 = ET_U-ET_S _L1
-INT6 = ST_S _L1- ST_L _L1
-INT7 = ET_L _L1- ST_S _L1
-INT 8 = ET_S _L1- ST_L _L1
・ INT9 = ET_S _L1 -ET_L _L1

時間構造情報Θ_ｖ，ｋは、例えば発話区間Ｕ_ｋ＋１が開始された後に、発話区間Ｕ_ｋおよびその発話者Ｐ_ｕｋ、発話区間Ｕ_ｋ＋１に該当する発話を行う次発話者Ｐ_ｕｋ＋１および次発話開始タイミングＴ_ｕｋ＋１を表す情報とともにデータベースに送られる。データベースでは、時間構造情報Θ_ｖ，ｋが、パラメータλ_ａ，ｋと併合され、Θ_ｖ，ｋ，λ_ａ，ｋ，Ｕ_ｋ，Ｐ_ｕｋ，Ｕ_ｋ＋１，Ｐ_ｕｋ＋１，Ｔ_ｕｋ＋１を表す情報の一部またはすべてがデータベースに保持される。 Time structure information theta _{v, k,} for example after the speech segment _{U k + 1} is started, the speech segment _{U k} and its speaker _{P uk,} next speaker _{P uk + 1} and the next utterance start performing speech corresponding to the speech segment _{U k + 1} It is sent to the database together with information representing the timing T _{uk + 1} . In the database, temporal structure information Θ _{v, k} is merged with the parameters λ _{a, k} and _one of the information representing Θ _{v, k} , λ _{a, k} , U _k , P _uk , U _{k + 1} , P _{uk + 1} , T _{uk + 1} Part or all is kept in the database.

次話者推定部１０８又は次話者推定部１０８Ａは、注視対象遷移パターンＥ_ａ，ｋ、時間構造情報Θ_ｖ，ｋ、発話者情報Ｐ_ｕｋ、発話区間Ｕ_ｋ、利用者Ｐ_ａの吸い込み区間Ｉ_ａ，ｋでの息の吸い込み量、吸い込み区間Ｉ_ａ，ｋの長さ、吸い込み区間Ｉ_ａ，ｋでの息の吸い込み量の時間変化、および発話区間Ｕ_ｋと吸い込み区間Ｉ_ａ，ｋとの時間関係の少なくとも一部に対応する特徴量ｆ_ａ，ｋに対する推定情報を得るためのモデルを機械学習し、モデルを用いて特徴量に対する推定情報である次話者確率Ｐ^ｎｓ _ｉ（ｔ）を得て出力する。 The next-speaker estimation unit 108 or the next-speaker estimation unit 108A selects the gaze target transition pattern E _{a, k} , the time structure information Θ _{v, k} , the speaker information P _uk , the utterance section U _k , and the suction section of the user P _a I _a, suction amount of breath at _k, the suction section _{I a,} the length of _k, the suction section _{I a,} suction amount of time variation of the breath at _k, and speech periods _{U k} and the suction section _{I a,} and _k Machine learns a model for obtaining estimated information for the feature quantity f _{a, k} corresponding to at least a part of the time relationship of, and uses the model to estimate the next speaker probability P ^ns _i (t) Obtain and output.

上述の次話者推定部１０８又は次話者推定部１０８Ａは、呼吸動作の観測値および視線の観測値に基づいて次に発話を開始する利用者およびタイミングを推定しているが、さらに、利用者の頭部の動きに関する情報を用いてもよい。これは、人は発話の直前に大きく頷く傾向があることを利用するものである。次話者推定部１０８又は次話者推定部１０８Ａは、映像入力部１０２からの各利用者の画像データを解析して、頭部が上下に動いたか否かにより利用者が頷いたか否かを判定する。次話者推定部１０８又は次話者推定部１０８Ａは、利用者ｉが時刻ｔの数秒前に頷いたと判定した場合には、利用者ｉの時刻ｔにおける次話者確率Ｐ^ｎｓ _ｉ（ｔ）に所定値を加算する処理等を行う。また、次話者推定部１０８又は次話者推定部１０８Ａは、呼吸動作の観測値、視線の観測値および、利用者の頭部の動きに関する情報の少なくとも一つに基づいて次話者確率Ｐ^ｎｓ _ｉ（ｔ）を算出してもよい。 Although the above-described next-speaker estimation unit 108 or next-speaker estimation unit 108A estimates the user and timing to start speaking next based on the observed value of respiratory motion and the observed value of gaze, Information on the movement of the head of the person may be used. This utilizes the fact that a person tends to crawl in front of his / her speech. The next-speaker estimation unit 108 or the next-speaker estimation unit 108A analyzes the image data of each user from the video input unit 102, and determines whether the user looked up or not depending on whether the head moved up and down. judge. If the next speaker estimation unit 108 or the next speaker estimation unit 108A determines that the user i has visited several seconds before the time t, the next speaker probability P ^ns _i (t) at the time t of the user i A process of adding a predetermined value to Further, the next speaker estimation unit 108 or the next speaker estimation unit 108A determines the next speaker probability P based on at least one of the observation value of the respiratory movement, the observation value of the line of sight, and the information on the head movement of the user. You may calculate ^ns _i (t).

また、次話者推定部１０８又は次話者推定部１０８Ａが呼吸動作の観測値、視線の観測値および、利用者の頭部の動きに関する情報の少なくとも一つを用いている場合は、次話者推定部１０８又は次話者推定部１０８Ａで用いる情報に応じて、センサ１０３は、位置計測装置２０１、呼吸動作計測装置２０２、注視対象検出装置２０３及び頭部動作検出装置２０４のいずれか一つ又は複数を備える構成でよい。 In addition, when the next speaker estimation unit 108 or the next speaker estimation unit 108A uses at least one of the observation value of the breathing movement, the observation value of the line of sight, and the information on the movement of the head of the user, Depending on the information used by the person estimation unit 108 or the next speaker estimation unit 108A, the sensor 103 may be any one of the position measurement device 201, the respiration movement measurement device 202, the gaze target detection device 203, and the head movement detection device 204. Alternatively, a plurality of components may be provided.

なお、第１の実施形態におけるロボット１００及び第２の実施形態におけるロボット１００Ａは、マイク１０１、カメラ１０２、センサ１０３、音声入力部１０４、映像入力部１０５、センサ入力部１０６、発話区間検出部１０７、次話者推定部１０８又は次話者推定部１０８Ａ及び発話制御部１０９又は制御部１０９Ａを内蔵する構成としたが、この構成に限られるものではない。マイク１０１、カメラ１０２、センサ１０３、音声入力部１０４、映像入力部１０５、センサ入力部１０６、発話区間検出部１０７、次話者推定部１０８（又は次話者推定部１０８Ａ）及び発話制御部１０９（又は制御部１０９Ａ）を備える発話制御装置をロボット１００（又はロボット１００Ａ）と別装置で設ける構成としてもよい。発話制御装置は、ロボット１００（又はロボット１００Ａ）と通信可能な構成であり、発話制御部１０９（又は制御部１０９Ａ）からの制御信号をロボット１００（又はロボット１００Ａ）へ送信することで、ロボット１００（又はロボット１００Ａ）の発話を制御する。 The robot 100 according to the first embodiment and the robot 100A according to the second embodiment include the microphone 101, the camera 102, the sensor 103, the voice input unit 104, the video input unit 105, the sensor input unit 106, and the speech segment detection unit 107. Although the configuration is such that the next speaker estimation unit 108 or the next speaker estimation unit 108A and the speech control unit 109 or the control unit 109A are built in, the present invention is not limited to this configuration. Microphone 101, camera 102, sensor 103, voice input unit 104, video input unit 105, sensor input unit 106, speech section detection unit 107, next speaker estimation unit 108 (or next speaker estimation unit 108A), and speech control unit 109 The speech control device including (or the control unit 109A) may be provided separately from the robot 100 (or the robot 100A). The speech control device is configured to be able to communicate with the robot 100 (or the robot 100A), and transmits the control signal from the speech control unit 109 (or the control unit 109A) to the robot 100 (or the robot 100A). The speech of (or the robot 100A) is controlled.

ロボット１００及びロボット１００Ａは、その体の一部をディスプレイ等の表示部に体の一部を表示する構成であっても良く、全身が仮想的な人物であるエージェントとして表示部に表示されるものであってもよい。ロボット１００及びロボット１００Ａの体の一部を表示部で表現するとは、例えば、顔全体が表示部となっており、その表示部に顔の画像を表示する構成等が考えられる。表示部に表示した顔の画像を変化させていろいろな表現を行うことができる。話者となるエージェントを表示部に表示する表示装置は、ロボット１００と同様に、マイク１０１と、カメラ１０２と、センサ１０３と、音声入力部１０４と、映像入力部１０５と、センサ入力部１０６と、発話区間検出部１０７と、利用者情報取得部１０８と、動作制御部１０９と、音制御部１１０と、口部制御部１１１と、視線制御部１１２と、頭部制御部１１３と、胴部制御部１１４と、スピーカ１１５とを備える。エージェントは、例えば、口を含む口部及び目を含む眼部を有する顔があり、顔を含む頭部の下には手、腕、及び足を有する胴部がある人物である。表示装置は、口部制御部１１１、視線制御部１１２、頭部制御部１１３及び胴部制御部１１４からの制御信号に応じて、表示部に表示中のエージェントの口、目の視線、頭及び胴体（手、腕及び足等を含む）を動かす画像処理部をさらに備える。なお、ロボット１００及びロボット１００Ａは、複数のマイク１０１及びセンサ１０３を備えない構成であってもよく、例えば、ロボット１００及びロボット１００Ａの外部に設置された複数のマイク１０１及びセンサ１０３と有線又は無線にて信号を送受信可能な構成であってもよい。 The robot 100 and the robot 100A may be configured to display a part of the body on a display unit such as a display, and the whole body is displayed on the display unit as an agent that is a virtual person. It may be To represent a part of the body of the robot 100 and the robot 100A on the display unit, for example, the entire face is a display unit, and a configuration in which an image of the face is displayed on the display unit can be considered. Various expressions can be made by changing the image of the face displayed on the display unit. Like the robot 100, the display device for displaying the agent as the speaker on the display unit includes the microphone 101, the camera 102, the sensor 103, the voice input unit 104, the video input unit 105, and the sensor input unit 106. , An utterance section detection unit 107, a user information acquisition unit 108, an operation control unit 109, a sound control unit 110, a mouth control unit 111, a sight control unit 112, a head control unit 113, and a trunk A control unit 114 and a speaker 115 are provided. The agent is, for example, a person who has a face having an mouth including a mouth and an eye including an eye, and a torso having a hand, an arm, and a foot under the head including the face. According to control signals from the mouth control unit 111, the sight control unit 112, the head control unit 113, and the trunk control unit 114, the display device displays the agent's mouth, eye gaze, head and so on. It further comprises an image processing unit that moves the torso (including hands, arms, legs, etc.). The robot 100 and the robot 100A may not be provided with the plurality of microphones 101 and sensors 103. For example, the robot 100 and the plurality of microphones 101 and sensors 103 installed outside the robot 100A may be wired or wireless. May be configured to transmit and receive signals.

第１の実施形態におけるロボット１００及び第２の実施形態におけるロボット１００Ａにおいて、上述した発話制御処理の妨げにならない範囲であれば、図１及び図７に示した機能以外の通常のロボットが備えている機能等を備えてもよい。例えば、第１の実施形態におけるロボット１００は、第２の実施形態におけるロボット１００Ａのような呼吸動作等の会話時の人間と同様の動作を行うことができる構成としてもよい。 In the robot 100 in the first embodiment and the robot 100A in the second embodiment, a normal robot other than the functions shown in FIG. 1 and FIG. May be provided. For example, the robot 100 in the first embodiment may be configured to be able to perform the same operation as a human at the time of conversation such as a breathing operation like the robot 100A in the second embodiment.

上述した本実施形態におけるロボット１００又はロボット１００Ａの備える各機能部は、例えば、コンピュータで実現することができる。その場合、この機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現してもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含んでもよい。また上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよく、ＦＰＧＡ（Field Programmable Gate Array）等のプログラマブルロジックデバイスを用いて実現されるものであってもよい。 Each function part with which robot 100 or robot 100A in this embodiment mentioned above is provided is realizable with a computer, for example. In that case, a program for realizing this function may be recorded in a computer readable recording medium, and the program recorded in the recording medium may be read and executed by a computer system. Here, the “computer system” includes an OS and hardware such as peripheral devices. The term "computer-readable recording medium" refers to a storage medium such as a flexible disk, a magneto-optical disk, a ROM, a portable medium such as a ROM or a CD-ROM, or a hard disk built in a computer system. Furthermore, “computer-readable recording medium” dynamically holds a program for a short time, like a communication line in the case of transmitting a program via a network such as the Internet or a communication line such as a telephone line. It may also include one that holds a program for a certain period of time, such as volatile memory in a computer system that becomes a server or a client in that case. Further, the program may be for realizing a part of the functions described above, or may be realized in combination with the program already recorded in the computer system. It may be realized using a programmable logic device such as an FPGA (Field Programmable Gate Array).

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes design and the like within the scope of the present invention.

利用者と会話を行うロボットの制御に適用したり、利用者と会話を行う表示装置に表示されたエージェント（仮想的な人物）の動きの制御に適用したりすることができる。 The present invention can be applied to control of a robot that talks with a user, and can be applied to control of movement of an agent (virtual person) displayed on a display device that talks to the user.

５１ａ…右目，５１ｂ…左目，５２…口部，５３…頭部，５４…頸部，５５…胴部，１００、１００Ａ…ロボット，１０１…マイク，１０２…カメラ，１０３…センサ，１０４…音声入力部，１０５…映像入力部，１０６…センサ入力部，１０７…発話区間検出部，１０８、１０８Ａ…次話者推定部，１０９、３０１…発話制御部，１０９Ａ…制御部，１１０、１１０Ａ…音制御部，１１１…口部制御部，１１２…視線制御部，１１３…頭部制御部，１１４…胴部制御部，１１５…スピーカ（発音部），１１６…口部駆動部，１１７…眼部駆動部，１１８…頭部駆動部，１１９…胴部駆動部，２０１…位置計測装置，２０２…呼吸動作計測装置，２０３…注視対象検出装置，２０４…頭部動作検出装置，３０２…動作パターン情報格納部，３０３…動作制御部，３０４…センサ信号変換部，４０１…音声解析部，４０２…会話情報生成部，４０３、４０３Ａ…会話情報ＤＢ，４０４、４０４Ａ…発声情報生成部，４０５…音信号生成部 51a: right eye, 51b: left eye, 52: mouth, 53: head, 54: neck, 55: trunk, 100, 100A: robot, 101: microphone, 102: camera, 103: sensor, 104: voice input Unit 105 105 Video input unit 106 Sensor input unit 107 Speech section detection unit 108, 108A Next speaker estimation unit 109, 301 Speech control unit 109A Control unit 110, 110A Sound control Unit 111 111 mouth control unit 112 eye gaze control unit 113 head control unit 114 body control unit 115 speaker (sound generation unit) 116 mouth drive unit 117 eye unit drive unit , 118: head drive unit, 119: trunk unit drive unit, 201: position measurement device, 202: respiratory movement measurement device, 203: gaze target detection device, 204: head movement detection device, 30 2 Operation pattern information storage unit 303 Operation control unit 304 Sensor signal conversion unit 401 Speech analysis unit 402 Conversation information generation unit 403, 403A Conversation information DB 404, 404A Speech information generation unit , 405 ... sound signal generator

Claims

An utterance control unit for controlling an utterance of a robot talking with a plurality of users, or an utterance of a speaker displayed on a display device talking with a plurality of users;
A next speaker estimation unit that acquires a first next speaker probability that is a probability that the user will be the next speaker at any time;
Equipped with
The speech control unit controls the speech of the robot or the speaker based on the first next speaker probability of the user acquired by the next speaker estimation unit .
The next speaker estimation unit acquires the first next speaker probability based on the non-verbal behavior of the user,
The robot or the speaker can be controlled to perform an action corresponding to the non-verbal action,
The second speaker estimation unit is a second probability that the robot or the speaker will be the next speaker at the time based on the information on the motion corresponding to the non-verbal behavior of the robot or the speaker. Get next speaker probability,
The utterance control unit is configured to compare the robot or the speaker based on comparison between the first speaker next probability of the plurality of users and the robot or the second speaker next probability of the speaker. A speech control system that controls the speech of the person.

The utterance control unit acquires the utterance necessity degree, which is an index indicating the necessity of the utterance of the robot or the speaker according to the utterance content, and the next speaker estimation unit according to the acquired utterance necessity degree The speech control system according to claim 1 , wherein the value of the second next speaker probability acquired by the user is changed.

An utterance control unit for controlling an utterance of a robot talking with a plurality of users, or an utterance of a speaker displayed on a display device talking with a plurality of users;
A next speaker estimation unit that acquires a first next speaker probability that is a probability that the user will be the next speaker at any time;
Equipped with
The speech control unit controls the speech of the robot or the speaker based on the first next speaker probability of the user acquired by the next speaker estimation unit .
The next speaker estimation unit acquires the first next speaker probability based on the non-verbal behavior of the user,
The robot or the speaker can be controlled to perform an action corresponding to the non-verbal action,
The second speaker estimation unit is a second probability that the robot or the speaker will be the next speaker at the time based on the information on the motion corresponding to the non-verbal behavior of the robot or the speaker. Get next speaker probability,
The utterance control unit is configured to compare the robot or the speaker based on comparison between the first speaker next probability of the plurality of users and the robot or the second speaker next probability of the speaker. Utterance control device which controls the utterance of.

A speech control program for causing a computer to function as the speech control system according to claim 1 or 2.