JP2019217558A

JP2019217558A - Interactive system and control method for the same

Info

Publication number: JP2019217558A
Application number: JP2018114261A
Authority: JP
Inventors: 伊藤　光一郎; Koichiro Ito; 光一郎伊藤; 孝志松原; Takashi Matsubara; 永松　健司; Kenji Nagamatsu; 健司永松
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2018-06-15
Filing date: 2018-06-15
Publication date: 2019-12-26
Anticipated expiration: 2038-06-15
Also published as: JP7045938B2

Abstract

To provide an interactive robot system that narrows down an interactive object person in advance by determining whether or not the person has an interactive intention and takes an interest in dialogues, before the person gets close to a robot, when the robot is utilized in an environment where a plurality of persons come and go, and a control method for the interactive robot.SOLUTION: An interactive robot system includes: imaging equipment for taking an image of its surroundings; and a calculator which detects a person from image information from the imaging equipment, which tracks the detected person by a plurality of images taken by the imaging equipment, which calculates the extent of interest in the tracked person on the basis of changes in the orientation of his/her face and the orientation of his/her body in the plurality of images, and which regards him/her as an interactive candidate on the basis of the calculated extent of the interest.SELECTED DRAWING: Figure 7

Description

本発明は対話システムに関し、特に、複数の人物が行き交う環境において利用され、対話対象となりうる人物を識別し、事前に対話の準備を行う対話ロボットシステム及び対話ロボットの制御方法に関するものである。 The present invention relates to a dialogue system, and more particularly to a dialogue robot system that is used in an environment where a plurality of people come and go, identifies a person who can be a dialogue target, and prepares for a dialogue in advance, and a dialogue robot control method.

近年、小売店や公共施設において、来店客に対し、従業員に代わって対話サービスを提供するロボットの開発が盛んである。特に人通りの多い環境でロボットを利用するにあたっては、ロボットは、周囲を行き交う複数の人物の中から対話対象となる人物を選択して、対応しなければならない。 2. Description of the Related Art In recent years, robots have been actively developed in retail stores and public facilities to provide interactive services on behalf of employees to visitors. In particular, when using a robot in an environment with many people, the robot must respond by selecting a person to be a conversation target from a plurality of people moving around.

ロボットが複数人の人物から対話対象を選択する技術が、特許文献１に記載されている。この特許文献１は、ロボット付近の領域内に存在する人物の関心度を定義し、関心度の高い人物を対話対象として選択している。ここでの関心度は、人物ごとの顔の向き、視線の向き、ジェスチャや発話の有無に応じてスコア付けがなされ、スコアに応じて対話対象の順位付けを行うものである。 A technology in which a robot selects a conversation target from a plurality of persons is described in Patent Literature 1. This patent document 1 defines a degree of interest of a person existing in an area near a robot, and selects a person with a high degree of interest as a conversation target. Here, the degree of interest is scored according to the direction of the face, the direction of the line of sight, the presence or absence of a gesture or an utterance for each person, and ranks the dialogue targets according to the score.

特開２００９−２４８１９３号公報JP 2009-248193 A

特許文献１における技術は、人物ごとの顔の向き、視線の向き、ジェスチャや発話の有無によりスコア付けを行っており、既にロボットの周囲に人物が集まっている状況において機能する。 The technique in Patent Literature 1 performs scoring based on the direction of the face, the direction of the line of sight, and the presence or absence of gestures and utterances for each person, and functions in a situation where people have already gathered around the robot.

しかしながら、対話ロボットは、実際、人が行き交う環境下に設置されたり、それらの人物の中から対話対象となりうる人物を選択し、能動的に声掛けをしたり、実際に人物に声を掛けられる以前に体を向け、カメラで人物を撮像し認識するなど対話に備える必要がある。 However, a dialogue robot is actually installed in an environment where people come and go, selects a person who can be a conversation target from among those people, actively speaks, and can actually speak to the person It is necessary to turn to the body before and prepare for dialogue such as capturing and recognizing a person with a camera.

対話ロボットにとっては、あらかじめ指定された領域に人物が入り込む、ないしは実際に人物に声を掛けられるまでは、人物が対話対象になりえるかを判断することができない。また、ロボットは、指定された領域に入り込んだ人物が、そのまま素通りするのか、対話意思を持つのかは、実際に人物がロボットに話しかけられるまでに判定することに対応していない。 The interactive robot cannot judge whether a person can be a conversation target until the person enters a predetermined area or actually speaks to the person. In addition, the robot does not support determining whether a person who has entered a designated area passes through or has a dialogue intention before the person actually speaks to the robot.

そこで、本発明の課題は、ロボットが、複数人が行き交う環境下で利用される際、人物が対話意思や関心を持っているかを、ロボットが人物の関心度を判定し、事前に対話対象となる人物を絞り込む対話ロボットシステムおよび対話ロボットの制御方法を提供することである。 Therefore, the problem of the present invention is that when a robot is used in an environment where a plurality of people come and go, the robot determines the degree of interest of the person to determine whether the person has a dialogue intention or interest, and determines the degree of interest of the person beforehand. An object of the present invention is to provide a dialogue robot system and a control method of a dialogue robot for narrowing down persons.

上記課題を解決するための代表的な一側面は、周囲を撮像する撮像装置と、撮像装置からの画像情報から人物を検出し、検出された人物を前記撮像装置の複数の画像で追跡し、追跡された人物の関心度を、複数の画像における人物の顔の向きと胴体の向きの変化に基づいて算出し、算出された関心度に基づいて対話候補とする計算機とを有する。 A representative aspect for solving the above-described problem is an imaging device that images the surroundings, a person is detected from image information from the imaging device, and the detected person is tracked with a plurality of images of the imaging device. A computer that calculates the degree of interest of the tracked person based on changes in the orientation of the person's face and body in a plurality of images, and sets the degree of interest as a dialogue candidate based on the calculated degree of interest.

人物がロボットに接近する以前に、ロボットが自装置に対話意思ないしは関心を持つ人物を絞り込むことができる。 Before the person approaches the robot, the robot can narrow down persons who are willing or interested in talking to the own device.

対話ロボットシステムの概略図である。It is a schematic diagram of an interactive robot system. 対話ロボットシステムのハードウェア構成例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of an interactive robot system. 対話ロボットシステムの機能的構成例を示すブロック図である。FIG. 3 is a block diagram illustrating a functional configuration example of the interactive robot system. 第１の推定処理０１の具体的処理手順を示したフロー図である。FIG. 9 is a flowchart showing a specific processing procedure of the first estimation processing 01. 関心度を持つと判断された対話候補に対してロボットが働きかけの方法を選択するためのフローチャートである。11 is a flowchart for selecting a method for a robot to work on a dialogue candidate determined to have a degree of interest. 対話候補への働きかけを行った際の第２の推定処理０２を示したフローチャートを示す図である。It is a figure showing the flow chart which showed the 2nd presumption processing 02 at the time of acting on the dialogue candidate. 対話ロボットシステムが対話対象となる人物を識別するフローチャートを示す図である。It is a figure which shows the flowchart which a conversation robot system identifies the person who becomes a conversation object. ロボットと人物の位置関係の変化と関心度の関係を示す図である。It is a figure which shows the relationship between the change of the positional relationship of a robot and a person, and the degree of interest. ロボットに対する人物の３フレーム分の移動の様子を示した図である。FIG. 3 is a diagram illustrating a state of a person moving for three frames with respect to a robot. ロボットに対する人物の３フレーム分の移動について、関心度の算出例を示した表である。9 is a table showing an example of calculating a degree of interest for a movement of a person for three frames with respect to a robot.

以下、各実施例を、図面を用いて説明する。 Hereinafter, each embodiment will be described with reference to the drawings.

図１は、対話ロボットシステムの概略図である。人物が往来する環境下における、対話ロボットシステム(以下、対話システム)１００の使用状態例を示す。対話システム１００は、人物と対話する対話ロボット１１０と（以下、単にロボットと称する）、ロボット１１０からの信号に基づき、ロボット１１０を制御する遠隔サーバー１３０から構成されている。 FIG. 1 is a schematic diagram of the interactive robot system. An example of a use state of a dialogue robot system (hereinafter, dialogue system) 100 in an environment where a person comes and goes will be described. The interactive system 100 includes an interactive robot 110 that interacts with a person (hereinafter, simply referred to as a robot) and a remote server 130 that controls the robot 110 based on a signal from the robot 110.

ロボット１１０は、カメラ１２０、スピーカ１２１、マイクアレイ１２２、内部サーバー１２３、表示装置１２４、駆動装置１２５、第１通信インターフェイス（以下、ＩＦと示す）１２６を備えて構成される。遠隔サーバー１３０は、ロボット１１０の動作を制御するための制御信号を送る計算機であり、第１通信ＩＦ１２６と通信を行う第２通信ＩＦ１３１を備える。尚、第１通信Ｉ／Ｆ１２６、第２通信ＩＦ１３１は、無線インターフェイスであり、無線通信を利用してデータの送受信を行うＬＡＮシステム、例えば、ＩＥＥＥ８０２．１１に規定されるものがあげられる。第１通信ＩＦと第２通信ＩＦの間は、インターネット等のネットワークを介することもある。 The robot 110 includes a camera 120, a speaker 121, a microphone array 122, an internal server 123, a display device 124, a driving device 125, and a first communication interface (hereinafter, referred to as IF) 126. The remote server 130 is a computer that sends a control signal for controlling the operation of the robot 110, and includes a second communication IF 131 that communicates with the first communication IF 126. Note that the first communication I / F 126 and the second communication IF 131 are wireless interfaces, and include a LAN system that transmits and receives data using wireless communication, for example, a LAN system defined by IEEE 802.11. The first communication IF and the second communication IF may be via a network such as the Internet.

カメラ１２０は画像を取り込む撮像装置であり、マイクアレイ１２２は環境音や人物の音声を取り込む。表示装置１２４は人物に情報を提示するもので、例えば、ディスプレイやプロジェクション映像である。また、ロボット１１０の顔や表情を表現してもよい。駆動手段１２５は、ロボット１１０の腕や足など関節に位置し、感情表現のための動作や、移動を実現する、例えばモーターや減速機である。 The camera 120 is an imaging device that captures images, and the microphone array 122 captures environmental sounds and voices of people. The display device 124 presents information to a person, and is, for example, a display or a projection image. Further, the face and expression of the robot 110 may be expressed. The drive unit 125 is located at a joint such as an arm or a leg of the robot 110, and is, for example, a motor or a speed reducer that implements an operation for expressing an emotion or moves.

内部サーバー１２３は計算機であり、第１通信ＩＦ１２６を介して、カメラ１２０やマイクアレイ１２２で得たデータを遠隔サーバー１３０に送信する。また、遠隔サーバー１３０は第２通信ＩＦ１３１を備え、第１通信ＩＦ１２６から画像、音声データ、信号を受信し、受信信号に応じてロボット１１０を制御する信号を第２通信ＩＦ１３１、第１通信ＩＦ１２６を介してロボット１１０へと送信する。 The internal server 123 is a computer, and transmits data obtained by the camera 120 or the microphone array 122 to the remote server 130 via the first communication IF 126. Further, the remote server 130 includes a second communication IF 131, receives images, audio data, and signals from the first communication IF 126, and transmits a signal for controlling the robot 110 according to the received signal to the second communication IF 131 and the first communication IF 126. To the robot 110.

尚、遠隔サーバー１３０のロボット１１０を制御する機能を内部サーバー１２３に処理させることもでき、その際は遠隔サーバー１３０、第１通信インターフェイス１２６は不要となり、ロボット１１０が独立して人物と対話する構成とすることができる。以上の構成の対話システムは、主として、ロボットの周囲の人物に対し、対話を働きかけたり、対話を行う対話ロボットとして利用される。 In addition, the function of controlling the robot 110 of the remote server 130 may be processed by the internal server 123. In this case, the remote server 130 and the first communication interface 126 are not required, and the robot 110 independently interacts with a person. It can be. The dialogue system having the above configuration is mainly used as a dialogue robot that works on a dialogue with a person around the robot or performs a dialogue.

＜システム構成例＞
図２は、対話システム１００を構成するロボット１１０と、遠隔サーバー１３０のシステムのハードウェア構成例を示す図である。 <System configuration example>
FIG. 2 is a diagram illustrating an example of a hardware configuration of a system of the robot 110 and the remote server 130 included in the interactive system 100.

ロボット１１０はカメラ１２０と、マイクアレイ１２２と、第１出力デバイス１４０と搭載し、これらは内部サーバー１２３とバス１２９で接続されている。第１出力デバイス１４０は、スピーカ１２１と、表示装置１２４と、駆動装置１２５とを含む。内部サーバー１２３は第１プロセッサ１２７と、第１記憶デバイス１２８と、第１通信インターフェイス１２６と、それらを接続するバス１２９を有する。また、カメラ１２０は深度センサであってもよい。 The robot 110 includes a camera 120, a microphone array 122, and a first output device 140, and these are connected to an internal server 123 by a bus 129. The first output device 140 includes a speaker 121, a display device 124, and a driving device 125. The internal server 123 has a first processor 127, a first storage device 128, a first communication interface 126, and a bus 129 connecting them. Further, the camera 120 may be a depth sensor.

第１プロセッサ１２７は、ロボット１１０に備わる出力デバイス１４０を制御し、内部サーバー１２３の機能を実現する。第１記憶デバイス１２８は、第１プロセッサ１２７の作業エリアとなり、機能を実現する各種プログラムとデータを記憶する非一時的なまたは一時的な記憶媒体である。第１記憶デバイス１２８は、例えばＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、ＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、フラッシュメモリがある。第１出力デバイス１４０としては、例えば表示装置１２４、スピーカ１２１がある。第１通信ＩＦ１２６は、遠隔サーバー１３０と無線通信するか、ネットワーク(図示せず)を介して接続し、データを送受信する。 The first processor 127 controls the output device 140 provided in the robot 110 and implements the function of the internal server 123. The first storage device 128 is a work area for the first processor 127, and is a non-temporary or temporary storage medium for storing various programs and data for realizing functions. The first storage device 128 includes, for example, a ROM (Read Only Memory), a RAM (Random Access Memory), a HDD (Hard Disk Drive), a GPU (Graphics Processing Unit), and a flash memory. The first output device 140 includes, for example, a display device 124 and a speaker 121. The first communication IF 126 performs wireless communication with the remote server 130 or connects via a network (not shown) to transmit and receive data.

カメラ１２０は、ロボット１１０の周囲を撮影する撮像デバイスであって、例えば、被写体までの距離を計測可能な３次元測量機能を備えていてもよい。駆動装置１２５は、たとえばモーターであってもよく、ロボット１１０を駆動させる機構である。例えば、ロボット１１０を歩行動作、ないしは車輪によって移動させてもよいし、ロボット１１０の腕や指を動かしロボット１１０の感情を表現してもよいし、首を振ることでカメラ１２０の向きを変える駆動装置である。 The camera 120 is an imaging device that captures an image of the periphery of the robot 110, and may have, for example, a three-dimensional measurement function that can measure a distance to a subject. The driving device 125 may be, for example, a motor, and is a mechanism that drives the robot 110. For example, the robot 110 may be moved by a walking motion or wheels, the arms or fingers of the robot 110 may be moved to express the emotions of the robot 110, or the driving of changing the direction of the camera 120 by shaking the head. Device.

遠隔サーバー１３０は、第２通信ＩＦ１３１と、第２プロセッサ１３２と、第２記憶デバイス１３３と、これらを接続するバス１３４を有する。第２記憶デバイス１３３は、第２プロセッサ１３２の作業エリアとなり、第２記憶デバイス１３３は、遠隔サーバの機能を実現する各種プログラムやデータを記憶する非一時的なまたは一時的な記憶媒体である。第２記憶デバイス１３３としては例えばＲＯＭ、ＲＡＭ、ＨＤＤ、ＧＰＵ、フラッシュメモリがある。第２通信ＩＦ１３１は、ロボット１１０と無線通信するか、ネットワーク(図示せず)を介して接続し、データを送受信する。 The remote server 130 has a second communication IF 131, a second processor 132, a second storage device 133, and a bus 134 connecting these. The second storage device 133 is a work area for the second processor 132, and the second storage device 133 is a non-temporary or temporary storage medium for storing various programs and data for realizing the functions of the remote server. Examples of the second storage device 133 include a ROM, a RAM, an HDD, a GPU, and a flash memory. The second communication IF 131 performs wireless communication with the robot 110 or connects via a network (not shown) to transmit and receive data.

＜制御システムの機能的構成例＞
図３は、対話システム１００の機能的構成例を示すブロック図である。 <Example of functional configuration of control system>
FIG. 3 is a block diagram illustrating a functional configuration example of the interactive system 100.

内部サーバー１２３はカメラ１２０からの画像データを受信する画像受信部１２０Ａと、遠隔サーバー１３０とデータを送受するための第１通信ＩＦ１２６と、スピーカ１２１や表示装置１２４を制御する出力デバイス制御部３０３と、駆動装置１２５を制御する駆動制御部３０４と、を有する。出力デバイス制御部３０３と駆動制御部３０４とは、第１記憶デバイス１２８に記憶されたプログラムを第１プロセッサ１２７が実行することにより実現される。例えば、マイクアレイ１２２やスピーカ１２１を用いて、人物と会話をするよう出力デバイス制御部３０３による第1の出力デバイスの制御、駆動制御部３０４による駆動装置１２５の制御を実現する。また、ロボット１１０の感情表現のための動作や、ロボット１１０の移動を実現する。 The internal server 123 includes an image receiving unit 120A for receiving image data from the camera 120, a first communication IF 126 for transmitting and receiving data to and from the remote server 130, and an output device control unit 303 for controlling the speaker 121 and the display device 124. , A drive control unit 304 for controlling the drive device 125. The output device control unit 303 and the drive control unit 304 are realized by the first processor 127 executing a program stored in the first storage device 128. For example, the control of the first output device by the output device control unit 303 and the control of the drive device 125 by the drive control unit 304 are realized so as to have a conversation with a person using the microphone array 122 and the speaker 121. In addition, the operation for expressing the emotion of the robot 110 and the movement of the robot 110 are realized.

遠隔サーバー１３０は、第２通信ＩＦ１３１と、人検出部３１２と、人特徴抽出部３１３と、人追跡部３１４と、時系列特徴抽出部３１５と、関心行動識別部３１６と、反応確認部３１７と、を有する。人検出部３１２と、人特徴抽出部３１３と、人追跡部３１４と、時系列特徴抽出部３１５と、関心行動識別部３１６と、反応確認部３１７のそれぞれの機能は、第２記憶デバイス１３３に記憶されたプログラムを第２プロセッサ１３２が実行することにより実現される。人特徴抽出部３１３は、頭検出部３２１と、頭方定部３２２と、胴方定部３２３とを有する。 The remote server 130 includes a second communication IF 131, a person detection unit 312, a person feature extraction unit 313, a person tracking unit 314, a time series feature extraction unit 315, an interest behavior identification unit 316, and a reaction confirmation unit 317. And The functions of the human detection unit 312, the human feature extraction unit 313, the human tracking unit 314, the time-series feature extraction unit 315, the interest behavior identification unit 316, and the reaction confirmation unit 317 are stored in the second storage device 133. This is realized by the second processor 132 executing the stored program. The human feature extraction unit 313 includes a head detection unit 321, a head orientation unit 322, and a body orientation unit 323.

カメラ１２０によって撮像されたロボット１１０の周囲の環境は、画像情報として画像受信部１２０Ａ、第１通信IF１２６、第２通信IF１３１を介して遠隔サーバー１３０に送信される。 The environment around the robot 110 captured by the camera 120 is transmitted as image information to the remote server 130 via the image receiving unit 120A, the first communication IF 126, and the second communication IF 131.

人検出部３１２は、カメラ１２０からの画像情報から人物が存在する領域を推定する。人物が存在する領域とは、人物の領域を囲う矩形の位置であってもよい。人検出部３１２で推定された領域、または領域内の画像情報は、人特徴抽出部３１３と、人追跡部３１４と、へ送信される。人検出部３１２にて実行される人検出処理は、現在公知のものとなっており、具体的には、画素ごとに周囲の画素値との勾配から人物の輪郭を特徴にし、存在を推定するものや、畳み込みフィルタを利用したＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ（以下ＣＮＮ）を利用し、人物の存在を示す矩形で表すものでもよい。人検出部３１２では、上述した技術を利用することで、複数人が存在している場合でも、人ごとに領域情報を取得することができる。 The person detection unit 312 estimates an area where a person exists from image information from the camera 120. The region where the person exists may be a rectangular position surrounding the region of the person. The region estimated by the human detection unit 312 or image information in the region is transmitted to the human feature extraction unit 313 and the human tracking unit 314. The human detection process performed by the human detection unit 312 is currently known, and specifically, the presence of a person is estimated by characterizing the contour of the person from the gradient with the surrounding pixel values for each pixel. It may be represented by a rectangle indicating the presence of a person by using Convolutional Neural Network (hereinafter, CNN) using a convolution filter. By using the above-described technology, the human detection unit 312 can acquire area information for each person even when a plurality of persons exist.

人特徴抽出部３１３では、例えば、人検出部３１２で得られた人ごとの領域に対し、人物の画像中における特徴を抽出する。人特徴抽出部３１３にて抽出される特徴は、画像情報一枚における人領域内の人ごとの特徴である。人特徴抽出部３１３では、具体的には、頭検出部３２１と、頭方定部３２２と、胴方定部３２３とを有する。人検出部３１２で得られた人物の特徴は、時系列特徴抽出部３１５にて、時間的に連続する人ごとの特徴を抽出する際に用いてもよい。 The human feature extraction unit 313 extracts, for example, a feature in an image of a person from a region for each person obtained by the person detection unit 312. The features extracted by the human feature extracting unit 313 are features of each person in a human area in one piece of image information. Specifically, the human feature extraction unit 313 includes a head detection unit 321, a head direction determination unit 322, and a body direction determination unit 323. The feature of the person obtained by the person detection unit 312 may be used when the time-series feature extraction unit 315 extracts the feature of each person that is temporally continuous.

頭検出部３２１は、人検出部３１２により抽出された領域内での頭部の領域を検出する。頭部の領域は、具体的には、頭部を囲う矩形の位置であってもよく、ＣＮＮ（Conventional Neural Network）に基づく検出器を用いる。なお、人物が正面を向いている際には、顔を検出してもよい。顔の検出に関しては、Ｈａｒｒ−Ｌｉｋｅ特徴を用いる。顔検出についてもＣＮＮを利用した検出器を用いてもよい。 The head detection unit 321 detects a region of the head in the region extracted by the human detection unit 312. Specifically, the region of the head may be a rectangular position surrounding the head, and a detector based on CNN (Conventional Neural Network) is used. When the person is facing the front, the face may be detected. For face detection, Harr-Like features are used. For face detection, a detector using CNN may be used.

頭方定部３２２は、頭検出部３２１にて同定された頭部の領域内の情報を用いて頭部の方向を推定する。頭部の方向推定手法として、顔が利用可能な向きであれば、現在、画像処理において既知となっているような、顔の特徴点を抽出し、画像上の特徴点の配置から顔の方向を推定する手段を利用してもよい。また、上述の方向推定結果を基に得られた値に対し、閾値を定めることでカメラ１２０の方向を向いているか、向いていないかの２値の識別を行ってもよい。また、例えば、人の頭部画像ＣＮＮを入力とし、頭部の方向を出力する識別器を利用して学習したり、方向の値を直接出力してもよいし、もしくは、方向推定を行う識別器を事前に学習し、実行時には識別器の中間層の特徴量を利用してもよい。 The head orientation unit 322 estimates the direction of the head using information in the region of the head identified by the head detection unit 321. As a head direction estimating method, if the face can be used, the feature points of the face that are currently known in image processing are extracted, and the face direction is determined from the arrangement of the feature points on the image. May be used. Alternatively, a threshold value may be determined for a value obtained based on the above-described direction estimation result to determine whether the camera 120 faces the camera 120 or not. Further, for example, learning may be performed using a classifier that outputs a head direction by inputting a human head image CNN, or a direction value may be directly output. The learning device may be learned in advance, and the feature value of the intermediate layer of the classifier may be used at the time of execution.

胴方定部３２３は、人検出部３１０から得られる領域内で人物の体の方向を推定する。体の方向を推定する手段として、カメラ１２０が深度センサであるならば、深度画像を入力とする機械学習の推定を基に人物の骨格を推定し、推定された骨格の位置を基に体の方向を決定する。もしくは、人物の画像と、それに対応する体の向きのラベル付与した多数の事例を用い、体の向きを推定する識別器を作成し、判定に用いてもよい。このとき、前述の識別器は、方向値を直接出力してもよいし、識別器の中間層出力を出力してもよい。 The torso determining unit 323 estimates the direction of the person's body in the area obtained from the person detecting unit 310. As a means for estimating the direction of the body, if the camera 120 is a depth sensor, the skeleton of the person is estimated based on the machine learning estimation using the depth image as an input, and the body position is estimated based on the estimated position of the skeleton. Determine the direction. Alternatively, a discriminator for estimating the body orientation may be created using a number of cases in which a human image and a corresponding body orientation label are provided, and used for the determination. At this time, the above-described classifier may directly output the direction value, or may output the intermediate layer output of the classifier.

人特徴抽出部３１３は、上述の説明では、一例として、頭方向、胴体方向と分けて出力したが、人が写る画像と、それに対応するラベルを基に識別器を作成することもできる。具体的には、人の画像と、その人物が実際に対話したか否かのラベルを事例として集めておき、事例を基準に判断してもよい。 In the above description, the human feature extraction unit 313 outputs the head direction and the body direction separately, for example. However, the human feature extraction unit 313 can also create a classifier based on an image of a person and a label corresponding to the image. Specifically, images of a person and labels indicating whether or not the person actually interacted may be collected as cases, and the judgment may be made based on the cases.

人追跡部３１４では、人検出部３１３にて推定された人領域に基づいて、連続する画像フレーム間で同一人物の対応を取る。連続するフレーム間での、人追跡技術は、人検出部３１３の出力の領域内での特徴量と、連続するフレームでの特徴量とを比較し、類似する特徴であるならば同一人物としてもよく、フレーム間の対応付けを行う。 The person tracking unit 314 takes the correspondence of the same person between consecutive image frames based on the person region estimated by the person detection unit 313. The human tracking technique between consecutive frames compares the feature amount in the area of the output of the human detection unit 313 with the feature amount in the consecutive frames. Often, correspondence between frames is performed.

時系列特徴抽出部３１５では、人特徴抽出部３１３と人追跡部３１４とから人物ごとに、時系列的な行動特徴を抽出する。具体的には、胴方定部３２３の複数の時間フレームにわたる胴方向の推移から、人物の移動方向を抽出する。また、ロボット１１０に接近してくる、もしくは、素通りする人物の時系列特徴を抽出してもよい。この具体的処理については後述する。 The time-series feature extraction unit 315 extracts a time-series behavior feature for each person from the human feature extraction unit 313 and the person tracking unit 314. Specifically, the movement direction of the person is extracted from the transition of the torso direction over a plurality of time frames of the torso setting unit 323. Further, a time-series feature of a person approaching or passing by the robot 110 may be extracted. This specific processing will be described later.

関心行動識別部３１６は、時系列特徴抽出部３１５において抽出した時系列特徴を基に、画像中の人物ごとにロボット１１０への関心度を持つかを識別し、人物が対話候補であるかを判定する。人物の関心度が高いと判定された場合、人物を対話候補であると判定し、第２通信ＩＦ１３１と、第１通信ＩＦ１２６を介して、出力デバイス制御部３０３と駆動制御部３０４に対して制御信号を送る。制御信号については、後述する。 The interest behavior identification unit 316 identifies whether each person in the image has a degree of interest in the robot 110 based on the time series feature extracted by the time series feature extraction unit 315, and determines whether the person is a conversation candidate. judge. If it is determined that the interest level of the person is high, the person is determined to be a conversation candidate, and the output device control unit 303 and the drive control unit 304 are controlled via the second communication IF 131 and the first communication IF 126. Send a signal. The control signal will be described later.

制御信号を受け取った出力デバイス制御部３０３は、第１出力デバイス１４０のスピーカ１２１と、表示装置１２４を制御し、駆動制御部３０４は、駆動装置１２５を制御する。出力デバイス制御部３０３は、人物に対して、たとえば、表示装置１２４の表示の変更や、スピーカ１２１からの声かけを行う。駆動制御部３０４は、駆動装置１２５を制御することで手招きなどの動作など、働きかけを行う。 The output device control unit 303 that has received the control signal controls the speaker 121 and the display device 124 of the first output device 140, and the drive control unit 304 controls the drive device 125. The output device control unit 303 changes the display of the display device 124 and calls out from the speaker 121 to the person, for example. The drive control unit 304 performs an action such as an operation such as beckoning by controlling the drive device 125.

反応確認部３１７では、例えば、出力デバイス制御部３０３にて制御された第１出力デバイス１４０のスピーカ１２１と、表示装置１２４と、駆動制御部３０４にて制御された駆動装置１２５と、による働きかけを行われた人物の反応を確認する。出力デバイス制御部３０３と、駆動制御部３０４と、により、働きかけが行われた時刻に近い時刻で、働きかけに相関のある人物の反応に変化が得られるかを検出してもよい。 In the reaction confirmation unit 317, for example, the speaker 121 of the first output device 140 controlled by the output device control unit 303, the display device 124, and the drive device 125 controlled by the drive control unit 304 act. Check the response of the person. The output device control unit 303 and the drive control unit 304 may detect whether or not a change in the reaction of a person correlated with the approach is obtained at a time close to the time when the approach is performed.

＜対話対象識別のための処理＞
図７は、対話システム１００が対話対象となる人物を識別するため実行されるフローチャートを示す。 <Process for identifying conversation target>
FIG. 7 shows a flowchart executed by the dialog system 100 to identify a person to be talked to.

まず、ステップS７０１では、ロボット１１０のカメラ１２０は、周囲の画像を撮影し画像情報を遠隔サーバー１３０の人検出部３１１に送信する。人検出部３１１では送信された画像情報を取得する。 First, in step S701, the camera 120 of the robot 110 captures an image of the surroundings and transmits image information to the human detection unit 311 of the remote server 130. The human detection unit 311 acquires the transmitted image information.

次にステップS７０２では、人検出部３１２は、取得した画像情報を基に、人検出処理を実行する。この人検出処理において、人検出部３１２は、人物が存在するかを判定し、存在する場合は、人物の領域を例えば、矩形領域といった形で、人物ごとに個別に取得することになる。 Next, in step S702, the person detection unit 312 performs a person detection process based on the acquired image information. In the human detection process, the human detection unit 312 determines whether or not a person exists, and if there is, obtains an area of the person for each person in the form of, for example, a rectangular area.

次いで、ステップS７０３では、人追跡部３１４は、人検出部３１２の出力を受け、現在の取得フレームに検出された人物が、直近の過去の取得フレームにて検出されたかを判定し、フレーム間の人物対応付けを行う。一方で直近の過去の取得フレームに該当する人物が存在しない場合は、人追跡部３１４は、新たな人物を検出したものとし、第２記憶デバイス１３３に新たな人物として登録する。新たな人物について特徴を記憶し、次回以降の取得フレームで、人追跡部３１４は対応付けを実行する。ここで用いられる人追跡技術は、例えば、人物の領域内の画像特徴量の類似度を測ることで実現される。人追跡技術では、遮蔽物などで画像から追跡対象を見失ったとしても、その後追跡対象が出現した場合に、追跡を続行できる場合があることが知られている。 Next, in step S703, the person tracking unit 314 receives the output of the person detection unit 312, determines whether the person detected in the current acquisition frame has been detected in the most recent past acquisition frame, and determines between the frames. Perform person correspondence. On the other hand, when there is no person corresponding to the latest past acquired frame, the person tracking unit 314 determines that a new person has been detected and registers the new person in the second storage device 133. The feature is stored for the new person, and the person tracking unit 314 executes the association in the next and subsequent acquired frames. The human tracking technique used here is realized, for example, by measuring the degree of similarity of the image feature amount in the area of the person. In the human tracking technology, it is known that even if a tracking target is lost from an image due to an obstruction or the like, tracking can be continued if the tracking target subsequently appears.

次に、ステップS７０４にて、時系列特徴抽出部３１５と関心行動識別部３１６は、人物がロボット１１０に対して対話の意思、ないしは、関心を持つかを判定し、人物を対話候補とする第一の推定処理０１を行う。第一の推定処理０１にて対話の意思、ないしは関心を持つと判断された人物は、対話候補となる（S７０５）。ステップS７０４の具体的処理については後述する。 Next, in step S704, the time-series feature extraction unit 315 and the interest behavior identification unit 316 determine whether the person has a conversation intention or interest in the robot 110, and determines that the person is a conversation candidate. One estimation process 01 is performed. The person determined to have an intention or interest in the dialogue in the first estimation process 01 is a dialogue candidate (S705). The specific processing of step S704 will be described later.

ステップS７０６では、第一の推定処理０１によって、対話候補と判定された人物に対して、ロボット１１０は、働きかけを行う。働きかけの具体的な処理については後述する。 In step S706, the robot 110 works on the person determined to be a conversation candidate by the first estimation processing 01. The specific processing of the action will be described later.

ステップS７０７では、反応確認部３１７は、ステップS７０６にて働きかけたロボット１１０の行動に対する人物の反応を観測し、働きかけに対する反応を確認したならば対話対象であると判定する。 In step S707, the reaction confirming unit 317 observes the response of the person to the action of the robot 110 that has acted in step S706, and if it confirms the reaction to the act, determines that the object is a dialogue target.

ステップS７０８では、ステップS７０７において、対話対象と判定された人物と対話を行う準備を行う。具体的には、例えば、駆動装置１２５がロボット１１０の移動機能を有しているのであれば、対話を行う前に対話対象に歩み寄ってもよい。または、駆動装置１２５がロボット１１０の旋回機能を有しているのであれば、対話を行う前に事前にロボット１１０の体の向きを対話対象に向けてもよい。このとき、カメラ１２０を人物に向け、人物の画像を正面から撮像してもよい。撮像した人物の画像に対し、第１記憶デバイス１２８もしくは、第２記憶デバイス１３３が、人物の外見的特徴を推定する手段を備えているのであれば、対話を行う前に推定を行ってもよい。ここでの外見的特徴とは、例えば顔画像を基にした人物の年齢や性別である。 In step S708, preparations are made for a dialogue with the person determined to be a dialogue target in step S707. Specifically, for example, if the driving device 125 has a moving function of the robot 110, the user may walk to the conversation target before conducting the conversation. Alternatively, if the driving device 125 has a turning function of the robot 110, the orientation of the body of the robot 110 may be turned to the dialogue target before performing the dialogue. At this time, the camera 120 may be pointed at a person, and an image of the person may be taken from the front. If the first storage device 128 or the second storage device 133 is provided with means for estimating the appearance characteristics of the person, the estimation may be performed before the interaction with the captured image of the person. . Here, the appearance feature is, for example, the age and gender of the person based on the face image.

ステップS７０９では、実際には対話意図を持つ人物を、誤って対話意図を持たないと判定した場合、ロボット１１０は、該当人物に接近され、話しかけられた場合に対応する例外処理を行う。 In step S709, when it is determined that a person who actually has a dialogue intention does not have a dialogue intention, the robot 110 performs an exception process corresponding to a case where the robot 110 approaches the person and speaks.

ステップS７１０では、第二の推定処理０２によって、対話対象であると判定された人物、ないしはステップS７０９にてロボット１１０に話しかけてきた人物に対して、例えば、スピーカ１２１によるロボット１１０の発話と、駆動装置１２５によるロボット１１０のジェスチャ、表示装置１２４による情報提示などにより、対話サービスを提供する。人物との対話において、ロボット１１０は、例えばステップS７０８にて撮像した人物の画像から判定された例えば年齢性別など外見的特徴を基に、例えば口調を変えてもよい。 In step S710, for the person determined to be the conversation target by the second estimation process 02, or the person who talked to the robot 110 in step S709, for example, the utterance of the robot 110 by the speaker 121 and the driving The interactive service is provided by the gesture of the robot 110 by the device 125, the information presentation by the display device 124, and the like. In the dialogue with the person, for example, the robot 110 may change the tone, for example, based on appearance characteristics such as age and gender determined from the image of the person captured in step S708.

実施例１では、人物のロボット１１０に対する対話意思、ないしは関心を、ステップS７０４の第一の推定部と、ステップS７０７の第二の推定部と、を用いた２段階の判定を行うことで人物の対話意図、ないしは関心度を精度よく算出できる。 In the first embodiment, a person's dialogue intention or interest in the robot 110 is determined in two stages using the first estimating unit in step S704 and the second estimating unit in step S707, thereby determining the person's interaction. The intention of conversation or the degree of interest can be accurately calculated.

＜第一の推定処理０１の具体的処理＞
図４は、第１の推定処理０１の具体的処理手順を示したフロー図である。第１の推定処理０１は、人検出部３１２にて、人物を検出し、人追跡部３１４にて、フレーム間にて追跡が可能となった人物から、ロボット１１０への対話意思ないしは、関心度を推定し、対話候補を判定するためのものである。 <Specific processing of the first estimation processing 01>
FIG. 4 is a flowchart showing a specific processing procedure of the first estimation processing 01. In a first estimation process 01, a person detecting unit 312 detects a person, and a person tracking unit 314 determines whether a person who can be tracked between frames has a dialogue intention or interest level with the robot 110. To estimate dialog candidates.

まず、ステップＳ４０４では、頭検出部３２１と頭方定部３２２により、人物の頭部の領域から、頭部の方向を推定して、こちらを向いているか識別する。こちらを向いているかの判定は、頭方定部３２２にて推定された人物の頭の向きと、こちらを向いているかを判定するための閾値を定め、その大小関係から、こちらを向いているかを判定する。また、こちらを向いている、顔もしくは頭の事例と、そうでいない顔もしくは頭の事例を集め、識別器を作成して判定に利用してもよい。 First, in step S404, the head detection unit 321 and the head orientation unit 322 estimate the direction of the head from the region of the head of the person and identify whether the head is facing. The determination of whether or not the person is facing is determined by determining the head direction of the person estimated by the head orientation determining unit 322 and a threshold value for determining whether or not the person is facing, Is determined. Further, a case of the face or the head facing this and a case of the face or the head not facing the head or the head may be collected, and a classifier may be created and used for the determination.

ステップS４０５では、ステップS４０４にて第２プロセッサ１３２が、頭部がこちらを向いていると判定した時刻Tfを、第２記憶デバイス１３３に記録する。 In step S405, the second processor 132 records in the second storage device 133 the time Tf at which the second processor 132 determines in step S404 that the head is facing the user.

ステップS４０６では、胴方定部３２３と時系列特徴抽出部３１５とにより、人物がこちらに向かう動きか、離れる動きか、素通りか、人物の進行方向を判定する。判定には、人物の移動ベクトルを抽出して判断する。あるいは、人物の移動の事例を集め、識別器を作成したのち、判定に利用してもよい。 In step S <b> 406, the body direction determination unit 323 and the time-series feature extraction unit 315 determine whether the person moves toward, away from, or passes through the person, or the traveling direction of the person. The determination is made by extracting the movement vector of the person. Alternatively, after a case of a person's movement is collected and a classifier is created, it may be used for determination.

ステップS４０７において、関心行動識別部３１６は、現在のフレームにおける、人物ごとのロボット１１０への関心度を計算する。例えば、頭がこちらを向いていること、人物の胴体が接近動作であることにより、スコアを加算してもよい。 In step S407, the interest behavior identification unit 316 calculates the degree of interest in the robot 110 for each person in the current frame. For example, the score may be added based on the fact that the head is facing this direction and the torso of the person is approaching.

他のスコアの算出方法については、接近動作であるが、頭がこちらを向いていない場合、時刻Tfと現在時刻の差分に応じた減衰を考慮したスコアを加算することもできる。また、接近動作であると判断できない素通り動作ならば、スコアを加算しないようにすることもできる。他には、ステップS４０７にて算出されたスコアを、関心行動識別部３１６は、各人物に対し、複数にわたって算出されたスコアに時間平均し加算することで、対象となる人物の関心度としたり、人物が後頭部をみせ、遠ざかる動作を所定時間継続するならば、関心行動識別部３１６は、スコアをリセットないしは、減算してよい。 Another method of calculating the score is an approaching motion, but when the head is not facing this direction, a score that takes into account the attenuation according to the difference between the time Tf and the current time can be added. If the operation is a straightforward operation that cannot be determined to be an approaching operation, the score may not be added. Alternatively, the interest behavior identification unit 316 may average and add the score calculated in step S407 to a plurality of calculated scores for each person to obtain the degree of interest of the target person. If the person shows the back of the head and keeps moving away for a predetermined time, the interest behavior identification unit 316 may reset or subtract the score.

図８は、ロボットと人物の位置関係の変化と関心度の関係を示す図である。ロボット１１０と、所定の時間内における人物の挙動の変化により、関心行動識別部３１６は人物の関心度を算出する。 FIG. 8 is a diagram illustrating the relationship between the change in the positional relationship between the robot and the person and the degree of interest. The interest behavior identification unit 316 calculates the degree of interest of the person based on the robot 110 and changes in the behavior of the person within a predetermined time.

図８（a）は、所定時間内に、人物が位置８１０から、ロボットへ向かう経路８１２を経て、位置８１１へと移動した例である。このとき、人物の頭部の向きはロボット１１０の方向を向いている。この例では、関心行動識別部３１６は、対話意思があるとし関心度(対話意志スコア)を加算する。 FIG. 8A shows an example in which a person has moved from a position 810 to a position 811 via a route 812 toward the robot within a predetermined time. At this time, the direction of the head of the person is in the direction of the robot 110. In this example, the interest behavior identification unit 316 determines that there is a dialogue intention and adds the degree of interest (dialogue will score).

図８（b）は、所定時間内に、人物が位置８２０から、ロボットへ向かう経路８２２を経て、位置８２１へと移動した例である。このとき、位置８２０では、人物の頭部はロボット１１０を向いていたが、位置８２１において、ロボット１１０の方向を向いていない。関心行動識別部３１６は、関心度スコアを減衰したうえで加算する。 FIG. 8B shows an example in which a person has moved from a position 820 to a position 821 via a path 822 toward the robot within a predetermined time. At this time, at the position 820, the head of the person is facing the robot 110, but not at the position 821. The interest behavior identification unit 316 adds the interest level score after attenuating it.

図８（c）は、人物が位置８３０から経路８３２を経て、位置８３１へと移動し、ロボットへ頭部を向けていない例である。このとき、関心行動識別部３１６は、関心度のスコアを加算しない。 FIG. 8C illustrates an example in which a person moves from a position 830 to a position 831 via a path 832, and the head is not turned to the robot. At this time, the interest behavior identification unit 316 does not add the interest degree score.

図８（d）は、人物が位置８４０から、ロボットから遠ざかる経路８４２を経て、位置８４１へと至り、人物の頭部がロボットを向いていない場合である。このとき、関心行動識別部３１６は、人物の対話意思ないしは関心度のスコアをリセットする。あるいは、減算してよい。 FIG. 8D illustrates a case where the person reaches the position 841 from the position 840 via the path 842 that moves away from the robot, and the head of the person does not face the robot. At this time, the interest behavior identification unit 316 resets the score of the person's intention to interact or the degree of interest. Alternatively, it may be subtracted.

次いでステップS４０８にて、関心行動識別部３１６は、複数のフレームにわたって算出された（ステップS４０７にて算出される）スコアを用いて、人物のロボット１１０への対話意思ないしは、関心度とする。尚、算出された関心度は、第２記憶デバイス１３３に、図１０に示すように格納される。 Next, in step S408, the interest behavior identification unit 316 uses the scores calculated over a plurality of frames (calculated in step S407) as the intention of the person to interact with the robot 110 or the degree of interest. The calculated degree of interest is stored in the second storage device 133 as shown in FIG.

図９は、ロボット１１０に対する人物の３フレーム分の移動の様子を示したものである。また、図１０は、ロボット１１０に対する人物の３フレーム分の移動について、関心度の算出例を示した表である。それぞれのケースにおいて、図８でのスコア付けを基に、人物の接近行動、頭の向きを用いて、人物の挙動から、関心行動識別部３１６における、対話意思、ないしは関心度の算出手法の一例を示している。図１０では、それぞれのフレームごとのスコア付けとして1フレーム目のスコアをC1、２フレーム目のスコアをC2、３フレーム目のスコアをC３、ないしは３フレーム分のスコア付けの一例として、時間平均したスコア付けを示している。図１０の関心度は、第２の記憶デバイス１３３に記憶され、実際にロボット１１０が利用される場面でも、同様の関心度表として利用できる。即ち、複数人の人物が行き交う環境下で、カメラ１２０により撮像された複数の人物を識別するためのＩＤは（Ａ）〜（Ｄ）で、各人物の時間平均の関心度を同様に求めてテーブルとして管理する。 FIG. 9 shows how a person moves by three frames with respect to the robot 110. FIG. 10 is a table showing an example of calculating the degree of interest with respect to the movement of the person for three frames with respect to the robot 110. In each case, based on the scoring in FIG. 8, an example of a method of calculating a dialogue intention or a degree of interest in the interest behavior identification unit 316 from the behavior of the person using the approaching behavior and head direction of the person. Is shown. In FIG. 10, as the scoring for each frame, the score of the first frame is C1, the score of the second frame is C2, the score of the third frame is C3, or time-averaged as an example of the scoring for three frames. Shows scoring. The interest level in FIG. 10 is stored in the second storage device 133 and can be used as a similar interest level table even when the robot 110 is actually used. In other words, in an environment where a plurality of people come and go, the IDs for identifying the plurality of people captured by the camera 120 are (A) to (D), and the time average interest level of each person is similarly calculated. Manage as a table.

図９（a）は、人物がロボット１１０へ接近する、３フレーム分の様子を示している。経路９０１と、経路９０２と、経路９０３と、はそれぞれのフレームの人物の移動経路を示しており、それぞれのフレームにおいて、人物の頭はロボット１１０を向いており、人物は、ロボット１１０方向へと接近している。これは図８（a）の動きに対応し、関心行動識別部３１６は、この動きのスコアを「１」とする。それぞれのフレームにおけるスコアはＣ１＝１、Ｃ２＝１、Ｃ３＝１となり、３フレーム分の人物の挙動から対話意図を、時間平均で評価すると「１」となる。 FIG. 9A shows a state in which a person approaches the robot 110 for three frames. A route 901, a route 902, and a route 903 indicate a moving route of the person in each frame. In each frame, the person's head is facing the robot 110, and the person is moving toward the robot 110. Approaching. This corresponds to the movement in FIG. 8A, and the interest behavior identification unit 316 sets the score of this movement to “1”. The scores in each frame are C1 = 1, C2 = 1, and C3 = 1, and the dialogue intention is evaluated as “1” based on the behavior of the person for three frames on a time average basis.

図９（b）では、人物がロボット１１０を素通りする行動のうち３フレーム分の様子を示している。経路９１１と、経路９１２と、経路９１３と、はそれぞれのフレームの人物の移動経路を示しており、図８（c）の動きに対応し、関心行動識別部３１６はスコアを例えば０とする。このとき、関心行動識別部３１６は、それぞれのフレームにおけるスコアをＣ１＝０、Ｃ２＝０、Ｃ３＝０となり、３フレーム分の挙動から対話意図を、時間平均で評価すると「０」となる。 FIG. 9 (b) shows three frames of the behavior of the person passing through the robot 110. A route 911, a route 912, and a route 913 indicate a moving route of the person in each frame, and correspond to the movement in FIG. 8C, and the interest behavior identification unit 316 sets the score to 0, for example. At this time, the interest action identification unit 316 sets the scores in each frame to C1 = 0, C2 = 0, and C3 = 0, and evaluates the dialogue intention from the behavior of three frames by time average to “0”.

図９（c）では、人物がロボット１１０へ接近する行動のうち、３フレーム分の様子を示している。人物は当初、頭部をロボット１１０へ向けている。その後の移動経路９２１、経路９２２、経路９２３においてロボット１１０へ接近する行動であるが、頭部はロボット方向を向いておらず、図８（b）の動きに対応している。経路９２１では、頭部がロボット方向を向かなくなってから１フレーム経過した接近動作であるため、関心行動識別部３１６は、スコアをＣ１＝１／１とする。経路９２２では、頭部がロボット１１０方向を向かなくなってから２フレーム経過した接近動作であるため、スコアをＣ２＝１／２とする。経路９２３では、人物の頭部がロボット１１０方向を向かなくなってから３フレーム経過しているため、スコアをＣ３＝１／３とする。図９（c）の３フレーム分の挙動から対話意図を評価すると、関心行動識別部３１６は、例えば時間平均で１１／１８とスコアをつけることになる。 FIG. 9C shows three frames of the action of the person approaching the robot 110. The person initially points his head at the robot 110. The subsequent movement route 921, the route 922, and the route 923 approach the robot 110, but the head does not face the robot direction, and corresponds to the movement in FIG. 8B. On the route 921, since the approaching motion is one frame after the head stops moving in the robot direction, the interest action identifying unit 316 sets the score to C1 = 1/1. In the route 922, since the approach operation is performed two frames after the head stops moving in the direction of the robot 110, the score is set to C2 = 1/2. In the route 923, the score is set to C3 = 1/3 since three frames have elapsed since the head of the person no longer turns toward the robot 110. When the dialogue intention is evaluated from the behavior for three frames in FIG. 9C, the interest behavior identification unit 316 gives a score of, for example, 11/18 on a time average.

図９（d）では、人物のロボット１１０前での挙動のうち、３フレーム分の動作を示している。経路９４１でロボット１１０へと頭を向け接近する図８（a）の動作であり、経路９４２でロボット１１０に対して頭部を背け、経路９４３でロボット１１０から遠ざかる図８（d）の動作である。関心行動識別部３１６は、経路９４１のスコアＣ１＝１である。経路９４２では、立ち止まり行動であるため、スコアＣ２＝０とし、経路９４３では、頭を背け遠ざかる動作であるため、スコアをリセットする。そのため、図９（d）では、人物の対話意図は、図８（d）の動作によりリセットされ、関心行動識別部３１６は、時間平均で評価するなら関心度スコア「０」となる。 FIG. 9D shows three frames of the behavior of the person in front of the robot 110. 8A in which the head is directed toward and approaching the robot 110 on the path 941, and the head is turned away from the robot 110 on the path 942 and moved away from the robot 110 on the path 943 in FIG. 8D. is there. The interest behavior identification unit 316 sets the score C1 of the route 941 to 1. The score C2 = 0 is set on the path 942 because of the stop action, and the score is reset on the path 943 because the head is moving away from the head. Therefore, in FIG. 9D, the intention of the person to interact is reset by the operation of FIG. 8D, and the interest behavior identification unit 316 becomes an interest degree score “0” if evaluated by time average.

関心行動識別部３１６は、時系列特徴抽出部３１５にて抽出した特徴を基に、人物がロボット１１０に対して対話意思ないしは関心を持つかを識別し、スコアを算出する。この際、算出されるスコアは、複数人の人物が行き交う環境下で、ロボット１１０が対話候補を選択するための順位付けに用いることができる。また、ここでは、閾値を設け、閾値を超えない人物を順位付けから除外することもできる。例えば、図９と図１０の例では、閾値を０．５とすることで、（a）と（c）を対話候補として識別でき、順位付けを行い、素通りする人物（b）や、遠ざかる人物（d）を対話候補から除外できる。また、閾値を上げて例えば、０．７とすることで、よそ見をしながら近づいてくる（C）を対話候補から除外することもできる。ロボット１１０の前を素通りする人物に対しては、ロボット１１０は対話候補とみなさないとすることができため、計算処理を単純化し、関心度スコアの処理速度を高速化させることができる。複数人が候補対象となる場合、計算した関心度の上位２名といった具合に、人物の関心度の相対評価により対話候補とすることもできる。 The interest behavior identification unit 316 identifies whether the person has a dialogue intention or interest in the robot 110 based on the features extracted by the time-series feature extraction unit 315, and calculates a score. At this time, the calculated score can be used for ranking for the robot 110 to select a dialogue candidate in an environment where a plurality of people come and go. Here, a threshold value may be provided, and persons not exceeding the threshold value may be excluded from the ranking. For example, in the examples of FIGS. 9 and 10, by setting the threshold value to 0.5, (a) and (c) can be identified as dialog candidates, ranked, and a person (b) passing by or a person moving away (D) can be excluded from the dialogue candidates. Also, by increasing the threshold value to, for example, 0.7, it is possible to exclude (C) approaching while looking away from the dialog candidates. For a person passing in front of the robot 110, the robot 110 can not be regarded as a conversation candidate, so that the calculation process can be simplified and the processing speed of the interest degree score can be increased. When a plurality of persons are candidates, the dialogue candidates can be determined by relative evaluation of the degree of interest of the person, such as the top two persons with the calculated degree of interest.

なお、図４では、第１の推定処理０１の一例を示したが、関心行動識別部３１６は、これに限らず深層学習を利用した関心度の尤度を推定してもよい。 Although FIG. 4 illustrates an example of the first estimation process 01, the interest behavior identification unit 316 is not limited to this, and may estimate the likelihood of the interest level using deep learning.

時系列特徴抽出部３１５では、具体的には、ロボット１１０の付近を行き交う人物の動作を動画像として入力する。動画像は連続する画像の集合であり、人特徴抽出部３１３が画像フレームごとに人物の特徴量を抽出する。ここで抽出される特徴量は、頭方定部３２１が出力する頭方定部特徴量、胴方定部３２３が出力する胴方定部特徴量、頭方定部特徴量と胴方定部特徴量とを一つにまとめた特徴量であってもよい。その後、関心行動識別部３１６は、各フレームの特徴量を動画像の全フレームから抽出し、関心度の尤度を出力してもよい。このとき、教師データとしては入力として、ロボットの前を行き交う人物の動画像とし、実際に来たかどうかの教師ラベルの事例を基に識別器を作成し、判定に用いてもよい。なお、上述した手法は接近してくる人物の対話意図を識別する手段の一例であり、これらに限らない。 Specifically, the time-series feature extraction unit 315 inputs a motion of a person who moves around the robot 110 as a moving image. A moving image is a set of continuous images, and a human feature extraction unit 313 extracts a feature amount of a person for each image frame. The feature amounts extracted here are the head-side fixed portion feature amount output by the head-side fixed portion 321, the body-side fixed portion feature amount output by the body-side fixed portion 323, the head-side fixed portion feature amount, and the body-side fixed portion The feature quantity may be a feature quantity obtained by combining the feature quantity with one. After that, the interest behavior identification unit 316 may extract the feature amount of each frame from all the frames of the moving image and output the likelihood of the interest level. At this time, as the input of the teacher data, a moving image of a person traveling in front of the robot may be used as an input, and a discriminator may be created based on an example of a teacher label indicating whether or not the robot has actually come, and used for determination. Note that the above-described method is an example of a means for identifying an intention of a dialogue of an approaching person, and is not limited thereto.

＜働きかけの具体的処理＞
図５は、関心度を持つと判断された対話候補に対してロボットが働きかけの方法を選択するためのフローチャートである。具体的には、対話候補が存在する際にロボット１１０に対して送信される制御信号によって、出力デバイス制御部３０３と、駆動制御部３０４ともより制御することになる。この処理は、第２のプロセッサ１３２が第２の記憶デバイス１３３に格納されたプログラムを実行することにより行われる。ここでは、制御信号による制御対象は関心を持つ人物の人数によって変更してもよい。 <Specific processing of approach>
FIG. 5 is a flowchart for selecting a method for the robot to work on a dialogue candidate determined to have a degree of interest. Specifically, the output device control unit 303 and the drive control unit 304 are further controlled by a control signal transmitted to the robot 110 when a dialog candidate exists. This processing is performed by the second processor 132 executing a program stored in the second storage device 133. Here, the control target by the control signal may be changed depending on the number of interested persons.

まず、ステップS５０１では関心行動識別部３１６は、対話候補が存在するかを判定する。対話候補が存在しない場合、制御信号を送信することはない。
次に、ステップS５０２では、関心行動識別部３１６は、対話候補が複数存在するかを判定する。対話候補が複数か一人かに応じて、制御する出力デバイス１０４もしくは駆動装置１２５を選択するためである。 First, in step S501, the interest behavior identification unit 316 determines whether a dialogue candidate exists. If no interaction candidate exists, no control signal is sent.
Next, in step S502, the interest behavior identification unit 316 determines whether there are a plurality of interaction candidates. This is for selecting the output device 104 or the driving device 125 to be controlled depending on whether there are a plurality of dialog candidates or one.

次に、対話候補が複数存在しない場合にはステップS５０３に進み、人検出部３１３は、人物が複数存在するかを判定する。 Next, when there is no plurality of conversation candidates, the process proceeds to step S503, and the person detection unit 313 determines whether there are a plurality of people.

次に、人物が一人だけの場合ステップS５０４に進み、関心行動識別部３１６は出力デバイス制御部３０３に対してスピーカ１２１を制御するよう制御信号を送る。具体的には、あいさつなどの声掛けを行うよう制御する。 Next, when there is only one person, the process proceeds to step S504, and the interest behavior identification unit 316 sends a control signal to the output device control unit 303 to control the speaker 121. Specifically, control is performed so as to perform a greeting such as a greeting.

ステップＳ５０３で人物が複数存在すると判断された場合、或いは、ステップＳ５０２で対話候補が複数存在すると判断された場合、ステップS５０５に進む。ステップＳ５０５では、関心行動識別部３１６は出力デバイス制御部３０３に対し、第１の出力デバイス１４０の内、表示装置１２４に対する制御信号を送る。これにより、表示装置１２４にロボットの顔を表示するなどの描画や、表情を変更するなどをして、関心度の高い対話候補に働きかけを行う。また、駆動制御部３０４は、駆動装置１２５を制御するのであれば、具体的には、関心度の高い人物に向かい、手を振る、会釈などしてロボット１１０の対話の意思を表現してもよい。 If it is determined in step S503 that there are a plurality of persons, or if it is determined that there are a plurality of dialogue candidates in step S502, the process proceeds to step S505. In step S505, the interest behavior identification unit 316 sends a control signal for the display device 124 of the first output device 140 to the output device control unit 303. In this way, drawing is performed, such as displaying the robot's face on the display device 124, and the expression is changed, thereby acting on a conversation candidate with a high degree of interest. In addition, if the drive control unit 304 controls the drive device 125, specifically, the drive control unit 304 may express the intention of the robot 110 to talk by heading toward a person of high interest, waving a hand, or bowing. Good.

ここで、制御対象を周囲の人数に応じて分けた理由は、声掛けは、関心度の低い人物の注意をひいてしまう恐れがあり、関心行動識別部３１６の判定結果への影響を避けるためである。なお、対話候補が一人のみ存在する場合には遠くから声を掛けてもよい。 Here, the reason why the control target is divided according to the number of people in the vicinity is that there is a fear that the voice call may draw attention of a person with a low degree of interest, so as to avoid affecting the determination result of the interest behavior identification unit 316. It is. If there is only one conversation candidate, a call may be made from a distance.

＜第二の推定処理０２の具体的処理＞
反応確認部３１７は、出力デバイス制御部３０３、もしくは、駆動制御部３０４により働きかけた人物の反応を観測する。これにより、反応確認部３１７は、第２の推定処理０２を実現し、対話候補の中から対話対象となりうる人物を抽出する。 <Specific processing of the second estimation processing 02>
The reaction confirmation unit 317 observes the reaction of the person acted on by the output device control unit 303 or the drive control unit 304. Accordingly, the reaction confirmation unit 317 implements the second estimation process 02 and extracts a person who can be a conversation target from the conversation candidates.

図６は、出力デバイス制御部３０３と駆動制御部３０４とによる、対話候補への働きかけを行った際の第２の推定処理０２を示したフローチャートである。 FIG. 6 is a flowchart illustrating a second estimation process 02 performed by the output device control unit 303 and the drive control unit 304 when acting on a dialogue candidate.

まずステップS６０１では、出力デバイス制御部３０３と駆動制御部３０４は、出力デバイス１４０を制御することで、対話候補となる人物に働きかけを行う。このステップでは、第１プロセッサ１２７は、第１記憶デバイス１２８にて、働きかけを行った時刻Ｔａを記憶する。この時刻Ｔａは、第２記憶デバイス１３３に記録してもよい。 First, in step S601, the output device control unit 303 and the drive control unit 304 control the output device 140 to work on a person who is a conversation candidate. In this step, the first processor 127 stores, in the first storage device 128, the time Ta at which the action was performed. This time Ta may be recorded in the second storage device 133.

次に、ステップS６０２では、反応確認部３１７は、ステップS６０１にて制御されたロボット１１０の働きかけに対する対話候補の反応の変化を判定する。具体的には、反応確認部３１７は、頭方定部３２２にて時間ごとに検出される頭の向きが、時刻Ｔａに対して、例えば１秒以内など、極めて近い時刻以内にロボット１１０方向に向くように変化した際には、ロボット１１０のアクションに対する対話候補の反応であるとし、対話候補を対話対象であると判定する。また、例えば、反応確認部３１７は、胴方定部３２３にて、時刻Ｔａに対して、例えば５秒以内など、近い時刻で、対話候補の進行方向がロボット１１０方向へ変更されると判定される、ないしは、ロボット１１０方向に向かうまま変更しないと判定されるのであれば、対話対象と判定する。 Next, in step S602, the reaction confirmation unit 317 determines a change in the response of the dialogue candidate to the action of the robot 110 controlled in step S601. Specifically, the reaction confirmation unit 317 determines that the head orientation detected by the head orientation unit 322 at each time is in the direction of the robot 110 within a very short time, such as within 1 second, with respect to the time Ta. When it changes so as to face, it is determined that it is a reaction of the dialogue candidate to the action of the robot 110, and the dialogue candidate is determined to be a dialogue target. In addition, for example, the reaction confirmation unit 317 determines that the advancing direction of the dialogue candidate is changed to the robot 110 at a close time, for example, within 5 seconds, with respect to the time Ta, by the body shape determination unit 323. Or, if it is determined that no change is made in the direction of the robot 110, it is determined that the object is to be interacted with.

次に、ステップS６０３では、反応確認部３１７は、ステップS６０２にて対話対象と判定された人物に対して、ロボット１１０があらかじめ体を向けたり、或いは、人物に向かって移動するよう駆動制御部３０４に制御信号を送る。出力デバイス制御部３０３に対し、スピーカ１２１を用いて、対話対象に声を掛けるよう制御信号を送信してもよい。 Next, in step S603, the reaction confirmation unit 317 causes the drive control unit 304 to direct the robot 110 in advance to the person determined to be the conversation target in step S602 or to move toward the person. To the control signal. A control signal may be transmitted to the output device control unit 303 so as to use the speaker 121 to speak to the conversation target.

ステップS６０２にて、反応確認部３１７が、対話候補の反応を確認できず、対話対象として判定できなかった際に、時系列特徴抽出部３１５は、対話候補の人物がロボット１１０へと接近するか否かを判断する（ステップＳ６０４）。ここでの接近とするか否かの判断は、ロボット１１０の周囲の領域に対話候補人物が進入侵入したかにより判断する。この領域の広さは、対象となる人物のロボットへの接近速度に応じて変化するものであってさせることもできる。また、一定時間経ってもよい対象が接近しない場合や、対象人物がロボット１１０から一定の距離はなれば場合には、当該人物に対する処理は終了する。 In step S602, when the reaction check unit 317 cannot check the reaction of the dialog candidate and cannot determine the reaction as a dialog target, the time-series feature extraction unit 315 determines whether the dialog candidate person approaches the robot 110. It is determined whether or not it is (step S604). The determination as to whether or not the approach is made here is made based on whether or not the conversation candidate person has entered the area around the robot 110. The size of this area may be changed according to the speed at which the target person approaches the robot. In addition, when the target, which may pass for a certain period of time, does not approach, or when the target person is at a certain distance from the robot 110, the processing for the person ends.

最後にステップS６０５では、ステップS６０３にて対話対象であると判定された人物に対して、対話を行う準備を行う。例えば、人物との対話の開始にあたって、駆動装置１２５は、ロボット１１０の旋回機能を有しているのであれば、対話対象に対して、ロボット１１０に正対姿勢を取らせる。また、駆動装置１２５が移動手段を含む場合には、駆動制御部３０４は、ロボット１１０を対話対象の近くまで接近させ、その後、出力デバイス制御部３０３は、たとえばスピーカ１２１を用いて、対話対象に声を掛けてもよい。 Finally, in step S605, preparations are made for a dialogue with the person determined to be a dialogue target in step S603. For example, at the start of a dialogue with a person, if the driving device 125 has a turning function of the robot 110, the driving device 125 causes the robot 110 to take a facing posture with respect to the dialogue target. When the driving device 125 includes a moving unit, the drive control unit 304 causes the robot 110 to approach the vicinity of the dialogue target, and then the output device control unit 303 uses the speaker 121 to set the robot 110 to the dialogue target. You may call out.

領域は、具体的には、たとえば、ロボット１１０が正対姿勢を取るのであれば、その動作を完了するために必要な時間と、対話候補の接近速度を基に決定する可変の領域の範囲であるとしてよい。 Specifically, for example, if the robot 110 takes a facing posture, the region is a range of a variable region determined based on the time required to complete the operation and the approach speed of the dialogue candidate. There may be.

以上のように、本実施の形態に示す対話システム１００によれば、遠方より接近してくる複数の人物の接近動作の特徴から対話候補とする第1の推定と、対話候補に働きかけを行うことで、これに対する対話候補の反応行動から対話対象とする第2の推定により、ロボットが、複数人が行き交う環境下で利用される際、人物が対話意思や関心を持っているかを、人物がロボットに接近前に判定し、事前に対話対象となる人物を絞り込む対話ロボットシステムおよび対話ロボットの制御方法を提供することができる。また、ロボットが能動的に人物を選択して話しかけることができるため、人物に効果的に対話対象とすることができる。さらに、接近される前にカメラを向けるなどし、人物の認識のための処理を実行することで、人物の外見的特徴を対話開始前に抽出可能となり、対話内容に反映させることができる。 As described above, according to the dialogue system 100 shown in the present embodiment, the first estimation as a dialogue candidate from the characteristics of the approaching motion of a plurality of persons approaching from a distance, and the approach to the dialogue candidate are performed. Then, the second estimation of the dialogue from the reaction behavior of the dialogue candidate in response to this suggests that when the robot is used in an environment where multiple people are It is possible to provide a dialogue robot system and a control method for a dialogue robot, in which a determination is made before approaching and a person to be talked to is narrowed down in advance. In addition, since the robot can actively select and speak with a person, the robot can be effectively targeted for conversation. Further, by performing a process for recognizing a person, for example, by turning the camera before approaching, the external features of the person can be extracted before the start of the dialog, and can be reflected in the contents of the dialog.

１００：対話システム、１１０：ロボット、１２０：カメラ、１２１：スピーカ、１２２：マイクアレイ、１２３：内部サーバー、１２４：表示装置、１２５：駆動部、１２６：第１通信IF、１２７：第１プロセッサ、１２８：第１記憶デバイス、１３０：遠隔サーバー、１３１：第２通信IF、１３２：第２プロセッサ、１３３：第２記憶デバイス、１４０：第１出力デバイス、３０３：出力デバイス制御部、３０４：駆動制御部、３１２：人検出部、３１３：人特徴抽出部、３１４：人追跡部、３１５：時系列特徴抽出部、３１６：関心行動識別部、３１７：反応確認部、３２１：頭検出部、３２２：頭方定部、３２３：胴方定部。 100: Dialogue system, 110: Robot, 120: Camera, 121: Speaker, 122: Microphone array, 123: Internal server, 124: Display device, 125: Drive unit, 126: First communication IF, 127: First processor, 128: first storage device, 130: remote server, 131: second communication IF, 132: second processor, 133: second storage device, 140: first output device, 303: output device control unit, 304: drive control Section, 312: person detection section, 313: person feature extraction section, 314: person tracking section, 315: time series feature extraction section, 316: interest behavior identification section, 317: reaction confirmation section, 321: head detection section, 322: Head part fixed part, 323: Body part fixed part.

Claims

An imaging device for imaging the surroundings;
Detecting a person from the image information from the imaging device, tracking the detected person in a plurality of images of the imaging device, the degree of interest of the tracked person, the face of the person in the plurality of images A dialog system comprising: a computer that calculates a change based on the change in the orientation and the orientation of the torso and sets the computer as a dialog candidate based on the calculated degree of interest.

A robot having the imaging device and a speaker;
The robot has one of a display device and a driving device,
The computer transmits a control signal for causing the speaker, the display device, or the driving device to act on the interaction candidate, and a person determined to have received a response from the interaction candidate within a predetermined time. 2. The dialogue system according to claim 1, wherein the target is a dialogue target.

When a plurality of persons are captured in the image information from the imaging device, the calculator calculates the degree of interest for each person, and among the plurality of persons, identifies a person whose degree of interest is higher than a threshold as the dialog candidate. The interactive system according to claim 2, wherein:

A robot having the imaging device and a display device;
The computer calculates a degree of interest for each person when a plurality of persons are imaged in the image information from the imaging device, and when a plurality of persons whose degree of interest is higher than a threshold value among the plurality of persons, The interactive system according to claim 1, wherein display control of the display device of the robot is performed.

The driving device turns the robot,
The interactive system according to claim 2, wherein the imaging device images the person to be interacted with from the front by turning by the driving device.

The computer, based on an image of the person to be interacted with by the imaging device taken from the front, identifies the age and gender of the person before starting conversation with the person. 5. The interactive system according to item 5.

A control method for a dialogue system having a robot and a computer,
The surroundings are imaged by an imaging device mounted on the robot, the computer detects a person from image information imaged by the imaged imaging device, tracks the detected person, and tracks the detected person. A control method for a dialogue system, comprising: calculating a degree of interest based on a change in the orientation of the person's face and body in the plurality of images, and using the calculated degree of interest as a dialogue candidate. .

A speaker, a display device, and a driving device mounted on the robot, and a control signal for performing an action is transmitted from the computer to the interaction candidate, and the computer transmits the control signal within a predetermined time to the action. 8. The control method according to claim 7, wherein a person determined to have received a response from the dialogue candidate is set as a dialogue target.

When a plurality of persons are imaged in the image information from the imaging device, the calculator calculates the degree of interest for each person, and among the plurality of persons, a person whose degree of interest is higher than a threshold is a conversation candidate. 9. The control method for a dialogue system according to claim 8, wherein:

When a plurality of persons are imaged in the image information from the imaging device, the calculator calculates the degree of interest for each person, and when there are a plurality of persons whose degree of interest is higher than a threshold from among the plurality of persons. 9. The method according to claim 8, wherein control is performed to switch a display image of the display device.

9. The control method according to claim 8, wherein the computer turns the robot to transmit a control signal for imaging the person to be interacted with from the front.

The dialog system according to claim 11, wherein the computer identifies the age and gender of the person before starting the dialog with the person, based on an image of the person to be talked from the front. Control method.

In a dialogue system using a robot installed in an environment where multiple people come and go,
An imaging device for imaging the periphery of the robot,
A person detection unit that detects a person from image information from the imaging device,
A person tracking unit that tracks the same person from a plurality of images from the imaging device,
A time-series feature extraction unit that calculates the degree of interest of the person tracked by the person tracking unit with respect to the robot based on changes in the orientation of the person's face and body in the plurality of images; A dialogue system comprising: an interest behavior identification unit that determines a dialogue candidate based on the degree of interest.

The robot has at least one of an output device or a driving device,
By controlling either the output device or the driving device with respect to the dialogue candidate, there is provided a reaction confirming unit for acting on the person and confirming whether or not the person has responded to the act. The interactive system according to claim 13, wherein: