JP2007257088A

JP2007257088A - Robot device and its communication method

Info

Publication number: JP2007257088A
Application number: JP2006077693A
Authority: JP
Inventors: Masahide Kaneko; 正秀金子; Junichi Imai; 順一今井; Yuanjie Cui; 元杰崔
Original assignee: University of Electro Communications NUC
Current assignee: University of Electro Communications NUC
Priority date: 2006-03-20
Filing date: 2006-03-20
Publication date: 2007-10-04

Abstract

<P>PROBLEM TO BE SOLVED: To have a robot carry out human-like communication in a state of having a multiplicity of humans around the robot. <P>SOLUTION: When carrying out communication between the robot device, and humans around the robot device, the periphery of the robot device is photographed, detection of faces, and hands or arms of the humans is carried out from the photographed image, evaluation values of each state is obtained from states of the detected faces, and hands or arms, weighting coefficients predetermined per type of evaluation value are multiplied to the evaluation values of each state, a person with high priority is specified from a sum value of the evaluation values multiplied by respective coefficient values, and communication is carried out with the specified person. Priority is judged by obtaining distances to the detected people, and evaluation values in regard to detected sound source positions also. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、ロボット装置及びそのコミュニケーション方法に関し、特に、ロボット装置の周囲にいる人間とコミュニケーションを行う技術に関する。 The present invention relates to a robot apparatus and a communication method therefor, and more particularly to a technique for communicating with a person around the robot apparatus.

人間とロボット装置とのコミュニケーションに関しては、従来、各種研究がなされている。しかし、従来の研究はほとんどのものが、１台のロボット装置と一人のユーザとの１対１のコミュニケーションに関するものである。これに対し、ロボット装置と複数のユーザとのコミュニケーションについての研究はまだ少ないが、例えば、ロボット装置と人間との距離をセンサで計測して、その計測された距離に基づいてコミュニケーションを行うことが知られている。具体的には、距離情報とタッチセンサを使ってロボット装置がユーザを選択する。 Various studies have been conducted on communication between humans and robotic devices. However, most of the conventional research is related to one-to-one communication between one robot apparatus and one user. On the other hand, there is still little research on communication between robot devices and multiple users. For example, the distance between a robot device and a human can be measured with a sensor, and communication can be performed based on the measured distance. Are known. Specifically, the robot apparatus selects a user using the distance information and the touch sensor.

特許文献１には、１台のロボット装置と一人のユーザとの１対１のコミュニケーションをとる手法についての開示がある。
特開２００５−１９９４０３号公報 Patent Document 1 discloses a technique for performing one-to-one communication between one robot apparatus and one user.
JP 2005-199403 A

ところが、コミュニケーションをする際、距離を用いて相手を選択するだけでは、人間らしいコミュニケーションを行うことは難しい。例えば、近くのユーザはロボット装置を見つめているだけであるが、遠くにいるユーザがロボットに興味を持って手を振ったりする場合であっても、近くのユーザが選択されることになる。しかし、このような状況では、遠くのユーザの方がロボット装置により興味を持っているので、遠くのユーザを選択することが本来は望ましいが、従来の手法ではそのような場合の対処は困難である。また、複数の人物が随時出入りする様子を撮影した動画像からカメラの方を向いている人物（ユーザ）のみを安定して認識・抽出する手法が、従来提案されている。しかし、カメラの方を向いている人物を検出するだけでは十分とは言えない。 However, when communicating, it is difficult to perform human-like communication only by selecting a partner using distance. For example, a nearby user only looks at the robot device, but a nearby user is selected even when a user in the distance is interested in the robot and shakes his hand. However, in such a situation, since a distant user is more interested in the robot apparatus, it is originally desirable to select a distant user, but it is difficult to deal with such a case with the conventional method. is there. In addition, a method for stably recognizing and extracting only a person (user) facing the camera from a moving image obtained by shooting a state in which a plurality of persons come and go as needed has been proposed. However, it is not enough to detect a person who is facing the camera.

本発明はかかる点に鑑みてなされたものであり、ロボットの周囲に多数の人間がいる状態で、ロボットに人間らしいコミュニケーションをさせることを目的とする。 The present invention has been made in view of this point, and an object of the present invention is to allow a robot to perform human-like communication in a state where there are a large number of people around the robot.

本発明は、ロボット装置と、そのロボット装置の周囲にいる人間とのコミュニケーションを行う場合において、ロボット装置の周囲を撮影し、撮影された画像から、人間の顔の検出と、手又は腕の検出を行い、検出された顔の状態と、検出された手又は腕の状態から、それぞれの状態の評価値を得、それぞれの状態の評価値に、評価値の種類ごとに予め決められた重み付け係数を乗算し、それぞれの係数値が乗算された評価値の合計の値から、優先度の高い人を特定し、特定された人とコミュニケーションを行うようにしたものである。 In the present invention, when communication is performed between a robot apparatus and a human being around the robot apparatus, the surroundings of the robot apparatus are photographed, and detection of a human face and detection of a hand or arm from the photographed image The evaluation value of each state is obtained from the detected face state and the detected hand or arm state, and the weighting coefficient predetermined for each type of evaluation value is obtained as the evaluation value of each state. And a person with high priority is identified from the total value of the evaluation values multiplied by the respective coefficient values, and communication is performed with the identified person.

本発明によると、検出された周囲の人間の顔の状態と、検出された周囲の人間の手又は腕の状態とに基づいて、コミュニケーションを行う上で優先度の高い人を特定するようにしたことで、ロボット装置の周囲に複数人がいる状態で、その複数人の中から、ロボット装置に最も興味を持っている人を良好に特定することができる。 According to the present invention, a person with a high priority in communication is identified based on the detected state of the surrounding human face and the detected state of the surrounding human hand or arm. Thus, in a state where there are a plurality of persons around the robot apparatus, it is possible to satisfactorily identify a person who is most interested in the robot apparatus from among the plurality of persons.

この場合、検出した顔の向きに対する評価値と、検出した手又は腕の動きに対する評価値を算出して、それぞれの評価値から優先度を判断することで、顔の向きと手（又は腕）の動きとから、ロボット装置に興味を持った者を良好に特定できるようになる。 In this case, the evaluation value for the detected face orientation and the evaluation value for the detected hand or arm movement are calculated, and the priority is determined from each evaluation value, whereby the face orientation and the hand (or arm) are determined. The person who is interested in the robot apparatus can be well identified from the movement of the robot.

また、手又は腕の検出は、人間の顔を検出した位置から推定した、手又は腕が存在する可能性のある範囲内の所定の色の状態の部分を、手又は腕と検出することで、手又は腕の検出を確実かつ良好に行うようにしてもよい。 In addition, the detection of the hand or arm is performed by detecting a part of a predetermined color state within the range where the hand or arm may exist, estimated from the position where the human face is detected, as a hand or arm. The hand or arm may be detected reliably and satisfactorily.

また、顔を認識した人間との距離を測定して、その測定された距離を評価値に算出することで、距離についても考慮された特定が可能になる。 Further, by measuring the distance to the person who recognized the face and calculating the measured distance as an evaluation value, it is possible to specify the distance in consideration.

また、周囲の音声を集音した音声データから音源の位置を特定し、特定した音源位置と前記画像認識手段が認識した人間との対応関係を求め、その求まった人間の音声データに関する評価値を算出することで、話し声についても考慮された特定が可能になる。 Further, the position of the sound source is identified from the sound data obtained by collecting the surrounding sounds, the correspondence between the identified sound source position and the person recognized by the image recognition means is obtained, and an evaluation value for the obtained person's sound data is obtained. By calculating, it is possible to specify the voice taking into account the voice.

以下、本発明の一実施の形態を、添付図面を参照して説明する。 Hereinafter, an embodiment of the present invention will be described with reference to the accompanying drawings.

本例においては、ロボット装置に人間らしいコミュニケーションを行うようにしたものである。図１は、本例のロボット装置の構成を示した図であり、ロボット装置の近傍にいるユーザの中から、優先度の高いユーザを判別する構成を中心に示してある。 In this example, humanoid communication is performed with the robot apparatus. FIG. 1 is a diagram showing a configuration of the robot apparatus of this example, and mainly shows a configuration for discriminating a user having a high priority from users in the vicinity of the robot apparatus.

ロボット装置１０は、カメラ１１を備え、ロボット装置１０において動画像の撮影を行う。カメラ１１は、２眼以上によるステレオ画像の撮影を行う構成のビデオカメラである。カメラ１１で撮影された画像データは、画像認識部１２に供給され、画像データから人物の画像（特に顔の画像と手又は腕の画像）を抽出する処理が行われる。画像認識部１２で人物の顔や手の画像を抽出する処理の詳細については後述する。 The robot apparatus 10 includes a camera 11 and takes a moving image in the robot apparatus 10. The camera 11 is a video camera configured to take a stereo image with two or more eyes. Image data photographed by the camera 11 is supplied to the image recognition unit 12, and a process of extracting a person image (particularly a face image and a hand or arm image) from the image data is performed. Details of the process of extracting an image of a person's face or hand by the image recognition unit 12 will be described later.

画像認識部１２で認識されたデータは、認識データ処理部１３に送られる。認識データ処理部１３は、認識された画像の状態の評価値を得る評価値算出手段として機能し、得られた評価値を、優先度判定部１４に送る。認識された画像の状態の評価値としては、検出されたユーザの顔の状態の評価値と、ユーザの手（又は腕）の状態の評価値と、検出されたユーザとロボット装置との間の距離の評価値とを得る。顔の状態の評価値としては、例えば、顔がロボット装置の正面を向いているとき、最も高い評価値となるようにする。手（又は腕）の状態の評価値としては、例えば、手を振っている状態のとき、最も高い評価値となるようにする。距離の状態の評価値としては、例えば、ユーザとロボット装置との間の距離が近いとき、高い評価値になるようにする。ユーザとロボット装置との間の距離については、ここではカメラ１１がステレオ画像の撮影を行う構成のビデオカメラであるので、撮影された画像自身から検出することが可能である。但し、赤外線センサなどの他の周知の距離測定手段を併用してもよい。 Data recognized by the image recognition unit 12 is sent to the recognition data processing unit 13. The recognition data processing unit 13 functions as an evaluation value calculation unit that obtains an evaluation value of the state of the recognized image, and sends the obtained evaluation value to the priority determination unit 14. As the evaluation value of the state of the recognized image, the evaluation value of the detected state of the user's face, the evaluation value of the state of the user's hand (or arm), and between the detected user and the robot apparatus Get the distance evaluation value. As the evaluation value of the face state, for example, when the face faces the front of the robot apparatus, the highest evaluation value is set. As the evaluation value of the hand (or arm) state, for example, when the hand is waving, the highest evaluation value is set. As the evaluation value of the distance state, for example, when the distance between the user and the robot apparatus is short, a high evaluation value is set. The distance between the user and the robot apparatus can be detected from the captured image itself because the camera 11 is a video camera configured to capture a stereo image. However, other known distance measuring means such as an infrared sensor may be used in combination.

また本例のロボット装置１０は、複数のマイクロフォン２１ａ〜２１ｎがそれぞれ集音方向を変えて取付けてあり、ロボット装置１０の周囲の音声を集音する。各マイクロフォン２１ａ〜２１ｎで集音された音声信号を、それぞれ個別のアナログ／デジタル変換器２２ａ〜２２ｎでデジタルオーディオデータに変換し、変換されたオーディオデータを、デジタルオーディオ処理部２３に供給する。デジタルオーディオ処理部２３では、複数のオーディオデータから、音源から各マイクロフォンまでの到達時間差を推定し、音源位置推定の尤度マップを作成する。作成された音源尤度マップを音源特定部２４に送る。音源特定部２４では、複数の音源尤度マップを組み合わせ、ロボット装置から見た音源の方向を推定し、音源がユーザの顔領域と一致する割合を表す評価値を得る。音源特定部２４で得られた評価値は、優先度判定部１４に送る。 In the robot apparatus 10 of this example, a plurality of microphones 21a to 21n are attached with their sound collecting directions changed, and sound around the robot apparatus 10 is collected. Audio signals collected by the microphones 21 a to 21 n are converted into digital audio data by individual analog / digital converters 22 a to 22 n, and the converted audio data is supplied to the digital audio processing unit 23. The digital audio processing unit 23 estimates the arrival time difference from the sound source to each microphone from a plurality of audio data, and creates a likelihood map for sound source position estimation. The created sound source likelihood map is sent to the sound source specifying unit 24. The sound source identification unit 24 combines a plurality of sound source likelihood maps, estimates the direction of the sound source viewed from the robot apparatus, and obtains an evaluation value that represents the rate at which the sound source matches the user's face area. The evaluation value obtained by the sound source identification unit 24 is sent to the priority determination unit 14.

優先度判定部１４は、供給される顔の評価値と、手（又は腕）の評価値と、距離の評価値と、音源の評価値とを使用して、ロボット装置の周囲にいる複数のユーザの中から、このロボット装置に最も興味をもっているユーザを特定する。この場合、評価値の種類ごとに予め決められた係数値を乗算し、その係数値が乗算された各種類の評価値を加算して、それぞれのユーザの加算評価値を得て、その加算評価値の値を比較して、値が最も高いユーザが、このロボット装置に最も興味をもっているユーザであると特定する。具体的な評価値の算出例については後述する。なお、音源の評価値については、その音源の位置と、撮影画像から顔が検出された位置との一致処理を行い、その処理で一致が検出された顔のユーザの話し声であると判定して、該当するユーザの評価値とする処理が行われる。 The priority determination unit 14 uses a supplied face evaluation value, a hand (or arm) evaluation value, a distance evaluation value, and a sound source evaluation value. The user who is most interested in the robot apparatus is identified from the users. In this case, a coefficient value determined in advance for each type of evaluation value is multiplied, and each type of evaluation value multiplied by the coefficient value is added to obtain an added evaluation value for each user. By comparing the value values, the user with the highest value is identified as the user who is most interested in the robot apparatus. A specific example of calculating the evaluation value will be described later. As for the evaluation value of the sound source, the matching process between the position of the sound source and the position where the face is detected from the photographed image is performed, and it is determined that the voice is the user's speech of the face where the match is detected in the process. The process of setting the evaluation value of the corresponding user is performed.

優先度判定部１４でこのロボット装置に最も興味をもっているユーザを特定すると、その判定したユーザがいる方向や距離のデータをロボット動作制御部３１に送る。ロボット動作制御部３１は、そのデータで指定された方向を向くように回転駆動部３２に駆動指令を送り、特定したユーザとのコミュニケーションを行う。また、ロボット装置の動作で、ユーザを特定して行う必要がある、その他の各種動作についても、判定したユーザがいる方向や距離のデータを参照して行い、特定したユーザとのコミュニケーションを行う。 When the priority determination unit 14 identifies a user who is most interested in the robot apparatus, the data indicating the direction and distance of the determined user is sent to the robot operation control unit 31. The robot operation control unit 31 sends a drive command to the rotation drive unit 32 so as to face the direction specified by the data, and communicates with the identified user. In addition, various other operations that need to be performed by specifying the user in the operation of the robot apparatus are also performed by referring to the data of the direction and distance in which the determined user is present, and communication with the specified user is performed.

次に、本例のロボット装置１０で行うユーザを特定する処理の全体的な流れを、図２のフローチャートを参照して説明する。まず、カメラ１１で撮影されたカラー動画像データが画像認識部１２に入力され（ステップＳ１１）、その動画像データ中の肌色の画素の領域を抽出する（ステップＳ１２）。抽出された肌色領域から、人間の顔を検出する処理を行う（ステップＳ１３）。 Next, an overall flow of processing for identifying a user performed by the robot apparatus 10 of this example will be described with reference to a flowchart of FIG. First, color moving image data photographed by the camera 11 is input to the image recognition unit 12 (step S11), and a skin color pixel region in the moving image data is extracted (step S12). A process of detecting a human face from the extracted skin color region is performed (step S13).

次に、検出した顔領域に基づいてユーザの上半身を検出し（ステップＳ１４）、ステップＳ１２で検出された肌色領域のデータと、ステップＳ１４で検出されたユーザの上半身の領域のデータに基づいて、そのユーザの手を抽出する（ステップＳ１５）。そして、その抽出された手の動き（即ち検出される手の位置の時間での変化）を検出し、検出された手の動きからジェスチャ認識を行う（ステップＳ１６）。また、ステップＳ１３で検出された顔と、ロボット装置１０との距離を画像処理で検出する（ステップＳ１７）。 Next, the user's upper body is detected based on the detected face area (step S14), and based on the skin color area data detected in step S12 and the user's upper body area data detected in step S14, The user's hand is extracted (step S15). Then, the extracted hand movement (that is, a change in the detected hand position with time) is detected, and gesture recognition is performed from the detected hand movement (step S16). Further, the distance between the face detected in step S13 and the robot apparatus 10 is detected by image processing (step S17).

また、マイクロフォン２１ａ〜２１ｎの出力が供給されて（ステップＳ１８）、多チャネルの音声信号から音源位置推定を行う（ステップＳ１９）。そして、ここまで検出されたそれぞれの状態についての評価値を得て、それぞれの評価値に、種類ごとの係数を乗算し、その係数乗算された値のユーザごとの合計値から、親密さの度合いを示す優先度を計算する（ステップＳ２０）。 Further, the outputs of the microphones 21a to 21n are supplied (step S18), and the sound source position is estimated from the multi-channel audio signal (step S19). Then, evaluation values for each state detected so far are obtained, each evaluation value is multiplied by a coefficient for each type, and the degree of intimacy is calculated from the total value for each user of the coefficient-multiplied value. Is calculated (step S20).

次に、ユーザ毎の優先度の計算処理と、それぞれの情報の抽出の詳細例について説明する。
ユーザの顔の向き、ジェスチャ、音声などのマルチモーダル情報を統合し、また、ユーザとの距離を考慮することにより、ロボットに対するユーザの親しさを表す優先度を次式（1）で算出する。式（1）で、Ｕは優先度を、Ｐ_iは各モーダルの出現頻度及び距離から判定される親密さを、ｗ_iは親密さＰ_iに対する重みを表す。親密さＰ_iが、それぞれの特性の評価値に相当する。 Next, a detailed example of priority calculation processing for each user and extraction of each information will be described.
By integrating multimodal information such as the user's face orientation, gesture, and voice, and taking into account the distance to the user, the priority level representing the user's familiarity with the robot is calculated by the following equation (1). In Equation (1), U represents a priority, P _i represents intimacy determined from the appearance frequency and distance of each modal, and w _i represents a weight for intimacy P _i . The intimacy P _i corresponds to the evaluation value of each characteristic.

次に、カメラ１１が撮影した画像から人間の顔を検出する処理例について説明する。人間の顔を検出する処理としては、まず、入力画像から肌色領域を取出し、マスク画像を作成することによって、顔抽出の処理範囲を限定する。その上で、顔画像抽出用の公知のコンピュータ用ソフトウェアを使用して、顔抽出を行う。本発明を検証する上で使用したソフトウェアは、商品名OKAO Vision（オムロン社製）であるが、これに限定されない。 Next, an example of processing for detecting a human face from an image taken by the camera 11 will be described. As processing for detecting a human face, first, a skin color region is extracted from an input image and a mask image is created to limit the processing range of face extraction. Then, face extraction is performed using known computer software for face image extraction. The software used for verifying the present invention is the trade name OKAO Vision (manufactured by OMRON), but is not limited thereto.

次に、顔抽出された画像から、肌色抽出を行う処理について説明する。肌色抽出に関しては、肌色分布のＧＭＭ（ガウス混合モデル）を用いて、入力画像における肌色の尤度を求める。その後、閾値処理により２値画像を作り、雑音除去処理を行ってマスク画像を作成する。マスク領域を１．４倍拡大した領域だけに限って顔抽出処理を行う、このようにしたことにより、顔抽出の精度が高くなり、処理速度を向上させることができる。 Next, a process for performing skin color extraction from the face extracted image will be described. Regarding the skin color extraction, the skin color likelihood in the input image is obtained by using a skin color distribution GMM (Gaussian mixture model). Thereafter, a binary image is created by threshold processing, and noise removal processing is performed to create a mask image. The face extraction process is performed only in the area obtained by enlarging the mask area by a factor of 1.4. Thus, the face extraction accuracy is improved, and the processing speed can be improved.

図３は、撮影されたカラー画像から顔を抽出した例を示したものである。図３（ｂ）は、図３（ａ）に示したカラー画像に対して求めた肌色の尤度画像であり、肌色又は肌色に近似した色の画素だけが抽出されている。この肌色の尤度画像から、２値化処理、雑音除去処理などを行って、図３（ｃ）に示すマスク画像が得られる。このマスク画像中の白色領域（マスク）を拡大した画像が、図３（ｄ）に示す画像である。この図３（ｄ）に示す画像の中の、ユーザの顔部分を含んだ四角形の領域を顔領域の候補とする。 FIG. 3 shows an example of extracting a face from a photographed color image. FIG. 3B is a skin color likelihood image obtained with respect to the color image shown in FIG. 3A, and only the pixels of the skin color or the color approximated to the skin color are extracted. By performing binarization processing, noise removal processing, and the like from the skin color likelihood image, a mask image shown in FIG. 3C is obtained. An image obtained by enlarging the white area (mask) in the mask image is an image shown in FIG. In the image shown in FIG. 3D, a rectangular area including the user's face portion is set as a face area candidate.

このようにして得られた顔領域の候補に対して、顔画像抽出用のソフトウェアを用いて顔抽出を行う。本例で使用したソフトウェアの場合、顔がおおむね正面を向いている場合、顔を検出できる。従って、ここでは顔が検出できた場合、顔が正面に向いていると判断し、その他の場合には正面を向いていないと判断する。図３（ｄ）に示す画像中で、中央の四角形の枠が顔抽出結果である。 Face extraction is performed on the face area candidates obtained in this way using face image extraction software. In the case of the software used in this example, a face can be detected when the face is generally facing the front. Therefore, here, when the face is detected, it is determined that the face is facing the front, and in other cases, it is determined that the face is not facing the front. In the image shown in FIG. 3D, the square frame at the center is the face extraction result.

次に、手又は腕の検出と、その検出した手のジェスチャ認識を行う処理について説明する。ここでは、手の動きで表す６方向（上・下・左・右・前・後）指示のジェスチャ認識を行う。まず、上半身を抽出することによって、手の抽出を行う。次に、手の動きを調べることによってジェスチャ認識を行う。 Next, processing for detecting a hand or an arm and recognizing the detected hand gesture will be described. Here, gesture recognition is performed in six directions (up / down / left / right / front / back) indicated by hand movements. First, the hand is extracted by extracting the upper body. Next, gesture recognition is performed by examining hand movements.

上半身の抽出としては、上述した処理で得られた顔領域から、顔の下に胴体があるという人の身体構造を利用して、図４に示すように顔領域の下の部分を切出す。即ち、抽出された顔の幅Ｗと高さＨとを基準として、その顔の下の２Ｗ×１．５Ｈの範囲に上半身の胴体があるとし、その胴体の左右の脇に、手又は腕があるとする。 For the extraction of the upper body, the lower part of the face area is cut out from the face area obtained by the above-described process using the body structure of a person who has a torso under the face as shown in FIG. That is, based on the extracted face width W and height H, it is assumed that there is a torso of the upper body in the range of 2W × 1.5H below the face, and the hand or arm is on the left and right sides of the torso. Suppose there is.

図５は、このようにして上半身を抽出する例を示した図である。図５（ａ）は、実際のカラー画像から、図４の範囲設定で上半身の範囲を決めた例であり、上の枠は顔領域を、下の枠は切り出した部分を表す。この部分のＨＳＶ値を求め、色相（Ｈ）と彩度（Ｓ）のヒストグラムを求める。図５（ｃ）は色相（Ｈ）のヒストグラムを示し、図５（ｄ）は彩度（Ｓ）のヒストグラムを示す。ここでは各々のヒストグラムについて最大値を求める。そして、求めた最大値を中心に±１０の範囲を求める（図５の（ｃ）と（ｄ）の枠に囲まれた部分）。このカラー情報を持つ画素を画像全体の中から抽出し、雑音除去処理などを行うことで、ユーザの上半身を抽出することができる。図５の（ｂ）は、このようにして図５（ａ）のカラー画像から上半身領域を抽出した結果を示す。 FIG. 5 is a diagram showing an example of extracting the upper body in this way. FIG. 5A shows an example in which the upper body range is determined from the actual color image by setting the range shown in FIG. 4. The upper frame represents the face area, and the lower frame represents the clipped portion. The HSV value of this part is obtained, and a histogram of hue (H) and saturation (S) is obtained. FIG. 5C shows a histogram of hue (H), and FIG. 5D shows a histogram of saturation (S). Here, the maximum value is obtained for each histogram. Then, a range of ± 10 is obtained centering on the obtained maximum value (portion surrounded by frames (c) and (d) in FIG. 5). The upper body of the user can be extracted by extracting pixels having this color information from the entire image and performing noise removal processing or the like. FIG. 5B shows the result of extracting the upper body region from the color image of FIG.

次に、ユーザの手を抽出する処理について説明する。
ユーザの手は腕とつながっているので、手の周りにユーザの上半身領域が存在するはずである。上述した処理で得られた顔の候補領域中、顔領域以外の領域を手の候補領域とする。この手の候補領域内にユーザの上半身領域（胸、腕など）が入っているか否かで、そのユーザの手を抽出する。図６（ｂ）は、図６（ａ）に示すカラー画像から顔と手を抽出した結果である。図６（ａ）で、白い枠が手の候補領域を示す。図６（ｂ）で、一番上の白色領域がユーザの顔を、左下の領域がユーザの右手を、右下の領域がユーザの左手を表す。 Next, processing for extracting a user's hand will be described.
Since the user's hand is connected to the arm, the user's upper body area should exist around the hand. Of the face candidate regions obtained by the above-described processing, a region other than the face region is set as a hand candidate region. The user's hand is extracted depending on whether or not the upper body area (chest, arm, etc.) of the user is within the candidate area of the hand. FIG. 6B shows the result of extracting faces and hands from the color image shown in FIG. In FIG. 6A, a white frame indicates a candidate region of the hand. In FIG. 6B, the top white area represents the user's face, the lower left area represents the user's right hand, and the lower right area represents the user's left hand.

次に、検出された手のジェスチャ認識を行う処理について説明する。手の動きで方向を指示するジェスチャについて、実際に動画像として取込んで手の動きを調べた結果、指示する方向の手の動きが戻りの手の動きより速いことが分かった。そこで、本例では手の移動ベクトルを用いてジェスチャ認識を行う。まず、距離画像を用いて、手の抽出で得た、手の３次元座標を求める。前フレームの手の座標と現フレームの手の座標の差を移動ベクトルとして定義する。ジェスチャをしている間の移動ベクトルを各フレームで順次調べることで指示動作の認識を行う。 Next, a process for recognizing the detected hand gesture will be described. As for the gesture for indicating the direction by the movement of the hand, it was found that the movement of the hand in the indicated direction was faster than the movement of the returning hand as a result of actually capturing it as a moving image and examining the movement of the hand. Therefore, in this example, gesture recognition is performed using a hand movement vector. First, using the distance image, the three-dimensional coordinates of the hand obtained by hand extraction are obtained. The difference between the hand coordinates of the previous frame and the hand coordinates of the current frame is defined as a movement vector. The pointing motion is recognized by sequentially examining the movement vector during each gesture in each frame.

図７は、中央の点線を境にして、二つのジェスチャ（右と左の指示）の移動ベクトルを示した表である。図７から分かるように、「右」を指示するジェスチャ（一番目のジェスチャ）の時、右方向移動ベクトルの平均値（68）が左方向移動ベクトルの平均値（53）より大きい。 FIG. 7 is a table showing movement vectors of two gestures (instructions on the right and left) with the central dotted line as a boundary. As can be seen from FIG. 7, the average value (68) of the right direction movement vector is larger than the average value (53) of the left direction movement vector in the case of the gesture indicating “right” (first gesture).

次に、マイクロフォンが取り込んだ音声データから音源位置推定を行う処理例について説明する。ここでは、ＣＳＰ法（Cross-power Spectrum Phase）を用いて音源位置推定を行う。２つのマイクロフォンの組に対して事前に遅延時間マップを作成し、ＣＳＰ係数を計算することによって、音源の尤度マップを求める。 Next, an example of processing for estimating the sound source position from the audio data captured by the microphone will be described. Here, the sound source position is estimated using the CSP method (Cross-power Spectrum Phase). A delay time map is created in advance for a set of two microphones, and a CSP coefficient is calculated to obtain a likelihood map of the sound source.

そして、音源推定範囲を得る。図８は、音源推定範囲の算出例を示した図である。図８（ａ）はカメラによる撮影範囲を側面から見た図である。この撮影範囲を、図８（ｂ）に示すように、カメラを中心に３６０度回転して得られた空間を音源位置推定範囲とする。この空間を展開したものが図８（ｃ）で、この展開した範囲を、音源位置推定の尤度マップを計算するために使う。 Then, a sound source estimation range is obtained. FIG. 8 is a diagram illustrating a calculation example of the sound source estimation range. FIG. 8A is a side view of the photographing range of the camera. As shown in FIG. 8 (b), a space obtained by rotating the shooting range 360 degrees around the camera is set as a sound source position estimation range. FIG. 8C shows the expanded space, and the expanded range is used to calculate a likelihood map for sound source position estimation.

この音源位置推定範囲のデータを利用して遅延時間マップを作成する。遅延時間マップは、音源位置推定範囲内で遅延サンプル数ｎ（ｐ）の最大値と最小値を求めたものである。次の式（２）を用いて遅延サンプル数ｎ（ｐ）を求める。 A delay time map is created using the data of the sound source position estimation range. The delay time map is obtained by obtaining the maximum value and the minimum value of the delay sample number n (p) within the sound source position estimation range. The number of delayed samples n (p) is obtained using the following equation (2).

ここで、Ｆ，ｐ，Ｍa，Ｍb，ｃはそれぞれ音信号のサンプリング周波数、実空間中にある点Ｐの３次元座標、２つマイクロフォン各々の３次元座標、音速である。 Here, F, p, Ma, Mb, c are the sampling frequency of the sound signal, the three-dimensional coordinates of the point P in the real space, the three-dimensional coordinates of each of the two microphones, and the speed of sound.

音源位置推定範囲を展開した画像の画素（i,j）毎の遅延サンプル数ｎ（ｐ）の最大値ｎ_max（ｐ）と最小値ｎ_min（ｐ）を求める。ＣＳＰ係数を計算し、最大値ｎ_max（ｐ）と最小値ｎ_min（ｐ）の間のＣＳＰ係数の値を、画素（i,j）の音源位置推定の尤度とする。 Pixels of an image obtained by developing the sound source position estimation range (i, j) delay sample number n for each maximum value n _max of the (p) (p) and obtaining the minimum value n _min (p). The CSP coefficient is calculated, and the value of the CSP coefficient between the maximum value n _max (p) and the minimum value n _min (p) is set as the likelihood of the sound source position estimation of the pixel (i, j).

本例の場合、５本のマイクロフォンを用いており、 ₅Ｃ₂＝１０組のマイクロホンペアが得られる。これら１０組のマイクロホンペアにおける音源尤度マップを同期加算することにより音源位置推定を行う。 In the case of this example, five microphones are used, and ₅ C ₂ = 10 microphone pairs are obtained. The sound source position is estimated by synchronously adding the sound source likelihood maps in these 10 microphone pairs.

次に、画像から検出された顔との距離の計算処理について説明する。距離情報の取得のために、本例の場合にはステレオカメラを使用する。ステレオカメラによると、画像中の各画素までの距離を計算できるので、顔領域の画素の平均距離をロボットとユーザとの距離と定義する。距離を扱う際には、近接学における親密な距離（〜50cm程度）、個人的な距離(50〜120cm程度)、社会的な距離(120〜360cm程度)、公共的な距離(360cm以上)という分類を利用する。具体的には、ユーザがロボットに近いほどロボットと親密であると言えることから、ユーザとロボットとの距離が近いほど距離による親密さの度合を表すＰ_dの値を大きくする。例えば、図９に示す距離と親密さの関係を用いて、距離に応じたユーザとの親密さの値（評価値）を計算する。 Next, a process for calculating the distance to the face detected from the image will be described. In this example, a stereo camera is used to acquire distance information. Since the distance to each pixel in the image can be calculated according to the stereo camera, the average distance between the pixels in the face area is defined as the distance between the robot and the user. When dealing with distances, intimate distance (about 50cm), personal distance (about 50-120cm), social distance (about 120-360cm), public distance (more than 360cm) Use classification. Specifically, it can be said that the closer the user is to the robot, the closer the user is to the robot, so the closer the distance between the user and the robot is, the larger the value of P _d representing the degree of intimacy by distance. For example, using the relationship between distance and intimacy shown in FIG. 9, the intimacy value (evaluation value) with the user according to the distance is calculated.

次に、これらの算出された各モーダルにおける親密さの値から、各ユーザの優先度の計算を行う。優先度を計算する際には、各ユーザの行動を観察する必要がある。ここでは、観察フレーム数 nFrame＝４０（５フレーム/秒程度）を一つの観察区間とする。但し、観察フレーム数は４０に限るものではない。 Next, the priority of each user is calculated from the calculated intimacy values in each modal. When calculating the priority, it is necessary to observe the behavior of each user. Here, the number of observation frames nFrame = 40 (about 5 frames / second) is defined as one observation section. However, the number of observation frames is not limited to 40.

まず、顔正面向きの出現頻度を判断する。ユーザの顔がロボットの方を向いている場合、顔が抽出される。観察区間内で、顔が抽出されたフレーム数（nFrontFace ）を求め、式（３）により観察区間の中での顔正面向きの出現頻度Pfを算出する。 First, the appearance frequency of the face front direction is determined. If the user's face is facing the robot, the face is extracted. The number of frames (nFrontFace) from which the face is extracted within the observation section is obtained, and the appearance frequency Pf of the face front direction in the observation section is calculated by Expression (3).

次に、ジェスチャの出現頻度について説明すると、式（４）を用いてジェスチャの出現頻度を求める。ここでは、観察区間内で、指示しているジェスチャの種類とジェスチャしているフレーム数（nGesture ）を求めて、ジェスチャの出現頻度を求める。 Next, the appearance frequency of the gesture will be described. The expression frequency of the gesture is obtained using Expression (4). Here, within the observation section, the type of the instructed gesture and the number of gesture frames (nGesture) are obtained, and the appearance frequency of the gesture is obtained.

ここで、 nGesture/nFrameはジェスチャしているフレーム数の割合を示す。コミュニケーションの際、前後方向の手の動きがロボットに最も興味を示していることを表しているので、ジェスチャの種類に応じた係数（Kind）の値を０．７にする。係数（Kind）の値は、例えば、前後以外の往復動きでは０．５、その他の動きでは０．０にする。 Here, nGesture / nFrame indicates the ratio of the number of gestured frames. During communication, the hand movement in the front-rear direction indicates that the robot is most interested in the robot. Therefore, the coefficient (Kind) corresponding to the type of gesture is set to 0.7. The value of the coefficient (Kind) is, for example, 0.5 for reciprocating movements other than front and rear, and 0.0 for other movements.

次に、会話の出現頻度について説明する。上述した処理で音源位置推定を行った上で、観察区間内において音源位置がユーザの顔の領域に入ったフレーム数（nVoice）を求める。ここでは、音声認識の機能を加えて、話の内容によって会話の出現頻度（Pv ）を変化させることを考慮して、式（５）を用いて会話の出現頻度を計算する。 nVoice /nFrameは話しているフレーム数の割合を示す。 Next, the appearance frequency of conversation will be described. After the sound source position is estimated by the above-described processing, the number of frames (nVoice) in which the sound source position enters the user's face area in the observation section is obtained. Here, in consideration of changing the appearance frequency (Pv) of the conversation depending on the content of the speech by adding the function of speech recognition, the appearance frequency of the conversation is calculated using Expression (5). nVoice / nFrame indicates the ratio of the number of frames spoken.

本発明者による実験によれば、ロボット装置に「こっちへ来て」というように話しかける際、話しているフレーム数は１０フレーム前後であり、観察区間での割合（nVoice /nFrame）が小さい。そこで、 Iaを０．５にし、 Pvを大きくする。 According to an experiment by the present inventor, when talking to the robot apparatus such as “Come here”, the number of frames being spoken is around 10 frames, and the ratio (nVoice / nFrame) in the observation section is small. So Ia is set to 0.5 and Pv is increased.

これらの各値を使用して、ユーザごとの優先度の計算を行う。即ち、観察区間ごとにＰf,Ｐg,Ｐv及びＰdを用いて、上述した式（１）により優先度を計算する。ここで、コミュニケーション時における顔の向きの重要さを考えて、Ｐfの重みＷfを大きくする必要がある。人によって重視する情報が異なるので、重みＷiの値を変える事によって、ロボット装置に個性がある振舞いをさせることができる。ここでは、音の情報とジェスチャの情報を同じように重視するものとして、重みＷf, Ｗg, Ｗv,Ｗdの値を各々０．５、０．２、０．２、０．１にする。このようにして計算することにより、ロボット装置１０に最も関心を持っているユーザを選ぶことができる。 Using each of these values, the priority for each user is calculated. That is, the priority is calculated by the above-described equation (1) using Pf, Pg, Pv and Pd for each observation section. Here, considering the importance of the orientation of the face at the time of communication, it is necessary to increase the weight Wf of Pf. Since the information to be emphasized differs depending on the person, the robot apparatus can behave with individuality by changing the value of the weight Wi. Here, assuming that sound information and gesture information are equally emphasized, the values of the weights Wf, Wg, Wv, and Wd are set to 0.5, 0.2, 0.2, and 0.1, respectively. By calculating in this way, a user who is most interested in the robot apparatus 10 can be selected.

次に、実際にロボット装置の周囲に３人のユーザがいる状態で、実験を行った場合の優先度の変化例を、図１０に示す。縦軸が優先度で、横軸が画像のフレーム番号であり、３人のユーザＵ１，Ｕ２，Ｕ３の状態の変化例を示す。この例では、ユーザＵ１はただロボットを見ているだけなので、ユーザＵ１の優先度の変化は緩やかで、値は低い（図１０の区間I）。ユーザＵ２はロボットに話しながらジェスチャをしているので、ユーザＵ２の優先度の変化が急で、値が大きい（図１０の区間II）。さらに、ユーザＵ３はロボット装置を見ながら話しているだけなので、ユーザＵ３の優先度の変化があまり急ではなく値があまり大きくない（図１０の区間III）。その後、ユーザＵ２の優先度が一番大きいので、ロボットがユーザＵ２のところに向きを変えてユーザＵ２を見たが、ユーザＵ２はユーザＵ１を見ながら話している（ロボットの方向を向いていない）状態になったとすると、優先度が小さくなる（図１０の区間IV）。さらに、ロボット装置は優先度が一番大きいユーザＵ３のところに向きを変えて、ユーザＵ３の優先度を計算する。ユーザＵ３はまだロボットに興味を持って話しかけているため、優先度が大きい状態である（図１０の区間V）。このとき、ユーザＵ３の優先度が大きく、ロボットはユーザＵ３とコミュニケーションをすることになる。 Next, FIG. 10 shows an example of a change in priority when an experiment is performed in a state where there are actually three users around the robot apparatus. The vertical axis represents the priority, the horizontal axis represents the frame number of the image, and shows an example of changes in the state of the three users U1, U2, U3. In this example, since the user U1 is merely looking at the robot, the priority change of the user U1 is gradual and the value is low (section I in FIG. 10). Since the user U2 is gesturing while talking to the robot, the priority of the user U2 changes rapidly and the value is large (section II in FIG. 10). Furthermore, since the user U3 is only talking while looking at the robot apparatus, the change in the priority of the user U3 is not so steep and the value is not so large (section III in FIG. 10). After that, since the priority of the user U2 is the highest, the robot turns to the user U2 and looks at the user U2, but the user U2 is talking while looking at the user U1 (not facing the direction of the robot). ), The priority is reduced (section IV in FIG. 10). Further, the robot apparatus changes the direction to the user U3 having the highest priority, and calculates the priority of the user U3. Since the user U3 is still talking with interest in the robot, the priority is high (section V in FIG. 10). At this time, the priority of the user U3 is large, and the robot communicates with the user U3.

以上説明したように、本実施の形態によると、ユーザの顔の向き、ジェスチャ、音声、距離などのマルチモーダル情報の統合により、ユーザとロボット装置との親しさを表す優先度を定義することができる。そして、ユーザごとの優先度を計算することにより、複数ユーザの中でロボットに最も高い関心を示すユーザを優先してコミュニケーションすることが可能になった。 As described above, according to the present embodiment, it is possible to define the priority indicating the friendliness between the user and the robot apparatus by integrating the multimodal information such as the user's face direction, gesture, voice, and distance. it can. Then, by calculating the priority for each user, it becomes possible to give priority to communication among the users who are most interested in the robot.

なお、コミュニケーションを行う際には、人によって重視する情報が異なる。本実施の形態では、ジェスチャと音声の情報の重みを同じようにしたが、あるユーザがジェスチャの情報を重視した場合、ジェスチャの情報の重みを大きくすることなどを行ってロボット装置の振舞いに個性を持たせることも可能である。 In addition, when communicating, the information emphasized differs depending on the person. In this embodiment, the weights of gesture information and voice information are the same. However, if a certain user places importance on gesture information, the weight of the gesture information is increased, etc. It is also possible to have

本発明の一実施の形態によるロボット装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the robot apparatus by one embodiment of this invention. 本発明の一実施の形態による処理全体の流れを示すフローチャートである。It is a flowchart which shows the flow of the whole process by one embodiment of this invention. 本発明の一実施の形態による顔画像の抽出処理例を示す説明図である。It is explanatory drawing which shows the example of an extraction process of the face image by one embodiment of this invention. 本発明の一実施の形態による手又は腕の検出範囲の例を示す説明図である。It is explanatory drawing which shows the example of the detection range of the hand or arm by one embodiment of this invention. 本発明の一実施の形態による上半身の抽出処理例を示す説明図である。It is explanatory drawing which shows the example of an extraction process of the upper body by one embodiment of this invention. 本発明の一実施の形態による手の画像の抽出例を示す説明図である。It is explanatory drawing which shows the example of extraction of the image of the hand by one embodiment of this invention. 本発明の一実施の形態による移動ベクトルの時間的変化例を示す波形図である。It is a wave form diagram which shows the example of a time change of the movement vector by one embodiment of this invention. 本発明の一実施の形態による音源位置推定例を示す説明図である。It is explanatory drawing which shows the example of a sound source position estimation by one embodiment of this invention. 本発明の一実施の形態による距離に応じた親密さの値の例を示す特性図である。It is a characteristic view which shows the example of the value of intimacy according to the distance by one embodiment of this invention. 本発明の一実施の形態による優先度の時間的変化例を示す説明図である。It is explanatory drawing which shows the example of a time change of the priority by one embodiment of this invention.

Explanation of symbols

１０…ロボット装置、１１…カメラ、１２…画像認識部、１３…認識データ処理部、１４…優先度判定部、２１ａ〜２１ｎ…マイクロフォン、２２ａ〜２２ｎ…アナログ／デジタル変換器、２３…デジタルオーディオ処理部、２４…音源特定部、３１…ロボット動作制御部、３２…回転駆動部 DESCRIPTION OF SYMBOLS 10 ... Robot apparatus, 11 ... Camera, 12 ... Image recognition part, 13 ... Recognition data processing part, 14 ... Priority determination part, 21a-21n ... Microphone, 22a-22n ... Analog / digital converter, 23 ... Digital audio processing , 24 ... Sound source identification unit, 31 ... Robot operation control unit, 32 ... Rotation drive unit

Claims

Photographing means for photographing the surroundings;
An image recognition means for detecting a human face and a hand or arm from the image photographed by the photographing means;
Evaluation value calculation means for obtaining an evaluation value of each state from the detected face state and the detected hand or arm state;
Multiplying the evaluation value of each state by a weighting factor determined in advance for each type of evaluation value, and specifying a person with high priority from the total value of the evaluation values multiplied by each coefficient value Means and
A robot apparatus that communicates with a person specified by the specifying means.

The robot apparatus according to claim 1, wherein
The robot apparatus, wherein the evaluation value calculation means calculates an evaluation value for the face orientation detected by the image recognition means and an evaluation value for the hand or arm movement detected by the image recognition means.

The robot apparatus according to claim 2, wherein
The detection of the hand or arm by the image recognition means is performed by detecting a portion of a predetermined color state within a range where the hand or arm may exist, estimated from the position where the human face is detected. A robot device characterized by detecting.

The robot apparatus according to claim 1, wherein
Furthermore, the image recognition means comprises a measurement means for measuring the distance from the person who recognized the face,
The evaluation value calculation means calculates an evaluation value from the distance measured by the measurement means.

The robot apparatus according to claim 1,
Furthermore, sound collecting means for collecting surrounding sounds,
A sound source identifying means for identifying a position of a sound source from the sound data collected by the sound collecting means, and obtaining a correspondence relationship between the identified sound source position and the person recognized by the image recognition means;
The robot apparatus according to claim 1, wherein the evaluation value calculation means calculates an evaluation value related to human voice data obtained by the sound source identification means.

In a communication method of a robot apparatus that performs communication between the robot apparatus and a person around the robot apparatus,
Photograph the surroundings of the robotic device,
From the captured image, human face detection and hand or arm detection,
From the detected face state and the detected hand or arm state, obtain an evaluation value of each state,
Multiply the evaluation value of each state by a weighting factor determined in advance for each type of evaluation value, and identify a person with high priority from the total value of the evaluation values multiplied by each coefficient value,
A communication method of a robot apparatus, characterized by communicating with a specified person.