JP2006263348A

JP2006263348A - Device, method, and program for identifying user

Info

Publication number: JP2006263348A
Application number: JP2005089419A
Authority: JP
Inventors: Kaoru Suzuki; 薫鈴木
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2005-03-25
Filing date: 2005-03-25
Publication date: 2006-10-05
Anticipated expiration: 2025-03-25
Also published as: JP4257308B2

Abstract

<P>PROBLEM TO BE SOLVED: To identify a user with high precision through the use of a plurality of kinds of biological information by examining the identity of the user with the same tracking directivity. <P>SOLUTION: A user identifying device comprises: a speaker identifying part 104 for identifying the user from the first biological information acquired by a microphone and an all-sky camera; a face identifying part 105 for generating first living body identification information which is obtained by relating user identification information to a user existence direction, identifying the user, based on the biological information outputted from the all-sky camera, and generating second living body identification information which is obtained by relating the user identification information to the user existence direction; an area tracking part 106 for detecting the approximation area of an image feature area in an image in the direction indicated by the first living body identification information from a new input image, setting it as the image feature area, and obtaining the detection direction; and an identify examining part 110 for determining that the user by the first living body identification information is the same as the user by the second living body identification information when a difference between the detection direction of the image feature area and the direction of the second living body identification information acquired after the first living body identification information is not more than a prescribed threshold. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、利用者を検知して得られた利用者の生体情報と、利用者ごとの生体情報を登録した生体辞書とに基づいて利用者を識別する利用者識別装置、利用者識別方法および利用者識別プログラムに関する。 The present invention relates to a user identification device, a user identification method, and a user identification method for identifying a user based on a user's biometric information obtained by detecting the user and a biometric dictionary in which biometric information for each user is registered. It relates to a user identification program.

近年、人間と活動空間を共有する種々の間共存型ロボット技術が研究開発されている。このような人間共存型ロボットは、その利用者が誰であるのかを的確に識別および認証することが重要な課題となっている。このような利用者識別に関しては、特に比較的離れた位置から取得できる利用者の音声や顔等の利用者の生体情報（バイオメトリクス）を、個人を特定するために使用する話者識別技術や顔識別技術による利用者認証方式が有力なものである。 In recent years, various intercoexistence robot technologies that share an activity space with humans have been researched and developed. In such a human-symbiotic robot, it is an important issue to accurately identify and authenticate who the user is. With regard to such user identification, in particular, speaker identification technology used to identify a user's biometric information (biometrics) such as a user's voice and face that can be acquired from a relatively remote location, A user authentication method based on face identification technology is promising.

話者識別方式とは、次のような方式である。事前に実際の利用者の音声の声紋パターンを取得し、利用者ごとに声紋パターンを対応づけた声紋辞書を生成しておく。そして、入力された音声の声紋パターンと声紋辞書に登録された声紋パターンを照合して、声紋パターンの一致の度合いを示す類似度を求める。そして、この類似度が予め定められた閾値以上で最大となる声紋パターンが存在していれば、当該入力音声が最大の類似度の声紋パターンに対応した利用者の音声であると判断する方式である。 The speaker identification method is as follows. A voice print pattern of an actual user's voice is acquired in advance, and a voice print dictionary in which the voice print pattern is associated with each user is generated. Then, the voice print pattern of the input voice and the voice print pattern registered in the voice print dictionary are collated to obtain a similarity indicating the degree of matching of the voice print patterns. Then, if there is a voice print pattern in which the similarity is the maximum above a predetermined threshold, the input voice is determined to be the voice of the user corresponding to the voice print pattern with the maximum similarity. is there.

また、顔識別方式は、話者識別方式と同様に、事前に実際の利用者の顔パターンを取得し、利用者ごとに顔パターンを対応づけた顔辞書を生成しておき、入力された顔画像の顔パターンと顔辞書に登録された顔パターンを照合して類似度を求め、類似度が予め定められた閾値以上で最大となる顔ターンが存在していれば、当該入力顔画像が最大の類似度の顔パターンに対応した利用者の顔画像であると判断する方式である。 Similarly to the speaker identification method, the face identification method acquires an actual user's face pattern in advance, generates a face dictionary that associates the face pattern for each user, and inputs the face The face pattern of the image and the face pattern registered in the face dictionary are collated to determine the similarity, and if there is a face turn that maximizes the similarity above a predetermined threshold, the input face image is the maximum This is a method for determining that the face image of the user corresponds to the face pattern of the similarity.

このような利用者の生体情報を使用した利用者識別の技術としては、例えば、特許文献１で開示された技術がある。この特許文献１の技術では、個人認証システムにおいて、声紋パタンや顔パタンなどの複数種類の生体情報を入力し、これら複数種類の生体情報を各々識別して得たそれぞれの類似度を、さらに時系列的に連続してそれぞれ収集し、生体情報の種類（声紋と顔など）毎に収集されたこの類似度の系列から、入力された生体情報毎の最終的類似度をそれぞれ決定する。かかる決定は、時系列的な類似度の最大値を選択するか、平均値を計算することで実行される。そして、この個人認証システムは、生体情報毎に決定された最終的類似度を総合的に判断して利用者を認証している。 As a user identification technique using such user's biological information, there is a technique disclosed in Patent Document 1, for example. In the technique of Patent Document 1, in the personal authentication system, a plurality of types of biometric information such as a voice print pattern and a face pattern are input, and the similarity obtained by identifying each of the plurality of types of biometric information is further obtained. The final similarity for each input biometric information is determined from the series of similarities collected for each type of biometric information (voice print, face, etc.). Such determination is performed by selecting a maximum value of time-series similarity or calculating an average value. The personal authentication system authenticates the user by comprehensively determining the final similarity determined for each piece of biometric information.

この特許文献１の技術では、時系列的な類似度を利用者識別で使用していることから、時系列的な類似度が出力される期間内で略同時に入力された複数の生体情報は、同一人の利用者の生体情報であることを前提としていると考えられる。 In the technique of this patent document 1, since the time-series similarity is used for user identification, a plurality of pieces of biological information that are input substantially simultaneously within a period in which the time-series similarity is output are: It is assumed that it is premised on the biometric information of the same user.

特開２０００−１４８９８５号公報JP 2000-148985 A

しかしながら、このような短時間の間を略同時とみなして、この間に入力された複数の生体情報が同一人の利用者の生体情報であると判断する場合には、次のような問題がある。 However, when such a short period of time is regarded as substantially simultaneous and it is determined that a plurality of pieces of biometric information input during this time are biometric information of the same user, there are the following problems. .

すなわち、利用者から取得することができる生体情報にはタイミングがある。例えば、声紋パターンを取得できるタイミングは利用者が発声したときに限られ、顔パターンを取得できるタイミングは利用者の顔の正面が撮像カメラの方向に向いているタイミングに限られる。このように、複数種類の生体情報が常に略同時に取得することができるとは限らず、各生体情報が異なった時間でしか取得できなかったり、あるいは一方の生体情報しか取得できないことがある。 That is, there is a timing in the biological information that can be acquired from the user. For example, the timing at which the voiceprint pattern can be acquired is limited to when the user utters, and the timing at which the face pattern can be acquired is limited to the timing at which the front of the user's face faces the direction of the imaging camera. As described above, a plurality of types of biological information cannot always be acquired almost simultaneously, and each biological information can be acquired only at different times, or only one biological information can be acquired.

このため、特許文献１の技術のように一定の短期間を略同時とみなして、その期間で取得した複数種類の生体情報を同一人の利用者であると判断してしまうと、利用者の同一性の検証に誤りを生じる場合がある。例えば、顔画像と音声が略同時に取得できた場合でも、音声を発した利用者と取得した顔画像の利用者とが異なる場合でも、特許文献１の技術では、同一人と判断してしまうおそれがある。一方、顔画像と音声とが略同時とみなされる期間経過後に取得した場合において、当該顔画像の利用者と音声を発した利用者とが同一人である場合もあるが、略同時性のみをもって同一人か否かを判断しているため、異なる利用者と判断されてしまうおそれがある。 For this reason, if a certain short period is regarded as substantially simultaneous like the technique of patent document 1, and it will be judged that the multiple types of biological information acquired in the period is the same person's user, An error may occur in the verification of identity. For example, even if the face image and the sound can be acquired substantially simultaneously, even if the user who has generated the sound is different from the user of the acquired face image, the technique of Patent Document 1 may determine that the person is the same person. There is. On the other hand, when the face image and the sound are acquired after a period that is considered to be substantially the same, the user of the face image and the user who emitted the sound may be the same person, but only with approximately the same time. Since it is judged whether or not they are the same person, there is a possibility that they are judged as different users.

ところで、利用者識別の技術では、利用者認証のための生体辞書（声紋辞書や顔辞書など）は、利用者に対して利用者識別システムが運用を開始される時点で生成されることが多い。このようにある時点で生成された生体辞書は、その後の雑音環境や照明環境の変化、利用者自身の声や顔の経時変化のため、生体情報を使用できる期間に有効期限があると考えてよい。すなわち、生体情報の有効期限が切れると、これらの生体情報を登録した生体辞書による類似度が利用者本人であっても十分大きくならず、利用者識別の精度が低下する。このため、システム管理権限を持つ管理者等が、生体情報の有効期限が切れた場合に明示的な操作によって有効期限が切れた生体情報を生体辞書に再学習させる辞書メンテナンスが一般的に行われている。 By the way, in the user identification technique, a biometric dictionary (such as a voice print dictionary or a face dictionary) for user authentication is often generated when the user identification system starts to operate for a user. . The biological dictionary generated at this point in time is considered to have an expiration date in the period in which biological information can be used due to subsequent changes in the noise environment and lighting environment, and changes in the user's own voice and face over time. Good. That is, when the expiration date of the biometric information expires, the degree of similarity according to the biometric dictionary in which the biometric information is registered is not sufficiently increased even by the user himself, and the accuracy of user identification is reduced. For this reason, in general, dictionary maintenance is performed in which an administrator with system management authority relearns biometric information whose biometric information has expired by an explicit operation when the biometric information has expired. ing.

このような辞書メンテナンス機能を有する利用者識別システムにおいて、特許文献１のように略同時に取得した生体情報により同一人の認証を行っている場合には、さらに次のような問題が生じる。すなわち、利用者Ａの顔パターンが利用者Ａのものであると識別されている間に、たまたま側に居た生体辞書に登録されていない利用者Ｂの発した音声を識別しようとして利用者を特定できなかった場合、この事実をもって利用者Ａの声紋辞書の有効期限が切れていると誤って判断し、顔パターンで識別できた利用者Ａの音声として全く別の利用者Ｂの声紋パターンを学習して生体辞書を更新してしまうという危険性がある。 In the user identification system having such a dictionary maintenance function, when the same person is authenticated by biometric information acquired almost simultaneously as in Patent Document 1, the following problem occurs. That is, while the face pattern of the user A is identified as that of the user A, the user tries to identify the voice uttered by the user B who is not registered in the biological dictionary on the side. If it cannot be specified, it is erroneously determined that the user A's voiceprint dictionary has expired based on this fact, and a completely different voiceprint pattern of the user B is used as the voice of the user A that can be identified by the face pattern. There is a danger of learning and updating the biological dictionary.

本発明は、上記に鑑みてなされたものであって、利用者の同一性検証を追跡同方向性によって行うことにより、複数種類の生体情報を使用した利用者識別を高精度に行うことができる利用者識別装置、利用者識別方法および利用者識別プログラムを提供することを目的とする。 The present invention has been made in view of the above, and can perform user identification using a plurality of types of biological information with high accuracy by performing verification of user identity based on tracking directionality. It is an object to provide a user identification device, a user identification method, and a user identification program.

上述した課題を解決し、目的を達成するために、本発明は、利用者の音声を検知して、検知した音声を利用者の生体情報として出力する検知手段と、利用者の画像を撮像して、撮像した画像を利用者の生体情報として出力する撮像手段と、前記撮像手段および前記検知手段が出力する生体情報ごとに、利用者の識別情報と利用者の生体情報とを対応付けた複数の生体辞書を記憶する生体辞書記憶手段と、前記検知手段によって出力された前記生体情報と前記生体情報に対応した前記生体辞書とに基づいて前記生体情報から利用者を識別し、利用者の識別情報と利用者が存在する方向と対応付けた生体識別情報を生成し、前記撮像手段によって出力された生体情報と前記生体情報に対応した前記生体辞書とに基づいて前記生体情報から利用者を識別し、利用者の識別情報と利用者が存在する方向と対応付けた生体識別情報を生成する識別手段と、前記識別手段によって生成され、先に取得した第１の生体識別情報が示す方向の画像に関して特徴的な領域である画像特徴領域に近似する領域を、前記撮像手段によって新たに入力される画像から検出して、検出された領域を新たな前記画像特徴領域に設定するとともに、設定された新たな前記画像特徴領域の検出方向を求める領域追跡手段と、前記領域追跡手段によって求められた前記画像特徴領域の検出方向と前記第１の生体識別情報より後に取得した第２の生体識別情報の方向との差が予め定められた閾値以下であるか否かを判断し、方向の差が前記閾値以下である場合に、前記第１の生体識別情報で識別される利用者と前記第２の生体識別情報で識別される利用者が同一であると判断する同一性検証手段と、を備えたことを特徴とする利用者識別装置である。 In order to solve the above-described problems and achieve the object, the present invention detects a user's voice and outputs the detected voice as user's biological information, and captures a user's image. A plurality of imaging means for outputting captured images as biometric information of a user, and a plurality of user identification information and biometric information of the user associated with each biometric information output by the imaging means and the detection means. A user is identified from the biometric information based on the biometric dictionary storing means for storing the biometric dictionary, the biometric information output by the detecting means, and the biometric dictionary corresponding to the biometric information, and identification of the user Biometric identification information associated with information and a direction in which the user exists is generated, and the user is identified from the biological information based on the biological information output by the imaging unit and the biological dictionary corresponding to the biological information. Separately, identification means for generating biometric identification information associated with the identification information of the user and the direction in which the user exists, and the direction indicated by the first biometric identification information generated by the identification means and acquired earlier An area that approximates an image feature area that is a characteristic area for an image is detected from an image newly input by the imaging unit, and the detected area is set as a new image feature area. Area tracking means for obtaining a new detection direction of the image feature area, and second biometric identification information acquired after the detection direction of the image feature area obtained by the area tracking means and the first biometric identification information It is determined whether or not the difference between the direction and the second direction is less than or equal to a predetermined threshold, and when the direction difference is less than or equal to the threshold, the user identified by the first biometric identification information and the second of And identity verification means the user identified by the body identification information is determined to be identical, a user identification device characterized by comprising a.

また、本発明は上記利用者識別装置で実行される利用者識別方法および利用者識別プログラムである。 Moreover, this invention is the user identification method and user identification program which are performed with the said user identification device.

本発明によれば、検知手段によって出力された生体情報とこの生体情報に対応した生体辞書とに基づいて利用者を識別し、利用者の識別情報と利用者が存在する方向と対応付けた生体識別情報を生成し、一方、これとは別個の撮像手段によって出力された生体情報と生体情報に対応した生体辞書とに基づいて利用者を識別し、利用者の識別情報と利用者が存在する方向と対応付けた生体識別情報を生成する。すなわち、検知手段から得られる生体情報と撮像手段から得られる生体情報のそれぞれから別個に利用者が存在する方向を求める。そして、先に取得した第１の生体識別情報が示す方向の画像に関して特徴的な領域である画像特徴領域に近似する領域を追跡し、画像特徴領域の検出方向と第１の生体識別情報より後に取得した第２の生体識別情報の方向との差が予め定められた閾値以下であるか否かを判断し、方向の差が閾値以下である場合に、第１の生体識別情報で識別される利用者と第２の生体識別情報で識別される利用者が同一であると判断しているので、複数種類の生体情報を使用して利用者識別を行う場合に、それぞれ別個の生体情報から求めた利用者の方向の同一性に基づいて利用者の同一性の検証を行うことができ、一定期間経過後に利用者から異なる生体情報を取得した場合でも、利用者の同一性の検証を高精度に行うことができる。 According to the present invention, the user is identified based on the biological information output by the detection means and the biological dictionary corresponding to the biological information, and the biological information is associated with the user identification information and the direction in which the user exists. On the other hand, identification information is generated. On the other hand, a user is identified based on biometric information output by a separate imaging unit and a biometric dictionary corresponding to the biometric information, and the user's identification information and the user exist. Biometric identification information associated with the direction is generated. That is, the direction in which the user exists is obtained separately from the biological information obtained from the detection means and the biological information obtained from the imaging means. And the area | region which approximates the image feature area which is a characteristic area | region regarding the image of the direction which the 1st biometric identification information acquired previously is tracked is tracked after the detection direction of the image feature area and the first biometric identification information. It is determined whether or not a difference from the direction of the acquired second biometric identification information is equal to or smaller than a predetermined threshold value. If the difference in direction is equal to or smaller than the threshold value, the first biometric identification information is identified. Since it is determined that the user and the user identified by the second biometric identification information are the same, when performing user identification using a plurality of types of biometric information, they are obtained from separate biometric information. The identity of the user can be verified based on the identity of the user's direction, and even when different biological information is acquired from the user after a certain period of time, the identity of the user can be verified with high accuracy. Can be done.

以下に添付図面を参照して、この発明にかかる利用者識別装置、利用者識別方法および利用者識別プログラムの最良な実施の形態を詳細に説明する。本実施の形態は、本発明の利用者識別装置を自律移動型ロボットに適用したものである。 Exemplary embodiments of a user identification device, a user identification method, and a user identification program according to the present invention are explained in detail below with reference to the accompanying drawings. In the present embodiment, the user identification device of the present invention is applied to an autonomous mobile robot.

（自律移動型ロボットの構成）
図１は、本実施の形態にかかる自律移動型ロボットの機能的構成を示すブロック図である。 (Configuration of autonomous mobile robot)
FIG. 1 is a block diagram showing a functional configuration of the autonomous mobile robot according to the present embodiment.

本実施の形態にかかる自律移動型ロボットは、図１に示すように、３個のマイクロホン２０６，２０７，２０８と全天カメラ２０５と、音響指向性形成部１０１と、語彙識別部１０３と、話者識別部１０４と、顔識別部１０５と、領域追跡部１０６と、利用者認証部１０７と、辞書更新部１０８と、サービス提供部１０９と、同一性検証部１１０と、生体辞書記憶部１２０とを主に備えた構成となっている。なお、本実施の形態の自律移動型ロボットは、図１に示す各部の他、移動制御を行う制御部など利用者識別機能以外の機能を実行する部分を備えている。 As shown in FIG. 1, the autonomous mobile robot according to the present embodiment includes three microphones 206, 207, 208, an all-sky camera 205, an acoustic directivity forming unit 101, a vocabulary identifying unit 103, and a speech. Person identification unit 104, face identification unit 105, area tracking unit 106, user authentication unit 107, dictionary update unit 108, service provision unit 109, identity verification unit 110, and biological dictionary storage unit 120 It has a configuration mainly equipped with. Note that the autonomous mobile robot according to the present embodiment includes a part that performs functions other than the user identification function, such as a control unit that performs movement control, in addition to the units illustrated in FIG.

ここで、マイクロホン２０６，２０７，２０８は本発明における検知手段に相当し、全天カメラ２０５は本発明における撮像手段に相当する。また、話者識別部１０４と顔識別部１０５は本発明における識別手段に相当する。 Here, the microphones 206, 207, and 208 correspond to the detection means in the present invention, and the all-sky camera 205 corresponds to the imaging means in the present invention. The speaker identification unit 104 and the face identification unit 105 correspond to the identification unit in the present invention.

マイクロホン２０６，２０７，２０８は、利用者の音声を入力するものである。全天カメラ２０５は、上向き１８０度の視野の魚眼レンズ付きカメラであり、利用者を含むロボット本体の周囲３６０度が撮像可能となっている。 Microphones 206, 207, and 208 are for inputting user's voice. The omnidirectional camera 205 is a camera with a fisheye lens with an upward visual field of 180 degrees, and can image 360 degrees around the robot body including the user.

ここで、図２は、本実施の形態の自律移動型ロボットの外観を示す模式図である。本実施の形態の自律移動型ロボットは、図２に示すように、本体２０１と、本体２０１に装着された移動のための駆動輪２０２，２０３および補助輪（図示せず）とを備え、利用者住居内の床面上を自在に移動できるように構成されている。 Here, FIG. 2 is a schematic diagram showing an appearance of the autonomous mobile robot according to the present embodiment. As shown in FIG. 2, the autonomous mobile robot according to the present embodiment includes a main body 201, driving wheels 202 and 203 and auxiliary wheels (not shown) for movement mounted on the main body 201, and is used. It is configured to be able to move freely on the floor in the person's residence.

上述した２個のマイクロホン２０６，２０７は、本体２０１の前面に左右一対のが水平配置され、マイクロホン２０８は、マイクロホン２０６，２０７と略同一水平面内に位置するよう本体１０１の背面に配置されている。このように空間的に離散配置された３個のマイクロホン２０６，２０７，２０８によって入力された利用者の音声の指向性は、後述する音響指向性形成部１０１の複数の音響指向性形成モジュール１０２によって水平・垂直方向に複数自在に設定可能になっている。 A pair of left and right microphones 206 and 207 described above are horizontally disposed on the front surface of the main body 201, and the microphone 208 is disposed on the back surface of the main body 101 so as to be positioned in substantially the same horizontal plane as the microphones 206 and 207. . The directivity of the user's voice input by the three microphones 206, 207, and 208 that are spatially discretely arranged in this manner is obtained by a plurality of acoustic directivity forming modules 102 of the acoustic directivity forming unit 101 described later. Multiple settings can be made in the horizontal and vertical directions.

また、上述した全天カメラ２０５は、自律移動型ロボットの本体２０１上部に装着されている。全天カメラ２０５は、上向きに１８０度の視野を持つ魚眼レンズ付きカメラ３１５であり、円形視野の全天画像３１８を取得できるようになっている。全天カメラ２０５は、一定時間間隔で周囲を撮像して撮像した画像フレームを出力する。 The above-described all-sky camera 205 is attached to the upper part of the main body 201 of the autonomous mobile robot. The all-sky camera 205 is a fisheye lens-equipped camera 315 having a visual field of 180 degrees upward, and can acquire a full-sky image 318 having a circular field of view. The all-sky camera 205 images the surroundings at fixed time intervals and outputs captured image frames.

図３−１は、２人の利用者を撮像する全天カメラ２０５の配置を示す模式図であり、図３−２は、図３−１で示す配置の２人の利用者を全天カメラ２０５で撮像した全天画像を示す模式図である。全天カメラ２０５で撮像される利用者３１６，３１７が図３−１に示す位置にいる場合、この２人の利用者３１６，３１７は全天画像３１８内に図３−２に示すように投影される。 FIG. 3A is a schematic diagram illustrating an arrangement of the omnidirectional camera 205 that captures images of two users. FIG. 3B is a celestial camera showing two users arranged as illustrated in FIG. FIG. 6 is a schematic diagram illustrating a whole sky image captured at 205. When the users 316 and 317 captured by the omnidirectional camera 205 are at the positions shown in FIG. 3A, the two users 316 and 317 are projected in the omnidirectional image 318 as shown in FIG. 3-2. Is done.

図１に戻り、音響指向性形成部１０１は、自律移動型ロボット周囲の種種の方向から到来する音声を方向ごとに分離抽出して音声区間の音声データと到来方向とからなる指向性音声情報を出力するものであり、周囲の種種の方向から到来する音声を、個別かつ選択的に通過させる複数の音響指向性形成モジュール１０２と到来方向を記憶する指向性設定値記憶部（図示せず）とを備えている。音響指向性形成部１０１によって分離抽出される音声には、利用者の音声や環境雑音などがある。 Returning to FIG. 1, the acoustic directivity forming unit 101 separates and extracts voices coming from various directions around the autonomous mobile robot for each direction, and obtains directional voice information composed of voice data of the voice section and the arrival direction. A plurality of acoustic directivity forming modules 102 that individually and selectively pass sound coming from various surrounding directions, and a directivity setting value storage unit (not shown) that stores the arrival directions; It has. Voices separated and extracted by the acoustic directivity forming unit 101 include user voices and environmental noises.

音響指向性形成部１０１は、分離抽出された音声情報のそれぞれの強度を評価して、強度が第１の閾値以上となってから所定期間遡った時刻から、第２の閾値未満となってから所定期間経過する時刻までを「音声区間」とし、音声情報から当該音声区間のみを抽出したデータを音声区間の音声データとする。そして、その到来方向を示す方向情報と音声区間の音声データを対応付けて指向性音声情報として出力する。なお、強度による音声区間の検出処理は、遠方から到来する弱い環境雑音しか含まない音声を、後段で無駄に処理しないように除外する効果を有している。 The sound directivity forming unit 101 evaluates the strength of each of the separated and extracted voice information, and after the strength falls below the second threshold from the time that has passed a predetermined period after the strength becomes equal to or greater than the first threshold. The time until a predetermined time elapses is defined as a “voice section”, and data obtained by extracting only the voice section from the voice information is set as voice data of the voice section. Then, the direction information indicating the arrival direction and the voice data of the voice section are associated with each other and output as directional voice information. It should be noted that the voice section detection process based on intensity has the effect of excluding voices that contain only weak environmental noise coming from a distance so as not to be processed in a later stage.

図４は、音響指向性形成部１０１の構成を示すブロック図である。音響指向性形成部１０１は、マイクロホン２０６，２０７からの音声を入力する前段のビームフォーマ４１０，４２０，４３０と、マイクロホン２０８からの音声と前段のビームフォーマ４１０，４２０，４３０からの出力を入力する後段のビームフォーマ４１１，４１２，４１３、・・・、４２１，４２２，４２３、・・・４３１，４３２，４３３、・・・、と指向性設定値記憶部４４０とから構成される。ここで、図４に示すように、後段のビームフォーマ４１１，４１２，４１３、・・・は、前段のビームフォーマ４１０からの出力とマイクロホン２０８からの音声を入力し、後段のビームフォーマ４２１，４２２，４２３、・・・は、前段のビームフォーマ４２０からの出力とマイクロホン２０８からの音声を入力し、後段のビームフォーマ４３１，４３２，４３３、・・・は、前段のビームフォーマ４３０からの出力とマイクロホン２０８からの音声を入力するようになっている。 FIG. 4 is a block diagram illustrating a configuration of the sound directivity forming unit 101. The sound directivity forming unit 101 inputs the front beamformers 410, 420, and 430 for inputting the sound from the microphones 206 and 207, the sound from the microphone 208, and the output from the previous beamformers 410, 420, and 430. , 421, 422, 423,... 431, 432, 433,..., And a directivity setting value storage unit 440. Here, as shown in FIG. 4, the subsequent beamformers 411, 412, 413,... Receive the output from the preceding beamformer 410 and the sound from the microphone 208, and the subsequent beamformers 421, 422. , 423,... Input the output from the beam former 420 and the sound from the microphone 208, and the beam formers 431, 432, 433,. The sound from the microphone 208 is input.

指向性設定値記憶部４４０と、前段のビームフォーマ４１０と、前段のビームフォーマ４１０に対応する後段のビームフォーマ４１１，４１２，４１３，・・・のそれぞれとによって各後段のビームフォーマ４１１，４１２，４１３，・・・に対応した数の音響指向性形成モジュール１０２が構成される。前段のビームフォーマ４２０に対応する後段のビームフォーマ４２１，４２２，４２３，・・・、前段のビームフォーマ４３０に対応する後段のビームフォーマ４３１，４３２，４３３，・・・についても同様である。 Each of the subsequent beamformers 411, 412,... By the directivity setting value storage unit 440, the preceding beamformer 410, and the subsequent beamformers 411, 412, 413,. The number of acoustic directivity forming modules 102 corresponding to 413,. The same applies to the subsequent beamformers 421, 422, 423,... Corresponding to the preceding beamformer 420, and the subsequent beamformers 431, 432, 433,.

すなわち、音響指向性形成モジュール１０２は、前段のビームフォーマ４１０と後段のビームフォーマ４１１と指向性記憶部４４０で一つの領域に対応しており、別の領域に対応した音響指向性形成モジュール１０２は、前段のビームフォーマ４１０と後段のビームフォーマ４１２と指向性記憶部４４０で構成される。 In other words, the acoustic directivity forming module 102 corresponds to one area with the beam former 410, the beam former 411, and the directivity storage unit 440 in the preceding stage, and the acoustic directivity forming module 102 corresponding to another area corresponds to the acoustic directivity forming module 102. The former beamformer 410, the latter beamformer 412, and the directivity storage unit 440.

図５は、音響指向性形成モジュール１０２の構成を示すブロック図である。図５では、前段のビームフォーマ４１０と後段のビームフォーマ４１１とで構成される例を示している。この音響指向性形成モジュール１０２は、３個のマイクロホン２０６、２０７、２０８からの音声入力の指向性を水平・垂直方向に自在に設定する機能を有している。 FIG. 5 is a block diagram showing a configuration of the sound directivity forming module 102. In FIG. 5, the example comprised by the beam former 410 of the front | former stage and the beam former 411 of the back | latter stage is shown. The acoustic directivity forming module 102 has a function of freely setting the directivity of audio input from the three microphones 206, 207, and 208 in the horizontal and vertical directions.

前段のビームフォーマ４１０は、自律移動型ロボットの前面に水平配置された２個のマイクロホン２０６，２０７からの音声を入力し、マイクロホン２０６とマイクロホン２０７を結ぶ線分に対して設定される角度θを中心指向性方向として、その中心指向性方向からはずれた方向の感度を抑圧する働きを有する指向性形成手段である。この結果、中心指向性方向を中心とした所定範囲に対する感度が所定値以上に保持されつつ、それ以外の方向の感度はかかる所定値未満に抑制される。かかるビームフォーマの構成は、種々提案されており、例えば、フェイズドアレイとして知られている複数のマイクロホン入力の遅延和を計算する方法では、遅延量を制御することで任意の方向に指向性を向けることができる。また、Griffith-Jim型一般化サイドローブキャンセラを２個使用して、設定された指向性範囲内の音声をより明瞭に抽出する技術も知られている。 The beam former 410 in the previous stage inputs the sound from the two microphones 206 and 207 arranged horizontally on the front surface of the autonomous mobile robot, and sets the angle θ set with respect to the line segment connecting the microphone 206 and the microphone 207. The directivity forming means has a function of suppressing sensitivity in a direction deviating from the central directivity direction as the central directivity direction. As a result, the sensitivity with respect to the predetermined range centered on the central directivity direction is maintained at a predetermined value or higher, and the sensitivity in the other directions is suppressed to less than the predetermined value. Various configurations of such beamformers have been proposed. For example, in a method of calculating a delay sum of a plurality of microphone inputs known as a phased array, directivity is directed in an arbitrary direction by controlling the delay amount. be able to. There is also known a technique that uses two Griffith-Jim type generalized sidelobe cancellers to more clearly extract speech within a set directivity range.

２個のマイクロホン２０６，２０７により形成される中心指向性方向は、図６に示すように、例示するようにマイクロホン２０６とマイクロホン２０７を結ぶ線分から角度θだけ開いた円錐面６１２となる。したがって、その出力は円錐面６２１付近の所定範囲内に存在する音源からの音声を主成分とする。 As shown in FIG. 6, the center directivity direction formed by the two microphones 206 and 207 is a conical surface 612 opened by an angle θ from the line segment connecting the microphone 206 and the microphone 207 as illustrated. Therefore, the output is mainly composed of sound from a sound source that exists within a predetermined range near the conical surface 621.

しかしながら、ビームフォーマ４１０により形成される指向性範囲は円錐面６２１付近全域に亘るので、その範囲のいずれの方向から音声が到来したのかを知るためには、指向性範囲が広すぎる。このため、ビームフォーマ４１０の後段に２つ目のビームフォーマ４１１を接続する。ビームフォーマ４１１は、ビームフォーマ４１０の出力と自律移動型ロボット背面に配置されたマイクロホン２０８からの音声を入力し、所定の指向性範囲を形成するサイドローブキャンセラである。 However, since the directivity range formed by the beamformer 410 covers the entire vicinity of the conical surface 621, the directivity range is too wide to know from which direction of the range the sound has come. Therefore, the second beamformer 411 is connected to the subsequent stage of the beamformer 410. The beamformer 411 is a sidelobe canceller that inputs the output of the beamformer 410 and the sound from the microphone 208 disposed on the back of the autonomous mobile robot and forms a predetermined directivity range.

このように、ビームフォーマ４１０の後段にビームフォーマ４１１を設け、ビームフォーマ４１０からの出力とマイクロホン２０８からの音声とを入力することにより、２個のマイクロホン２０６，２０７の中間に位置する仮想的なマイクロホン５０１からの音声とと背面のマイクロホン２０８からの音声を入力しているようになり、その指向性の中心は仮想的なマイクロホン５０１と背面のマイクロホン２０８を結ぶ線分から所定角度開いた円錐面となる。このとき、仮想的なマイクロホン５０１からの音声はビームフォーマ４１０によって既に指向性範囲が絞られているので、ビームフォーマ４１１の出力は円錐面６２１と、仮想的なマイクロホン５０１と背面のマイクロホン２０８を結ぶ線分から所定角度開いた円錐面の交差する方向からの音声が最も強くなる。 As described above, the beam former 411 is provided at the subsequent stage of the beam former 410, and by inputting the output from the beam former 410 and the sound from the microphone 208, a virtual position located between the two microphones 206 and 207 is obtained. The sound from the microphone 501 and the sound from the back microphone 208 are input, and the center of directivity is a conical surface opened by a predetermined angle from the line segment connecting the virtual microphone 501 and the back microphone 208. Become. At this time, since the directivity range of the sound from the virtual microphone 501 is already narrowed by the beam former 410, the output of the beam former 411 connects the conical surface 621, the virtual microphone 501 and the rear microphone 208. The sound from the intersecting direction of the conical surfaces opened by a predetermined angle from the line segment becomes the strongest.

このような作用を得るために、本実施の形態の自律移動型ロボットでは、３個のマイクロホン２０６、２０７、２０８を、マイクロホン２０８を頂点とする水平な２等辺３角形をなすように配置している。 In order to obtain such an action, in the autonomous mobile robot of the present embodiment, three microphones 206, 207, 208 are arranged so as to form a horizontal isosceles triangle with the microphone 208 as a vertex. Yes.

図７−１および図７−２は、音響指向性モジュールの前段のビームフォーマと後段のビームフォーマにより設定される指向性範囲を示す模式図である。図７−１は、自律移動型ロボットの上方から観た図であり、図７−２は、水平方向から観た図である。７４は、ロボットの本体２０１の正面方向である。 FIGS. 7A and 7B are schematic diagrams illustrating directivity ranges set by the former beamformer and the latter beamformer of the acoustic directivity module. FIG. 7-1 is a diagram viewed from above the autonomous mobile robot, and FIG. 7-2 is a diagram viewed from the horizontal direction. Reference numeral 74 denotes a front direction of the main body 201 of the robot.

ビームフォーマ４１０による円錐面７０（６２１）はマイクロホンの高さでは指向性の水平方位を表しており、後段のビームフォーマ４１１による円錐面７１はその指向性の水平方位をさらに垂直方向に絞る効果を有している。このため、前段のビームフォーマ４１０による指向性範囲を水平指向性範囲７０、後段のビームフォーマ４１１による指向性範囲を垂直指向性範囲７１という。 The conical surface 70 (621) by the beamformer 410 represents the horizontal directionality of the directivity at the height of the microphone, and the conical surface 71 by the subsequent beamformer 411 has an effect of further narrowing the horizontal directionality of the directivity in the vertical direction. Have. For this reason, the directivity range by the former beamformer 410 is called the horizontal directivity range 70, and the directivity range by the latter beamformer 411 is called the vertical directivity range 71.

ビームフォーマ４１０およびビームフォーマ４１１による指向性範囲７２は、図７−１，７−２に示すように、水平指向性範囲７０と垂直指向性範囲７１の２つの円錐に挟まれる領域７２として現れる。 The directivity range 72 by the beamformer 410 and the beamformer 411 appears as a region 72 sandwiched between two cones of a horizontal directivity range 70 and a vertical directivity range 71 as shown in FIGS.

すなわち、ビームフォーマ４１０による垂直面内の指向性範囲も同様に図７−２に示すマイクロホン高さ７３の位置を起点とする２つの円錐に挟まれる領域となる。このとき水平指向性範囲７０と垂直指向性範囲７１との共通領域７２が、音響指向性形成モジュールの最終的な指向性範囲である。音響指向性形成モジュールはこの指向性範囲７２内から到来する音声を高感度で通過させる。なお、マイクロホン２０６、２０７、２０８の距離に対して音源の距離が十分大きいときは、形成された指向性範囲の起点、すなわち円錐の頂点を仮想的名マイクロホン５０１とマイクロホン２０８を結ぶ線分の中点としてもよい。 That is, the directivity range in the vertical plane by the beam former 410 is also a region sandwiched between two cones starting from the position of the microphone height 73 shown in FIG. At this time, the common region 72 of the horizontal directivity range 70 and the vertical directivity range 71 is the final directivity range of the acoustic directivity forming module. The sound directivity forming module passes sound coming from within the directivity range 72 with high sensitivity. When the distance of the sound source is sufficiently large with respect to the distance between the microphones 206, 207, and 208, the starting point of the formed directivity range, that is, the apex of the cone is included in the line segment connecting the virtual name microphone 501 and the microphone 208. It is good also as a point.

水平指向性範囲７０と垂直指向性範囲７１の各方位角は、例えば２０度の開きとなるように設定される。ビームフォーマ４１０およびビームフォーマ４１１とにより、このように例えば±１０度（すなわち２０度の角度）という狭い指向性範囲を形成することができる。 Each azimuth angle of the horizontal directivity range 70 and the vertical directivity range 71 is set so as to open, for example, 20 degrees. The beam former 410 and the beam former 411 can thus form a narrow directivity range of, for example, ± 10 degrees (that is, an angle of 20 degrees).

なお、ビームフォーマ４１０とビームフォーマ４１１により指向性範囲、すなわち水平指向性範囲７０と垂直指向性範囲の開き角度と音声の通過可能な角度幅は、指向性範囲設定値記憶部４４０に設定され、外部からリード・ライト可能である。 The beamformer 410 and the beamformer 411 set the directivity range, that is, the opening angle of the horizontal directivity range 70 and the vertical directivity range and the angle width through which sound can pass, in the directivity range setting value storage unit 440. Read / write is possible from the outside.

図８−１は、２０度刻みで１８０度を覆う指向性範囲の配置例を示した説明図である。本実施の形態の自律移動型ロボットでは。このように形成される指向性範囲を９個用意して図８−１に示すようにタイル張りに並べることにより±９０度（すなわち１８０度）の範囲を９つの指向性範囲に分割している。 FIG. 8A is an explanatory diagram illustrating an arrangement example of the directivity range that covers 180 degrees in increments of 20 degrees. In the autonomous mobile robot of this embodiment. By preparing nine directivity ranges formed in this way and arranging them in tiles as shown in FIG. 8A, the range of ± 90 degrees (ie 180 degrees) is divided into nine directivity ranges. .

このように１個の音響指向性形成モジュール１０２によって自律移動型ロボットを中心とした球面ドーム上の任意の位置に狭い指向性範囲を設定することができる。そして、本実施の形態では、この音響指向性形成モジュール１０２を複数装備し、各音響指向性形成モジュール１０２の指向性範囲をモジュール毎に異ならせて、水平・垂直方向の所定範囲をカバーするタイルのように配置することで、いずれの方向から音声が到来した場合でも、対応する指向性範囲を有するモジュールの出力として入力音声を分離抽出することとしている。従って、音響指向性形成部１０１は、全ての音響指向性形成モジュール１０２を同時に稼動させることにより、異なる方向から同時に音声が到来しても、それらを個々に分離して出力することができる。 In this way, a narrow directivity range can be set at an arbitrary position on the spherical dome centering on the autonomous mobile robot by the single acoustic directivity forming module 102. In the present embodiment, a plurality of tiles that cover a predetermined range in the horizontal and vertical directions are provided with a plurality of the sound directivity forming modules 102 and the directivity ranges of the sound directivity forming modules 102 are different for each module. By arranging in this way, the input voice is separated and extracted as the output of the module having the corresponding directivity range regardless of the direction from which the voice comes. Therefore, the sound directivity forming unit 101 operates all the sound directivity forming modules 102 at the same time, so that even when voices simultaneously arrive from different directions, they can be separated and output individually.

例えば図８−１に示すように、２０度刻みで水平・垂直それぞれ９範囲ずつの組み合わせとして最大でＮ＝９×９＝８１個の音響指向性形成モジュール１０２を用意し、これにより自律移動型ロボットの周囲の水平３６０度、垂直１８０度の範囲をカバーすることができる。ただし、実際には８１個の組み合わせの中で水平指向性範囲と垂直指向性範囲の交差しない範囲を除外して、有効な組み合わせの数は８１より少なくなる。また、１つの水平指向性範囲に対して複数の垂直指向性範囲を設定することになるので、前段のビームフォーマ４１０の出力を後段の複数のビームフォーマ４１１，４１２，４１３，・・・に共通して供給することができ、ビームフォーマの数は有効な組み合わせの数の２倍より少なくすることができる。図４に示す音響指向性形成部１０１の構成は、有効な組み合わせに対応し、一部を共通化した必要数のビームフォーマを実装している。 For example, as shown in FIG. 8A, a maximum of N = 9 × 9 = 81 acoustic directivity forming modules 102 are prepared as combinations of nine horizontal and vertical ranges in 20 degree increments. A range of 360 degrees horizontally and 180 degrees around the robot can be covered. However, in practice, the number of effective combinations is less than 81 except for a range where the horizontal directivity range and the vertical directivity range do not intersect among the 81 combinations. In addition, since a plurality of vertical directivity ranges are set for one horizontal directivity range, the output of the former beamformer 410 is shared by the plurality of subsequent beamformers 411, 412, 413,. And the number of beamformers can be less than twice the number of valid combinations. The configuration of the acoustic directivity forming unit 101 shown in FIG. 4 corresponds to an effective combination, and a necessary number of beamformers that are partially shared are mounted.

マイクロホン２０６，２０７の出力は、水平指向性を実現する前段のビームフォーマ４１０に出力される。水平１８０度を９分割するために、この前段のビームフォーマはそれぞれ異なる指向性範囲を設定されて合計で９個が用意される。図ではその一部としてビームフォーマ４１０、４２０、４３０の３個を示している。 The outputs of the microphones 206 and 207 are output to the former beamformer 410 that realizes horizontal directivity. In order to divide the horizontal 180 degrees into 9 parts, the beam formers in the preceding stage are set with different directivity ranges, and a total of 9 beamformers are prepared. In the figure, three beamformers 410, 420, and 430 are shown as a part thereof.

異なる水平指向性範囲を実現する前段のビームフォーマ４１０、４２０、４３０のそれぞれの出力は、さらに異なる垂直指向性を実現する後段の複数のビームフォーマに供給される。例えば、ビームフォーマ４１０の出力は後段のビームフォーマ４１１、４１２、４１３、・・・に出力される。垂直１８０度を９分割するために、前段の１個のビームフォーマからの出力を共通に受ける後段のビームフォーマは最大で９個必要となる。 The outputs of the former beamformers 410, 420, and 430 that realize different horizontal directivity ranges are supplied to a plurality of subsequent beamformers that realize different vertical directivities. For example, the output of the beamformer 410 is output to the subsequent beamformers 411, 412, 413,. In order to divide the vertical 180 degrees into nine, a maximum of nine subsequent beamformers that commonly receive the output from one previous beamformer are required.

ここで、本実施の形態にかかる自律移動型ロボットは、床面上で移動および旋回が可能である。このとき、音響指向性形成モジュール１０２の指向性範囲は、マイクロホン２０６、２０７、２０８、すなわちロボット本体２０１を基準にした相対的な方向として設定される。床面が水平である場合には、指向性範囲の垂直成分は常に一定の基準にしたがっているが、水平成分は本体２０１の旋回に伴ってその基準が変わる。 Here, the autonomous mobile robot according to the present embodiment can move and turn on the floor. At this time, the directivity range of the acoustic directivity forming module 102 is set as a relative direction based on the microphones 206, 207, 208, that is, the robot body 201. When the floor is horizontal, the vertical component of the directivity range always follows a certain standard, but the standard of the horizontal component changes as the main body 201 turns.

図８−２および図８−３は、自律移動型ロボットにおける相対方位から絶対方位への補正について示す説明図である。 FIG. 8-2 and FIG. 8-3 are explanatory diagrams illustrating correction from the relative direction to the absolute direction in the autonomous mobile robot.

例えば、図８−１に示すように、本体２０１の正面方向が矢印８４０の方向を向いているときに、当該方向から水平方位で右方向を正に＋θaの方位、すなわち矢印８４１で示される方向に利用者８４２の音声が検出されているとする。次に図８−２に示すように、本体２０１が右方向を正に＋φだけ旋回して、その正面が矢印８４３の方向を向いたとすると、利用者８４２の音声が今度は本体２０１の正面の右方向を正に−θbの方位に観測される。この例から明らかなように、同一利用者８４２の音声は本体２０１の旋回によって異なる方位に検出される。 For example, as shown in FIG. 8A, when the front direction of the main body 201 is directed in the direction of the arrow 840, the horizontal direction from the direction and the right direction is positively + θa, that is, the direction indicated by the arrow 841. Assume that the voice of the user 842 is detected. Next, as shown in FIG. 8B, if the main body 201 turns rightward by + φ and the front faces in the direction of the arrow 843, the voice of the user 842 is now heard from the front of the main body 201. Is observed in the direction of −θb. As is clear from this example, the voice of the same user 842 is detected in different directions as the main body 201 turns.

そこで、図８−２に示す状態での本体２０１の姿勢を基準にして水平方位オフセット角θoを０とする。図８−２に示す状態では、本体２０１が右に角度φ旋回したのであるから、このときの水平方位オフセット角θoを、右方向を正として＋φとする。図８−２の状態での利用者の音声の水平方位θbはこの水平方位オフセット角θoを用いて図８−１の状態を基準にした絶対方位に補正される。補正されたθbをθb'とすると、θb'は、次式で算出される。 Therefore, the horizontal azimuth offset angle θo is set to 0 with reference to the posture of the main body 201 in the state shown in FIG. In the state shown in FIG. 8B, since the main body 201 has turned to the right by the angle φ, the horizontal azimuth offset angle θo at this time is set to + φ with the right direction being positive. The horizontal azimuth θb of the user's voice in the state of FIG. 8-2 is corrected to an absolute azimuth based on the state of FIG. 8-1 using the horizontal azimuth offset angle θo. Assuming that the corrected θb is θb ′, θb ′ is calculated by the following equation.

θb'＝θo−θb
補正されたθb'は、利用者８４２が移動しない限り、本体２０１の旋回に関わりなくθaと同じ値になる。音響指向性形成部１０１から出力される指向性音声情報に含まれる方向情報の水平成分は上述のような補正が施されて絶対方位に変換される。この結果、ロボットが旋回した場合でも、異なる姿勢で得られた話者識別情報や顔識別情報や領域追跡情報を共通の基準で比較できるようになる。 θb '= θo−θb
As long as the user 842 does not move, the corrected θb ′ has the same value as θa regardless of the turning of the main body 201. The horizontal component of the direction information included in the directional sound information output from the acoustic directivity forming unit 101 is subjected to the above correction and converted into an absolute direction. As a result, even when the robot turns, the speaker identification information, face identification information, and area tracking information obtained in different postures can be compared with a common reference.

各音響指向性形成モジュール１０２は、モジュール毎に指向性範囲となる領域を割り当てられており、各領域は領域を識別する領域番号が割り当てられている。すなわち、各音響指向性形成モジュール１０２からの出力は、そのモジュールが割り当てられた領域から到来した音声データであることを示している。このため、各音響指向性形成モジュール１０２は、自己のモジュール番号を後述する方向対応付けテーブルによって割り当てられた領域番号に変換した方向情報とし、入力した音声区間の音声データに方向情報を付加した指向性音声情報を出力する。図９−１は、音響指向性形成部１０１の音響指向性形成モジュール１０２から出力される指向性音声情報のデータ構造図である。かかる指向性音声情報を入力する語彙識別部１０３、話者識別部１０４は、この指向性音声情報に設定された方向情報（領域番号）を取得することにより、入力された音声の到来方向を取得することができる。 Each acoustic directivity forming module 102 is assigned a region that is a directivity range for each module, and each region is assigned a region number for identifying the region. That is, the output from each acoustic directivity forming module 102 indicates that the voice data has arrived from the area to which the module is assigned. For this reason, each acoustic directivity forming module 102 uses direction information obtained by converting its module number into an area number assigned by a direction association table, which will be described later, and the direction information is added to the voice data of the input voice section. Sexual audio information is output. FIG. 9A is a data structure diagram of directional sound information output from the sound directivity forming module 102 of the sound directivity forming unit 101. The vocabulary identifying unit 103 and the speaker identifying unit 104 that input the directional speech information acquire the direction of arrival of the input speech by acquiring the direction information (region number) set in the directional speech information. can do.

図１に戻り、生体辞書記憶部１２０は、ＨＤＤ（ハードディスクドライブ装置）等の記憶媒体であり、声紋辞書１２１と顔辞書１２２と語彙辞書１２３が格納されている。 Returning to FIG. 1, the biological dictionary storage unit 120 is a storage medium such as an HDD (Hard Disk Drive Device), and stores a voiceprint dictionary 121, a face dictionary 122, and a vocabulary dictionary 123.

声紋辞書１２１は、利用者を識別するための利用者ＩＤごとに、利用者ＩＤと利用者ＩＤに対応する利用者の「あいうえお」等の基本的な音声の声紋パターンとを対応付けて登録した辞書データファイルである。 For each user ID for identifying a user, the voiceprint dictionary 121 registers a user ID and a basic voice voice pattern such as “Aiueo” corresponding to the user ID in association with each other. It is a dictionary data file.

顔辞書１２２は、利用者ＩＤごとに、利用者ＩＤと利用者ＩＤに対応する利用者を実際に撮像した顔画像パターンとを対応付けて登録した辞書データファイルである。 The face dictionary 122 is a dictionary data file in which a user ID and a face image pattern obtained by actually capturing a user corresponding to the user ID are registered in association with each other for each user ID.

語彙辞書１２３は、利用者ＩＤごとに、利用者ＩＤと音声データと当該音声データに対応した発話内容を示す語彙が対応付けられて登録された辞書データファイルである。 The vocabulary dictionary 123 is a dictionary data file in which, for each user ID, a user ID, voice data, and a vocabulary indicating utterance contents corresponding to the voice data are associated and registered.

語彙識別部１０３は、音響指向性形成部１０１によって出力される指向性音声情報に設定された音声データと語彙辞書１２３とを照合して指向性音声情報の発話内容である語彙を識別するものである。具体的には、語彙識別部１０３は、指向性音声情報に設定されている音声データと語彙辞書１２３に登録された利用者ＩＤごとの音声データとを照合し、音声データの一致の程度を示す類似度を算出し、最大類似度を有する語彙列を語彙識別結果とする。そして、指向性音声情報に設定されている到来方向を方向情報として、語彙列と方向情報とからなる語彙識別情報を出力する。 The vocabulary identifying unit 103 collates the speech data set in the directional speech information output by the acoustic directivity forming unit 101 with the vocabulary dictionary 123 to identify the vocabulary that is the utterance content of the directional speech information. is there. Specifically, the vocabulary identification unit 103 compares the voice data set in the directional voice information with the voice data for each user ID registered in the vocabulary dictionary 123 to indicate the degree of matching of the voice data. The similarity is calculated, and the vocabulary string having the maximum similarity is used as the vocabulary identification result. And the vocabulary identification information which consists of a vocabulary string and direction information is output by making the arrival direction set to directional audio | voice information into direction information.

図９−２は、語彙識別情報のデータ構造図である。語彙識別情報には、発話者が語彙辞書１２３に登録された利用者である場合には、図９−２に示すように、自律移動型ロボットへの命令としての意味を有する語彙列が語彙識別結果として設定される。一方、自律移動型ロボットに到来する音声が、例えば人物以外の音声あるいは語彙辞書１２３に登録されていない利用者の音声データである場合には、語彙識別を行うことができず、このため語彙識別部１０３は、語彙識別結果として語彙列のかわりに語彙不明ＩＤを語彙識別情報に設定する。 FIG. 9-2 is a data structure diagram of vocabulary identification information. In the vocabulary identification information, when the speaker is a user registered in the vocabulary dictionary 123, as shown in FIG. 9-2, a vocabulary string having a meaning as a command to the autonomous mobile robot is a vocabulary identification. Set as a result. On the other hand, if the voice arriving at the autonomous mobile robot is, for example, voice other than a person or voice data of a user who is not registered in the vocabulary dictionary 123, vocabulary identification cannot be performed. The unit 103 sets the vocabulary unknown ID in the vocabulary identification information instead of the vocabulary string as the vocabulary identification result.

話者識別部１０４は、音響指向性形成部１０１によって出力される指向性音声情報の音声データを声紋辞書１２１に登録された声紋パターンと照合して指向性音声情報の話者を識別するものである。具体的には、話者識別部１０４は、指向性音声情報に設定されている音声データと声紋辞書１２１に登録された利用者ＩＤごとの声紋パターンとを照合し、音声データの声紋パターンに対する一致の程度を示す類似度を利用者ＩＤごとに算出し、最大類似度を有する利用者ＩＤから上位Ｎ位までの利用者ＩＤを選定して、選定された上位Ｎ位までの利用者ＩＤおよびその類似度の列を話者識別結果とする。そして、指向性音声情報に設定されている到来方向を方向情報として、話者識別結果（利用者ＩＤと類似度の上位Ｎ位までの列）と方向情報とからなる話者識別情報を出力する。 The speaker identifying unit 104 is for identifying the speaker of the directional speech information by collating the speech data of the directional speech information output by the acoustic directivity forming unit 101 with the voiceprint pattern registered in the voiceprint dictionary 121. is there. Specifically, the speaker identification unit 104 compares the voice data set in the directional voice information with the voice print pattern for each user ID registered in the voice print dictionary 121 and matches the voice data with the voice print pattern. The degree of similarity indicating the degree of the above is calculated for each user ID, the user IDs having the highest similarity to the user IDs from the top N are selected, and the selected user IDs up to the top N are listed. Let the similarity column be the speaker identification result. Then, using the direction of arrival set in the directional speech information as direction information, speaker identification information composed of speaker identification results (up to the top N ranks of user ID and similarity) and direction information is output. .

図９−３は、話者識別情報のデータ構造図である。音声データが声紋辞書１２１に登録されている利用者ＩＤの声紋パターンである場合には、図９−３に示すように、話者識別結果には利用者ＩＤと類似度の列が設定されるが、音声データが声紋辞書１２１に登録された利用者ＩＤのものでない場合あるいは人物の音声データでない場合には、話者識別を行うことができない。このため、話者識別部１０４は、話者識別結果として利用者ＩＤと類似度の列のかわりに人物不明ＩＤを話者識別情報に設定する。 FIG. 9-3 is a data structure diagram of speaker identification information. When the voice data is a voice print pattern of a user ID registered in the voice print dictionary 121, as shown in FIG. 9-3, a column of user ID and similarity is set in the speaker identification result. However, speaker identification cannot be performed if the voice data is not of the user ID registered in the voiceprint dictionary 121 or if it is not a person's voice data. For this reason, the speaker identification unit 104 sets the unknown person ID as the speaker identification information instead of the column of the user ID and the similarity as the speaker identification result.

顔識別部１０５は、全天カメラ２０５で一定時間ごとに撮像された画像フレームを入力して、入力された画像フレームの中で探索された顔領域の画像と顔辞書１２２に登録された顔画像パターンと照合して撮像された画像フレームにおける利用者の顔を識別するものである。具体的には、顔識別部１０５は、入力された画像フレームから顔領域を探索して探索された顔領域の画像データと顔辞書１２２に登録された利用者ＩＤごとの顔画像パターンとを照合し、画像データの顔画像パターンに対する一致の程度を示す類似度を利用者ＩＤごとに算出し、最大類似度を有する利用者ＩＤから上位Ｎ位までの利用者ＩＤを選定して、選定された上位Ｎ位までの利用者ＩＤおよびその類似度の列を顔識別結果とする。そして、顔領域の画像中心位置に対応する方向情報（領域番号）を後述する方向対応付けテーブルを参照して求め、顔識別結果（利用者ＩＤと類似度の上位Ｎ位までの列）と方向情報とからなる顔識別情報を出力する。 The face identification unit 105 inputs image frames picked up by the omnidirectional camera 205 at regular intervals, and searches for face images searched in the input image frames and face images registered in the face dictionary 122. The user's face is identified in an image frame imaged by collating with a pattern. Specifically, the face identification unit 105 searches the face area from the input image frame, and collates the image data of the face area searched for with the face image pattern for each user ID registered in the face dictionary 122. The similarity indicating the degree of coincidence with the face image pattern of the image data is calculated for each user ID, and the user ID having the highest similarity to the user ID from the top N to the top N is selected and selected. The user IDs up to the top N and their similarity columns are used as face identification results. Then, direction information (area number) corresponding to the image center position of the face area is obtained by referring to a direction association table described later, and the face identification result (column up to the top N ranks of user ID and similarity) and direction The face identification information consisting of the information is output.

図９−４は、顔識別情報のデータ構造図である。顔領域の画像データが顔辞書１２２に登録されている利用者ＩＤの顔画像パターンである場合には、図９−４に示すように、顔識別結果には利用者ＩＤと類似度の列が設定されるが、顔領域の画像データが顔辞書１２２に登録された利用者ＩＤの顔画像パターンでない場合あるいは人物の顔の画像データでない場合には、顔識別を行うことができない。このため、顔識別部１０５は、顔識別結果として利用者ＩＤと類似度の列のかわりに人物不明ＩＤを顔識別情報に設定する。 FIG. 9-4 is a data structure diagram of face identification information. When the image data of the face area is a face image pattern of the user ID registered in the face dictionary 122, the face identification result includes a column of user ID and similarity as shown in FIG. 9-4. Although it is set, face identification cannot be performed when the image data of the face area is not the face image pattern of the user ID registered in the face dictionary 122 or the image data of the face of the person. Therefore, the face identification unit 105 sets the unknown person ID as the face identification information instead of the user ID and similarity column as the face identification result.

図９−５は、方向対応付けテーブルのデータ構造図である。方向対応付けテーブルは、メモリやＨＤＤ等の記憶媒体に格納されており、領域番号と音響指向性形成モジュール番号と領域の画像中心位置の座標範囲とを対応付けたテーブルである。領域番号は、利用者の存在する方向を示すものであり、上述した指向性音声情報、語彙識別情報、話者識別情報、顔識別情報における方向情報に設定されるものである。 FIG. 9-5 is a data structure diagram of a direction association table. The direction association table is stored in a storage medium such as a memory or an HDD, and is a table in which an area number, an acoustic directivity forming module number, and a coordinate range of an image center position of the area are associated with each other. The area number indicates the direction in which the user exists, and is set in the direction information in the above-described directional speech information, vocabulary identification information, speaker identification information, and face identification information.

音響指向性形成部１０１では入力音声の到来方向を推定するが、到来方向を推定した音響指向性形成モジュール１０２がそのまま到来方向を示すため、方向対応付けテーブルでは、領域番号と音響指向性形成モジュール１０２とを対応付けている。 The acoustic directivity forming unit 101 estimates the arrival direction of the input speech. Since the acoustic directivity forming module 102 that estimated the arrival direction indicates the arrival direction as it is, in the direction association table, the region number and the acoustic directivity forming module are used. 102 are associated with each other.

また、顔識別部１０５では、顔領域の方向を求める際に、顔領域の画像中心位置を求めるが、この顔領域の画像中心位置の座標が含まれる座標範囲に対応した領域番号が顔領域の方向となり、顔識別情報の方向情報に設定される。 The face identification unit 105 obtains the image center position of the face area when obtaining the direction of the face area. The area number corresponding to the coordinate range including the coordinates of the image center position of the face area is the face area. Direction, and is set in the direction information of the face identification information.

領域追跡部１０６は、全天カメラ２０５から入力された各画像フレームから、設定された画像的特徴を有する領域である画像特徴領域を検出して、当該画像特徴領域を入力される画像フレームごとに追跡するものである。画像特徴領域の設定は後述する同一性検証部１１０の追跡同方向性検証部１１２が出力する領域追加命令によって行われる。また、領域追跡部１０６は、追跡中の各画像特徴領域についての検出追跡状況（追跡中の画像特徴領域の現在の方向、あるいは追跡中の画像特徴領域を見失った旨）を領域追跡情報として出力する。 The area tracking unit 106 detects an image feature area, which is an area having set image characteristics, from each image frame input from the omnidirectional camera 205, and the image feature area is input for each input image frame. To track. The setting of the image feature area is performed by an area addition command output from the tracking direction verification unit 112 of the identity verification unit 110 described later. Further, the area tracking unit 106 outputs, as area tracking information, the detection tracking state (the current direction of the image feature area being tracked or the fact that the image feature area being tracked was lost) for each image feature area being tracked. To do.

同一性検証部１１０は、話者識別情報の対象となった利用者と顔識別情報の対象となった利用者とが同一人であるか否かを判断して、同一人のものであると判断される話者識別情報と顔識別情報とを設定した同一組情報を生成するものであり、同方向性検証部１１１と追跡同方向性検証部１１２とを備えている。 The identity verification unit 110 determines whether the user who is the target of the speaker identification information and the user who is the target of the face identification information are the same person, and is the same person The same set information in which the speaker identification information and the face identification information to be determined are set is generated, and the same direction verification unit 111 and the tracking directionality verification unit 112 are provided.

同方向性検証部１１１は、話者識別情報と顔識別情報の中で先に取得した生体識別情報に設定された方向情報を登録しておき、その方向情報と先の生体識別情報より後に取得した生体識別情報に設定された方向情報との差が予め定められた閾値以下であるか否かを判断するものである。同方向性検証部１１１は、さらに、当該方向情報の差が閾値以下である場合に、話者識別情報で識別される利用者と顔識別情報で識別される利用者が同一であると判断して、話者識別情報とその音声データおよび顔識別情報とその画像データを一組とし、さらに両識別情報の有効な期間を定めた有効期限を付加した同一組情報を生成する。 The same direction verification unit 111 registers the direction information set in the biometric identification information acquired earlier in the speaker identification information and the face identification information, and acquires it after the direction information and the previous biometric identification information. It is determined whether or not the difference from the direction information set in the biometric identification information is equal to or less than a predetermined threshold value. The same direction verification unit 111 further determines that the user identified by the speaker identification information and the user identified by the face identification information are the same when the difference in the direction information is equal to or smaller than the threshold. Thus, the same set information is generated by combining the speaker identification information, the voice data thereof, the face identification information, and the image data thereof, and further adding an expiration date that determines the effective period of both identification information.

追跡同方向性検証部１１２は、領域追跡部１０６から出力される領域追跡情報を入力して領域追跡情報に含まれる画像特徴領域の検出方向と、話者識別情報と顔識別情報の否かで後に取得した生体識別情報に設定される方向情報との差が予め定められた閾値以下であるか否かを判断するものである。追跡同方向性検証部１１２は、さらに、当該方向の差が閾値以下である場合に、話者識別情報の利用者と顔識別情報の利用者が同一人であると判断して、話者識別情報とその音声データおよび顔識別情報とその画像データを一組とした同一組情報を生成するものである。 The tracking co-direction verification unit 112 receives the region tracking information output from the region tracking unit 106 and determines whether the image feature region included in the region tracking information is detected, whether the speaker identification information and the face identification information are present. It is determined whether or not the difference from the direction information set in the biometric identification information acquired later is equal to or less than a predetermined threshold value. The tracking co-direction verification unit 112 further determines that the user of the speaker identification information and the user of the face identification information are the same person when the difference between the directions is equal to or less than the threshold, The same set information is generated by combining the information, its voice data, face identification information, and its image data.

図９−６は、同一組情報のデータ構造図である。図９−６では、同方向性検証部１１１で生成され、有効期限が設定された同一組情報の例を示している。追跡同方向性検証部１１２で生成される同一組情報は、この有効期限の項目が設定されていない点のみが同方向性検証部１１１で生成される同一組情報と異なっており、他の項目については同様である。また、図９−６に示すように、語彙識別情報が語彙識別部１０３によって出力される場合には、語彙識別情報も同一組情報に設定される。また、図９−６では、話者識別情報と顔識別情報が一組として設定されているが、いずれか一方のみの識別情報しか設定されていない同一組情報も出力される場合がある。 FIG. 9-6 is a data structure diagram of the same set information. FIG. 9-6 illustrates an example of the same set information generated by the same direction verification unit 111 and set with an expiration date. The same set information generated by the tracking same direction verification unit 112 is different from the same set information generated by the same direction verification unit 111 only in that this expiration date item is not set. The same applies to. As shown in FIG. 9-6, when the vocabulary identification information is output by the vocabulary identification unit 103, the vocabulary identification information is also set to the same set information. In FIG. 9-6, the speaker identification information and the face identification information are set as one set, but the same set information in which only one of the identification information is set may be output.

辞書更新部１０８は、同一性検証部１１０によって生成された同一組情報を解析して生体辞書記憶部１２０に格納されている声紋辞書１２１、顔辞書１２２の各生体辞書の更新の必要性を判断し、必要性がある生体辞書の更新を行うものである。具体的には、辞書更新部１０８は、同一組情報に話者識別情報と顔識別情報の両方が存在している場合に、次のような辞書更新を行う。 The dictionary updating unit 108 analyzes the same set information generated by the identity verifying unit 110 and determines the necessity of updating each of the voiceprint dictionary 121 and the face dictionary 122 stored in the biological dictionary storage unit 120. In addition, there is a need to update the vital dictionary. Specifically, the dictionary update unit 108 performs the following dictionary update when both speaker identification information and face identification information exist in the same set information.

すなわち、話者識別情報の最大類似度の利用者ＩＤと、顔識別情報の最大類似度の利用者ＩＤが一致しない場合、最大類似度の大きい方の利用者ＩＤをこの利用者の利用者ＩＤと決定し、類似度の小さい方の生体辞書の更新を行う。また、話者識別情報と顔識別情報の一方のみに人物不明ＩＤが設定されている場合、人物不明ＩＤが設定されていない識別情報の利用者ＩＤをこの利用者の利用者ＩＤであると決定し、人物不明ＩＤが設定された識別情報に対応した生体辞書の更新を行う。 That is, when the user ID having the maximum similarity in the speaker identification information and the user ID having the maximum similarity in the face identification information do not match, the user ID having the larger maximum similarity is selected as the user ID of this user. The bio-dictionary with the smaller similarity is updated. Further, when the unknown person ID is set for only one of the speaker identification information and the face identification information, the user ID of the identification information for which no unknown person ID is set is determined to be the user ID of this user. Then, the biological dictionary corresponding to the identification information in which the person unknown ID is set is updated.

利用者認証部１０７は、同一性検証部１１０によって生成された同一組情報から実際の利用者を認証するものである。同一組情報には、話者識別情報のみが設定されているもの、顔識別情報のみが設定されているもの、両識別情報が設定されているものが存在しうる。 The user authentication unit 107 authenticates an actual user from the same set information generated by the identity verification unit 110. In the same set information, there may be information in which only speaker identification information is set, information in which only face identification information is set, or information in which both identification information is set.

また、話者識別情報には、利用者ＩＤと類似度の列と人物不明ＩＤの２種類があり、顔識別情報にも利用者ＩＤと類似度の列と人物不明ＩＤの２種類がある。利用者を認証してサービスを提供するためには、音声によっても顔によっても利用者を特定できなかったものは考慮する必要がない。このため、利用者認証部１０７は、少なくとも音声か顔のどちらかで利用者を特定することができた同一組情報のみを対象に、話者識別結果と顔識別結果の中で最大の類似度を有する利用者ＩＤを利用者認証情報として出力する。 In addition, there are two types of speaker identification information: a user ID / similarity column and a person unknown ID, and face identification information also includes two types: a user ID / similarity column and a person unknown ID. In order to provide a service by authenticating a user, it is not necessary to consider the case where the user could not be identified by voice or face. For this reason, the user authentication unit 107 targets the maximum similarity between the speaker identification result and the face identification result only for the same set information that can identify the user by at least one of voice and face. Is output as user authentication information.

また、同一組情報に語彙識別情報が設定されている場合には、語彙識別結果として設定された語彙列も利用者認証情報に付加して出力する。 When vocabulary identification information is set for the same set information, the vocabulary string set as the vocabulary identification result is also added to the user authentication information and output.

サービス提供部１０９は、利用者認証部１０８により出力された利用者認証情報に基づいて利用者に所定のサービスを提供するものである。提供するサービスとしては、任意のものを実行することができ、例えば、メールチェックサービスなどがあげられる。利用者認証情報には、利用者認証部１０７によって特定された利用者の利用者ＩＤが含まれるため、この利用者ＩＤを利用したサービスを提供するように構成することができる。また、利用者認証情報に語彙列が含まれる場合には、利用者の音声の発話内容をこの語彙列で示した命令として利用者ＩＤを使用したサービスを提供することができる。例えば、メールチェックサービスを行うように構成した場合には、語彙列からメールチェック命令であることを判断して、利用者ＩＤに対応する利用者のメールボックスを確認する等のサービス提供処理を行うことが可能である。一方、語彙識別情報を含まない利用者認証情報の場合には、サービス提供部１０９が以前に命令されたサービスの一環として何らかの結果を利用者に返すように構成することもできる。 The service providing unit 109 provides a predetermined service to the user based on the user authentication information output by the user authentication unit 108. As the service to be provided, an arbitrary service can be executed, for example, an e-mail check service. Since the user authentication information includes the user ID of the user specified by the user authentication unit 107, a service using this user ID can be provided. In addition, when the user authentication information includes a vocabulary string, it is possible to provide a service using the user ID as a command indicating the utterance content of the user's voice in the vocabulary string. For example, when configured to perform a mail check service, a service providing process is performed such as determining a mail check command from the vocabulary string and confirming the user's mailbox corresponding to the user ID. It is possible. On the other hand, in the case of user authentication information that does not include vocabulary identification information, the service providing unit 109 may be configured to return some result to the user as part of a previously ordered service.

（利用者識別の全体処理）
次に、以上のように構成された本実施の形態にかかる自律移動型ロボットによる利用者識別処理について説明する。図１０は、本実施の形態にかかる自律移動型ロボットによる利用者識別の全体処理の手順を示すフローチャートである。 (Overall processing of user identification)
Next, user identification processing by the autonomous mobile robot according to the present embodiment configured as described above will be described. FIG. 10 is a flowchart showing a procedure of overall user identification processing by the autonomous mobile robot according to the present embodiment.

まず、マイクロホン２０６，２０７，２０８によって音声を入力し、音響指向性形成部１０１によって音声データとその到来方向を示す方向情報からなる指向性音声情報が生成される（ステップＳ１００１）。語彙識別部１０３では、この指向性音声情報と語彙辞書１２３に基づいて語彙識別処理を行い、語彙識別結果と方向情報からなる語彙識別情報が出力される（ステップＳ１００２）。また、話者識別部１０４では、指向性音声情報と声紋辞書１２１に基づいて話者識別処理を行い、話者識別結果と方向情報からなる話者識別情報を出力する（ステップＳ１００３）。 First, sound is input through the microphones 206, 207, and 208, and the sound directivity forming unit 101 generates directional sound information including sound data and direction information indicating the arrival direction (step S1001). The vocabulary identification unit 103 performs vocabulary identification processing based on the directional speech information and the vocabulary dictionary 123, and outputs vocabulary identification information including the vocabulary identification result and direction information (step S1002). Further, the speaker identification unit 104 performs speaker identification processing based on the directional audio information and the voiceprint dictionary 121, and outputs speaker identification information including the speaker identification result and direction information (step S1003).

一方、これと並行して、全天カメラ２０５では一定時間ごとに周囲を撮像して画像フレームが出力される（ステップＳ１００４）。そして顔識別部１０５によって、入力された画像フレームと顔辞書１２２に基づいて顔識別処理を行い、顔識別結果と方向情報からなる顔識別情報を出力する（ステップＳ１００５）。 On the other hand, in parallel with this, the omnidirectional camera 205 images the surroundings at regular intervals and outputs an image frame (step S1004). The face identification unit 105 performs face identification processing based on the input image frame and the face dictionary 122, and outputs face identification information including the face identification result and direction information (step S1005).

次いで、領域追跡部１０６によって入力される画像フレームごとに同一性検証部１１０からの領域追加指令で指定された画像特徴領域の追跡が行われ、追跡状況を示す領域追跡情報が出力される（ステップＳ１００６）。 Next, the image feature region specified by the region addition command from the identity verification unit 110 is tracked for each image frame input by the region tracking unit 106, and region tracking information indicating the tracking state is output (step). S1006).

利用者の移動がない場合には、同一性検証部１１０の同方向性検証部１１１によって同方向性検証処理が行われ、同一人と検証された話者識別情報と顔識別情報とを組み合わせた同一組情報が出力される（ステップＳ１００７）。一方、利用者が移動している場合には、同一性検証部１１０の追跡同方向性検証部１１２によって領域追加命令が領域追跡部１０６に送出され、その結果としての領域追跡情報を領域追跡部１０６から受け取って、追跡同方向性検証処理が行われ、同一人と検証された話者識別情報と顔識別情報とを組み合わせた同一組情報が出力される（ステップＳ１００８）。 When there is no movement of the user, the same direction verification processing is performed by the same direction verification unit 111 of the identity verification unit 110, and the speaker identification information verified with the same person and the face identification information are combined. The same set information is output (step S1007). On the other hand, when the user is moving, a region addition command is sent to the region tracking unit 106 by the tracking and directionality verification unit 112 of the identity verification unit 110, and the resulting region tracking information is transmitted to the region tracking unit. A tracking unidirectional verification process is received from 106 and the same set information combining the speaker identification information verified with the same person and the face identification information is output (step S1008).

次いで、利用者認証部１０７によって、同一組情報における利用者が誰であるかを判断する利用者認証処理が行われ、特定された利用者ＩＤを含む利用者認証情報が出力される（ステップＳ１００９）。そして、サービス提供部１０９によって利用者認証情報によって特定された利用者に対して、例えばメールチェックサービス等のサービス提供処理が行われる（ステップＳ１０１０）。一方、辞書更新部１０８では同一組情報から更新が必要になった生体辞書の更新が行われ、生体辞書が強化される（ステップＳ１０１１）。 Next, user authentication processing is performed by the user authentication unit 107 to determine who is the user in the same set information, and user authentication information including the specified user ID is output (step S1009). ). Then, for the user specified by the user authentication information by the service providing unit 109, for example, a service providing process such as a mail check service is performed (step S1010). On the other hand, the dictionary update unit 108 updates the biological dictionary that needs to be updated from the same set information, and strengthens the biological dictionary (step S1011).

（指向性音声情報生成処理）
以下、上記全体処理で説明した各処理に詳細に説明する。まず、音響指向性形成部１０１による指向性音声情報の生成処理について説明する。図１１は、音響指向性形成部１０１による指向性音声情報の生成処理の手順を示すフローチャートである。 (Directed voice information generation processing)
Hereinafter, each process described in the overall process will be described in detail. First, generation processing of directional audio information by the acoustic directivity forming unit 101 will be described. FIG. 11 is a flowchart illustrating a procedure of directivity audio information generation processing by the acoustic directivity forming unit 101.

まず、マイクロホン２０６、２０７、２０８が音声を入力すると（ステップＳ１１０１）、音響指向性形成部１０１は、マイクロホン２０６、２０７、２０８から出力される入力音声の音声電気信号を周期的にＡ−Ｄ（Ａｎａｌｏｇ−Ｄｉｇｉｔａｌ）変換し（ステップＳ１１０２）、Ａ−Ｄ変換により得られたデジタルの音声信号の瞬間振幅値を順次入力してＦＩＦＯ（Ｆｉｒｓｔ−ＩｎｐｕｔＦｉｒｓｔ−Ｏｕｔ）に蓄積していく（ステップＳ１１０３）。 First, when the microphones 206, 207, and 208 input sound (step S 1101), the acoustic directivity forming unit 101 periodically generates audio electrical signals of input sounds output from the microphones 206, 207, and 208 (A-D). (Analog-Digital) conversion (step S1102), and instantaneous amplitude values of digital audio signals obtained by the A-D conversion are sequentially input and accumulated in a FIFO (First-Input First-Out) (step S1103). .

そして、ＦＩＦＯに蓄積された振幅値データの中で、マイクロホン２０６と２０７による振幅値データが前段のビームフォーマ４１０、４２０、４３０に入力され、所定の指向性を与えられた振幅値データとして出力する第１の指向性形成処理が行われる（ステップＳ１１０４）。これにより、まず水平指向性範囲が求められる。 Of the amplitude value data stored in the FIFO, the amplitude value data from the microphones 206 and 207 are input to the beam formers 410, 420, and 430 in the previous stage and output as amplitude value data having a predetermined directivity. A first directivity formation process is performed (step S1104). Thereby, a horizontal directivity range is first obtained.

次に、後段のビームフォーマ４１１、４１２，４１３・・・、４２１、４２２，４２３・・・、４３１、４３２，４３３・・・の中では、前段のビームフォーマ４１０から出力された振幅値データとマイクロホン２０８による振幅値データが入力され、後段のビームフォーマ４１１、４１２，４１３・・・によって、所定の指向性を与えられた振幅値データとして出力する第２の指向性形成処理が行われる（ステップＳ１１０５）。これにより、垂直指向性範囲が求められ、第１の指向性形成処理で求められた水平指向性範囲と垂直指向性範囲の共通の範囲が入力音声の指向性範囲となる。 .., 421, 422, 423... 431, 432, 433..., 431, 432, 433. Amplitude value data from the microphone 208 is input, and a second directivity forming process is performed by the subsequent beamformers 411, 412, 413,... S1105). As a result, a vertical directivity range is obtained, and a common range of the horizontal directivity range and the vertical directivity range obtained in the first directivity forming process becomes the directivity range of the input sound.

そして、音響指向性形成部１０１は、第２の指向性形成処理で後段の各ビームフォーマから出力された振幅値データが最大の振幅値データを音声区間の音声データとし、最大の振幅値データを出力する後段のビームフォーマを有する音響指向性形成モジュールのモジュール番号を、方向対応付けテーブルを参照して指向性範囲の領域番号に変換して方向情報とする。そして、音響指向性形成部１０１は、この音声区間の音声データと方向情報とからなる指向性音声情報を生成して出力する（ステップＳ１１０６）。 Then, the acoustic directivity forming unit 101 sets the amplitude value data having the maximum amplitude value data output from each beamformer in the second stage in the second directivity forming process as the audio data of the audio section, and sets the maximum amplitude value data to The module number of the acoustic directivity forming module having the subsequent beamformer to be output is converted to the area number of the directivity range with reference to the direction association table to obtain direction information. Then, the acoustic directivity forming unit 101 generates and outputs directional voice information composed of voice data and direction information of this voice section (step S1106).

このように、音響指向性形成部１０１の各音響指向性形成モジュール１０２から出力される指向性音声情報には、音声データの他に、音響指向性形成モジュール１０２の指向性範囲を表す方向情報が含まれるので、指向性音声情報を入力して使用する語彙識別部１０３、話者識別に１０４は、入力した指向性音声情報から音声データの到来方向を容易に把握することが可能である。 Thus, in the directional audio information output from each acoustic directivity forming module 102 of the acoustic directivity forming unit 101, in addition to the audio data, direction information representing the directivity range of the acoustic directivity forming module 102 is included. Therefore, the vocabulary identifying unit 103 that inputs and uses directional speech information and the speaker identification 104 can easily determine the arrival direction of speech data from the input directional speech information.

（語彙識別処理）
次に、語彙識別部１０３による語彙識別処理について説明する。図１２は、語彙識別部１０３による語彙識別処理の手順を示すフローチャートである。 (Vocabulary identification processing)
Next, vocabulary identification processing by the vocabulary identification unit 103 will be described. FIG. 12 is a flowchart showing a procedure of vocabulary identification processing by the vocabulary identification unit 103.

まず、語彙識別部１０３は、音響指向性形成部１０１によって出力される新規の指向性音声情報を入力する（ステップＳ１２０１）。そして、入力された指向性音声情報に設定されている音声データと語彙辞書１２３に登録されている利用者ごとの発話内容を示す語彙列と照合して、音声データと語彙列ごとの類似度を求め、最大類似度となる語彙列を取得する（ステップＳ１２０２）。 First, the vocabulary identification unit 103 inputs new directional speech information output by the acoustic directivity forming unit 101 (step S1201). Then, the speech data set in the input directional speech information and the vocabulary string indicating the utterance content for each user registered in the vocabulary dictionary 123 are collated, and the similarity between the speech data and each vocabulary string is determined. The vocabulary string that is obtained and has the maximum similarity is acquired (step S1202).

次に、最大類似度が所定の閾値以上であるか否かを判断する（ステップＳ１２０３）。そして、最大類似度が閾値以上である場合には（ステップＳ１２０３：Ｙｅｓ）、この語彙列を語彙識別結果として認定し、この語彙列と指向性音声情報に設定された方向情報とからなる語彙識別情報を生成して出力する（ステップＳ１２０４）。 Next, it is determined whether or not the maximum similarity is greater than or equal to a predetermined threshold (step S1203). If the maximum similarity is greater than or equal to the threshold (step S1203: Yes), this vocabulary string is recognized as a vocabulary identification result, and the vocabulary identification consisting of this vocabulary string and the direction information set in the directional speech information. Information is generated and output (step S1204).

一方、ステップＳ１２０３において、最大類似度が閾値より小さい場合には（ステップＳ１２０３：Ｎｏ）、語彙不明ＩＤを語彙識別結果とし、この語彙識別結果と指向性音声情報に設定された方向情報とからなる語彙識別情報を生成して出力する（ステップＳ１２０５）。 On the other hand, if the maximum similarity is smaller than the threshold value in step S1203 (step S1203: No), the vocabulary unknown ID is used as the vocabulary identification result, and the vocabulary identification result and the direction information set in the directional speech information are included. Vocabulary identification information is generated and output (step S1205).

なお、この語彙識別情報は、サービス提供部１０９におけるサービス提供処理の際に、サービスに関する命令として使用することができる。 Note that this vocabulary identification information can be used as a service-related command during service provision processing in the service provision unit 109.

（話者識別処理）
次に、話者識別部１０４による話者識別処理について説明する。図１３は、話者識別部１０４による話者識別処理の手順を示すフローチャートである。 (Speaker identification processing)
Next, speaker identification processing by the speaker identification unit 104 will be described. FIG. 13 is a flowchart showing a procedure of speaker identification processing by the speaker identification unit 104.

まず、話者識別部１０４は、音響指向性形成部１０１によって出力される新規の指向性音声情報を入力する（ステップＳ１３０１）。そして、入力された指向性音声情報に設定されている音声データを、フレーム長Ｆ、フレーム間隔ＤでＤＦＴ（ＤｉｓｃｒｅｔｅＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ：離散フーリエ変換）処理を行い、Ｆ／２個の要素からなる短時間パワースペクトルを時系列的に並べた音声データに変換する（ステップＳ１３０２）。 First, the speaker identifying unit 104 inputs new directional speech information output by the acoustic directivity forming unit 101 (step S1301). Then, the audio data set in the input directional audio information is subjected to DFT (Discrete Fourier Transform) processing with a frame length F and a frame interval D, and a short time consisting of F / 2 elements. The power spectrum is converted into audio data arranged in time series (step S1302).

次いで、このパワースペクトル列に変換された音声データと話者辞書１２１に登録されている利用者ＩＤごとの声紋パターンと照合して、利用者ＩＤごとに音声データと利用者ＩＤの声紋パターンとの一致度を示す類似度を求め、類似度で上位Ｎ位（Ｎは任意の整数）までの利用者ＩＤの一覧を取得する（ステップＳ１３０３）。 Next, the voice data converted into the power spectrum sequence is compared with the voice print pattern for each user ID registered in the speaker dictionary 121, and the voice data and the voice print pattern of the user ID are compared for each user ID. A degree of similarity indicating the degree of coincidence is obtained, and a list of user IDs up to the top N ranks (N is an arbitrary integer) is obtained (step S1303).

次に、利用者ＩＤの一覧の中の最大類似度が所定の閾値以上であるか否かを判断する（ステップＳ１３０４）。そして、最大類似度が閾値以上である場合には（ステップＳ１３０４：Ｙｅｓ）、この上位Ｎ位までの利用者ＩＤと利用者ＩＤに対応した各類似度からなる列を話者識別結果として認定し、この各利用者ＩＤと各類似度からなる列と指向性音声情報に設定された方向情報とからなる話者識別情報を生成して出力する（ステップＳ１３０５）。 Next, it is determined whether or not the maximum similarity in the list of user IDs is equal to or greater than a predetermined threshold (step S1304). If the maximum similarity is equal to or greater than the threshold (step S1304: Yes), a column composed of the user IDs up to the top N and the similarities corresponding to the user IDs is recognized as a speaker identification result. Then, speaker identification information composed of a column composed of each user ID and each similarity and direction information set in the directional speech information is generated and output (step S1305).

一方、ステップＳ１３０４において、最大類似度が所定の閾値より小さい場合には（ステップＳ１３０４：Ｎｏ）、人物不明ＩＤを話者識別結果とし、この話者識別結果と指向性音声情報に設定された方向情報とからなる話者識別情報を生成して出力する（ステップＳ１３０６）。 On the other hand, if the maximum similarity is smaller than the predetermined threshold in step S1304 (step S1304: No), the unknown person ID is set as the speaker identification result, and the direction set in the speaker identification result and the directional voice information is set. Speaker identification information composed of the information is generated and output (step S1306).

ここで、ステップＳ１３０３において行われる音声データと声紋辞書１２１との照合について説明する。図１４は、入力された音声データからパワースペクトル列に変換された音声データの一例を示す説明図である。１４０１は、「こんにちはアプリ君」と発声した男性Ａの声紋パターンを示している。図１４からわかるように、人物の音声の大部分の期間は基本周波数とその高調波から成る調波構造を有している。そして、調波構造の大部分の期間は母音のパワースペクトルで構成され、個人性が反映される。 Here, the collation between the voice data and the voiceprint dictionary 121 performed in step S1303 will be described. FIG. 14 is an explanatory diagram showing an example of voice data converted from input voice data into a power spectrum sequence. 1401, shows a voiceprint pattern of male A that say "Hello app kun". As can be seen from FIG. 14, the majority of the human voice has a harmonic structure composed of the fundamental frequency and its harmonics. And most of the period of the harmonic structure is composed of the power spectrum of vowels, reflecting the individuality.

図１５−１は、男性Ａが「あいうえお」と発声した場合の声紋パターン１４０２の一例を示す説明図であり、図１５−２は、女性Ｂが「あいうえお」と発声した場合の声紋パターン１４０３の一例を示す説明図である。両図からわかるように、男性よりも女性の方が調波構造の間隔が広い、すなわち基本周波数が高い。 15A is an explanatory diagram showing an example of a voice print pattern 1402 when male A utters “Aiueo”, and FIG. 15-2 shows a voice print pattern 1403 when female B utters “Aiueo”. It is explanatory drawing which shows an example. As can be seen from both figures, the interval between the harmonic structures is larger in women than in men, that is, the fundamental frequency is higher.

このように、調波構造の大部分の期間は母音のパワースペクトルで構成され個人性が反映さるため、生体辞書記憶部１２０に格納されている声紋辞書１２１には、図１５−１や図１５−２で示すような母音を発声した場合の声紋パターンが利用者ごと（利用者ＩＤごと）に登録されている。 Thus, since most periods of the harmonic structure are composed of the vowel power spectrum and reflect the individuality, the voiceprint dictionary 121 stored in the biological dictionary storage unit 120 includes the voiceprint dictionary 121 shown in FIGS. A voice print pattern when a vowel as shown by -2 is uttered is registered for each user (for each user ID).

声紋辞書１２１は、各利用者がその言語の全ての母音を発声した教示用の音声データから時系列的に得られる短時間パワースペクトルの集合に基づいて生成されている。日本語の場合、教示用の音声データは５つの母音を含む「あいうえお」と発声した音声データとなる。短時間パワースペクトルの時系列データ中の時刻ｔのパワースペクトルＳ（ｔ）はＦ／２次元のベクトルとしてノルム正規化される。ここでＦはＦＦＴ（ＦａｓｔＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ：高速フーリエ変換）で短時間パワースペクトルを計算する際に使用される音声データ数（フレーム長）である。ノルム正規化された各ベクトルＶ（ｔ）を行ベクトルとすると、ベクトルＶ（ｔ）の自己相関行列は、（１）式で示される。 The voiceprint dictionary 121 is generated based on a set of short-time power spectra obtained in a time series from teaching voice data in which each user utters all vowels of the language. In the case of Japanese, the voice data for teaching is voice data uttered “Aiueo” including five vowels. The power spectrum S (t) at time t in the time series data of the short-time power spectrum is norm normalized as an F / 2-dimensional vector. Here, F is the number of audio data (frame length) used when calculating a short-time power spectrum by FFT (Fast Fourier Transform). When each norm-normalized vector V (t) is a row vector, the autocorrelation matrix of the vector V (t) is expressed by equation (1).

このとき、スペクトルの全パワーの合計が所定の閾値未満しかない区間を除いた時刻の自己相関行列Ｐ（ｔ）を全て加算して相関行列Ｑを（２）式のように算出する。

At this time, the correlation matrix Q is calculated by adding all the autocorrelation matrices P (t) at times excluding a section where the total power of the spectrum is less than a predetermined threshold.

この相関行列Ｑを主成分分析して得られる固有ベクトルのうち、対応する固有値の大きい順に上位Ｍ位までの固有ベクトルを抽出すると、このＭ本の固有ベクトルで張られる部分空間（次元数Ｍの辞書部分空間）が、当該教示用音声データから生成される声紋辞書１２１となる。

Out of the eigenvectors obtained by performing principal component analysis on this correlation matrix Q, eigenvectors up to the M-th order are extracted in descending order of the corresponding eigenvalues. ) Is the voiceprint dictionary 121 generated from the teaching voice data.

図１６は、話者識別処理における辞書部分空間と入力部分空間の生成手法について示す説明図である。図１６に示すように、「あいうえお」と発声した短時間パワースペクトルの時系列データ１４０２は、各時刻の短時間パワースペクトルの列である。このうち、パワーが所定の閾値未満しかない区間を除いた短時間パワースペクトルが１６１１〜１６１５の部分である。各部分１６１１〜１６１５は左から順に「あ」、「い」、「う」、「え」、「お」に対応している。これを教示データ１６１６として辞書部分空間１６１７が生成される。 FIG. 16 is an explanatory diagram showing a method for generating a dictionary subspace and an input subspace in speaker identification processing. As shown in FIG. 16, the time series data 1402 of the short-time power spectrum uttered “Aiueo” is a column of the short-time power spectrum at each time. Among these, the short-time power spectrum excluding the section where the power is less than the predetermined threshold is the portions 1611 to 1615. The portions 1611 to 1615 correspond to “a”, “i”, “u”, “e”, “o” in order from the left. Using this as teaching data 1616, a dictionary subspace 1617 is generated.

入力された音声データからも同様の方法で部分空間（次元数Ｋの入力部分空間）が生成される。入力音声の短時間パワースペクトルの時系列１４０１も、図１６に示すように各時刻の短時間パワースペクトルの列である。このうち、パワーが所定の閾値未満しかない区間を除いた短時間パワースペクトルが１６２１〜１６２４の部分である。これを入力データ１６２５として入力部分空間１６２７が生成される。 A subspace (an input subspace having a dimension number K) is also generated from the input audio data in the same manner. The time series 1401 of the short time power spectrum of the input voice is also a sequence of the short time power spectrum at each time as shown in FIG. Among these, the short-time power spectrum excluding the section where the power is less than the predetermined threshold is the portions 1621 to 1624. Using this as input data 1625, an input subspace 1627 is generated.

図１７は、話者識別処理における相互部分空間法の概念を示した説明図である。話者識別処理における声紋辞書１２１との照合の際には、図１７に示すように、入力部分空間１６２７と利用者ごとの辞書部分空間１６１７の正準角１７３１を類似度として計算する「相互部分空間法」で類似度が算出される。 FIG. 17 is an explanatory diagram showing the concept of the mutual subspace method in speaker identification processing. When collating with the voiceprint dictionary 121 in the speaker identification process, as shown in FIG. 17, the canonical angle 1731 of the input subspace 1627 and the dictionary subspace 1617 for each user is calculated as a similarity. The similarity is calculated by the “space method”.

上述のステップＳ１３０３では、入力部分空間１６２７を、声紋辞書１２１の全ての利用者ＩＤの辞書部分空間１６１７の教示データ１６１６と照合してそれぞれの類似度を算出し、類似度が所定の閾値以上となるものを最大類似度の利用者ＩＤから上位Ｎ位までの利用者ＩＤを抽出する。このように声紋辞書１２１との照合の単位を部分空間とし、声紋辞書１２１に母音を学習させて登録しておくことにより、発話内容に左右されにくい話者識別が可能になる。 In step S1303 described above, the input subspace 1627 is collated with the teaching data 1616 of the dictionary subspace 1617 of all user IDs in the voiceprint dictionary 121 to calculate respective similarities, and the similarity is equal to or greater than a predetermined threshold. From the user ID with the highest similarity, the user IDs from the top N to the top N are extracted. In this way, by using a unit for collation with the voiceprint dictionary 121 as a partial space and learning and registering vowels in the voiceprint dictionary 121, speaker identification that is less influenced by the utterance content becomes possible.

なお、上述した相互部分空間法は、非特許文献「福井和広, 山口修, “部分空間法の理論拡張と物体認識への応用”, 情報処理学会研究報告, 2004-CVIM-145, pp219-228, 2004」において詳述された手法を使用している。 The mutual subspace method mentioned above is a non-patent document “Kazuhiro Fukui, Osamu Yamaguchi,“ Theoretical Extension of Subspace Method and Application to Object Recognition ”, IPSJ SIG, 2004-CVIM-145, pp219-228 , 2004 "is used.

（顔識別処理）
次に、顔識別部１０５による顔識別処理について説明する。図１８は、顔識別部１０５による顔識別処理の手順を示すフローチャートである。 (Face identification process)
Next, face identification processing by the face identification unit 105 will be described. FIG. 18 is a flowchart showing the procedure of face identification processing by the face identification unit 105.

まず、顔識別部１０５は、全天カメラ２０５によって撮像された画像フレームを入力し（ステップＳ１８０１）、画像フレームを予め定められた顔画像が存在する領域を示す顔テンプレートと照合することにより画像フレーム中から顔テンプレートと同等若しくは近似している領域である顔領域を検出する（ステップＳ１８０２）。この際、直前に入力された画像フレームにおいて既に検出されている顔領域が存在する場合には、まず顔領域の近傍のみで顔テンプレートを照合する顔追跡処理を行う。そして追跡対象となっていない残りの領域に対しては顔テンプレートを走査しつつ照合することにより顔領域の探索処理を行う。 First, the face identifying unit 105 inputs an image frame captured by the omnidirectional camera 205 (step S1801), and collates the image frame with a face template indicating a region where a predetermined face image exists. A face area that is equivalent to or close to the face template is detected from the inside (step S1802). At this time, when there is a face area that has already been detected in the image frame input immediately before, face tracking processing for matching the face template only in the vicinity of the face area is first performed. Then, the face area search process is performed by collating the remaining area that is not the tracking target while scanning the face template.

この結果、顔領域が検出されたか否かを判断し（ステップＳ１８０３）、顔領域が検出されていない場合には（ステップＳ１８０３：Ｎｏ）、次の画像フレームの入力を待つ。一方、顔領域が検出された場合には（ステップＳ１８０３：Ｙｅｓ）、顔領域の画像中心位置の座標を求め、方向対応付けテーブルを参照して当該座標が含まれる「領域の画像中心位置の座標範囲」に対応する領域番号を方向情報として設定する（ステップＳ１８０４）。 As a result, it is determined whether or not a face area has been detected (step S1803). If no face area has been detected (step S1803: No), input of the next image frame is awaited. On the other hand, when the face area is detected (step S1803: Yes), the coordinates of the image center position of the face area are obtained, and the coordinates of the image center position of the area are included with reference to the direction association table. An area number corresponding to “range” is set as direction information (step S1804).

次いで、この顔領域の顔画像と顔辞書１２２に登録されている利用者ＩＤごとの顔画像パターンと照合して、利用者ＩＤごとに顔画像と利用者ＩＤの顔画像パターンとの一致度を示す類似度を求め、類似度で上位Ｎ位（Ｎは任意の整数）までの利用者ＩＤの一覧を取得する（ステップＳ１８０５）。 Next, the face image of the face area is compared with the face image pattern for each user ID registered in the face dictionary 122, and the degree of coincidence between the face image and the face image pattern of the user ID is determined for each user ID. The similarity shown is obtained, and a list of user IDs up to the top N ranks (N is an arbitrary integer) in the similarity is acquired (step S1805).

次に、利用者ＩＤの一覧の中の最大類似度が所定の閾値以上であるか否かを判断する（ステップＳ１８０６）。そして、最大類似度が閾値以上である場合には（ステップＳ１８０６：Ｙｅｓ）、この上位Ｎ位までの利用者ＩＤと利用者ＩＤに対応した各類似度からなる列を顔識別結果として認定し、この各利用者ＩＤと各類似度からなる列とステップＳ１８０４で設定した方向情報とからなる顔識別情報を生成して出力する（ステップＳ１８０７）。 Next, it is determined whether the maximum similarity in the list of user IDs is equal to or greater than a predetermined threshold (step S1806). If the maximum similarity is greater than or equal to the threshold (step S1806: Yes), the column consisting of the user IDs up to the top N and each similarity corresponding to the user ID is recognized as the face identification result, Face identification information composed of a column composed of each user ID and each similarity and the direction information set in step S1804 is generated and output (step S1807).

一方、ステップＳ１８０６において、最大類似度が所定の閾値より小さい場合には（ステップＳ１８０６：Ｎｏ）、人物不明ＩＤを顔識別結果とし、この顔識別結果とステップＳ１８０４で設定した方向情報とからなる顔識別情報を生成して出力する（ステップＳ１８０８）。 On the other hand, if the maximum similarity is smaller than the predetermined threshold value in step S1806 (step S1806: No), the person unknown ID is set as the face identification result, and the face composed of the face identification result and the direction information set in step S1804. Identification information is generated and output (step S1808).

ここで、ステップＳ１８０５において行われる顔画像と顔辞書１２２との照合について説明する。画像フレームから顔領域が初めて検出されて以後、追跡されている各顔領域から次々切り出される顔画像パターンを、Ｌ×Ｌ画素に大きさを正規化した後、これをＧ次元（Ｇ＝Ｌ×Ｌである）のベクトルとしてノルム正規化する。同一の顔画像を時系列的に追跡することによって得られる多数のノルム正規化ベクトルを使用して、上述の話者識別処理と同様に、相関行列を主成分分析して部分空間（入力部分空間）を得る。 Here, the collation between the face image and the face dictionary 122 performed in step S1805 will be described. After the face area is detected from the image frame for the first time, the size of the face image pattern that is successively cut out from each tracked face area is normalized to L × L pixels, and this is converted to G dimension (G = L × Norm normalization as a vector of L). Using a large number of norm normalization vectors obtained by tracking the same face image in time series, the correlation matrix is subjected to principal component analysis in the same manner as in the speaker identification process described above, and a subspace (input subspace) is obtained. )

図１９は、顔識別処理における相互部分空間法の概念を示す説明図である。図１９に示すように、追跡中の顔画像パターンは、時系列で得られる顔画像パターン列１９４１を形成し、この顔画像パターン列１９４１を入力データとして入力部分空間１９４２が生成される。顔辞書１２２も同様に、実際に利用者の顔を撮像して得られた教示用の顔画像パターン列１９４３を教示データとして辞書部分空間１９４４を生成する。顔識別処理は、この入力部分空間１９４２と利用者（利用者ＩＤ）ごとの辞書部分空間１９４４の正準角１９４５を類似度として計算する「相互部分空間法」で類似度が算出される。 FIG. 19 is an explanatory diagram showing the concept of the mutual subspace method in face identification processing. As shown in FIG. 19, the face image pattern being tracked forms a face image pattern sequence 1941 obtained in time series, and an input subspace 1942 is generated using the face image pattern sequence 1941 as input data. Similarly, the face dictionary 122 generates a dictionary partial space 1944 using teaching face image pattern sequence 1943 obtained by actually capturing the face of the user as teaching data. In the face identification process, the similarity is calculated by a “mutual subspace method” in which the canonical angle 1945 of the input subspace 1942 and the dictionary subspace 1944 for each user (user ID) is calculated as the similarity.

上述のステップＳ１８０５では、入力部分空間１９４２を、顔辞書１２２の全ての利用者ＩＤの辞書部分空間１９４４の顔画像パターン列１９４３と照合してそれぞれの類似度を算出し、類似度が所定の閾値以上となるものを最大類似度の利用者ＩＤから上位Ｎ位までの利用者ＩＤを抽出する。 In step S1805 described above, the input subspace 1942 is collated with the face image pattern sequence 1943 of the dictionary subspace 1944 of all user IDs of the face dictionary 122 to calculate the respective similarities, and the similarity is a predetermined threshold value. From the user ID having the highest similarity, the user IDs from the top N to the top N are extracted.

（領域追跡処理）
次に、領域追跡部１０６による領域追跡処理について説明する。図２０は、領域追跡部１０６による領域追跡処理の手順を示すフローチャートである。領域追跡部１０６は、同一性検証部１１０から領域追加命令を受け取ると（ステップＳ２００１）、受け取った領域追加命令から画像テンプレートと初期位置とを取得する（ステップＳ２００２）。ここで、領域追加命令とは、同一性検証部１１０によいて、追跡すべき画像特徴領域とそのサイズを示す画像テンプレートと、画像テンプレートの初期値からなる追跡を行う旨の命令である。追跡対象となる画像特徴領域は、顔領域の検出された方向に見える画像領域や音声の検出された方向に見える画像領域であり、特徴的な画像を有する領域である。 (Region tracking process)
Next, the region tracking process by the region tracking unit 106 will be described. FIG. 20 is a flowchart showing the procedure of the area tracking process by the area tracking unit 106. When the region tracking unit 106 receives a region addition command from the identity verification unit 110 (step S2001), the region tracking unit 106 acquires an image template and an initial position from the received region addition command (step S2002). Here, the region addition command is a command for performing tracking including the image feature region to be tracked and the image template indicating the size, and the initial value of the image template, to the identity verification unit 110. The image feature area to be tracked is an image area that appears in the direction in which the face area is detected or an image area that appears in the direction in which the sound is detected, and is an area having a characteristic image.

領域追跡部１０６は、領域追加命令を受け取ると、この命令を受理した時刻以降に画像フレームを入力すると（ステップＳ２００３）、領域追加命令で示された初期位置の近傍で与えられた画像テンプレートに最もかつ十分似た領域を探索する（ステップＳ２００４）。 Upon receiving the region addition command, the region tracking unit 106 inputs an image frame after the time when the command is received (step S2003), and the region tracking unit 106 finds the most recent image template given in the vicinity of the initial position indicated by the region addition command. A sufficiently similar region is searched (step S2004).

探索の結果、このような領域が検出された場合には（ステップＳ２００５：Ｙｅｓ）、検出された新たな領域の検出方向の位置を初期位置に設定し（ステップＳ２００６）、新たな領域の検出方向を領域追跡情報として出力する（ステップＳ２００７）。これ以降は、この初期位置で、入力される画像フレームから画像特徴領域の探索が行われる。 If such a region is detected as a result of the search (step S2005: Yes), the position of the detected new region in the detection direction is set as the initial position (step S2006), and the detection direction of the new region is set. Is output as area tracking information (step S2007). Thereafter, an image feature area is searched from the input image frame at this initial position.

一方、ステップＳ２００５において、領域追加命令で示された初期位置の近傍で与えられた画像テンプレートに最もかつ十分似た領域が検出されなかった場合には（ステップＳ２００５：Ｎｏ）、画像テンプレートの画像特徴領域を追跡対象から除外し（ステップＳ２００８）、追跡対象を見失った旨を領域追跡情報として出力する（ステップＳ２００９）。 On the other hand, if an area most similar to the image template given in the vicinity of the initial position indicated by the area addition command is not detected in step S2005 (step S2005: No), the image feature of the image template is detected. The area is excluded from the tracking target (step S2008), and the fact that the tracking target is lost is output as area tracking information (step S2009).

ここで、領域追跡処理について具体的な例を用いて説明する。図２１−１及び図２１−２は、領域追加命令によって行われる画像特徴領域の設定例を示す模式図である。図２１−１は、全天画像中の顔の検出方向近傍や音声の検出方向近傍を広視野の全天画像から切り出して模式的に示しており、顔領域が見えている場合である。図２１−１の例では、正面を向いている人物２１０の抽出された顔領域２１１を中心にした所定の領域２１２が画像特徴領域として設定される。また、図２１−２は、音声が聞こえた場合であり、横を向いていて顔の見えない人物２１３の口元付近を指向性範囲２１４として設定されており、この指向性範囲２１４を中心にした所定の領域２１５が画像特徴領域として設定される。 Here, the region tracking process will be described using a specific example. FIGS. 21A and 21B are schematic diagrams illustrating an example of setting the image feature area performed by the area addition command. FIG. 21A schematically shows the vicinity of the face detection direction and the vicinity of the voice detection direction in the whole sky image cut out from the wide-field whole sky image, and the face area is visible. In the example of FIG. 21A, a predetermined area 212 centering on the extracted face area 211 of the person 210 facing the front is set as an image feature area. Further, FIG. 21-2 shows a case where sound is heard, and the directional area 214 is set near the mouth of the person 213 who faces sideways and cannot see the face. A predetermined area 215 is set as the image feature area.

図２２−１、図２２−２は、領域追跡処理の例を示す模式図である。図２２−１は、画像テンプレートが生成されたときの状態を表している。この例では正面を向いている人物２２０の抽出された顔領域２２１を中心にした所定の領域２２２が画像特徴領域として設定される。次に、図２２−２に示すように人物２２０が横を向いて移動を開始してしまったために顔が検出できなくなるが、図２２−１で生成された画像テンプレートを用いることにより、これに最も近い性質を持つ領域２２３が検出されて顔の見えなくなった人物の移動を追跡することができる。 22A and 22B are schematic diagrams illustrating an example of the area tracking process. FIG. 22-1 shows a state when the image template is generated. In this example, a predetermined area 222 centering on the extracted face area 221 of the person 220 facing the front is set as the image feature area. Next, as shown in FIG. 22-2, the face cannot be detected because the person 220 has started to move sideways, but this is achieved by using the image template generated in FIG. It is possible to track the movement of a person whose face has disappeared when the region 223 having the closest property is detected.

なお、上述の領域追跡の処理は、同一性検証部１１０の追跡同方向性検証部１１２による追跡同方向性検証処理がおこなわれる場合に実行される。同一性検証部１１０の同方向性検証部１１１による同方向性検証処理がおこなわれる場合には、領域追跡部１０６は画像特徴領域の追跡は行わずに、先に取得した話者識別情報や顔識別情報に設定された方向情報を検出方向として同方向性検証部１１２に登録する処理を行う。 Note that the above-described area tracking process is executed when the tracking unidirectional verification process by the tracking unidirectional verification unit 112 of the identity verification unit 110 is performed. When the same direction verification processing by the same direction verification unit 111 of the identity verification unit 110 is performed, the region tracking unit 106 does not track the image feature region, but acquires the previously obtained speaker identification information and face. The direction information set in the identification information is registered in the same direction verification unit 112 as the detection direction.

（同方向性検証処理）
次に同一性検証部１１０の同方向性検証部１１１による同方向性検証処理について説明する。図２３は、同方向性検証部１１１による同方向性検証処理の手順を示すフローチャートである。 (Codirectionality verification process)
Next, the same direction verification processing by the same direction verification unit 111 of the identity verification unit 110 will be described. FIG. 23 is a flowchart showing the procedure of the same direction verification processing by the same direction verification unit 111.

まず、同方向性検証部１１１では、領域追跡部１０６によって、既に登録済みの検出方向の有効期限が経過しているか否かを判断する（ステップＳ２３０１）。ここで、検出方向としては、現在の時点で話者識別情報と顔識別情報の生体識別情報のうち先に取得している生体識別情報に設定された方向情報がその有効期限とともに領域追跡部１０６によって登録されている。有効期限は、生体識別情報を同一性検証処理として使用できる期間を示すものである。 First, in the same direction verification unit 111, the area tracking unit 106 determines whether or not the expiration date of the already registered detection direction has passed (step S2301). Here, as the detection direction, the direction information set in the biometric identification information acquired earlier among the biometric identification information of the speaker identification information and the face identification information at the current time point is the region tracking unit 106 together with the expiration date. It is registered by. The expiration date indicates a period during which the biometric identification information can be used as the identity verification process.

そして、有効期限を経過している場合には（ステップＳ２３０１：Ｙｅｓ）、登録済みの検出方向は同一性検証処理に使用できないとして登録を抹消し（ステップＳ２３０２）、新たな話者識別情報あるいは新たな顔識別情報の取得待ち状態となる（ステップＳ２３０３）。 When the expiration date has passed (step S2301: Yes), the registered detection direction is deleted because it cannot be used for the identity verification process (step S2302), and new speaker identification information or new Is ready to acquire correct face identification information (step S2303).

一方、ステップＳ２３０１において、登録済みの検出方向の有効期限内である場合には（ステップＳ２３０１：Ｎｏ）、検出方向の抹消は行わず、新たな話者識別情報あるいは新たな顔識別情報の取得待ち状態となる（ステップＳ２３０３）。 On the other hand, in step S2301, if the registered detection direction is within the valid period (step S2301: No), the detection direction is not erased and waiting for acquisition of new speaker identification information or new face identification information. A state is entered (step S2303).

そして、新たな話者識別情報あるいは新たな顔識別情報を取得した場合には（ステップＳ２３０３：Ｙｅｓ）、取得した新たな話者識別情報または新たな顔識別情報に設定された方向情報と登録済みの検出方向を照合する（ステップＳ２３０４）。そして、方向の差が予め定められた閾値以下であるか否かを判断する（ステップＳ２３０５）。 When new speaker identification information or new face identification information is acquired (step S2303: Yes), the direction information set in the acquired new speaker identification information or new face identification information is registered. Are detected (step S2304). Then, it is determined whether or not the difference in direction is equal to or less than a predetermined threshold (step S2305).

方向の差が予め定められた閾値以下である場合には（ステップＳ２３０５：Ｙｅｓ）、新たに取得した生体識別情報の方向情報が登録済みの検出方向（先に取得した生体情報の方向情報）と略同一であるとみなして、当該略同一方向に検出された話者識別情報と顔識別情報からなる同一組情報内のいずれか対応する生体識別情報が新たに取得した話者識別情報または顔識別情報で置換され、かつその有効期限が延長されることにより同一組情報が更新され（ステップＳ２３０６）、同一組情報を出力する（ステップＳ２３０７）。 When the difference in direction is equal to or smaller than a predetermined threshold (step S2305: Yes), the direction information of the newly acquired biometric identification information is the registered detection direction (direction information of the biometric information acquired earlier). Speaker identification information or face identification newly acquired by corresponding biometric identification information in the same set of information consisting of speaker identification information and face identification information detected in substantially the same direction The same set information is updated by being replaced with information and the validity period is extended (step S2306), and the same set information is output (step S2307).

一方、ステップＳ２３０５において、方向が予め定められた閾値より大きい場合には（ステップＳ２３０５：Ｎｏ）、略同一であると見なすべき検出方向がないと判断し、新たに取得した話者識別情報または顔識別情報に設定された方向情報を新たな検出方向として登録し（ステップＳ２３０８）、この検出方向を指定した領域追加命令を領域追跡部１０６に出力する（ステップＳ２３０９）。そして、新たに取得した話者識別情報または顔識別情報およびその有効期限で同一組情報を更新し（ステップＳ２３１０）、同一組情報を出力する（ステップＳ２３０７）。 On the other hand, when the direction is larger than the predetermined threshold value in step S2305 (step S2305: No), it is determined that there is no detection direction that should be regarded as substantially the same, and the newly acquired speaker identification information or face The direction information set in the identification information is registered as a new detection direction (step S2308), and an area addition command designating this detection direction is output to the area tracking unit 106 (step S2309). Then, the same set information is updated with the newly acquired speaker identification information or face identification information and its expiration date (step S2310), and the same set information is output (step S2307).

ステップＳ２３０８において新たに登録された検出方向は、以降、その有効期限が切れてステップＳ２３０１およびＳ２３０２によって有効期限が経過して登録を抹消されるまで、新たに取得した話者識別情報や顔識別情報に対して方向の同一性の判断に使用される。 The newly registered detection direction in step S2308 is the newly acquired speaker identification information and face identification information until the expiration date is expired and the registration expires in steps S2301 and S2302 and the registration is deleted. Is used to determine direction identity.

なお、同一性検証処理が最初におこなわれる場合には、上記同一組情報の更新処理（ステップＳ２３０６、Ｓ２３１０）において、先に取得した生体識別情報と新たに取得した生体識別情報とにより同一組情報が生成されることになる。また、話者識別情報に設定された方向情報と同一の方向情報が設定された語彙識別情報が語彙識別部１０３から出力されている場合には、当該語彙識別情報が同一組情報に付加される。さらに、話者識別情報と顔識別情報の両方が取得できなかった場合には、取得できなかった方の生体識別情報を空欄とした同一組情報が出力される。このようにして更新・生成された同一組情報は、後述する辞書更新処理および利用者認証処理において使用される。 When the identity verification process is performed first, the same set information is obtained from the previously acquired biometric identification information and the newly acquired biometric identification information in the same set information update process (steps S2306 and S2310). Will be generated. In addition, when the vocabulary identification information in which the same direction information as the direction information set in the speaker identification information is output from the vocabulary identification unit 103, the vocabulary identification information is added to the same set information. . Further, when both the speaker identification information and the face identification information cannot be acquired, the same set information with the blank of the biometric identification information that could not be acquired is output. The same set information updated / generated in this way is used in dictionary update processing and user authentication processing described later.

上述した同方向性検証処理において具体例をあげて説明する。入力された音声の方向は音響指向性形成モジュール１０２のモジュール番号に対応した領域番号が話者識別情報の方向情報に付加されており、入力された顔画像の方向も顔は画像中の顔画像の中心位置に対応した領域番号が顔識別情報に付加されている。このため、本実施の形態にかかる自律移動型ロボットでは、利用者の音声と顔画像のそれぞれから独立に利用者を特定するとともに、利用者の音声の方向と利用者の顔の方向を独立に取得している。特に、音声と顔画像の方向を独立して特定することにより、従来の利用者識別装置による同時性による同一性検証による問題を解決している。 A specific example is given and demonstrated in the above-mentioned same direction verification process. As the direction of the input voice, an area number corresponding to the module number of the sound directivity forming module 102 is added to the direction information of the speaker identification information, and the direction of the input face image is also the face image in the image. An area number corresponding to the center position of the face is added to the face identification information. Therefore, in the autonomous mobile robot according to the present embodiment, the user is specified independently from each of the user's voice and the face image, and the direction of the user's voice and the direction of the user's face are independently set. Have acquired. In particular, the problem of identity verification due to simultaneity by the conventional user identification device is solved by specifying the direction of the voice and the face image independently.

本実施の形態にかかる自律移動型ロボットは、利用者をその音声から特定するとともに、周囲のどの方向にこの利用者が存在しているのかを入力音声のみから把握することができる。また、本実施の形態にかかる自律移動型ロボットは、利用者をその撮像した顔画像から特定するとともに、周囲のどの方向にこの利用者が存在しているのかを撮像した顔画像のみから把握することができる。すなわち、音声と顔画像とにより、それぞれ別個に利用者を特定し、かつそれぞれ別個に利用者の存在する方向を特定している。 The autonomous mobile robot according to the present embodiment can identify the user from the voice and can determine from which direction the user is present only from the input voice. In addition, the autonomous mobile robot according to the present embodiment identifies a user from the captured face image, and grasps in which direction the user is present from only the captured face image. be able to. That is, the user is specified separately from the voice and the face image, and the direction in which the user exists is specified separately.

仮に、このように特定された利用者（あるいは未知の人物）がほとんど移動しない場合を仮定すると、この利用者の顔と音声が時間的に間隔を空けて検出されたとしてもほぼ同方向に検出されることで両者は同一人物のものであるとみなすことができる。また、仮に、この利用者の顔画像が検出可能な期間に音声が聞こえたり、音声が聞こえている期間に顔画像も検出された場合には、両者は当然ほぼ同方向に検出されるので同一人物のものであるとみなすことができ、同方向性による同一性の検証が可能となる。 Assuming that the user (or unknown person) identified in this way hardly moves, even if the user's face and voice are detected at intervals, they are detected in almost the same direction. By doing so, it can be considered that both belong to the same person. In addition, if a voice is heard during a period in which the user's face image can be detected or a face image is also detected during a period in which the user can hear the voice, the two are naturally detected in substantially the same direction. It can be considered that of a person, and it is possible to verify identity by the same direction.

この同方向性による同一性の検証は、顔画像による利用者識別結果と音声による利用者識別結果が一致しない場合でも、両者が方向において同一であることから両者が同一人物の生体情報であることがわかり、２つの識別結果が一致しないことが問題であることを知ることのできる有効な手段となり、生体辞書の更新に利用することができる。 The verification of identity based on the same directionality means that even if the user identification result based on the face image and the user identification result based on the voice do not match, both are the same in the direction, so both are biometric information of the same person. It becomes an effective means for knowing that the problem is that the two identification results do not match, and can be used to update the biological dictionary.

例えば、本実施の形態にかかる自律移動型ロボットの方を向いた利用者Ａが存在し、自律移動型ロボットがこの利用者Ａの顔画像を検出していた場合であって、自律移動型ロボットに用のある利用者Ａがロボットの方を向きながら「メール届いてる？」と話した場合を考える。この場合には、顔が検出されている方向から利用者の音声が到来することになる。図２４−１は、顔が検出されている方向から利用者の音声が到来した場合を示す模式図である。図２４−１に示すように、自律移動型ボットの本体２０１を中心に設定される仮想的なドーム２４９０上の点２４４１の方向に利用者Ａの顔（の中心）があり、ある指向性範囲の中心点２４４２の方向に利用者Ａの音声が検出されることになる。このとき、同一性検証部１１０の同方向性検証部１１１は、点２４４１を中心に設定される所定半径の円形近傍領域２４４３内に点２４４２が含まれている、すなわち先に取得した顔識別情報の方向情報と後に取得した話者識別情報の方向情報が一致していると判断し、顔画像と音声の検出方向が略同一であるとの判断することによって、この顔画像の利用者と音声の利用者は同一人物であることが検証される。 For example, there is a user A facing the autonomous mobile robot according to the present embodiment, and the autonomous mobile robot detects a face image of the user A, and the autonomous mobile robot Consider a case where a user A who is in the middle of business talks to the robot and says, "Do you receive an email?" In this case, the user's voice comes from the direction in which the face is detected. FIG. 24A is a schematic diagram illustrating a case where the user's voice comes from the direction in which the face is detected. As shown in FIG. 24A, the face of the user A is in the direction of a point 2441 on a virtual dome 2490 set around the main body 201 of the autonomously moving bot, and has a certain directivity range. Therefore, the voice of the user A is detected in the direction of the center point 2442. At this time, the same direction verification unit 111 of the identity verification unit 110 includes the point 2442 in the circular vicinity region 2443 having a predetermined radius set around the point 2441, that is, the face identification information acquired earlier. If the direction information of the face image and the direction information of the speaker identification information acquired later match, and the face image and the sound detection direction are determined to be substantially the same, the user of the face image and the sound Are verified to be the same person.

一方、例えば、本実施の形態にかかる自律移動型ロボットの方を向いた利用者Ａが存在し、自律移動型ロボットがこの利用者Ａの顔画像を検出していた場合に、この利用者Ａが横を向いていて顔の見えない別の利用者Ｂが「メール届いてる？」と発声した場合を考える。図２４−２は、顔が検出されている方向と別の方向から別の利用者の音声が到来した場合を示す模式図である。 On the other hand, for example, when there is a user A facing the autonomous mobile robot according to the present embodiment and the autonomous mobile robot detects a face image of the user A, the user A Let's consider a case where another user B who is facing sideways and cannot see his face utters "You have received an email?" FIG. 24-2 is a schematic diagram illustrating a case where another user's voice comes from a direction different from the direction in which the face is detected.

この場合、利用者Ａと利用者Ｂが方位的に離れて位置している場合には、顔画像が検出されている方向とは別の方向から音声が到来することになる。すなわち、図２４−２に示すように、自律移動型ボットの本体２０１を中心に設定される仮想的なドーム２４９０上の点２４４４の方向に利用者Ａの顔があり、当該点２４４４から離れた指向性範囲の中心点２４４５の方向から利用者Ｂの音声が検出されることになる。従来の同時性による同一性の検証を行う従来の利用者識別装置では、このような場合でも顔画像が検出されている短期間に検出された音声この顔画像の利用者と同一人の音声であると誤って判断されてしまう。 In this case, when the user A and the user B are located azimuthally apart, the voice comes from a direction different from the direction in which the face image is detected. That is, as shown in FIG. 24-2, there is a face of the user A in the direction of the point 2444 on the virtual dome 2490 set around the main body 201 of the autonomous mobile bot, and the user A is away from the point 2444. The voice of the user B is detected from the direction of the center point 2445 of the directivity range. In a conventional user identification device that performs identity verification by conventional simultaneity, even in such a case, the voice detected in a short period of time when the face image is detected. It is mistakenly judged that there is.

しかしながら、本実施の形態にかかる自律移動型ロボットでは、同方向性による同一性の検証を行っているので、同方向性検証部１１１によって、点２４４４を中心に設定される所定半径の円形近傍領域２４４６内に点２４４５が含まれていない、すなわち顔識別情報の方向情報と話者識別情報の方向情報が一致しないと判断し、これによりこの顔画像と音声はその検出方向が十分離れていることから同一人物のものではないことを判断している。 However, in the autonomous mobile robot according to the present embodiment, since the identity is verified by the same directionality, the circular vicinity region having a predetermined radius set around the point 2444 by the directionality verification unit 111. It is determined that the point 2445 is not included in 2446, that is, the direction information of the face identification information and the direction information of the speaker identification information do not match, so that the detection direction of the face image and the sound is sufficiently separated from each other. From the same person.

なお、前記円形近傍領域の半径は所定の固定値、例えば角度で５度のように設定するものとするが、顔画像が検出されている場合には、当該顔画像のサイズから所定の演算によって円形近傍領域の半径を定めるように構成することができる。例えば、顔の中心（およそ鼻の頭付近）と口中心の間の距離（約５ｃｍ）より大きく、かつ１つの指向性範囲の大きさ以下に円形近傍領域の半径を定めるように構成してもよい。 It should be noted that the radius of the circular vicinity region is set to a predetermined fixed value, for example, an angle of 5 degrees, but when a face image is detected, a predetermined calculation is performed based on the size of the face image. It can be configured to determine the radius of the circular neighborhood. For example, the radius of the circular vicinity region may be determined to be larger than the distance between the center of the face (near the head of the nose) and the center of the mouth (about 5 cm) and less than the size of one directivity range. Good.

また、例えば、本実施の形態の自律移動型ロボットの方を向いた利用者Ａが存在し、自律移動型ロボットは当該利用者Ａの顔を検出していた場合を考える。そのうちに利用者Ａが何かに注意を惹かれて横を向いてしまったため、自律移動型ロボットは利用者Ａの顔を検出できなくなったが、その後しばらくして利用者Ａが横を向いたまま「ちょっとこの部屋暑いな」と発声した場合を考える。この場合には、以前に顔が検出され識別されていた方向から今度は音声が到来することになるので、先に取得した顔識別情報の方向情報と後に取得した話者識別情報の方向情報が一致していると判断し、顔画像と音声の検出方向が略同一であるとの判断することによって、この顔画像の利用者と音声の利用者は同一人物であることが検証される。 For example, consider a case where there is a user A facing the autonomous mobile robot of the present embodiment, and the autonomous mobile robot detects the face of the user A. Over time, the user A attracted attention and turned sideways, so the autonomous mobile robot could not detect the face of the user A, but after a while user A turned sideways. Let's consider the case of saying "This room is a little hot". In this case, since the voice comes from the direction in which the face was previously detected and identified, the direction information of the face identification information acquired earlier and the direction information of the speaker identification information acquired later are It is verified that the user of the face image and the user of the voice are the same person by determining that they coincide with each other and determining that the detection direction of the face image and the sound is substantially the same.

このように、本実施の形態にかかる同方向性検証部１１１により同方向性による同一性の検証処理によれば、利用者が移動しない限り音声と顔の検出時刻がほぼ同時でなくても同一人の検証が可能なため、従来の同時性による同一性の検証を行う利用者識別装置に比べて同一人の検証を正確に行うことができる。 As described above, according to the identity verification process based on the same direction by the same direction verification unit 111 according to the present embodiment, the voice and the face detection time are the same even if the detection time is not almost the same unless the user moves. Since verification of a person is possible, verification of the same person can be performed more accurately than a conventional user identification device that performs verification of identity by simultaneity.

（追跡同方向性検証処理）
次に同一性検証部１１０の追跡同方向性検証部１１２による追跡同方向性検証処理について説明する。 (Tracking unidirectional verification process)
Next, the tracking unidirectional verification processing by the tracking unidirectional verification unit 112 of the identity verification unit 110 will be described.

利用者がほとんど移動しない場合には、上述した同方向性検証部１１１による同方向性検証処理によって利用者の同一性を正確に判断することができる。しかしながら、実際には利用者は移動するので、同方向性だけでは同一人の検証は不可能である。すなわち、一度検出されたある生体情報が途中で見失われ、その間に利用者が移動してしまい、同一人物の別種類の生体情報が、この移動によって別の方向から検出されたとき、同方向性検証処理では両者の同一性を検証することができない。 In the case where the user hardly moves, the identity of the user can be accurately determined by the above-described directionality verification processing by the directionality verification unit 111. However, since the user actually moves, it is impossible to verify the same person only with the same directionality. That is, when certain biometric information detected once is lost in the middle and the user moves in the meantime, another type of biometric information of the same person is detected from another direction by this movement, the same directionality In the verification process, the identity of both cannot be verified.

このため、本実施の形態にかかる自律移動型ロボットでは、画像特徴領域を追跡しながら複数の生体識別情報の方向情報が一致するか否かを判断することにより、同一性検証を行う追跡同方向性検証処理を行っている。 For this reason, in the autonomous mobile robot according to the present embodiment, the tracking same direction in which identity verification is performed by determining whether or not the direction information of a plurality of pieces of biometric identification information matches while tracking the image feature region. A sex verification process is performed.

顔は正面を向いているタイミングのときだけ検出可能であり、音声は発話したタイミングのときだけ検出可能である。このように、自律移動型ロボットが利用者を識別するための生体情報は、利用者がロボットの側に存在するときであっても常に検出できるとは限らない。 The face can be detected only when it is facing the front, and the voice can be detected only when it is spoken. As described above, the biological information for the autonomous mobile robot to identify the user is not always detected even when the user is on the robot side.

そこで、本実施の形態では、生体情報よりも安定した検出が可能な別の手がかりを使用して利用者の移動を追跡している。この移動のための手がかりは、生体情報である音声や顔と実際に関連性を有する必要があり、本実施の形態では、音声や顔の検出方向近傍に観測される画像的特徴に基づいて利用者の移動を追跡している。このような画像的特徴としては、音声や顔の利用者の着衣や肌や頭髪などの部分に相当し、画像的特徴は、利用者が横を向いても口をつぐんでも、より長期間安定に検出できることができる。すなわち、生体情報の検出をトリガにして、その検出方向の画像的特徴を認識して利用者の移動を追跡している。 Therefore, in this embodiment, the movement of the user is tracked using another clue that can be detected more stably than the biological information. This clue for movement needs to be actually related to the voice and face as biological information, and in this embodiment, it is used based on the image characteristics observed in the vicinity of the voice and face detection direction. The movement of the person is tracked. Such image features correspond to voice and facial user's clothing, skin and hair, etc., and image features are more stable for a long time regardless of whether the user is facing sideways or holding his mouth. Can be detected. That is, using the detection of biological information as a trigger, the movement of the user is tracked by recognizing the image characteristic in the detection direction.

このように、生体情報とは別のより安定に検出できそる画像的特徴に基づいて生体情報の発信者である利用者の移動を追跡しているので、ある時刻にある方向に検出されたある生体情報の発信源が、別の時刻ではどの方向に移動したかを把握することができる。 In this way, since the movement of the user who is the sender of the biometric information is tracked based on the image feature that can be detected more stably than the biometric information, it is detected in a certain direction at a certain time. It is possible to grasp in which direction the biological information transmission source has moved at another time.

図２５は、追跡同方向性検証部１１２による追跡同方向性検証処理の手順を示すフローチャートである。 FIG. 25 is a flowchart showing the procedure of the tracking unidirectional verification process by the tracking unidirectional verification unit 112.

まず、追跡同方向性検証部１１２では、領域追跡部１０６から追跡情報を取得する（ステップＳ２５０１）。そして、取得した追跡情報の中に画像特徴領域の検出方向が含まれているか否かを判断する（ステップＳ２５０２）。ここで、追跡情報の中に画像特徴領域は、先に取得した生体識別情報から後述するステップＳ２５１０で生成された画像特徴領域である。 First, the tracking direction verification unit 112 acquires tracking information from the area tracking unit 106 (step S2501). Then, it is determined whether or not the acquired tracking information includes the detection direction of the image feature area (step S2502). Here, the image feature area in the tracking information is the image feature area generated in step S2510 described later from the biometric identification information acquired previously.

そして、追跡情報の中に画像特徴領域の検出方向が含まれていると判断した場合には（ステップＳ２５０２：Ｙｅｓ）、追跡中の画像特徴領域の初期位置を取得した追跡情報の検出方向で更新し（ステップＳ２５０３）、新たな話者識別情報あるいは新たな顔識別情報の取得待ち状態となる（ステップＳ２５０５）。 If it is determined that the detection direction of the image feature region is included in the tracking information (step S2502: Yes), the initial position of the image feature region being tracked is updated with the detection direction of the acquired tracking information. (Step S2503), the system waits for acquisition of new speaker identification information or new face identification information (Step S2505).

一方、ステップＳ２５０２において、追跡情報の中に画像特徴領域の検出方向が含まれていないと判断した場合（ステップＳ２５０２：Ｎｏ）、すなわち追跡対象の画像特徴領域を見失った旨が含まれている場合には、現在追跡中の画像特徴領域を追跡対象から削除し（ステップＳ２５０４）、新たな話者識別情報あるいは新たな顔識別情報の取得待ち状態となる（ステップＳ２５０５）。 On the other hand, if it is determined in step S2502 that the detection direction of the image feature area is not included in the tracking information (step S2502: No), that is, the fact that the image feature area to be tracked is lost is included. The image feature area currently being tracked is deleted from the tracking target (step S2504), and a new speaker identification information or new face identification information is awaiting acquisition (step S2505).

そして、新たな話者識別情報あるいは新たな顔識別情報を取得した場合には（ステップＳ２５０５：Ｙｅｓ）、追跡中の画像特徴領域の初期位置と取得した新たな話者識別情報または新たな顔識別情報に設定された方向情報を照合し（ステップＳ２５０６）、方向の差が予め定められた閾値以下であるか否かを判断する（ステップＳ２５０７）。ここで、新たな話者識別情報または新たな顔識別情報が画像特徴領域の生成元となった先に取得した生体識別情報に対して後に取得した生体情報となる。 Then, when new speaker identification information or new face identification information is acquired (step S2505: Yes), the initial position of the image feature area being tracked and the acquired new speaker identification information or new face identification The direction information set in the information is collated (step S2506), and it is determined whether or not the difference in direction is equal to or less than a predetermined threshold (step S2507). Here, the new speaker identification information or the new face identification information becomes the biometric information acquired later with respect to the biometric identification information acquired previously from which the image feature region was generated.

方向の差が予め定められた閾値以下である場合には（ステップＳ２５０７：Ｙｅｓ）、新たに取得した生体識別情報の方向情報が画像特徴領域の検出方向と略同一であるとみなして、当該略同一方向に検出された話者識別情報と顔識別情報からなる同一組情報内のいずれか対応する生体識別情報が新たに取得した話者識別情報または顔識別情報で置換されることにより同一組情報が更新され（ステップＳ２５０８）、同一組情報が出力される（ステップＳ２５０９）。 When the difference in direction is equal to or smaller than a predetermined threshold (step S2507: Yes), it is assumed that the direction information of the newly obtained biometric identification information is substantially the same as the detection direction of the image feature region, and the abbreviation. The same set information is obtained by replacing the corresponding biometric identification information in the same set information consisting of the speaker identification information and the face identification information detected in the same direction with the newly acquired speaker identification information or face identification information. Is updated (step S2508), and the same set information is output (step S2509).

一方、ステップＳ２５０７において、方向の差が予め定められた閾値より大きい場合には（ステップＳ２５０７：Ｎｏ）、新たに取得した話者識別情報又は顔識別情報から画像の特徴的部分である領域を抽出した新たな画像特徴領域を生成し（ステップＳ２５１０）、新たな画像特徴領域とそのサイズとからなる画像テンプレートと新たに生成された画像特徴領域の初期位置とを含む領域追加命令を領域追跡部１０６に送出する（ステップＳ２５１１）。初期位置としては画像特徴領域の画像中心位置座標を設定する。なお、この新たに生成された画像特徴領域の生成元となる生体識別情報が先に取得した生体識別情報となる。そして、新たに取得した話者識別情報または顔識別情報で同一組情報を更新し（ステップＳ２５１２）、同一組情報を出力する（ステップＳ２５０９）。 On the other hand, if the direction difference is larger than the predetermined threshold value in step S2507 (step S2507: No), an area that is a characteristic part of the image is extracted from the newly acquired speaker identification information or face identification information. A new image feature area is generated (step S2510), and an area addition command including an image template including the new image feature area and its size and an initial position of the newly generated image feature area is received by the area tracking unit 106. (Step S2511). As the initial position, the image center position coordinates of the image feature area are set. Note that the biometric identification information that is the generation source of the newly generated image feature area is the biometric identification information that is acquired first. Then, the same set information is updated with the newly acquired speaker identification information or face identification information (step S2512), and the same set information is output (step S2509).

なお、同一性検証処理が最初におこなわれる場合には、上記同一組情報の更新処理（ステップＳ２５０８、Ｓ２５１２）において、先に取得した生体識別情報と新たに取得した生体識別情報とにより同一組情報が生成されることになる。また、話者識別情報に設定された方向情報と同一の方向情報が設定された語彙識別情報が語彙識別部１０３から出力されている場合には、当該語彙識別情報が同一組情報に付加される。さらに、話者識別情報と顔識別情報の両方が取得できなかった場合には、取得できなかった方の生体識別情報を空欄とした同一組情報が出力される。このようにして更新・生成された同一組情報は、後述する辞書更新処理および利用者認証処理において使用される。 When the identity verification process is performed first, the same set information is obtained by the previously acquired biometric identification information and the newly acquired biometric identification information in the same set information update process (steps S2508 and S2512). Will be generated. In addition, when the vocabulary identification information in which the same direction information as the direction information set in the speaker identification information is output from the vocabulary identification unit 103, the vocabulary identification information is added to the same set information. . Further, when both the speaker identification information and the face identification information cannot be acquired, the same set information with the blank of the biometric identification information that could not be acquired is output. The same set information updated / generated in this way is used in dictionary update processing and user authentication processing described later.

（辞書更新処理）
次に、辞書更新部１０８による生体辞書の更新処理について説明する。図２６は、辞書更新部１０８による生体辞書の更新処理の手順を示すフローチャートである。まず、辞書更新部１０８では、同一性検証部１１０から出力された同一組情報を入力し（ステップＳ２６０１）、同一組情報に話者識別情報が存在するか否かを調べる（ステップＳ２６０２）。そして、話者識別情報が同一組情報に存在しない場合には（ステップＳ２６０２：Ｎｏ）、次の同一組情報の入力待ち状態となる。一方、話者識別情報が同一組情報に存在する場合には（ステップＳ２６０２：Ｙｅｓ）、さらに顔識別情報が同一組情報に存在するか否かを調べる（ステップＳ２６０３）。 (Dictionary update process)
Next, biometric dictionary update processing by the dictionary update unit 108 will be described. FIG. 26 is a flowchart illustrating a procedure of biometric dictionary update processing by the dictionary update unit 108. First, the dictionary update unit 108 inputs the same set information output from the identity verification unit 110 (step S2601), and checks whether speaker identification information exists in the same set information (step S2602). If the speaker identification information does not exist in the same set information (step S2602: No), the next set of information for the same set information is awaited. On the other hand, when the speaker identification information exists in the same set information (step S2602: Yes), it is further checked whether the face identification information exists in the same set information (step S2603).

そして、顔識別情報が同一組情報に存在しない場合には（ステップＳ２６０３：Ｎｏ）、次の同一組情報の入力処理待ち状態となる。一方、顔識別情報が同一組情報に存在する場合には（ステップＳ２６０３：Ｙｅｓ）、同一組情報に含まれる話者識別情報の話者識別結果に人物不明ＩＤが設定されているか否かを調べる（ステップＳ２６０４）。 If the face identification information does not exist in the same set information (step S2603: No), the process waits for input processing of the next same set information. On the other hand, if the face identification information exists in the same group information (step S2603: Yes), it is checked whether or not the unknown person ID is set in the speaker identification result of the speaker identification information included in the same group information. (Step S2604).

そして、話者識別情報の話者識別結果に人物不明ＩＤが設定されている場合には（ステップＳ２６０４：Ｙｅｓ）、同一組情報に含まれる顔識別情報の顔識別結果に人物不明ＩＤが設定されているか否かを調べる（ステップＳ２６０５）。 If the unknown person ID is set in the speaker identification result of the speaker identification information (step S2604: Yes), the unknown person ID is set in the face identification result of the face identification information included in the same set information. It is checked whether or not (step S2605).

そして、顔識別情報の顔識別結果に人物不明ＩＤが設定されている場合には（ステップＳ２６０５：Ｙｅｓ）、同一組情報の顔識別情報によっても話者識別情報によっても利用者を特定できないため、次の同一組情報の入力処理待ち状態となる。一方、顔識別情報の顔識別結果に人物不明ＩＤが設定されていない場合には（ステップＳ２６０５：Ｎｏ）、声紋辞書１２１を更新して強化する（ステップＳ２６０６）。具体的には、顔識別情報、すなわち顔のみによって利用者識別が成功したと判断し、声紋辞書１２１において、顔識別情報の最大類似度を有する利用者ＩＤに対応する声紋パターンに、同一組情報に含まれる音声データを登録することによって、声紋辞書１２１を更新して強化する。 If the person unknown ID is set in the face identification result of the face identification information (step S2605: Yes), the user cannot be specified by the face identification information of the same set information or by the speaker identification information. The next input processing waiting state for the same set information is entered. On the other hand, when the person unknown ID is not set in the face identification result of the face identification information (step S2605: No), the voiceprint dictionary 121 is updated and strengthened (step S2606). Specifically, it is determined that the user identification is successful only by the face identification information, that is, the face, and in the voiceprint dictionary 121, the same set information is added to the voiceprint pattern corresponding to the user ID having the maximum similarity of the face identification information. The voiceprint dictionary 121 is updated and strengthened by registering the voice data included in.

ステップＳ２６０４において、同一組情報の話者識別情報に人物不明ＩＤが設定されない場合には（ステップＳ２６０４：Ｎｏ）、さらに同一組情報の顔識別情報に人物不明ＩＤが設定されているか否かを調べる（ステップＳ２６０７）。 In step S2604, when the unknown person ID is not set in the speaker identification information of the same set information (step S2604: No), it is further checked whether the unknown person ID is set in the face identification information of the same set information. (Step S2607).

そして、同一組情報の顔識別情報に人物不明ＩＤが設定されている場合には（ステップＳ２６０７：Ｙｅｓ）、顔辞書１２２を更新して強化を図る（ステップＳ２６０８）。具体的には、この場合、話者識別情報のみによって、すなわち音声のみによって利用者識別が成功したと判断して、顔辞書１２２において、顔識別情報の最大類似度を有する利用者ＩＤに対応する顔画像パターンを、同一組情報に含まれる画像データを登録することによって顔辞書１２２の更新を行い強化する。 If a person unknown ID is set in the face identification information of the same set information (step S2607: Yes), the face dictionary 122 is updated and strengthened (step S2608). Specifically, in this case, it is determined that the user identification is successful only by the speaker identification information, that is, only by the voice, and corresponds to the user ID having the maximum similarity of the face identification information in the face dictionary 122. The face dictionary 122 is updated and strengthened by registering image data included in the same set information for the face image pattern.

ステップＳ２６０７において、同一組情報の顔識別情報に人物不明ＩＤが設定されていない場合には（ステップＳ２６０７：Ｎｏ）、顔によっても音声によっても利用者識別が成功したと判断して、同一組情報の話者識別情報に設定された最大類似度を有する利用者ＩＤと顔識別情報に設定された最大類似度を有する利用者ＩＤとが不一致であるか否かを調べる（ステップＳ２６０９）。 If the unknown person ID is not set in the face identification information of the same set information in step S2607 (step S2607: No), it is determined that the user identification is successful by both the face and the voice, and the same set information It is checked whether or not the user ID having the maximum similarity set in the speaker identification information and the user ID having the maximum similarity set in the face identification information do not match (step S2609).

そして、同一組情報の話者識別情報に設定された最大類似度を有する利用者ＩＤと顔識別情報に設定された最大類似度を有する利用者ＩＤとが不一致である場合には（ステップＳ２６０９：Ｙｅｓ）、最大類似度の小さい生体辞書を更新して強化する（ステップＳ２６１０）。具体的には、この場合には、最大類似度の小さい方の生体辞書が劣化していると判断し、最大類似度の大きい方の利用者ＩＤに対応する他方の生体辞書に、同一組情報に含まれている音声データと顔画像データのうちこの生体辞書に対応したデータを登録して生体辞書を更新し強化を図る。 If the user ID having the maximum similarity set in the speaker identification information of the same set information and the user ID having the maximum similarity set in the face identification information do not match (step S2609: Yes), the biological dictionary with a small maximum similarity is updated and strengthened (step S2610). Specifically, in this case, it is determined that the biological dictionary having the smaller maximum similarity is deteriorated, and the same set information is stored in the other biological dictionary corresponding to the user ID having the larger maximum similarity. Among the audio data and face image data included in the data, the data corresponding to this biometric dictionary is registered and the biometric dictionary is updated and strengthened.

ステップＳ２６０９において、同一組情報の話者識別情報に設定された最大類似度を有する利用者ＩＤと顔識別情報に設定された最大類似度を有する利用者ＩＤとが一致する場合には（ステップＳ２６０９：Ｎｏ）、いずれの生体辞書（声紋辞書１２１、顔辞書１２２）も劣化しておらず十分な類似度を出力する良好な状態であると判断して、次の同一組情報の入力待ち状態となる。 If the user ID having the maximum similarity set in the speaker identification information of the same set information matches the user ID having the maximum similarity set in the face identification information in step S2609 (step S2609). : No), it is determined that none of the biological dictionaries (voiceprint dictionary 121, face dictionary 122) is in a good state in which a sufficient degree of similarity is output without deterioration, Become.

ここで、各生体辞書の更新は次のような手法で行われる。上述した通り、生体辞書である声紋辞書１２１と顔辞書１２２は、教示データから生成された部分空間である。このとき、生体辞書の更新は、新たに学習させるべきデータを既に生成されている辞書部分空間に加えることになるため、辞書部分空間を生成するときに用いた相関行列を再度使用する。新規の教示データは辞書部分空間を生成するときと同様の処理を施されて自己相関行列化される。この自己相関行列を既に登録されている相関行列に加えて新しい相関行列を生成し、これを主成分分析して得られる固有ベクトルのうち、対応する固有値の大きい順に上位Ｍ位までの固有ベクトルを抽出する。この結果、新たに抽出されたＭ本の固有ベクトルで張られる部分空間は、これまでの教示データに加えて新たな教示データをも反映した辞書部分空間となる。 Here, the update of each biological dictionary is performed by the following method. As described above, the voiceprint dictionary 121 and the face dictionary 122 which are biological dictionaries are partial spaces generated from teaching data. At this time, the update of the biological dictionary adds data to be newly learned to the dictionary partial space that has already been generated. Therefore, the correlation matrix used when generating the dictionary partial space is used again. The new teaching data is subjected to the same processing as that for generating the dictionary subspace and is converted into an autocorrelation matrix. A new correlation matrix is generated by adding this autocorrelation matrix to the already registered correlation matrix, and eigenvectors up to the M-th order are extracted from the eigenvectors obtained by principal component analysis in descending order of the corresponding eigenvalues. . As a result, the subspace spanned by the newly extracted M eigenvectors becomes a dictionary subspace reflecting new teaching data in addition to the teaching data so far.

このように、同一人の利用者に関する複数種類の生体情報（音声と顔画像）に関する話者識別情報と音声データと顔識別情報と画像データを同一組情報として含めることによって、各生体情報の識別結果を比較して同一人の利用者に関する劣化した、すなわち有効期限切れの生体辞書を検出し、入力中の当該利用者の生体情報を教示データとして用いることにより、劣化した生体辞書を自動的に再強化することができる。 In this way, identification of each piece of biological information is achieved by including speaker identification information, voice data, face identification information, and image data relating to a plurality of types of biological information (speech and facial images) relating to the same user as the same set of information. By comparing the results and detecting a deteriorated biometric dictionary related to the same user, and using the biometric information of the user being input as teaching data, the deteriorated biometric dictionary is automatically re-established. Can be strengthened.

（利用者認証処理）
次に、利用者認証部１０７による利用者認証処理について説明する。図２７は、利用者認証部１０７による利用者認証処理の手順を示すフローチャートである。まず、利用者認証部１０７では、同一性検証部１１０から出力された同一組情報を入力し（ステップＳ２７０１）、同一組情報に含まれる話者識別情報の話者識別結果に人物不明ＩＤが設定されているか否かを調べる（ステップＳ２７０２）。 (User authentication processing)
Next, user authentication processing by the user authentication unit 107 will be described. FIG. 27 is a flowchart illustrating a procedure of user authentication processing by the user authentication unit 107. First, the user authentication unit 107 inputs the same set information output from the identity verification unit 110 (step S2701), and sets the unknown person ID in the speaker identification result of the speaker identification information included in the same set information. It is checked whether or not it is done (step S2702).

そして、話者識別情報の話者識別結果に人物不明ＩＤが設定されている場合には、さらに顔識別情報の顔識別結果に人物不明ＩＤが設定されているか否かを調べる（ステップＳ２７０３）。そして、顔識別情報の顔識別結果に人物不明ＩＤが設定されている場合には（ステップＳ２７０３：Ｙｅｓ）、音声によっても顔によっても利用者を特定不可能であるため、次の同一組情報の入力待ち状態となる。 If the unknown person ID is set in the speaker identification result of the speaker identification information, it is further checked whether or not the unknown person ID is set in the face identification result of the face identification information (step S2703). If the person unknown ID is set in the face identification result of the face identification information (step S2703: Yes), the user cannot be specified by voice or by face, so Wait for input.

一方、ステップＳ２７０３において、顔識別情報の顔識別結果に人物不明ＩＤが設定されていない場合には（ステップＳ２７０３：Ｎｏ）、顔のみによって利用者識別が成功したと判断して、同一組情報の顔識別情報に設定されている最大類似度を有する利用者ＩＤを認証結果とし（ステップＳ２７０４）、この利用者ＩＤを利用者認証情報に設定して利用者認証情報を出力する（ステップＳ２７０８）。 On the other hand, in step S2703, when the person unknown ID is not set in the face identification result of the face identification information (step S2703: No), it is determined that the user identification is successful only by the face, and the same group information The user ID having the maximum similarity set in the face identification information is set as the authentication result (step S2704), the user ID is set in the user authentication information, and the user authentication information is output (step S2708).

ステップＳ２７０２において、同一組情報に含まれる話者識別情報の話者識別結果に人物不明ＩＤが設定されていない場合には（ステップＳ２７０２：Ｎｏ）、さらに顔識別情報の顔識別結果に人物不明ＩＤが設定されているか否かを調べる（ステップＳ２７０５）。 In step S2702, if the unknown person ID is not set in the speaker identification result of the speaker identification information included in the same set information (step S2702: No), the unknown person ID is further included in the face identification result of the face identification information. Is checked (step S2705).

そして、顔識別情報の顔識別結果に人物不明ＩＤが設定されている場合には（ステップＳ２７０５：Ｙｅｓ）、音声のみによって利用者識別が成功したと判断して、同一組情報の話者識別情報に設定されている最大類似度を有する利用者ＩＤを認証結果とし（ステップＳ２７０６）、この利用者ＩＤを利用者認証情報に設定して利用者認証情報を出力する（ステップＳ２７０８）。 If a person unknown ID is set in the face identification result of the face identification information (step S2705: Yes), it is determined that the user identification is successful only by voice, and the speaker identification information of the same set information Is set as the authentication result (step S2706), the user ID is set in the user authentication information, and the user authentication information is output (step S2708).

一方、ステップＳ２７０５において、顔識別情報の顔識別結果に人物不明ＩＤが設定されていない場合には（ステップＳ２７０５：Ｎｏ）、顔と音声の両方によって利用者識別が成功していると判断し、同一組情報の話者識別情報に設定されている最大類似度と顔識別情報に設定されている最大類似度のうち大きい方の最大類似度を有する利用者ＩＤを認証結果とし（ステップＳ２７０７）、この利用者ＩＤを利用者認証情報に設定して利用者認証情報を出力する（ステップＳ２７０８）。 On the other hand, if the unknown person ID is not set in the face identification result of the face identification information in step S2705 (step S2705: No), it is determined that the user identification is successful by both the face and the voice, A user ID having the largest similarity between the maximum similarity set in the speaker identification information of the same set information and the maximum similarity set in the face identification information is set as an authentication result (step S2707). This user ID is set in the user authentication information and the user authentication information is output (step S2708).

この利用者認証情報は、サービス提供部１０９で入力され、サービス提供部１０９によって利用者認証情報に設定された利用者ＩＤを使用して、メールチェックサービス等の種種のサービス処理を実行する。 The user authentication information is input by the service providing unit 109, and various service processes such as an email check service are executed using the user ID set in the user authentication information by the service providing unit 109.

このように本実施の形態にかかる自律移動型ロボットでは、マイクロホン２０６，２０７，２０８から得られる音声データと全天カメラ２０５から得られる画像データのそれぞれから別個に利用者が存在する方向を求める。そして、音声データに対する生体情報である話者識別情報と画像データに対する生体情報である顔識別情報のうち、先に得られた生体識別情報が示す方向の画像に関して特徴的な画像の領域である画像特徴領域に近似する領域を追跡し、画像特徴領域の検出方向と後に取得した生体識別情報の方向との差が予め定められた閾値以下であるか否かを判断して、方向の差が閾値以下である場合に、先に取得した生体識別情報で識別される利用者と後に取得した生体識別情報で識別される利用者が同一であると判断しているので、複数種類の生体情報を使用して利用者識別を行う場合に、それぞれ別個の生体情報から求めた利用者の方向の同一性に基づいて利用者の同一性の検証を行うことができ、一定期間経過後に利用者から異なる生体情報を取得した場合でも、利用者の同一性の検証を高精度に行うことができる。 As described above, in the autonomous mobile robot according to the present embodiment, the direction in which the user exists is obtained separately from the audio data obtained from the microphones 206, 207, and 208 and the image data obtained from the all-sky camera 205. Then, among the speaker identification information that is the biometric information for the voice data and the face identification information that is the biometric information for the image data, an image that is a characteristic image region with respect to the image in the direction indicated by the previously obtained biometric identification information The region that approximates the feature region is tracked, and it is determined whether the difference between the detection direction of the image feature region and the direction of the biometric identification information acquired later is equal to or less than a predetermined threshold value. In the following cases, it is determined that the user identified by the biometric identification information acquired earlier is the same as the user identified by the biometric identification information acquired later, so multiple types of biometric information are used. When the user is identified, the identity of the user can be verified based on the identity of the direction of the user obtained from each separate biometric information, and a different biometric from the user after a certain period of time Take information Even when it is possible to verify the identity of users with high accuracy.

また、本実施の形態にかかる自律移動型ロボットでは、このようにして高精度に同一性が検証された話者識別情報と顔識別情報とを一対とした同一組情報を生成して、この同一組情報の内容によって生体辞書を更新しているので、生体辞書の更新の必要性をより正確に判断することができる。 Further, in the autonomous mobile robot according to the present embodiment, the same set information is generated by pairing the speaker identification information and the face identification information whose identity is verified with high accuracy in this way. Since the biological dictionary is updated according to the contents of the group information, the necessity of updating the biological dictionary can be determined more accurately.

また、本実施の形態にかかる自律移動型ロボットでは、このような高精度に同一性が検証された話者識別情報と顔識別情報とを一対とした同一組情報から利用者を認証して認証した利用者に対するサービスを提供しているので、利用者に対するサービスの提供をより安全に行うことができる。 Further, in the autonomous mobile robot according to the present embodiment, the user is authenticated by authenticating the user from the same set of information including the speaker identification information and the face identification information whose identity is verified with high accuracy. Since the service for the user is provided, the service can be provided to the user more safely.

なお、本実施の形態では、本発明の利用者識別装置を自律移動型ロボットに適用した例を示しているが、これに限定されるものではなく、自律移動型ロボット以外でも利用者識別の機能を有する装置であれば、本発明の利用者識別装置を適用することができる。 In this embodiment, an example in which the user identification device of the present invention is applied to an autonomous mobile robot is shown. However, the present invention is not limited to this, and the user identification function can be applied to other than the autonomous mobile robot. The user identification device according to the present invention can be applied to any device having the above.

本実施の形態にかかる自律移動型ロボットの機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the autonomous mobile robot concerning this Embodiment. 本実施の形態の自律移動型ロボットの外観を示す模式図である。It is a schematic diagram which shows the external appearance of the autonomous mobile robot of this Embodiment. ２人の利用者を撮像する全天カメラ２０５の配置を示す模式図である。It is a schematic diagram which shows arrangement | positioning of the all-sky camera 205 which images two users. 図３−１で示す配置の２人の利用者を全天カメラ２０５で撮像した全天画像を示す模式図である。It is a schematic diagram which shows the all-sky image which imaged two users of arrangement | positioning shown in FIG. 音響指向性形成部１０１の構成を示すブロック図である。2 is a block diagram illustrating a configuration of an acoustic directivity forming unit 101. FIG. 音響指向性形成モジュール１０２の構成を示すブロック図である。3 is a block diagram illustrating a configuration of an acoustic directivity forming module 102. FIG. 音響指向性形成モジュール１０２の指向性中心の例を示した説明図である。It is explanatory drawing which showed the example of the directivity center of the sound directivity formation module. 音響指向性モジュールの前段のビームフォーマと後段のビームフォーマにより設定される指向性範囲をロボット上方から示す模式図である。It is a schematic diagram which shows the directivity range set by the front beamformer and the rear beamformer of the acoustic directivity module from above the robot. 音響指向性モジュールの前段のビームフォーマと後段のビームフォーマにより設定される指向性範囲を水平方向から示す模式図である。It is a schematic diagram which shows the directivity range set by the beam former of a front | former stage of a sound directivity module, and the beam former of a back | latter stage from a horizontal direction. ２０度刻みで１８０度を覆う指向性範囲の配置例を示した説明図である。It is explanatory drawing which showed the example of arrangement | positioning of the directivity range which covers 180 degree | times by 20 degree | times. 自律移動型ロボットにおける相対方位から絶対方位への補正における補正前の例を示す説明図である。It is explanatory drawing which shows the example before correction | amendment in the correction | amendment from a relative azimuth | direction to an absolute azimuth | direction in an autonomous mobile robot. 自律移動型ロボットにおける相対方位から絶対方位への補正における補正後の例を示す説明図である。It is explanatory drawing which shows the example after correction | amendment in the correction | amendment from a relative azimuth | direction to an absolute azimuth | direction in an autonomous mobile robot. 音響指向性形成部１０１の音響指向性形成モジュール１０２から出力される指向性音声情報のデータ構造図である。3 is a data structure diagram of directional sound information output from an acoustic directivity forming module 102 of the acoustic directivity forming unit 101. FIG. 語彙識別情報のデータ構造図である。It is a data structure figure of vocabulary identification information. 話者識別情報のデータ構造図である。It is a data structure figure of speaker identification information. 顔識別情報のデータ構造図である。It is a data structure figure of face identification information. 方向対応付けテーブルのデータ構造図である。It is a data structure figure of a direction matching table. 同一組情報のデータ構造図である。It is a data structure figure of the same set information. 本実施の形態にかかる自律移動型ロボットによる利用者識別の全体処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the whole process of the user identification by the autonomous mobile robot concerning this Embodiment. 音響指向性形成部１０１による指向性音声情報の生成処理の手順を示すフローチャートである。5 is a flowchart illustrating a procedure of directivity audio information generation processing by an acoustic directivity forming unit 101. 語彙識別部１０３による語彙識別処理の手順を示すフローチャートである。5 is a flowchart illustrating a procedure of vocabulary identification processing by a vocabulary identification unit 103. 話者識別部１０４による話者識別処理の手順を示すフローチャートである。5 is a flowchart illustrating a procedure of speaker identification processing by a speaker identification unit 104. 入力された音声データからパワースペクトル列に変換された音声データの一例を示す説明図である。It is explanatory drawing which shows an example of the audio | voice data converted into the power spectrum row | line | column from the input audio | voice data. 男性Ａが「あいうえお」と発声した場合の声紋パターン１４０２の一例を示す説明図である。It is explanatory drawing which shows an example of the voiceprint pattern 1402 when the man A utters "Aiueo". 女性Ｂが「あいうえお」と発声した場合の声紋パターン１４０３の一例を示す説明図である。It is explanatory drawing which shows an example of the voiceprint pattern 1403 when the woman B utters "Aiueo". 話者識別処理における辞書部分空間と入力部分空間の生成手法について示す説明図である。It is explanatory drawing shown about the production | generation method of the dictionary subspace and input subspace in a speaker identification process. 話者識別処理における相互部分空間法の概念を示した説明図である。It is explanatory drawing which showed the concept of the mutual subspace method in a speaker identification process. 顔識別部１０５による顔識別処理の手順を示すフローチャートである。5 is a flowchart illustrating a procedure of face identification processing by a face identification unit 105. 顔識別処理における相互部分空間法の概念を示す説明図である。It is explanatory drawing which shows the concept of the mutual subspace method in face identification processing. 領域追跡部１０６による領域追跡処理の手順を示すフローチャートである。5 is a flowchart showing a procedure of region tracking processing by the region tracking unit 106; 領域追加命令によって行われる画像特徴領域の設定例を示す模式図である。It is a schematic diagram which shows the example of a setting of the image characteristic area performed by an area | region addition command. 領域追加命令によって行われる画像特徴領域の他の設定例を示す模式図である。It is a schematic diagram which shows the other example of an image feature area | region performed by an area | region addition command. 領域追跡処理の例を示す模式図である。It is a schematic diagram which shows the example of an area | region tracking process. 領域追跡処理の他の例を示す模式図である。It is a schematic diagram which shows the other example of an area | region tracking process. 同方向性検証部１１１による同方向性検証処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the same direction verification process by the same direction verification part 111. FIG. 顔が検出されている方向から利用者の音声が到来した場合を示す模式図である。It is a schematic diagram which shows the case where a user's audio | voice has arrived from the direction from which the face is detected. 顔が検出されている方向と別の方向から別の利用者の音声が到来した場合を示す模式図である。It is a schematic diagram which shows the case where another user's audio | voice comes from the direction different from the direction from which the face is detected. 追跡同方向性検証部１１２による追跡同方向性検証処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the tracking directionality verification process by the tracking directionality verification part 112. 辞書更新部１０８による生体辞書の更新処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the update process of the biometric dictionary by the dictionary update part. 利用者認証部１０７による利用者認証処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the user authentication process by the user authentication part 107.

Explanation of symbols

１０１音響指向性形成部
１０２音響指向性形成モジュール
１０３語彙識別部
１０４話者識別部
１０５顔識別部
１０６領域追跡部
１０７利用者認証部
１０８辞書更新部
１０９サービス提供部
１１０同一性検証部
１１１同方向性検証部
１１２追跡同方向性検証部
１２０生体辞書記憶部
１２１声紋辞書
１２２顔辞書
１２３語彙辞書
２０１本体
２０２，２０３駆動輪
２０５全天カメラ
２０６，２０７，２０８マイクロホン

DESCRIPTION OF SYMBOLS 101 Acoustic directivity formation part 102 Acoustic directivity formation module 103 Vocabulary identification part 104 Speaker identification part 105 Face identification part 106 Area tracking part 107 User authentication part 108 Dictionary update part 109 Service provision part 110 Identity verification part 111 Same direction Sex verification unit 112 Tracking unidirectional verification unit 120 Biological dictionary storage unit 121 Voice print dictionary 122 Face dictionary 123 Vocabulary dictionary 201 Body 202, 203 Driving wheel 205 All-sky camera 206, 207, 208 Microphone

Claims

Detecting means for detecting a user's voice and outputting the detected voice as user's biological information;
Imaging means for capturing an image of the user and outputting the captured image as biological information of the user;
Biometric dictionary storage means for storing a plurality of biometric dictionaries in which user identification information and user biometric information are associated with each other, for each biometric information output by the imaging means and the detection means;
A user is identified from the biometric information based on the biometric information output by the detection means and the biometric dictionary corresponding to the biometric information, and is associated with the user identification information and the direction in which the user exists. Biometric identification information is generated, a user is identified from the biometric information based on the biometric information output by the imaging unit and the biometric dictionary corresponding to the biometric information, and the user identification information and the user exist Identification means for generating biometric identification information associated with the direction to perform,
An area that approximates an image feature area that is a characteristic area with respect to an image in the direction indicated by the first biometric identification information that is generated by the identification means and that is previously acquired is detected from an image that is newly input by the imaging means. Then, while setting the detected area as the new image feature area, area tracking means for obtaining the detection direction of the set new image feature area,
Whether or not the difference between the detection direction of the image feature area obtained by the area tracking unit and the direction of the second biometric identification information acquired after the first biometric identification information is equal to or less than a predetermined threshold value. When the difference in direction is equal to or smaller than the threshold, it is determined that the user identified by the first biometric identification information is the same as the user identified by the second biometric identification information. Identity verification means;
A user identification device comprising:

In the identity verification means, the difference between the direction of the first biometric identification information and the direction of the second biometric identification information acquired after the first biometric identification information is not more than a predetermined threshold value. The user identified by the first biometric identification information is the same as the user identified by the second biometric identification information when the difference in direction is equal to or less than the threshold value. The user identification device according to claim 1, wherein the user identification device is determined.

The identity verification means further includes the first biometric identification determined that the user identified by the first biometric identification information and the user identified by the second biometric identification information are the same. The user identification device according to claim 1, wherein the same set information in which the information is associated with the second biological identification information is generated.

The identification unit further obtains, for each user, a similarity indicating a degree of coincidence between the biometric information for each user registered in the biometric dictionary and the biometric information output by the detection unit. 4. The user identification apparatus according to claim 3, wherein the identification information of the user having a degree, the maximum similarity, and the biometric identification information associated with the direction in which the user exists are generated.

It is determined whether or not the user having the maximum similarity indicated by the first biometric identification information associated with the same set information matches the user having the maximum similarity indicated by the second biometric identification information. And a dictionary updating means for updating the biometric dictionary corresponding to the biometric identification information having a small maximum similarity when the values do not match based on the biometric identification information having a large maximum similarity,
The user identification device according to claim 4, further comprising:

The user having the maximum similarity of the biometric identification information having the greatest similarity among the first biometric identification information and the second biometric identification information associated with the same set information is associated with the same set information. A user authentication means for authenticating the user of the attached biometric information;
Service providing means for executing predetermined service processing for the user authenticated by the user authentication means;
The user identification device according to claim 4, further comprising:

The detection means includes first, second and third voice input means for detecting voice of a user from different directions and outputting voice information of the detected voice as the biological information.
From the first voice information output from the first voice input means and the second voice information output from the second voice input means, the voice is input to the first directivity range in a predetermined direction range. The fourth voice information with limited sensitivity is output, and the first directivity range is determined from the output fourth voice information and the third voice information output from the third voice input means. Further, acoustic directivity forming means for outputting fourth voice information whose input sensitivity is limited to the limited second directivity range, and estimating the second directivity range as an arrival direction of the user's voice;
The user identification device according to claim 1, further comprising:

The acoustic directivity forming means further outputs directional voice information in which the arrival direction and the fourth voice information are associated with each other,
The biological dictionary storage means stores a speaker dictionary associating user identification information with the user's voice information as the biological dictionary,
The identifying means identifies the user based on the directional speech information output by the acoustic directivity forming means and the speaker dictionary, and the identification information of the user and the direction of arrival of the directional speech information, The user identification device according to claim 7, further comprising speaker identification means for generating speaker identification information in association with each other as the first biometric identification information.

The biological dictionary storage means stores a face dictionary in which identification information of a user is associated with an image of the user's face as the biological dictionary,
The identification means searches the user's face area from the image based on the image output by the imaging means and the face dictionary, identifies the user, and identifies the user's identification information and the face area. The user identification device according to claim 7, further comprising face identification means for generating face identification information associated with the image center position as a direction as the second biometric identification information.

Detecting the user's voice, and outputting the detected voice as the user's biometric information, the biometric information output by the detection means, the user identification information corresponding to the biometric information, and the user's biometric information A user is identified from the biometric information based on the associated biometric dictionary, biometric identification information associated with the user identification information and the direction in which the user exists is generated, and an image of the user is captured. Identifying the user from the biometric information based on the biometric information output by the imaging means that outputs the captured image as the biometric information of the user and the biometric dictionary corresponding to the biometric information, and identifying the user And an identification step for generating biometric identification information associated with the direction in which the user exists,
An area that approximates an image feature area that is a characteristic area with respect to the image in the direction indicated by the first biometric identification information that is generated in the identification step and that has been previously acquired is detected from the image that is newly input by the imaging unit. An area tracking step for setting the detected area as a new image feature area and obtaining a detection direction of the set new image feature area;
Whether or not the difference between the detection direction of the image feature region obtained by the region tracking step and the direction of the second biometric identification information acquired after the first biometric identification information is equal to or less than a predetermined threshold value. And the user identified by the first biometric identification information is the same as the user identified by the second biometric identification information when the direction difference is equal to or less than the threshold. An identity verification step;
A user identification method comprising:

Detecting the user's voice, and outputting the detected voice as the user's biometric information, the biometric information output by the detection means, the user identification information corresponding to the biometric information, and the user's biometric information A user is identified from the biometric information based on the associated biometric dictionary, biometric identification information associated with the user identification information and the direction in which the user exists is generated, and an image of the user is captured. Identifying the user from the biometric information based on the biometric information output by the imaging means that outputs the captured image as the biometric information of the user and the biometric dictionary corresponding to the biometric information, and identifying the user And an identification step for generating biometric identification information associated with the direction in which the user exists,
An area that approximates an image feature area that is a characteristic area with respect to the image in the direction indicated by the first biometric identification information that is generated in the identification step and that has been previously acquired is detected from the image that is newly input by the imaging unit. An area tracking step for setting the detected area as a new image feature area and obtaining a detection direction of the set new image feature area;
Whether or not the difference between the detection direction of the image feature region obtained by the region tracking step and the direction of the second biometric identification information acquired after the first biometric identification information is equal to or less than a predetermined threshold value. And the user identified by the first biometric identification information is the same as the user identified by the second biometric identification information when the direction difference is equal to or less than the threshold. An identity verification step;
User identification program that causes a computer to execute.