JP2022088146A

JP2022088146A - Learning data generation device, person identifying device, learning data generation method, and learning data generation program

Info

Publication number: JP2022088146A
Application number: JP2020200432A
Authority: JP
Inventors: 勇太萩尾; Yuta Hagio; 豊金子; Yutaka Kaneko; 晋矢阿部; Shinya Abe
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2020-12-02
Filing date: 2020-12-02
Publication date: 2022-06-14

Abstract

To provide a person identifying device which no longer requires users to perform face registration in advance, and is able to highly-accurately identify persons even when users having similar faces exist, a person identifying method, and a person identifying program.SOLUTION: A person identifying device 1 comprises: a face detection section 11 for detecting areas for face images of persons from an image group photographed by a camera; a correspondence relation estimation section 13 for identifying an identical person based on coordinates of the face images from two images photographed within a time less than a threshold, assigning an identical person ID to the face image of the identical person, and assigning a separate person ID to the other face image; a learning data recording section 14 for associating and recording the face images with the person IDs, and recording a pair of the person IDs obtained from the image group photographed at an identical time; and a learning data set generation section 15 for generating learning data in a Triplet format composed of two face images having the identical person ID and one face image of the person ID configuring the pair with the person ID.SELECTED DRAWING: Figure 1

Description

本発明は、画像内の人物を識別するための装置、方法及びプログラムに関する。 The present invention relates to devices, methods and programs for identifying a person in an image.

カメラなどから取得した画像に写っている人物を識別する手法は長年研究されており、家庭などで利用されるコミュニケーションロボットやＩｏＴデバイスの他、様々な場面で利用されている。 Techniques for identifying a person in an image acquired from a camera or the like have been studied for many years, and are used in various situations such as communication robots and IoT devices used at home.

人物の識別を行うためには、一般に顔検出処理と顔識別処理の２つの処理を行う必要がある。
顔検出処理は、入力された画像から人物の顔が写っている矩形領域（顔領域）を検出する処理である。顔検出処理としては、非特許文献１のようなカスケード検出器を用いた手法や、非特許文献２及び非特許文献３のような深層学習による物体検出手法などが提案されている。また、深層学習による物体検出手法を用いる場合、人物の顔検出に特化したデータセットとして、非特許文献４を用いることができる。 In order to identify a person, it is generally necessary to perform two processes, a face detection process and a face identification process.
The face detection process is a process of detecting a rectangular area (face area) in which a person's face is reflected from an input image. As the face detection process, a method using a cascade detector such as Non-Patent Document 1 and an object detection method by deep learning such as Non-Patent Document 2 and Non-Patent Document 3 have been proposed. Further, when the object detection method by deep learning is used, Non-Patent Document 4 can be used as a data set specialized for face detection of a person.

一方、顔識別処理は、与えられた顔画像が既知の人物のいずれか、若しくは未知の人物であることを推定する処理である。入力される顔画像は、一人の人物の顔領域のみが写っている画像、又は前述の顔検出処理により検出された顔領域が指定された画像である。顔識別処理は、識別対象の人物を登録する学習ステップと、入力された顔画像の人物を識別する推定ステップと、の２つから構成される。 On the other hand, the face recognition process is a process of estimating that a given face image is one of known persons or an unknown person. The input face image is an image in which only the face area of one person is shown, or an image in which the face area detected by the above-mentioned face detection process is specified. The face identification process is composed of two steps: a learning step of registering a person to be identified and an estimation step of identifying a person in an input face image.

学習ステップでは、識別したい人物の顔画像が特徴ベクトルに変換され、データベースに保存される。顔画像を特徴ベクトルに変換する手法は、例えば、非特許文献５で提案されている。
推定ステップでは、入力された顔画像が学習ステップと同様の方法で特徴ベクトルへと変換され、事前に登録された人物の特徴ベクトルと比較することで、入力された顔画像の人物が事前に登録した人物のいずれと一致するのか、若しくはいずれとも一致しないのかの推定結果が出力される。 In the learning step, the face image of the person to be identified is converted into a feature vector and stored in the database. A method for converting a face image into a feature vector has been proposed, for example, in Non-Patent Document 5.
In the estimation step, the input face image is converted into a feature vector in the same way as in the learning step, and the person in the input face image is registered in advance by comparing with the feature vector of the pre-registered person. The estimation result of which of the people who made the match or which does not match is output.

また、学習ステップに必要な顔画像を収集する方法についても、様々な方法が提案されている。
例えば、特許文献１では、ユーザの顔をシステムに登録する際、システムがユーザに対して顔の向きなどを指示することで、様々な角度から撮影した画像を収集する装置が提案されている。
例えば、特許文献２では、ユーザの顔をシステムに登録する際、ロボットが自らユーザに指示することで登録処理を円滑に行う装置が提案されている。
例えば、特許文献３では、ロボットがユーザに抱きかかえられたときに登録処理を行うことで、ユーザに負荷をかけることなく、高品質な顔画像を取得する装置が提案されている。 In addition, various methods have been proposed for collecting facial images necessary for the learning step.
For example, Patent Document 1 proposes a device that collects images taken from various angles by instructing the user on the direction of the face when the user's face is registered in the system.
For example, Patent Document 2 proposes a device that smoothly performs a registration process by a robot instructing the user himself / herself when registering a user's face in the system.
For example, Patent Document 3 proposes a device that acquires a high-quality face image without imposing a load on the user by performing a registration process when the robot is held by the user.

特開２０００－２５９８３４号公報Japanese Unexamined Patent Publication No. 2000-259834 特開２００４－３０２６４５号公報Japanese Unexamined Patent Publication No. 2004-302645 国際公開第２０１８／０８４１７０号International Publication No. 2018/084170

T. Mita et al., “Joint Haar-like features for face detection,” Tenth IEEE International Conference on Computer Vision (ICCV'05).T. Mita et al., “Joint Haar-like features for face detection,” Tenth IEEE International Conference on Computer Vision (ICCV'05). S. Ren et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” Twenty-ninth Conference on Neural Information Processing Systems (NIPS 2015).S. Ren et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” Twenty-ninth Conference on Neural Information Processing Systems (NIPS 2015). J. Redmon et al., “You Only Look Once: Unified Real-Time Object Detection,” 2016 IEEE conference on computer vision and pattern recognition (CVPR 2016).J. Redmon et al., “You Only Look Once: Unified Real-Time Object Detection,” 2016 IEEE conference on computer vision and pattern recognition (CVPR 2016). Y. Shuo et al., “WIDER FACE: A Face Detection Benchmark,” 2016 IEEE conference on computer vision and pattern recognition (CVPR 2016).Y. Shuo et al., “WIDER FACE: A Face Detection Benchmark,” 2016 IEEE conference on computer vision and pattern recognition (CVPR 2016). 河合吉彦他，“深層ニューラルネットワークを利用した顔認識の検討，” 第１８回情報科学技術フォーラム（ＦＩＴ２０１９）．Yoshihiko Kawai et al., "Study of Face Recognition Using Deep Neural Networks," 18th Information Science and Technology Forum (FIT 2019). G. B. Huang et al., “Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments,” University of Massachusetts, Technical Report 2007.G. B. Huang et al., “Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments,” University of Massachusetts, Technical Report 2007.

人物識別処理を行うコミュニケーションロボット又はＩｏＴデバイスなどでは、初期設定時にユーザの顔登録を行うことが一般的である。その際、特許文献１又は特許文献２で提案されているように、ロボット又はＩｏＴデバイスなどが音声又はテキスト表示などによりユーザに対して指示を出し、ユーザは、その指示に従って顔を様々な方向に動かすことで顔登録を行う。ロボット又はＩｏＴデバイスが家族で共用するものである場合、この顔登録を人数分行う必要がある。また、人間の顔は加齢と共に変化していくため、定期的に登録情報を更新しないと正しく人物識別ができなくなる可能性がある。したがって、ロボット又はＩｏＴデバイスでの顔登録は、ユーザに負荷が発生し、使用するユーザ数が多くなるとその分、全体の負荷が増加していた。 In a communication robot or an IoT device that performs a person identification process, it is common to register a user's face at the time of initial setting. At that time, as proposed in Patent Document 1 or Patent Document 2, a robot, an IoT device, or the like gives an instruction to the user by voice or text display, and the user turns his face in various directions according to the instruction. Face registration is done by moving. If the robot or IoT device is shared by the family, it is necessary to perform this face registration for the number of people. In addition, since the human face changes with aging, it may not be possible to correctly identify the person unless the registered information is updated regularly. Therefore, face registration with a robot or an IoT device causes a load on the user, and as the number of users used increases, the overall load increases accordingly.

この課題を解決する手段として、例えば、特許文献３の手法が利用できる。この手法では、顔登録におけるユーザの負荷を低減できるが、この手法を導入するロボットは、抱きかかえることが可能なデザインとする必要があり、ロボットの内部にも抱きかかえられたことを検知するセンサデバイスを実装する必要がある。したがって、ロボットのデザインに制約が生じ、ＩｏＴデバイスでは実現できないことが問題となり、汎用性が高い手法とは言えなかった。 As a means for solving this problem, for example, the method of Patent Document 3 can be used. This method can reduce the load on the user in face registration, but the robot that introduces this method must be designed so that it can be held, and a sensor that detects that the robot is also held inside. You need to implement the device. Therefore, there are restrictions on the design of the robot, and there is a problem that it cannot be realized by the IoT device, so it cannot be said that the method is highly versatile.

また、コミュニケーションロボットやＩｏＴデバイスは家庭で使用されることが多いため、ユーザの集合の中には、遺伝的に顔が似ている人物が存在することが想定される。非特許文献５のように顔画像を特徴ベクトルに変換する手法の多くは、非特許文献６のような大規模な顔画像データセットを用いて学習を行う。このような大規模な顔画像データセットには、様々な属性（年齢、性別、人種など）の人物が含まれており、このようなデータセットで学習することで汎用的な顔特徴抽出が行えるようになることが期待される。しかし、コミュニケーションロボットやＩｏＴデバイスなど、家庭で使用することを想定したシステムの場合、実際に識別する必要のある人物の属性は限られている。すると、従来の汎用的な顔特徴抽出手法では顔が似ている人物は特徴空間上の近い位置に配置されるため、異なる人物でありながら特徴ベクトルの距離が近いことが誤識別の原因となる課題があった。 In addition, since communication robots and IoT devices are often used at home, it is assumed that there are people who have genetically similar faces in the set of users. Most of the methods for converting a face image into a feature vector as in Non-Patent Document 5 perform learning using a large-scale face image data set as in Non-Patent Document 6. Such a large facial image dataset contains people with various attributes (age, gender, race, etc.), and by learning with such a dataset, general-purpose facial feature extraction can be performed. It is expected that it will be possible. However, in the case of a system such as a communication robot or an IoT device that is supposed to be used at home, the attributes of the person that actually needs to be identified are limited. Then, in the conventional general-purpose face feature extraction method, people with similar faces are placed at close positions in the feature space, so that the distance between the feature vectors is short even though they are different people, which causes misidentification. There was a challenge.

本発明は、ユーザによる事前の顔登録を必要とせず、顔が似ているユーザが存在する場合でも高精度に識別ができる人物識別装置、人物識別方法及び人物識別プログラムを提供することを目的とする。 An object of the present invention is to provide a person identification device, a person identification method, and a person identification program that can perform highly accurate identification even when there are users with similar faces without requiring prior face registration by the user. do.

本発明に係る学習データ生成装置は、カメラで撮影された画像群から人物の顔画像の領域を検出する顔検出部と、閾値未満の時間内に撮影された２つの画像から、前記顔画像の座標に基づいて同一人物を判定し、当該同一人物の顔画像に同一の人物ＩＤを割り当て、他の顔画像に別の人物ＩＤを割り当てる対応関係推定部と、顔画像を前記人物ＩＤと対応付けて記録すると共に、同時刻に撮影された画像群から得られた前記人物ＩＤのペアを記録する学習データ記録部と、前記人物ＩＤが同一の顔画像が２枚、及び当該人物ＩＤと前記ペアを構成している人物ＩＤの顔画像が１枚からなるＴｒｉｐｌｅｔ形式の学習データを生成する学習データセット生成部と、を備え、顔画像を入力とした人物識別のための前記学習データを出力する。 The learning data generation device according to the present invention has a face detection unit that detects a region of a person's face image from a group of images taken by a camera, and two images taken within a time less than a threshold, from the face image. Correspondence relationship estimation unit that determines the same person based on the coordinates, assigns the same person ID to the face image of the same person, and assigns another person ID to another face image, and associates the face image with the person ID. A learning data recording unit that records a pair of the person IDs obtained from a group of images taken at the same time, two face images having the same person ID, and the person ID and the pair. It is provided with a learning data set generation unit for generating triplet format training data in which the face image of the person ID constituting the above is composed of one sheet, and outputs the training data for person identification using the face image as an input. ..

前記学習データ生成装置は、前記顔画像の座標に基づいて、人物方向を導出する人物方向推定部を備え、前記対応関係推定部は、同時刻に撮影された２つの画像から、差が閾値未満の人物方向が導出された場合、当該人物方向に対応する２つの顔画像に同一の人物ＩＤを割り当ててもよい。 The learning data generation device includes a person direction estimation unit that derives a person direction based on the coordinates of the face image, and the correspondence estimation unit has a difference of less than a threshold value from two images taken at the same time. When the person direction is derived, the same person ID may be assigned to the two face images corresponding to the person direction.

前記対応関係推定部は、同時刻に撮影された２つの画像から、差が閾値未満の人物方向が導出された場合、又は閾値未満の時間内に撮影された２つの画像から、差が閾値未満の人物方向が導出された場合に、当該人物方向に対応する２つの顔画像に同一の人物ＩＤを割り当て、他の顔画像に別の人物ＩＤを割り当ててもよい。 In the correspondence estimation unit, the difference is less than the threshold when the person direction whose difference is less than the threshold is derived from the two images taken at the same time, or from the two images taken within the time less than the threshold. When the person direction is derived, the same person ID may be assigned to the two face images corresponding to the person direction, and another person ID may be assigned to the other face images.

同一人物が同時に写らない位置に複数のカメラが配置され、前記対応関係推定部は、閾値未満の時間内に撮影された２つの画像から、座標間の距離が閾値未満の２つの顔画像が検出された場合に、当該２つの顔画像に同一の人物ＩＤを割り当て、他の顔画像に別の人物ＩＤを割り当ててもよい。 Multiple cameras are placed at positions where the same person is not captured at the same time, and the correspondence estimation unit detects two face images whose distance between coordinates is less than the threshold from two images taken within the time less than the threshold. If this is the case, the same person ID may be assigned to the two face images, and another person ID may be assigned to the other face images.

本発明に係る人物識別装置は、前記学習データ生成装置から出力された前記学習データを用いて、特徴抽出モデルを学習するモデル学習部と、前記特徴抽出モデルにより前記顔画像を特徴ベクトルに変換し、当該特徴ベクトルの間の距離に基づいて前記人物ＩＤを統合したユーザＩＤ毎に、特徴ベクトルの重心ベクトルを記録する特徴記録部と、新たに撮影された画像に含まれる顔画像を、前記特徴抽出モデルにより特徴ベクトルに変換する顔特徴抽出部と、前記顔特徴抽出部により得られた特徴ベクトルを、前記ユーザＩＤ毎の前記重心ベクトルと比較し、閾値未満の距離にあるユーザＩＤを人物の推定結果として出力する人物推定部と、を備える。 The person identification device according to the present invention uses the training data output from the training data generation device to learn a feature extraction model, and converts the face image into a feature vector by the feature extraction model. For each user ID that integrates the person ID based on the distance between the feature vectors, the feature recording unit that records the center of gravity vector of the feature vector and the face image included in the newly captured image are described as the feature. The face feature extraction unit that converts to a feature vector by the extraction model and the feature vector obtained by the face feature extraction unit are compared with the center of gravity vector for each user ID, and the user ID at a distance less than the threshold is the person's. It is provided with a person estimation unit that outputs an estimation result.

前記特徴記録部は、識別対象のユーザ数の入力を受け付け、前記人物ＩＤ毎の重心ベクトルを当該ユーザ数にクラスタリングしてユーザＩＤを割り当ててもよい。 The feature recording unit may receive an input of the number of users to be identified, cluster the center of gravity vector for each person ID to the number of users, and assign the user ID.

前記特徴記録部は、前記ユーザＩＤに対応する顔画像を提示し、ユーザ名の入力を受け付けて、前記重心ベクトルと共に記録してもよい。 The feature recording unit may present a face image corresponding to the user ID, accept input of a user name, and record the face image together with the center of gravity vector.

前記特徴記録部は、提示した顔画像のうち識別対象外の顔画像を指示されると、当該指示された顔画像が含まれる人物ＩＤを除外して、再度クラスタリングを行ってもよい。 When the feature recording unit is instructed to use a face image that is not the identification target among the presented face images, the feature recording unit may exclude the person ID including the instructed face image and perform clustering again.

本発明に係る学習データ生成方法は、カメラで撮影された画像群から人物の顔画像の領域を検出する顔検出ステップと、閾値未満の時間内に撮影された２つの画像から、前記顔画像の座標に基づいて同一人物を判定し、当該同一人物の顔画像に同一の人物ＩＤを割り当て、他の顔画像に別の人物ＩＤを割り当てる対応関係推定ステップと、顔画像を前記人物ＩＤと対応付けて記録すると共に、同時刻に撮影された画像群から得られた前記人物ＩＤのペアを記録する学習データ記録ステップと、前記人物ＩＤが同一の顔画像が２枚、及び当該人物ＩＤと前記ペアを構成している人物ＩＤの顔画像が１枚からなるＴｒｉｐｌｅｔ形式の学習データを生成する学習データセット生成ステップと、をコンピュータが実行し、顔画像を入力とした人物識別のための前記学習データを出力する。 The learning data generation method according to the present invention is a face detection step of detecting a region of a person's face image from a group of images taken by a camera, and two images taken within a time less than a threshold of the face image. Correspondence estimation step of determining the same person based on the coordinates, assigning the same person ID to the face image of the same person, and assigning another person ID to another face image, and associating the face image with the person ID. A learning data recording step of recording a pair of the person IDs obtained from a group of images taken at the same time, two face images having the same person ID, and the person ID and the pair. The computer executes a training data set generation step of generating triplet format training data consisting of one face image of a person ID constituting the above, and the training data for person identification using the face image as an input. Is output.

本発明に係る学習データ生成プログラムは、前記学習データ生成装置としてコンピュータを機能させるためのものである。 The learning data generation program according to the present invention is for making a computer function as the learning data generation device.

本発明によれば、事前の顔登録を行わずに、周囲にいる人物を高精度に識別できる。 According to the present invention, it is possible to identify a person in the vicinity with high accuracy without performing prior face registration.

第１実施形態における人物識別装置の機能構成を示す図である。It is a figure which shows the functional structure of the person identification apparatus in 1st Embodiment. 第１実施形態における人物方向推定部の処理例を示す図である。It is a figure which shows the processing example of the person direction estimation part in 1st Embodiment. 第１実施形態における対応関係推定部の処理例を示す図である。It is a figure which shows the processing example of the correspondence relation estimation part in 1st Embodiment. 第１実施形態における学習データ記録部の処理例を示す図である。It is a figure which shows the processing example of the learning data recording part in 1st Embodiment. 第１実施形態における学習データセット生成部の処理例を示す図である。It is a figure which shows the processing example of the learning data set generation part in 1st Embodiment. 第１実施形態におけるモデル学習部の処理例を示す図である。It is a figure which shows the processing example of the model learning part in 1st Embodiment. 第１実施形態における特徴ベクトル生成部の処理例を示す図である。It is a figure which shows the processing example of the feature vector generation part in 1st Embodiment. 第１実施形態における人物推定部の処理例を示す図である。It is a figure which shows the processing example of the person estimation part in 1st Embodiment. 第１実施形態における人物識別方法の流れを例示するフローチャートである。It is a flowchart which illustrates the flow of the person identification method in 1st Embodiment. 第２実施形態における人物識別装置の機能構成を示す図である。It is a figure which shows the functional structure of the person identification apparatus in 2nd Embodiment. 第２実施形態における対応関係推定部の処理例を示す図である。It is a figure which shows the processing example of the correspondence relation estimation part in 2nd Embodiment. 第３実施形態における人物識別装置の機能構成を示す図である。It is a figure which shows the functional structure of the person identification apparatus in 3rd Embodiment. 第３実施形態における人物登録部の処理例を示す図である。It is a figure which shows the processing example of the person registration part in 3rd Embodiment.

本発明の実施形態における人物識別装置は、コミュニケーションロボット又はＩｏＴデバイスなどの稼働中に、周囲にいるユーザの顔画像を収集して学習用のデータセットを自動構築し、構築したデータセットを用いて特定のタイミングで深層学習モデルを学習する。
このとき、人物識別装置は、カメラに写った人物の顔領域を検出し、ラベリングしたデータをデータベースに蓄積する。ラベリングを行う際は、前後のフレームで近い位置にいた人物は同一人物、同時に写った人物は異なる人物という性質を利用する。 The person identification device according to the embodiment of the present invention collects facial images of users in the vicinity during operation of a communication robot, an IoT device, or the like, automatically constructs a data set for learning, and uses the constructed data set. Learn a deep learning model at a specific timing.
At this time, the person identification device detects the face area of the person captured by the camera and stores the labeled data in the database. When labeling, the property that the people who were close to each other in the previous and next frames are the same person, and the people who are shown at the same time are different people is used.

［第１実施形態］
以下、本発明の第１実施形態を例示して説明する。本実施形態の人物識別装置１は、コミュニケーションロボット又はＩｏＴデバイスに搭載された、複数台のカメラを持つカメラアレイを入力デバイスとして、周囲にいる人物を自動で識別する装置である。 [First Embodiment]
Hereinafter, the first embodiment of the present invention will be illustrated and described. The person identification device 1 of the present embodiment is a device mounted on a communication robot or an IoT device and automatically identifies a person in the vicinity by using a camera array having a plurality of cameras as an input device.

図１は、人物識別装置１の機能構成を示す図である。
人物識別装置１は、制御部１０及び記憶部２０の他、各種のインターフェースを備えた情報処理装置であり、記憶部２０に格納されたソフトウェア（人物識別プログラム）を制御部１０が実行することにより、本実施形態の各種機能が実現される。 FIG. 1 is a diagram showing a functional configuration of the person identification device 1.
The person identification device 1 is an information processing device provided with various interfaces in addition to the control unit 10 and the storage unit 20, and the control unit 10 executes software (person identification program) stored in the storage unit 20. , Various functions of this embodiment are realized.

制御部１０は、顔検出部１１と、人物方向推定部１２と、対応関係推定部１３と、学習データ記録部１４と、学習データセット生成部１５と、モデル学習部１６と、特徴ベクトル生成部１７（特徴記録部）と、顔特徴抽出部１８と、人物推定部１９とを備える。
また、記憶部２０は、人物識別プログラムの他、人物メモリ２１と、顔画像データベース（ＤＢ）２２と、人物ペアデータベース（ＤＢ）２３と、顔特徴抽出モデル２４と、顔特徴データベース（ＤＢ）２５とを備える。 The control unit 10 includes a face detection unit 11, a person direction estimation unit 12, a correspondence estimation unit 13, a learning data recording unit 14, a learning data set generation unit 15, a model learning unit 16, and a feature vector generation unit. A 17 (feature recording unit), a face feature extraction unit 18, and a person estimation unit 19 are provided.
In addition to the person identification program, the storage unit 20 includes a person memory 21, a face image database (DB) 22, a person pair database (DB) 23, a face feature extraction model 24, and a face feature database (DB) 25. And.

顔検出部１１は、コミュニケーションロボット又はＩｏＴデバイスなどに接続されたカメラアレイから画像を取得し、取得した画像に写っている人物の顔を検出する。
カメラアレイは、例えば、円柱状の筐体に複数台のカメラが埋め込まれたデバイスであり、水平方向に４５度おきに８台のカメラを設置すると、全方位の映像を収集することができる。なお、カメラアレイは、必ずしも全方位の映像を収集する必要はなく、コミュニケーションロボット又はＩｏＴデバイスなどの使用時にユーザが存在する位置が写っていればよい。
また、カメラアレイを構成する複数台のカメラによる同時撮影には限られず、人物の動きに比べて十分に速い間隔で１台又は数台のカメラが周囲を連続的に撮影する構成であってもよい。 The face detection unit 11 acquires an image from a camera array connected to a communication robot, an IoT device, or the like, and detects the face of a person in the acquired image.
A camera array is, for example, a device in which a plurality of cameras are embedded in a cylindrical housing, and if eight cameras are installed at intervals of 45 degrees in the horizontal direction, images in all directions can be collected. It should be noted that the camera array does not necessarily have to collect images in all directions, and it is sufficient that the position where the user is present is captured when the communication robot, IoT device, or the like is used.
Further, it is not limited to simultaneous shooting by a plurality of cameras constituting the camera array, and even if one or several cameras continuously shoot the surroundings at intervals sufficiently faster than the movement of a person. good.

顔画像の検出には、例えば、非特許文献１で提案されているようなカスケード検出器を用いた手法や、非特許文献２又は非特許文献３で提案されているような深層学習による物体検出手法などが利用できる。顔検出部１１は、カメラアレイに搭載された各カメラから取得した画像それぞれに対して顔検出処理を行い、画像内での顔領域の位置を示す座標情報を人物方向推定部１２へ出力する。
なお、人物識別装置１は、一連の処理が終了し、人物推定結果を出力すると、カメラアレイから新たに画像を取得して顔検出部１１の処理を再度実行する。 For the detection of the face image, for example, a method using a cascade detector as proposed in Non-Patent Document 1 or object detection by deep learning as proposed in Non-Patent Document 2 or Non-Patent Document 3 Techniques can be used. The face detection unit 11 performs face detection processing on each image acquired from each camera mounted on the camera array, and outputs coordinate information indicating the position of the face region in the image to the person direction estimation unit 12.
When the person identification device 1 completes a series of processes and outputs the person estimation result, the person identification device 1 acquires a new image from the camera array and executes the process of the face detection unit 11 again.

人物方向推定部１２は、顔検出部１１により検出された顔の画像内での位置をもとに、人物のいる方向を推定する。
人物方向の推定には、例えば、次の文献Ａで提案されている方法が用いられる。この場合、人物方向推定部１２は、カメラ画像のピクセルの位置と、人物識別装置１を搭載したロボットなどから見た角度とを対応づけた角度変換表を、カメラアレイのカメラ毎に事前に用意し、これらを用いることで人物方向を推定する。
文献Ａ：萩尾勇太他，“人と一緒にテレビを視聴するコミュニケーションロボットの試作と検証，” 電子情報通信学会技術研究報告，ｖｏｌ．１１９，ｎｏ．４４６，ＣＮＲ２０１９－４６，ｐｐ．７－１２，２０２０． The person direction estimation unit 12 estimates the direction in which the person is located based on the position of the face detected by the face detection unit 11 in the image.
For the estimation of the person direction, for example, the method proposed in the following document A is used. In this case, the person direction estimation unit 12 prepares in advance an angle conversion table for each camera in the camera array, which associates the positions of the pixels of the camera image with the angles seen from a robot or the like equipped with the person identification device 1. Then, by using these, the direction of the person is estimated.
Reference A: Yuta Hagio et al., "Prototype and verification of a communication robot for watching TV with humans," IEICE Technical Report, vol. 119, no. 446, CNR2019-46, pp. 7-12, 2020.

図２は、人物方向推定部１２の処理例を示す図である。
なお、この例では、説明の簡略化のため、カメラアレイのカメラは２台としている。
ここでは、顔検出部１１から出力された顔検出結果を用いて、人物の方向が推定される。具体的には、人物方向推定部１２は、顔検出結果として得られた顔領域の中心座標を、カメラ毎に予め用意されている角度変換表により角度へと変換する。 FIG. 2 is a diagram showing a processing example of the person direction estimation unit 12.
In this example, for the sake of simplicity of explanation, the number of cameras in the camera array is two.
Here, the direction of the person is estimated using the face detection result output from the face detection unit 11. Specifically, the person direction estimation unit 12 converts the center coordinates of the face region obtained as the face detection result into an angle by an angle conversion table prepared in advance for each camera.

この例では、カメラＡに１人、カメラＢに２人の人物が写っている。人物方向推定部１２は、カメラ毎の角度変換表において、顔領域の中心座標に対応する数値を参照し、人物方向の推定結果として、カメラＡからは７５°、カメラＢからは７５°及び１１５°が得られている。
角度変換表は、実際にはカメラの画素数と同様の大きさであるが、ここでは、説明のため簡略化している。角度変換表に格納されている数値は、この例では１°単位で設定されているが、１°より細かい角度が設定されてもよい。
また、この例では水平方向の角度のみが推定されるが、垂直方向の角度も推定できるように、角度変換表に２次元の数値（角度）を格納して拡張してもよい。 In this example, one person is shown in camera A and two people are shown in camera B. The person direction estimation unit 12 refers to the numerical value corresponding to the center coordinate of the face area in the angle conversion table for each camera, and as the estimation result of the person direction, 75 ° from the camera A and 75 ° and 115 from the camera B. ° is obtained.
The angle conversion table is actually the same size as the number of pixels of the camera, but here, it is simplified for the sake of explanation. The numerical value stored in the angle conversion table is set in units of 1 ° in this example, but an angle finer than 1 ° may be set.
Further, in this example, only the horizontal angle is estimated, but a two-dimensional numerical value (angle) may be stored in the angle conversion table and expanded so that the vertical angle can also be estimated.

対応関係推定部１３は、人物方向推定部１２により得られた人物方向及び人物メモリ２１に格納されている過去の人物方向データから、人物の対応関係を推定し、学習データ記録部１４及び顔特徴抽出部１８に出力すると共に、人物メモリ２１のデータを更新する。 The correspondence estimation unit 13 estimates the correspondence between people from the past person direction data stored in the person direction and the person memory 21 obtained by the person direction estimation unit 12, and the learning data recording unit 14 and the face feature. The data is output to the extraction unit 18 and the data in the person memory 21 is updated.

人物メモリ２１には、人物ＩＤ、顔画像、人物方向、更新時刻が記録されており、対応関係推定部１３の処理を行うたびに内容が更新される。
ここで、人物ＩＤは、前後のフレーム（時間的に近い２つの画像）の比較により同一人物であると推定されるユーザを管理するための識別子であり、新たに人物ＩＤを発行する場合は、過去に使用された人物ＩＤと重複しないものが使用される。 The person ID, the face image, the person direction, and the update time are recorded in the person memory 21, and the contents are updated every time the processing of the correspondence estimation unit 13 is performed.
Here, the person ID is an identifier for managing users who are presumed to be the same person by comparing the frames before and after (two images close in time), and when issuing a new person ID, the person ID is used. A person ID that does not overlap with the person ID used in the past is used.

対応関係推定部１３は、まず、人物方向推定部１２から得られた人物方向に基づいて、同一人物の顔画像を推定する。これは、カメラアレイに搭載された複数のカメラの画角の一部が重なっており、複数のカメラにより同一人物を異なる角度から同時に撮影したことを推定する処理である。対応関係推定部１３は、コミュニケーションロボット又はＩｏＴデバイスから見て、ほぼ同時刻に同じ方向に存在する人物は同一人物であると仮定して推定を行う。複数のカメラから取得した画像より推定した人物方向に対し、事前に設定した同一人物方向閾値以内の角度の推定結果が存在する場合、対応関係推定部１３は、これらの人物を同一の人物と推定する。なお、同一人物方向閾値以内に複数の人物が存在し、一意に定まらない場合、制御部１０は、一連の処理を中断し、再度新たな画像を取得して顔検出部１１の処理から実行する。 First, the correspondence estimation unit 13 estimates the face image of the same person based on the person direction obtained from the person direction estimation unit 12. This is a process of estimating that a part of the angles of view of a plurality of cameras mounted on the camera array overlaps and the same person is simultaneously photographed from different angles by the plurality of cameras. The correspondence estimation unit 13 estimates by assuming that the persons existing in the same direction at substantially the same time are the same person when viewed from the communication robot or the IoT device. If there is an estimation result of an angle within the same person direction threshold set in advance with respect to the person direction estimated from images acquired from a plurality of cameras, the correspondence estimation unit 13 estimates these people as the same person. do. If a plurality of people exist within the same person direction threshold value and are not uniquely determined, the control unit 10 interrupts a series of processes, acquires a new image again, and executes the process from the face detection unit 11. ..

その後、対応関係推定部１３は、推定された人物方向を、人物メモリ２１に格納されている過去の人物方向と比較し、同一人物であるかを推定する。ここでは、特定の人物がある方向に存在したとき、次に取得された画像でも近い方向に存在すると仮定して推定される。
具体的には、対応関係推定部１３は、人物方向推定部１２から得られた人物の人物方向が人物メモリ２１に記録されている人物の人物方向と、事前に設定した人物移動閾値以内だった場合、これらを同一人物とみなして同一の人物ＩＤを割り当てる。人物方向推定部１２から得られた人物のうち、人物メモリ２１に記録されている人物と同一人物とみなすことができない人物が存在する場合、対応関係推定部１３は、過去に使用していない新たな人物ＩＤを割り当て、人物メモリ２１に記録する。
対応関係推定部１３は、人物メモリ２１のデータを更新すると共に、更新情報を学習データ記録部１４及び顔特徴抽出部１８に出力する。 After that, the correspondence estimation unit 13 compares the estimated person direction with the past person direction stored in the person memory 21 and estimates whether the person is the same person. Here, it is presumed that when a specific person exists in a certain direction, the next acquired image also exists in a close direction.
Specifically, in the correspondence estimation unit 13, the person direction of the person obtained from the person direction estimation unit 12 is within the person direction of the person recorded in the person memory 21 and the person movement threshold value set in advance. In this case, these are regarded as the same person and the same person ID is assigned. If there is a person who cannot be regarded as the same person as the person recorded in the person memory 21 among the persons obtained from the person direction estimation unit 12, the correspondence estimation unit 13 has not used in the past. Person ID is assigned and recorded in the person memory 21.
The correspondence estimation unit 13 updates the data in the person memory 21 and outputs the updated information to the learning data recording unit 14 and the face feature extraction unit 18.

また、事前にタイムアウト閾値が設定され、対応関係推定部１３は、更新時刻からの経過時間がタイムアウト閾値を超えているデータが人物メモリ２１に存在している場合、これらのデータを人物メモリ２１から削除した後、同一人物の推定及び人物ＩＤの割り当てを行う。
例えば、タイムアウト閾値は、人物移動閾値に相当する距離の移動時間に対して十分に短い時間で設定される。これにより、顔検出に失敗するなどの検出漏れが発生した場合にも、人物メモリ２１に存在する数フレーム前に検出された人物との照合により、人物ＩＤの割り当てが可能となる。 Further, when a timeout threshold is set in advance and the correspondence estimation unit 13 has data in the person memory 21 whose elapsed time from the update time exceeds the timeout threshold, the correspondence estimation unit 13 transfers these data from the person memory 21. After the deletion, the same person is estimated and the person ID is assigned.
For example, the time-out threshold value is set in a sufficiently short time with respect to the movement time of the distance corresponding to the person movement threshold value. As a result, even if a detection omission such as a failure in face detection occurs, a person ID can be assigned by collating with a person detected several frames before existing in the person memory 21.

図３は、対応関係推定部１３の処理例を示す図である。
この例では、同一人物方向閾値が５°、人物移動閾値が１０°、タイムアウト閾値が５秒と設定されている。人物方向推定部１２により７５°、７５°、１１５°という３人分の人物方向が推定されており、人物メモリ２１には人物ＩＤが３の人物が７２°の方向に存在するというデータが格納されている。 FIG. 3 is a diagram showing a processing example of the correspondence relationship estimation unit 13.
In this example, the same person direction threshold is set to 5 °, the person movement threshold is set to 10 °, and the timeout threshold is set to 5 seconds. The person direction estimation unit 12 estimates the person directions of 75 °, 75 °, and 115 ° for three people, and the person memory 21 stores data that a person with a person ID of 3 exists in the direction of 72 °. Has been done.

対応関係推定部１３は、まず、人物方向推定部１２から得られた人物のうち、データＩＤが０１の人物とデータＩＤが０２の人物とは同一人物方向閾値の５°以内となっているため、同一人物と推定する。なお、データＩＤは、人物方向推定部１２の出力データに対して付与した識別子である。 First, among the persons obtained from the person direction estimation unit 12, the person with the data ID of 01 and the person with the data ID of 02 are within 5 ° of the same person direction threshold value in the correspondence estimation unit 13. , Presumed to be the same person. The data ID is an identifier given to the output data of the person direction estimation unit 12.

次に、人物メモリ２１に記録されている人物ＩＤが３のデータは７２°方向となっており、データＩＤが０１の人物とデータＩＤが０２の人物とが人物移動閾値の１０°以内となっているため、対応関係推定部１３は、データＩＤが０１の人物とデータＩＤが０２の人物には人物ＩＤとして３を割り当てる。一方、データＩＤが０３の推定角度と人物移動閾値以内のデータは人物メモリ２１には含まれていないため、対応関係推定部１３は、データＩＤが０３の人物に、新たに人物ＩＤとして４を割り当てる。
対応関係推定部１３は、これらのデータにより人物メモリ２１を更新する。なお、今回のデータには更新時刻からの経過時間がタイムアウト閾値の５秒を超えているデータは存在しないため、人物メモリ２１から削除するデータはない。 Next, the data with the person ID 3 recorded in the person memory 21 is in the 72 ° direction, and the person with the data ID 01 and the person with the data ID 02 are within 10 ° of the person movement threshold. Therefore, the correspondence estimation unit 13 assigns 3 as a person ID to the person whose data ID is 01 and the person whose data ID is 02. On the other hand, since the data whose data ID is within the estimated angle of 03 and the person movement threshold value is not included in the person memory 21, the correspondence estimation unit 13 newly assigns 4 as the person ID to the person whose data ID is 03. assign.
The correspondence estimation unit 13 updates the person memory 21 with these data. Since there is no data in the data this time in which the elapsed time from the update time exceeds the timeout threshold of 5 seconds, there is no data to be deleted from the person memory 21.

学習データ記録部１４は、対応関係推定部１３より出力された顔画像及び人物ＩＤを元に、顔画像ＤＢ２２及び人物ペアＤＢ２３にデータを記録する。
顔画像ＤＢ２２には、画像ＩＤ、顔画像及び人物ＩＤのセットが記録され、人物ペアＤＢ２３には、同時刻に存在した人物ＩＤのペアが記録されている。人物識別装置１は、同時に存在した人物ＩＤの異なる人物同士は、異なる人物の顔画像であることを利用して人物識別を行う。 The learning data recording unit 14 records data in the face image DB 22 and the person pair DB 23 based on the face image and the person ID output from the correspondence estimation unit 13.
A set of an image ID, a face image, and a person ID is recorded in the face image DB 22, and a pair of person IDs existing at the same time is recorded in the person pair DB 23. The person identification device 1 identifies a person by utilizing the fact that persons having different person IDs that exist at the same time are facial images of different persons.

図４は、学習データ記録部１４の処理例を示す図である。
この例では、対応関係推定部１３より人物ＩＤが３の顔画像２枚と、人物ＩＤが４の顔画像１枚とが入力されている。
学習データ記録部１４は、まず、これらのデータを顔画像ＤＢ２２に追加し、その後、人物ＩＤの３及び４が同時に存在していることを示すデータを、人物ペアＤＢ２３に追加する。 FIG. 4 is a diagram showing a processing example of the learning data recording unit 14.
In this example, two face images having a person ID of 3 and one face image having a person ID of 4 are input from the correspondence estimation unit 13.
The learning data recording unit 14 first adds these data to the face image DB 22, and then adds data indicating that the person IDs 3 and 4 exist at the same time to the person pair DB 23.

学習データセット生成部１５は、顔画像ＤＢ２２及び人物ペアＤＢ２３のデータを用い、顔特徴抽出モデル２４（深層学習モデル）の重みを学習させるためのデータセットを生成する。この処理は、顔特徴抽出モデル２４を学習させる所定のタイミングで、例えば定期的に実行される。
ここで、学習データ記録部１４の処理によって、顔画像ＤＢ２２には、顔画像と人物ＩＤのセットが記録されており、人物ＩＤが同じ顔画像は同一人物の顔画像である。また、人物ペアＤＢ２３に記録されている人物ＩＤのペアは、同一時刻に異なる方向にいた人物であるため、異なる人物である。学習データセット生成部１５は、これらの性質を用いて、学習データを生成する。 The learning data set generation unit 15 uses the data of the face image DB 22 and the person pair DB 23 to generate a data set for learning the weights of the face feature extraction model 24 (deep learning model). This process is executed, for example, periodically at a predetermined timing for training the face feature extraction model 24.
Here, a set of a face image and a person ID is recorded in the face image DB 22 by the processing of the learning data recording unit 14, and the face image having the same person ID is the face image of the same person. Further, the pair of person IDs recorded in the person pair DB 23 is a different person because they are people who were in different directions at the same time. The training data set generation unit 15 uses these properties to generate training data.

学習データは、画像３枚をセットにしたＴｒｉｐｌｅｔ形式であり、１枚目の顔画像と同一人物である２枚目の画像（正例）、及び異なる人物である３枚目の顔画像（負例）により構成される。顔画像ＤＢ２２及び人物ペアＤＢ２３のデータを組み合わせることで、多様な組み合わせのＴｒｉｐｌｅｔデータを生成することができる。 The training data is a triplet format in which three images are set, a second image (normal example) that is the same person as the first face image, and a third face image (negative) that is a different person. Example). By combining the data of the face image DB 22 and the person pair DB 23, various combinations of Triplet data can be generated.

図５は、学習データセット生成部１５の処理例を示す図である。
この例では、人物ペアＤＢ２３には、人物ＩＤ「１－２」と「３－４」のペアが記録されている。これは、人物ＩＤが１の人物と人物ＩＤが２の人物とは異なる人物であり、人物ＩＤが３の人物と人物ＩＤが４の人物とは異なる人物であることを示している。 FIG. 5 is a diagram showing a processing example of the learning data set generation unit 15.
In this example, a pair of person IDs "1-2" and "3-4" is recorded in the person pair DB 23. This indicates that a person with a person ID of 1 and a person with a person ID of 2 are different from each other, and a person with a person ID of 3 and a person with a person ID of 4 are different from each other.

学習データセット生成部１５は、これらの情報に基づいて、顔画像ＤＢ２２から顔画像を取り出し、Ｔｒｉｐｌｅｔ形式の学習データを生成する。学習データにおいて、顔画像１（参照元）と顔画像２（正例）とは同一の人物ＩＤの顔画像となり、顔画像１（参照元）と顔画像３（負例）とは異なる人物ＩＤで、かつ、人物ペアＤＢ２３に記録されているペアとなる。例えば、画像ＩＤが１の顔画像を参照元とすると、画像ＩＤが３の顔画像が正例となり、画像ＩＤが２の顔画像が負例となる。 Based on this information, the learning data set generation unit 15 extracts a face image from the face image DB 22 and generates learning data in Triplet format. In the training data, the face image 1 (reference source) and the face image 2 (normal example) are face images with the same person ID, and the face image 1 (reference source) and the face image 3 (negative example) are different person IDs. And, it becomes a pair recorded in the person pair DB23. For example, assuming that a face image having an image ID of 1 is used as a reference source, a face image having an image ID of 3 is a positive example, and a face image having an image ID of 2 is a negative example.

モデル学習部１６は、学習データセット生成部１５で生成したＴｒｉｐｌｅｔ形式の学習データを用いて、深層学習モデルである顔特徴抽出モデル２４の学習を行う。
ここで用いる深層学習モデルには、例えば、次の文献Ｂで提案されているＴｒｉｐｌｅｔ－ＢａｓｅｄＶａｒｉａｔｉｏｎａｌＡｕｔｏｅｎｃｏｄｅｒ（ＴＶＡＥ）をアーキテクチャとして利用できる。
文献Ｂ：Ｈ．Ｉｓｈｆａｑｅｔａｌ．， “ＴＶＡＥ：Ｔｒｉｐｌｅｔ－ＢａｓｅｄＶａｒｉａｔｉｏｎａｌＡｕｔｏｅｎｃｏｄｅｒｕｓｉｎｇＭｅｔｒｉｃＬｅａｒｎｉｎｇ，” ＩＣＬＲ２０１８Ｗｏｒｋｓｈｏｐ． The model learning unit 16 learns the face feature extraction model 24, which is a deep learning model, using the triplet-format learning data generated by the learning data set generation unit 15.
As the deep learning model used here, for example, the Triplet-Based Variational Autoencoder (TVAE) proposed in the following document B can be used as an architecture.
Document B: H. Ishifaq et al. , "TVAE: Triplet-Based Variational Autoencoder using Metric Learning," ICLR 2018 Workshop.

ＴＶＡＥは、Ｔｒｉｐｌｅｔ形式のデータを入力とするＶａｒｉａｔｉｏｎａｌＡｕｔｏｅｎｃｏｄｅｒ（ＶＡＥ）の一種であり、ＶＡＥと同様に、エンコーダの出力である潜在変数をデコーダに入力して得られる再構成データと入力データとの差を評価するＲｅｃｏｎｓｔｒｕｃｔｉｏｎＬｏｓｓに加え、正例間及び負例間の潜在変数の距離を評価するＴｒｉｐｌｅｔＬｏｓｓを損失関数とすることで、正例である入力同士は潜在変数が近くに配置されるように、負例である入力同士は潜在変数が遠くに配置されるように学習させるアーキテクチャである。 TVAE is a kind of Variational Autoencoder (VAE) that inputs data in Triplet format, and like VAE, the difference between the reconstructed data obtained by inputting the latent variable that is the output of the encoder to the decoder and the input data. By using Triplet Ross, which evaluates the distance between the positive and negative examples, as the loss function, in addition to the Reaction Loss, which evaluates the data, the latent variables can be placed close to each other. , Negative examples of inputs are architectures that train latent variables to be placed far away.

モデル学習部１６は、学習データセット生成部１５により生成されたデータセットを用いて、損失が最小となるようにネットワークの重みを更新し、顔特徴抽出モデル２４に記録する。
なお、顔特徴抽出モデル２４の重みの初期値には、非特許文献６のような大規模な顔画像データセットを利用して学習した学習済みモデルが使用されてもよい。 The model learning unit 16 updates the network weight so as to minimize the loss by using the data set generated by the learning data set generation unit 15, and records it in the face feature extraction model 24.
As the initial value of the weight of the face feature extraction model 24, a trained model learned by using a large-scale face image data set as in Non-Patent Document 6 may be used.

図６は、モデル学習部１６の処理例を示す図である。
モデル学習部１６は、学習データセット生成部１５により生成されたデータセットからデータを取得し、学習処理、すなわち深層学習モデルの重みの更新を行う。学習処理によって更新する対象は、ＥｎｃｏｄｅｒＮｅｔｗｏｒｋ及びＤｅｃｏｄｅｒＮｅｔｗｏｒｋの重みである。 FIG. 6 is a diagram showing a processing example of the model learning unit 16.
The model learning unit 16 acquires data from the data set generated by the learning data set generation unit 15, and performs learning processing, that is, updating the weights of the deep learning model. The target to be updated by the learning process is the weights of the Encoder Network and the Decoder Network.

モデル学習部１６は、ＲｅｃｏｎｓｔｒｕｃｔｉｏｎＬｏｓｓを用いて、画像を復元できるように学習すると共に、ＴｒｉｐｌｅｔＬｏｓｓを用いて、潜在変数（例えば、ガウス分布の平均μ）の距離が、正例では近く、負例では遠くなるように学習する。
なお、この例では、ＥｎｃｏｄｅｒＮｅｔｗｏｒｋ及びＤｅｃｏｄｅｒＮｅｔｗｏｒｋが各３つずつ表記されているが、これらのネットワークの重みは同一のものを利用する。 The model learning unit 16 learns so that the image can be restored by using Reaction Loss, and the distance of the latent variable (for example, the average μ of the Gaussian distribution) is close in the positive example and negative by using Triplet Ross. Let's learn to be far away.
In this example, three Encoder Networks and three Encoder Networks are shown, but the same weights are used for these networks.

特徴ベクトル生成部１７は、モデル学習部１６により更新された顔特徴抽出モデル２４を用いて、既知の人物の顔の特徴データを記録する顔特徴ＤＢ２５を更新する。
顔特徴ＤＢ２５には、ユーザＩＤ、重心ベクトル、人物ＩＤリストが記録される。
ここで、顔画像ＤＢ２２及び人物ペアＤＢ２３に存在する人物ＩＤは、カメラにフレームインした人物に対して次々に付与したものなので、実際に同一の人物であっても、同じ人物ＩＤが利用されているわけではない。したがって、特徴ベクトル生成部１７は、顔画像から抽出した特徴ベクトルのユークリッド距離を利用して、次のように顔画像ＤＢ２２に含まれている人物の分類を行い、分類結果をユーザＩＤとして顔特徴ＤＢ２５に記録する。 The feature vector generation unit 17 updates the face feature DB 25 that records the face feature data of a known person by using the face feature extraction model 24 updated by the model learning unit 16.
The user ID, the center of gravity vector, and the person ID list are recorded in the face feature DB 25.
Here, since the person IDs existing in the face image DB 22 and the person pair DB 23 are given one after another to the person framed in to the camera, the same person ID is used even if they are actually the same person. Not at all. Therefore, the feature vector generation unit 17 classifies the persons included in the face image DB 22 as follows by using the Euclidean distance of the feature vector extracted from the face image, and the classification result is used as the user ID for the face feature. Record in DB25.

まず、特徴ベクトル生成部１７は、顔画像ＤＢ２２から顔画像を取得し、これらの画像を、モデル学習部１６により更新された顔特徴抽出モデル２４のＥｎｃｏｄｅｒＮｅｔｗｏｒｋに入力して特徴ベクトルへと変換する。なお、特徴ベクトルは、ＥｎｃｏｄｅｒＮｅｔｗｏｒｋの出力のうち、平均にあたる出力とする。 First, the feature vector generation unit 17 acquires a face image from the face image DB 22, inputs these images into the Encoder Network of the face feature extraction model 24 updated by the model learning unit 16, and converts them into a feature vector. .. The feature vector is the output corresponding to the average of the outputs of the Encoder Network.

次に、特徴ベクトル生成部１７は、取得した特徴ベクトルに対して、人物ＩＤ毎の重心のベクトルを計算する。
そして、特徴ベクトル生成部１７は、算出した人物ＩＤ毎の重心ベクトルから任意の２つのベクトルを取り出し、これらのベクトルのユークリッド距離を、事前に設定した人物特徴閾値と比較する。２つのベクトル間のユークリッド距離が人物特徴閾値未満であった場合、この２つの人物ＩＤは同一人物とみなされ、人物特徴閾値以上であった場合、この２つの人物ＩＤは異なる人物とみなされる。特徴ベクトル生成部１７は、同一人物とみなした人物ＩＤについては、これらを合わせて再度重心ベクトルを計算し、顔特徴ＤＢ２５に登録する。
なお、人物特徴閾値は、モデル学習部１６のＴｒｉｐｌｅｔＬｏｓｓの計算時に使用したマージンの値を基準に決定されてよい。 Next, the feature vector generation unit 17 calculates the vector of the center of gravity for each person ID with respect to the acquired feature vector.
Then, the feature vector generation unit 17 extracts arbitrary two vectors from the calculated center of gravity vector for each person ID, and compares the Euclidean distance of these vectors with the preset person feature threshold value. If the Euclidean distance between the two vectors is less than the person feature threshold, the two person IDs are considered to be the same person, and if they are greater than or equal to the person feature threshold, the two person IDs are considered to be different people. The feature vector generation unit 17 calculates the center of gravity vector again for the person IDs regarded as the same person, and registers them in the face feature DB 25.
The person feature threshold value may be determined based on the value of the margin used in the calculation of Triplet Ross of the model learning unit 16.

図７は、特徴ベクトル生成部１７の処理例を示す図である。
まず、特徴ベクトル生成部１７は、顔画像ＤＢ２２から人物ＩＤが１～４の顔画像を取得し、モデル学習部１６により学習された顔特徴抽出モデル２４のＥｎｃｏｄｅｒＮｅｔｗｏｒｋを用いて特徴ベクトルへと変換し、人物ＩＤ毎の重心ベクトルを計算する。 FIG. 7 is a diagram showing a processing example of the feature vector generation unit 17.
First, the feature vector generation unit 17 acquires a face image having a person ID of 1 to 4 from the face image DB 22, and converts it into a feature vector using the Encoder Network of the face feature extraction model 24 learned by the model learning unit 16. Then, the center of gravity vector for each person ID is calculated.

次に、特徴ベクトル生成部１７は、それぞれの重心ベクトル間のユークリッド距離を計算し、計算した距離を人物特徴閾値と比較する。この例では、人物ＩＤが２と３の重心ベクトル間のユークリッド距離は、人物特徴閾値未満であるため、同一人物とみなされ、顔特徴ＤＢ２５に記録される。
なお、ここでは特徴ベクトルを２次元空間で表現しているが、特徴ベクトルは、より高次元のものが使用されてもよい。また、重心ベクトル間の距離は、ユークリッド距離として説明したが、これには限られず、マハラノビス距離など他の距離が用いられてもよい。 Next, the feature vector generation unit 17 calculates the Euclidean distance between the respective center of gravity vectors, and compares the calculated distance with the person feature threshold value. In this example, since the Euclidean distance between the centroid vectors having the person IDs 2 and 3 is less than the person feature threshold value, it is regarded as the same person and recorded in the face feature DB 25.
Although the feature vector is expressed in a two-dimensional space here, a higher-dimensional feature vector may be used. Further, the distance between the centroid vectors has been described as the Euclidean distance, but the distance is not limited to this, and other distances such as the Mahalanobis distance may be used.

顔特徴抽出部１８は、対応関係推定部１３から得られた顔画像を、顔特徴抽出モデル２４のＥｎｃｏｄｅｒＮｅｔｗｏｒｋにより、特徴ベクトルへと変換する。この変換処理は、特徴ベクトル生成部１７の処理と同様のものである。
対応関係推定部１３から入力された顔画像のうち、同一の人物ＩＤの画像が複数存在した場合、顔特徴抽出部１８は、その中から任意の１枚を選択して特徴ベクトルへと変換、あるいは、複数の画像全て若しくは一部画像の特徴ベクトルを統計処理（例えば、平均）して、該当の人物ＩＤの特徴ベクトルとする。
なお、平均した特徴ベクトルを採用することで人物の識別精度の向上が期待できるが、処理負荷及び処理時間が増大するため、状況に応じて処理方法は選択されてよい。 The face feature extraction unit 18 converts the face image obtained from the correspondence estimation unit 13 into a feature vector by the Encoder Network of the face feature extraction model 24. This conversion process is the same as the process of the feature vector generation unit 17.
If there are multiple images with the same person ID among the face images input from the correspondence estimation unit 13, the face feature extraction unit 18 selects any one of them and converts it into a feature vector. Alternatively, the feature vectors of all or part of the plurality of images are statistically processed (for example, averaged) to obtain the feature vectors of the corresponding person ID.
Although it can be expected that the identification accuracy of a person is improved by adopting the averaged feature vector, the processing load and the processing time increase, so that the processing method may be selected depending on the situation.

人物推定部１９は、顔特徴抽出部１８から入力される各特徴ベクトルを、顔特徴ＤＢ２５に登録されているユーザＩＤの重心ベクトルと比較し、ユークリッド距離を計算する。人物推定部１９は、計算した距離が人物特徴閾値未満のユーザＩＤを発見すると、顔画像の人物をそのユーザであると推定する。また、人物推定部１９は、顔特徴抽出部１８から得られた特徴ベクトルとのユークリッド距離が人物特徴閾値未満である重心ベクトルが顔特徴ＤＢ２５内に存在しない場合、入力された顔画像は顔特徴ＤＢ２５に存在しない、未知のユーザと推定する。 The person estimation unit 19 compares each feature vector input from the face feature extraction unit 18 with the center of gravity vector of the user ID registered in the face feature DB 25, and calculates the Euclidean distance. When the person estimation unit 19 finds a user ID whose calculated distance is less than the person feature threshold, the person estimation unit 19 estimates that the person in the face image is the user. Further, in the person estimation unit 19, when the center of gravity vector whose Euclidean distance from the feature vector obtained from the face feature extraction unit 18 is less than the person feature threshold does not exist in the face feature DB 25, the input face image is a face feature. It is estimated to be an unknown user that does not exist in DB25.

図８は、人物推定部１９の処理例を示す図である。
この例では、顔特徴ＤＢ２５にユーザＡ～Ｃの３人分の重心ベクトルが記録されており、顔特徴抽出部１８から、人物ＩＤが１４～１６の３人分の特徴ベクトルが入力されている。 FIG. 8 is a diagram showing a processing example of the person estimation unit 19.
In this example, the center of gravity vectors for three users A to C are recorded in the face feature DB 25, and the feature vectors for three people with person IDs 14 to 16 are input from the face feature extraction unit 18. ..

人物推定部１９は、入力された特徴ベクトルそれぞれに対し、ユークリッド距離が人物特徴閾値未満の重心ベクトルが顔特徴ＤＢ２５内に存在するか確認する。この例では、人物ＩＤが１４の特徴ベクトルとユーザＢの重心ベクトル、人物ＩＤが１５の特徴ベクトルとユーザＣの重心ベクトルがこの条件を満たしているため、人物ＩＤが１４の人物はユーザＢ、人物ＩＤが１５の人物はユーザＣと推定される。また、人物ＩＤが１６の人物の特徴ベクトルとのユークリッド距離が人物特徴閾値未満である重心ベクトルは顔特徴ＤＢ２５内に存在しないため、人物ＩＤが１６の人物は未知のユーザであると推定される。 The person estimation unit 19 confirms whether the center of gravity vector whose Euclidean distance is less than the person feature threshold exists in the face feature DB 25 for each of the input feature vectors. In this example, since the feature vector having a person ID of 14 and the center of gravity vector of the user B and the feature vector having a person ID of 15 and the center of gravity vector of the user C satisfy this condition, the person with the person ID of 14 is the user B. A person with a person ID of 15 is presumed to be user C. Further, since the center of gravity vector whose Euclidean distance from the feature vector of the person whose person ID is 16 is less than the person feature threshold does not exist in the face feature DB 25, it is estimated that the person with the person ID 16 is an unknown user. ..

なお、人物識別装置１におけるモデルの学習機能は、学習データセット生成部１５、モデル学習部１６、特徴ベクトル生成部１７から構成されるが、学習を行うタイミングには様々なパターンが考えられる。例えば、一定の周期で定期的に実行する方法の他、人物推定部１９において未知のユーザと推定された顔画像が存在した場合に実行する方法、人物推定部１９において異なる人物ＩＤの顔画像が同一のユーザと推定された場合に実行する方法などが挙げられる。 The model learning function in the person identification device 1 is composed of a learning data set generation unit 15, a model learning unit 16, and a feature vector generation unit 17, and various patterns can be considered for the timing of learning. For example, in addition to the method of executing periodically at a fixed cycle, the method of executing when a face image estimated to be an unknown user exists in the person estimation unit 19, and the face image of a different person ID in the person estimation unit 19 The method to execute when it is presumed to be the same user can be mentioned.

図９は、人物識別方法の流れを例示するフローチャートである。
ここでは、一例として、人物推定部１９にて異なる人物ＩＤの顔画像が同一のユーザと推定された場合に学習を行う場合を示す。 FIG. 9 is a flowchart illustrating the flow of the person identification method.
Here, as an example, a case where learning is performed when the face images of different person IDs are estimated to be the same user by the person estimation unit 19 is shown.

ステップＳ１において、制御部１０は、カメラアレイの各カメラから画像をキャプチャし、取得する。
ステップＳ２において、顔検出部１１は、取得した各画像から顔領域を検出する。 In step S1, the control unit 10 captures and acquires an image from each camera in the camera array.
In step S2, the face detection unit 11 detects a face region from each acquired image.

ステップＳ３において、制御部１０は、顔領域が検出されたか否かを判定する。この判定がＹＥＳの場合、処理はステップＳ４に移り、判定がＮＯの場合、すなわち顔領域が１つも検出されなかった場合、処理はステップＳ１に戻る。 In step S3, the control unit 10 determines whether or not the face region has been detected. If this determination is YES, the process proceeds to step S4, and if the determination is NO, that is, if no face area is detected, the process returns to step S1.

ステップＳ４において、人物方向推定部１２は、検出された各顔領域の中心座標から人物方向を推定する。
ステップＳ５において、対応関係推定部１３は、人物メモリ２１を参照して過去フレームの人物との対応関係、すなわち同一人物か否かを推定する。 In step S4, the person direction estimation unit 12 estimates the person direction from the center coordinates of each detected face area.
In step S5, the correspondence relationship estimation unit 13 estimates the correspondence relationship with the person in the past frame, that is, whether or not they are the same person, with reference to the person memory 21.

ステップＳ６において、制御部１０は、人物の対応関係を一意に推定できたか否かを判定する。この判定がＹＥＳの場合、処理はステップＳ７に移り、判定がＮＯの場合、処理はステップＳ１に戻る。 In step S6, the control unit 10 determines whether or not the correspondence between the persons can be uniquely estimated. If this determination is YES, the process proceeds to step S7, and if the determination is NO, the process returns to step S1.

ステップＳ７において、学習データ記録部１４は、顔画像及び人物ＩＤを顔画像ＤＢ２２に、同時に存在した人物ＩＤのペア（別人のペア）を人物ペアＤＢ２３に、それぞれ記録する。 In step S7, the learning data recording unit 14 records the face image and the person ID in the face image DB 22, and the pair of person IDs (pairs of different people) that existed at the same time in the person pair DB 23.

ステップＳ８において、顔特徴抽出部１８は、顔特徴抽出モデル２４を用いて顔画像を特徴ベクトルへと変換する。
ステップＳ９において、人物推定部１９は、ステップＳ８で得られた特徴ベクトルと顔特徴ＤＢ２５に記録されている各ユーザの重心ベクトルとを比較し、人物推定を行う。 In step S8, the face feature extraction unit 18 converts the face image into a feature vector using the face feature extraction model 24.
In step S9, the person estimation unit 19 compares the feature vector obtained in step S8 with the center of gravity vector of each user recorded in the face feature DB 25, and performs person estimation.

ステップＳ１０において、制御部１０は、異なる人物ＩＤの人物が同一ユーザと判定されたか否かを判定する。この判定がＹＥＳの場合、顔特徴抽出モデル２４の学習を行うため、処理はステップＳ１１に移り、判定がＮＯの場合、処理は終了する。 In step S10, the control unit 10 determines whether or not a person with a different person ID is determined to be the same user. If this determination is YES, the process proceeds to step S11 to learn the face feature extraction model 24, and if the determination is NO, the process ends.

ステップＳ１１において、学習データセット生成部１５は、顔画像ＤＢ２２及び人物ペアＤＢ２３からデータを取得し、学習用のデータセットを生成する。
ステップＳ１２において、モデル学習部１６は、学習用のデータセットを用いて顔特徴抽出モデル２４の重みの学習を行う。
ステップＳ１３において、特徴ベクトル生成部１７は、顔画像ＤＢ２２に記録されている顔画像を、顔特徴抽出モデル２４を用いて特徴ベクトルに変換し、顔特徴ＤＢ２５を更新する。 In step S11, the learning data set generation unit 15 acquires data from the face image DB 22 and the person pair DB 23, and generates a data set for learning.
In step S12, the model learning unit 16 learns the weights of the face feature extraction model 24 using the learning data set.
In step S13, the feature vector generation unit 17 converts the face image recorded in the face image DB 22 into a feature vector using the face feature extraction model 24, and updates the face feature DB 25.

本実施形態によれば、人物識別装置１は、カメラで撮影された画像群から人物の顔画像の領域を検出し、閾値未満の時間内に撮影された２つの画像から、顔画像の座標に基づいて同一人物を判定し、同一人物の顔画像に同一の人物ＩＤを割り当て、他の顔画像に別の人物ＩＤを割り当てる。そして、人物識別装置１は、顔画像を人物ＩＤと対応付けて記録すると共に、同時刻に撮影された画像群から得られた人物ＩＤのペアを別人物として記録する。人物識別装置１は、人物ＩＤが同一の顔画像が２枚、及びこの人物ＩＤとペアを構成している人物ＩＤの顔画像が１枚からなるＴｒｉｐｌｅｔ形式の学習データを生成する。
これにより、人物識別装置１は、ある人物の顔画像と同一人物の顔画像及び異なる人物の顔画像の３つから構成されるＴｒｉｐｌｅｔ形式の学習データを用いて深層学習モデルを学習できる。 According to the present embodiment, the person identification device 1 detects a region of a person's face image from a group of images taken by a camera, and converts two images taken within a time less than a threshold into the coordinates of the face image. Based on this, the same person is determined, the same person ID is assigned to the face image of the same person, and another person ID is assigned to another face image. Then, the person identification device 1 records the face image in association with the person ID, and records the pair of person IDs obtained from the image group taken at the same time as another person. The person identification device 1 generates triplet-format learning data including two face images having the same person ID and one face image of a person ID forming a pair with the person ID.
Thereby, the person identification device 1 can learn the deep learning model by using the triplet format training data composed of the face image of a certain person, the face image of the same person, and the face image of a different person.

すなわち、人物識別装置１は、生成された学習データを用いて学習した顔特徴抽出モデル２４により、顔画像を特徴ベクトルに変換し、この特徴ベクトルの間の距離に基づいて人物ＩＤを統合したユーザＩＤ毎に、特徴ベクトルの重心ベクトルを記録する。そして、人物識別装置１は、新たに撮影された画像に含まれる顔画像を、顔特徴抽出モデル２４により特徴ベクトルに変換すると、この特徴ベクトルを、ユーザＩＤ毎の重心ベクトルと比較し、閾値未満の距離にあるユーザＩＤを人物の推定結果として出力する。
これにより、人物識別装置１は、ユーザによる事前の顔登録を行わずに、ユーザＩＤ毎の特徴ベクトル（重心ベクトル）を自動的に記録し、周囲にいる人物を高精度に識別できる。 That is, the person identification device 1 converts the face image into a feature vector by the face feature extraction model 24 learned using the generated learning data, and the user who integrates the person ID based on the distance between the feature vectors. The center of gravity vector of the feature vector is recorded for each ID. Then, when the face image included in the newly captured image is converted into a feature vector by the face feature extraction model 24, the person identification device 1 compares this feature vector with the center of gravity vector for each user ID and is less than the threshold value. The user ID at the distance of is output as the estimation result of the person.
As a result, the person identification device 1 can automatically record the feature vector (center of gravity vector) for each user ID without performing prior face registration by the user, and can identify the surrounding people with high accuracy.

人物識別装置１は、顔画像の座標に基づいて人物方向を導出することにより、同時刻に撮影された２つの画像から、差が閾値未満の人物方向が導出された場合、これらの人物方向に対応する２つの顔画像に同一の人物ＩＤを割り当てる。
したがって、人物識別装置１は、複数のカメラ画像から得られる顔画像から、同一人物を適切に判定できる。
さらに、人物識別装置１は、閾値未満の時間内に撮影された２つの画像から、差が閾値未満の人物方向が導出された場合に、これらの人物方向に対応する２つの顔画像に同一の人物ＩＤを割り当て、他の顔画像に別の人物ＩＤを割り当てる。
したがって、人物識別装置１は、短時間の間に同一方向に存在する顔画像を同一人物として適切に対応付けることができる。 The person identification device 1 derives the person direction based on the coordinates of the face image, and when the person direction whose difference is less than the threshold is derived from the two images taken at the same time, the person direction is directed to these person directions. The same person ID is assigned to the two corresponding face images.
Therefore, the person identification device 1 can appropriately determine the same person from the face images obtained from the plurality of camera images.
Further, the person identification device 1 is the same as the two face images corresponding to these person directions when the person direction whose difference is less than the threshold is derived from the two images taken within the time less than the threshold. Assign a person ID and assign another person ID to another face image.
Therefore, the person identification device 1 can appropriately associate facial images existing in the same direction as the same person in a short period of time.

［第２実施形態］
以下、本発明の第２実施形態について説明する。
本実施形態は、住居又はオフィスなどの各部屋にカメラが設置され、カメラの設置された空間で行動する人物を自動で識別する人物識別装置である。つまり、第１実施形態のカメラアレイとは異なり、複数のカメラに同一人物が同時に写らないことを前提としている。 [Second Embodiment]
Hereinafter, a second embodiment of the present invention will be described.
This embodiment is a person identification device in which a camera is installed in each room such as a house or an office and automatically identifies a person who acts in the space where the camera is installed. That is, unlike the camera array of the first embodiment, it is premised that the same person is not captured by a plurality of cameras at the same time.

図１０は、本実施形態における人物識別装置２の機能構成を示す図である。
本実施形態では、第１実施形態と比べて、人物方向推定部１２がなく、対応関係推定部１３の機能が異なり、人物メモリ２１に記録されるデータが異なる。 FIG. 10 is a diagram showing a functional configuration of the person identification device 2 in the present embodiment.
In the present embodiment, as compared with the first embodiment, there is no person direction estimation unit 12, the functions of the correspondence estimation unit 13 are different, and the data recorded in the person memory 21 is different.

対応関係推定部１３は、顔検出部１１から出力された顔領域の座標情報を、人物メモリ２１に格納された過去の人物の座標データと比較して人物の対応関係を推定し、学習データ記録部１４及び顔特徴抽出部１８に出力すると共に、人物メモリ２１のデータを更新する。
人物メモリ２１には、人物ＩＤ、顔画像、カメラＩＤ、座標、更新時刻が記録されており、対応関係推定部１３が処理を行うたびに内容が更新される。カメラＩＤは、人物を撮影したカメラの識別子であり、座標は画像内の人物の顔領域の中心座標である。人物ＩＤ、顔画像及び更新時刻は、第１実施形態と同様である。 The correspondence estimation unit 13 compares the coordinate information of the face area output from the face detection unit 11 with the coordinate data of the past person stored in the person memory 21, estimates the correspondence of the person, and records the learning data. The data is output to the unit 14 and the face feature extraction unit 18, and the data in the person memory 21 is updated.
The person ID, face image, camera ID, coordinates, and update time are recorded in the person memory 21, and the contents are updated every time the correspondence estimation unit 13 performs processing. The camera ID is an identifier of the camera that captured the person, and the coordinates are the center coordinates of the face area of the person in the image. The person ID, the face image, and the update time are the same as those in the first embodiment.

対応関係推定部１３は、顔領域の中心座標を、人物メモリ２１に格納されている過去の人物の座標と比較することで、顔検出部１１で得られた人物が人物メモリ２１に記録されている人物と同一人物であるかを推定する。ここでは、特定の人物の顔があるカメラ画像のある座標に存在したとき、この人物は、同一のカメラで次にキャプチャされた画像において、元の座標に近い位置に存在すると仮定する。 The correspondence estimation unit 13 compares the center coordinates of the face area with the coordinates of the past person stored in the person memory 21, and the person obtained by the face detection unit 11 is recorded in the person memory 21. Estimate whether it is the same person as the person who is. Here, it is assumed that when the face of a specific person is present at a certain coordinate of a camera image, this person is located close to the original coordinate in the next image captured by the same camera.

具体的には、顔検出部１１から得られた顔領域の座標と人物メモリ２１に記録されている人物の座標のユークリッド距離が、事前に設定された人物移動閾値以内だった場合、同一人物とみなして同一の人物ＩＤが割り当てられる。また、顔検出部１１から得られた人物のうち、人物メモリ２１に記録されている人物と同一人物とみなすことができない人物が存在する場合、過去に使用していない新たな人物ＩＤを割り当て、人物メモリ２１に記録される。
なお、タイムアウト閾値による、人物メモリ２１からの削除については、第１実施形態と同様である。そして、対応関係推定部１３は、人物メモリ２１のデータを更新すると共に、更新情報を学習データ記録部１４及び顔特徴抽出部１８に出力する。 Specifically, when the Euclidean distance between the coordinates of the face area obtained from the face detection unit 11 and the coordinates of the person recorded in the person memory 21 is within the preset person movement threshold, the same person is used. The same person ID is assumed to be assigned. Further, if there is a person who cannot be regarded as the same person as the person recorded in the person memory 21 among the persons obtained from the face detection unit 11, a new person ID that has not been used in the past is assigned. It is recorded in the person memory 21.
The deletion from the person memory 21 by the timeout threshold is the same as in the first embodiment. Then, the correspondence estimation unit 13 updates the data in the person memory 21 and outputs the updated information to the learning data recording unit 14 and the face feature extraction unit 18.

図１１は、対応関係推定部１３の処理例を示す図である。
この例では、人物移動閾値を１０、タイムアウト閾値を５秒として説明する。顔検出部１１から得られた顔領域の中心座標は、カメラＡの座標（２００，３００）、カメラＢの座標（２５０，２８０）、カメラＢの座標（５６０，２５０）となっている。また、人物メモリ２１に記録されている人物ＩＤ：３は、カメラＡの座標（１９８，３０４）、人物ＩＤ：４は、カメラＢの座標（５６３，２４２）となっている。 FIG. 11 is a diagram showing a processing example of the correspondence relationship estimation unit 13.
In this example, the person movement threshold value is set to 10 and the timeout threshold value is set to 5 seconds. The center coordinates of the face region obtained from the face detection unit 11 are the coordinates of the camera A (200,300), the coordinates of the camera B (250,280), and the coordinates of the camera B (560,250). Further, the person ID: 3 recorded in the person memory 21 is the coordinates of the camera A (198,304), and the person ID: 4 is the coordinates of the camera B (563,242).

同一のカメラにて撮影された顔画像間のユークリッド距離を計算すると、データＩＤが０１の座標と人物ＩＤが３の座標間は４．４７、データＩＤが０２の座標と人物ＩＤが４の座標間は３１５．３０、データＩＤが０３の座標と人物ＩＤが４の座標間は３．６１となっている。
人物移動閾値が１０であるため、対応関係推定部１３は、データＩＤが０１の人物には人物ＩＤ：３を、データＩＤが０３の人物には人物ＩＤ：４を割り当てる。一方、データＩＤが０２の座標に対して、ユークリッド距離が人物移動閾値以内の座標が人物メモリ２１には含まれていないため、対応関係推定部１３は、データＩＤが０２の人物には新たに人物ＩＤ：５を割り当てる。 When calculating the Euclidean distance between face images taken by the same camera, the coordinates between the coordinates with data ID 01 and the coordinates with person ID 3 are 4.47, the coordinates with data ID 02 and the coordinates with person ID 4 are 4. The distance is 315.30, and the distance between the coordinates with the data ID of 03 and the coordinates with the person ID of 4 is 3.61.
Since the person movement threshold value is 10, the correspondence estimation unit 13 assigns a person ID: 3 to a person with a data ID of 01 and a person ID: 4 to a person with a data ID of 03. On the other hand, since the person memory 21 does not include the coordinates whose Euclidean distance is within the person movement threshold with respect to the coordinates whose data ID is 02, the correspondence estimation unit 13 newly applies the coordinates whose data ID is 02 to the person whose data ID is 02. Person ID: 5 is assigned.

対応関係推定部１３は、これらのデータにより人物メモリ２１を更新する。なお、今回のデータには、更新時刻からの経過時間がタイムアウト閾値５秒を超えているデータは存在しないため、人物メモリ２１から削除するデータはない。
また、後段の処理については、第１実施形態と同様である。 The correspondence estimation unit 13 updates the person memory 21 with these data. Since there is no data in the data this time in which the elapsed time from the update time exceeds the timeout threshold of 5 seconds, there is no data to be deleted from the person memory 21.
Further, the subsequent processing is the same as that of the first embodiment.

なお、本実施形態では、複数台のカメラを入力とする前提で説明したが、その限りではなく、カメラは１台であってもよい。また、使用するカメラは、矩形領域を撮影するものとして説明したが、その限りではなく、例えば、周囲全体を撮影できる全天球カメラなどであってもよい。 In the present embodiment, the description has been made on the premise that a plurality of cameras are input, but the present invention is not limited to this, and the number of cameras may be one. Further, the camera to be used has been described as taking a rectangular area, but the present invention is not limited to this, and for example, an omnidirectional camera capable of taking a picture of the entire surroundings may be used.

本実施形態によれば、同一人物が同時に写らない位置に複数のカメラが配置され、人物識別装置２は、閾値未満の時間内に撮影された２つの画像から、座標間の距離が閾値未満の２つの顔画像が検出された場合に、これら２つの顔画像に同一の人物ＩＤを割り当て、他の顔画像に別の人物ＩＤを割り当てる。
したがって、人物識別装置２は、例えば、不特定数の人物が出入りする空間において各部屋にいる人物を識別することができ、各部屋の人物配置をもとに各種ＩｏＴデバイス（照明又は家電機器）を制御できる。 According to the present embodiment, a plurality of cameras are arranged at positions where the same person is not captured at the same time, and the person identification device 2 has a distance between coordinates of less than the threshold from two images taken within a time less than the threshold. When two face images are detected, the same person ID is assigned to these two face images, and another person ID is assigned to the other face images.
Therefore, the person identification device 2 can identify a person in each room in a space where an unspecified number of people enter and exit, and various IoT devices (lighting or home appliances) based on the arrangement of people in each room. Can be controlled.

［第３実施形態］
以下、本発明の第３実施形態について説明する。
従来の顔識別装置を搭載したコミュニケーションロボット又はＩｏＴデバイスの場合、顔登録処理を行うと同時にユーザの名前を登録する形態が多い。本実施形態の人物識別装置は、自動で学習して獲得した特徴ベクトルを、指定された数にクラスタリングすることで、入力されたユーザの名前を含む特徴ベクトルをデータベースに登録する。 [Third Embodiment]
Hereinafter, a third embodiment of the present invention will be described.
In the case of a communication robot or an IoT device equipped with a conventional face recognition device, there are many forms in which a user's name is registered at the same time as performing face registration processing. The person identification device of the present embodiment clusters the feature vectors acquired by automatic learning into a specified number, and registers the feature vectors including the input user's name in the database.

図１２は、本実施形態における人物識別装置３の機能構成を示す図である。
本実施形態の人物識別装置３は、第１実施形態と比べて、特徴ベクトル生成部１７の代わりに人物登録部１７ａ（特徴記録部）を備え、顔特徴ＤＢ２５に記録されるデータが異なる。
なお、ここでは第１実施形態の構成をもとにした機能構成を示すが、第２実施形態をもとに、特徴ベクトル生成部１７及び顔特徴ＤＢ２５を変更した機能構成であってもよい。 FIG. 12 is a diagram showing a functional configuration of the person identification device 3 in the present embodiment.
Compared with the first embodiment, the person identification device 3 of the present embodiment includes a person registration unit 17a (feature recording unit) instead of the feature vector generation unit 17, and the data recorded in the face feature DB 25 is different.
Although the functional configuration based on the configuration of the first embodiment is shown here, the functional configuration may be a modification of the feature vector generation unit 17 and the face feature DB 25 based on the second embodiment.

本実施形態では、人物識別装置３を備えたコミュニケーションロボット又はＩｏＴデバイスを１～数日ほど稼働させ、顔画像ＤＢ２２及び人物ペアＤＢ２３に、十分に学習用のデータが記録された後に、学習データセット生成部１５、モデル学習部１６及び人物登録部１７ａによる処理が実行される。また、人物登録部１７ａにより顔特徴ＤＢ２５にデータが保存されるまでは、顔特徴抽出部１８及び人物推定部１９の処理は行われない。 In the present embodiment, a communication robot or an IoT device equipped with a person identification device 3 is operated for about 1 to several days, and after sufficient data for learning is recorded in the face image DB 22 and the person pair DB 23, a learning data set is performed. Processing by the generation unit 15, the model learning unit 16, and the person registration unit 17a is executed. Further, the processing of the face feature extraction unit 18 and the person estimation unit 19 is not performed until the data is saved in the face feature DB 25 by the person registration unit 17a.

人物登録部１７ａは、モデル学習部１６により学習された顔特徴抽出モデル２４を用いて、顔画像ＤＢ２２に登録されている顔画像を特徴ベクトルに変換し、クラスタリングした結果をユーザに示すことで、ユーザ名を含むデータを顔特徴ＤＢ２５に登録する。
顔特徴ＤＢ２５には、ユーザＩＤ、ユーザ名、重心ベクトル、人物ＩＤリストが記録される。 The person registration unit 17a converts the face image registered in the face image DB 22 into a feature vector using the face feature extraction model 24 learned by the model learning unit 16, and shows the result of clustering to the user. The data including the user name is registered in the face feature DB 25.
A user ID, a user name, a center of gravity vector, and a person ID list are recorded in the face feature DB 25.

まず、人物登録部１７ａは、顔画像ＤＢ２２から顔画像を取得し、これらの画像を、モデル学習部１６により学習された顔特徴抽出モデル２４のＥｎｃｏｄｅｒＮｅｔｗｏｒｋを利用して特徴ベクトルへと変換する。なお、特徴ベクトルは、ＥｎｃｏｄｅｒＮｅｔｗｏｒｋの出力のうち、平均にあたる出力とする。
次に、人物登録部１７ａは、取得した特徴ベクトルに対して、人物ＩＤ毎の重心のベクトルを計算する。 First, the person registration unit 17a acquires a face image from the face image DB 22, and converts these images into a feature vector using the Encoder Network of the face feature extraction model 24 learned by the model learning unit 16. The feature vector is the output corresponding to the average of the outputs of the Encoder Network.
Next, the person registration unit 17a calculates the vector of the center of gravity for each person ID with respect to the acquired feature vector.

その後、人物登録部１７ａは、ユーザに対して、識別して欲しいユーザの人数の入力を要求する。例えば、人物識別装置３が搭載されたコミュニケーションロボット又はＩｏＴデバイスが家庭内で利用されるものであった場合、家族の人数が入力されることになる。続いて、人物登録部１７ａは、人物ＩＤ毎に計算した特徴ベクトルの重心に対し、ユーザから入力された人数に応じてｋ－ｍｅａｎｓ法などによるクラスタリングを行う。 After that, the person registration unit 17a requests the user to input the number of users to be identified. For example, when a communication robot or an IoT device equipped with a person identification device 3 is used in a home, the number of family members is input. Subsequently, the person registration unit 17a clusters the center of gravity of the feature vector calculated for each person ID according to the number of people input by the user by the k-means method or the like.

次に、人物登録部１７ａは、各クラスタに所属する人物ＩＤの顔画像から数枚を無作為抽出し、ユーザに提示する。ユーザは、提示された数枚の顔画像の人物の名前を入力する。
このとき、提示された顔画像に識別対象でない人物の顔画像が含まれていた場合、ユーザは、その旨を人物識別装置３に入力する。この場合、人物登録部１７ａは、提示した画像に含まれていた人物ＩＤの重心ベクトルを取り除いたものに対して、再度クラスタリングを実施し、各クラスタに所属する人物ＩＤの顔画像数枚を再度ユーザに提示する。 Next, the person registration unit 17a randomly extracts several images from the facial images of the person IDs belonging to each cluster and presents them to the user. The user inputs the name of a person in several presented facial images.
At this time, if the presented face image contains a face image of a person who is not the identification target, the user inputs to that effect to the person identification device 3. In this case, the person registration unit 17a performs clustering again on the image from which the center of gravity vector of the person ID included in the presented image has been removed, and again performs several face images of the person ID belonging to each cluster. Present to the user.

具体的には、対象外の顔画像が提示された場合の再クラスタリングは、例えば、次のような手順で行われる。
（１）クラスタ毎に、所属している人物ＩＤから各１枚の顔画像が提示される。ただし、予め提示する顔画像の枚数の上限を設定しておき、この上限を超える枚数は提示されないように制限されてもよい。ユーザは、提示されたクラスタ毎の顔画像リストを確認し、識別対象外の人物が含まれているクラスタを選択する。
（２）ユーザが選択したクラスタに所属する人物ＩＤの顔画像が各１枚、一覧で提示される。ユーザは、一覧の中から識別対象外の顔画像を選択する。
（３）ユーザが選択した顔画像に対応する人物ＩＤを取り除き、再度クラスタリングした結果が提示される。 Specifically, re-clustering when a non-target facial image is presented is performed by, for example, the following procedure.
(1) For each cluster, one face image is presented from the person ID to which the person belongs. However, an upper limit of the number of face images to be presented may be set in advance, and the number of images exceeding this upper limit may be limited so as not to be presented. The user confirms the presented face image list for each cluster, and selects a cluster that includes a person not to be identified.
(2) One face image of each person ID belonging to the cluster selected by the user is presented in a list. The user selects a face image that is not to be identified from the list.
(3) The person ID corresponding to the face image selected by the user is removed, and the result of clustering again is presented.

提示した各クラスタの画像全てに対して名前が入力された場合、人物登録部１７ａは、クラスタに所属する人物ＩＤの重心ベクトルの重心を計算し、ユーザＩＤ、ユーザ名、重心ベクトル、そのクラスタに含まれる人物ＩＤリストを顔特徴ＤＢ２５に登録する。 When a name is input for all the presented images of each cluster, the person registration unit 17a calculates the center of gravity of the center of gravity vector of the person ID belonging to the cluster, and uses the user ID, the user name, the center of gravity vector, and the cluster as the center of gravity. The included person ID list is registered in the face feature DB 25.

図１３は、人物登録部１７ａの処理例を示す図である。
顔画像ＤＢ２２から人物ＩＤが１～５の顔画像が取得され、人物登録部１７ａは、これらの顔画像をそれぞれ、モデル学習部１６により学習された顔特徴抽出モデル２４のＥｎｃｏｄｅｒＮｅｔｗｏｒｋを用いて特徴ベクトルへと変換し、さらに、人物ＩＤごとの重心ベクトルを計算する。 FIG. 13 is a diagram showing a processing example of the person registration unit 17a.
Face images with person IDs 1 to 5 are acquired from the face image DB 22, and the person registration unit 17a features each of these face images using the Encoder Network of the face feature extraction model 24 learned by the model learning unit 16. It is converted into a vector, and the center of gravity vector for each person ID is calculated.

その後、ユーザは、識別して欲しいユーザの人数を入力する。この例では、「３人」と入力されたため、人物登録部１７ａは、人物ＩＤ毎の重心ベクトルをｋ－ｍｅａｎｓ法により３つにクラスタリングする。 After that, the user inputs the number of users to be identified. In this example, since "3 people" is input, the person registration unit 17a clusters the center of gravity vector for each person ID into three by the k-means method.

次に、人物登録部１７ａは、各クラスタに含まれる人物ＩＤの顔画像の中から無作為に抽出した数枚の顔画像をユーザに提示し、その人物の名前をユーザに入力してもらう。この例では、それぞれ「太郎」、「花子」、「次郎」と入力されている。人物登録部１７ａは、入力されたこれらの名前と共に、各クラスタに含まれる人物ＩＤの重心ベクトルの重心となるベクトルを、顔特徴ＤＢ２５に登録する。
このように登録された顔特徴ＤＢ２５を使用することで、人物推定部１９は、推定した人物の名前も取得することができる。 Next, the person registration unit 17a presents to the user several face images randomly selected from the face images of the person ID included in each cluster, and asks the user to input the name of the person. In this example, "Taro", "Hanako", and "Jiro" are input respectively. The person registration unit 17a registers the vector that is the center of gravity of the center of gravity vector of the person ID included in each cluster in the face feature DB 25 together with these input names.
By using the face feature DB 25 registered in this way, the person estimation unit 19 can also acquire the name of the estimated person.

本実施形態によれば、人物識別装置３は、識別対象のユーザ数の入力を受け付け、人物ＩＤ毎の重心ベクトルを、このユーザ数にクラスタリングしてユーザＩＤを割り当てる。
したがって、人物識別装置３は、実際のユーザ数に基づいて特徴ベクトル（重心ベクトル）をクラスタリングでき、これにより、適切にユーザを識別できる。 According to the present embodiment, the person identification device 3 accepts the input of the number of users to be identified, clusters the center of gravity vector for each person ID to this number of users, and assigns the user ID.
Therefore, the person identification device 3 can cluster the feature vector (center of gravity vector) based on the actual number of users, whereby the user can be appropriately identified.

人物識別装置３は、ユーザＩＤに対応する顔画像を提示し、ユーザ名の入力を受け付けて、重心ベクトルと共に記録する。
したがって、人物識別装置３は、顔画像を識別したうえで、ユーザ名を推定結果として出力できる。 The person identification device 3 presents a face image corresponding to the user ID, accepts the input of the user name, and records it together with the center of gravity vector.
Therefore, the person identification device 3 can identify the face image and output the user name as the estimation result.

人物識別装置３は、提示した顔画像のうち識別対象外の顔画像を指示されると、この指示された顔画像が含まれる人物ＩＤを除外して、再度クラスタリングを行う。
したがって、人物識別装置３は、識別対象外の顔画像群を除外した特徴ベクトル（重心ベクトル）を記録できるので、人物の識別精度を向上できる。 When the person identification device 3 is instructed to use a face image that is not the identification target among the presented face images, the person identification device 3 excludes the person ID including the instructed face image and performs clustering again.
Therefore, the person identification device 3 can record the feature vector (center of gravity vector) excluding the face image group that is not the identification target, so that the identification accuracy of the person can be improved.

以上、本発明の実施形態について説明したが、本発明は前述した実施形態に限るものではない。また、前述の実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、実施形態に記載されたものに限定されるものではない。 Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments. Moreover, the effects described in the above-described embodiments are merely a list of the most preferable effects resulting from the present invention, and the effects according to the present invention are not limited to those described in the embodiments.

前述の実施形態では、顔画像ＤＢ２２及び人物ペアＤＢ２３に記録される人物ＩＤには、未使用のＩＤが割り当てられていくとしたが、人物ＩＤの数には上限が設けられてもよい。この場合、古い人物ＩＤに紐付けられているデータを削除し、このＩＤが再利用されてもよい。 In the above-described embodiment, unused IDs are assigned to the person IDs recorded in the face image DB 22 and the person pair DB 23, but an upper limit may be set for the number of person IDs. In this case, the data associated with the old person ID may be deleted and this ID may be reused.

顔画像を取得するためのカメラは、移動しないものとして説明したが、例えば、移動可能なロボットなどに設けられる場合、移動に伴って人物方向が変化するため、人物メモリ２１の内容はリセットされ、再度処理される。 The camera for acquiring the face image has been described as not moving, but for example, when the camera is provided in a movable robot or the like, the direction of the person changes with the movement, so that the contents of the person memory 21 are reset. Will be processed again.

本実施形態では、主に人物識別装置（１、２、３）の構成と動作について説明したが、本発明はこれに限られず、各構成要素を備え、人物を識別するための方法、又はプログラムとして構成されてもよい。 In the present embodiment, the configuration and operation of the person identification device (1, 2, 3) have been mainly described, but the present invention is not limited to this, and the present invention includes each component and is a method or program for identifying a person. It may be configured as.

さらに、人物識別装置（１、２、３）の機能を実現するためのプログラムをコンピュータで読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現してもよい。 Further, a program for realizing the functions of the person identification device (1, 2, 3) is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read and executed by the computer system. It may be realized by.

ここでいう「コンピュータシステム」とは、ＯＳや周辺機器などのハードウェアを含むものとする。また、「コンピュータで読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭなどの可搬媒体、コンピュータシステムに内蔵されるハードディスクなどの記憶装置のことをいう。 The term "computer system" as used herein includes hardware such as an OS and peripheral devices. Further, the "computer-readable recording medium" refers to a portable medium such as a flexible disk, a magneto-optical disk, a ROM, or a CD-ROM, and a storage device such as a hard disk built in a computer system.

さらに「コンピュータで読み取り可能な記録媒体」とは、インターネットなどのネットワークや電話回線などの通信回線を介してプログラムを送信する場合の通信線のように、短時刻の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時刻プログラムを保持しているものも含んでもよい。また、上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよい。 Furthermore, a "computer-readable recording medium" is a communication line that dynamically holds a program for a short period of time, such as a communication line when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. It may also include a program that holds a program for a certain period of time, such as a volatile memory inside a computer system that is a server or a client in that case. Further, the above program may be for realizing a part of the above-mentioned functions, and may be further realized for realizing the above-mentioned functions in combination with a program already recorded in the computer system. ..

１、２、３人物識別装置（学習データ生成装置）
１０制御部
１１顔検出部
１２人物方向推定部
１３対応関係推定部
１４学習データ記録部
１５学習データセット生成部
１６モデル学習部
１７特徴ベクトル生成部（特徴記録部）
１７ａ人物登録部（特徴記録部）
１８顔特徴抽出部
１９人物推定部
２０記憶部
２１人物メモリ
２２顔画像ＤＢ
２３人物ペアＤＢ
２４顔特徴抽出モデル
２５顔特徴ＤＢ 1, 2, 3 Person identification device (learning data generation device)
10 Control unit 11 Face detection unit 12 Person direction estimation unit 13 Correspondence relationship estimation unit 14 Learning data recording unit 15 Learning data set generation unit 16 Model learning unit 17 Feature vector generation unit (feature recording unit)
17a Person registration department (feature recording department)
18 Face feature extraction unit 19 Person estimation unit 20 Storage unit 21 Person memory 22 Face image DB
23 Person pair DB
24 Face feature extraction model 25 Face feature DB

Claims

A face detector that detects the area of a person's face image from a group of images taken by a camera,
From two images taken within the time less than the threshold, the same person is determined based on the coordinates of the face image, the same person ID is assigned to the face image of the same person, and another person is assigned to another face image. Correspondence estimation unit that assigns ID and
A learning data recording unit that records a face image in association with the person ID and records a pair of the person IDs obtained from a group of images taken at the same time.
A learning data set generation unit that generates triplet-format learning data including two face images having the same person ID and one face image of the person ID and the person ID constituting the pair. Prepare,
A learning data generation device that outputs the learning data for identifying a person by inputting a face image.

A person direction estimation unit for deriving a person direction based on the coordinates of the face image is provided.
The claim that the correspondence estimation unit assigns the same person ID to two face images corresponding to the person direction when the person direction whose difference is less than the threshold is derived from the two images taken at the same time. The learning data generation device according to 1.

In the correspondence estimation unit, the difference is less than the threshold when the person direction whose difference is less than the threshold is derived from the two images taken at the same time, or from the two images taken within the time less than the threshold. The learning data generation device according to claim 2, wherein when the person direction is derived, the same person ID is assigned to two face images corresponding to the person direction, and another person ID is assigned to another face image.

Multiple cameras are placed in positions where the same person cannot be seen at the same time.
When two face images in which the distance between coordinates is less than the threshold are detected from the two images taken within the time less than the threshold, the correspondence estimation unit has the same person ID in the two face images. The learning data generation device according to claim 1, wherein a person ID is assigned to another face image.

A model learning unit that learns a feature extraction model using the learning data output from the learning data generator according to any one of claims 1 to 4.
A feature recording unit that converts the face image into a feature vector by the feature extraction model and records the center of gravity vector of the feature vector for each user ID that integrates the person ID based on the distance between the feature vectors.
A face feature extraction unit that converts a face image included in a newly captured image into a feature vector by the feature extraction model, and a face feature extraction unit.
A person identification unit including a person estimation unit that compares the feature vector obtained by the face feature extraction unit with the center of gravity vector for each user ID and outputs a user ID at a distance less than the threshold value as a person estimation result. Device.

The person identification device according to claim 5, wherein the feature recording unit receives an input of the number of users to be identified, clusters the center of gravity vector for each person ID to the number of users, and assigns the user ID.

The person identification device according to claim 6, wherein the feature recording unit presents a face image corresponding to the user ID, accepts input of a user name, and records the face image together with the center of gravity vector.

The seventh aspect of claim 7, wherein when the feature recording unit is instructed to use a face image that is not to be identified among the presented face images, the person ID including the instructed face image is excluded and clustering is performed again. Person identification device.

A face detection step that detects the area of a person's face image from a group of images taken by a camera,
From two images taken within the time less than the threshold, the same person is determined based on the coordinates of the face image, the same person ID is assigned to the face image of the same person, and another person is assigned to another face image. Correspondence estimation step to assign ID and
A learning data recording step of recording a face image in association with the person ID and recording a pair of the person IDs obtained from a group of images taken at the same time.
A training data set generation step of generating triplet-format learning data including two face images having the same person ID and one face image of the person ID and the person ID constituting the pair. Computer run,
A learning data generation method for outputting the learning data for identifying a person using a face image as an input.

The learning data generation program for operating a computer as the learning data generation device according to any one of claims 1 to 4.