JP6819633B2

JP6819633B2 - Personal identification device and feature collection device

Info

Publication number: JP6819633B2
Application number: JP2018042204A
Authority: JP
Inventors: 純平松永
Original assignee: Omron Corp
Current assignee: Omron Corp
Priority date: 2018-03-08
Filing date: 2018-03-08
Publication date: 2021-01-27
Anticipated expiration: 2038-03-08
Also published as: WO2019171780A1; JP2019154575A

Description

本発明は、撮像された画像に写っている人物を識別する個人識別技術に関する。 The present invention relates to a personal identification technique for identifying a person in a captured image.

個人識別技術として顔認識を用いる技術が存在するが、対象者がカメラと正対しない場合には識別できない。また、家庭内に限れば、遺伝的に似た顔であるため判別が困難であるという課題もある。そこで、人体特徴（姿勢、シルエット）や行動特徴（動線パターン、滞在場所）を用いた識別を行うことで、顔が撮影できない場面でも識別可能とすることが考えられる。 There is a technology that uses face recognition as an individual identification technology, but it cannot be identified if the subject does not face the camera. In addition, there is also a problem that it is difficult to distinguish the faces in the home because they have genetically similar faces. Therefore, it is conceivable to make it possible to identify even in a scene where the face cannot be photographed by performing identification using human body characteristics (posture, silhouette) and behavioral characteristics (traffic line pattern, place of stay).

人体特徴や行動特徴を用いた識別を行うためには、対象者ごとの人体特徴や行動特徴を収集して学習を行う必要がある。また、時間や環境の変化に応じて行動特徴が変化する可能性があるので、これらの特徴を継続的に収集して、識別に用いる登録情報を更新することで変化に対応することが望まれる。 In order to perform identification using human body characteristics and behavioral characteristics, it is necessary to collect and learn human body characteristics and behavioral characteristics for each subject. In addition, behavioral characteristics may change according to changes in time and environment, so it is desirable to continuously collect these characteristics and update the registration information used for identification to respond to the changes. ..

人体特徴や行動特徴を収集するためには、検出された人物が誰であるかを識別する必要があるが、画像に基づく識別のみでは正確な個人識別を行えない場面がある。上述のように顔が撮影できなければ顔認識は利用できないし、人体特徴や行動特徴の学習が不十分であればこれらの特徴に基づく識別も正確ではない。 In order to collect human body characteristics and behavioral characteristics, it is necessary to identify who the detected person is, but there are cases where accurate personal identification cannot be performed only by identification based on images. As described above, face recognition cannot be used unless the face can be photographed, and if the learning of human body features and behavioral features is insufficient, the identification based on these features is not accurate.

特許文献１は、人物に対して呼び掛け音声を出力して、呼び掛け音声出力前後での顔の変化に基づいて登録すべき最適な顔を選択することを開示する。例えば、呼び掛けに対して振り向いた顔を選択することが開示されている。しかしながら、特許文献１は、あくまでも学習用の顔画像の取得方法を開示するだけであり、識別時に顔が写った画像が得られなければ正確な個人識別が行えないという問題に対処できるものではない。 Patent Document 1 discloses that a call voice is output to a person and the optimum face to be registered is selected based on the change of the face before and after the call voice output. For example, it is disclosed to select a face that turns around in response to the call. However, Patent Document 1 only discloses a method of acquiring a face image for learning, and cannot deal with the problem that accurate personal identification cannot be performed unless an image showing a face is obtained at the time of identification. ..

特開２０１３−１８２３２５号公報Japanese Unexamined Patent Publication No. 2013-182325

本発明は、上記実情に鑑みなされたものであって、画像内の人物を精度良く識別可能な個人識別技術を提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a personal identification technique capable of accurately identifying a person in an image.

上記目的を達成するために、本発明では、画像に基づく識別と、出力音声への応答音声に基づく識別とに基づいて人物を識別するという手法を採用する。 In order to achieve the above object, the present invention employs a method of identifying a person based on identification based on an image and identification based on a response voice to an output voice.

具体的には、本発明の第一態様は、音声出力手段、音声入力手段、画像入力手段、検出手段、および人物識別手段を備える。音声出力手段は音声を出力する。音声入力手段は、出力音声に応答する音声を取得する。画像入力手段は動画像を取得する。検出手段は、入力される動画像から人体を検出する。人体検出手法は任意の手法を採用可能である。人物識別手段は、画像に基づく識別を行う第１識別手段と、音声に基づく識別を行う第２識別手段を有する。第１識別手段は、例えば、顔特徴、人体特徴（姿勢やシルエット）、行動特徴（動線パターンや滞在場所）を利用した識別手法を採用できるが、画像から得られる特徴に基づく識別であればその他の識別手法を採用してもよい。第２識別手段は、入力音
声の波形解析により得られる音響特徴量や、自然言語解析により得られる単語や文章の特徴量に基づく識別手段を採用できるが、入力音声に基づく識別であればその他の手法を採用してもよい。人物識別手段は、第１識別手段による識別結果（以下、第１識別結果）と、第２識別手段による識別結果（以下、第２識別結果）に基づいて、検出された人物を識別する。 Specifically, the first aspect of the present invention includes a voice output means, a voice input means, an image input means, a detection means, and a person identification means. The audio output means outputs audio. The voice input means acquires a voice that responds to the output voice. The image input means acquires a moving image. The detection means detects the human body from the input moving image. Any method can be adopted as the human body detection method. The person identification means includes a first identification means for performing identification based on an image and a second identification means for performing identification based on voice. As the first identification means, for example, an identification method using facial features, human body features (posture or silhouette), and behavioral features (traffic line pattern or place of stay) can be adopted, but if the identification is based on the features obtained from the image. Other identification methods may be adopted. As the second identification means, the identification means based on the acoustic feature amount obtained by the waveform analysis of the input voice or the feature amount of the word or sentence obtained by the natural language analysis can be adopted, but other identification means can be adopted as long as the identification is based on the input voice. The method may be adopted. The person identification means identifies a detected person based on the identification result by the first identification means (hereinafter, the first identification result) and the identification result by the second identification means (hereinafter, the second identification result).

このような構成によれば、画像のみに基づいて人物の識別が精度良く行えない場合でも、音声に基づく識別結果と合わせて判断することで、人物識別を精度良く行えるようになる。 According to such a configuration, even if it is not possible to accurately identify a person based only on an image, it is possible to accurately identify a person by making a judgment together with an identification result based on voice.

本態様において、人物識別手段は、第１識別手段が信頼度高く識別できない場合に第２識別手段も考慮した識別を行うように構成してもよい。すなわち、前記人物識別手段は、前記第１識別手段による信頼度が第１閾値未満の場合に、前記第２識別手段による識別を行って、前記第１識別結果と前記第２識別結果とに基づいて前記人物を識別するように構成してもよい。人物識別手段は、第１識別結果と第２識別結果とに基づく識別では、例えば、第１識別結果と第２識別結果が一致した場合に識別結果を確定し、２つの識別結果が異なる場合には識別結果を未確定としてもよい。 In this embodiment, the person identification means may be configured to perform identification in consideration of the second identification means when the first identification means cannot identify with high reliability. That is, when the reliability by the first identification means is less than the first threshold value, the person identification means performs identification by the second identification means and is based on the first identification result and the second identification result. It may be configured to identify the person. In the identification based on the first identification result and the second identification result, the person identification means determines the identification result when the first identification result and the second identification result match, and when the two identification results are different. May leave the identification result undetermined.

また本態様において、前記第１識別手段による信頼度が前記第１閾値以上の場合は、前記第１識別結果を、前記人物の識別結果としてもよい。 Further, in the present embodiment, when the reliability by the first identification means is equal to or higher than the first threshold value, the first identification result may be used as the identification result of the person.

このような構成によれば、画像のみから信頼度高く識別が行える場合に、処理量を削減できるとともに、音声での問いかけも省略されてユーザに対応を強いる必要がなくなる。 According to such a configuration, when the identification can be performed with high reliability only from the image, the processing amount can be reduced, and it is not necessary to oblige the user to respond by omitting the question by voice.

本態様において、前記音声出力手段からの音声出力は所定のタイミングで行わればよい。この所定のタイミングは、人物の識別信頼度が閾値未満となったタイミング、人物が検出されたタイミングから第１の所定時間が経過したタイミング、人物の時間変化が略無い状態が第２の所定時間継続したタイミング、人物が撮像範囲外へ出るタイミング、情報処理装置と前記人物の間の距離が所定距離以下となったタイミング、の少なくとも何れかとすることができる。人物の識別信頼度は、第１識別手段による識別の信頼度であってもよいし、第１識別手段と第２識別手段の識別結果を統合した信頼度であってもよい。第１の所定時間は、固定の時間であってもよいし、画像から特徴を取得する場合には特徴の取得回数や特徴のデータ量に応じて決定される時間であってもよい。 In this embodiment, the audio output from the audio output means may be performed at a predetermined timing. The predetermined timing is the timing when the identification reliability of the person becomes less than the threshold value, the timing when the first predetermined time elapses from the timing when the person is detected, and the second predetermined time when there is almost no time change of the person. It can be at least one of a continuous timing, a timing when the person goes out of the imaging range, and a timing when the distance between the information processing device and the person becomes equal to or less than a predetermined distance. The identification reliability of a person may be the reliability of identification by the first identification means, or may be the reliability of integrating the identification results of the first identification means and the second identification means. The first predetermined time may be a fixed time, or may be a time determined according to the number of acquisitions of the feature and the amount of data of the feature when the feature is acquired from the image.

本態様において、前記人物識別手段は、前記第１識別手段による信頼度に応じて、前記出力音声の内容を決定することが好適である。例えば、前記人物識別手段は、前記信頼度が第２閾値未満の場合は、前記第１識別結果の人物の呼称を含む内容、または、前記人物が誰であるかを問い合わせる内容を、前記出力音声の内容として決定することができる。なお、信頼度を３段階以上にレベル分けして、それぞれのレベルに応じて出力音声の内容を決定してもよい。例えば、信頼度を３段階に分ける場合には、高レベルであれば人物の呼称を含まない自然な内容とし、中レベルであれば人物の呼称を含む内容とし、低レベルであれば直接的に人物が誰であるかを問い合わせる内容としてもよい。一例として、高レベルであれば「そこで何してるの？」、中レベルであれば「お母さん、そこで何してるの？」、低レベルであれば「そこにいるのは誰？」といった内容を採用することができる。 In this embodiment, it is preferable that the person identification means determines the content of the output voice according to the reliability of the first identification means. For example, when the reliability is less than the second threshold value, the person identification means outputs the content including the name of the person in the first identification result or the content inquiring who the person is. Can be determined as the content of. The reliability may be divided into three or more levels, and the content of the output voice may be determined according to each level. For example, when dividing the reliability into three levels, if the level is high, the content does not include the name of the person, if the level is medium, the content includes the name of the person, and if the level is low, the content is directly. It may be the content to inquire who the person is. As an example, at a high level, "What are you doing there?", At a medium level, "Mom, what are you doing there?", And at a low level, "Who is there?" Can be adopted.

また、本態様において、前記人物識別手段は、前記第１識別結果と前記第２識別結果が一致しない場合には、新たに音声を出力して当該新たな出力音声に応答する入力音声に基づいて前記第２識別手段による識別を行ってもよい。この際は、新たな出力音声の内容は、前回の出力音声の内容と比較してより直接的に前記人物を確認する内容とするとよい。第１識別結果と第２識別結果が一致しない場合には、人物が誰であるか確定できない状態
であるので、より直接的に人物が誰であるかを確認するような内容の音声出力を行って、その応答から人物が誰であるか識別するとよい。 Further, in the present embodiment, when the first identification result and the second identification result do not match, the person identification means newly outputs a voice and is based on an input voice that responds to the new output voice. Identification may be performed by the second identification means. In this case, the content of the new output voice may be the content of confirming the person more directly than the content of the previous output voice. If the first identification result and the second identification result do not match, it is not possible to determine who the person is, so voice output is performed to more directly confirm who the person is. Then, it is good to identify who the person is from the response.

本発明の第二態様は、上記の個人識別装置と、前記画像入力手段に入力される動画像から、前記検出された人物の人体または行動に関する特徴の少なくともいずれかを取得する特徴取得手段と、前記特徴取得手段によって取得された特徴を、前記人物識別手段によって識別された人物を関連付けて登録する特徴登録手段と、備える特徴収集装置である。このようにして収集された特徴は、識別器の学習に用いることができる。 A second aspect of the present invention includes the personal identification device, a feature acquisition means for acquiring at least one of the detected features related to the human body or behavior of a person from a moving image input to the image input means. It is a feature collecting device provided with a feature registering means for registering a feature acquired by the feature acquiring means in association with a person identified by the person identifying means. The features collected in this way can be used for learning the classifier.

なお、本発明は、上記構成ないし機能の少なくとも一部を有する情報処理システムとして捉えることができる。また、本発明は、上記処理の少なくとも一部を含む、情報処理方法又は情報処理システムの制御方法や、これらの方法をコンピュータに実行させるためのプログラム、又は、そのようなプログラムを非一時的に記録したコンピュータ読取可能な記録媒体として捉えることもできる。上記構成及び処理の各々は技術的な矛盾が生じない限り互いに組み合わせて本発明を構成することができる。 The present invention can be regarded as an information processing system having at least a part of the above-mentioned configuration or function. Further, the present invention provides a method for controlling an information processing method or an information processing system including at least a part of the above processing, a program for causing a computer to execute these methods, or a program such as such. It can also be regarded as a recorded computer-readable recording medium. Each of the above configurations and processes can be combined with each other to construct the present invention as long as there is no technical contradiction.

本発明によれば、画像内の人物を精度良く識別できる。 According to the present invention, a person in an image can be identified with high accuracy.

図１は、本発明が適用された個人識別装置の構成例を示すブロック図である。FIG. 1 is a block diagram showing a configuration example of a personal identification device to which the present invention is applied. 図２は、第１の実施形態に係る特徴収集装置の構成例を示すブロック図である。FIG. 2 is a block diagram showing a configuration example of the feature collecting device according to the first embodiment. 図３は、第１の実施形態における特徴収集処理の例を示すフローチャートである。FIG. 3 is a flowchart showing an example of the feature collection process according to the first embodiment. 図４は、第１の実施形態における発話内容決定処理の例を示すフローチャートである。FIG. 4 is a flowchart showing an example of the utterance content determination process in the first embodiment. 図５Ａおよび図５Ｂは、入力画像から検出される人体の骨格情報と、骨格に基づく姿勢検出を説明する図である。5A and 5B are diagrams for explaining the skeleton information of the human body detected from the input image and the posture detection based on the skeleton. 図６は、第２の実施形態における発話内容決定処理の例を示すフローチャートである。FIG. 6 is a flowchart showing an example of the utterance content determination process in the second embodiment.

＜適用例＞
本発明の適用例について説明する。個人の人体特徴や行動特徴を収集する際には、収集している対象の人物が誰であるのかを特定する必要がある。画像に基づく個人識別手法として顔認識があるが、必ずしも顔が撮影できるとは限らないので精度のよい識別が行えない場合がある。また、人体特徴や行動特徴に基づく識別も可能であるが、人体特徴や行動特徴による識別のための特徴収集において、人体特徴や行動特徴に基づく識別に全面的に頼ることは現実的ではない。 <Application example>
An application example of the present invention will be described. When collecting personal and behavioral characteristics of an individual, it is necessary to identify who the person being collected is. Face recognition is an image-based personal identification method, but it is not always possible to photograph a face, so accurate identification may not be possible. In addition, although identification based on human body characteristics and behavioral characteristics is possible, it is not realistic to rely entirely on identification based on human body characteristics and behavioral characteristics in collecting characteristics for identification based on human body characteristics and behavioral characteristics.

図１は、本発明が適用された個人識別装置１０の構成例を示すブロック図である。個人識別装置１１０は、入力される動画像や音声から動画像中の人物を識別する。個人識別装置１０は、画像入力部１１、人体検出部１２、人物識別部１３、音声出力部１４、および音声入力部１５を有する。 FIG. 1 is a block diagram showing a configuration example of the personal identification device 10 to which the present invention is applied. The personal identification device 110 identifies a person in the moving image from the input moving image or sound. The personal identification device 10 includes an image input unit 11, a human body detection unit 12, a person identification unit 13, a voice output unit 14, and a voice input unit 15.

画像入力部１１は、撮像された動画像（動画像データ）を取得する。例えば、画像入力部１１は、画像データが入力される入力端子である。画像入力部１１は、本発明の画像取得手段の一例である。 The image input unit 11 acquires the captured moving image (moving image data). For example, the image input unit 11 is an input terminal into which image data is input. The image input unit 11 is an example of the image acquisition means of the present invention.

人体検出部１２は、動画像データの処理対象フレーム画像から人の体らしい領域（人体領域）を検出する。人体検出は、既存の任意のアルゴリズムによって学習された検出器を用いて行えばよい。人体検出部１２は、本発明の検出手段の一例である。 The human body detection unit 12 detects a region (human body region) that is typical of the human body from the frame image to be processed of the moving image data. Human body detection may be performed using a detector learned by any existing algorithm. The human body detection unit 12 is an example of the detection means of the present invention.

音声出力部１４は、人物識別部１３から得られる発話内容を音声データにして出力する。音声出力部１４は本発明の音声出力手段の一例である。 The voice output unit 14 outputs the utterance content obtained from the person identification unit 13 as voice data. The audio output unit 14 is an example of the audio output means of the present invention.

音声入力部１５は、音声データを取得する。音声入力部１５は、本発明の音声入力手段の一例である。 The voice input unit 15 acquires voice data. The voice input unit 15 is an example of the voice input means of the present invention.

人物識別部１３は、画像入力部１１が取得する動画像や音声入力部１５が取得する音声に基づいて、動画像中の人物を特定する。人物識別部１３は、音声による識別に用いる応答音声を引き出すための発話の内容およびその発話のタイミングも決定する。人物識別部１３は、本発明の人物識別手段の一例である。 The person identification unit 13 identifies a person in the moving image based on the moving image acquired by the image input unit 11 and the sound acquired by the voice input unit 15. The person identification unit 13 also determines the content of the utterance for drawing out the response voice used for voice identification and the timing of the utterance. The person identification unit 13 is an example of the person identification means of the present invention.

第１識別部１３１は、画像入力部１１が取得した動画像から人体検出部１２によって検出された人物の識別を行う。第１識別部１３１は、入力画像に基づいて人物の識別を行う。具体的には、画像中の顔特徴、人体特徴、および行動特徴の少なくともいずれかに基づいて検出人物を識別する。第１識別部１３１は、本発明の第１識別手段の一例である。 The first identification unit 131 identifies a person detected by the human body detection unit 12 from the moving image acquired by the image input unit 11. The first identification unit 131 identifies a person based on the input image. Specifically, the detected person is identified based on at least one of facial features, human body features, and behavioral features in the image. The first identification unit 131 is an example of the first identification means of the present invention.

第２識別部１３３は、音声入力部１５が取得する音声に基づいて、人体検出部１２によって検出された人物の識別を行う。第２識別部１３３は、音声の波形解析を行って得られる音響特徴量に基づいて識別を行ってもよいし、自然言語解析を行って得られる言語特徴量に基づいて識別を行ってもよい。音響特徴量に基づく識別は、あらかじめ登録された音響特徴量との一致度合いに基づいて行える。言語特徴量に基づく識別は、出力音声の内容と応答音声の内容の意味に基づいて行える。第２識別部１３３は、本発明の第２識別手段の一例である。 The second identification unit 133 identifies the person detected by the human body detection unit 12 based on the voice acquired by the voice input unit 15. The second identification unit 133 may perform identification based on the acoustic feature amount obtained by performing the waveform analysis of the voice, or may perform the identification based on the language feature amount obtained by performing the natural language analysis. .. Identification based on the acoustic feature amount can be performed based on the degree of agreement with the acoustic feature amount registered in advance. Identification based on language features can be performed based on the meaning of the content of the output voice and the content of the response voice. The second identification unit 133 is an example of the second identification means of the present invention.

人物識別部１３は、音声出力部１４から発話を行うタイミングおよび発話文の内容を決定する。本適用例において、音声出力部１４から発話（音声の出力）を行うのは、検出人物から音声による応答を得て、音声に基づく検出を行うためである。したがって、人物識別部１３は、音声に基づく識別を行う必要があると想定されるタイミングを発話タイミングとして決定する。人物識別部１３は、典型的には、第１識別部１３１による識別の信頼度が低い（閾値ＴＨ１未満）タイミングを発話タイミングとして決定してもよい。また、人物識別部１３は、定期的に発話を行ったり、人物の動きがない場合に発話を行ったり、第１識別部１３１と第２識別部１３３の識別結果が相違する場合に発話を行ったりするように発話タイミングを決定してもよい。発話内容は、人物識別の信頼度に応じて決定してもよい。例えば、信頼度が高い場合には自然なコミュニケーションの発話を行い、信頼度が低い場合には、推定される人物の呼称を含む発話や、直接的に検出人物が誰であるかを問い合わせる発話文を行うことが考えられる。ここで、人物の呼称を含む発話および人物が誰であるかを問い合わせる発話は、人物が誰であるかを確認する発話の一例である。 The person identification unit 13 determines the timing of utterance from the voice output unit 14 and the content of the utterance sentence. In this application example, the utterance (voice output) is performed from the voice output unit 14 in order to obtain a voice response from the detecting person and perform voice-based detection. Therefore, the person identification unit 13 determines the timing at which it is assumed that the identification based on the voice needs to be performed as the utterance timing. The person identification unit 13 may typically determine a timing at which the reliability of identification by the first identification unit 131 is low (less than the threshold value TH1) as the utterance timing. In addition, the person identification unit 13 periodically speaks, speaks when there is no movement of the person, and speaks when the identification results of the first identification unit 131 and the second identification unit 133 are different. You may decide the utterance timing so as to do so. The content of the utterance may be determined according to the reliability of the person identification. For example, if the reliability is high, the utterance of natural communication is performed, and if the reliability is low, the utterance including the name of the estimated person or the utterance that directly inquires who the detected person is. It is conceivable to do. Here, the utterance including the name of the person and the utterance inquiring who the person is are examples of the utterances for confirming who the person is.

なお、発話および応答を複数回行って、複数回の応答音声に対して第２識別部１３３による識別を行ってもよい。この場合、例えば、１回目は自然なコミュニケーションの発話を行い、第２識別部１３３が応答音声の音響特徴に基づいて識別を行う。ここで、第１識別部１３１と第２識別部１３３の識別結果が一致するか、第２識別部１３３の識別結果の信頼度が高いときには、人物の識別が正しく行えていると考えられる。したがって、２回目の発話はしなくてもよいし、引き続き自然なコミュニケーションを続けてもよい。一方、１回目の応答に基づく第２識別部１３３の識別結果が第１識別部１３１の識別結果と相
違するか、識別信頼度が低い場合には、２回目の発話内容を、１回目の発話と比較してより直接的に人物を確認する内容とするとよい。２回目は応答の内容（自然言語解析による意味解釈）によって人物を識別できるので、信頼度の高い識別結果が得られることが期待される。 It should be noted that the utterance and the response may be performed a plurality of times, and the second identification unit 133 may identify the response voices a plurality of times. In this case, for example, the first utterance of natural communication is performed, and the second identification unit 133 performs identification based on the acoustic characteristics of the response voice. Here, when the identification results of the first identification unit 131 and the second identification unit 133 match, or when the reliability of the identification result of the second identification unit 133 is high, it is considered that the person is correctly identified. Therefore, it is not necessary to make a second utterance, and it is possible to continue natural communication. On the other hand, if the identification result of the second identification unit 133 based on the first response is different from the identification result of the first identification unit 131, or the identification reliability is low, the content of the second utterance is the first utterance. It is better to confirm the person more directly in comparison with. In the second time, the person can be identified by the content of the response (semantic interpretation by natural language analysis), so it is expected that a highly reliable identification result can be obtained.

このように、本適用例に係る個人識別装置１０によれば、画像と音声の２つの識別手法に基づいて画像中の人物を識別することで、画像だけからは信頼度高く識別が行えない場合でも、精度良く識別が行える。また、個人識別のための音声による問いかけを最小限にできるので、ユーザに煩わしさを感じさせることを抑制できる。また、上述のように発話内容を決定することで、ユーザに問いかけを行う際にも自然なコミュニケーションが可能であり、このような点からもユーザに煩わしさを感じさせることを抑制できる。 As described above, according to the personal identification device 10 according to the present application example, by identifying the person in the image based on the two identification methods of the image and the sound, the identification cannot be performed with high reliability only from the image. However, it can be identified accurately. In addition, since it is possible to minimize the number of voice questions for personal identification, it is possible to suppress the user from feeling annoyed. Further, by determining the utterance content as described above, natural communication is possible even when asking a question to the user, and from such a point as well, it is possible to suppress the user from feeling annoyed.

（第１の実施形態）
本発明の第１の実施形態は、人物を撮影して得られる人体特徴や行動特徴を収集する特徴収集装置であり、家庭用のコミュニケーションロボット１（以下、単にロボット１とも称する）に搭載される。特徴収集装置は、個人識別の学習用にこれらの特徴を収集する。 (First Embodiment)
The first embodiment of the present invention is a feature collecting device that collects human body features and behavioral features obtained by photographing a person, and is mounted on a home-use communication robot 1 (hereinafter, also simply referred to as a robot 1). .. The feature collector collects these features for learning personal identification.

［構成］
図２は、ロボット１の構成を示す図である。ロボット１は、特徴収集装置１００、カメラ２００、スピーカ３００、およびマイク４００を備える。なお、ロボット１は、ＣＰＵなどのプロセッサ（演算装置）と、主記憶装置、補助記憶装置、通信装置などを有しており、プロセッサがプログラムを実行することで特徴収集装置１００の各処理が実行される。 [Constitution]
FIG. 2 is a diagram showing the configuration of the robot 1. The robot 1 includes a feature collecting device 100, a camera 200, a speaker 300, and a microphone 400. The robot 1 has a processor (arithmetic device) such as a CPU, a main storage device, an auxiliary storage device, a communication device, and the like, and each process of the feature collection device 100 is executed when the processor executes a program. Will be done.

カメラ２００は、可視光または非可視光（例えば赤外光）による連続的な撮影を行い、撮影した画像データを特徴収集装置１００に入力する。カメラ２００の撮像範囲は、固定であっても可変であってもよい。カメラ２００の撮像範囲の変更は、カメラ２００の向きを変えることによって行われてもよいし、ロボット１が自律的に移動することによって行われてもよい。 The camera 200 continuously shoots with visible light or non-visible light (for example, infrared light), and inputs the captured image data to the feature collecting device 100. The imaging range of the camera 200 may be fixed or variable. The imaging range of the camera 200 may be changed by changing the orientation of the camera 200, or by autonomously moving the robot 1.

スピーカ３００は、特徴収集装置１００から入力される音声データを音響波に変換して出力する。マイク４００は、ユーザが発話する音声などの音響波を音声データに変換して特徴収集装置１００に出力する。マイク４００は、音源分離可能なようにマイクアレイとして構成されてもよい。 The speaker 300 converts the voice data input from the feature collecting device 100 into an acoustic wave and outputs it. The microphone 400 converts an acoustic wave such as a voice spoken by the user into voice data and outputs the sound wave to the feature collecting device 100. The microphone 400 may be configured as a microphone array so that the sound sources can be separated.

特徴収集装置１００は、カメラ２００が撮像した画像から、当該画像に写っている人物の姿勢、仕草、シルエット、動線、及び、滞在場所の少なくともいずれかを当該人物の特徴として取得する。また、特徴収集装置１００は、画像内の人物の識別を行い、個人と関連付けて特徴情報を記憶する。このような動作を行うために、特徴収集装置１００は、個人識別装置１１０と、特徴取得部１２０と、特徴登録部１３０とを有する。 From the image captured by the camera 200, the feature collecting device 100 acquires at least one of the posture, gesture, silhouette, flow line, and staying place of the person in the image as the feature of the person. Further, the feature collecting device 100 identifies a person in the image and stores the feature information in association with the individual. In order to perform such an operation, the feature collecting device 100 includes a personal identification device 110, a feature acquisition unit 120, and a feature registration unit 130.

個人識別装置１１０は、カメラ２００から入力される動画像やマイク４００から入力される音声から、動画像中の人物を識別する。個人識別装置１１０は、画像入力部１１１、人体検出部１１２、人物識別部１１３、音声出力部１１４、音声入力部１１５を有する。人物識別部１１３は、第１識別部１１３１、発話制御部１１３２、第２識別部１１３３、人物特定部１１３４を有する。 The personal identification device 110 identifies a person in the moving image from the moving image input from the camera 200 and the sound input from the microphone 400. The personal identification device 110 includes an image input unit 111, a human body detection unit 112, a person identification unit 113, a voice output unit 114, and a voice input unit 115. The person identification unit 113 includes a first identification unit 1131, an utterance control unit 1132, a second identification unit 1133, and a person identification unit 1134.

画像入力部１１１は、カメラ２００から画像データを受け取るインタフェースである。なお本実施形態では、画像入力部１１１は、カメラ２００から直接画像データを受け取っているが、通信装置を介して画像データを受け取ったり、記録媒体を経由して画像データ
を受け取ったりしてもよい。 The image input unit 111 is an interface for receiving image data from the camera 200. In the present embodiment, the image input unit 111 receives the image data directly from the camera 200, but the image data may be received via the communication device or the image data via the recording medium. ..

人体検出部１１２は、入力画像から人体領域を検出する。人体検出のアルゴリズムは既存の任意のものが採用できる。また、人体検出部１１２は、一度検出した人体については追跡処理によって検出するようにしてもよい。追跡の際に、人体検出部１１２は、一度検出した人物の服装や装飾品の種類や形状や色などを記憶しておき、所定時間（例えば１時間から数時間程度）の間は、これらの特徴を用いて検出してもよい。また、服装や装飾品の種類や形状や色を、第１識別部１１３１による識別に利用してもよい。短時間の間で服装が変更される可能性は低いため、これらの特徴は追跡に有効活用できる。 The human body detection unit 112 detects the human body region from the input image. Any existing algorithm for detecting the human body can be adopted. Further, the human body detection unit 112 may detect the human body once detected by a tracking process. At the time of tracking, the human body detection unit 112 stores the type, shape, color, etc. of the clothes and ornaments of the person once detected, and these are stored for a predetermined time (for example, about 1 hour to several hours). It may be detected using features. In addition, the type, shape, and color of clothes and accessories may be used for identification by the first identification unit 1131. These features can be useful for tracking, as clothing is unlikely to change in a short period of time.

人物識別部１１３は、人体検出部１１２が検出した人物（検出人物）が誰であるかを識別（特定）する。 The person identification unit 113 identifies (identifies) who the person (detection person) detected by the human body detection unit 112 is.

第１識別部１１３１は、画像入力部１１が取得した画像に基づいて、検出人物の識別を行う。具体的には、画像中の顔特徴、人体特徴、および行動特徴の少なくともいずれかに基づいて検出人物を識別する。これらの識別を行うために、識別対象の各人物の顔特徴、人体特徴、および行動特徴をあらかじめ登録しておく。上述のように、人体検出部１１２が、検出人物の服装や装飾品の種類や形状や色を取得している場合には、これらの特徴も用いて識別を行ってもよい。また、第１識別部１１３１は、識別結果の信頼度（識別結果がどの程度確からしいかを表す指標）も算出する。 The first identification unit 1131 identifies the detected person based on the image acquired by the image input unit 11. Specifically, the detected person is identified based on at least one of facial features, human body features, and behavioral features in the image. In order to perform these identifications, the facial features, human body features, and behavioral features of each person to be identified are registered in advance. As described above, when the human body detection unit 112 has acquired the type, shape, and color of the clothes and ornaments of the detected person, these characteristics may also be used for identification. In addition, the first identification unit 1131 also calculates the reliability of the identification result (an index indicating how probable the identification result is).

発話制御部１１３２は、音声出力部１１４から発話を行うタイミングおよび発話文の内容を決定する。音声出力部１１４から発話（音声の出力）を行うのは、検出人物から音声による応答を得て、音声に基づく検出を行うためである。したがって、人物識別部１１３は、音声に基づく識別を行う必要があると想定されるタイミングを発話タイミングとして決定する。また、発話内容は、ユーザが煩わしく感じない内容であり、かつ、応答音声に基づく識別が行えるような内容とすべきである。具体的な、発話タイミングおよび発話内容の決定方法は後述する。 The utterance control unit 1132 determines the timing of utterance from the voice output unit 114 and the content of the utterance sentence. The reason why the utterance (voice output) is performed from the voice output unit 114 is to obtain a voice response from the detecting person and perform detection based on the voice. Therefore, the person identification unit 113 determines the timing at which it is assumed that the identification based on the voice needs to be performed as the utterance timing. In addition, the utterance content should be content that the user does not feel annoyed and that can be identified based on the response voice. The specific method of determining the utterance timing and the utterance content will be described later.

第２識別部１１３３は、音声入力部１５が取得する音声に基づいて、人体検出部１２によって検出された人物の識別を行う。第２識別部１１３３は、音声の波形解析を行って得られる音響特徴量に基づいて識別を行ってもよいし、自然言語解析を行って得られる言語特徴量に基づいて識別を行ってもよい。音響特徴量に基づく識別は、あらかじめ登録された音響特徴量との一致度合いに基づいて行える。言語特徴量に基づく識別は、出力音声の内容と応答音声の内容の意味に基づいて行える。第２識別部１１３３は、識別結果の信頼度も算出する。 The second identification unit 1133 identifies the person detected by the human body detection unit 12 based on the voice acquired by the voice input unit 15. The second identification unit 1133 may perform the identification based on the acoustic features obtained by performing the waveform analysis of the voice, or may perform the identification based on the language features obtained by performing the natural language analysis. .. Identification based on the acoustic feature amount can be performed based on the degree of agreement with the acoustic feature amount registered in advance. Identification based on language features can be performed based on the meaning of the content of the output voice and the content of the response voice. The second identification unit 1133 also calculates the reliability of the identification result.

人物特定部１１３４は、第１識別部１１３１の識別結果（第１識別結果）と第２識別部１１３３の識別結果（第２識別結果）に基づいて、検出人物が誰であるかを最終的に特定する。例えば、人物特定部１１３４は、第１識別結果と第２識別結果が一致する場合に、その識別結果を最終的な結果としてもよいし、第１識別結果または第２識別結果いずれかを採用してもよい。人物特定部１１３４は、第１識別部１１３１の識別信頼度や第２識別部１１３３の識別信頼度を考慮して、検出人物が誰であるかを特定してもよい。例えば、第１識別部１１３１の識別信頼度が十分に高い場合には、第２識別部１１３３による識別を行わずに、第１識別部１１３１の識別結果を最終的な結果としてもよい。人物特定部１１３４は、識別結果の信頼度も算出する。この信頼度は、例えば、発話制御部１１３２による発話タイミングや発話内容の決定に用いられる。 The person identification unit 1134 finally determines who the detected person is based on the identification result (first identification result) of the first identification unit 1131 and the identification result (second identification result) of the second identification unit 1133. Identify. For example, when the first identification result and the second identification result match, the person identification unit 1134 may use the identification result as the final result, or adopts either the first identification result or the second identification result. You may. The person identification unit 1134 may specify who the detection person is in consideration of the identification reliability of the first identification unit 1131 and the identification reliability of the second identification unit 1133. For example, when the identification reliability of the first identification unit 1131 is sufficiently high, the identification result of the first identification unit 1131 may be the final result without performing the identification by the second identification unit 1133. The person identification unit 1134 also calculates the reliability of the identification result. This reliability is used, for example, in determining the utterance timing and the utterance content by the utterance control unit 1132.

音声出力部１１４は、発話制御部１１３２から発話音声のテキストデータを取得し、それを音声合成処理により音声データに変換して、スピーカ３００から出力する。 The voice output unit 114 acquires text data of the utterance voice from the utterance control unit 1132, converts it into voice data by voice synthesis processing, and outputs it from the speaker 300.

音声入力部１１５は、マイク４００から音声信号を受取、音声データに変換して第２識別部１１３３に出力する。音声入力部１１５は、雑音除去や話者分離などの前処理を施してもよい。 The voice input unit 115 receives a voice signal from the microphone 400, converts it into voice data, and outputs it to the second identification unit 1133. The voice input unit 115 may perform preprocessing such as noise removal and speaker separation.

特徴取得部１２０は、人体検出部１１２によって検出された人物の特徴を、入力画像から取得する。特徴取得部１２０が取得する特徴は、例えば、人体に関する特徴（例えば、人体の部位、骨格、姿勢、シルエットに関する特徴）と、行動に関する特徴（仕草、動線、滞在場所に関する特徴）の少なくともいずれかを含む。以下では、特徴取得部１２０が取得する特徴を、人体・行動特徴とも称する。これらの特徴が人体検出部１１２や第１識別部１１３１において既に算出されている場合には、特徴取得部１２０は算出済みの特徴を取得すればよい。逆に、特徴取得部１２０が算出した特徴を、第１識別部１１３１が使用してもよい。 The feature acquisition unit 120 acquires the characteristics of the person detected by the human body detection unit 112 from the input image. The feature acquired by the feature acquisition unit 120 is, for example, at least one of a feature related to the human body (for example, a feature related to a part, skeleton, posture, and silhouette of the human body) and a feature related to behavior (features related to gestures, flow lines, and staying place). including. Hereinafter, the features acquired by the feature acquisition unit 120 are also referred to as human body / behavioral features. When these features have already been calculated by the human body detection unit 112 or the first identification unit 1131, the feature acquisition unit 120 may acquire the calculated features. On the contrary, the feature calculated by the feature acquisition unit 120 may be used by the first identification unit 1131.

図５Ａ、図５Ｂを参照して人体・行動特徴の取得について説明する。特徴取得部１２０は、入力画像５０から、検出人物５１の骨格を示す骨格情報を取得する。骨格情報は、例えば、ＯｐｅｎＰｏｓｅなどを使って取得される。骨格情報は、人体を示す特徴情報でもあるし、人体の部位（頭、首、肩、肘、手、腰、膝、足首、目、耳、指先、等）を示す特徴情報でもある。図５では、画像５０から、検出人物５１の骨格（骨格情報）５２が検出されている。 The acquisition of human body / behavioral characteristics will be described with reference to FIGS. 5A and 5B. The feature acquisition unit 120 acquires skeleton information indicating the skeleton of the detected person 51 from the input image 50. The skeleton information is acquired using, for example, OpenPose. The skeletal information is also characteristic information indicating the human body, and is also characteristic information indicating parts of the human body (head, neck, shoulders, elbows, hands, hips, knees, ankles, eyes, ears, fingertips, etc.). In FIG. 5, the skeleton (skeleton information) 52 of the detection person 51 is detected from the image 50.

図５Ｂに示すように、骨格の形状は姿勢に依存する。そのため、特徴取得部１２０は、人体の部位の位置関係（骨格情報）から人物の姿勢情報を取得できる。姿勢情報は、人体の各部位の相対的な位置関係を列挙した情報であってもよいし、人体の部位の位置関係を分類した結果であってもよい。姿勢の分類として、例えば、直立、猫背、Ｏ脚、Ｘ脚などが挙げられる。また、特徴取得部１２０は、検出人物のシルエット情報を取得してもよく、シルエット情報から姿勢を求めてもよい。 As shown in FIG. 5B, the shape of the skeleton depends on the posture. Therefore, the feature acquisition unit 120 can acquire the posture information of the person from the positional relationship (skeleton information) of the parts of the human body. The posture information may be information listing the relative positional relationships of each part of the human body, or may be the result of classifying the positional relationships of the parts of the human body. Examples of posture classifications include upright, stoop, O-legs, and X-legs. Further, the feature acquisition unit 120 may acquire the silhouette information of the detected person, or may obtain the posture from the silhouette information.

上述の骨格情報・姿勢情報・シルエット情報はいずれも本発明における人体特徴に相当する。 The above-mentioned skeletal information, posture information, and silhouette information all correspond to the human body features in the present invention.

また、特徴取得部１２０は、各フレームの姿勢情報の変化から動作情報（仕草）を検出できる。仕草の検出結果として、例えば、歩行、屈伸、寝転び、腕組み、等を示す情報が得られる。腕組みの場合と腕組みでない場合との間で、上腕と前腕の間の位置関係などは異なる。このように、各部位の位置関係は仕草に依存する。そのため、骨格情報に基づいて、各部位の位置関係から仕草を検出できる。歩行や屈伸などの動きを伴う仕草は、互いに異なる時間に撮像された複数の画像にそれぞれ対応する複数の骨格情報を用いて検出されてもよい。歩行については、歩幅と肩幅の比率を示す情報が得られてもよい。 In addition, the feature acquisition unit 120 can detect motion information (gesture) from changes in posture information of each frame. As the detection result of the gesture, for example, information indicating walking, bending and stretching, lying down, arms folded, and the like can be obtained. The positional relationship between the upper arm and the forearm differs between the case where the arms are folded and the case where the arms are not folded. In this way, the positional relationship of each part depends on the gesture. Therefore, the gesture can be detected from the positional relationship of each part based on the skeleton information. Gestures accompanied by movements such as walking and bending and stretching may be detected by using a plurality of skeletal information corresponding to a plurality of images captured at different times. For walking, information indicating the ratio of stride length to shoulder width may be obtained.

また、特徴取得部１２０は、検出人物の滞在場所や動線（移動経路）を特徴として取得してもよい。滞在場所に関する特徴は、過去数分間などの所定期間における人物位置の時間変化に基づいて、当該所定期間の長さに対する滞在時間の比率（滞在率）を、滞在判定領域ごとに算出すれば得られる。滞在場所の検出結果は、各滞在判定領域の滞在率を示す滞在マップ（ヒートマップ）の形式で得られる。動線に関する特徴は、滞在位置を時系列で示した情報であり、滞在マップと同様の手法により得られる。 Further, the feature acquisition unit 120 may acquire the staying place and the flow line (movement route) of the detected person as features. The characteristics related to the place of stay can be obtained by calculating the ratio of the staying time to the length of the predetermined period (stay rate) for each stay determination area based on the time change of the person's position in the predetermined period such as the past several minutes. .. The detection result of the staying place is obtained in the form of a staying map (heat map) showing the staying rate of each staying determination area. The feature related to the flow line is the information showing the stay position in chronological order, and can be obtained by the same method as the stay map.

上述した、動作、滞在場所、および動線に関する情報は、いずれも本発明における行動特徴に相当する。 The above-mentioned information on movements, places of stay, and flow lines all correspond to behavioral features in the present invention.

特徴登録部１３０は、特徴取得部１２０によって取得された人体・行動特徴を、個人識
別装置１１０（人物識別部１１３）によって特定された人物と関連付けて照合用データベースなどの記憶部に登録する。特徴登録部１３０が登録を行うタイミングは任意であってよいが、例えば、人物識別部１１３による識別が信頼度高く行えたタイミングや、人物の追跡が完了したタイミングであってもよい。 The feature registration unit 130 associates the human body / behavioral features acquired by the feature acquisition unit 120 with the person identified by the personal identification device 110 (person identification unit 113) and registers them in a storage unit such as a collation database. The timing at which the feature registration unit 130 registers may be arbitrary, but may be, for example, a timing at which the person identification unit 113 can perform identification with high reliability or a timing at which the tracking of the person is completed.

［処理］
図３は、特徴収集装置１００が行う特徴収集処理の全体的流れを示すフローチャートである。以下、図３を参照しながら本実施形態における特徴収集処理について説明する。なお、このフローチャートは本実施形態における特徴収集処理を概念的に説明するものであり、実施形態においてこのフローチャートの通りの処理が実装される必要はないことに留意されたい。 [processing]
FIG. 3 is a flowchart showing the overall flow of the feature collection process performed by the feature collection device 100. Hereinafter, the feature collection process in the present embodiment will be described with reference to FIG. It should be noted that this flowchart conceptually explains the feature collection process in the present embodiment, and it is not necessary to implement the process according to this flowchart in the embodiment.

ステップＳ１０において、人体検出部１１２が、入力画像から人体の検出を行う。人体が検出されなかった場合（Ｓ１１−ＮＯ）には、ステップＳ１０に戻って次の処理対象フレーム画像から人体検出を行う。一方、人体が検出された場合（Ｓ１１−ＹＥＳ）には、処理はステップＳ１２に進む。 In step S10, the human body detection unit 112 detects the human body from the input image. If the human body is not detected (S11-NO), the process returns to step S10 to detect the human body from the next frame image to be processed. On the other hand, when the human body is detected (S11-YES), the process proceeds to step S12.

ステップＳ１２において、特徴取得部１２０が検出人物の人体・行動特徴を取得する。 In step S12, the feature acquisition unit 120 acquires the human body / behavioral features of the detected person.

ステップＳ１３において、第１識別部１１３１が人体・行動特徴に基づく個人識別を行う。なお、ステップＳ１２において顔特徴も取得して、ステップＳ１３において第１識別部１１３１が顔特徴に基づく個人識別を行ってもよい。 In step S13, the first identification unit 1131 performs personal identification based on the human body / behavioral characteristics. The face feature may also be acquired in step S12, and the first identification unit 1131 may perform personal identification based on the face feature in step S13.

ここで、画像に基づく識別の信頼度について簡単に説明する。顔特徴に基づく識別は比較的精度良く（信頼度高く）行えることが期待できるが、ユーザがロボット１と正対しているときしか行えない。一方、人体特徴や行動特徴に基づく識別は、ユーザの身体が写っていれば行えるが、必ずしも精度がよいとは限らない。特に、人体・行動特徴の初期学習のための特徴を収集している段階では、信頼度の高い識別はできないことが想定される。 Here, the reliability of the identification based on the image will be briefly described. It can be expected that identification based on facial features can be performed with relatively high accuracy (high reliability), but it can be performed only when the user is facing the robot 1. On the other hand, identification based on human body characteristics and behavioral characteristics can be performed if the user's body is shown, but the accuracy is not always good. In particular, it is assumed that highly reliable identification cannot be performed at the stage of collecting features for initial learning of human body / behavioral features.

第１識別部１１３１は、動画像に含まれる人物の識別を継続して行い、各フレームにおける識別の信頼度を総合して現時点での識別信頼度を算出する。この際、第１識別部１１３１は、最近に行った識別の信頼度に対して大きな重みを付けた加重平均を、現時点での識別信頼度としてもよい。 The first identification unit 1131 continuously identifies the person included in the moving image, and calculates the current identification reliability by totaling the identification reliability in each frame. At this time, the first identification unit 1131 may use a weighted average that gives a large weight to the reliability of the recent identification as the current identification reliability.

ステップＳ１４において、発話制御部１１３２は、発話を行うタイミングであるか否かを判断する。発話を行うタイミングはあらかじめ条件として設定しておけばよく、ステップＳ１４では、人物識別部１３は、設定した条件に該当しているか否か判断すればよい。 In step S14, the utterance control unit 1132 determines whether or not it is time to speak. The timing of utterance may be set as a condition in advance, and in step S14, the person identification unit 13 may determine whether or not the set condition is met.

発話を行う条件として次のようなものが採用できる。
（１）時間
例えば、人体検出から所定時間（例えば１０分）経過後に１回目の発話を行い、それ以降所定の間隔で発話を行う。
（２）データ量
例えば、あらかじめ定められたデータ量（例えば１００回分のデータ）の特徴が取得されたら発話を行う。
（３）行動停止
検出人物の行動に一定時間変化がない場合。例えば、ソファーに座ってテレビを見始めた後に一定時間経過した場合が相当する。
（４）撮像範囲外への移動
検出人物が撮影範囲外に移動することが予測される場合。例えば、検出人物が現在の部
屋から他の部屋へ移動・外出した場合が相当する。
（５）発話のしやすい状況
ロボットが検出人物と対話を行うのに適した状況に達した場合。例えば、検出人物とロボットが向かい合っており（ロボットが検出人物の顔を検出でき）、かつ、検出人物とロボットの間の距離が所定距離（例えば３メートル）以内のとき。
（６）識別信頼度が低い場合
人物識別部１１３は、第１識別部１１３１と第２識別部１１３３の両方の識別結果を用いて最終的な識別結果を確定する。そこで、第１識別部１１３１による識別信頼度が閾値ＴＨ１以上であれば、その結果を人物識別部１１３の識別結果として確定し、識別信頼度が閾値ＴＨ１未満であれば、第２識別部１１３３による識別を行うために発話を行うようにしてもよい。あるいは、第１識別部１１３１と第２識別部１１３３の両方の識別結果を考慮した上で識別信頼度が閾値ＴＨ１未満となる場合に、さらに第２識別部１１３３による識別を行うために発話を行うようにしてもよい。 The following can be adopted as conditions for utterance.
(1) Time For example, the first utterance is made after a predetermined time (for example, 10 minutes) has elapsed from the detection of the human body, and then the utterance is made at a predetermined interval.
(2) Data amount For example, when a feature of a predetermined data amount (for example, data for 100 times) is acquired, an utterance is performed.
(3) Behavior stop When the behavior of the detected person does not change for a certain period of time. For example, it corresponds to the case where a certain period of time has passed after sitting on the sofa and starting watching TV.
(4) Movement outside the imaging range When it is predicted that the detected person will move outside the imaging range. For example, it corresponds to the case where the detected person moves or goes out from the current room to another room.
(5) Situation where it is easy to speak When the robot reaches a situation suitable for interacting with the detected person. For example, when the detected person and the robot are facing each other (the robot can detect the face of the detected person), and the distance between the detected person and the robot is within a predetermined distance (for example, 3 meters).
(6) When the identification reliability is low The person identification unit 113 determines the final identification result by using the identification results of both the first identification unit 1131 and the second identification unit 1133. Therefore, if the identification reliability by the first identification unit 1131 is equal to or higher than the threshold value TH1, the result is determined as the identification result of the person identification unit 113, and if the identification reliability is less than the threshold value TH1, the second identification unit 1133 determines the result. The utterance may be made to identify. Alternatively, when the identification reliability is less than the threshold value TH1 after considering the identification results of both the first identification unit 1131 and the second identification unit 1133, an utterance is made to further identify by the second identification unit 1133. You may do so.

上記の条件は複数組み合わせてもよく、例えば、上記複数の条件のいくつかの何れかが成立するときに発話するようにしてもよいし、上記複数の条件のいくつかが同時に成立するときに発話するようにしてもよい。さらに、検出人物が眠っているときや集中しているときなど、話しかけることが適切ではない状況では、上記条件を満たしても発話しないようにしてもよい。 A plurality of the above conditions may be combined, and for example, the utterance may be made when any one of the plurality of conditions is satisfied, or the utterance may be made when some of the plurality of conditions are satisfied at the same time. You may try to do it. Furthermore, in situations where it is not appropriate to speak, such as when the detected person is sleeping or concentrating, the above conditions may be met but the person may not speak.

なお、ここでの発話は、個人識別のための応答音声を得ることを目的としたものであるので、上記のようなタイミングで行うようにしているが、それ以外のタイミングでの発話を禁止するものではない。例えば、上記の条件を満たさないタイミングにおいて、コミュニケーションのためにユーザに話しかけるようにしても構わない。 Since the purpose of the utterance here is to obtain a response voice for personal identification, the utterance is made at the above timing, but the utterance at other timings is prohibited. It's not a thing. For example, the user may be talked to for communication at a timing when the above conditions are not satisfied.

ステップＳ１４において発話タイミングの条件を満たすと判断された場合は、処理はステップＳ１５に進み、そうでない場合には、ステップＳ１９に進む。 If it is determined in step S14 that the utterance timing condition is satisfied, the process proceeds to step S15, and if not, the process proceeds to step S19.

ステップＳ１５では、発話制御部１１３２が、発話の内容（発話テキスト）を決定する。本実施形態では、発話制御部１１３２は、現時点での識別信頼度に基づいて発話内容を決定する。図４は、ステップＳ１５の発話内容決定処理を説明するフローチャートである。 In step S15, the utterance control unit 1132 determines the content of the utterance (speech text). In the present embodiment, the utterance control unit 1132 determines the utterance content based on the current identification reliability. FIG. 4 is a flowchart illustrating the utterance content determination process in step S15.

図４に示すように、ステップＳ１３１において、発話制御部１１３２は、現在の識別信頼度に応じて発話レベルを決定する。本実施形態では、例えば、識別信頼度が０．８以上の高信頼度、識別信頼度が０．５以上０．８未満の中信頼度、識別信頼度が０．５未満の低信頼度の３つのレベルに分類する。この閾値は例示に過ぎず、システム要求に応じて適宜決定すればよい。また、閾値は状況に応じて変化するものであってもよい。また、本実施形態ではレベルを３段階に分けているが、２段階あるいは４段階以上に分けても構わない。 As shown in FIG. 4, in step S131, the utterance control unit 1132 determines the utterance level according to the current identification reliability. In the present embodiment, for example, a high reliability with an identification reliability of 0.8 or more, a medium reliability with an identification reliability of 0.5 or more and less than 0.8, and a low reliability with an identification reliability of less than 0.5 Classify into 3 levels. This threshold value is merely an example and may be appropriately determined according to the system requirements. Further, the threshold value may change depending on the situation. Moreover, although the level is divided into three stages in this embodiment, it may be divided into two stages or four or more stages.

識別信頼度が高い場合は、ステップＳ１５２に進み、発話制御部１１３２は、コミュニケーションの自然さを重視して発話内容を決定する。例えば、発話制御部１１３２は、検出人物の呼称を含まない内容の発話内容を決定する。人物が検出された場所が台所である場合には、発話制御部１１３２は、例えば「台所でいま何してるの？」を発話内容として決定する。 If the identification reliability is high, the process proceeds to step S152, and the utterance control unit 1132 determines the utterance content with an emphasis on the naturalness of communication. For example, the utterance control unit 1132 determines the utterance content that does not include the name of the detected person. When the place where the person is detected is the kitchen, the utterance control unit 1132 determines, for example, "what are you doing in the kitchen now?" As the utterance content.

識別信頼度が中程度の場合は、ステップＳ１５３に進み、発話制御部１１３２は、不自然とはならない程度に内容で、検出人物が誰であるかを確かめるように発話内容を決定する。発話制御部１１３２は、例えば、第１識別部１１３１の識別結果の呼称を含めた内容
を発話内容とする。第１識別部１１３１の識別結果が「母」である場合には、発話制御部１１３２は、例えば「お母さん、台所で何してるの？」を発話内容として決定する。 When the identification reliability is medium, the process proceeds to step S153, and the utterance control unit 1132 determines the utterance content so as to confirm who the detecting person is, with the content not to be unnatural. The utterance control unit 1132 uses, for example, the content including the name of the identification result of the first identification unit 1131 as the utterance content. When the identification result of the first identification unit 1131 is "mother", the utterance control unit 1132 determines, for example, "Mother, what are you doing in the kitchen?" As the utterance content.

識別信頼度が低い場合は、ステップＳ１５４に進み、発話制御部１１３２は、検出人物が誰であるかをより直接的に問いかける内容を発話内容とする。発話制御部１１３２は、例えば「台所にいるのは誰ですか？」を発話内容として決定する。 If the identification reliability is low, the process proceeds to step S154, and the utterance control unit 1132 sets the utterance content to be a content that more directly asks who the detection person is. The utterance control unit 1132 determines, for example, "Who is in the kitchen?" As the utterance content.

ステップＳ１５において決定された発話内容のテキストデータは、発話制御部１１３２から音声出力部１１４に渡される。ステップＳ１６において、音声出力部１１４は、発話テキストを音声合成により音声データに変換して、スピーカ３００から出力する。ステップＳ１７において、システム発話に対する応答を音声入力部１１５がマイク４００から取得する。 The text data of the utterance content determined in step S15 is passed from the utterance control unit 1132 to the voice output unit 114. In step S16, the voice output unit 114 converts the spoken text into voice data by voice synthesis and outputs it from the speaker 300. In step S17, the voice input unit 115 acquires the response to the system utterance from the microphone 400.

ステップＳ１８において、第２識別部１１３３が入力音声に基づく個人識別を行う。第２識別部１１３３は、音響特徴（音響解析）に基づく識別と、言語特徴（意味解析）に基づく識別を行う。音響特徴に基づく識別は応答音声が得られれば結果を得られるが、言語特徴に基づく識別は問いかけと応答の内容によっては誰であるか不明となる。ただし、言語特徴に基づく識別では「（私は）母です」といった意味を考慮できるため識別結果は信頼できると考えられる。 In step S18, the second identification unit 1133 performs personal identification based on the input voice. The second identification unit 1133 performs identification based on acoustic features (acoustic analysis) and discrimination based on language features (semantic analysis). Identification based on acoustic features can be obtained if a response voice is obtained, but discrimination based on language features makes it unclear who the question and response are. However, in identification based on language characteristics, the meaning such as "(I am) mother" can be taken into consideration, so the identification result is considered to be reliable.

ステップＳ１９において、人物特定部１１３４は、第１識別部１１３１による画像に基づく識別結果（Ｓ１３）と、第２識別部１１３３による音声に基づく識別結果（Ｓ１８）を考慮して、検出人物を特定する。 In step S19, the person identification unit 1134 identifies the detected person in consideration of the image-based identification result (S13) by the first identification unit 1131 and the voice-based identification result (S18) by the second identification unit 1133. ..

発話内容が「台所で何しているの？」である場合、応答として「いま料理中」が得られることが想定される。この場合、第２識別部１１３３は、言語特徴に基づく識別は行えないが、音響特徴に基づいて識別結果が得られる。ここで、第２識別部１１３３の識別結果が第１識別部１１３１の識別結果と一致すれば、人物特定部１１３４は第１識別部１１３１の識別結果が正しいことを確認でき、これを最終的な特定結果とする。一方、これら２つの識別結果が異なった場合には、人物特定部１１３４は、いずれかの識別結果を採用してもよいし、検出人物が不明であるとしてもよい。この場合は、人物特定部１１３４は識別信頼度を低く設定して、次回の発話においてより直接的に検出人物が誰であるかを確認するようにしてもよい。 If the utterance is "what are you doing in the kitchen?", It is assumed that you will get "cooking now" as a response. In this case, the second identification unit 1133 cannot perform identification based on language features, but can obtain identification results based on acoustic features. Here, if the identification result of the second identification unit 1133 matches the identification result of the first identification unit 1131, the person identification unit 1134 can confirm that the identification result of the first identification unit 1131 is correct, and this is finally confirmed. Make it a specific result. On the other hand, when these two identification results are different, the person identification unit 1134 may adopt one of the identification results, or the detected person may be unknown. In this case, the person identification unit 1134 may set the identification reliability low to more directly confirm who the detected person is in the next utterance.

発話内容が「お母さん、台所で何しているの？」である場合、人物が実際に「母」である場合には、応答として「いま料理中」が得られることが想定され、「母」ではない場合には、「私はお母さんじゃないよ」や「私は姉だよ」といった応答が得られることが想定される。いずれの場合も、第２識別部１１３３は、音響特徴に基づく識別と、言語特徴に基づく識別の両方が行える。言語特徴に基づく識別では、「母」という呼称を用いた問いかけに対して、応答文にそれを否定する語句が含まれているかいないかに基づいて、相手が「母」であるか否かが識別できる。応答文に自分が誰であるかを示す語句が含まれていれば、それに基づいて検出人物を識別できる。この際、音響特徴に基づく識別結果と言語特徴に基づく識別結果に相違が生じることも考えられるが、言語特徴に基づく識別結果を優先してもよいし、それぞれの識別信頼度を考慮して判断してもよい。 If the utterance is "Mom, what are you doing in the kitchen?", And if the person is actually "Mother", it is assumed that "Cooking now" will be obtained as a response, and "Mother". If not, it is expected that you will get a response such as "I am not a mother" or "I am an older sister". In either case, the second identification unit 1133 can perform both identification based on acoustic features and discrimination based on linguistic features. In identification based on language characteristics, in response to a question using the name "mother", whether or not the other party is the "mother" is identified based on whether or not the response sentence contains a phrase that denies it. it can. If the response sentence contains a phrase indicating who you are, you can identify the person to be detected based on it. At this time, it is possible that the discrimination result based on the acoustic feature and the discrimination result based on the language feature may differ, but the discrimination result based on the language feature may be prioritized, and the judgment is made in consideration of the respective discrimination reliability. You may.

発話内容が「台所にいるのは誰ですか？」である場合、応答は「私は○○です」が得られることが想定される。したがって、第２識別部１１３３は、言語特徴に基づく識別を行って検出人物を識別すればよい。検出人物が誰であるかを直接的に問い合わせる発話内容を採用しているため、応答の意味内容から検出人物をより確実に識別できる。この際にも上記と同様に、第２識別部１１３３は音響特徴に基づく識別を行ってもよい。 If the utterance is "Who is in the kitchen?", It is assumed that the response will be "I am XX". Therefore, the second identification unit 1133 may identify the detected person by performing identification based on the language feature. Since the utterance content that directly inquires who the detection person is is adopted, the detection person can be more reliably identified from the meaning content of the response. At this time as well, the second identification unit 1133 may perform identification based on the acoustic features in the same manner as described above.

ステップＳ２０では、特徴収集を継続するか否かが判断される。引き続き特徴収集を行う場合は、ステップＳ１０に戻って次のフレームを処理する。 In step S20, it is determined whether or not to continue feature collection. If the feature collection is to be continued, the process returns to step S10 to process the next frame.

特徴収集を終了する場合には、ステップＳ２１において、特徴登録部１３０が、取得された人体・生体特徴を検出された人物の識別結果と関連付けて記憶部（不図示）に登録する。なお、図３のフローチャートでは、特徴収集を終了するタイミングで特徴登録を行っているが、検出人物の追跡が完了したタイミングで特徴登録を行ってもよい。特徴登録部１３０は、追跡の開始から終了までに得られた特徴の全てを、一人の人物と関連付けて登録する。ただし、特徴登録部１３０は、一つの追跡期間を複数に分割して、それぞれの期間について、得られた特徴を当該期間内の識別結果の人物と関連付けて登録してもよい。この処理は、人物が途中で入れ替わったのを正しく認識できず人体検出部１１２は同一人物として検出していたが、人物識別部１１３によって異なる人物として識別された場合に行われうる。 When the feature collection is completed, in step S21, the feature registration unit 130 registers the acquired human body / biological feature in the storage unit (not shown) in association with the detection result of the detected person. In the flowchart of FIG. 3, the feature registration is performed at the timing when the feature collection is completed, but the feature registration may be performed at the timing when the tracking of the detected person is completed. The feature registration unit 130 registers all the features obtained from the start to the end of tracking in association with one person. However, the feature registration unit 130 may divide one tracking period into a plurality of parts and register the obtained features in association with the person whose identification result is within the period for each period. This process cannot be correctly recognized that the person has been replaced in the middle, and the human body detection unit 112 has detected it as the same person, but it can be performed when the person identification unit 113 identifies the person as a different person.

［本実施形態の有利な効果］
本実施形態によれば、画像に基づく個人識別と音声に基づく個人識別を行い、両方を総合して最終的な識別結果を得られるため、精度のよい識別が行える。特に、画像だけからは精度のよい識別が行えない場合に、システム発話を行ってユーザからの音声応答を取得して音声に基づく識別を行うことで、精度のよい識別を可能としている。さらに、画像に基づく識別結果が信頼できない場合のみに発話を行ったり、画像に基づく識別信頼度に応じて発話内容を決定したりすることで、ユーザが煩わしさを感じることを最小限にできる。 [Advantageous effect of this embodiment]
According to the present embodiment, personal identification based on an image and personal identification based on voice are performed, and the final identification result can be obtained by integrating both, so that accurate identification can be performed. In particular, when accurate identification cannot be performed only from an image, accurate identification is possible by making a system utterance, acquiring a voice response from the user, and performing identification based on voice. Further, by speaking only when the identification result based on the image is unreliable, or by determining the content of the utterance according to the identification reliability based on the image, it is possible to minimize annoyance to the user.

（第２の実施形態）
第１の実施形態では、図４に示すように識別信頼度に応じて発話内容を決定している。しかしながら、ユーザから音声による応答が得られれば、少なくとも音響特徴に基づく識別ができることと、１回の対話で複数回の発話が可能であることを考慮して、本実施形態では、発話内容の決定処理を第１の実施形態から変更する。以下、第１の実施形態との相違点について主に説明する。 (Second Embodiment)
In the first embodiment, as shown in FIG. 4, the utterance content is determined according to the identification reliability. However, in the present embodiment, the content of the utterance is determined in consideration of the fact that if a voice response is obtained from the user, at least the identification based on the acoustic characteristics can be performed and the utterance can be made a plurality of times in one dialogue. The process is changed from the first embodiment. Hereinafter, the differences from the first embodiment will be mainly described.

図６は、本実施形態における発話内容決定処理の流れを示すフローチャートである。本実施形態では複数回の発話を行うことも想定しており、したがって、図６に示す処理は図３におけるステップＳ１５の処理そのものではないことに留意されたい。 FIG. 6 is a flowchart showing the flow of the utterance content determination process in the present embodiment. It should be noted that the present embodiment also assumes that the utterance is performed a plurality of times, and therefore the process shown in FIG. 6 is not the process itself of step S15 in FIG.

発話制御部１１３２は、発話を行うタイミングになったら、ステップＳ３１に示すように、自然なコミュニケーションとなるような発話内容を決定する。したがって、発話内容には検出人物の呼称を含める必要はなく、「台所でいま何してるの？」といった内容が発話内容として決定される。 When it is time to make an utterance, the utterance control unit 1132 determines the utterance content that enables natural communication, as shown in step S31. Therefore, it is not necessary to include the name of the detected person in the utterance content, and the content such as "what are you doing in the kitchen now?" Is determined as the utterance content.

ステップＳ３２では、ステップＳ３１において決定された内容をスピーカ３００から出力して、その結果としてマイク４００が取得するユーザの応答音声に対して、第２識別部１１３３が識別を行う。ここでは、少なくとも音響特徴に基づく識別が行われればよい。言語特徴に基づく識別も可能であれば当然実施してもよい。 In step S32, the content determined in step S31 is output from the speaker 300, and as a result, the second identification unit 1133 identifies the user's response voice acquired by the microphone 400. Here, at least the identification based on the acoustic characteristics may be performed. Of course, identification based on language characteristics may be carried out if possible.

ステップＳ３３では、第１識別部１１３１による識別結果と、第２識別部１１３３による識別結果が一致するか判断する。一致する場合には、検出人物が特定できるので、それ以上の発話を行う必要はない。一方、２つの識別結果が相違する場合には、人物をより確実に識別するために、ステップＳ３４においてさらなる発話内容を決定する。発話制御部１１３２は、２回目の発話内容を、１回目の発話と比較して、より直接的に検出人物が誰
であるかを確認する内容として決定する。いまの場合は、例えば、「台所にいるのはお母さんじゃないの？」という発話内容を採用できる。あるいは、第１の実施形態の信頼度が中レベルまたは低レベルの時と同様に、検出人物の呼称を含む発話内容（例：「お母さん、そこで何してるの？」）や、誰であるかを直接問い合わせる発話内容（例：「台所にいるのは誰ですか？」）を２回目の発話内容として決定してもよい。 In step S33, it is determined whether the identification result by the first identification unit 1131 and the identification result by the second identification unit 1133 match. If they match, the person to be detected can be identified, so there is no need to speak further. On the other hand, when the two identification results are different, further utterance content is determined in step S34 in order to more reliably identify the person. The utterance control unit 1132 compares the content of the second utterance with the content of the first utterance, and determines more directly as the content for confirming who the detection person is. In this case, for example, the utterance content "Isn't it the mother in the kitchen?" Can be adopted. Alternatively, as in the case where the reliability of the first embodiment is medium or low level, the utterance content including the name of the detected person (eg, "Mom, what are you doing there?") And who is it. The content of the utterance (eg, "Who is in the kitchen?") May be determined as the content of the second utterance.

本実施形態によれば、第１の実施形態と同様の効果が得られる上に、音声に基づく識別を行う際により自然な対話が行える。 According to the present embodiment, the same effect as that of the first embodiment can be obtained, and more natural dialogue can be performed when performing identification based on voice.

上記の説明では、１回の対話で発話を２回行うように説明しているが、１回の対話で発話を３回以上行ってもよい。その場合、最初の何回かは自然な発話を行って、音声に基づいて人物を識別できないときに、検出人物が誰であるかを確認する内容の発話を行うようにしてもよい。 In the above description, it is explained that the utterance is performed twice in one dialogue, but the utterance may be performed three times or more in one dialogue. In that case, the first few times may be a natural utterance, and when the person cannot be identified based on the voice, the utterance may be made to confirm who the detected person is.

また、本実施形態の処理は、第１の実施形態において、画像に基づく識別信頼度が高く自然な内容の発話を行った際に、音声に基づく識別結果が顔図に基づく識別結果と相違した場合にも適用可能である。 Further, in the processing of the present embodiment, in the first embodiment, when the image-based identification reliability is high and the utterance of natural content is performed, the identification result based on the voice is different from the identification result based on the face map. It is also applicable in some cases.

（その他）
上述した各実施形態は、本発明の例示に過ぎない。本発明は上記の具体的な形態に限定されることはなく、その技術的思想の範囲内で種々の変形が可能である。 (Other)
Each of the above embodiments is merely an example of the present invention. The present invention is not limited to the above-mentioned specific form, and various modifications can be made within the scope of its technical idea.

上記の説明において、特徴収集装置１００（家庭用ロボット）がユーザに話しかけるタイミングやその内容について述べたが、これは音声に基づく応答をユーザから得るための発話についての処理である。家庭用ロボットをコミュニケーションロボットとして実装する場合に、ユーザとコミュニケーションを取るための発話については上記の処理を適用する必要はない。また、上記の説明では、自然なコミュニケーションの例として、相手の呼称を含まない発話を挙げているが、呼称を含む発話が自然な場面や相手であれば、呼称を含む発話内容は自然な発話となる。 In the above description, the timing and contents of the feature collecting device 100 (domestic robot) talking to the user have been described, but this is a process for utterance for obtaining a voice-based response from the user. When implementing a home robot as a communication robot, it is not necessary to apply the above processing to utterances for communicating with the user. Further, in the above explanation, as an example of natural communication, an utterance that does not include the other party's name is given, but if the utterance including the name is a natural scene or the other party, the utterance content including the name is a natural utterance. It becomes.

上記の実施形態では、特徴収集装置１００が家庭用ロボットに搭載されている例を説明したが、監視カメラなどに搭載されてもよい。また、上記の個人識別装置１１０は、学習用の特徴を収集する特徴収集装置１００に搭載される必要はなく、それ単体で実施して個人を識別するために用いてもよい。 In the above embodiment, the example in which the feature collecting device 100 is mounted on the home robot has been described, but it may be mounted on a surveillance camera or the like. Further, the personal identification device 110 does not need to be mounted on the feature collecting device 100 for collecting learning features, and may be used by itself to identify an individual.

（付記）
音声を出力する音声出力手段（１４，１１４）と、
出力音声に応答する音声を取得する音声入力手段（１５，１１５）と、
画像入力手段（１１，１１１）と、
前記画像入力手段に入力される動画像から、人物を検出する検出手段（１２，１１２）と、
前記検出手段によって検出された人物を識別する人物識別手段（１３，１１３）と、
を備え、
前記人物識別手段は、
前記動画像に基づいて前記人物を識別する第１識別手段（１３１，１１３１）と、
出力音声への応答として得られる入力音声に基づいて前記人物を識別する第２識別手段（１３３，１１３３）と、を有し、
前記第１識別手段による識別結果である第１識別結果と前記第２識別手段による識別結果である第２識別結果に基づいて、前記人物を識別する、
ことを特徴とする、個人識別装置。 (Additional note)
Audio output means (14,114) that output audio and
Voice input means (15,115) for acquiring voice in response to output voice, and
Image input means (11,111) and
Detection means (12,112) for detecting a person from a moving image input to the image input means, and
Person identification means (13,113) for identifying a person detected by the detection means, and
With
The person identification means
A first identification means (131, 1131) for identifying the person based on the moving image, and
It has a second identification means (133, 1133) for identifying the person based on the input voice obtained as a response to the output voice.
The person is identified based on the first identification result which is the identification result by the first identification means and the second identification result which is the identification result by the second identification means.
A personal identification device characterized by the fact that.

１０：個人識別装置１１：画像入力部１２：人体検出部
１３：人物識別部１３１：第１識別部１３３：第２識別部
１４：音声出力部１５：音声入力部
１００：特徴収集装置
１１０：個人識別装置１１１：画像入力部１１２：人体検出部
１１３：人物識別部１１３１：第１識別部１１３２：発話制御部
１１３３：第２識別部１１３４：人体特定部
１１４：音声出力部１１５：音声入力部
１２０：特徴取得部１３０：特徴登録部
２００：カメラ３００：スピーカ４００：マイク 10: Personal identification device 11: Image input unit 12: Human body detection unit 13: Person identification unit 131: First identification unit 133: Second identification unit 14: Voice output unit 15: Voice input unit 100: Feature collection device 110: Individual Identification device 111: Image input unit 112: Human body detection unit 113: Person identification unit 1131: First identification unit 1131: Speech control unit 1133: Second identification unit 1134: Human body identification unit 114: Voice output unit 115: Voice input unit 120 : Feature acquisition unit 130: Feature registration unit 200: Camera 300: Speaker 400: Microphone

Claims

Audio output means to output audio and
A voice input means for acquiring voice that responds to output voice,
Image input means and
A detection means for detecting a person from a moving image input to the image input means,
A person identification means for identifying a person detected by the detection means, and
With
The person identification means
A first identification means for identifying the person based on the moving image,
It has a second identification means for identifying the person based on the input voice obtained as a response to the output voice.
The person is identified based on the first identification result which is the identification result by the first identification means and the second identification result which is the identification result by the second identification means .
When the reliability of the first identification means is less than the first threshold value and greater than or equal to the second threshold value smaller than the first threshold value, the content including the name of the person as the first identification result is included as a call. , Determined as the output voice,
When the reliability by the first identification means is less than the second threshold value, the content of inquiring who the person is is determined as the content of the output voice.
A personal identification device characterized by the fact that.

When the reliability of the first identification means is equal to or higher than the first threshold value, the person identification means uses the first identification result as the identification result of the person.
The personal identification device according to claim 1 .

The output audio from the audio output means is output at a predetermined timing.
The predetermined timing is
The timing when the identification reliability of the person becomes less than the threshold value,
The timing at which the first predetermined time has elapsed from the timing at which the person was detected,
The timing at which the state in which there is almost no time change of the person continues for the second predetermined time,
Timing when the person goes out of the imaging range,
Timing the distance between the person and the personal identification device is equal to or less than a predetermined distance,
At least one of
The personal identification device according to claim 1 or 2 .

When the first identification result and the second identification result do not match, the person identification means newly outputs an output voice and the second identification means is based on an input voice that responds to the new output voice. Identify by
The content of the new output voice is the content that confirms the person more directly than the content of the previous output voice.
The personal identification device according to any one of claims 1 to 3 .

The second identification means identifies the person by performing at least one of waveform analysis and language analysis using the input voice.
The personal identification device according to any one of claims 1 to 4 .

The first identifying means identifies the person based on at least one of facial features, human body features, and behavioral features obtained from the moving image.
The personal identification device according to any one of claims 1 to 5 .

The personal identification device according to any one of claims 1 to 6 .
A feature acquisition means for acquiring at least one of the detected features related to the human body or behavior of the person from the moving image input to the image input means.
A feature registration means for registering a feature acquired by the feature acquisition means in association with a person identified by the person identification means, and a feature registration means.
A feature collection device equipped with.

A personal identification method performed by a computer
A detection step that detects a person from a moving image,
The first identification step of identifying the person based on the moving image and
Audio output step to output audio and
A voice input step to get the voice that responds to the output voice,
A second identification step of identifying the person based on the input voice obtained in response to the output voice, and
A third identification step that identifies the person based on the first identification result, which is the identification result in the first identification step, and the second identification result, which is the identification result in the second identification step.
Only including,
In the audio output step,
When the reliability in the first identification step is less than the first threshold value and greater than or equal to the second threshold value smaller than the first threshold value, the content including the name of the person as the first identification result is included as a call. , Determined as the output voice,
When the reliability in the first identification step is less than the second threshold value, the content of inquiring who the person is is determined as the content of the output voice.
Personal identification method.

When the reliability in the first identification step is lower than the first threshold value, performs the voice output step, the speech input step, the second identification step,
The personal identification method according to claim 8 .

When the reliability in the first identification step is equal to or higher than the first threshold value, the first identification result is used as the identification result of the person in the third identification step.
The personal identification method according to claim 9 .

A program for performing each step of the method according to any one of claims 8 to 10 .