JP2012212071A

JP2012212071A - Face image authentication device

Info

Publication number: JP2012212071A
Application number: JP2011078429A
Authority: JP
Inventors: Naoyuki Takada; 直幸高田; Masanori Onozuka; 正則小野塚
Original assignee: Secom Co Ltd
Current assignee: Secom Co Ltd
Priority date: 2011-03-31
Filing date: 2011-03-31
Publication date: 2012-11-01
Anticipated expiration: 2031-03-31
Also published as: JP5751889B2

Abstract

PROBLEM TO BE SOLVED: To provide a face image authentication device which outputs a voice at an appropriate timing for each user when the faces of a plurality of users are verified sequentially.SOLUTION: A face image authentication device 1 comprises: a voice output part 3 for outputting a voice; face detection means 11 for extracting an input face image from an input image; face verification means 13 for verifying a registered face image with the input face image to determine whether they are the same person; voice output determination means 14 for determining whether a voice output part is outputting a voice; output voice-use information formation means 15 for forming output voice-use information; voice synthesis means 17 for synthesizing an output voice signal from the output voice information; and control means 16 for outputting a voice based on the output voice signal from the voice output part. The output voice-use information formation means forms, as the output voice information, standard voice-use information when the voice output part is outputting no voice, and forms abbreviated voice-use information shorter than the standard voice-use information when the voice output part is outputting a voice, both at a time when the face verification means performs the determination.

Description

本発明は、顔画像認証装置に関し、特に、利用者に対して音声を出力する顔画像認証装置に関する。 The present invention relates to a face image authentication apparatus, and more particularly to a face image authentication apparatus that outputs a sound to a user.

従来より、企業の居室の入り口などに設置され、通行する利用者（社員等）の顔を撮影した顔画像を予め登録された顔画像と照合することにより、その利用者がその居室に入室可能な人物であるか否かを認証する顔画像認証装置が開発されている。従来の顔画像認証装置は、利用者が居室の入り口の扉の前で一旦立ち止まって暗証番号を入力したときに顔画像を照合していた。しかし近年は、歩行中の利用者の顔を撮影して照合し、利用者が入り口に接近する前に認証する歩行型の顔画像認証装置が開発されている。このような歩行型の顔画像認証装置においては、出勤時のように複数の利用者が次々と居室の入り口に接近する場合、各利用者が入り口に到達する前に認証しなければ、各利用者がスムーズに入室できず、利便性が損なわれることになる。
そこで、特許文献１には、通行人数に応じて顔照合処理の精度を変更して顔照合処理の時間を短くする人物認識装置が提案されている。この人物認識装置は、入力画像中の顔の数を計算し、その顔の数に応じて顔照合処理における顔画像の解像度、照合対象領域等を変更することにより、通行人数が多いほど１名当たりの顔照合処理の時間が短くなるように制御している。 Conventionally, a user can enter a room by checking the face image of the face of a user (employee, etc.) that is installed at the entrance of a company's office and photographed with the face image registered in advance. Face image authentication devices that authenticate whether or not a person is a person have been developed. The conventional face image authentication device collates face images when the user temporarily stops in front of the entrance door of the living room and inputs a password. However, in recent years, a walking-type face image authentication device has been developed that captures and collates the face of a user who is walking and authenticates before the user approaches the entrance. In such a walk-type facial image authentication device, when a plurality of users approach the entrance of a living room one after another, such as when going to work, each user must authenticate before reaching the entrance. The user cannot enter the room smoothly, and convenience is impaired.
Therefore, Patent Document 1 proposes a person recognition device that changes the accuracy of face matching processing in accordance with the number of people passing to shorten the time of face matching processing. This person recognition device calculates the number of faces in the input image, and changes the resolution of the face image in the face matching process, the matching target area, etc. according to the number of faces, so that the number of people passing by increases. Control is performed so as to shorten the time of the hit face matching process.

特開２００７−１５６５４１号公報JP 2007-156541 A

特許文献１に記載された人物認識装置は、通行人数が多い場合に１名当たりの顔照合処理の時間を低減することができ、各利用者はスムーズに入室することができる。しかしながら、歩行型の顔画像認証装置において、例えば利用者に対して顔照合処理の結果に応じた音声を出力しようとする場合、一般に顔照合処理にかかる時間と比較して顔照合処理の結果に応じた音声を出力する時間は非常に長くなる。そのため、複数の利用者が同時に通行する場合に、各利用者の顔照合処理にかかる時間を短くしても、各利用者に対する音声出力が完了するまでの時間に対する影響は小さく、各利用者が入室するまでに音声の出力が間に合わないおそれがある。 The person recognition device described in Patent Document 1 can reduce the time for face matching processing per person when there are a large number of passing people, and each user can enter the room smoothly. However, in the walking type face image authentication device, for example, when trying to output a sound corresponding to the result of the face matching process to the user, the result of the face matching process is generally compared with the time required for the face matching process. The time for outputting the corresponding sound becomes very long. Therefore, when multiple users pass at the same time, even if the time required for the face matching process of each user is shortened, the influence on the time until the voice output for each user is completed is small. There is a possibility that the audio output may not be in time before entering the room.

そこで、本発明の目的は、複数の利用者に対して順次照合するとともに各利用者に対する音声を出力する際に各利用者に対する音声の出力タイミングを最適化することが可能な顔画像認証装置を提供することにある。 Accordingly, an object of the present invention is to provide a face image authentication apparatus capable of sequentially collating a plurality of users and optimizing the sound output timing for each user when outputting the sound to each user. It is to provide.

かかる課題を解決するための本発明は、照合対象者を撮影した入力画像を順次取得する撮像部と、予め登録者の登録顔画像を記憶する記憶部と、照合対象者に音声を出力する音声出力部と、入力画像が取得される度に照合対象者の顔領域の画像を入力顔画像として抽出する顔検出手段と、登録顔画像と入力顔画像を照合し、同一人物であるか否かを判定する顔照合手段と、音声出力部が音声を出力中であるか否かを判定する音声出力判定手段と、音声出力判定手段の判定結果に基づいた出力音声用情報を作成する出力音声用情報作成手段と、出力音声用情報から出力音声信号を合成する音声合成手段と、出力音声信号を音声出力部から音声出力させる制御手段とを有する顔画像認証装置を提供する。係る顔画像認証装置において、出力音声用情報作成手段は、出力音声用情報として、顔照合手段による判定の時点で音声出力部が音声出力中でない場合には標準音声用情報を作成し、当該判定の時点で音声出力部が音声出力中である場合には標準音声用情報よりも短い短縮音声用情報を作成する。 The present invention for solving this problem includes an imaging unit that sequentially acquires input images obtained by photographing a person to be collated, a storage unit that stores a registered face image of the registrant in advance, and a voice that outputs sound to the person to be collated. Whether the output unit, the face detection means for extracting the face area image of the person to be collated as an input face image each time an input image is acquired, the registered face image and the input face image are collated, and whether or not they are the same person A face collating unit for determining the output, a sound output determining unit for determining whether or not the sound output unit is outputting sound, and an output sound for generating output sound information based on a determination result of the sound output determining unit Provided is a face image authentication device having information creation means, voice synthesis means for synthesizing an output voice signal from information for output voice, and control means for outputting the output voice signal from a voice output unit. In the face image authentication apparatus, the output voice information creating unit creates standard voice information as the output voice information when the voice output unit is not outputting the voice at the time of the determination by the face matching unit. If the voice output unit is outputting voice at the time of, short voice information is created that is shorter than the standard voice information.

また、本発明に係る顔画像認証装置において、記憶部は、認証定型句と、登録者の登録顔画像に関連づけて当該登録者の個人名を更に記憶し、出力音声用情報作成手段は、標準音声用情報を、認証定型句と、顔照合手段にて入力顔画像に写っている照合対象者と同一人物であると判定された登録者の個人名から作成し、短縮音声用情報を、顔照合手段にて入力顔画像に写っている照合対象者と同一人物であると判定された登録者の個人名から作成することが好ましい。 In the face image authentication apparatus according to the present invention, the storage unit further stores an authentication boilerplate and a personal name of the registrant in association with the registered face image of the registrant. The voice information is created from the authentication boilerplate and the personal name of the registrant who is determined to be the same person as the person to be collated in the input face image by the face matching means, and the shortened voice information is It is preferable to create it from the personal name of the registrant who has been determined by the verification means to be the same person as the verification target person shown in the input face image.

また、本発明に係る顔画像認証装置において、制御手段は、出力音声用情報作成手段が出力音声用情報を作成する度に当該出力音声用情報を記憶部に記憶する一方で、音声出力部が出力音声信号の音声出力を完了したときに記憶部から当該出力音声信号に対応する出力音声用情報を削除し、音声出力判定手段は、記憶部に出力音声用情報が記憶されていると、音声出力部が音声出力中であると判定し、記憶部に出力音声用情報が記憶されていないと、音声出力部が音声出力中でないと判定することが好ましい。 In the face image authentication apparatus according to the present invention, the control means stores the output sound information in the storage unit every time the output sound information creation means creates the output sound information. When the audio output of the output audio signal is completed, the output audio information corresponding to the output audio signal is deleted from the storage unit, and the audio output determination means It is preferable that it is determined that the output unit is outputting audio, and if the output audio information is not stored in the storage unit, it is determined that the audio output unit is not outputting audio.

また、本発明に係る顔画像認証装置において、記憶部には、登録者ごとに属性が記憶されるとともに、当該属性に応じて予め定められた、制御部が出力音声信号を音声出力部に音声出力させる順序を規定する優先度が記憶され、制御手段は、記憶部に記憶された出力音声用情報のうち、音声出力部による音声出力が開始されていない出力音声用情報についての出力音声信号を優先度が高い順に音声出力部に音声出力させることが好ましい。 Further, in the face image authentication device according to the present invention, the storage unit stores an attribute for each registrant, and a control unit, which is predetermined according to the attribute, outputs an output audio signal to the audio output unit. The priority that defines the order of output is stored, and the control means outputs the output audio signal for the output audio information for which the audio output by the audio output unit has not started among the output audio information stored in the storage unit. It is preferable to have the audio output unit output audio in descending order of priority.

本発明に係る顔画像認証装置は、複数の利用者に対して順次照合するとともに各利用者に対する音声を出力する際に各利用者に対する音声の出力タイミングを最適化することができるため、各利用者が、自分に向けて出力された音声を聞くことで、自分が認証されたことを認識でき、使用感が向上するという効果を奏する。 Since the face image authentication apparatus according to the present invention can sequentially collate a plurality of users and can optimize the voice output timing for each user when outputting the voice to each user, A person can recognize that he / she has been authenticated by listening to the sound output directed toward him / her, thereby improving the usability.

本発明を適用した顔画像認証装置の概略構成図である。It is a schematic block diagram of the face image authentication apparatus to which this invention is applied. 顔画像認証装置がオフィスビルの入り口に設置される場合の撮像部の設置例を表す模式図である。It is a schematic diagram showing the installation example of an imaging part in case a face image authentication apparatus is installed in the entrance of an office building. 登録テーブルの模式図である。It is a schematic diagram of a registration table. （ａ）は、顔画像認証装置が設置された通路に複数の利用者が離散的に通行する場合の入力画像と出力音声の関係を示す模式図であり、（ｂ）は、顔画像認証装置が設置された通路に複数の利用者が連続的に通行する場合の入力画像と出力音声の関係を示す模式図である。(A) is a schematic diagram which shows the relationship between an input image and output audio | voice when a some user passes discretely to the channel | path in which the face image authentication apparatus was installed, (b) is a face image authentication apparatus It is a schematic diagram which shows the relationship between an input image and an output audio | voice when a some user passes along the channel | path in which was installed. 出力テーブルの模式図である。It is a schematic diagram of an output table. 定型句テーブルの模式図である。It is a schematic diagram of a fixed phrase table. 顔画像認証装置による通知情報の登録処理の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the registration process of the notification information by a face image authentication apparatus. 顔画像認証装置による音声の出力処理の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of the audio | voice output process by a face image authentication apparatus. 本実施形態の顔画像認証装置が設置された通路に複数の利用者が連続的に通行する場合の入力画像と出力音声の関係を示す模式図である。It is a schematic diagram which shows the relationship between an input image and output sound in case a some user passes along the channel | path in which the face image authentication apparatus of this embodiment was installed continuously. 本発明の変形例における出力テーブルの模式図である。It is a schematic diagram of the output table in the modification of this invention. （ａ）、（ｂ）は、登録者の優先度に応じた音声出力の順序変更を説明するための出力テーブルの模式図である。(A), (b) is a schematic diagram of the output table for demonstrating the order change of the audio | voice output according to the priority of a registrant.

以下、本発明の一実施形態である顔画像認証装置について図を参照しつつ説明する。
本発明を適用した顔画像認証装置は、複数の照合対象者に対して順次照合し、その照合結果に応じた音声を各照合対象者に対して出力する。そして顔画像認証装置は、照合対象者の照合が完了したとき既に顔画像認証装置が音声を出力中である場合、新たに出力する出力音声を通常の出力音声より短くする。これにより、顔画像認証装置は、複数の照合対象者に対する音声の出力タイミングを最適化できるようにすることを図る。 Hereinafter, a face image authentication apparatus according to an embodiment of the present invention will be described with reference to the drawings.
The face image authentication apparatus to which the present invention is applied sequentially compares a plurality of verification target persons, and outputs a sound corresponding to the verification result to each verification target person. Then, when the face image authentication apparatus is already outputting the voice when the collation of the person to be collated is completed, the face image authentication apparatus shortens the newly output voice to be shorter than the normal output voice. Thereby, the face image authentication device aims to optimize the output timing of the sound for a plurality of verification subjects.

図１は、本発明を適用した顔画像認証装置１の概略構成を示す図である。図１に示すように、顔画像認証装置１は、撮像部２と、音声出力部３と、インターフェース部４と、記憶部５と、処理部６とを有する。以下、顔画像認証装置１の各部について詳細に説明する。 FIG. 1 is a diagram showing a schematic configuration of a face image authentication apparatus 1 to which the present invention is applied. As illustrated in FIG. 1, the face image authentication device 1 includes an imaging unit 2, an audio output unit 3, an interface unit 4, a storage unit 5, and a processing unit 6. Hereinafter, each part of the face image authentication device 1 will be described in detail.

撮像部２は、顔画像認証装置１が運用される環境に応じて、照合対象者の顔を撮影できるように設置される。そして撮像部２は、照合対象者が写った画像を入力画像として生成する。そのために、撮像部２は、例えば、ＣＣＤまたはＣＭＯＳといった固体撮像素子の２次元アレイ上に照合対象者の顔の像を結像する光学系を備えたカメラを有する。
撮像部２は、生成した入力画像をインターフェース部４へ出力する。 The imaging unit 2 is installed so that the face of the person to be collated can be photographed according to the environment in which the face image authentication device 1 is operated. And the imaging part 2 produces | generates the image in which the collation subject was shown as an input image. For this purpose, the imaging unit 2 includes a camera including an optical system that forms an image of the face of the person to be collated on a two-dimensional array of solid-state imaging elements such as a CCD or a CMOS.
The imaging unit 2 outputs the generated input image to the interface unit 4.

なお、撮像部２は、入力画像として、カラーの多階調画像を作成するものであってもよく、あるいは、近赤外域に感度を有し、グレー画像を作成するカメラであってもよい。また撮像部２が有する撮像素子アレイは、入力画像に写っている照合対象者の顔の目、鼻、口などの顔特徴が区別できる程度の画素数を有することが好ましい。 The imaging unit 2 may create a color multi-tone image as an input image, or may be a camera having sensitivity in the near infrared region and creating a gray image. Moreover, it is preferable that the image pick-up element array which the image pick-up part 2 has has the pixel number of the grade which can distinguish facial features, such as eyes, a nose, and a mouth of a collation target person who are reflected in the input image.

図２に、顔画像認証装置１がオフィスビルの入り口に設置される場合の撮像部２の設置例を模式的に示す。図２に示すように、例えば、顔画像認証装置１がオフィスビルの入り口２００に設置される場合、撮像部２は、入り口２００に通じる通路を監視領域に含むよう、入り口２００が設置された壁の上方または天井に、撮影方向をやや下方へ向け、その通路側へ向けた状態で取り付けられる。これにより撮像部２は、入り口２００に向かう（進行方向２０１へ向かう）照合対象者２０２〜２０４を所定の時間間隔（例えば、200msec）で撮像することができる。 FIG. 2 schematically shows an installation example of the imaging unit 2 when the face image authentication device 1 is installed at the entrance of an office building. As shown in FIG. 2, for example, when the face image authentication device 1 is installed at the entrance 200 of an office building, the imaging unit 2 includes a wall on which the entrance 200 is installed so that the monitoring area includes a passage leading to the entrance 200. It is attached to the top or ceiling of the camera with the shooting direction slightly downward and toward the passage. Thereby, the imaging unit 2 can capture the verification target persons 202 to 204 heading toward the entrance 200 (heading in the traveling direction 201) at predetermined time intervals (for example, 200 msec).

音声出力部３は、例えばスピーカであり、インターフェース部４と接続され、インターフェース部４から受け取った音声信号に応じた音声を発し、照合対象者に報知する。図２に示すように、音声出力部３は、入り口２００が設置された壁の上方または天井の、撮像部２の近傍に設置され、音声の出力方向をやや下方へ向け、その通路側へ向けた状態で取り付けられる。また音声出力部３は、出力する音声が照合対象者に聞こえるように、適切な音量で音声出力するように調整される。 The audio output unit 3 is, for example, a speaker, is connected to the interface unit 4, emits audio corresponding to the audio signal received from the interface unit 4, and notifies the verification target person. As shown in FIG. 2, the audio output unit 3 is installed in the vicinity of the imaging unit 2 above or on the ceiling where the entrance 200 is installed, and the audio output direction is directed slightly downward and toward the passage side. It is attached in the state. Further, the audio output unit 3 is adjusted so as to output the sound at an appropriate volume so that the sound to be output can be heard by the person to be collated.

インターフェース部４は、撮像部２及び音声出力部３と接続されるインターフェース回路であり、例えば、ビデオインターフェース及びオーディオインターフェースあるいはユニバーサル・シリアル・バスといったシリアルバスに準じるインターフェース回路を有する。そしてインターフェース部４は、撮像部２から受け取った入力画像を処理部６に渡し、処理部６から受け取った音声信号を音声出力部３に渡す。 The interface unit 4 is an interface circuit connected to the imaging unit 2 and the audio output unit 3, and includes an interface circuit conforming to a serial bus such as a video interface, an audio interface, or a universal serial bus. The interface unit 4 passes the input image received from the imaging unit 2 to the processing unit 6, and passes the audio signal received from the processing unit 6 to the audio output unit 3.

記憶部５は、半導体メモリ、磁気記録媒体及びそのアクセス装置並びに光記録媒体及びそのアクセス装置のうちの少なくとも一つを有する。そして記憶部５は、顔画像認証装置１を制御するためのコンピュータプログラム、各種パラメータ及びデータなどを記憶する。また記憶部５は、登録者ごとの登録情報を格納する登録テーブルと、顔画像認証装置１が出力する音声の定型句の情報を格納する定型句テーブルと、顔画像認証装置１が出力すべき音声に関する情報を管理する出力テーブルとを記憶する。また記憶部５は、登録テーブル内に登録者の顔画像に関するデータである顔データを記憶する。これらの登録テーブル、定型句テーブル及び出力テーブルの詳細については後述する。 The storage unit 5 includes at least one of a semiconductor memory, a magnetic recording medium and its access device, and an optical recording medium and its access device. The storage unit 5 stores a computer program, various parameters, data, and the like for controlling the face image authentication device 1. In addition, the storage unit 5 stores a registration table for storing registration information for each registrant, a fixed phrase table for storing information on the fixed phrases of speech output by the face image authentication apparatus 1, and the face image authentication apparatus 1 to output. An output table for managing information related to audio is stored. In addition, the storage unit 5 stores face data that is data related to the registrant's face image in the registration table. Details of these registration table, fixed phrase table, and output table will be described later.

処理部６は、１個または複数個のプロセッサ及びその周辺回路を有する。そして処理部６は、照合対象者に対する顔照合処理を実行するとともに、その照合した結果に応じた音声の出力を制御する。そのために、処理部６は、そのプロセッサ上で動作するソフトウェアにより実装される機能モジュールとして、顔検出手段１１と、顔追跡手段１２と、顔照合手段１３と、音声出力判定手段１４と、出力音声用情報作成手段１５と、制御手段１６と、音声合成手段１７とを有する。
なお、処理部６が有するこれらの各部は、独立した集積回路、ファームウェア、マイクロプロセッサなどで構成されてもよい。
以下、処理部６の各部について詳細に説明する。 The processing unit 6 includes one or a plurality of processors and their peripheral circuits. And the process part 6 controls the output of the audio | voice according to the collation result while performing the face collation process with respect to a collation subject. For this purpose, the processing unit 6 includes a face detection unit 11, a face tracking unit 12, a face matching unit 13, a voice output determination unit 14, and an output voice as functional modules implemented by software operating on the processor. Use information creation means 15, control means 16, and speech synthesis means 17.
Note that these units included in the processing unit 6 may be configured by independent integrated circuits, firmware, a microprocessor, and the like.
Hereinafter, each part of the processing unit 6 will be described in detail.

顔検出手段１１は、入力画像が取得される度に、入力画像から、照合対象者の顔が写っている領域である入力顔領域を検出し、入力画像からその領域を切り出して入力顔画像を作成する。そのために、顔検出手段１１は、例えば、入力画像からsobelフィルタなどのエッジ検出フィルタを用いてエッジ画素を検出する。そして顔検出部２１は、検出されたエッジ画素を一般化ハフ変換することにより人の顔の輪郭らしい楕円状の輪郭を検出し、その輪郭で囲まれた領域を入力顔領域とする。
あるいは顔検出手段１１は、Ａｄａｂｏｏｓｔ識別器を用いて入力顔領域を検出してもよい。この方法についてはP.Violaと M.Jonesによる論文「Rapid Object Detection Using a Boosted Cascade of Simple Features」（Proc. the IEEE International Conference on Computer Vision and Pattern Recognition, vol.1, pp.511-518, 2001）を参照することができる。その場合、Ａｄａｂｏｏｓｔ識別器には、人の顔が写っている複数のサンプル画像と人の顔が写っていない複数のサンプル画像を用いて、顔が写っている画像に対して顔が写っているとの識別結果を出力し、顔が写っていない画像に対しては顔が写っていないとの識別結果を出力するように事前に学習させておく。そして顔検出手段１１は入力画像内の所定の大きさの領域を、その位置をずらしながら切り出してＡｄａｂｏｏｓｔ識別器に入力し、その領域内に顔が写っているか否かの識別結果を得ることにより、入力顔領域を検出する。
なお、入力画像中に複数の照合対象者が写っている場合には、顔検出手段１１は、各照合対象者毎に入力顔領域を検出し、入力顔画像を作成する。
顔検出手段１１は、入力画像上で入力顔領域の検出に成功すると、入力顔領域を表す情報を顔追跡手段１２へ通知する。 Each time the input image is acquired, the face detection unit 11 detects an input face area that is an area in which the face of the person to be collated is captured from the input image, cuts out the area from the input image, and extracts the input face image. create. For this purpose, the face detection unit 11 detects edge pixels from the input image using an edge detection filter such as a sobel filter. Then, the face detection unit 21 detects an elliptical outline that seems to be the outline of a human face by performing generalized Hough transform on the detected edge pixels, and sets an area surrounded by the outline as an input face area.
Alternatively, the face detection unit 11 may detect the input face area using an Adaboost classifier. This method is described in a paper by P. Viola and M. Jones “Rapid Object Detection Using a Boosted Cascade of Simple Features” (Proc. The IEEE International Conference on Computer Vision and Pattern Recognition, vol.1, pp.511-518, 2001). ) Can be referred to. In that case, the Adaboost discriminator uses a plurality of sample images in which a human face is reflected and a plurality of sample images in which a human face is not captured, and the face is reflected in an image in which the face is reflected. And an identification result indicating that a face is not captured is output in advance for an image in which a face is not captured. Then, the face detection means 11 cuts out an area of a predetermined size in the input image while shifting its position, inputs it to the Adaboost classifier, and obtains an identification result as to whether or not a face is reflected in the area. , Detect the input face area.
When a plurality of verification target persons are shown in the input image, the face detection unit 11 detects an input face area for each verification target person and creates an input face image.
When the face detection unit 11 successfully detects the input face area on the input image, the face detection unit 11 notifies the face tracking unit 12 of information representing the input face area.

顔追跡手段１２は、所定の時間間隔で連続して取得される複数の入力画像にわたって顔検出手段１１から検出された入力顔領域に対して公知のトラッキング技術を利用して追跡処理を行い、同一人物の顔が写っている入力顔領域を対応付ける。
例えば、顔追跡手段１２は、最新の入力画像から検出された入力顔領域（以降、現フレームの入力顔領域と称する）の重心位置と、１フレーム前の入力画像から検出された入力顔領域（以降、前フレームの入力顔領域と称する）の重心位置の距離を求めて、その距離が所定の閾値以下である場合に、その入力顔領域を同一人物によるものとして対応付ける。なお、照合対象者が撮像部２から離れているときに一定の距離を移動した場合と撮像部２の近くにいるときに同じ距離を移動した場合とでは、その移動の前後において入力画像における入力顔領域の位置の差は異なる。そのため、例えば所定の閾値を入力顔領域の大きさとすることにより、監視領域内の照合対象者の位置にかかわらず、現フレームの入力顔領域と前フレームの入力顔領域のそれぞれに写っている人物が同一人物であるか否かを適切に評価することができる。また顔追跡手段１２は、複数の入力顔領域が抽出されている場合には、重心位置の距離が最も近い入力顔領域どうしが対応づくか否かを調べる。
あるいは、顔追跡手段１２は、オプティカルフロー、パーティクルフィルタ等の方法を用いて入力顔領域の追跡処理を行ってもよい。 The face tracking unit 12 performs a tracking process using a known tracking technique on the input face area detected from the face detection unit 11 over a plurality of input images continuously acquired at predetermined time intervals, and the same. Associate an input face area with a human face.
For example, the face tracking unit 12 detects the position of the center of gravity of the input face area (hereinafter referred to as the input face area of the current frame) detected from the latest input image and the input face area ( Hereinafter, the distance of the center of gravity position of the previous frame (referred to as the input face area of the previous frame) is obtained, and when the distance is equal to or less than a predetermined threshold, the input face area is associated with the same person. It should be noted that when the person to be collated moves a certain distance when moving away from the imaging unit 2 and when moving the same distance when moving closer to the imaging unit 2, input in the input image is performed before and after the movement. The difference in the position of the face area is different. For this reason, for example, by setting a predetermined threshold value as the size of the input face area, a person shown in each of the input face area in the current frame and the input face area in the previous frame regardless of the position of the person to be verified in the monitoring area Can be appropriately evaluated whether or not they are the same person. In addition, when a plurality of input face areas are extracted, the face tracking unit 12 checks whether or not the input face areas with the closest distance between the center of gravity positions correspond to each other.
Alternatively, the face tracking unit 12 may perform input face region tracking processing using a method such as an optical flow or a particle filter.

前フレームの入力顔領域と対応付けることができなかった入力顔領域は、新規の照合対象者に対応する入力顔領域とされ、以降の追跡処理の対象となる。また現フレームの全ての入力顔領域と対応付けることができなかった前フレームの入力顔領域は、以降の追跡処理の対象から除外される。 The input face area that could not be associated with the input face area of the previous frame is set as an input face area corresponding to a new person to be collated, and is a target of subsequent tracking processing. Also, the input face area of the previous frame that could not be associated with all the input face areas of the current frame is excluded from the subsequent tracking processing targets.

顔照合手段１３は、顔追跡手段１２によって追跡処理の対象となっている入力顔領域のうち、その入力顔領域から切り出された入力顔画像がまだ照合されていなければ、記憶部５に記憶された登録テーブルの各登録顔画像と照合し、同一人物によるものか否かを判定する。 If the input face image cut out from the input face area that is the target of the tracking process by the face tracking means 12 has not been verified yet, the face matching means 13 is stored in the storage unit 5. Each registered face image in the registered table is collated to determine whether or not the image is from the same person.

図３は、記憶部５に記憶される、登録者ごとの登録情報を格納する登録テーブルの模式図である。図３に示された登録テーブル３００において、一つの行が一人の登録者に対応する。
そして左端の列の各欄には、登録者の識別情報３０１が示される。識別情報３０１は、例えば、登録者のユーザ名、ユーザ識別番号またはパスワードである。あるいは、識別情報３０１は、社員番号又は連続した正の整数のように登録者を一意に特定できるものであればどのようなものでもよい。また左から２番目の列の各欄には、顔データ３０２が格納される。顔データ３０２は、登録者の顔画像に関するデータであり、本実施形態では、顔データ３０２として登録顔画像が記憶される。 FIG. 3 is a schematic diagram of a registration table that stores registration information for each registrant stored in the storage unit 5. In the registration table 300 shown in FIG. 3, one row corresponds to one registrant.
In each column of the leftmost column, identification information 301 of the registrant is shown. The identification information 301 is, for example, a registrant's user name, user identification number, or password. Alternatively, the identification information 301 may be anything as long as it can uniquely identify the registrant, such as an employee number or a continuous positive integer. Further, face data 302 is stored in each column of the second column from the left. The face data 302 is data related to the registrant's face image, and in this embodiment, a registered face image is stored as the face data 302.

顔照合手段１３は、照合処理として、公知の様々な照合方法を用いることができる。例えば、顔照合手段１３は、顔抽出手段１１によって抽出された入力顔画像と登録顔画像のパターンマッチングを行う。顔照合手段１３は、入力顔画像と登録顔画像の位置をずらしながら入力顔画像の各画素と登録顔画像の対応画素の輝度値の差の二乗和を算出し、算出した二乗和のうち最も小さいものを入力顔画像に含まれる画素数で割って正規化した値の逆数を類似度として求める。そして顔照合手段１３は、各登録顔画像について求めた類似度のうち、最も高い類似度が所定の照合閾値を超える場合、その入力顔画像に写っている照合対象者を、類似度が最も高い値を有する登録顔画像により登録された登録者である（照合成功）と判断する。一方、顔照合手段１３は、何れの類似度も所定の照合閾値を超えない場合、入力顔画像に写っている照合対象者は登録者ではない（照合失敗）と判断する。なおこの照合閾値は、顔画像認証装置１が設置される環境、目的などに応じて適宜定められる。 The face matching unit 13 can use various known matching methods as the matching process. For example, the face matching unit 13 performs pattern matching between the input face image extracted by the face extracting unit 11 and the registered face image. The face collating unit 13 calculates the sum of squares of the luminance values of the pixels of the input face image and the corresponding pixels of the registered face image while shifting the positions of the input face image and the registered face image, and most of the calculated sums of squares. The reciprocal of the normalized value obtained by dividing the smaller one by the number of pixels included in the input face image is obtained as the similarity. Then, when the highest similarity among the similarities obtained for each registered face image exceeds a predetermined collation threshold, the face collating unit 13 selects the person to be collated in the input face image with the highest similarity. It is determined that the registered person is a registered face image having a value (success in collation). On the other hand, the face collating means 13 determines that the person to be collated in the input face image is not a registrant (collation failure) when any similarity does not exceed a predetermined collation threshold. This collation threshold value is appropriately determined according to the environment, purpose, etc. in which the face image authentication device 1 is installed.

あるいは、顔照合手段１３は、顔の特徴的な部分である顔特徴点により類似度を求めてもよい。その場合、顔照合手段１３は、入力顔画像及び登録顔画像から両目尻、両目領域中心、鼻尖点、口点、口角点などの顔特徴点を複数抽出する。例えば、顔照合手段１３は、入力顔画像及び登録顔画像に対してエッジ抽出処理を行って周辺画素との輝度差が大きいエッジ画素を抽出する。そして顔照合手段１３は、エッジ画素の位置、パターンなどに基づいて求めた特徴量が、両目尻、両目領域中心、鼻尖点、口点、口角点などの部位について予め定められた条件を満たすか否かを調べて各部位の位置を特定することにより、顔特徴点として抽出する。そして顔照合手段１３は、抽出した顔特徴点毎に入力顔画像及び登録顔画像上の各顔特徴点の位置情報（例えば、入力顔画像及び登録顔画像の左上端部を原点とする２次元座標値）を算出する。そして顔照合手段１３は、入力顔画像及び登録顔画像の対応する特徴点間の距離の総和を位置ずれ量として求め、その位置ずれ量の逆数を類似度とする。
あるいは、顔照合手段１３は、抽出した顔特徴点毎にその顔特徴点の近傍の局所領域について輝度又は色差の平均値を算出する。その場合、顔照合手段１３は、入力顔画像及び登録顔画像の対応する局所領域毎に、算出した平均値の差の絶対値を求め、その総和の逆数を類似度としてもよい。この場合、登録テーブル３００の顔データ３０２は、登録顔画像に代えて予め登録顔画像について算出された、顔特徴点毎の特徴量（顔特徴点の位置情報、顔特徴点の近傍の局所領域の輝度又は閾値の平均値等）としてもよい。 Alternatively, the face collating unit 13 may obtain the similarity based on face feature points that are characteristic parts of the face. In that case, the face collating means 13 extracts a plurality of facial feature points such as both eye corners, both eye region centers, nose tip, mouth point, mouth corner point from the input face image and the registered face image. For example, the face matching unit 13 performs edge extraction processing on the input face image and the registered face image, and extracts edge pixels having a large luminance difference from surrounding pixels. Then, the face collating unit 13 determines whether the feature amount obtained based on the position and pattern of the edge pixel satisfies a predetermined condition for parts such as both eye corners, both eye region centers, nose apex, mouth point, and mouth corner point. It is extracted as a face feature point by checking whether or not and specifying the position of each part. Then, the face collating unit 13 outputs, for each extracted face feature point, position information of each face feature point on the input face image and the registered face image (for example, two-dimensional with the upper left end of the input face image and the registered face image as the origin. (Coordinate value) is calculated. Then, the face collating unit 13 obtains the sum of the distances between corresponding feature points of the input face image and the registered face image as a positional deviation amount, and uses the reciprocal of the positional deviation amount as the similarity.
Alternatively, the face matching unit 13 calculates an average value of luminance or color difference for each extracted facial feature point for a local region near the facial feature point. In that case, the face matching unit 13 may obtain an absolute value of the difference between the calculated average values for each corresponding local region of the input face image and the registered face image, and the reciprocal of the sum may be used as the similarity. In this case, the face data 302 of the registration table 300 includes feature amounts (position information of face feature points, local regions in the vicinity of the face feature points) calculated for the registered face images in advance instead of the registered face images. Or the average value of the threshold values).

顔照合手段１３は、照合対象者が登録者であると判定すると、照合成功を示す結果通知を音声出力判定手段１４に送り、照合対象者が登録者でないと判定すると、照合失敗を示す結果通知を音声出力判定手段１４に送る。 When the face collating means 13 determines that the person to be collated is a registrant, the face collating means 13 sends a result notification indicating that the collation is successful to the voice output determining means 14, and when it is determined that the person to be collated is not a registrant, the result notification indicating that the collation is not successful. Is sent to the sound output determination means 14.

音声出力判定手段１４は、顔照合手段１３から照合成功又は照合失敗を示す結果通知を受け取ると、音声出力部３に出力させる音声の語句をテキストデータで示した出力音声用情報の作成要求を出力音声用情報作成手段１５に送る。
まず、照合対象者が照合成功となったときに顔画像認証装置１が出力する出力音声の語句を定型句「お疲れ様です」及び照合対象者の個人名とした場合について、顔画像認証装置が設置された通路に複数の利用者が通行する例を用いて説明する。図４（ａ）に、顔画像認証装置１が設置された通路に複数の利用者が離散的に通行する場合の入力画像と出力音声の関係の例を示す。図４（ａ）において、画像４００は時刻tにおける入力画像であり、画像４０１は時刻t+1における入力画像であり、画像４０２は時刻t+2における入力画像であり、画像４０３は時刻t+3における入力画像であり、画像４０４は時刻t+4における入力画像である。図４（ａ）に示す例では、時刻tにおける入力画像４００に人物４１０（Ａ役員）が写っている。そのため、この顔画像認証装置は時刻tにおいて照合処理を行い、照合結果に応じた出力音声「お疲れ様です、Ａ役員」を時刻tから出力する。そして、時刻t+2においては、入力画像４０２に照合対象者が存在しないため、音声は出力されない。そして時刻t+3における入力画像４０３には新たに人物４１１（Ｂさん）が写っている。そのため、顔画像認証装置は時刻t+3において照合処理を行い、照合結果に応じた出力音声「お疲れ様です、Ｂさん」を時刻t+3から出力する。 When the voice output determination unit 14 receives the result notification indicating the collation success or the collation failure from the face collation unit 13, the voice output determination unit 14 outputs a creation request for the output voice information indicating the voice phrase to be output to the voice output unit 3 as text data. The information is sent to the voice information creation means 15.
First, when the collation target person succeeds in collation, the face image authentication apparatus is installed when the phrase of the output voice output from the face image authentication apparatus 1 is the fixed phrase “Thank you for your work” and the personal name of the collation target person. An example in which a plurality of users pass through the passage will be described. FIG. 4A shows an example of the relationship between the input image and the output sound when a plurality of users pass through the passage where the face image authentication device 1 is installed in a discrete manner. In FIG. 4A, an image 400 is an input image at time t, an image 401 is an input image at time t + 1, an image 402 is an input image at time t + 2, and an image 403 is at time t +. 3 is an input image at time 3, and an image 404 is an input image at time t + 4. In the example shown in FIG. 4A, a person 410 (A officer) is shown in the input image 400 at time t. For this reason, the face image authentication apparatus performs a collation process at time t, and outputs an output voice “Thank you, A officer” according to the collation result from time t. At time t + 2, no voice is output because there is no person to be collated in the input image 402. Then, a new person 411 (Mr. B) is shown in the input image 403 at time t + 3. Therefore, the face image authentication apparatus performs a collation process at time t + 3, and outputs an output voice “Thank you, Mr. B” according to the collation result from time t + 3.

一方、図４（ｂ）に、顔画像認証装置１が設置された通路に複数の利用者が連続的に通行する場合の入力画像と出力音声の関係の例を示す。図４（ｂ）において、画像４２０は時刻tにおける入力画像であり、画像４２１は時刻t+1における入力画像であり、画像４２２は時刻t+2における入力画像であり、画像４２３は時刻t+3における入力画像であり、画像４２４は時刻t+4における入力画像である。図４（ｂ）に示す例では、時刻tにおける入力画像４２０に人物４１０（Ａ役員）が写っている。そのため、この顔画像認証装置は時刻tにおいて照合処理を行い、照合結果に応じた出力音声「お疲れ様です、Ａ役員」を時刻tから出力する。一方、時刻t+1における入力画像４２１には新たに人物４１１（Ｂさん）が写っている。この場合、顔画像認証装置は時刻t+1において照合処理を行うが、時刻t+1においては音声を出力中であるため、出力音声「お疲れ様です、Ｂさん」を出力することができない。そのため、顔画像認証装置は出力音声「お疲れ様です、Ａ役員」の出力が完了した後、時刻t+2から出力音声「お疲れ様です、Ｂさん」を出力する。しかし、時刻t+3には人物４１１（Ｂさん）は既に入室しており、人物４１１（Ｂさん）の入室までに音声出力を完了させることができない。また、時刻t+2における入力画像４２２には新たに人物４１２（Ｃさん）が写っている。この場合、顔画像認証装置は時刻t+2において照合処理を行うが、時刻t+2においては音声を出力中であるため、出力音声「お疲れ様です、Ｃさん」を出力することができない。そのため、顔画像認証装置は出力音声「お疲れ様です、Ｂさん」の出力が完了した後、時刻t+4から出力音声「お疲れ様です、Ｃさん」を出力する。しかし、時刻t+4には人物４１２（Ｃさん）は既に入室しており、人物４１２（Ｃさん）の入室までに音声を出力することができない。 On the other hand, FIG. 4B shows an example of the relationship between the input image and the output sound when a plurality of users continuously pass through the passage where the face image authentication device 1 is installed. In FIG. 4B, an image 420 is an input image at time t, an image 421 is an input image at time t + 1, an image 422 is an input image at time t + 2, and an image 423 is at time t +. 3 is an input image at time 3, and an image 424 is an input image at time t + 4. In the example shown in FIG. 4B, the person 410 (A officer) is shown in the input image 420 at time t. For this reason, the face image authentication apparatus performs a collation process at time t, and outputs an output voice “Thank you, A officer” according to the collation result from time t. On the other hand, the person 411 (Mr. B) is newly shown in the input image 421 at time t + 1. In this case, the face image authentication apparatus performs the collation process at time t + 1. However, since the voice is being output at time t + 1, the output voice “Thank you, Mr. B” cannot be output. Therefore, the face image authentication apparatus outputs the output voice “Thank you, Mr. B” from time t + 2 after the output of the output voice “Thank you, Mr. A” is completed. However, the person 411 (Mr. B) has already entered the room at time t + 3, and the voice output cannot be completed before the person 411 (Mr. B) enters the room. In addition, a person 412 (Mr. C) is newly shown in the input image 422 at time t + 2. In this case, the face image authentication apparatus performs the collation process at time t + 2, but since the voice is being output at time t + 2, the output voice “Thank you, Mr. C” cannot be output. Therefore, after the output of the output voice “Thank you, Mr. B” is completed, the face image authentication device outputs the output voice “Thank you, Mr. C” from time t + 4. However, the person 412 (Mr. C) has already entered the room at time t + 4, and the voice cannot be output before the person 412 (Mr. C) enters the room.

このように、音声出力処理は、一般に照合処理よりも長時間を要し、多数の利用者が連続して照合された場合には、全員に対する音声の出力が間に合わない場合がある。従って、複数の利用者が連続して通行している場合には、二人目以降の照合対象者に対する出力音声は、一人目の照合対象者に対する出力音声より短くすることが好ましい。
そこで本実施形態の音声出力判定手段１４は、音声出力部３が音声を出力中であるか否かを判定する。そして音声出力判定手段１４は、音声出力部３が音声を出力中でない場合には通常通りの出力音声用情報（以下、標準音声用情報と称する）を出力音声用情報作成手段１５に作成させるために標準音声用情報の作成要求を出力音声用情報作成手段１５に送る。一方、音声出力判定手段１４は、音声出力部３が音声を出力中である場合には標準音声用情報より短い出力音声用情報（以下、短縮音声用情報と称する）を出力音声用情報作成手段１５に作成させるために短縮音声用情報の作成要求を出力音声用情報作成手段１５に送る。
例えば、図４（ｂ）に示した例では、音声出力部３は、出力音声「お疲れ様です、Ａ役員」、「お疲れ様です、Ｂさん」、「お疲れ様です、Ｃさん」を順次出力している。この出力音声のうち「お疲れ様です」の部分は、定型句であり、図４（ｂ）のように、照合対象者が連続する場合は、先頭の一人分について出力すれば十分な面がある一方で、出力するだけの時間がかかるため、全員分を出力しきれない原因となっている。そこで、本実施形態の出力音声用情報作成手段１５は、標準音声用情報から定型句を省略したテキストを短縮音声用情報として作成する。
また、顔照合手段１３から受け取った結果通知が照合失敗を示す場合は、照合対象者を特定できないため、出力音声用情報に個人名を含ませることができない。そのため、この場合、音声出力判定手段１４は、未登録者向けの定型句のみからなる出力音声用情報（以下、未登録者向け音声用情報と称する）を出力音声用情報作成手段１５に作成させるために未登録者向け音声用情報の作成要求を出力音声用情報作成手段１５に送る。 As described above, the voice output process generally requires a longer time than the collation process, and when a large number of users are collated in succession, the voice output to all of the users may not be in time. Therefore, when a plurality of users pass continuously, it is preferable that the output voice for the second and subsequent verification target persons is shorter than the output voice for the first verification target person.
Therefore, the audio output determination unit 14 of this embodiment determines whether or not the audio output unit 3 is outputting audio. The audio output determination unit 14 causes the output audio information creation unit 15 to create normal output audio information (hereinafter referred to as standard audio information) when the audio output unit 3 is not outputting audio. The standard voice information creation request is sent to the output voice information creation means 15. On the other hand, when the audio output unit 3 is outputting audio, the audio output determination unit 14 outputs output audio information (hereinafter referred to as shortened audio information) shorter than the standard audio information. In order to make it generate for 15, a request for creation of shortened voice information is sent to the output voice information creation means 15.
For example, in the example shown in FIG. 4B, the voice output unit 3 sequentially outputs the output voices “Thank you, A officer”, “Thank you, Mr. B”, “Thank you, Mr. C”. . The part of the output speech “Thank you for your work” is a fixed phrase. As shown in FIG. 4B, when the person to be collated continues, it is sufficient to output only the first person. Because it takes time to output, it is a cause that cannot output all of them. Therefore, the output voice information creating unit 15 of the present embodiment creates a text in which the fixed phrase is omitted from the standard voice information as shortened voice information.
Further, when the result notification received from the face collating means 13 indicates that the collation has failed, the person to be collated cannot be specified, and therefore the personal name cannot be included in the output voice information. Therefore, in this case, the audio output determination unit 14 causes the output audio information generation unit 15 to generate output audio information (hereinafter, referred to as audio information for unregistered users) consisting only of fixed phrases for unregistered persons. Therefore, a request for creating voice information for unregistered persons is sent to the output voice information creating means 15.

また、音声出力判定手段１４は、音声出力部３が音声を出力中であるか否かを記憶部５に記憶された出力テーブルに出力音声用情報が格納されているか否かにより判定する。つまり、音声出力判定手段１４は、出力テーブルに出力音声用情報が格納されている場合は、音声出力部３が音声を出力中であると判定し、出力音声用情報が格納されていない場合は、音声出力部３が音声出力中でないと判定する。
図５は、記憶部５に記憶される出力テーブルの模式図である。図５に示された出力テーブル５００において、一つの行が顔追跡手段１２により追跡処理がされている一人の照合対象者に対応する。この出力テーブルの各行の情報は、照合対象者の照合処理が完了し、出力音声用情報作成手段１５によってその照合結果に応じた出力音声用情報が作成されたときに制御手段１６によって追加される。また、この出力テーブルの各行の情報は、出力音声用情報から合成された音声信号がインターフェース部４に出力され終わったときに制御手段１６によって削除される。従って、出力テーブル５００に出力音声用情報が格納されていない場合には、音声出力部３が音声を出力中でないと判断することができる。
なお、上述した通り、出力音声用情報作成手段１５は、標準音声用情報から定型句を省略したテキストを短縮音声用情報として作成する。そのため、短縮音声用情報に示される語句を音声出力するためには、その直前に標準音声用情報に示される語句が音声出力されていること、又は標準音声用情報に示される語句に続けて短縮音声用情報に示される語句が音声出力されていることが好ましい。そこで、本実施形態の音声出力判定手段１４は、出力テーブル５００に未登録者向け音声用情報しか格納されていない場合、つまり標準音声用情報又は短縮音声用情報が格納されていない場合は、標準音声用情報の作成要求を出力音声用情報作成手段１５に送る。 Further, the sound output determination means 14 determines whether or not the sound output unit 3 is outputting sound based on whether or not the output sound information is stored in the output table stored in the storage unit 5. That is, the sound output determination unit 14 determines that the sound output unit 3 is outputting sound when the output sound information is stored in the output table, and when the output sound information is not stored. It is determined that the audio output unit 3 is not outputting audio.
FIG. 5 is a schematic diagram of an output table stored in the storage unit 5. In the output table 500 shown in FIG. 5, one row corresponds to one person to be verified that is being tracked by the face tracking unit 12. Information in each row of the output table is added by the control means 16 when the collation process of the person to be collated is completed and the output voice information creation means 15 according to the collation result is created by the output voice information creation means 15. . The information in each row of the output table is deleted by the control means 16 when the audio signal synthesized from the output audio information has been output to the interface unit 4. Therefore, when the output audio information is not stored in the output table 500, it can be determined that the audio output unit 3 is not outputting audio.
As described above, the output voice information creating unit 15 creates a text in which the fixed phrase is omitted from the standard voice information as the shortened voice information. Therefore, in order to output the words shown in the abbreviated voice information as a voice, the words shown in the standard voice information are output immediately before that, or the words and phrases shown in the standard voice information are shortened. It is preferable that the words shown in the voice information are output as voice. Therefore, the audio output determination unit 14 of the present embodiment uses the standard when the output table 500 stores only audio information for unregistered persons, that is, when standard audio information or shortened audio information is not stored. A request for creating voice information is sent to the output voice information creating means 15.

出力音声用情報作成手段１５は、音声出力判定手段１４から受け取った作成要求に従って出力音声用情報を作成する。
以下に、出力音声用情報作成手段１５が作成する出力音声用情報について説明するために、登録テーブル及び定型句テーブルに格納される情報について説明する。
図３に示した登録テーブル３００において、左から３番目の列の各欄には、登録者の個人名３０３が示される。個人名３０３は、登録者の本名、名字、通称名などであり、テキストデータとして格納される。また個人名３０３として名字を用いる場合、同じ名字の登録者が複数いる場合に限り氏名を記録しておくようにしてもよい。
そして左から４番目の列の各欄には、登録者の属性３０４が示される。属性３０４は、社員、業者、顧客等を識別するためのものであり、属性３０４が社員である場合には、さらに役員、部長、担当等の役職により識別される。
そして左から５番目の列の各欄には、登録者の個人名３０３を音声で出力するときに要する出力時間を示す個人名時間長３０５が示される。例えば、個人名時間長３０５は、その個人名３０３を音声にした音声データを作成し、音声再生に要する時間を計測することにより求めることができる。あるいは、個人名時間長３０５は、その個人名３０３を平仮名表記したときの文字数により定めてもよい。また、処理を容易にするために、最も長い個人名３０３を音声で出力するときに要する出力時間を全ての登録者の個人名時間長３０５として共通に適用してもよい。 The output voice information creating unit 15 creates output voice information in accordance with the creation request received from the voice output determining unit 14.
The information stored in the registration table and the fixed phrase table will be described below in order to describe the output sound information created by the output sound information creating means 15.
In the registration table 300 shown in FIG. 3, each column in the third column from the left shows the personal name 303 of the registrant. The personal name 303 is a registrant's real name, surname, common name, etc., and is stored as text data. When a surname is used as the personal name 303, the name may be recorded only when there are a plurality of registrants with the same surname.
In each column of the fourth column from the left, an attribute 304 of the registrant is shown. An attribute 304 is for identifying an employee, a trader, a customer, and the like. When the attribute 304 is an employee, the attribute 304 is further identified by a position such as an officer, a department manager, or a person in charge.
In each column of the fifth column from the left, a personal name time length 305 indicating an output time required for outputting the registrant's personal name 303 by voice is shown. For example, the personal name time length 305 can be obtained by creating voice data in which the personal name 303 is voiced and measuring the time required for voice playback. Alternatively, the personal name time length 305 may be determined by the number of characters when the personal name 303 is written in hiragana. Further, in order to facilitate processing, the output time required for outputting the longest personal name 303 by voice may be commonly applied as the personal name time length 305 of all registrants.

登録テーブル３００において、例えば、一番上の行には、その登録者についての識別情報３０１が'1'であること、個人名３０３が「Ａ」であること、属性３０４が「役員」であること、及び個人名時間長３０５が'tm1'であることが示されている。また、上から二番目の行には、その登録者についての識別情報３０１が'2'であること、個人名３０３が「Ｂ」であること、属性３０４が「担当」であること、及び個人名時間長３０５が'tm2'であることが示されている。また、上から三番目の行には、その登録者についての識別情報３０１が'3'であること、個人名３０３が「Ｃ」であること、属性３０４が「部長」であること、及び個人名時間長３０５が'tm3'であることが示されている。また、上から四番目の行には、その登録者についての識別情報３０１が'4'であること、個人名３０３が「Ｄ」であること、属性３０４が「顧客」であること、及び個人名時間長３０５が'tm4'であることが示されている。また、上から五番目の行には、その登録者についての識別情報３０１が'5'であること、個人名３０３が「Ｅ」であること、属性３０４が「業者」であること、及び個人名時間長３０５が'tm5'であることが示されている。 In the registration table 300, for example, in the top row, the identification information 301 about the registrant is “1”, the personal name 303 is “A”, and the attribute 304 is “officer”. And the personal name time length 305 is 'tm1'. In the second line from the top, the identification information 301 for the registrant is “2”, the personal name 303 is “B”, the attribute 304 is “in charge”, and the individual It is shown that the nominal time length 305 is “tm2”. In the third line from the top, the identification information 301 for the registrant is “3”, the personal name 303 is “C”, the attribute 304 is “director”, and the individual It is indicated that the nominal time length 305 is “tm3”. In the fourth row from the top, the identification information 301 for the registrant is “4”, the personal name 303 is “D”, the attribute 304 is “customer”, and the personal It is indicated that the nominal time length 305 is “tm4”. In the fifth row from the top, the identification information 301 for the registrant is “5”, the personal name 303 is “E”, the attribute 304 is “trader”, and the personal It is indicated that the nominal time length 305 is “tm5”.

図６は、記憶部５に記憶される、音声出力部３が出力する音声の定型句の情報を格納する定型句テーブルの模式図である。図６に示すように、定型句テーブル６００の左端の各欄には、定型句種別６０１が示される。本実施形態では定型句種別６０１として、照合対象者の照合が成功したときの出力音声の語句のうち定型句の部分である認証定型句６１１と、照合対象者の照合が失敗したときの出力音声の語句のうち定型句の部分である非認証定型句６１２と、出力音声の語句のうち照合対象者の個人名の後に続ける部分である敬称句６１３とが登録されている。本実施形態において、認証定型句６１１、非認証定型句６１２及び敬称句６１３は、テキストデータで管理される。
また左から２番目の列の各欄には、定型句として登録された語句６０２が示される。語句６０２は、テキストデータとして格納される。認証定型句６１１の語句として「おはようございます」、「お疲れ様です」、「お疲れ様でした」、「いらっしゃいませ」が登録されている。例えば、照合対象者の属性３０４が「顧客」である場合には「いらっしゃいませ」が用いられ、照合対象者の属性３０４が「役員」、「部長」、「担当」、「業者」のうちの何れかである場合には、午前中は「おはようございます」、午後の業務時間中は「お疲れ様です」、業務時間後は「お疲れ様でした」が用いられるのが好ましい。 FIG. 6 is a schematic diagram of a fixed phrase table stored in the storage unit 5 that stores information on the fixed phrases of the voice output by the voice output unit 3. As shown in FIG. 6, a boilerplate type 601 is shown in each column at the left end of the boilerplate table 600. In the present embodiment, as the fixed phrase type 601, the authentication fixed phrase 611 that is a portion of the fixed phrase among the words of the output voice when the verification of the verification target person is successful, and the output voice when the verification of the verification target person fails The non-authenticated fixed phrase 612 that is a part of the fixed phrase and the honorific phrase 613 that is the part that follows the personal name of the person to be collated among the words of the output speech are registered. In this embodiment, the authentication phrase 611, the non-authentication phrase 612, and the honorific phrase 613 are managed as text data.
Each column in the second column from the left shows a phrase 602 registered as a fixed phrase. The phrase 602 is stored as text data. “Good morning”, “Thank you for your hard work”, “Thank you for your hard work”, and “Welcome” are registered as the phrases of the authentication phrase 611. For example, when the attribute 304 of the person to be collated is “customer”, “welcome” is used, and the attribute 304 of the person to be collated is “Executive Officer”, “Department Manager”, “Responsor”, or “Seller”. In any case, it is preferable to use “Good morning” in the morning, “Thank you very much” during the business hours in the afternoon, and “Good work after hours”.

また、非認証定型句６１２の語句として「ご来館ありがとうございます」、「ご用の方は内線電話にて呼び出してください」、「カード操作をしてください」が登録されている。例えば、照合対象者が顧客である場合には「ご来館ありがとうございます」、「ご用の方は内線電話にて呼び出してください」等が用いられるのが好ましい。一方、照合対象者が社員のみであり、顧客がほとんど来ることのない場所に顔画像認証装置１が設置され、かつ顔画像認証装置１とは別にＩＤカードによる認証装置が備えられている場合には、「カード操作をしてください」が用いられるのが好ましい。
また、敬称句６１３の語句として「さん」、「役員」、「部長」、「様」が登録されている。例えば、照合対象者が顧客又は業者である場合には「様」が用いられ、照合対象者が顧客である場合には役職に応じて「役員」、「部長」、「さん」が用いられるのが好ましい。あるいは、処理を容易にするために、全ての照合対象者に共通に「さん」を用いるようにしてもよい。 In addition, “Thank you for visiting us”, “Please call us by extension phone” and “Please operate the card” are registered as phrases of the non-authenticated fixed phrase 612. For example, when the person to be verified is a customer, it is preferable to use “Thank you for visiting” or “If you are interested, please call us by extension”. On the other hand, when the face image authentication device 1 is installed in a place where only the employee is a collation target and the customer rarely comes and an authentication device using an ID card is provided separately from the face image authentication device 1 Is preferably "Please operate the card".
In addition, “san”, “officer”, “manager”, and “sama” are registered as honorific phrases 613. For example, “sama” is used when the person to be collated is a customer or a contractor, and “officer”, “manager”, and “san” are used according to the job title when the person to be collated is a customer. Is preferred. Alternatively, “san” may be used in common for all collation subjects in order to facilitate processing.

そして、左から３番目の列の各欄には、各定型句の語句を音声で出力するときに要する出力時間を示す時間長６０３が示される。例えば、各時間長６０３は、各定型句の語句を音声にした音声データを作成し、音声再生に要する時間を計測することにより求めることができる。あるいは、各時間長６０３は、その語句を平仮名表記したときの文字数により定めてもよい。また、処理を容易にするために、各種別において最も長い語句を音声で出力するときに要する出力時間をその種別の全ての語句の時間長６０３として共通に適用してもよい。以下、認証定型句６１１の時間長６０３を認証定型句時間長と称し、非認証定型句６１２の時間長６０３を非認証定型句時間長と称し、敬称句６１３の時間長６０３を敬称句時間長と称する。
例えば、認証定型句６１１については、「おはようございます」の認証定型句時間長は'tg1'であり、「お疲れ様です」の認証定型句時間長は'tg2'であり、「お疲れ様でした」の認証定型句時間長は'tg3'であり、「いらっしゃいませ」の認証定型句時間長は'tg4'であることが示されている。また、非認証定型句６１２については、「ご来館ありがとうございます」の非認証定型句時間長は'tn1'であり、「ご用の方は内線電話にて呼び出してください」の非認証定型句時間長は'tn2'であり、「カード操作をしてください」の認証定型句時間長は'tn3'であることが示されている。また、敬称句６１３については、「さん」の敬称句時間長は'tr1'であり、「役員」の敬称句時間長は'tr2'であり、「部長」の敬称句時間長は'tr3'であり、「様」の敬称句時間長は'tr4'であることが示されている。 In each column of the third column from the left, a time length 603 indicating an output time required for outputting the phrase of each fixed phrase by voice is shown. For example, each time length 603 can be obtained by creating voice data in which the phrase of each fixed phrase is voiced and measuring the time required for voice playback. Alternatively, each time length 603 may be determined by the number of characters when the phrase is expressed in hiragana. In order to facilitate processing, the output time required to output the longest word / phrase in various types by voice may be commonly applied as the time length 603 of all words / phrases of that type. Hereinafter, the time length 603 of the authentication boilerplate 611 is referred to as an authentication boilerplate time length, the time length 603 of the non-authentication boilerplate 612 is referred to as a non-authentication boilerplate time length, and the time length 603 of the honorific phrase 613 is referred to as a honorific phrase time length. Called.
For example, for authentication boilerplate 611, the authentication boilerplate length of “Good morning” is 'tg1', the authentication boilerplate length of “Thank you” is “tg2”, and “Thank you very much” The authentication boilerplate time length is “tg3”, and the authentication boilerplate time length of “I welcome you” is “tg4”. For non-authenticated boilerplate 612, “Thank you for visiting”, the non-authenticated boilerplate time length is “tn1”, and “Please call us on the extension phone” unauthenticated boilerplate The time length is “tn2”, and the authentication boilerplate time length of “Please operate the card” is “tn3”. As for the honorific phrase 613, “san” 's honorific phrase length is “tr1”, “executive” honorific phrase duration is “tr2”, and “director” honorific phrase duration is “tr3”. It is shown that the honorific time length of “sama” is “tr4”.

なお、図６に示した認証定型句６１１、非認証定型句６１２、敬称句６１３は、それぞれ例示であり、顔画像認証装置１の設置場所、運用方針、管理者の嗜好等に応じて語句を変更したり、増減してもよい。 Note that the authentication boilerplate 611, the non-authentication boilerplate 612, and the honorific phrase 613 shown in FIG. 6 are only examples, and the words and phrases can be changed according to the installation location, operation policy, administrator preference, and the like of the face image authentication device 1. It may be changed or increased or decreased.

以下、出力音声用情報作成手段１５が作成する出力音声用情報について説明する。出力音声用情報作成手段１５は、出力音声用情報として、音声出力判定手段１４から受け取った作成要求に従って標準音声用情報、短縮音声用情報又は未登録者向け音声用情報を作成する。出力音声用情報作成手段１５は、標準音声用情報を作成する場合、認証定型句６１１と個人名３０３と敬称句６１３とから出力音声用情報を作成する。例えば、午後の業務時間中に識別情報３０１が'1'の登録者であると認証された照合対象者に対する標準音声用情報は「お疲れ様です、Ａ役員」となる。
一方、出力音声用情報作成手段１５は、短縮音声用情報を作成する場合、個人名３０３と敬称句６１３のみから出力音声用情報を作成する。例えば、午後の業務時間中に識別情報３０１が'1'の登録者であると認証された照合対象者に対する短縮音声用情報は「Ａ役員」となる。短縮音声用情報は、認証定型句６１１がない分、標準音声用情報よりも短くなり、短縮音声用情報に示される語句の音声出力は、標準音声用情報に示される語句の音声出力よりも短時間に行われる。
また、出力音声用情報作成手段１５は、未登録者向け音声用情報を作成する場合、非認証定型句６１２のうちの何れかの語句を選択して出力音声用情報を作成する。非認証定型句６１２から選択される語句は、例えば、顔画像認証装置１の設置場所、運用方針等に応じて定められる。
出力音声用情報作成手段１５は、出力音声用情報を作成すると、作成した出力音声用情報を制御手段１６に送る。 Hereinafter, the output voice information created by the output voice information creation unit 15 will be described. The output voice information creating unit 15 creates standard voice information, shortened voice information, or unregistered voice information as output voice information in accordance with the creation request received from the voice output determination unit 14. When creating the standard voice information, the output voice information creating unit 15 creates the output voice information from the authentication fixed phrase 611, the personal name 303, and the honorific phrase 613. For example, the standard voice information for the person to be verified who is identified as the registrant whose identification information 301 is “1” during the afternoon business hours is “Thank you, A officer”.
On the other hand, the output voice information creating means 15 creates the output voice information from only the personal name 303 and the honorific phrase 613 when creating the shortened voice information. For example, the information for abbreviated speech for the person to be verified who is identified as the registrant whose identification information 301 is “1” during the afternoon business hours is “A officer”. The shortened voice information is shorter than the standard voice information because the authentication phrase 611 is not present, and the voice output of the phrase indicated in the shortened voice information is shorter than the voice output of the word indicated in the standard voice information. Done on time.
In addition, when creating the voice information for unregistered persons, the output voice information creating unit 15 selects any one of the unauthenticated fixed phrases 612 and creates the output voice information. The phrase selected from the non-authentication fixed phrase 612 is determined according to, for example, the installation location, operation policy, and the like of the face image authentication apparatus 1.
When the output voice information creating unit 15 creates the output voice information, the output voice information creating unit 15 sends the created output voice information to the control unit 16.

制御手段１６は、出力音声用情報作成手段１５が作成した出力音声用情報を記憶部５の出力テーブル５００に格納する。上述した通り、音声出力部３は、ある照合対象者に対する照合が完了したときに、その前に照合した照合対象者に対して音声を出力している場合がある。この場合、音声出力部３は、現在の音声出力が完了しなければ、次の照合対象者に対する音声を出力することができない。そこで、制御手段１６は、出力音声用情報作成手段１５から出力音声用情報を取得すると、取得した出力音声用情報を一旦、記憶部５の出力テーブル５００に格納する。 The control unit 16 stores the output voice information created by the output voice information creation unit 15 in the output table 500 of the storage unit 5. As described above, when the collation for a certain collation target person is completed, the voice output unit 3 may output a voice to the collation target person collated before that. In this case, the voice output unit 3 cannot output the voice for the next person to be collated unless the current voice output is completed. Therefore, when the control unit 16 acquires the output voice information from the output voice information creation unit 15, the control unit 16 temporarily stores the acquired output voice information in the output table 500 of the storage unit 5.

以下、図５に示した出力テーブル５００について説明する。上述した通り、出力テーブル５００において、一つの行が顔追跡手段１２により追跡処理がされている一人の照合対象者に対応する。以下、一つの行に示される情報を通知情報と称する。
そして左端の列の各欄には、情報番号５０１が示される。この情報番号５０１は、通知情報毎、つまり照合対象者毎に付与される識別番号である。例えば、情報番号５０１は、通知情報が追加される毎にインクリメントされる連続した正の整数である。
また左から２番目の列の各欄には、情報種別５０２が示される。この情報種別５０２は、出力音声用情報が、標準音声用情報であるか、短縮音声用情報であるか、又は未登録者向け音声用情報であるかを区別する情報である。
また左から３番目の列の各欄には、出力音声用情報５０３が格納される。
また左から４番目の列の各欄には、出力時間５０４が示される。この出力時間５０４は、出力音声用情報５０３に示される語句を音声出力するのに要する時間である。つまり情報種別５０２が標準音声用情報である場合、出力時間５０４は、認証定型句時間長と個人名時間長３０５と敬称句時間長の和となる。一方、情報種別５０２が短縮音声用情報である場合、出力時間５０４は、個人名時間長３０５と敬称句時間長の和となる。また、情報種別５０２が未登録者向け音声用情報である場合、出力時間５０４は、非認証定型句時間長となる。
また左から５番目の列の各欄には、作成時刻５０５が示される。この作成時刻５０５は、各通知情報が作成され、出力テーブル５００に記録された時刻である。 Hereinafter, the output table 500 illustrated in FIG. 5 will be described. As described above, in the output table 500, one line corresponds to one person to be verified that is being tracked by the face tracking unit 12. Hereinafter, the information shown in one line is referred to as notification information.
In each column of the leftmost column, an information number 501 is shown. The information number 501 is an identification number assigned to each notification information, that is, for each person to be verified. For example, the information number 501 is a continuous positive integer that is incremented each time notification information is added.
Each column in the second column from the left shows an information type 502. This information type 502 is information for distinguishing whether the output audio information is standard audio information, shortened audio information, or unregistered audio information.
Also, output audio information 503 is stored in each column of the third column from the left.
In addition, the output time 504 is shown in each column of the fourth column from the left. This output time 504 is the time required for outputting the words shown in the output audio information 503 as audio. That is, when the information type 502 is standard audio information, the output time 504 is the sum of the authentication fixed phrase time length, the personal name time length 305, and the honorific phrase time length. On the other hand, when the information type 502 is shortened voice information, the output time 504 is the sum of the individual name time length 305 and the honorific phrase time length. When the information type 502 is unregistered voice information, the output time 504 is a non-authentication fixed phrase time length.
Also, the creation time 505 is shown in each column of the fifth column from the left. The creation time 505 is the time when each piece of notification information is created and recorded in the output table 500.

例えば、出力テーブル５００に通知情報が存在しておらず、直前に存在していた通知情報の情報番号５０１が'260'である場合において、午後の業務時間中に識別情報３０１が'1'の登録者について照合が成功したものとする。その場合、出力テーブル５００に標準音声用情報又は短縮音声用情報を含む通知情報が存在していないので、音声出力判定手段１４は、出力音声用情報作成手段１５に識別情報３０１が'1'の登録者に対する標準音声用情報の作成要求を送る。音声出力判定手段１４から作成要求を受けると、出力音声用情報作成手段１５は、認証定型句「お疲れ様です」と、識別情報３０１が'1'の登録者の個人名「Ａ」と、識別情報３０１が'1'の登録者の敬称句「役員」とを組み合わせて標準音声用情報「お疲れ様です、Ａ役員」を作成する。出力音声用情報作成手段１５からこの標準音声用情報を受けると、制御手段１６は、出力テーブル５００に、情報番号５０１を'261'とし、情報種別５０２を「標準音声用情報」とし、出力音声用情報５０３を「お疲れ様です、Ａ役員」とする通知情報を追加する。さらに制御手段１６は、その通知情報の出力時間５０４に「お疲れ様です」の認証定型句時間長'tg2'と、個人名「Ａ」の個人名時間長'tm1'と、敬称句「役員」の敬称句時間長'tr2'の総和'tg2+tm1+tr2'を記録し、作成時刻５０５に現在時刻'11:31:48'を記録する。 For example, when there is no notification information in the output table 500 and the information number 501 of the notification information existing immediately before is “260”, the identification information 301 is “1” during the afternoon business hours. Assume that the verification is successful for the registrant. In that case, since the notification information including the standard voice information or the shortened voice information does not exist in the output table 500, the voice output determination unit 14 determines that the identification information 301 is “1” in the output voice information creation unit 15. Send a request for creating standard audio information to the registrant. Upon receipt of the creation request from the voice output determination means 14, the output voice information creation means 15 reads the authentication phrase “Thank you very much”, the personal name “A” of the registrant whose identification information 301 is “1”, and the identification information. The standard voice information “Thank you, A officer” is created by combining the honorific phrase “executive” of the registrant with 301 of “1”. Upon receiving this standard voice information from the output voice information creation means 15, the control means 16 sets the information number 501 as “261”, the information type 502 as “standard voice information”, and the output voice in the output table 500. Notification information is added with the business information 503 as “Thank you, A officer”. Furthermore, the control means 16 outputs the notification format time length “tg2” of “Thank you” for the notification information output time 504, the personal name time length “tm1” of the personal name “A”, and the honorific phrase “officer”. Record the sum 'tg2 + tm1 + tr2' of honorific phrase time length 'tr2', and record the current time '11: 31: 48 'at creation time 505.

さらにこの状態から、識別情報３０１が'2'の登録者について照合が成功した場合、出力テーブル５００には標準音声用情報を含む通知情報が存在するので、音声出力判定手段１４は、出力音声用情報作成手段１５に識別情報３０１が'2'の登録者に対する短縮音声用情報の作成要求を送る。音声出力判定手段１４から作成要求を受けると、出力音声用情報作成手段１５は、識別情報３０１が'2'の登録者の個人名「Ｂ」と、識別情報３０１が'2'の登録者の敬称句「さん」とを組み合わせて短縮音声用情報「Ｂさん」を作成する。出力音声用情報作成手段１５からこの短縮音声用情報を受けると、制御手段１６は、出力テーブル５００に、情報番号５０１を'262'とし、情報種別５０２を「短縮音声用情報」とし、出力音声用情報５０３を「Ｂさん」とする通知情報を追加する。さらに制御手段１６は、その通知情報の出力時間５０４に個人名「Ｂ」の個人名時間長'tm2'と、敬称句「さん」の敬称句時間長'tr1'の総和'tm2+tr1'を記録し、作成時刻５０５に現在時刻'11:31:49'を記録する。
さらにこの状態から、識別情報３０１が'3'の登録者について照合が成功した場合、識別情報３０１が'2'の登録者の場合と同様に、制御手段１６は、出力テーブル５００に、情報番号５０１を'263'とし、情報種別５０２を「短縮音声用情報」とし、出力音声用情報５０３を「Ｃ部長」とする通知情報を追加する。さらに制御手段１６は、その通知情報の出力時間５０４に個人名「Ｃ」の個人名時間長'tm3'と、敬称句「部長」の敬称句時間長'tr3'の総和'tm2+tr1'を記録し、作成時刻５０５に現在時刻'11:31:50'を記録する。 Further, from this state, when collation is successful for a registrant whose identification information 301 is “2”, the output table 500 includes notification information including standard audio information. A request for creating shortened voice information is sent to the information creating means 15 for a registrant whose identification information 301 is “2”. Upon receipt of the creation request from the voice output determination means 14, the output voice information creation means 15 sends the personal name “B” of the registrant whose identification information 301 is “2” and the registrant whose identification information 301 is “2”. The short voice information “Mr. B” is created in combination with the honorific phrase “Ms.”. Upon receiving this shortened voice information from the output voice information creating means 15, the control means 16 sets the information number 501 to “262”, the information type 502 as “shortened voice information” in the output table 500, and the output voice. Notification information in which the usage information 503 is “Mr. B” is added. Furthermore, the control means 16 outputs the sum “tm2 + tr1” of the personal name “B”, the personal name time length “tm2” of the personal name “B”, and the title time length “tr1” of the honorific phrase “san” in the output time 504 of the notification information. Record the current time “11:31:49” at the creation time 505.
Further, from this state, when the collation succeeds for the registrant whose identification information 301 is “3”, the control means 16 stores the information number in the output table 500 as in the case of the registrant whose identification information 301 is “2”. Notification information in which 501 is '263', the information type 502 is “shortened voice information”, and the output voice information 503 is “C section manager” is added. Further, the control means 16 outputs the sum “tm2 + tr1” of the personal name time length “tm3” of the personal name “C” and the title time length “tr3” of the honorific phrase “department” at the output time 504 of the notification information. Record the current time “11:31:50” at the creation time 505.

さらにこの状態で、照合対象者の照合が失敗になった場合、音声出力判定手段１４は、出力音声用情報作成手段１５に未登録者向け音声用情報の作成要求を送る。音声出力判定手段１４から作成要求を受けると、出力音声用情報作成手段１５は、未登録者向け音声用情報「ご来館ありがとうございます」を作成する。出力音声用情報作成手段１５からこの未登録者向け音声用情報を受けると、制御手段１６は、出力テーブル５００に、情報番号５０１を'264'とし、情報種別５０２を「未登録者向け音声用情報」とし、出力音声用情報５０３を「ご来館ありがとうございます」とする通知情報を追加する。さらに制御手段１６は、その通知情報の出力時間５０４に「ご来館ありがとうございます」の非認証定型句時間長'tn1'を記録し、作成時刻５０５に現在時刻'11:31:51'を記録する。 Further, in this state, when collation of the person to be collated fails, the voice output determination unit 14 sends a voice information creation request for unregistered persons to the output voice information creation unit 15. When the creation request is received from the voice output determination means 14, the output voice information creation means 15 creates voice information “Thank you for visiting” for unregistered persons. Upon receiving this voice information for unregistered persons from the output voice information creating means 15, the control means 16 sets the information number 501 to “264” in the output table 500, and sets the information type 502 to “for voice for unregistered persons”. Information ”, and notification information for“ Thank you for visiting ”is added to the output audio information 503. Further, the control means 16 records the non-authenticated fixed phrase time length “tn1” of “Thank you for visiting” at the output time 504 of the notification information, and records the current time “11:31:51” at the creation time 505. To do.

また、制御手段１６は、音声出力部３による音声出力を制御する。制御手段１６は、音声出力部３に音声を出力させていないとき、出力テーブル５００に格納された出力音声用情報５０３を読み出して音声合成手段１７に送り、その出力音声用情報５０３から生成された音声信号を音声合成手段１７から受け取る。そして制御手段１６は、受け取った音声信号をインターフェース部４に出力し、音声出力部３に音声出力させる。 Further, the control means 16 controls the sound output by the sound output unit 3. When the sound output unit 3 is not outputting sound, the control means 16 reads the output sound information 503 stored in the output table 500 and sends it to the sound synthesizing means 17, which is generated from the output sound information 503. A voice signal is received from the voice synthesis means 17. Then, the control unit 16 outputs the received audio signal to the interface unit 4 and causes the audio output unit 3 to output the audio.

このとき制御手段１６は、出力テーブル５００に格納された出力音声用情報５０３を先頭から順に読み出す。ただし、標準音声用情報に示される語句を音声出力した後に、未登録者向け音声用情報に示される語句を音声出力し、さらに短縮音声用情報に示される語句を音声出力すると、順次出力される出力音声が不適切な文章になる。例えば、図４に示した例において、人物４１１（Ｂさん）の照合が失敗した場合、「お疲れ様です、Ａ役員」、「ご来館ありがとうございます」、「Ｃさん」の順に音声出力され、「お疲れ様です」と「Ｃさん」の間に非認証定型句６１２がはさまれる。そこで、制御手段１６は、未登録者向け音声用情報については、出力テーブル５００に格納された全ての標準音声用情報と短縮音声用情報から生成される音声の出力が完了した後に読み出すようにする。また、未登録者向け音声用情報は定型句のみからなるため、出力テーブル５００に複数の未登録者向け音声用情報がある場合、それぞれについて音声を出力するのは冗長である。そのため、制御手段１６は、出力テーブル５００に複数の未登録者向け音声用情報がある場合、未登録者向け音声用情報について一回のみ、音声信号を音声合成手段１７に生成させ、インターフェース部４に出力し、音声出力部３に音声出力させる。 At this time, the control means 16 reads the output audio information 503 stored in the output table 500 in order from the top. However, after the words shown in the standard voice information are output in voice, the words shown in the voice information for unregistered users are output in voice, and further the words shown in the shortened voice information are output in order. The output sound is inappropriate. For example, in the example shown in FIG. 4, if the verification of the person 411 (Mr. B) fails, the voices are output in the order of “Thank you, A officer”, “Thank you for visiting”, “Mr. C”. Non-authenticated boilerplate 612 is sandwiched between “Thank you so much” and “Mr. C”. Therefore, the control means 16 reads out the unregistered voice information after the output of the voice generated from all the standard voice information and the shortened voice information stored in the output table 500 is completed. . Further, since the information for unregistered speech is composed only of fixed phrases, when there are a plurality of unregistered speech information in the output table 500, it is redundant to output speech for each. For this reason, when there are a plurality of unregistered voice information in the output table 500, the control means 16 causes the voice synthesis means 17 to generate a voice signal for the unregistered voice information only once, and the interface unit 4 And the audio output unit 3 outputs the audio.

そして制御手段１６は、標準音声用情報又は短縮音声用情報から生成された音声信号の出力が完了すると、その標準音声用情報又は短縮音声用情報に対応する通知情報を出力テーブル５００から削除する。また、制御手段１６は、未登録者向け音声用情報から生成された音声信号の出力が完了すると、出力テーブル５００から未登録者向け音声用情報を含む全ての通知情報を削除する。 When the output of the voice signal generated from the standard voice information or the shortened voice information is completed, the control unit 16 deletes the notification information corresponding to the standard voice information or the shortened voice information from the output table 500. In addition, when the output of the audio signal generated from the unregistered user sound information is completed, the control unit 16 deletes all the notification information including the unregistered user sound information from the output table 500.

音声合成手段１７は、制御手段１６から受け取った出力音声用情報５０３から公知の音声合成技術を利用して音声信号を生成し、制御手段１６に送り返す。例えば、音声合成手段１７は、コーパスベースの音声合成技術を利用して音声合成を行う。その場合、多数の文章について所定単位の語句（例えばアクセント句）毎にイントネーションデータ（例えば音韻毎のピッチ周波数、パワー、音韻継続長等）が蓄積されたイントネーションデータベースと、イントネーション及び読み毎に音声素片が蓄積された音声データベースとを予め記憶部５に用意しておく。そして音声合成手段１７は、まず出力音声用情報５０３を解析して単語を特定し、読みを決定する。次に音声合成手段１７は、出力音声用情報５０３内の語句についてイントネーションデータベースから一致又は最も類似する語句を検索して文のイントネーションの位置を決定する。さらに音声合成手段１７は、イントネーション及び読みに基づいて音声データベースから適切な音声素片を検索して接続することにより音声信号を生成する。あるいは、音声合成手段１７は、フォルマント合成技術を利用して出力音声用情報５０３から人工音声を生成することにより、音声信号を生成してもよい。 The voice synthesis unit 17 generates a voice signal from the output voice information 503 received from the control unit 16 by using a known voice synthesis technique, and sends it back to the control unit 16. For example, the speech synthesizer 17 performs speech synthesis using a corpus-based speech synthesis technique. In that case, an intonation database in which intonation data (for example, pitch frequency, power, phoneme duration, etc. for each phoneme) is stored for each word of a predetermined unit (for example, accent phrase) for a large number of sentences, and a speech element for each intonation and reading. A voice database in which pieces are stored is prepared in the storage unit 5 in advance. The speech synthesizer 17 first analyzes the output speech information 503 to identify a word and determine reading. Next, the speech synthesizing unit 17 searches the intonation database for a phrase that is the same as or most similar to the phrase in the output speech information 503 and determines the position of the intonation of the sentence. Further, the speech synthesizer 17 generates a speech signal by searching and connecting an appropriate speech segment from the speech database based on intonation and reading. Alternatively, the voice synthesizer 17 may generate a voice signal by generating an artificial voice from the output voice information 503 using a formant synthesis technique.

以下、図７に示したフローチャートを参照しつつ、本発明を適用した顔画像認証装置１による通知情報の登録処理の動作を説明する。なお、以下に説明する動作のフローは、処理部６により制御される。また、以下に説明する動作は、入力画像の取得時刻ごと（１フレームごと）に実施される。
まず、処理部６の顔検出手段１１は、撮像部２からインターフェース部４を介して、照合対象者が写っている入力画像を取得する（ステップＳ７０１）。そして顔検出手段１１は、取得した入力画像から照合対象者の顔が写っている領域である入力顔領域を検出し、入力顔画像を作成する（ステップＳ７０２）。そして顔検出手段１１は、入力顔領域の検出に成功したか否か判定する（ステップＳ７０３）。顔検出手段１１が入力顔領域の検出に失敗した場合、処理部６は、再度ステップＳ７０１以降の処理を実行する。
一方、顔検出手段１１が入力顔領域の検出に成功した場合、顔検出手段１１は、入力顔領域を表す情報を顔追跡手段１２へ通知する。顔追跡手段１２は、顔検出手段１１によって抽出された入力顔領域について追跡処理を行い、同一人物の顔が写っている入力顔領域を対応付ける（ステップＳ７０４）。 The operation of the notification information registration process by the face image authentication apparatus 1 to which the present invention is applied will be described below with reference to the flowchart shown in FIG. The operation flow described below is controlled by the processing unit 6. The operation described below is performed at every acquisition time of the input image (every frame).
First, the face detection unit 11 of the processing unit 6 acquires an input image in which a person to be verified is shown from the imaging unit 2 via the interface unit 4 (step S701). Then, the face detection unit 11 detects an input face area that is an area in which the face of the person to be collated is captured from the acquired input image, and creates an input face image (step S702). Then, the face detection unit 11 determines whether or not the input face area has been successfully detected (step S703). When the face detection unit 11 fails to detect the input face area, the processing unit 6 executes the processes after step S701 again.
On the other hand, when the face detection unit 11 succeeds in detecting the input face area, the face detection unit 11 notifies the face tracking unit 12 of information representing the input face area. The face tracking unit 12 performs a tracking process on the input face area extracted by the face detection unit 11, and associates the input face area in which the face of the same person is shown (step S704).

以下のステップＳ７０５〜Ｓ７１１の処理は、顔検出手段１１によって抽出され、顔追跡手段１２によって追跡処理の対象となっている入力顔領域から作成され、まだ照合されていない入力顔画像ごとに行われる。顔照合手段１３は、顔追跡手段１２によって追跡処理の対象となっている入力顔領域から作成され、まだ照合されていない入力顔画像について、記憶部５に記憶された登録テーブル３００の各登録顔画像と照合し（ステップＳ７０５）、同一人物によるものか否かを判定する（ステップＳ７０６）。
顔照合手段１３は、照合対象者が登録者でないと判定すると（ステップＳ７０６のＮｏの分岐）、照合失敗を示す結果通知を音声出力判定手段１４に送る。そして音声出力判定手段１４は、顔照合手段１３から照合失敗を示す結果通知を受けると、未登録者向け音声用情報の作成要求を出力音声用情報作成手段１５に送る。出力音声用情報作成手段１５は、音声出力判定手段１４から未登録者向け音声用情報の作成要求を受けると、未登録者向け音声用情報を作成する（ステップＳ７０７）。
一方、顔照合手段１３は、照合対象者が登録者であると判定すると（ステップＳ７０６のＹｅｓの分岐）、照合成功を示す結果通知を音声出力判定手段１４に送る。そして音声出力判定手段１４は、顔照合手段１３から照合成功を示す結果通知を受けると、出力テーブル５００に標準音声用情報又は短縮音声用情報を含む通知情報が格納されているか否かを判定する（ステップＳ７０８）。そして音声出力判定手段１４は、出力テーブル５００に標準音声用情報及び短縮音声用情報のうちの何れも格納されていない場合（ステップＳ７０８のＹｅｓの分岐）、標準音声用情報の作成要求を出力音声用情報作成手段１５に送る。出力音声用情報作成手段１５は、音声出力判定手段１４から標準音声用情報の作成要求を受けると、標準音声用情報を作成する（ステップＳ７０９）。一方、音声出力判定手段１４は、出力テーブル５００に標準音声用情報又は短縮音声用情報のうちの何れかが格納されている場合（ステップＳ７０８のＮｏの分岐）、短縮音声用情報の作成要求を出力音声用情報作成手段１５に送る。出力音声用情報作成手段１５は、音声出力判定手段１４から短縮音声用情報の作成要求を受けると、短縮音声用情報を作成する（ステップＳ７１０）。 The following steps S705 to S711 are performed for each input face image that has been extracted by the face detection unit 11 and created from the input face region that is the target of the tracking process by the face tracking unit 12 and has not yet been collated. . The face collating unit 13 creates each registered face of the registration table 300 stored in the storage unit 5 for an input face image that has been created from the input face region that is the target of the tracking process by the face tracking unit 12 and has not yet been collated. The image is collated (step S705), and it is determined whether or not the image belongs to the same person (step S706).
If the face collating unit 13 determines that the person to be collated is not a registrant (No branch in step S706), the face collating unit 13 sends a result notification indicating a collation failure to the voice output determining unit 14. When the voice output determination unit 14 receives a result notification indicating a verification failure from the face verification unit 13, the voice output determination unit 14 sends a request for generating voice information for unregistered persons to the output voice information generation unit 15. When receiving the creation request for voice information for unregistered persons from the voice output determining means 14, the output voice information creating means 15 creates voice information for unregistered persons (step S707).
On the other hand, when the face collating unit 13 determines that the person to be collated is a registrant (Yes in step S706), the face collating unit 13 sends a result notification indicating the collation success to the voice output determining unit 14. When the voice output determination unit 14 receives a result notification indicating that the collation is successful from the face matching unit 13, the voice output determination unit 14 determines whether notification information including standard voice information or shortened voice information is stored in the output table 500. (Step S708). If neither the standard voice information nor the shortened voice information is stored in the output table 500 (Yes in step S708), the voice output determining unit 14 outputs a request for creating the standard voice information as the output voice. To the use information creation means 15. When receiving the request for creating standard audio information from the audio output determining unit 14, the output audio information creating unit 15 creates standard audio information (step S709). On the other hand, when either the standard voice information or the shortened voice information is stored in the output table 500 (No branch in step S708), the voice output determining unit 14 issues a request for creating the shortened voice information. This is sent to the output voice information creating means 15. When the output voice information creating unit 15 receives a request for creating shortened voice information from the voice output determining unit 14, the output voice information creating unit 15 creates shortened voice information (step S710).

出力音声用情報作成手段１５は、出力音声用情報５０３を作成すると、作成した出力音声用情報５０３を制御手段１６に送る。そして制御手段１６は、出力音声用情報作成手段１５から出力音声用情報５０３を受けると、情報番号５０１を生成し、出力時間５０４を算出し、作成時刻５０５を取得し、出力音声用情報５０３に情報番号５０１、情報種別５０２、出力音声用情報５０３、出力時間５０４及び作成時刻５０５により通知情報を作成して出力テーブル５００に追加する（ステップＳ７１１）。
制御手段１６が通知情報を作成して出力テーブル５００に追加すると、追跡処理の対象となっている入力顔領域から作成され、まだ照合されていない全ての入力顔画像について通知情報の登録処理がされたか否かが判定される。まだ通知情報の登録処理がされていない入力顔画像が存在する場合、制御はステップＳ７０５に戻り、ステップＳ７０５〜Ｓ７１１の処理が繰り返される。一方、その全ての入力顔画像について通知情報の登録処理がされた場合、処理部６は、再度ステップＳ７０１以降の処理を実行する。 When the output voice information creating unit 15 creates the output voice information 503, the output voice information creating unit 15 sends the created output voice information 503 to the control unit 16. Upon receiving the output voice information 503 from the output voice information creation means 15, the control means 16 generates an information number 501, calculates an output time 504, acquires a creation time 505, and outputs the output voice information 503. Notification information is created based on the information number 501, information type 502, output voice information 503, output time 504, and creation time 505, and added to the output table 500 (step S711).
When the control means 16 creates notification information and adds it to the output table 500, notification information registration processing is performed for all input face images that have been created from the input face area that is the target of tracking processing and have not yet been collated. It is determined whether or not. If there is an input face image for which notification information registration processing has not yet been performed, control returns to step S705, and the processing of steps S705 to S711 is repeated. On the other hand, when the notification information registration processing is performed for all the input face images, the processing unit 6 executes the processing from step S701 onward again.

以下、図８に示したフローチャートを参照しつつ、本発明を適用した顔画像認証装置１０による音声の出力処理の動作を説明する。なお、以下に説明する動作のフローは、処理部６により制御される。
まず、処理部６の制御手段１６は、出力テーブル５００に通知情報が格納されているか否かを判定する（ステップＳ８０１）。出力テーブル５００に通知情報が格納されていない場合（ステップＳ８０１のＮｏの分岐）、音声合成手段１７は、出力テーブル５００に新たに通知情報が格納されるまでステップＳ８０１で待機する。一方、出力テーブル５００に通知情報が格納されている場合（ステップＳ８０１のＹｅｓの分岐）、制御手段１６は、何れかの通知情報に出力音声用情報５０３として標準音声用情報又は短縮音声用情報が含まれるか否かを判定する（ステップＳ８０２）。 The operation of the voice output process by the face image authentication apparatus 10 to which the present invention is applied will be described below with reference to the flowchart shown in FIG. The operation flow described below is controlled by the processing unit 6.
First, the control means 16 of the processing unit 6 determines whether notification information is stored in the output table 500 (step S801). When the notification information is not stored in the output table 500 (No branch in step S801), the speech synthesis unit 17 waits in step S801 until new notification information is stored in the output table 500. On the other hand, when the notification information is stored in the output table 500 (Yes in step S801), the control unit 16 includes standard audio information or shortened audio information as output audio information 503 in any of the notification information. It is determined whether it is included (step S802).

何れかの通知情報に出力音声用情報５０３として標準音声用情報又は短縮音声用情報が含まれる場合（ステップＳ８０２のＹｅｓの分岐）、制御手段１６は、その通知情報のうち先頭に格納されている通知情報の出力音声用情報５０３（標準音声用情報又は短縮音声用情報）を読み出す（ステップＳ８０３）。そして制御手段１６は、読み出した出力音声用情報５０３を音声合成手段１７に送る。音声合成手段１７は、制御手段１６から出力音声用情報５０３を受け取ると、受け取った出力音声用情報５０３から音声信号を生成し、制御手段１６に送り返す。制御手段１６は、音声合成手段１７から音声信号を受け取ると、受け取った音声信号をインターフェース部４に出力し、音声出力部３に音声出力させる（ステップＳ８０４）。そして制御手段１６は、出力音声用情報５０３から生成された音声信号の出力が完了すると、その出力音声用情報５０３に対応する通知情報を出力テーブル５００から削除する（ステップＳ８０５）。通知情報を出力テーブル５００から削除すると、制御手段１６は、再度ステップＳ８０１以降の処理を実行する。 If any of the notification information includes standard audio information or shortened audio information as the output audio information 503 (Yes in step S802), the control means 16 is stored at the head of the notification information. The output voice information 503 (standard voice information or shortened voice information) of the notification information is read (step S803). Then, the control unit 16 sends the read output voice information 503 to the voice synthesis unit 17. When the voice synthesis unit 17 receives the output voice information 503 from the control unit 16, the voice synthesis unit 17 generates a voice signal from the received output voice information 503 and sends it back to the control unit 16. When receiving the audio signal from the voice synthesizing unit 17, the control unit 16 outputs the received audio signal to the interface unit 4 and causes the audio output unit 3 to output the audio (step S804). When the output of the audio signal generated from the output audio information 503 is completed, the control unit 16 deletes the notification information corresponding to the output audio information 503 from the output table 500 (step S805). When the notification information is deleted from the output table 500, the control unit 16 executes the processing after step S801 again.

一方、何れの通知情報にも出力音声用情報５０３として標準音声用情報又は短縮音声用情報が含まれない場合（ステップＳ８０２のＮｏの分岐）、制御手段１６は、出力音声用情報５０３として未登録者向け音声用情報を含む通知情報のうち、先頭に格納されている出力音声用情報５０３（未登録者向け音声用情報）を読み出す（ステップＳ８０６）。そして制御手段１６は、読み出した出力音声用情報５０３を音声合成手段１７に送る。音声合成手段１７は、制御手段１６から出力音声用情報５０３を受け取ると、受け取った出力音声用情報５０３から音声信号を生成し、制御手段１６に送り返す。制御手段１６は、音声合成手段１７から音声信号を受け取ると、受け取った音声信号をインターフェース部４に出力し、音声出力部３に音声出力させる（ステップＳ８０７）。そして制御手段１６は、出力音声用情報５０３から生成された音声信号の出力が完了すると、未登録者向け音声用情報を含む全ての通知情報を出力テーブル５００から削除する（ステップＳ８０８）。未登録者向け音声用情報を含む全ての通知情報を出力テーブル５００から削除すると、制御手段１６は、再度ステップＳ８０１以降の処理を実行する。 On the other hand, when none of the notification information includes the standard voice information or the shortened voice information as the output voice information 503 (No in step S802), the control unit 16 is not registered as the output voice information 503. Out of the notification information including the audio information for users, the output audio information 503 (information for unregistered users) stored at the head is read (step S806). Then, the control unit 16 sends the read output voice information 503 to the voice synthesis unit 17. When the voice synthesis unit 17 receives the output voice information 503 from the control unit 16, the voice synthesis unit 17 generates a voice signal from the received output voice information 503 and sends it back to the control unit 16. Upon receiving the audio signal from the audio synthesizing unit 17, the control unit 16 outputs the received audio signal to the interface unit 4 and causes the audio output unit 3 to output the audio (step S807). When the output of the audio signal generated from the output audio information 503 is completed, the control unit 16 deletes all notification information including the unregistered user audio information from the output table 500 (step S808). When all the notification information including the information for voice for unregistered persons is deleted from the output table 500, the control unit 16 executes the processing after step S801 again.

以下、顔画像認証装置が設置された通路に複数の利用者が通行する場合における本実施形態の顔画像認証装置１の動作について説明する。
図９に、顔画像認証装置１が設置された通路に複数の利用者が連続的に通行する場合の入力画像と出力音声の関係の例を示す。図９において、画像９００は時刻tにおける入力画像であり、画像９０１は時刻t+1における入力画像であり、画像９０２は時刻t+2における入力画像であり、画像９０３は時刻t+3における入力画像であり、画像９０４は時刻t+4における入力画像である。図９に示す例では、時刻tにおける入力画像９００に人物９１０（Ａ役員）が写っている。そのため、顔画像認証装置１は時刻tにおいて照合処理を行い、時刻tから時刻t+1までに照合結果として出力音声「お疲れ様です、Ａ役員」を出力する。一方、時刻t+1における入力画像９１１には新たに人物９１１（Ｂさん）が写っている。この場合、顔画像認証装置１は時刻t+1において照合処理を行い、出力音声「お疲れ様です、Ａ役員」の出力が完了した時刻t+2において「お疲れ様です」を省略した出力音声「Ｂさん」を出力する。さらに、入力画像９０２には新たに人物９１２（Ｃさん）が写っている。この場合、顔画像認証装置１は時刻t+2において照合処理を行い、出力音声「Ｂさん」の出力が完了した時刻t+3において「お疲れ様です」を省略した出力音声「Ｃさん」を出力する。これにより顔画像認証装置１は、人物９１１（Ｂさん）、人物９１２（Ｃさん）の入室までに音声出力を完了させることができる。 Hereinafter, an operation of the face image authentication device 1 of the present embodiment when a plurality of users pass through a passage where the face image authentication device is installed will be described.
FIG. 9 shows an example of the relationship between the input image and the output sound when a plurality of users pass continuously through the passage where the face image authentication device 1 is installed. In FIG. 9, an image 900 is an input image at time t, an image 901 is an input image at time t + 1, an image 902 is an input image at time t + 2, and an image 903 is an input image at time t + 3. An image 904 is an input image at time t + 4. In the example shown in FIG. 9, a person 910 (A officer) is shown in the input image 900 at time t. Therefore, the face image authentication device 1 performs a collation process at time t, and outputs an output voice “Thank you, A officer” as a collation result from time t to time t + 1. On the other hand, a person 911 (Mr. B) is newly shown in the input image 911 at time t + 1. In this case, the face image authentication device 1 performs the collation process at time t + 1, and the output voice “Mr. B” is omitted at time t + 2 when the output of the output voice “Thank you, A officer” is completed. Is output. Furthermore, a new person 912 (Mr. C) is shown in the input image 902. In this case, the face image authentication device 1 performs the collation process at time t + 2, and outputs the output voice “Mr. C” from which “Thank you for your hard work” is omitted at time t + 3 when the output of the output voice “Mr. B” is completed. To do. As a result, the face image authentication apparatus 1 can complete the audio output until the person 911 (Mr. B) and the person 912 (Mr. C) enter the room.

以上説明してきたように、本発明を適用した顔画像認証装置１は、照合対象者の顔を撮影した入力画像から抽出された入力顔画像に写っている照合対象者と予め登録された登録者とが同一の人物か否かを判定する。そして顔画像認証装置１は、照合対象者に対する出力音声用情報を作成し、出力音声用情報から音声信号を生成して出力する。この顔画像認証装置１は、照合対象者の照合が完了したとき、顔画像認証装置１が音声を出力中である場合に作成する出力音声用情報を顔画像認証装置１が音声を出力中でない場合に作成する出力音声用情報より短くする。これにより、顔画像認証装置１は、複数の照合対象者がいる場合に二番目以降に照合した照合対象者に対して出力する音声を通常の音声より短くすることができる。従って、顔画像認証装置１は、複数の照合対象者に対して順次照合するとともに各照合対象者に対する音声を出力する際に各照合対象者に対する音声の出力タイミングを最適化することができる。 As described above, the face image authentication device 1 to which the present invention is applied is a collation target person and a registrant registered in advance in the input face image extracted from the input image obtained by photographing the face of the collation target person. Are the same person. Then, the face image authentication device 1 creates output audio information for the person to be collated, generates an audio signal from the output audio information, and outputs it. The face image authentication device 1 does not output voice information that is generated when the face image authentication device 1 is outputting sound when the verification of the person to be verified is completed. Make it shorter than the output audio information created in this case. Thereby, the face image authentication apparatus 1 can shorten the audio | voice output with respect to the collation target person collated after 2nd when there exist a some collation target person from a normal audio | voice. Therefore, the face image authenticating apparatus 1 can sequentially collate a plurality of collation target persons and optimize the output timing of the voices for each collation target person when outputting the voices for the respective collation target persons.

以上、本発明の好適な実施形態について説明してきたが、本発明はこれらの実施形態に限定されるものではない。例えば、登録テーブル、定型句テーブル及び出力テーブルに格納されるデータは、テキストデータでなく、例えば予め作成されたｗａｖ形式、ｍｐ３形式等の音声信号でもよい。その場合、記憶部には登録テーブルの個人名及び定型句テーブルの各定型句の語句として音声信号を格納しておき、出力音声用情報作成手段は、出力音声用情報として、テキストデータに代えて、これらの音声信号をつなぎ合わせた音声信号を作成する。一方、制御手段は、出力音声用情報作成手段によって作成された音声信号を音声合成手段に送らずにそのままインターフェース部に出力する。 The preferred embodiments of the present invention have been described above, but the present invention is not limited to these embodiments. For example, the data stored in the registration table, the fixed phrase table, and the output table may not be text data, but may be audio signals in a wav format, mp3 format, or the like created in advance. In that case, a voice signal is stored in the storage unit as a personal name of the registration table and a phrase of each fixed phrase of the fixed phrase table, and the output voice information creating means replaces the text data as the output voice information. Then, an audio signal is created by connecting these audio signals. On the other hand, the control means outputs the voice signal created by the output voice information creation means to the interface unit as it is without sending it to the voice synthesis means.

また、短縮音声用情報は、標準音声用情報より短ければどのようなものでもよく、例えば標準音声用情報から個人名と敬称句を省略したテキスト、つまり定型句のみからなるテキストとしてもよい。あるいは、短縮音声用情報の語句は標準音声用情報と同じにしておき、短縮音声用情報による音声信号を公知の音声処理技術を用いて時間的に圧縮し、短縮音声用情報による出力音声の発声速度を標準音声用情報による出力音声の発声速度より速くすることにより短時間に音声出力するようにしてもよい。例えば、音声合成手段は、ＳＯＬＡ（synchronized overlap-add）技術を用いて近傍のフレームデータ間の相関を計算し、もっとも相関の大きい部分にずらしてクロスフェードすることにより、短縮音声用情報による音声信号を時間的に圧縮する。または音声合成手段は、ＰＩＣＯＬＡ（Pointer Interval Controlled OverLap and Add）技術、ＰｈａｓｅＶｏｃｏｄｅｒ技術等を用いて短縮音声用情報による音声信号を時間的に圧縮してもよい。
なおこの変形例において、登録テーブルの個人名及び定型句テーブルの各定型句の語句として音声信号が記憶部に格納されている場合、制御手段は、出力音声用情報作成手段によって作成された（つなぎ合わされた）短縮音声用情報による音声信号を音声合成手段に送る。そして音声合成手段は、受け取った音声信号を時間的に圧縮し、制御手段に送り返す。 The shortened voice information may be any information as long as it is shorter than the standard voice information. For example, the short voice information may be a text in which the personal name and the honorific phrase are omitted from the standard voice information, that is, a text including only a fixed phrase. Alternatively, the phrase of the shortened voice information is the same as that of the standard voice information, the voice signal by the shortened voice information is temporally compressed using a known voice processing technique, and the output voice is uttered by the shortened voice information. The voice may be output in a short time by making the speed faster than the utterance speed of the output voice based on the standard voice information. For example, the speech synthesizer calculates a correlation between neighboring frame data using a synchronized overlap-add (SOLA) technique, and shifts to a portion having the largest correlation and performs a crossfade. Is compressed in time. Alternatively, the voice synthesizing unit may temporally compress the voice signal based on the shortened voice information using a PICOLA (Pointer Interval Controlled OverLap and Add) technique, a Phase Vocoder technique, or the like.
In this modified example, when the audio signal is stored in the storage unit as the personal name of the registration table and the phrase of each fixed phrase of the fixed phrase table, the control means is created by the output voice information creating means (connection) The voice signal based on the shortened voice information is sent to the voice synthesis means. Then, the voice synthesis unit compresses the received voice signal in time and sends it back to the control unit.

また、顔照合手段は、一回の顔照合処理で照合成功と判断しなかった場合に照合失敗と判断するのではなく、複数回の顔照合処理で（複数フレームにわたって）照合成功と判断しなかった場合、またはタイムアウトが発生した場合に照合失敗と判断するようにしてもよい。その場合、図７に示したステップＳ７０７〜Ｓ７１１の処理は、照合成功、照合失敗の何れとも判断されていない入力顔画像については行わないようにする。そのために、顔照合手段は、ステップＳ７０６において照合成功、照合失敗の何れとも判断しなかった入力顔画像については、結果通知を音声判定手段に送らないようにする。 In addition, the face matching means does not determine that the matching has failed in a single face matching process, but does not determine that the matching has succeeded (over a plurality of frames) in multiple face matching processes. In the case of a check or when a timeout occurs, it may be determined that the verification has failed. In that case, the processing of steps S707 to S711 shown in FIG. 7 is not performed for an input face image that has not been determined to be either collation success or collation failure. For this reason, the face collating unit does not send a result notification to the voice determining unit for an input face image that has not been determined to be either collation success or collation failure in step S706.

また、例えば、顔画像認証装置は、照合失敗と判断された照合対象者に対しては音声を出力しないようにしてもよい。その場合、音声出力判定手段は、顔照合手段から照合失敗を示す結果通知を受け取っても、未登録者向け音声用情報の作成要求を出力音声用情報作成手段に送らない。従って、この場合、出力音声用情報の情報種別は、標準音声用情報と短縮音声用情報の二つとなる。 Further, for example, the face image authentication device may be configured not to output a voice to a person to be collated who has been determined to have failed. In this case, the voice output determination means does not send a request for creating voice information for unregistered persons to the output voice information creation means even if a result notification indicating a matching failure is received from the face matching means. Therefore, in this case, there are two types of information of the output audio information: standard audio information and shortened audio information.

また、本実施形態では、音声出力判定手段は、顔画像照合手段から結果通知を受けた時点、つまり顔照合手段が顔照合処理を完了した時点において顔画像認証装置が音声を出力しているか否かにより、出力音声用情報作成手段に作成させる出力音声用情報の種別を決定する例を示したが、本発明はこれに限定されない。音声出力判定手段は、例えば顔照合手段が顔照合処理を開始する時点のように他のタイミングにおいて顔画像認証装置が音声を出力しているか否かにより、出力音声用情報作成手段に作成させる出力音声用情報の種別を決定してもよい。その場合、顔照合手段は顔照合処理を開始する前に音声出力判定手段に開始通知を送り、音声出力判定手段はその開始通知を受けたときに顔画像認証装置が音声を出力中であるか否かを判定しておく。 In the present embodiment, the voice output determination unit determines whether or not the face image authentication apparatus outputs a voice when the result notification is received from the face image matching unit, that is, when the face matching unit completes the face matching process. Thus, an example of determining the type of output audio information to be generated by the output audio information generating means has been shown, but the present invention is not limited to this. The voice output determination means outputs the output voice information creation means to create depending on whether or not the face image authentication device outputs voice at another timing, for example, when the face matching means starts the face matching process. The type of audio information may be determined. In that case, the face matching unit sends a start notification to the voice output determining unit before starting the face matching process, and the voice output determining unit is outputting a voice when receiving the start notification. It is determined whether or not.

また、制御手段は、出力テーブルにおいて通知情報が作成された順（照合処理がされた順）に通知情報を格納するのではなく、例えば、標準音声用情報又は短縮音声用情報についての通知情報を、未登録者向け音声用情報についての通知情報より前に格納するように順番を入れ替えてもよい。その場合、制御手段は、出力テーブルに格納された通知情報の順に音声出力すれば、標準音声用情報又は短縮音声用情報による出力音声を未登録者向け音声用情報による出力音声より先に出力することができる。 In addition, the control means does not store the notification information in the order in which the notification information is created in the output table (the order in which the collation processing is performed), but for example, the notification information about the standard voice information or the shortened voice information. The order may be changed so as to be stored before the notification information about the voice information for unregistered persons. In that case, if the control means outputs the sound in the order of the notification information stored in the output table, the control means outputs the output sound based on the standard sound information or the shortened sound information before the output sound based on the unregistered sound information. be able to.

また、本実施形態では、音声出力判定手段は、顔画像認証装置が音声を出力中であるか否かを、出力テーブルに標準音声用情報又は短縮音声用情報が格納されているか否かにより判定する例を示したが、本発明はこれに限定されない。例えば、音声出力判定手段は、音声出力を開始した時刻及び音声出力に要する時間により、顔画像認証装置が音声を出力中であるか否かを判定してもよい。その場合の出力テーブルの模式図を図１０に示す。図１０に示すように、本変形例の場合の出力テーブル１０００には、図５に示した出力テーブル５００における作成時刻５０５に代えて、出力開始時刻１００５が示される。この出力開始時刻１００５は、制御手段が出力音声用情報１００３による音声信号をインターフェース部に出力した時間であり、制御手段によって記録される。 In the present embodiment, the sound output determination means determines whether or not the face image authentication device is outputting sound based on whether or not standard sound information or shortened sound information is stored in the output table. However, the present invention is not limited to this. For example, the audio output determination unit may determine whether or not the face image authentication apparatus is outputting audio based on the time when audio output is started and the time required for audio output. FIG. 10 shows a schematic diagram of the output table in that case. As shown in FIG. 10, the output table 1000 in the present modification example shows an output start time 1005 instead of the creation time 505 in the output table 500 shown in FIG. 5. This output start time 1005 is the time when the control means outputs the audio signal based on the output audio information 1003 to the interface unit, and is recorded by the control means.

本変形例の制御手段は、出力音声用情報１００３による音声信号のインターフェース部への出力を開始したとき、そのときの時刻を出力開始時刻１００５に記録する。また制御手段は、出力テーブル１０００に標準音声用情報又は短縮音声用情報を含む通知情報が全く格納されていない状態から新たに格納された標準音声用情報を含む通知情報についての出力開始時刻１００５を標準音声出力開始時刻として記憶部に記録する。
尚、標準音声用情報を記録した後に記録される短縮音声用情報については、音声信号のインターフェース部への出力を開始した時刻は、それより前に記録された出力音声用情報の出力時間に依存するため、出力開始時刻１００５は空白にしておく。又は、実際の音声信号のインターフェース部への出力時間を、後追いの結果として記録してもよい。
一方、音声出力判定手段は、顔照合手段から照合成功又は照合失敗を示す結果通知を受け取ると、記録部から最新の標準音声出力開始時刻を取得する。さらに音声出力判定手段は、その標準音声出力開始時刻に対応する標準音声用情報を含む通知情報の出力時間１００４、及びその標準音声用情報を含む通知情報の作成後に作成した短縮音声用情報を含む全ての通知情報の出力時間１００４を取得する。そして音声出力判定手段は、取得した全ての出力時間１００４の総和（以下、総出力時間と称する）を算出する。
例えば、図１０に示した出力テーブル１０００においては、情報番号１００１が'261'の通知情報における出力開始時刻である'11:31:48'が標準音声出力開始時刻に該当し、情報番号１００１が'261'の通知情報における出力時間１００４である'tg2+tm1+tr2'と、情報番号１００１が'262'の通知情報における出力時間１００４である'tm2+tr1'と、情報番号１００１が'263'の通知情報における出力時間１００４である'tm3+tr3'との総和が総出力時間に該当する。 When the output of the audio signal by the output audio information 1003 to the interface unit is started, the control unit of the present modification records the time at the output start time 1005. Further, the control means sets the output start time 1005 for the notification information including the standard voice information newly stored from the state where the notification information including the standard voice information or the shortened voice information is not stored in the output table 1000. Recorded in the storage unit as the standard audio output start time.
As for the shortened sound information recorded after recording the standard sound information, the time when the output of the sound signal to the interface unit is started depends on the output time of the output sound information recorded before that time. Therefore, the output start time 1005 is left blank. Alternatively, the actual output time of the audio signal to the interface unit may be recorded as a follow-up result.
On the other hand, when the voice output determination unit receives the result notification indicating the verification success or the verification failure from the face verification unit, the voice output determination unit acquires the latest standard voice output start time from the recording unit. Further, the voice output determination means includes the output time 1004 of the notification information including the standard voice information corresponding to the standard voice output start time, and the shortened voice information created after the creation of the notification information including the standard voice information. The output time 1004 of all notification information is acquired. Then, the audio output determination means calculates the total sum of all the acquired output times 1004 (hereinafter referred to as the total output time).
For example, in the output table 1000 shown in FIG. 10, “11:31:48”, which is the output start time in the notification information whose information number 1001 is “261”, corresponds to the standard audio output start time, and the information number 1001 is 'Tg2 + tm1 + tr2' which is the output time 1004 in the notification information of '261', 'tm2 + tr1' which is the output time 1004 in the notification information whose information number 1001 is '262', and the information number 1001 is '263' The sum of “tm3 + tr3”, which is the output time 1004 in “notification information”, corresponds to the total output time.

そして音声出力判定手段は、顔画像照合手段から結果通知を受けた時刻が標準音声出力開始時刻に総出力時間を加えた時刻より前ならば、顔画像認証装置が音声を出力中であると判定して短縮音声用情報の作成要求を出力音声用情報作成手段に送る。一方、音声出力判定手段は、顔画像照合手段から結果通知を受けた時刻が標準音声出力開始時刻に総出力時間を加えた時刻以後ならば、顔画像認証装置が音声を出力中でないと判定して標準音声用情報の作成要求を出力音声用情報作成手段に送る。あるいは、音声出力判定手段は、結果通知ではなく開始通知を顔画像照合手段から受けた時刻が、標準音声出力開始時刻に総出力時間を加えた時刻より前であるか否かに応じて顔画像認証装置が音声を出力中であるか否かを判定してもよい。
また、顔画像認証装置が音声を出力中であるか否かを音声出力を開始した時刻及び音声出力に要する時間により判定できるように、制御手段は、標準音声用情報又は短縮音声用情報から生成した音声信号の出力が完了しても、その標準音声用情報又は短縮音声用情報に対応する通知情報を個別に出力テーブル１０００から削除しない。制御手段は、出力テーブル１０００内の全ての標準音声用情報及び短縮音声用情報から生成した全ての音声の出力が完了したときに、標準音声用情報又は短縮音声用情報を含む全ての通知情報を削除する。あるいは制御手段は、出力テーブル１０００に新たに標準音声用情報を含む通知情報が追加されたときに、それまでに音声出力が完了している標準音声用情報又は短縮音声用情報を含む通知情報を削除する。なお、制御手段は、標準音声用情報又は短縮音声用情報を含む通知情報を削除するときに、あわせて標準音声出力開始時刻も削除してもよい。
この変形例においても、顔画像認証装置は、音声を出力中であるか否かを適切に判定することができ、出力テーブルに出力音声用情報が格納されているか否かにより音声を出力中であるか否かを判定する場合と同様の効果が得られる。また、この変形例の場合、顔画像認証装置は、各出力音声用情報の出力時間に応じて音声出力中であるか否かを判定して出力音声の長さを変える。そのため、例えば出力音声用情報に照合対象者の個人名が含まれる場合のように、照合対象者毎に出力音声用情報の長さが変わる場合でも各照合対象者に対して適切なタイミングで音声を出力することができる。 The voice output determination means determines that the face image authentication device is outputting voice if the time when the result notification is received from the face image matching means is before the time when the total output time is added to the standard voice output start time. Then, a request for creating shortened voice information is sent to the output voice information creating means. On the other hand, if the time when the result notification is received from the face image matching means is after the time when the total output time is added to the standard sound output start time, the sound output determination means determines that the face image authentication device is not outputting sound. The standard voice information creation request is sent to the output voice information creation means. Alternatively, the voice output determination unit determines whether the time when the start notification instead of the result notification is received from the face image matching unit is before the time when the total output time is added to the standard voice output start time. It may be determined whether or not the authentication device is outputting sound.
Further, the control means generates from the standard voice information or the shortened voice information so that it can be determined whether or not the face image authentication device is outputting the voice based on the time when the voice output is started and the time required for the voice output. Even when the output of the voice signal is completed, the notification information corresponding to the standard voice information or the shortened voice information is not individually deleted from the output table 1000. When the output of all the voices generated from all the standard voice information and the shortened voice information in the output table 1000 is completed, the control means displays all the notification information including the standard voice information or the shortened voice information. delete. Alternatively, when the notification information including standard audio information is newly added to the output table 1000, the control unit displays the notification information including the standard audio information or the shortened audio information for which audio output has been completed so far. delete. The control means may also delete the standard audio output start time when deleting the notification information including the standard audio information or the shortened audio information.
Also in this modification, the face image authentication device can appropriately determine whether or not sound is being output, and is outputting sound depending on whether or not output sound information is stored in the output table. The same effect as in the case of determining whether or not there is obtained. In the case of this modification, the face image authentication device determines whether or not sound is being output according to the output time of each output sound information, and changes the length of the output sound. Therefore, even when the length of the output voice information changes for each verification target person, such as when the personal name of the verification target person is included in the output voice information, for example, the voice at the appropriate timing for each verification target person Can be output.

また、本実施形態では、顔画像認証装置は、各照合対象者に対して照合処理を実施した順に照合結果に応じた音声を出力する例を示したが、本発明はこれに限定されない。例えば、予め、登録者の属性に応じて、音声出力する順序についての優先度を定めておき、顔画像認証装置は、優先度の高い順に各照合対象者に対する音声を出力してもよい。その場合、例えば、属性が「顧客」である登録者の優先度が「高」に定められ、属性が「業者」である登録者の優先度が「中」に定められ、属性が「役員」、「部長」、「担当」のうちの何れか（すなわち「社員」）である登録者の優先度が「低」に定められる。この優先度は、登録者の属性と関連付けて記憶部に記憶される。そして制御手段は、通知情報を出力テーブルに格納するとき、出力テーブルの作成時刻（又は出力開始時刻）及び出力時間から、既に出力テーブルに格納されている標準音声用情報又は短縮音声用情報の通知情報のうち、まだ音声出力が開始されていない通知情報を抽出する。そして制御手段は、抽出した通知情報のうち、新たに格納する通知情報より優先度が低い通知情報より前の位置に新たに通知情報を挿入する。そして制御手段は、出力テーブルに格納されている通知情報の順に出力音声用情報による音声を出力する。これにより、制御手段は、出力テーブルに格納されている、まだ音声出力が開始されていない標準音声用情報又は短縮音声用情報について優先度が高い順に音声を出力することができる。 Further, in the present embodiment, the face image authentication device has shown an example in which the voice corresponding to the collation result is output in the order in which the collation processing is performed on each collation target person, but the present invention is not limited to this. For example, priorities for the order of outputting voices may be determined in advance according to the attributes of the registrant, and the face image authentication apparatus may output the voices for each person to be collated in descending order of priority. In that case, for example, the priority of the registrant whose attribute is “customer” is set to “high”, the priority of the registrant whose attribute is “trader” is set to “medium”, and the attribute is “officer” , The priority of the registrant who is one of “director” and “in charge” (that is, “employee”) is set to “low”. This priority is stored in the storage unit in association with the attribute of the registrant. When storing the notification information in the output table, the control means notifies the standard voice information or the shortened voice information already stored in the output table from the creation time (or output start time) and output time of the output table. From the information, notification information for which voice output has not yet started is extracted. And a control means inserts notification information newly in the position before the notification information in which priority is lower than the notification information newly stored among the extracted notification information. And a control means outputs the audio | voice by the information for output audio | voices in order of the notification information stored in the output table. As a result, the control means can output the voices in the descending order of priority for the standard voice information or the shortened voice information stored in the output table and whose voice output has not started yet.

図１１（ａ）、（ｂ）に登録者の優先度に応じた音声出力の順序変更を説明するための出力テーブルの模式図を示す。まず図１１（ａ）に示すように、出力テーブル１１００には、情報番号１１０１が'131'であり、作成時刻１１０５が'09:15:29'である標準音声用情報の通知情報と、情報番号１１０１が'132'であり、作成時刻１１０５が'09:15:30'である短縮音声用情報の通知情報とが格納されており、これらの属性は「社員」であるとする。このとき、属性が「顧客」である、情報番号１１０１が'133'であり、作成時刻１１０５が'09:15:31'である短縮音声用情報の通知情報が新たに作成されると、制御手段は、図１１（ｂ）に示すように、優先度が「顧客」より低い「社員」である、情報番号１１０１が'132'の通知情報の前に新たに作成された通知情報を挿入する。そして制御手段は、出力テーブル１１００に格納されている通知情報の順に、出力音声用情報を読み出して音声合成手段に音声信号を生成させ、インターフェース部に出力する。なお、出力テーブル１１００の先頭に位置する通知情報は、その出力音声用情報についての出力音声が出力中であると想定されるため、新たに作成された通知情報は、優先度の高さに関わらず、二番目以降に配置されるようにする。つまりこの例では、新たに作成した通知情報は、情報番号１１０１が'131'の通知情報より後ろに挿入される。 FIGS. 11A and 11B are schematic diagrams of output tables for explaining a change in the order of audio output in accordance with the priority of the registrant. First, as shown in FIG. 11A, in the output table 1100, the notification information of the standard audio information whose information number 1101 is' 131 'and the creation time 1105 is '09: 15: 29', and information It is assumed that the notification information of the information for shortened voice whose number 1101 is “132” and the creation time 1105 is “09:15:30” is stored, and these attributes are “employee”. At this time, when the notification information of the information for shortened voice whose attribute is “customer”, information number 1101 is “133”, and creation time 1105 is “09:15:31” is newly created, control is performed. As shown in FIG. 11B, the means inserts the newly created notification information before the notification information whose information number 1101 is '132', which is “employee” whose priority is lower than “customer”. . Then, the control means reads out the output voice information in the order of the notification information stored in the output table 1100, causes the voice synthesis means to generate a voice signal, and outputs it to the interface unit. Note that since the notification information positioned at the head of the output table 1100 is assumed to be outputting the output sound for the output sound information, the newly created notification information is related to the high priority. First, it is arranged after the second. That is, in this example, the newly created notification information is inserted after the notification information whose information number 1101 is '131'.

なお、優先度に応じて音声出力の順序を入れ替える場合、出力テーブル１１００内の順序は通知情報が作成された順（照合処理がされた順）とし、情報番号１１０１の値を入れ替えるようにしてもよい。その場合、制御手段は、出力テーブル１１００に格納されている順でなく、情報番号１１０１の番号順に出力音声用情報による音声を出力することにより、優先度が高い順に音声を出力することができる。
これにより顔画像認証装置は、顧客等に対する照合結果に応じた音声を優先して早期に出力することができ、きめ細やかに音声出力の順序を制御することができる。 When the order of audio output is changed according to the priority, the order in the output table 1100 is the order in which the notification information is created (the order in which the collation process is performed), and the value of the information number 1101 is changed. Good. In that case, the control means can output the voices based on the output voice information in the order of the information numbers 1101, not in the order stored in the output table 1100, so that the voices can be output in the descending order of priority.
As a result, the face image authentication apparatus can preferentially output the voice according to the collation result with respect to the customer or the like, and can control the order of the voice output finely.

以上のように、当業者は、本発明の範囲内で、実施される形態に合わせて様々な変更を行うことができる。 As described above, those skilled in the art can make various modifications in accordance with the embodiment to be implemented within the scope of the present invention.

１顔画像認証装置
２撮像部
３音声出力部
４インターフェース部
５記憶部
６処理部
１１顔検出手段
１２顔追跡手段
１３顔照合手段
１４音声出力判定手段
１５出力音声用情報作成手段
１６制御手段
１７音声合成手段 DESCRIPTION OF SYMBOLS 1 Face image authentication apparatus 2 Imaging part 3 Audio | voice output part 4 Interface part 5 Memory | storage part 6 Processing part 11 Face detection means 12 Face tracking means 13 Face collation means 14 Audio | voice output determination means 15 Output voice information preparation means 16 Control means 17 Voice Synthetic means

Claims

An imaging unit that sequentially acquires input images obtained by capturing the person to be verified;
A storage unit for storing a registered face image of the registrant in advance;
An audio output unit for outputting audio to the verification target person;
Face detection means for extracting an image of the face area of the person to be collated as an input face image each time the input image is acquired;
A face collating unit that collates the registered face image with the input face image and determines whether or not they are the same person;
Audio output determination means for determining whether the audio output unit is outputting audio;
Output sound information creating means for creating output sound information based on the determination result of the sound output determining means;
Voice synthesis means for synthesizing an output voice signal from the output voice information;
A face image authentication device having a control means for outputting the output audio signal from the audio output unit,
The output sound information creating means includes the output sound information as:
If the voice output unit is not outputting voice at the time of determination by the face matching means, create standard voice information,
The face image authentication apparatus characterized in that when the sound output unit is outputting sound at the time of the determination, shortened sound information shorter than the standard sound information is created.

The storage unit further stores an authentication boilerplate and a personal name of the registrant in association with the registered face image of the registrant,
The output audio information creating means includes:
The standard voice information is created from the authentication boilerplate and the personal name of the registrant determined to be the same person as the person to be collated in the input face image by the face collating means,
2. The face image according to claim 1, wherein the shortened voice information is created from the personal name of the registrant who is determined to be the same person as the person to be collated in the input face image by the face collating unit. Authentication device.

The control means includes
Each time the output sound information creating means creates the output sound information, the output sound information is stored in the storage unit, while the sound output unit completes sound output of the output sound signal. Deleting the output audio information corresponding to the output audio signal from the storage unit;
The sound output determination means is
If the output audio information is stored in the storage unit, the audio output unit determines that the audio output unit is outputting audio, and if the output audio information is not stored in the storage unit, the audio output unit The face image authentication device according to claim 1, wherein it is determined that is not outputting voice.

In the storage unit, an attribute is stored for each registrant, and a priority that predetermines the order in which the control unit outputs the output audio signal to the audio output unit is determined in accordance with the attribute. Degree is remembered,
The control means outputs the output audio signals for the output audio information for which the audio output by the audio output unit has not started among the output audio information stored in the storage unit in descending order of priority. The face image authentication device according to claim 3, wherein the voice output unit outputs a voice.