JP4730812B2

JP4730812B2 - Personal authentication device, personal authentication processing method, program therefor, and recording medium

Info

Publication number: JP4730812B2
Application number: JP2005086974A
Authority: JP
Inventors: 宏幸作山; 博顕鈴木; 泰稔太田; 亨水納; 一成戸波; 浩久稲本; 繁生森島
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2005-03-24
Filing date: 2005-03-24
Publication date: 2011-07-20
Anticipated expiration: 2025-03-24
Also published as: JP2006268563A

Description

本発明は、顔画像による個人認証装置及び処理方法、そのプログラム並びに記録媒体に関する。 The present invention relates to a personal authentication apparatus and processing method using a face image, a program thereof, and a recording medium.

従来から、様々な分野で個人認証システムが利用されているが、認証に対する精神的な抵抗感の少なさから、近年、顔画像を用いた認証システムがいろいろ提案されている。例えば、特許文献１では、顔認証システムによって顧客を認識し、認識結果にもとづいて顧客情報を提示することを提案している。 Conventionally, personal authentication systems have been used in various fields. However, in recent years, various authentication systems using facial images have been proposed due to the lack of mental resistance to authentication. For example, Patent Document 1 proposes that a customer is recognized by a face authentication system and customer information is presented based on the recognition result.

ところで、１枚の静止した顔画像を使った個人認証では、双子の場合等、誤認証が問題となることがある。そのため、例えば、特許文献２では顔画像認証、音声認証、ＩＤ番号認証、指紋認証等の何れかを２つ以上組み合わせて用いることを提案している。 By the way, in the personal authentication using one still face image, erroneous authentication may be a problem in the case of twins. Therefore, for example, Patent Document 2 proposes to use a combination of two or more of face image authentication, voice authentication, ID number authentication, fingerprint authentication, and the like.

特開２００４−３２６２０８公報JP 2004-326208 A 特開２０００−２５９８２８公報JP 2000-259828 A

異なる種類の組合せによる認証は有効ではあるが、まったく異なる複数種類の動作を伴うため、装置構成や処理が煩雑になる。よって、顔画像だけで認証精度が向上できればその方が望ましく、その方法としては、１枚の静止画ではなく、動画像を用いる方法が考えられる。それは、時系列の顔画像を用いた認証は、１枚の顔画像による認証よりも高い信頼性が期待できるからである。 Although authentication by a combination of different types is effective, since a plurality of different types of operations are involved, the apparatus configuration and processing become complicated. Therefore, it is desirable that the authentication accuracy can be improved by using only the face image, and a method using a moving image instead of a single still image is conceivable as the method. This is because authentication using time-series face images can be expected to be more reliable than authentication using a single face image.

動画像による認証では、例えば、登録された笑顔の動画像を利用して認証を行うといったことが考えられるが、動画像の場合、時系列データである性格上、観測される表情自体の再現性が十分でない場合がある。その場合には、動画像を用いたにも関わらず、認識精度の向上が見られないことがあるという問題がある。 In the authentication using moving images, for example, it is conceivable that authentication is performed using a registered smiley moving image. However, in the case of moving images, the reproducibility of the observed facial expression itself due to the nature of time-series data. May not be enough. In this case, there is a problem that recognition accuracy may not be improved despite the use of moving images.

本発明の目的は、時系列の動画像を用いて、高精度な個人認証を実現する装置及び処理方法、そのためのプログラム並びに記録媒体を提案することにある。 An object of the present invention is to propose an apparatus and processing method for realizing highly accurate personal authentication using time-series moving images, a program therefor, and a recording medium.

本発明は、表情の動画像の再現性を上げるために、ユーザにあるキーワード等を発語させ、そのときの顔の動画像を用いて認証を行うようにする。すなわち、表情の再現性を上げるために、表情の表出に対して、発語という拘束条件を与えてやるのである。これは、特許文献２のような、音声認証との組合せとは異なるものである。 According to the present invention, in order to improve the reproducibility of a moving image of a facial expression, a user speaks a keyword or the like, and authentication is performed using the moving image of the face at that time. In other words, in order to improve the reproducibility of facial expressions, a constraint condition of speech is given to the expression of facial expressions. This is different from the combination with voice authentication as in Patent Document 2.

詳しくは、請求項１に係る発明は、発語時の時系列の顔画像に基づいて人物認証を行う個人認証装置において、動画像を取得する手段と、音声を取得する手段と、音声を音素に分割し、各音素が発音されている区間の中央の時刻を取得する手段と、動画像から、発語直前の顔画像を構成するフレーム画像（以下、基準フレーム）と、発語時の各音素が発音されている区間の中央の時刻の顔画像を構成する複数のフレーム画像（以下、音素フレーム）を抽出する手段と、基準フレームと各音素フレームの顔画像の特徴点を抽出して、各音素フレームについて、基準フレームに対する特徴点の変位量（以下、特徴点変位量）を算出する手段と、全ての音素フレームの特徴点変位量をまとめて１のベクトルを生成し、該ベクトルとあらかじめ登録されているテンプレートベクトルとの類似度を算出して人物認証を行う手段とを有することを特徴とする。 Specifically, the invention according to claim 1 is a personal authentication device that performs person authentication based on a time-series face image at the time of speech, a means for acquiring a moving image, a means for acquiring sound, Means for acquiring the central time of the section in which each phoneme is pronounced, a frame image (hereinafter referred to as a reference frame) that constitutes a face image immediately before the utterance from the moving image, and each at the time of utterance Means for extracting a plurality of frame images (hereinafter referred to as phoneme frames) constituting the face image at the center of the section where the phoneme is pronounced, and extracting feature points of the reference frame and the face image of each phoneme frame; For each phoneme frame, a means for calculating a displacement of the feature point with respect to the reference frame (hereinafter referred to as a feature point displacement amount) and a feature point displacement amount of all phoneme frames are combined to generate one vector. Registered And having a means for performing person authentication by calculating the similarity between the template vector have.

ここで上記発語は、そのタイミングや回数を認証システムが促したり、あるいは発語対象の言葉自体を、その都度認証システムが指定するものであってもよい。あるいは発語する言葉自体は１つ、あるいは複数既定されており、そのうちのどれかを促したり、発語したりしても良い。また、前記既定しておく言葉は、氏名、住所、生年月日、パスワード等の忘却しにくい特定のキーワードであることが望ましい。 Here, the utterance may be prompted by the authentication system for the timing and the number of times, or the utterance target word itself may be designated by the authentication system each time. Alternatively, one or a plurality of words to be spoken are predetermined, and any one of them may be prompted or spoken. The predetermined words are preferably specific keywords that are difficult to forget such as name, address, date of birth, and password.

よって、請求項２に係る発明は、請求項１に記載の個人認証装置において、動画像中に顔画像が含まれていると発語を促す手段を設けることを特徴とする。この請求項２に係る発明は、発語時の利便性を高めることを目的とする。 Therefore, the invention according to claim 2 is characterized in that in the personal authentication device according to claim 1, means for prompting speech is provided when a face image is included in a moving image . The invention according to claim 2 is intended to improve convenience when speaking.

また、請求項３に係る発明は、請求項１に記載の個人認証装置において、前記発語の対象が不特定であることを特徴とする。この請求項３に係る発明は、発語対象の記憶を不要にすることを目的とする。 The invention according to claim 3 is the personal authentication device according to claim 1, characterized in that an object of the speech is unspecified. It is an object of the invention according to claim 3 to make it unnecessary to store a speech object.

また、請求項４に係る発明は、請求項１に記載の個人認証装置において、前記発語の対象が特定のキーワードであることを特徴とする。この請求項４に係る発明は、発語対象の記憶性を高めることを目的とする。 According to a fourth aspect of the present invention, in the personal authentication device according to the first aspect, the utterance target is a specific keyword. The invention according to claim 4 aims to improve the memory performance of the speech object.

また、請求項５〜８に係る発明は、上記請求項１〜４に記載の個人認証装置に対応する個人認証処理方法を提供する。この請求項５〜８に係る発明は、上記１〜４の各請求項に対応した効果を得ることを目的とする。 The inventions according to claims 5 to 8 provide a personal authentication processing method corresponding to the personal authentication device according to claims 1 to 4. It is an object of the inventions according to claims 5 to 8 to obtain the effects corresponding to the claims 1 to 4.

また、請求項９、１０に係る発明は、請求項５〜８に記載の個人認証処理方法をコンピュータに実行させるためのプログラム、さらには該プログラムを記録した記録媒体を提供する。この請求項９、１０に係る発明は、プログラムをコンピュータにインストールすることにより、上記１〜４の各請求項に対応した構成のシステムをコンピュータで実現することを目的とする。 The inventions according to claims 9 and 10 provide a program for causing a computer to execute the personal authentication processing method according to claims 5 to 8 and a recording medium recording the program. The inventions according to claims 9 and 10 are intended to realize a system having a configuration corresponding to each of claims 1 to 4 by a computer by installing the program in the computer.

本発明の個人認証装置、個人認証処理方法、プログラム又は記録媒体によれば、次のような作用効果が得られる。
（１）発語時の各音素に対応した時系列の顔画像を取得し、発語直前の顔画像に対する前記時系列の顔画像の特徴点の変位にもとづいて、人物認証を行うので、ユーザの表情表出精度が向上し、人物認証精度を向上させることが可能となる。
According to the personal authentication device, personal authentication processing method, program, or recording medium of the present invention, the following operational effects can be obtained.
(1) Since a time-series face image corresponding to each phoneme at the time of speech is acquired and person authentication is performed based on the displacement of the feature points of the time-series face image with respect to the face image immediately before the speech, The expression accuracy of the facial expression is improved, and the human authentication accuracy can be improved.

（２）発語を促す手段を設けることで、発語時の利便性を高めることが可能となる。 (2) By providing a means for prompting speech, it is possible to improve convenience when speaking.

（３）発語の対象を不特定とすることで、発語対象の記憶を不要にすることができる。 (3) By making the speech target unspecified, it is possible to make it unnecessary to store the speech target.

（４）発語の対象を特定のキーワードとすることで、発語対象の記憶性を高めることができる。 (4) By setting the speech target as a specific keyword, the memory property of the speech target can be improved.

以下、本発明の実施の形態について図面により説明する。 Embodiments of the present invention will be described below with reference to the drawings.

図１は、本発明の第１の実施例にかかる装置構成例を示したものである。図１において、データバスを介して、動画撮像手段１０、顔検出手段１１、フレーム検出手段１２、顔特徴点変位算出手段１３、類似度評価手段１４が接続されており、同じく顔検出手段１１、音声取得手段１５、音声分割手段１６が接続され、さらに音声取得手段１５と発語促進手段１７が接続されている。ここで、動画撮像手段１０以外の各手段１１〜１７は一般にはコンピュータの内部構成である。 FIG. 1 shows an apparatus configuration example according to the first embodiment of the present invention. In FIG. 1, a moving image pickup means 10, a face detection means 11, a frame detection means 12, a face feature point displacement calculation means 13, and a similarity evaluation means 14 are connected via a data bus. The voice acquisition means 15 and the voice division means 16 are connected, and the voice acquisition means 15 and the speech promotion means 17 are further connected. Here, each means 11-17 other than the moving image imaging means 10 is generally an internal configuration of a computer.

図２は本第１の実施例の全体の処理の流れを示したものである。以下、図２に従って本第１の実施例について詳述する。 FIG. 2 shows the overall processing flow of the first embodiment. The first embodiment will be described in detail below with reference to FIG.

動画撮像手段１０は、認証システムのユーザを撮影可能な、例えば入り口の映像を常時撮影しており（ステップ１０１）、その映像は時々刻々顔検出手段１０に入力される。 The moving image capturing means 10 is capable of capturing the user of the authentication system, for example, always capturing images of the entrance (step 101), and the images are input to the face detecting means 10 every moment.

顔検出手段１１は、１フレームの画像中に、顔があるか否かを判断するものであり、顔が含まれている場合には、その領域（主に、あごから額にかけての領域）を切り出す（ステップ１０２）。図３は顔検出手段１１での顔検出例で、白枠で囲った領域が顔領域として検出されることを示している。 The face detection means 11 determines whether or not there is a face in the image of one frame. If the face is included, the area (mainly the area from the chin to the forehead) is determined. Cut out (step 102). FIG. 3 shows an example of face detection by the face detection means 11 and shows that an area surrounded by a white frame is detected as a face area.

該顔検出手段１１としては、例えば、「佐部浩太郎、日台健一“ピクセル差分特徴を用いた実時間任意姿勢顔検出器の学習”、画像センシングシンポジウム２００５論文集」に記載の方法で構成できる。これは、教師画像（＝多くの顔画像）内の任意の２点の画素差分値をブースティング学習器に入力して閾値を学習させ、学習後の所定閾値と、検出対象の入力画像の任意の２点の画素差分値を比較して、顔画像が含まれるか否かを判別するものである。 The face detection means 11 can be configured by a method described in, for example, “Kotaro Sabe, Kenichi Hidai“ Learning a real-time arbitrary posture face detector using pixel difference features ”, Image Sensing Symposium 2005 Proceedings”. . This is because an arbitrary two pixel difference values in a teacher image (= many face images) are input to a boosting learning device to learn a threshold value, and a predetermined threshold value after learning and an arbitrary input image to be detected are detected. These two pixel difference values are compared to determine whether or not a face image is included.

顔検出手段１１は、フレーム中に顔を検出した場合には、その領域情報（例えば、図３の枠情報）を付したフレーム画像をフレーム抽出手段１２に送り、検出しなかった場合には、顔領域情報を付さないフレーム画像をフレーム抽出手段１２に送る。フレーム抽出手段１２は、入力された画像を所定フレーム数だけバッファリングしている。 When the face detection unit 11 detects a face in the frame, the face detection unit 11 sends a frame image with the region information (for example, the frame information in FIG. 3) to the frame extraction unit 12. A frame image without the face area information is sent to the frame extraction means 12. The frame extraction unit 12 buffers the input image by a predetermined number of frames.

また、該顔検出手段１１は、顔を検出すると、それを音声取得手段１５に通知する。これにより、音声取得手段１５が動作を開始する（ステップ１０３）。本第１の実施例は、特定のキーワードの発語を利用する例であり、音声取得手段１５の動作開始後、所定時間内にユーザが発語しない場合、音声取得手段１５はそれを発語促進手段１７に通知し、これを受けて発語促進手段１７はユーザに対して「フルネームをお話し下さい」等のアナウンスを行ってキーワードの発語を促す。音声取得手段１５は、ユーザの発語を取得すると、その音声情報を音声分割手段１６に送る。 Further, when detecting the face, the face detecting unit 11 notifies the voice acquiring unit 15 of the detection. Thereby, the voice acquisition means 15 starts operation (step 103). The first embodiment is an example in which the utterance of a specific keyword is used. If the user does not utter within a predetermined time after the operation of the voice acquisition means 15 starts, the voice acquisition means 15 Notification is sent to the promotion means 17, and the speech promotion means 17 makes an announcement such as “Please tell your full name” to the user and prompts the keyword to be spoken. When the voice acquisition unit 15 acquires the user's speech, the voice acquisition unit 15 sends the voice information to the voice division unit 16.

音声分割手段１６は、入力された音声情報を公知のＤＰマッチング（動的計画法マッチング）等を用いて音素に分解する（ステップ１０４）。ここで音素とは、母音（ａ，ｉ，ｕ，ｅ，ｏ）と濁音等を含む子音（ｋ，ｓ，ｔ，ｕ，ｈ，ｍ，ｙ，ｒ，ｗ，ｇ，ｚ，ｄ，ｂ，ｐ，…）等からなる２４種類程の日本語音声の表現単位である。ＤＰマッチング法とは、あらかじめ登録された全ての音素の特徴量（例えば、２０ｍｓ程度の音素のスペクトル）をテンプレートとして、入力された音声情報の時間軸を伸縮させながら、伸縮後の音声情報のスペクトルを算出して各テンプレートとの一致度を評価し、最も高い一致度をみたテンプレートの音素として、入力音声を認識する方法である。スペクトルは、周波数帯域毎の振幅を要素とするベクトルであるから、スペクトルの一致度は、ベクトル同士の内積が所定値以上であることによって評価することができる。本第１の実施例では、規定のキーワードを発語することを前提としているため、テンプレートとして使用するスペクトルは最初から決まっており、最もマッチングが高くなる時間軸伸縮を求めることになる。 The voice dividing unit 16 decomposes the input voice information into phonemes using known DP matching (dynamic programming matching) or the like (step 104). Here, phonemes are consonants (k, s, t, u, h, m, y, r, w, g, z, d, b) including vowels (a, i, u, e, o) and muddy sounds. , P,...), Etc., and so on. The DP matching method uses the feature values of all phonemes registered in advance (for example, a phoneme spectrum of about 20 ms) as a template, and expands / contracts the time axis of the input audio information while expanding / decreasing the spectrum of the audio information after expansion / contraction Is calculated, the degree of coincidence with each template is evaluated, and the input speech is recognized as the phoneme of the template with the highest degree of coincidence. Since the spectrum is a vector whose element is the amplitude of each frequency band, the degree of coincidence of the spectrum can be evaluated when the inner product of the vectors is a predetermined value or more. In the first embodiment, since it is assumed that a predetermined keyword is uttered, the spectrum used as a template is determined from the beginning, and the time axis expansion / contraction with the highest matching is obtained.

音声分割手段１６は、音声入力を音素に分割した後、発語直前の時刻、および各音素が発音されている区間の中央の時刻をフレーム抽出手段１２に送る（ステップ１０５）。フレーム検出手段１２は、バッファリングしている画像中から、顔領域情報が付された当該時刻のフレームを抽出して（ステップ１０６）、顔特徴点変位算出手段１３に送る。すなわち、顔特徴点変位算出手段１３には、発語直前のフレーム（以下、これを基準フレームと呼ぶ）と、音素数分だけのフレーム（以下、これを音素フレームと呼ぶ）が送られる。 The voice dividing unit 16 divides the voice input into phonemes, and then sends the time immediately before the utterance and the center time of the section where each phoneme is pronounced to the frame extracting unit 12 (step 105). The frame detection means 12 extracts the frame at the time to which the face area information is attached from the buffered image (step 106), and sends it to the face feature point displacement calculation means 13. That is, the face feature point displacement calculating means 13 is sent with a frame immediately before the utterance (hereinafter referred to as a reference frame) and a frame corresponding to the number of phonemes (hereinafter referred to as a phoneme frame).

顔特徴点変位算出手段１３は、図４に示すように、フレームの顔領域（図４の実線の矩形）中に、眉、目、口を検出するためのウィンドウを設定し（図４の点線部）（ステップ１０７）、該ウィンドウ内の画素値に対して、ラプラシアン等のエッジ検出オペレータを施し、図５に示すように、所定閾値で二値化してエッジの線画像を得る（ステップ１０８）。本実施例におけるウィンドウは、顔領域として検出された矩形を基準として、相対的な位置および大きさを有するものとして定義されており、顔領域の縦横比と各ウィンドウの縦横比は比例関係にある。従って、検出された顔領域のサイズ自体が大きければ、それに比例してウィンドウも大きく設定される。 As shown in FIG. 4, the face feature point displacement calculating means 13 sets a window for detecting eyebrows, eyes, and mouth in the face area of the frame (solid rectangle in FIG. 4) (dotted line in FIG. 4). (Step 107), an edge detection operator such as Laplacian is applied to the pixel values in the window, and binarization is performed with a predetermined threshold as shown in FIG. 5 to obtain an edge line image (Step 108). . The window in this embodiment is defined as having a relative position and size with reference to a rectangle detected as a face area, and the aspect ratio of the face area and the aspect ratio of each window are in a proportional relationship. . Accordingly, if the size of the detected face area itself is large, the window is set to be proportionally larger.

上記各ウィンドウ内の線画は、必要に応じて孤立点除去処理を施された後、ｘ軸およびＹ軸に対して正射影され（図５の矢印が射影を、太線が射影結果を示す）（ステップ１０９）、各射影の両端の点（太線の両端の点）として、左右の眉、左右の目および口の各上下左右端点（図５の丸印）の、計２０点が検出される。 The line drawing in each window is subjected to isolated point removal processing as necessary, and then orthogonally projected to the x-axis and the Y-axis (the arrows in FIG. 5 indicate the projection, and the bold lines indicate the projection results) ( Step 109) A total of 20 points are detected as left and right eyebrows, left and right eye eyes, and upper and lower left and right end points (circles in FIG. 5) as points at both ends of each projection.

このようにして、顔特徴点変位算出手段１３では、発語直前のフレーム（基準フレーム）および各音素に対応したフレーム（音素フレーム）に対して、まず、上記２０の特徴点の座標を検出する。ここで、ｘ軸に対して射影された端点はｘ座標のみを持ち、ｙ軸に対して射影された端点はｙ座標のみを持つ。 In this way, the face feature point displacement calculating means 13 first detects the coordinates of the 20 feature points with respect to the frame immediately before the utterance (reference frame) and the frame corresponding to each phoneme (phoneme frame). . Here, the end point projected with respect to the x axis has only the x coordinate, and the end point projected with respect to the y axis has only the y coordinate.

次に、顔特徴点変位算出手段１３では、各端点につき、基準フレームと音素フレームにおける座標の差、すなわちｘ軸またはｙ軸方向の移動距離を算出し、それを顔画像領域の大きさ（ｘ軸方向の移動距離はｘ軸方向の顔領域の大きさ、ｙ軸方向の移動距離はｙ軸方向の顔領域の大きさ）で正規化して、特徴点変位量を算出し（ステップ１１０）、これを類似度評価手段１４に送る。１つの音素フレームに対しては、２０の特徴点移動量が算出される。よって、１つの発語は、２０次元ベクトルの音素数分の配列として表現され、これが類似度評価手段１４に送られる。 Next, the face feature point displacement calculation means 13 calculates the difference in coordinates between the reference frame and the phoneme frame, that is, the movement distance in the x-axis or y-axis direction for each end point, and calculates the difference in the size of the face image area (x The movement distance in the axial direction is normalized by the size of the face area in the x-axis direction, and the movement distance in the y-axis direction is normalized by the size of the face area in the y-axis direction) to calculate the feature point displacement (step 110). This is sent to the similarity evaluation means 14. Twenty feature point movement amounts are calculated for one phoneme frame. Therefore, one utterance is expressed as an array for the number of phonemes of a 20-dimensional vector, and this is sent to the similarity evaluation means 14.

本第１の実施例では、各個人について、あらかじめ特定のキーワード（例えばｎ音素からなる）を発語した際の、音素毎の上記２０次元ベクトルが認識システムの所定記憶手段にテンプレートとして登録されている。類似度評価手段１４は、ユーザの発語された各音素について、顔特徴点変位算出手段１３によって算出されたベクトルと、テンプレートのベクトル同士の類似度を内積として算出し（ステップ１１１）、該内積のｎ音素数分の二乗和を算出する（ステップ１１２）。この「内積の二乗和」が、発語時の表情とテンプレートの類似度指標として評価され、その値が最大となるテンプレートが選択される。そして、その値（類似度）が所定値以上であるか否かが判定され（ステップ１１３）、所定値以上である場合に、ユーザはテンプレートの登録者として認証され（ステップ１１４）、所定値以下の場合には否認される（ステップ１１５）。 In the first embodiment, for each individual, the 20-dimensional vector for each phoneme when a specific keyword (for example, consisting of n phonemes) is uttered in advance is registered as a template in the predetermined storage means of the recognition system. Yes. The similarity evaluation unit 14 calculates, for each phoneme spoken by the user, the similarity between the vector calculated by the face feature point displacement calculation unit 13 and the template vector as an inner product (step 111). The sum of squares for the number of n phonemes is calculated (step 112). This “sum of squares of inner product” is evaluated as a similarity index between the expression at the time of speech and the template, and the template having the maximum value is selected. Then, it is determined whether or not the value (similarity) is greater than or equal to a predetermined value (step 113). If the value is greater than or equal to the predetermined value, the user is authenticated as a template registrant (step 114) and is less than or equal to the predetermined value. In the case of the above, it is denied (step 115).

本第１の実施例では、テンプレートと同一の音素を発している状態の顔画像を抽出するため、認証に用いられる顔画像の再現性が向上し、認証精度を向上させることが出来る。
なお、上記所定値とは、認証装置が、本人以外の人間を本人として認証してしまう誤認証率の許容値によって決定すべき定数であり、誤認証率と閾値は負の相関関係にある。 In the first embodiment, since the face image in the state of emitting the same phoneme as the template is extracted, the reproducibility of the face image used for authentication can be improved, and the authentication accuracy can be improved.
The predetermined value is a constant that should be determined by the allowable value of the false authentication rate that causes the authentication device to authenticate a person other than the person as the principal, and the false authentication rate and the threshold value have a negative correlation.

図６は、本発明の第２の実施例にかかる装置構成例を示したものである。本第２の実施例は、特定のキーワードの発語を利用しない、すなわち、発語の対象が不特定である実施例で、図１の発語促進手段１７として発語指定手段１８を用いたものである。それ以外の構成は図１と同様である。 FIG. 6 shows an apparatus configuration example according to the second embodiment of the present invention. The second embodiment is an embodiment in which the utterance of a specific keyword is not used, that is, the utterance target is unspecified, and the utterance designation means 18 is used as the utterance promotion means 17 in FIG. Is. The other configuration is the same as that of FIG.

図７は本第２の実施例の全体の処理の流れを示したものである。以下では、主に第１の実施例との相違点について説明する。 FIG. 7 shows the overall processing flow of the second embodiment. In the following, differences from the first embodiment will be mainly described.

本第２の実施例では、顔検出手段１１による顔検出後、発語促進手段の一種である発語指定手段１８が、ユーザに対して「こんにちは、とおっしゃってください」というように、発語すべき言葉を指定する（ステップ２０４）。 In the second embodiment, after the face detection by the face detection unit 11, speech specifying means 18 which is a kind of speech promoting means, such as "Hello, and, please saying" to the user, speech A word to be specified is designated (step 204).

音声取得手段１５はユーザの発語を取得し、その音声情報を音声分割手段１６に送るが、本第２の実施例では、ユーザの発語すべき言葉は、発語指定手段１８から該音声分割手段１６に対して入力されるため、該入力をもとにテンプレートとして参照するスペクトルが決定され、やはり最もマッチングが高くなる時間軸伸縮が求められる。 The voice acquisition means 15 acquires the user's utterance and sends the voice information to the voice division means 16. In the second embodiment, the word to be spoken by the user is sent from the utterance designation means 18 to the voice. Since it is input to the dividing means 16, a spectrum to be referred to as a template is determined based on the input, and the time axis expansion / contraction with the highest matching is also required.

第１の実施例と同様に、フレーム抽出手段１２において、音声分割手段１６によって分割された音素毎にフレームが抽出され（ステップ２０７）、顔特徴点変位算出手段１３で基準フレームに対する特徴点変位量（特徴点変位ベクトル）が算出されて（ステップ２１１）、類似度評価手段１４に送られる。本第２の実施例では、テンプレートとして、ユーザ毎にあらかじめ全ての音素に関する２０次元ベクトルが認証システムの所定記憶手段に登録されている。類似度評価手段１４では、指定された言葉の音素に対応したテンプレートベクトルと、顔特徴点変位算出手段１３で算出されたベクトルとの内積を算出し、類似度計算・評価を行うが、これは第１の実施例と同様である（ステップ２１２〜２１６）。 As in the first embodiment, the frame extracting unit 12 extracts a frame for each phoneme divided by the audio dividing unit 16 (step 207), and the face feature point displacement calculating unit 13 extracts the feature point displacement with respect to the reference frame. A (feature point displacement vector) is calculated (step 211) and sent to the similarity evaluation means 14. In the second embodiment, as a template, 20-dimensional vectors related to all phonemes are registered in advance in the predetermined storage unit of the authentication system for each user. The similarity evaluation means 14 calculates the inner product of the template vector corresponding to the phoneme of the designated word and the vector calculated by the face feature point displacement calculation means 13 and calculates and evaluates the similarity. This is the same as in the first embodiment (steps 212 to 216).

第１及び第２の実施例では、各フレーム毎の類似度の二乗和を用いて認証を行ったが、本第３の実施例では、全音素フレームの特徴量変位ベクトルの要素からなる２０×音素数次元の１つのベクトルを想定し、装置内にあらかじめ登録してある同じ次元のテンプレートベクトル（発語対象が特定キーワードなら、テンプレートは登録済みであり、発語対象が不特定なら、登録してある各音素の特徴量ベクトルから１つのベクトルを生成すればよい）との内積を算出して、類似度の算出・評価を行うものである。図８に本第３の実施例の全体の処理流れを示す。ステップ３１０，３１１，３１２以外、全体の処理の流れは第一の実施例と同様である。 In the first and second embodiments, authentication is performed using the sum of squares of the degrees of similarity for each frame. However, in the third embodiment, 20 × consisting of elements of feature amount displacement vectors of all phoneme frames. Assuming one vector of phoneme dimension, template vector of the same dimension registered in the device in advance (if the speech target is a specific keyword, the template is already registered; if the speech target is unspecified, register it The inner product with the feature quantity vector of each phoneme is calculated), and the similarity is calculated and evaluated. FIG. 8 shows the overall processing flow of the third embodiment. Except for steps 310, 311 and 312, the overall processing flow is the same as in the first embodiment.

本第４の実施例は、所謂隠れマルコフモデル（ＨＭＭ；Ｈidden Ｍarkov Ｍodel）によって、時系列のベクトルデータを認識する公知の枠組みを利用したものである。本例では、ユーザが特定のキーワードを発語するものとする。 The fourth embodiment uses a known framework for recognizing time-series vector data by a so-called Hidden Markov Model (HMM). In this example, it is assumed that the user speaks a specific keyword.

図９はＨＭＭの概念図を示したものである。本例でのＨＭＭはキーワードを構成する音素数ｎと同じｎ個の状態Ｓを有する。各状態Ｓは、他の状態へ遷移する確率ａを有し、時刻が進むにつれて状態の遷移が確率的に発生する。ＨＭＭは、各状態のときに、観測可能なシンボルＯを出力する。本例では、シンボルＯは特徴量変位ベクトルであり、Ｏ_１，Ｏ_２，…，Ｏ_ｎのように音素数だけ存在する。図９中、ａｉｊは、状態ｉから状態ｊへ遷移する確率を示し、ｂｉ（ｔ）は、状態ｉにおいて出力Ｏ_ｔを出力する確率である。なお、ＨＭＭは、その初期状態がＳ_１〜Ｓ_ｎのどれかであるかの確率分布を有する。 FIG. 9 shows a conceptual diagram of the HMM. The HMM in this example has n states S that are the same as the number of phonemes n constituting the keyword. Each state S has a probability a of transition to another state, and a state transition occurs probabilistically as time advances. The HMM outputs an observable symbol O in each state. In this example, the symbol O is the feature value displacement vectors, O _{1, O} 2, _..., exists only phonemes number as O _n. In FIG. 9, aij indicates the probability of transition from state i to state j, and bi (t) is the probability of outputting output O _{t in} state i. Incidentally, HMM, the initial state has a probability distribution if it was one of _S 1 to S _n.

ＨＭＭを認識や認証に用いるには、まずＨＭＭを本認証システムの各ユーザごとに学習させておく必要がある。本例では、ユーザの数ｍだけのＨＭＭを用意し、各ＨＭＭを各ユーザに１対１に対応づけて学習させる。ＨＭＭの学習の処理の流れを図１０に示す。 In order to use the HMM for recognition and authentication, it is necessary to first learn the HMM for each user of the authentication system. In this example, HMMs corresponding to the number m of users are prepared, and each HMM is learned by associating each user with one to one. The flow of HMM learning processing is shown in FIG.

ユーザの数だけのＨＭＭを用意して（ステップ４００）、まず、各ユーザにキーワードを発語をさせ（ステップ４０１）、学習に用いる特徴量変位ベクトルＯ_１，Ｏ_２，…，Ｏ_ｎを得る（ステップ４０２）。そして、学習用のＯ_１，Ｏ_２，…，Ｏ_ｎを発生しやすいようなＨＭＭの３つのパラメータａｉｊ，ｂｉ（ｔ）、初期状態の確率分布を推定する（ステップ４０３）。ここでは、公知のＢaum−Ｗelchのパラメータ推定法によるアルゴリズム（Ｆorward−Ｂackwardアルゴリズム）により、パラメータの尤度の変化が小さくなり、収束と見なせる時点まで繰返してパラメータを推定する。これにより、学習用の変位量Ｏ_１，Ｏ_２，…，Ｏ_ｎを発生しやすいＨＭＭパラメータが得られる。これを全てのユーザについて繰返し、学習が終了する（ステップ４０４）。 Are prepared HMM of the number of users (step 400), first, keywords to the speech to each user (step 401), the feature displacement vectors _O _1, O 2 used for learning, ... to obtain the _{O n} (Step 402). Then, _O _1, O 2 for learning, ..., three parameters aij of HMM as the _{O n} prone, bi (t), estimates the probability distribution of the initial state (step 403). Here, a change in the likelihood of the parameter is reduced by an algorithm based on a known Baum-Welch parameter estimation method (Forward-Backward algorithm), and the parameter is estimated repeatedly until it can be regarded as convergence. Thus, displacement of _O 1 for _learning, O 2, ..., the _{O n} prone HMM parameters are obtained. This is repeated for all users, and the learning ends (step 404).

図１１はＨＭＭによる認証処理の一例を示したものである。認証時には、ユーザの発語後（ステップ５０１）、顔特徴点変位算出手段１３にて各音素フレームから特徴量変位ベクトルの列を算出し（ステップ５０２）、類似度評価手段１４において上記ｍ個の各ＨＭＭが当該変位ベクトル列を出力する確率を計算する（ステップ５０３）。該確率は、公知のＦｏｒｗａｒｄアルゴリズムによって再帰的に計算されるが、こうして得られたｍ個の確率のうち、最も高いものを算出したＨＭＭに使われた学習データの持ち主が、認証すべき人物である。 FIG. 11 shows an example of authentication processing by the HMM. At the time of authentication, after the user speaks (step 501), the facial feature point displacement calculating unit 13 calculates a sequence of feature amount displacement vectors from each phoneme frame (step 502), and the similarity evaluation unit 14 calculates the m pieces. The probability that each HMM outputs the displacement vector sequence is calculated (step 503). The probability is recursively calculated by a known Forward algorithm. Of the m probabilities obtained in this way, the owner of the learning data used for the HMM that calculated the highest one is the person to be authenticated. is there.

したがって、類似度評価手段１４では、上記ｍ個の確率のうちの最大のものが所定閾値以上であるか否か判定して（ステップ５０４）、所定閾値以上のとき、ユーザたる被験者を、当該最大確率を算出したＨＭＭに対応づけられた人物として認証する（ステップ５０５）。また、所定閾値以下だったならば認証不可とする（ステップ５０６）。ここで所定値とは、認証システムが、本人以外の人間を本人として認証してしまう誤認証率の許容値によって決定すべき定数であり、誤認証率と閾値は負の相関関係にある。ＨＭＭを用いた認証は、実データによる学習によって、認証精度を向上させることができる。 Accordingly, the similarity evaluation means 14 determines whether or not the maximum of the m probabilities is equal to or greater than a predetermined threshold (step 504). Authentication is performed as a person associated with the HMM for which the probability is calculated (step 505). If it is equal to or less than the predetermined threshold value, authentication is impossible (step 506). Here, the predetermined value is a constant that should be determined by the allowable value of the false authentication rate that causes the authentication system to authenticate a person other than the person as the principal, and the false authentication rate and the threshold value have a negative correlation. Authentication using the HMM can improve the authentication accuracy by learning with actual data.

なお、上記例では、基準フレームに対する音素フレームの特徴点の変位量の系列をＨＭＭの生成対象としたが、基準フレームからの変位ではなく、直近の音素フレームに対する特徴点変化量の系列をＨＭＭの生成対象としてもよい。図１２は、この場合の認証処理の流れを示したものである。ここで、ステップ６０２が図１１と異なるのみで、それ以外は図１１と同様である。 In the above example, the sequence of displacements of the feature points of the phoneme frame with respect to the reference frame is set as the HMM generation target. It may be a generation target. FIG. 12 shows the flow of authentication processing in this case. Here, step 602 is different from that in FIG. 11 except that step 602 is the same as that in FIG.

なお、図１や図６で示した装置における各手段の一部もしくは全部の処理機能をコンピュータのプログラムで構成し、そのプログラムをコンピュータを用いて実行して本発明を実現することができること、あるいは、図２、図７、図８、図１０〜図１２で示した処理手順をコンピュータのプログラムで構成し、そのプログラムをコンピュータに実行させることができることは言うまでもない。また、コンピュータでその処理機能を実現するためのプログラム、あるいは、コンピュータにその処理手順を実行させるためのプログラムを、そのコンピュータが読み取り可能な記録媒体、例えば、ＦＤ、ＭＯ、ＲＯＭ、メモリカード、ＣＤ、ＤＶＤ、リムーバブルディスクなどに記録して、保存したり、提供したりすることができるとともに、インターネット等のネットワークを通してそのプログラムを配布したりすることが可能である。 The processing functions of some or all of the means in the apparatus shown in FIGS. 1 and 6 can be configured by a computer program, and the program can be executed using the computer to realize the present invention. Needless to say, the processing procedures shown in FIG. 2, FIG. 7, FIG. 8, and FIG. 10 to FIG. 12 can be configured by a computer program and the program can be executed by the computer. In addition, a computer-readable recording medium such as an FD, MO, ROM, memory card, CD, or the like is stored in the computer. In addition, the program can be recorded and stored on a DVD, a removable disk, etc., and the program can be distributed through a network such as the Internet.

本発明の実施例１の装置構成例を示す図。The figure which shows the apparatus structural example of Example 1 of this invention. 実施例１の処理の流れを示す図。FIG. 3 is a diagram illustrating a flow of processing according to the first embodiment. 顔検出例を示す図。The figure which shows the example of face detection. 顔領域へのウィンドウ設定を示す図。The figure which shows the window setting to a face area | region. ウィンドウ内画像の線画化および正射影による端点検出の例を示す図。The figure which shows the example of the endpoint detection by line drawing of the image in a window, and an orthogonal projection. 本発明の実施例２の装置構成例を示す図。The figure which shows the apparatus structural example of Example 2 of this invention. 実施例２の処理の流れを示す図。FIG. 10 is a diagram illustrating a processing flow of the second embodiment. 実施例３の処理の流れを示す図。FIG. 10 is a diagram illustrating a flow of processing according to the third embodiment. 本発明の実施例４で用いられる隠れマルコフモデル（ＨＭＭ）の概念図。The conceptual diagram of the hidden Markov model (HMM) used in Example 4 of this invention. ＨＭＭの学習の処理の流れを示す図。The figure which shows the flow of a learning process of HMM. ＨＭＭによる認証処理の流れ（その１）を示す図。The figure which shows the flow (the 1) of the authentication process by HMM. ＨＭＭによる認証処理の流れ（その２）を示す図。The figure which shows the flow (the 2) of the authentication process by HMM.

１０動画撮像手段
１１顔検出手段
１２フレーム抽出手段
１３顔特徴点変位算出手段
１４類似度評価手段
１５音声取得手段
１６音声分割手段
１７発語促進手段
１８発後指定手段
DESCRIPTION OF SYMBOLS 10 Moving image pick-up means 11 Face detection means 12 Frame extraction means 13 Face feature point displacement calculation means 14 Similarity evaluation means 15 Voice acquisition means 16 Voice division means 17 Speech promotion means 18 Post utterance designation means

Claims

In a personal authentication device that performs person authentication based on time-series face images at the time of speech,
Means for acquiring a moving image;
A means of acquiring audio;
Means for dividing the speech into phonemes and obtaining the central time of the section in which each phoneme is pronounced;
From a moving image, a frame image (hereinafter referred to as a reference frame) that constitutes a face image immediately before the utterance, and a plurality of frame images that constitute a face image at the center of the section where each phoneme at the time of utterance is pronounced ( Means for extracting the phoneme frame),
Means for extracting feature points of the face image of the reference frame and each phoneme frame, and calculating a displacement amount of the feature point with respect to the reference frame (hereinafter referred to as feature point displacement amount) for each phoneme frame;
Means for generating a single vector by combining the feature point displacements of all phoneme frames, calculating the similarity between the vector and a pre-registered template vector, and performing person authentication;
A personal authentication device characterized by comprising:

The personal authentication apparatus according to claim 1, further comprising means for prompting speech when a moving image includes a face image.

The personal authentication device according to claim 1, wherein the utterance target is unspecified.

The personal authentication apparatus according to claim 1, wherein the utterance target is a specific keyword.

In a personal authentication method for performing person authentication based on time-series face images at the time of speech,
Acquiring a moving image;
Obtaining audio, and
Dividing the speech into phonemes and obtaining the center time of the section in which each phoneme is pronounced;
From a moving image, a frame image (hereinafter referred to as a reference frame) that constitutes a face image immediately before the utterance, and a plurality of frame images that constitute a face image at the center of the section where each phoneme at the time of utterance is pronounced ( A step of extracting a phoneme frame),
Extracting a feature point of the face image of the reference frame and each phoneme frame, and calculating a displacement amount of the feature point with respect to the reference frame (hereinafter referred to as a feature point displacement amount) for each phoneme frame;
Collecting the feature point displacements of all phoneme frames to generate one vector, calculating the similarity between the vector and a pre-registered template vector, and performing person authentication;
A personal authentication method characterized by comprising:

6. The personal authentication processing method according to claim 5, further comprising a step of prompting speech when a moving image includes a face image.

6. The personal authentication processing method according to claim 5, wherein the utterance target is unspecified.

6. The personal authentication processing method according to claim 5, wherein the utterance target is a specific keyword.

The program for making a computer perform the personal authentication processing method of any one of Claims 5-8.

A recording medium storing a program for causing a computer to execute the personal authentication processing method according to claim 5.