JP5153434B2

JP5153434B2 - Information processing apparatus and information processing method

Info

Publication number: JP5153434B2
Application number: JP2008111843A
Authority: JP
Inventors: 崇士鈴木; 克彦森
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2008-04-22
Filing date: 2008-04-22
Publication date: 2013-02-27
Anticipated expiration: 2028-04-22
Also published as: JP2009265774A

Description

本発明は、情報処理装置及び情報処理方法に関する。 The present invention relates to an information processing apparatus and an information processing method.

顔画像による個人認識は、指紋認識或いは静脈認識と比較して、非接触で行え、かつ、使用者に犯罪捜査等のマイナスイメージを与えない認識技術として注目されている。顔認識技術では、予め認識対象となる人物の顔画像より特徴量を抽出し、生成される辞書データを基にして行う。したがって、その識別能力は、辞書データの顔画像の照明状況、又は姿勢変動に依存する。 Personal recognition using facial images is attracting attention as a recognition technique that can be performed in a non-contact manner and does not give a negative image such as criminal investigation to a user as compared with fingerprint recognition or vein recognition. In the face recognition technique, a feature amount is extracted from a face image of a person to be recognized in advance and is performed based on the generated dictionary data. Therefore, the identification ability depends on the illumination state of the face image in the dictionary data or the posture variation.

また、顔認識技術では、被認識者と十分な距離がある場合の認識も可能であるので、オープンな環境下での複数人同時の認識システムの実現が期待されている。ここで、オープンな環境とは、被認識者に立ち位置の指定、撮像装置への正対指定、照明条件の適正化、又は表情の指定といった制約条件を課さない一般的な環境下のことである。つまり、顔認識技術は、サイズ変動、姿勢変動、照明変動又は表情変動といった変動に対して、頑健さが必要である。 Further, since the face recognition technology can recognize when there is a sufficient distance from the person to be recognized, it is expected to realize a simultaneous recognition system for a plurality of persons in an open environment. Here, the open environment refers to a general environment that does not impose restrictions such as specifying the standing position, specifying the facing to the imaging device, optimizing the lighting conditions, or specifying the facial expression. is there. In other words, the face recognition technology needs to be robust against changes such as size change, posture change, illumination change or expression change.

これらのことから顔認識技術では、辞書データ作成時に様々な変動の画像を取得し、変動を許容する仕組みを導入している（非特許文献１参照）。より具体的に説明すると、例えば姿勢変動を許容する辞書データを構築したい場合、登録者が様々な姿勢状態である画像を複数取得することで、前述した目的を達成する。また、姿勢変動以外に表情変動や照明変動を許容する場合も同様の考えで対応する。 For these reasons, the face recognition technology introduces a mechanism for acquiring images with various fluctuations when creating dictionary data and allowing the fluctuations (see Non-Patent Document 1). More specifically, for example, when it is desired to construct dictionary data that allows posture variation, the registrant obtains a plurality of images in various posture states, thereby achieving the above-described object. The same idea can be applied to cases where facial expression variation and illumination variation are allowed in addition to posture variation.

また、特許文献１の技術では、予め複数の任意画像より照明変動を除去する知識を学習により獲得する。そして、特許文献１の技術では、学習によって得られた照明変動に係る知識に基づいて、入力画像及び辞書画像より照明変動の影響がない特徴量を抽出し、比較照合する技術が開示されている。 In the technique of Patent Document 1, knowledge for removing illumination fluctuations from a plurality of arbitrary images is acquired by learning in advance. And the technique of patent document 1 is based on the knowledge regarding the illumination fluctuation | variation obtained by learning, The technique which extracts the feature-value which does not have the influence of illumination fluctuation | variation from an input image and a dictionary image, and is compared and collated is disclosed. .

更に、特許文献２では、入力画像と辞書画像とにおける顔解像度の相違による精度の不安定化を防止する技術が開示されている。より具体的に説明すると、特許文献２の技術では、例えば目の中心位置を基準として画像を規格化し、解像度を同一にすることで安定化を図っている。 Furthermore, Patent Document 2 discloses a technique for preventing instability of accuracy due to a difference in face resolution between an input image and a dictionary image. More specifically, in the technique of Patent Document 2, for example, the image is standardized with reference to the center position of the eyes, and stabilization is achieved by making the resolution the same.

更に、特許文献３では、一連の画像より姿勢変動や表情変動に関する特徴量を抽出し、その特徴量が認識処理に適正であるかどうかの判定を行い、適正であると判定された画像のフレームを用いて、認識する技術が開示されている。なお、適正であると判定された画像のフレームとは、被写体が撮像装置に対して正対している、又は目を不自然な形式で開けていない画像のことである。 Further, in Patent Document 3, a feature amount related to posture variation and facial expression variation is extracted from a series of images, it is determined whether the feature amount is appropriate for recognition processing, and a frame of an image determined to be appropriate. A technique for recognizing using the above is disclosed. Note that the frame of the image determined to be appropriate is an image in which the subject is facing the imaging apparatus or the eyes are not opened in an unnatural format.

特開２００４−１４５５７６公報JP 2004-145576 A 特開２００５−０８４９７９公報Japanese Patent Laid-Open No. 2005-084979 特開平６−２５９５３４公報JP-A-6-259534 「コンピュータによる顔認識」、電子情報通信学会論文誌Ｄ−２、１９９７、Ｖｏｌ．Ｊ８０、Ｎｏ．８、ｐｐ．２０３１−２０４６“Face Recognition by Computer”, IEICE Transactions D-2, 1997, Vol. J80, no. 8, pp. 2031-2046

特許文献１は、変動に対する認識精度の安定化を目的として、学習により変動パターンの知識を獲得する手段と、変動を規格化する手段と、を用いることで対処している。しかし、変動パターンの知識を学習により網羅的に獲得することは、大量のデータが必要となり、現実的に困難である。 Japanese Patent Application Laid-Open No. 2004-228688 addresses this problem by using a means for acquiring knowledge of a fluctuation pattern by learning and a means for normalizing fluctuation for the purpose of stabilizing recognition accuracy against fluctuation. However, it is practically difficult to comprehensively acquire knowledge of fluctuation patterns by learning because a large amount of data is required.

また、特許文献２は、変動に対する認識精度の安定化を目的として、変動を規格化する手段を用いることで対処している。しかし、変動パターン、例えば姿勢（２次元画像において）、又は表情の規格化は困難であり、規格化処理の精度により認識精度の劣化が問題となる。 Japanese Patent Laid-Open No. 2004-228561 addresses this problem by using a means for normalizing fluctuations for the purpose of stabilizing recognition accuracy against fluctuations. However, it is difficult to standardize a fluctuation pattern, for example, a posture (in a two-dimensional image) or an expression, and degradation of recognition accuracy becomes a problem due to the accuracy of the normalization process.

また、特許文献３は、一連の画像より、変動が小さい複数の有効フレームを抽出し、抽出した複数の有効フレームにより認識するものである。特許文献３は、一連の画像より、変動が小さい有効フレームが複数取得できるという前提の上で成り立っている。しかし、オープンな環境下での顔認識システムは、必ずしもそのような前提条件が成立するとは限らない。よって、一般環境下での顔認識システムを実現する上で、特許文献３は不十分である。 Further, Patent Document 3 extracts a plurality of effective frames with small fluctuations from a series of images, and recognizes the extracted effective frames. Patent Document 3 is based on the premise that a plurality of effective frames with small fluctuations can be acquired from a series of images. However, a face recognition system in an open environment does not always satisfy such a precondition. Therefore, Patent Document 3 is insufficient for realizing a face recognition system in a general environment.

よって、前述した従来技術では、オープンな撮影環境下における姿勢変動、又は表情変動といった前述した変動パターンに対する対策が不十分である。 Therefore, the above-described conventional technology has insufficient measures against the above-described variation pattern such as posture variation or facial expression variation in an open shooting environment.

本発明はこのような問題点に鑑みなされたもので、変動が存在する画像が入力された場合であっても、精度のよい、安定した物体認識を実現することを目的とする。 The present invention has been made in view of such problems, and an object of the present invention is to realize accurate and stable object recognition even when an image with fluctuations is input.

そこで、本発明は、物体を含む時系列画像を受け取る受け取り手段と、前記時系列画像の各画像より前記物体に関する複数の特徴点を抽出する特徴点抽出手段と、前記特徴点抽出手段で抽出された複数の特徴点に基づいて前記物体の表情の判定を行う表情判定手段と、前記表情判定手段で前記物体に表情があると判定された場合、前記表情に応じて、前記特徴点抽出手段で抽出された複数の特徴点の座標値に基づいて複数の領域を設定し、前記設定した領域の配置情報又は形状情報を含む特徴ベクトルを生成し、前記特徴ベクトルと、物体に関する辞書データと、を照合し、照合結果の信頼度を算出する照合手段と、前記各画像について前記照合手段で算出された信頼度を前記時系列画像の複数の画像にわたって累積して前記物体に関する照合結果の信頼度に係る累積値を算出する累積値算出手段と、前記累積値算出手段で算出された前記累積値に基づいて前記照合手段での照合結果を出力するか否かを判定する出力判定手段と、を有することを特徴とする。 Therefore, the present invention is extracted by a receiving unit that receives a time-series image including an object, a feature point extracting unit that extracts a plurality of feature points related to the object from each image of the time-series image , and the feature point extracting unit. A facial expression determination unit that determines the facial expression of the object based on a plurality of feature points, and the facial expression determination unit determines whether the object has a facial expression when the facial expression determination unit determines that the facial expression has a facial expression. A plurality of areas are set based on the coordinate values of the extracted plurality of feature points , a feature vector including arrangement information or shape information of the set area is generated, and the feature vector and dictionary data relating to an object are obtained. collating the collation means for calculating the reliability of the verification result, the accumulated to a reliability calculated by said matching means for each image over a plurality of images of the time-series image irradiation relating to the object A cumulative value calculating means for calculating a cumulative value according to the result of the reliability, output determination determines whether to output a matching result in the matching unit based on the cumulative value calculated by the cumulative value calculation means And means.

本発明によれば、変動が存在する画像が入力された場合であっても、精度のよい、安定した物体認識を実現することができる。 According to the present invention, accurate and stable object recognition can be realized even when an image with fluctuations is input.

以下、本発明の実施形態について図面に基づいて説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

＜第１の実施形態＞
図１は、情報処理装置（コンピュータ）の一例である物体認識装置のハードウェア構成の一例を示す図である。図１に示されるように、物体認識装置は、撮像部１００と、制御部１０１と、顔検出部１０２と、特徴抽出部１０３と、顔照合部１０４と、辞書データＤＢ部１０５と、を含む。また、物体認識装置は、累積信頼度算出部１０６と、出力判定部１０７と、記憶部１０８と、表示部１０９と、を更に含む。 <First Embodiment>
FIG. 1 is a diagram illustrating an example of a hardware configuration of an object recognition apparatus that is an example of an information processing apparatus (computer). As illustrated in FIG. 1, the object recognition apparatus includes an imaging unit 100, a control unit 101, a face detection unit 102, a feature extraction unit 103, a face matching unit 104, and a dictionary data DB unit 105. . The object recognition apparatus further includes an accumulated reliability calculation unit 106, an output determination unit 107, a storage unit 108, and a display unit 109.

撮像部１００は、制御部１０１からの制御信号に基づいて、画像を撮像し、入力画像データ（物体を含む時系列画像）を取得する（又は時系列画像受け取り）。制御部１０１は全体の制御を行い、撮像部１００、顔検出部１０２、特徴抽出部１０３、顔照合部１０４、累積信頼度算出部１０６、出力判定部１０７、記憶部１０８、表示部１０９に接続されている。なお、撮像部１００は、物体認識装置に外部接続されているようにして、物体認識装置が撮像部１００より入力画像データを受け取る構成としてもよい。 The imaging unit 100 captures an image based on a control signal from the control unit 101, and acquires input image data (a time-series image including an object) (or receives a time-series image). The control unit 101 performs overall control and is connected to the imaging unit 100, the face detection unit 102, the feature extraction unit 103, the face collation unit 104, the cumulative reliability calculation unit 106, the output determination unit 107, the storage unit 108, and the display unit 109. Has been. Note that the image capturing unit 100 may be configured to receive input image data from the image capturing unit 100 as if it were externally connected to the object recognizing device.

顔検出部１０２は、撮像部１００で取得した画像データより顔領域を抽出する。顔検出部１０２は、顔の位置、顔の個数、顔のサイズ、向きを検出し、顔領域周辺の画像（顔領域画像）を切り出し、切り出した顔領域画像を特徴抽出部１０３に出力する。顔検出部１０２の処理内容については、後述する。特徴抽出部１０３は、顔検出部１０２で切り出された顔領域画像に対し、目、鼻、又は口等の顔の部位の座標値を抽出する。顔照合部１０４は、特徴抽出部１０３で抽出された情報と、辞書データＤＢ部１０５の辞書データと、を基に、照合を行い、検出された顔が誰のものであるかの判定を行う。累積信頼度算出部１０６は、顔照合部１０４の照合結果を基に、一連の画像に対して、例えば照合類似度等の値の累積値を算出する（累積値算出）。出力判定部１０７は、累積値を予め定められた閾値と比較することで、照合結果の出力判定を行う。記憶部１０８は、取得した入力画像データや各構成要素の中間的出力値等を記録する。表示部１０９は、ＣＲＴやＬＣＤ等であり、撮像部１００で撮像された入力画像、又は種々の演算結果を表示する。 The face detection unit 102 extracts a face area from the image data acquired by the imaging unit 100. The face detection unit 102 detects the face position, the number of faces, the face size, and the orientation, cuts out an image around the face region (face region image), and outputs the cut out face region image to the feature extraction unit 103. The processing content of the face detection unit 102 will be described later. The feature extraction unit 103 extracts coordinate values of facial parts such as eyes, nose, or mouth from the face area image cut out by the face detection unit 102. The face collation unit 104 performs collation based on the information extracted by the feature extraction unit 103 and the dictionary data of the dictionary data DB unit 105, and determines who the detected face belongs to. . Based on the collation result of the face collation unit 104, the cumulative reliability calculation unit 106 calculates a cumulative value of values such as collation similarity for a series of images (cumulative value calculation). The output determination unit 107 determines the output of the collation result by comparing the accumulated value with a predetermined threshold value. The storage unit 108 records the acquired input image data, intermediate output values of each component, and the like. The display unit 109 is a CRT, LCD, or the like, and displays an input image captured by the imaging unit 100 or various calculation results.

図２は、第１の実施形態における物体認識装置の全体処理の一例を示すフローチャートである。
ステップＳ２００において累積信頼度算出部１０６は、累積信頼度の初期化処理を行う。ステップＳ２０１において、顔検出部１０２は、顔を含む画像データを撮像部１００より受け取る。次に、ステップＳ２０２において、顔検出部１０２は、入力画像データより顔を検出する処理（顔検出処理）を行う。顔検出処理に関する詳細は後述する。 FIG. 2 is a flowchart illustrating an example of overall processing of the object recognition apparatus according to the first embodiment.
In step S 200, the cumulative reliability calculation unit 106 performs cumulative reliability initialization processing. In step S 201, the face detection unit 102 receives image data including a face from the imaging unit 100. Next, in step S202, the face detection unit 102 performs processing for detecting a face from the input image data (face detection processing). Details regarding the face detection processing will be described later.

なお、顔を検出する方法としては、ユーザがポインティングデバイス等を用いて指定した顔領域の情報を顔検出部１０２が受け取る構成としてもよい。また、顔検出部１０２は、顔検出テンプレート等のフィルタを用いて顔を検出してもよい。但し、本実施形態では顔検出部１０２は、ニューラルネットワークベースの顔検出技術を用いて顔検出を行うものとする。ここで、本実施形態における顔検出部１０２におけるニューラルネットワークを用いた顔検出処理の一例を、図３を用いて説明する。 As a method for detecting a face, the face detection unit 102 may receive information on a face area designated by a user using a pointing device or the like. The face detection unit 102 may detect a face using a filter such as a face detection template. However, in this embodiment, the face detection unit 102 performs face detection using a neural network-based face detection technique. Here, an example of face detection processing using a neural network in the face detection unit 102 in the present embodiment will be described with reference to FIG.

図３は、ニューラルネットワークを用いた顔検出部１０２における顔検出処理の一例を説明するための図である。図３に示されるように、本実施形態で用いる顔検出に係るニューラルネットワークは、階層的構造をとり、低次特徴から高次特徴まで、順次認識する処理工程をとる。つまり、第１階層レベル３０１では検出層において入力画像データ３００からプリミティブな特徴（例えばエッジ）を抽出し、統合層において統合された結果を用い、第２階層レベル３０２でより高次の特徴（例えば目・口を構成するエッジ）を検出する。同様に、第３階層レベル３０３では第２階層レベル３０２の統合結果を用いて高次特徴を検出する。最後に、第４階層レベル３０４で第３階層レベル３０３の統合結果を用いて顔を検出する。 FIG. 3 is a diagram for explaining an example of face detection processing in the face detection unit 102 using a neural network. As shown in FIG. 3, the neural network related to face detection used in the present embodiment has a hierarchical structure and sequentially takes processing steps from low-order features to high-order features. That is, in the first hierarchical level 301, primitive features (for example, edges) are extracted from the input image data 300 in the detection layer, and higher-order features (for example, in the second hierarchical level 302 are used using the result integrated in the integrated layer. Detect edges that make up eyes and mouth). Similarly, at the third hierarchy level 303, higher-order features are detected using the integration result of the second hierarchy level 302. Finally, the face is detected using the integration result of the third hierarchy level 303 at the fourth hierarchy level 304.

次に、顔検出処理で検出される特徴について図４を用いて説明する。図４は、顔検出処理で検出される特徴を説明するための図である。
ここで、顔検出処理を説明する上で必要となる人間の顔は、図４（ａ）のモデル顔とする。なお、以後、顔に関する記述は、図４（ａ）のモデル顔を用いて説明する。
本実施形態では、顔検出処理に伴い、顔検出部１０２は、両目の目頭及び目尻・口両端・目・口の特徴近辺においてニューロンの発火分布を取得する。これらを中間出力分布或いは検出出力分布と呼ぶ。第１階層レベル３０１は、顔の造作を保存する最も低次の特徴を検出する（本実施形態で用いる顔検出ニューラルネットワークは、第１階層レベルにおいて第１から第８までの８つの検出部を有する）。第１階層レベル３０１における特徴の検出は、例えば、輝度変化或いは線分方向の抽出程度の検出レベルでよい。 Next, features detected by the face detection process will be described with reference to FIG. FIG. 4 is a diagram for explaining the features detected by the face detection process.
Here, the human face necessary for explaining the face detection process is the model face shown in FIG. Hereinafter, description regarding the face will be described using the model face of FIG.
In the present embodiment, accompanying the face detection process, the face detection unit 102 acquires the firing distribution of neurons near the eyes of both eyes and in the vicinity of the corners of the eyes, both ends of the mouth, the eyes, and the mouth. These are called intermediate output distribution or detection output distribution. The first hierarchical level 301 detects the lowest-order feature that preserves facial features (the face detection neural network used in this embodiment includes eight detection units from the first to the eighth in the first hierarchical level. Have). The feature detection at the first hierarchical level 301 may be, for example, a detection level that is about the degree of luminance change or line segment direction extraction.

次に第２階層レベル３０２は、図４（ｂ）の右開きＶ字エッジ検出出力分布、図４（ｃ）の左開きＶ字エッジ検出出力分布、図４（ｄ）の線分エッジ１検出出力分布、図４（ｅ）の線分エッジ２検出出力分布、等の出力分布を出力する。ここで、図４（ｂ）の右開きＶ字エッジ検出出力分布（黒領域がＶ字エッジの検出出力分布を示し、灰色線分が顔の部位を示している）は、左目目尻、右目目頭、眉毛の左端及び口の左端が検出された結果を示している。４０２〜４０７についても同様に、検出出力分布は黒領域で示され、顔造作は灰色線分で示されている。このように、Ｖ字エッジ特徴は口の左右両端特徴４０８及び４０９、両目の目尻特徴４１０及び４１１、又は両目の目頭特徴４１２及び４１３、眉毛４１８の端を検出するのに有効である。また、線分エッジ１及び線分エッジ２は、目の上下まぶた４１４及び４１５又は上下唇４１６及び４１７の検出に有効である。 Next, the second hierarchical level 302 includes a right-open V-shaped edge detection output distribution shown in FIG. 4B, a left-open V-shaped edge detection output distribution shown in FIG. 4C, and a line segment edge 1 detection shown in FIG. Output distributions such as the output distribution, the line segment edge 2 detection output distribution of FIG. Here, the right-open V-shaped edge detection output distribution (the black region indicates the V-shaped edge detection output distribution and the gray line segment indicates the face part) in FIG. 4B is the left eye corner and the right eye head. FIG. 5 shows the result of detecting the left end of the eyebrows and the left end of the mouth. Similarly, for 402 to 407, the detection output distribution is indicated by a black region, and the face structure is indicated by a gray line segment. As described above, the V-shaped edge feature is effective for detecting the left and right end features 408 and 409 of the mouth, the eye corner features 410 and 411 of the eyes, or the eye features 412 and 413 of the eyes and the ends of the eyebrows 418. Further, the line segment edge 1 and the line segment edge 2 are effective for detecting the upper and lower eyelids 414 and 415 or the upper and lower lips 416 and 417 of the eyes.

次に、第３階層レベル３０３は、第２階層レベル３０２の特徴検出を受けて図４（ｆ）の目検出出力分布及び図４（ｇ）の口検出出力分布４０６を出力する。最後の第４階層レベル３０４は、第３階層レベル３０３の目及び口検出結果より、図４（ｈ）の顔検出出力分布を出力する。このとき、第４階層レベル３０４の顔検出層における顔の検出に使用する受容野構造として、各サイズや各回転量に適したものを用意しておく。顔検出処理において、顔が存在するという結果を得たときに、どの受容野構造を用いて検出したかによって、顔検出部１０２は、その顔の大きさや向き等の顔データを得ることができる。
以上が顔検出ニューラルネットワークを用いたステップＳ２０２の処理で生成される一連の検出出力分布である。 Next, the third hierarchy level 303 receives the feature detection of the second hierarchy level 302, and outputs the eye detection output distribution of FIG. 4 (f) and the mouth detection output distribution 406 of FIG. 4 (g). The last fourth hierarchy level 304 outputs the face detection output distribution of FIG. 4H from the eye and mouth detection results of the third hierarchy level 303. At this time, as the receptive field structure used for face detection in the face detection layer at the fourth hierarchical level 304, a structure suitable for each size and each rotation amount is prepared. In the face detection process, when a result indicating that a face is present is obtained, the face detection unit 102 can obtain face data such as the size and orientation of the face depending on which receptive field structure is used for detection. .
The above is a series of detection output distributions generated by the process of step S202 using the face detection neural network.

なお、ステップＳ２０１における顔検出手法は、前述した方式に限定するものではなく、例えばＥｉｇｅｎＦａｃｅ等の方法を用いてもよい。
更に、ステップＳ２０２において、顔検出部１０２は、顔検出処理で取得した顔の大きさを用いて、顔領域周辺の顔領域画像を切り出す。顔検出部１０２は、ステップＳ２０３の判定においてＯＫと判定した場合、切り出した顔領域画像を特徴抽出部１０３へ出力する。 Note that the face detection method in step S201 is not limited to the method described above, and a method such as Eigen Face may be used, for example.
In step S202, the face detection unit 102 cuts out a face area image around the face area using the face size acquired in the face detection process. If the face detection unit 102 determines that the determination in step S 203 is OK, the face detection unit 102 outputs the cut face area image to the feature extraction unit 103.

ステップＳ２０４において、特徴抽出部１０３は、目、口及び鼻等の顔を構成する部位の特徴点を抽出し、座標値を算出する。なお、目、口及び鼻の特徴点の抽出方法は、顔部位のテンプレートをスキャンして求めるテンプレートマッチング等の技術を応用したものであってもよいが、本実施形態では、図４（ｆ）の目検出出力分布及び図４（ｇ）の口検出出力分布を用いて行うものとする。より具体的に説明すると、特徴抽出部１０３は、各検出出力分布に対し重心を算出し、各重心の座標値を特徴点の座標値とする。なお、このとき、特徴抽出部１０３は、予め定められた閾値で各検出出力分布を２値化し、２値化した分布に対して重心を算出してもよい。 In step S204, the feature extraction unit 103 extracts feature points of parts constituting the face such as eyes, mouth, and nose, and calculates coordinate values. Note that the eye, mouth, and nose feature points may be extracted by applying a technique such as template matching obtained by scanning a template of a facial part. In this embodiment, FIG. The eye detection output distribution and the mouth detection output distribution of FIG. More specifically, the feature extraction unit 103 calculates the centroid for each detection output distribution, and uses the coordinate value of each centroid as the coordinate value of the feature point. At this time, the feature extraction unit 103 may binarize each detection output distribution with a predetermined threshold value and calculate the center of gravity for the binarized distribution.

図５は、特徴抽出部１０３が抽出した特徴点の一例を示す図である。図５において、目、口又は鼻といった部位上に存在する黒点が各部位の特徴点である。抽出特徴点の呼称は、右目特徴点５００、左目特徴点５０１、鼻特徴点５０２及び口特徴点５０３とする。 FIG. 5 is a diagram illustrating an example of feature points extracted by the feature extraction unit 103. In FIG. 5, black spots existing on parts such as eyes, mouth, or nose are characteristic points of each part. The extracted feature points are referred to as a right eye feature point 500, a left eye feature point 501, a nose feature point 502, and a mouth feature point 503.

次に、ステップＳ２０５において、特徴抽出部１０３は、ステップＳ２０４で抽出した特徴点（特徴点の座標値）の配置関係をチェックする。即ち抽出した特徴点の座標値が、対象物体を記述する上で不適切であった場合、特徴抽出部１０３は、処理をステップＳ２０１へと戻す。より具体的に説明すると、特徴抽出部１０３は、本実施形態では両目それぞれの位置が顔検出ステップＳ２０２において抽出された顔領域の中心位置より下方の場合、特徴点の座標値が、対象物体を記述する上で不適切であると判断する。但し、上下逆の顔画像の入力を容認する場合、この条件は成り立たない。これはシステムの使用シーンに依存する。但し、認識すべき対象の種類に応じた特徴の配置法則に基づく特徴の整合性チェックは必要である。つまり、特徴抽出部１０３は、例えば記憶部１０８等に保存されている配置法則（配置法則ファイル）等に基づき、特徴点の座標値の配置が配置法則を満たしているか否かをステップＳ２０５において判定する。 Next, in step S205, the feature extraction unit 103 checks the arrangement relationship of the feature points (the coordinate values of the feature points) extracted in step S204. That is, when the coordinate value of the extracted feature point is inappropriate for describing the target object, the feature extraction unit 103 returns the process to step S201. More specifically, in the present embodiment, the feature extraction unit 103 determines that the coordinate value of the feature point is the target object when the position of each eye is below the center position of the face area extracted in the face detection step S202. Judged as inappropriate for description. However, this condition does not hold when input of a face image upside down is permitted. This depends on the usage scene of the system. However, it is necessary to check the consistency of the features based on the feature placement rules corresponding to the types of objects to be recognized. That is, the feature extraction unit 103 determines in step S205 whether or not the arrangement of the coordinate values of the feature points satisfies the placement rule based on the placement rule (placement rule file) stored in the storage unit 108, for example. To do.

次に、ステップＳ２０６において、顔照合部１０４は、入力画像データのサイズ及び回転の正規化を行う。顔照合部１０４は、例えば、図６に示すように、ステップＳ２０４で抽出された両目の特徴点の座標値６０２及び６０３より算出される両目間距離６００が、全ての画像に対し同一になるようアフィン変換を施す。ここで、図６は、サイズ正規化及び回転変動を説明するための図である。また、顔照合部１０４は、両目間を結ぶ直線の傾き６０１を検出し、回転を考慮したアフィン変換補正を加える。このことで、顔照合部１０４は、サイズ変動と面内回転変動とを実現する。
次に、ステップＳ２０７において、顔照合部１０４は、辞書データＤＢ部１０５の辞書データと、入力画像データ（又は入力画像データに含まれる物体の一例である人物の顔）と、の照合を行う。 Next, in step S206, the face matching unit 104 normalizes the size and rotation of the input image data. For example, as shown in FIG. 6, the face matching unit 104 makes the distance 600 between both eyes calculated from the coordinate values 602 and 603 of the feature points of both eyes extracted in step S204 be the same for all images. Apply affine transformation. Here, FIG. 6 is a diagram for explaining size normalization and rotation variation. In addition, the face matching unit 104 detects a slope 601 of a straight line connecting both eyes, and adds affine transformation correction considering rotation. Thus, the face matching unit 104 realizes size variation and in-plane rotation variation.
Next, in step S207, the face collation unit 104 collates the dictionary data in the dictionary data DB unit 105 with the input image data (or a human face that is an example of an object included in the input image data).

なお、ステップＳ２０７の処理の詳細は、図７を用いて説明する。図７は、照合処理の一例を示すフローチャートである。
まず、ステップＳ７０１において、顔照合部１０４は、特徴抽出部１０３より図４に示したような検出出力分布を取得し、検出出力分布に応じて、入力画像データに対して領域を設定する。
ここで、顔照合部１０４が設定する領域には、局所的領域と、大局的領域とがある。以下、局所的領域及び大局的領域について説明する。 Details of the processing in step S207 will be described with reference to FIG. FIG. 7 is a flowchart illustrating an example of the collation process.
First, in step S701, the face matching unit 104 acquires the detection output distribution as illustrated in FIG. 4 from the feature extraction unit 103, and sets a region for the input image data according to the detection output distribution.
Here, the areas set by the face matching unit 104 include a local area and a global area. Hereinafter, the local region and the global region will be described.

個人差を示す情報は、目、鼻又は口等の特徴の形状や、特徴の配置関係である。ここで、局所的領域は、目、鼻又は口等の個人差を示す特徴の形状情報を抽出するために設定される。図８は、顔照合部１０４が設定する局所的領域の一例を示す図（その１）である。 Information indicating individual differences is the shape of features such as eyes, nose, or mouth, and the arrangement relationship of features. Here, the local region is set in order to extract shape information of features indicating individual differences such as eyes, nose or mouth. FIG. 8 is a diagram (part 1) illustrating an example of a local region set by the face matching unit 104.

図８において、目領域の局所的領域８００は、両目の左右端点及び上下まぶた頂点近辺に設定されている。また口領域の局所的領域８０１は、口の両端点及び上下唇頂点近辺に設定されている。更に鼻周辺の局所的領域８０２は、鼻の頂点及び左右小鼻特徴点近辺に設定されている。これら局所的領域が示す情報、例えば輝度情報は、目、鼻又は口等の特徴の形状を反映している。即ち、目が大きい人物の上まぶたのラインと下まぶたのラインとは、目が細い人物より傾向的に鈍角で交差する。したがって、目端点周辺の局所的領域における情報は、目の形状を反映している。また、目尻近辺の局所的領域は、目が切れ長である等の形状を反映している。 In FIG. 8, the local region 800 of the eye region is set in the vicinity of the left and right end points of both eyes and the upper and lower eyelid vertices. Further, the local region 801 of the mouth region is set near both end points of the mouth and the upper and lower lip vertices. Further, a local region 802 around the nose is set near the apex of the nose and the left and right nose feature points. Information indicated by these local areas, for example, luminance information, reflects the shape of features such as eyes, nose or mouth. That is, the line of the upper eyelid and the line of the lower eyelid of a person with large eyes tend to cross at a more obtuse angle than a person with thin eyes. Therefore, the information in the local region around the eye end point reflects the shape of the eye. Further, the local region in the vicinity of the corner of the eye reflects a shape such as a cut length of the eye.

よって、顔照合部１０４は、目の両端点及び上下まぶた頂点に配置した複数の局所的領域より、目の形状情報を抽出することができる。更に、顔照合部１０４は、口及び鼻に関しても同様に、それぞれに設定した複数の局所的領域より、各々の形状情報を抽出することができる。なお、局所的領域の設置は、図８に示されるものに限定されるものではなく、眉等に設定されてもよい。また、顔照合部１０４は、図９に示すように、右目局所的領域９００、左目局所的領域９０１、鼻局所的領域９０２、口局所的領域９０３のように設定してもよい。但し、このように設定した場合、物体認識装置は、姿勢変動又は目或いは口等の形状変動に敏感になる。図９は、顔照合部１０４が設定する局所的領域の一例を示す図（その２）である。 Therefore, the face matching unit 104 can extract eye shape information from a plurality of local regions arranged at both end points of the eyes and the upper and lower eyelid vertices. Further, the face matching unit 104 can also extract each shape information from a plurality of local regions set for the mouth and nose, respectively. In addition, installation of a local area | region is not limited to what is shown by FIG. 8, You may set to eyebrows etc. Further, as shown in FIG. 9, the face matching unit 104 may set a right eye local region 900, a left eye local region 901, a nose local region 902, and a mouth local region 903. However, when set in this way, the object recognition device becomes sensitive to posture fluctuations or shape fluctuations such as eyes or mouth. FIG. 9 is a diagram (part 2) illustrating an example of a local region set by the face matching unit 104.

次に、大局的領域について説明する。
大局的領域は、複数の特徴を含むことで、目、鼻又は口等の特徴間の配置情報を獲得するために設定される。図１０は、顔照合部１０４が設定する大局的領域の一例を示す図（その１）である。顔照合部１０４は、両目間の大局的領域１０００より、両目を含んでいるため予め定められた特徴点間距離で規格化された画像を用いて、両目の離れ具合情報を抽出することができる。また、目−口間の大局的領域１００１は、目特徴点と口特徴点とをベースに設定されるので、顔照合部１０４は、同様に規格化された画像を用いて、目と口との離れ具合情報を抽出することができる。なお、大局的領域の設置は、図１０に限定されるものではない。顔照合部１０４は、図１１に示すように、顔全体を含む大局的領域１１０１、更に目或いは口等の特徴以外に顎の輪郭を含む大局的領域１１０３を設定してもよい。図１１は、顔照合部１０４が設定する大局的領域の一例を示す図（その２）である。 Next, the global area will be described.
The global region includes a plurality of features, and is set in order to acquire arrangement information between features such as eyes, nose, or mouth. FIG. 10 is a diagram (part 1) illustrating an example of a global area set by the face matching unit 104. Since the face matching unit 104 includes both eyes from the global region 1000 between both eyes, the face matching unit 104 can extract the degree-of-separation information of both eyes using an image standardized with a predetermined distance between feature points. . Further, since the global region 1001 between the eyes and the mouth is set based on the eye feature points and the mouth feature points, the face matching unit 104 similarly uses the standardized image to Can be extracted. The installation of the global area is not limited to FIG. As shown in FIG. 11, the face matching unit 104 may set a global area 1101 that includes the entire face, and a global area 1103 that includes the outline of the jaw in addition to features such as eyes and mouth. FIG. 11 is a diagram (part 2) illustrating an example of a global area set by the face matching unit 104.

顔照合部１０４は、ステップＳ７０１の検出出力分布の取得処理において、局所的領域及び／又は大局的領域より形状情報及び／又は配置情報を取得する。
顔照合部１０４は、形状情報を獲得する局所的領域を、図３に示す第１階層レベルの８つの検出出力分布の内、特徴形状を保存する一のエッジ抽出ライクな検出出力分布に基づいて設定する。 The face collation unit 104 acquires shape information and / or arrangement information from the local region and / or the global region in the detection output distribution acquisition process of step S701.
The face matching unit 104 determines a local region from which shape information is acquired based on one edge extraction-like detection output distribution that preserves a feature shape among the eight detection output distributions at the first hierarchical level shown in FIG. Set.

図１２は、エッジ抽出ライクな検出出力分布の一例を示す図である。エッジ抽出ライクな出力分布とは、一例として図１２の（ａ）の黒実線である。黒実線は目の輪郭や口の輪郭情報を反映している。図１２の（ａ）の検出出力分布に複数の局所的領域を適用した図が図１２の（ｂ）である。なお、図１２の（ｂ）の局所的領域は、右目特徴点１２０１、左目特徴点１２０２、口特徴点１２０３及び鼻特徴点１２０４をベースに設定される。 FIG. 12 is a diagram illustrating an example of an edge extraction-like detection output distribution. The edge extraction-like output distribution is, for example, the black solid line in FIG. The black solid line reflects the eye contour and mouth contour information. FIG. 12B is a diagram in which a plurality of local regions are applied to the detection output distribution of FIG. Note that the local region in FIG. 12B is set based on the right eye feature point 1201, the left eye feature point 1202, the mouth feature point 1203, and the nose feature point 1204.

一方、配置情報を獲得するための図１０の両目間の大局的領域１０００は、図４（ｂ）の右開きＶ字エッジ検出出力分布或いは図４（ｃ）の左開きＶ字エッジ検出出力分布の何れか一方、又は両方の出力分布に基づいて設定される。 On the other hand, the global area 1000 between the two eyes in FIG. 10 for acquiring the arrangement information is a right-open V-shaped edge detection output distribution of FIG. 4B or a left-open V-shaped edge detection output distribution of FIG. Is set based on the output distribution of either one or both.

図１３は、両目間の大局的領域の設定の一例を示す図である。図４（ｂ）及び図４（ｃ）は、Ｖ字エッジを検出した結果であり、目の両端点或いは口の両端点を検出したことに相当する。よって、顔照合部１０４は、図１３に示すように両目間の大局的領域１３０３を、右目特徴点と左目特徴点の中点である両目間中点１３０１を起点に設定することで、両目の配置情報を取得することができる。図１３（ａ）の検出出力分布は、左右のＶ字エッジ検出出力分布の重ねあわせた分布に相当する。このようにすることで、顔照合部１０４は、それぞれの目の幅情報を取得することができる。 FIG. 13 is a diagram illustrating an example of setting a global area between both eyes. FIGS. 4B and 4C show the results of detecting the V-shaped edge, which corresponds to the detection of both end points of the eyes or both end points of the mouth. Therefore, the face matching unit 104 sets the global area 1303 between both eyes as shown in FIG. 13 by setting the middle point 1301 between both eyes, which is the middle point between the right eye feature point and the left eye feature point, as a starting point. Arrangement information can be acquired. The detection output distribution in FIG. 13A corresponds to a distribution obtained by superimposing the left and right V-shaped edge detection output distributions. By doing in this way, the face collation part 104 can acquire the width information of each eye.

同様の考えで、配置情報を獲得するための目−口間の大局的領域１００１は、図４（ｄ）の線分エッジ１検出出力分布或いは図４（ｅ）の線分エッジ２検出出力分布４０４の何れか一方、又は両方の出力分布に基づいて設定される。 Based on the same idea, the global region 1001 between the eyes for obtaining the arrangement information is the line edge 1 detection output distribution of FIG. 4D or the line edge 2 detection output distribution of FIG. Either one of 404 or both are set based on the output distribution.

図１４は、目−口間の大局的領域の設定の一例を示す図である。図４（ｄ）或いは図４（ｅ）は、線分を検出した結果であり、言い換えると上下のまぶた又は上下唇を検出したことに相当する。よって、顔照合部１０４は、図１４に示すように目−口間の大局的領域１４０４を、両目間中点１４０１と口特徴点１４０２とを起点として設定することで、目−口の配置情報を取得することができる。
以上、顔照合部１０４は、検出出力分布に応じて、領域を設定する。 FIG. 14 is a diagram illustrating an example of setting a global area between the eyes and the mouth. FIG. 4D or FIG. 4E shows the result of detecting a line segment, in other words, corresponding to the detection of upper and lower eyelids or upper and lower lips. Therefore, the face matching unit 104 sets the eye-mouth global area 1404 as shown in FIG. Can be obtained.
As described above, the face matching unit 104 sets a region according to the detection output distribution.

次に、ステップＳ７０２の高次元特徴ベクトルの取得処理について説明する。高次元特徴ベクトルを求めるためのデータは、前述したような顔照合部１０４が各検出出力分布に対して適用した領域から生成される。図１５は、高次元特徴ベクトルの定義を示す図である。なお、図１５では、一例として局所的領域１５００の特徴ベクトルの生成方法と、高次特徴ベクトルの定義と、を示している。図１５に示すように、高次特徴ベクトルＦは、局所的領域及び／又は大局的領域より生成される１次元のベクトルデータｆ_kを、１つにすることで生成される。ここで、顔照合部１０４は、局所的領域１５００内の検出出力値ｆ_iの２次元配列を、予め定められた方向へスキャンすることで１次元のベクトルデータｆ_kを生成する。 Next, the high-dimensional feature vector acquisition process in step S702 will be described. Data for obtaining a high-dimensional feature vector is generated from a region applied to each detected output distribution by the face matching unit 104 as described above. FIG. 15 is a diagram illustrating the definition of a high-dimensional feature vector. FIG. 15 shows, as an example, a method for generating a feature vector of the local region 1500 and a definition of a higher-order feature vector. As shown in FIG. 15, the high-order feature vector F is generated by combining one-dimensional vector data f _k generated from a local region and / or a global region. Here, the face collation unit 104 generates one-dimensional vector data f _k by scanning a two-dimensional array of detected output values f _i in the local region 1500 in a predetermined direction.

次に、ステップＳ７０３の照合処理について説明する。ステップＳ７０３において、顔照合部１０４は、高次特徴ベクトルＦをサポートベクトルマシン（以下、ＳＶＭという）に入力し、辞書データＤＢ部１０５の辞書データを用いたマッチングを行う。
ここで、ＳＶＭは学習アルゴリズムの一つである。本実施形態では、ＳＶＭの一例としてｌｉｓｖｍを用いるものとする。 Next, the collation process in step S703 will be described. In step S703, the face matching unit 104 inputs the high-order feature vector F to a support vector machine (hereinafter referred to as SVM), and performs matching using the dictionary data in the dictionary data DB unit 105.
Here, SVM is one of learning algorithms. In the present embodiment, lisvm is used as an example of SVM.

次に、辞書データを用いて、入力データ（高次特徴ベクトル）を分類する処理について説明する。なお、辞書データの生成処理については後述する。ｌｉｓｖｍは、２クラス分類する分類器（又は分類関数としてもよい）を複数有する形式を持つ。例えば、登録者（辞書作成者又は学習者）を４人とし、それぞれをＡクラス、Ｂクラス、Ｃクラス、Ｄクラス、登録者以外のクラスをＥクラスと仮定する。このとき２クラス分類する分類器とは、ＡｏｒＢのどちらが最もらしいか閾値判別（閾値は辞書データを生成する際に、２クラス間毎生成される）する分類器である。よって入力された特徴ベクトルは、ＡｏｒＢ、ＡｏｒＣ、ＡｏｒＤ、ＡｏｒＥ、ＢｏｒＣ・・・のように全てのクラス間で２クラス分類が行われ、それら分類結果の多数決処理により最終的なクラスが決定される。 Next, a process for classifying input data (higher order feature vectors) using dictionary data will be described. The dictionary data generation process will be described later. lisvm has a format having a plurality of classifiers (or classification functions) for classifying into two classes. For example, it is assumed that there are four registrants (dictionary creator or learner), each of which is class A, class B, class C, class D, and class other than the registrant is class E. At this time, the classifier that classifies into two classes is a classifier that performs threshold determination (threshold is generated every two classes when generating dictionary data) which is most likely AorB. Therefore, the input feature vector is classified into two classes among all classes like AorB, AorC, AorD, AorE, BorC..., And the final class is determined by majority processing of the classification results. .

ここで、辞書データの生成の一例を、図１６を用いて説明する。図１６は、辞書データ生成処理の一例を示すフローチャートである。
ステップＳ１６００において、例えば制御部１０１等は、学習者の有無を判定する。例えば制御部１０１等は、学習者が存在する場合、ステップＳ１６０１に処理を進め、学習者が存在しない場合、ステップＳ１６１１に処理を進める。 Here, an example of generation of dictionary data will be described with reference to FIG. FIG. 16 is a flowchart illustrating an example of dictionary data generation processing.
In step S1600, for example, the control unit 101 determines the presence or absence of a learner. For example, if there is a learner, the control unit 101 advances the process to step S1601, and if there is no learner, advances the process to step S1611.

ステップＳ１６１１において、例えば顔照合部１０４等は、ｌｉｓｖｍを用いた学習処理を実行する。一方、ステップＳ１６０１において、例えば制御部１０１等は、予め定められた領域等に学習者の画像が存在するか否かの判定を行う。例えば制御部１０１等は、学習者の画像が存在する場合、ステップＳ１６０２に処理を進め、学習者の画像が存在しない場合、ステップＳ１６００に処理を戻す。 In step S 1611, for example, the face matching unit 104 or the like executes a learning process using lisvm. On the other hand, in step S1601, for example, the control unit 101 or the like determines whether or not the learner's image exists in a predetermined region or the like. For example, the control unit 101 or the like advances the process to step S1602 when the learner image exists, and returns the process to step S1600 when the learner image does not exist.

ステップＳ１６０２において、顔検出部１０２は、顔を含む画像データをメモリ上に確保する。以下、ステップＳ１６０３からステップＳ１６０７までの処理は、図２のステップＳ２０２からステップＳ２０６までの処理と同様である。
また、ステップＳ１６０８及びステップＳ１６０９の処理は、図７のステップＳ７０１及びステップＳ７０２と同様の処理と同様である。 In step S1602, the face detection unit 102 secures image data including a face on the memory. Hereinafter, the processing from step S1603 to step S1607 is the same as the processing from step S202 to step S206 in FIG.
Further, the processing in step S1608 and step S1609 is the same as the processing in step S701 and step S702 in FIG.

ステップＳ１６０９までの処理で生成された高次特徴ベクトルは、ステップＳ１６１０において、例えば制御部１０１等によって、メモリ等の記憶部に記録される。
高次特徴ベクトルの記録後、物体認識装置は、別の画像（画像ファイル）の取得を行い、ステップＳ１６０１を実行する。このとき、ステップＳ１６０１において、例えば制御部１０１等は、処理中の人物の画像がない場合、別の人物の画像をサーチするよう処理を行う。 In step S1610, the higher-order feature vector generated by the processing up to step S1609 is recorded in a storage unit such as a memory by the control unit 101 or the like.
After recording the higher-order feature vector, the object recognition apparatus acquires another image (image file) and executes step S1601. At this time, in step S1601, for example, if there is no image of a person being processed, the control unit 101 or the like performs processing to search for an image of another person.

次に、図２のステップＳ２０８の累積信頼度算出処理について説明する。累積信頼度算出処理では、累積信頼度算出部１０６は、後述する信頼度を、時系列画像に対して累積的に算出する。なお、本実施形態では、累積信頼度算出部１０６は、ステップＳ７０３の照合処理で用いたＳＶＭの出力値を基に信頼度の算出を行うものとする。信頼度算出の詳細を以下に示す。 Next, the cumulative reliability calculation process in step S208 of FIG. 2 will be described. In the cumulative reliability calculation process, the cumulative reliability calculation unit 106 cumulatively calculates the reliability described later with respect to the time series image. In the present embodiment, the cumulative reliability calculation unit 106 calculates the reliability based on the output value of the SVM used in the collation process in step S703. Details of the reliability calculation are shown below.

累積信頼度算出部１０６は、ＳＶＭの多数決処理により決定された最終判定結果の賛成票数が所定閾値を超えた場合、信頼度が高い判別結果であるとし、インクリメントすることで信頼度を求める。この所定閾値は、例えば判別クラス数（入力を５つのクラスに分類したい場合、判別クラス数は５となる）の過半数以上に設定されているものとする。また、賛成票数とは、複数の分類器の結果が最終判定結果と同じになった分類器の数のことを指すものとする。より具体的に、信頼度を算出する処理を、図１７のフローチャートを用いて説明する。 The cumulative reliability calculation unit 106 determines that the determination result has a high reliability when the number of votes in the final determination result determined by the majority process of the SVM exceeds a predetermined threshold value, and obtains the reliability by incrementing. This predetermined threshold is set to a majority or more of the number of discriminating classes (if the input is classified into five classes, the discriminating class number is 5). In addition, the number of votes in favor refers to the number of classifiers in which the results of a plurality of classifiers are the same as the final determination result. More specifically, the process for calculating the reliability will be described with reference to the flowchart of FIG.

図１７は、信頼度を算出する処理の一例を示すフローチャートである。なお、図１７に示されるフローチャートは、ステップＳ２０７の被写体照合の処理対象となった顔の個数分実行される。
ステップＳ１７０１において、累積信頼度算出部１０６は、個々の顔に割り当てられた要素番号を示す変数ｆａｃｅにゼロを代入し、初期化する。 FIG. 17 is a flowchart illustrating an example of a process for calculating the reliability. Note that the flowchart shown in FIG. 17 is executed as many times as the number of faces subjected to subject collation processing in step S207.
In step S 1701, the cumulative reliability calculation unit 106 initializes by assigning zero to a variable face indicating the element number assigned to each face.

ステップＳ１７０２において、累積信頼度算出部１０６は、顔個数分処理を実行したか否かの確認を行う。ここで、定数ｆａｃｅＮｕｍは処理対象である顔の総数を示している。図１８は、顔に要素番号を指定する様子を示す図である。図１８（ａ）の画像データ１８００は、取得された入力画像の顔に対して、要素番号を割り当てる様子を示している。累積信頼度算出部１０６は、累積信頼度を算出する間、同じ顔に対して同じ番号（要素番号）を割り当てる。つまり、画像データ１８００が画像データ１８０１に時系列的に変化したとき、画像データ１８０１において、各顔は点線の顔から実線の顔へと移動したとする。このとき実線の顔は、点線の顔即ち過去の顔と同じ要素番号を割り当てられる。よって、前述の動作を行うためには、顔を画面内でトラッキングする技術が必要である。トラッキングは、公知の技術を用いて行う。本実施形態は、累積信頼度算出部１０６は、オプティカルフローから算出される移動ベクトルを求め、移動ベクトルより顔のトラッキングを行うものとする。 In step S1702, the cumulative reliability calculation unit 106 checks whether or not the processing for the number of faces has been executed. Here, the constant faceNum indicates the total number of faces to be processed. FIG. 18 is a diagram illustrating a state in which an element number is designated for a face. Image data 1800 in FIG. 18A shows a state in which element numbers are assigned to the face of the acquired input image. The cumulative reliability calculation unit 106 assigns the same number (element number) to the same face while calculating the cumulative reliability. That is, when the image data 1800 changes to the image data 1801 in time series, it is assumed that in the image data 1801, each face has moved from a dotted face to a solid face. At this time, the solid face is assigned the same element number as the dotted face, that is, the past face. Therefore, in order to perform the above-described operation, a technique for tracking the face in the screen is necessary. Tracking is performed using a known technique. In this embodiment, it is assumed that the cumulative reliability calculation unit 106 obtains a movement vector calculated from the optical flow and performs face tracking based on the movement vector.

累積信頼度算出部１０６は、顔の個数分だけ処理を実行したと判定すると、図１７に示す処理を終了し、顔の個数分の処理を実行していないと判定すると、ステップＳ１７０３に進む。
ステップＳ１７０３において、累積信頼度算出部１０６は、変数ｖｏｔｅＣｏｕｎｔにゼロを代入し、初期化する。
次に、ステップＳ１７０４について説明する。例えば判別クラス数が、５クラスである（Ａ、Ｂ、Ｃ、Ｄ、Ｅの５パターンとし、それぞれをｌａｂｅｌ＝１、ｌａｂｅｌ＝２、ｌａｂｅｌ＝３、ｌａｂｅｌ＝４、ｌａｂｅｌ＝５とラベリングする）とする。このとき１０個の２クラス分類器が存在する。その様子を示したのが図１９である。図１９は、２クラス分類器の判定結果を示す図である。図１９中において、入力データがＡであるのか、Ｂであるのかを分類するＡｏｒＢ分類器は、比較対象クラスの欄からＡとＢとがクロスする部分のことである。また、分類器の結果は、図１９中の斜体文字が示している。即ち、図１９中では、ＡｏｒＢ分類器の分類結果がＡである。また、最終判定結果は、多数決処理によりＡと判定される。図１９の各分類器の分類結果より、最終判定結果の賛成票数は、４となる。最後に、ステップＳ１７０４において、累積信頼度算出部１０６は、最終判定結果の賛成票数を格納する変数ｖｏｔｅＣｏｕｎｔに、取得した賛成票数を代入する。 If the cumulative reliability calculation unit 106 determines that the processing for the number of faces has been executed, the cumulative reliability calculation unit 106 ends the processing illustrated in FIG. 17.
In step S1703, the cumulative reliability calculation unit 106 assigns zero to the variable voteCount and initializes it.
Next, step S1704 will be described. For example, the number of discrimination classes is 5 (5 patterns of A, B, C, D, and E are used, and each is labeled as label = 1, label = 2, label = 3, label = 4, label = 5) And At this time, there are ten 2-class classifiers. This is shown in FIG. FIG. 19 is a diagram illustrating a determination result of the 2-class classifier. In FIG. 19, an AorB classifier that classifies whether input data is A or B is a portion where A and B cross from the column of comparison target class. The result of the classifier is indicated by italic characters in FIG. That is, in FIG. 19, the classification result of the AorB classifier is A. The final determination result is determined as A by majority processing. From the classification result of each classifier in FIG. 19, the number of votes in favor of the final determination result is 4. Finally, in step S1704, the cumulative reliability calculation unit 106 substitutes the acquired number of votes for the variable voteCount that stores the number of votes for the final determination result.

次に、ステップＳ１７０５について説明する。ステップＳ１７０５において、累積信頼度算出部１０６は、取得した変数ｖｏｔｅＣｏｕｎｔと所定閾値との比較を行う。本実施形態では、所定閾値は前述のように判定クラス数の過半数よりも大きい値にするものとする。但し、所定閾値は、統計的手法、又はヒューリスティックによって求めた値を用いてもよい。ここで、所定閾値は、ステップＳ１７０５において、Ｔｈ_confで表されている。変数ｖｏｔｅＣｏｕｎｔがＴｈ_confよりも大きい場合、累積信頼度算出部１０６は、ステップＳ１７０６において、最終判定結果が信頼できるものとして、最終判定結果のラベルの信頼度変数を１インクリメントする。ここで、信頼度変数は、フローチャートの変数ｃｏｎｆｉｄｅｎｃｅである。また、変数ｃｏｎｆｉｄｅｎｃｅは、顔番号とラベルとの２次元配列で表される。 Next, step S1705 will be described. In step S1705, the cumulative reliability calculation unit 106 compares the acquired variable voteCount with a predetermined threshold value. In the present embodiment, the predetermined threshold is set to a value larger than the majority of the number of determination classes as described above. However, a value obtained by a statistical method or a heuristic may be used as the predetermined threshold. Here, the predetermined threshold is represented by Th _conf in step S1705. When the variable voteCount is larger than Th _conf , the cumulative reliability calculation unit 106 increments the reliability variable of the label of the final determination result by 1 in step S1706, assuming that the final determination result is reliable. Here, the reliability variable is a variable confidence in the flowchart. The variable confidence is expressed by a two-dimensional array of face numbers and labels.

変数ｖｏｔｅＣｏｕｎｔがＴｈ_confよりも大きくない場合、累積信頼度算出部１０６は、ステップＳ１７０７において、変数ｆａｃｅの値を１インクリメントし、ステップＳ１７０２の処理に戻る。 When the variable voteCount is not larger than Th _conf , the cumulative reliability calculation unit 106 increments the value of the variable face by 1 in step S1707, and returns to the process of step S1702.

次に、図２のステップＳ２０９の出力判定処理について説明する。ステップＳ２０９において、出力判定部１０７は、前述の累積信頼度と所定閾値との比較を行う。より具体的に説明すると、出力判定部１０７は、各検出顔の各ｌａｂｅｌに対して次式を評価することで出力判定を実行する。

Next, the output determination process in step S209 in FIG. 2 will be described. In step S209, the output determination unit 107 compares the cumulative reliability described above with a predetermined threshold value. More specifically, the output determination unit 107 performs output determination by evaluating the following expression for each label of each detected face.

ここで、

は、あるラベルのある時刻ｔの入力画像までの累積信頼度を表し、Ｔｈｏｕｔｐｕｔは、所定の出力判定閾値を表し、ｎは判別クラス数を表す。数式（１）の比較処理は、ステップＳ２０７で処理対象、かつ、トラッキング可能である顔、全てに対して行われる。所定閾値は、統計的手法やヒューリスティックに求めることで獲得される。出力判定部１０７は、数式（１）が真であった場合、ステップＳ２１０へと進む。また、出力判定部１０７は、数式（１）が偽であった場合、ステップＳ２０１へと進む。このとき、複数の顔が写っている１つの画像データにおいて、ステップＳ２１０の評価が、ＯＫとＮＧとを共に含んでいた場合、以後、ＮＧ顔のみが図２のフローチャートの処理対象となる。ＯＫ顔は、照合処理等を行わずトラッキングで画面内追尾される。 here,

Represents the cumulative reliability up to an input image at a certain time t with a certain label, Thoutput represents a predetermined output determination threshold, and n represents the number of discrimination classes. The comparison processing of Expression (1) is performed on all the faces that can be processed and tracked in step S207. The predetermined threshold is obtained by a statistical method or heuristic determination. If the mathematical expression (1) is true, the output determination unit 107 proceeds to step S210. Further, when the mathematical formula (1) is false, the output determination unit 107 proceeds to step S201. At this time, if the evaluation in step S210 includes both OK and NG in one image data showing a plurality of faces, only the NG face will be processed in the flowchart of FIG. The OK face is tracked in the screen by tracking without performing a matching process or the like.

次に、図２のステップＳ２１０について説明する。ステップＳ２１０において、表示部１０９（又は制御部１０１）は、前述の累積信頼度が所定閾値を超えた対象顔に対し、照合結果を出力する。より具体的に説明すると、ステップＳ２１０において、表示部１０９は、対象顔近辺に所定閾値を超えたラベルに対応する登録名を出力する処理を実行する。 Next, step S210 in FIG. 2 will be described. In step S210, the display unit 109 (or the control unit 101) outputs a collation result for the target face whose cumulative reliability exceeds the predetermined threshold. More specifically, in step S210, the display unit 109 executes a process of outputting a registered name corresponding to a label that exceeds a predetermined threshold in the vicinity of the target face.

以上、第１の実施形態では、累積的な信頼度を、複数の分類器の結果から得られる賛成票数を基に算出することで、確信度の高い顔認識を実現することができる。よって、姿勢等の制約が存在しないシチュエーションの顔認識処理においても、誤認識を低減した顔認識システムを提供することができる。 As described above, in the first embodiment, face recognition with high certainty can be realized by calculating the cumulative reliability based on the number of approval votes obtained from the results of a plurality of classifiers. Therefore, it is possible to provide a face recognition system in which erroneous recognition is reduced even in face recognition processing in situations where there is no restriction such as posture.

＜第２の実施形態＞
以下、第２の実施形態について説明する。基本的な構成は、第１の実施形態を踏襲する。第１の実施形態と異なる点を以下に述べる。第２の実施形態では、顔の状態、例えば表情、姿勢又は照明といった変動を推定し、推定結果を基に被写体照合を行う。更に、第２の実施形態では、推定結果を活用して累積信頼度の算出を行うことをポイントとする。第２の実施形態では、推定する物体の状態を表情に限定し、以下、より具体的な説明を、図２０を用いて行う。 <Second Embodiment>
Hereinafter, the second embodiment will be described. The basic configuration follows the first embodiment. Differences from the first embodiment will be described below. In the second embodiment, the face state, for example, a variation such as a facial expression, posture, or illumination is estimated, and subject verification is performed based on the estimation result. Furthermore, in the second embodiment, the point is to calculate the cumulative reliability using the estimation result. In the second embodiment, the state of the object to be estimated is limited to a facial expression, and a more specific description will be given below with reference to FIG.

図２０は、第２の実施形態における物体認識装置の全体処理の一例を示すフローチャートである。
ステップＳ２０００、ステップＳ２００１、ステップＳ２００２、ステップＳ２００３、ステップＳ２００４、ステップＳ２００５及びステップＳ２００９の各ステップは、第１の実施形態で示した図２の対応するステップと同様である。よって、これらのステップ群の説明は割愛する。 FIG. 20 is a flowchart illustrating an example of overall processing of the object recognition apparatus according to the second embodiment.
Steps S2000, S2001, S2002, S2003, S2004, S2005, and S2009 are the same as the corresponding steps in FIG. 2 described in the first embodiment. Therefore, description of these step groups is omitted.

まず、ステップＳ２００６の処理について説明する。表情の代表例を図２１に示す。図２１は、表情の代表例の一例を示す図である。表情判定する顔は、辞書データに存在しない表情をした顔とする。例えば、辞書画像を生成する登録者の表情が、図２１（ａ）の中立顔であった場合、口を開けている状態、又は目を閉じている状態の顔が表情判定の対象顔となる。本実施形態では、辞書データの登録者の顔が、表情のない図２１（ａ）の中立顔のみ、又は図２１（ａ）の中立顔が大多数である場合に関してより具体的に説明する。 First, the process of step S2006 will be described. A representative example of facial expressions is shown in FIG. FIG. 21 is a diagram illustrating an example of a representative example of facial expressions. The face for which the expression is determined is a face having an expression that does not exist in the dictionary data. For example, when the facial expression of the registrant who generates the dictionary image is a neutral face in FIG. 21A, the face with the mouth open or the eyes closed is the face for facial expression determination. . In the present embodiment, the case where the face of the registrant of the dictionary data is only the neutral face in FIG. 21A with no expression or the majority of the neutral faces in FIG. 21A will be described more specifically.

図２１（ａ）の中立顔は、人間が標準的にとる表情であり、目が開いている状態、かつ、口は閉じられている顔である。一方、図２１（ａ）の中立顔以外の顔（図２１（ｂ）〜図２１（ｊ））は、中立顔の目或いは口の特徴が、大小の形状変化を示したものである。より具体的に説明すると、図２１（ｂ）の口開き顔及び図２１（ｃ）の口半開き顔は、図２１（ａ）の中立顔に対して口形状が変化している場合に相当する。この変化は、画像中の人物が言葉を発している自然状態や、意図して口を大きく開けている状態の場合に生じる。 The neutral face in FIG. 21 (a) is a facial expression that a person normally takes, and is a face in which the eyes are open and the mouth is closed. On the other hand, in the faces other than the neutral face in FIG. 21A (FIGS. 21B to 21J), the features of the eyes or mouth of the neutral face show a change in size. More specifically, the mouth open face shown in FIG. 21B and the mouth half open face shown in FIG. 21C correspond to the case where the mouth shape is changed with respect to the neutral face shown in FIG. . This change occurs in a natural state where a person in the image is uttering a word or a state where the mouth is intentionally wide open.

また、図２１（ｄ）の両目閉じ顔、図２１（ｅ）の両目半閉じ顔、図２１（ｆ）の片目閉じ顔、又は図２１（ｇ）の片目半閉じ顔は、目形状が変化している場合に相当する。図２１（ｄ）及び図２１（ｅ）の顔は、人間が自然に行う目つむり動作を行った結果、表情として表れるものである。また、図２１（ｆ）及び図２１（ｇ）の顔は、ウインク等の人物の意図的な動作を行った結果、表情として表れるものである。更に、図２１（ｈ）両目閉じ＋口開き顔、図２１（ｉ）片目閉じ＋口開き顔、又は図２１（ｊ）片目半閉じ＋片目閉じ顔は、目と口との各形状変化が組み合わさった顔の代表例である。次に、これら表情の判定方法について記す。 21 (d), both eyes half-closed face in FIG. 21 (e), one eye closed face in FIG. 21 (f), or one eye half-closed face in FIG. This is equivalent to The faces shown in FIGS. 21 (d) and 21 (e) appear as facial expressions as a result of a human eye-catching operation. The faces shown in FIGS. 21 (f) and 21 (g) appear as facial expressions as a result of intentional movement of a person such as wink. Further, FIG. 21 (h) both eyes closed + mouth open face, FIG. 21 (i) one eye closed + mouth open face, or FIG. 21 (j) one eye half closed + one eye closed face, each shape change between eyes and mouth. This is a representative example of a combined face. Next, a method for determining these facial expressions will be described.

本実施形態で行う表情判定は、公知の技術を利用することで実現される。より具体的に説明すると、図２２に示すように、物体認識装置（例えば特徴抽出部１０３等）は、目、鼻、口、眉毛又は顎の輪郭周りに特徴点を検出し、特徴点の配置状態又は特徴点の動的変化により表情を検出する。図２２は、特徴点の抽出の一例を示す図である。ここで、輪郭特徴点は、図２２中の黒丸２２０１が一例であり、唇周りに関して、顔表面と接する部分だけでなく口腔に接する部分からも抽出される。各特徴の輪郭特徴点の抽出する方法は、例えばエッジ抽出により各特徴の輪郭を抽出することで検出する技術を利用する。本実施形態では、特徴抽出部１０３は、ＡＡＭ（ＡｃｔｉｖｅＡｐｐｅａｒａｎｃｅＭｏｄｅｌ）を利用する。ＡＡＭは、特徴点の相対関係と各特徴点の輝度値を評価値として持つモデルを、顔に対してフィッティングさせることで、図２２に示すような特徴点を抽出することができる。 The facial expression determination performed in the present embodiment is realized by using a known technique. More specifically, as shown in FIG. 22, the object recognition device (for example, the feature extraction unit 103) detects feature points around the outline of the eyes, nose, mouth, eyebrows, or jaws, and arranges the feature points. A facial expression is detected by a dynamic change of a state or a feature point. FIG. 22 is a diagram illustrating an example of feature point extraction. Here, the outline feature point is an example of a black circle 2201 in FIG. 22, and is extracted not only from the part in contact with the face surface but also from the part in contact with the oral cavity with respect to the periphery of the lips. As a method for extracting contour feature points of each feature, for example, a technique of detecting by extracting the contour of each feature by edge extraction is used. In this embodiment, the feature extraction unit 103 uses an AAM (Active Appearance Model). The AAM can extract a feature point as shown in FIG. 22 by fitting a model having a relative relationship between feature points and a luminance value of each feature point as an evaluation value to the face.

以上の輪郭特徴点抽出に関する公知技術の何れかの処理は、前述したステップＳ２００４の目、口及び鼻特徴点抽出処理と同時、又は段階的に行われる。次に、表情の判定方法をより具体的に説明する。 Any of the above-described known techniques relating to contour feature point extraction is performed simultaneously or stepwise with the eye, mouth and nose feature point extraction process of step S2004 described above. Next, the expression determination method will be described more specifically.

表情の判定は、輪郭特徴点の配置関係を表情毎に学習した判別器（以下、表情判別器と呼ぶ）により行われる。表情判別器が学習する際に用いるパラメータは、特徴点同士を結ぶことでできるメッシュの距離、又は角度である。図２３は、メッシュとパラメータとを説明するための図である。メッシュは、顔に設定するメッシュを示す。図２３の（ａ）に示すように、顔の表情がでる目の周り、又は口の周りに設定される。また、パラメータは、図２３の（ｂ）に示すように、単位メッシュの各辺の距離ａ、ｂ、ｃ及び角度θ、φ、ψである。但し、距離パラメータに関しては、個人差を取り除くために、両目の距離等で正規化する必要がある。表情判別器２４００は、図２４に示すように各表情を識別する複数の分類器と、各表情分類器の出力を統合する出力統合器２４０１と、により構成される。図２４は、表情判定器の一例を示す図である。各表情分類器２４０２、２４０３及び２４０４は、表情ごとに前述のパラメータを全てのメッシュに対して算出し、学習することにより構築される。また、出力統合器２４０１は、各表情分類器の出力を比較し、最終的な表情を判定する。各表情分類器の学習は、本実施形態では公知技術であるニューラルネットワークを用いて行われるものとする。 The facial expression is determined by a discriminator (hereinafter referred to as a facial expression discriminator) that has learned the arrangement relationship of contour feature points for each facial expression. A parameter used when the facial expression classifier learns is a mesh distance or angle formed by connecting feature points. FIG. 23 is a diagram for explaining meshes and parameters. The mesh indicates a mesh set for the face. As shown in FIG. 23A, it is set around the eyes where the facial expression appears or around the mouth. The parameters are distances a, b, c and angles θ, φ, ψ of each side of the unit mesh, as shown in FIG. However, the distance parameter needs to be normalized by the distance between both eyes in order to remove individual differences. The facial expression discriminator 2400 includes a plurality of classifiers that identify each facial expression as shown in FIG. 24, and an output integrator 2401 that integrates the outputs of the facial expression classifiers. FIG. 24 is a diagram illustrating an example of a facial expression determiner. Each expression classifier 2402, 2403, and 2404 is constructed by calculating and learning the above parameters for all meshes for each expression. The output integrator 2401 compares the outputs of the expression classifiers to determine the final expression. In this embodiment, learning of each facial expression classifier is performed using a neural network that is a known technique.

表情判定が終了すると、次にステップＳ２００７に移行する。ステップＳ２００７において、物体認識装置（又は特徴抽出部１０３）は、ステップＳ２００６の結果を受けて、局所的領域及び大局的領域の取捨選択を行う。 When the facial expression determination is completed, the process proceeds to step S2007. In step S2007, the object recognition apparatus (or the feature extraction unit 103) receives the result of step S2006 and selects a local area and a global area.

領域選択ステップＳ２００７の説明を行う。表情判定ステップＳ２００６の表情判定が、例えば図２１に示す口開き顔２１０１であるとすると、口特徴の形状が変化していることになる。よって、図８に示す口に設定する局所的領域８０１内の情報は、中立的な口の形状との類似性が得られない可能性がある。よって、物体認識装置は、口に設定される局所的領域を、特徴抽出領域から除外する。また、ステップＳ２００６の表情判定で、両目閉じ＋口開き顔２１０７と判定された場合、前述と同様の理由で、目（両目）に設定する局所的領域及び口に設定する局所的領域を、物体認識装置は、特徴抽出領域から除外する。このとき、大局的領域に関しては、物体認識装置は、表情が表れたことにより特徴の配置関係が崩れることがないので、領域の取捨選択は行わない。以上のように、ステップＳ２００７では、物体認識装置は、ステップＳ２００６の結果を受けて、特徴抽出領域の適応的な取捨選択を行う。 The region selection step S2007 will be described. If the facial expression determination in the facial expression determination step S2006 is, for example, an open mouth face 2101 shown in FIG. 21, the shape of the mouth feature is changed. Therefore, the information in the local region 801 set in the mouth shown in FIG. 8 may not be similar to the neutral mouth shape. Therefore, the object recognition device excludes the local region set in the mouth from the feature extraction region. If it is determined in step S2006 that facial expression is closed with both eyes closed + open mouth face 2107, the local area set for the eyes (both eyes) and the local area set for the mouth are set as the object for the same reason as described above. The recognition device excludes it from the feature extraction region. At this time, regarding the global area, the object recognition apparatus does not change the arrangement of the features due to the expression, and therefore does not select the area. As described above, in step S2007, the object recognition apparatus performs adaptive selection of the feature extraction region in response to the result of step S2006.

次に、ステップＳ２００８において、物体認識装置（又は顔照合部１０４）は、選択した領域に適した辞書データを、辞書データＤＢ部１０５より選択する。より具体的に説明すると、ステップＳ２００８において、物体認識装置は、ステップＳ２００７で選択された領域により構築された辞書データを検索する処理を行う。よって、辞書データは、複数の局所的領域と大局的領域との組み合わせで、予め構築しておく必要がある。辞書データ群の構築については、後述する。 Next, in step S2008, the object recognition apparatus (or face collation unit 104) selects dictionary data suitable for the selected area from the dictionary data DB unit 105. More specifically, in step S2008, the object recognition apparatus performs a process of searching for dictionary data constructed by the area selected in step S2007. Therefore, the dictionary data needs to be constructed in advance by combining a plurality of local areas and global areas. The construction of the dictionary data group will be described later.

次に、ステップＳ２０１０において、顔照合部１０４は、抽出した辞書データと入力データとの照合を行う。照合方法は、第１の実施形態と同様で、図７に示された手順で処理される。但し、第１の実施形態と異なる点は、顔照合部１０４は、ステップＳ２００７の領域選択処理で選択された領域よりベクトルを生成し、照合する点である。 Next, in step S2010, the face collation unit 104 collates the extracted dictionary data with the input data. The collation method is the same as in the first embodiment, and is processed according to the procedure shown in FIG. However, the difference from the first embodiment is that the face matching unit 104 generates a vector from the region selected in the region selection processing in step S2007 and performs matching.

次に、第２の実施形態における辞書データの生成の流れを、図２５のフローチャートを用いて説明する。図２５は、第２の実施形態における物体認識装置の全体処理の一例を示すフローチャートである。図２５のフローチャートを実行するにあたり、物体認識装置は、予め各登録者と非登録者の静止画又は動画を、フォルダ単位で振り分けておく。また、登録者の表情は、殆ど表情のない顔であるとする。 Next, the flow of dictionary data generation in the second embodiment will be described using the flowchart of FIG. FIG. 25 is a flowchart illustrating an example of overall processing of the object recognition apparatus according to the second embodiment. In executing the flowchart of FIG. 25, the object recognition apparatus allocates the still images or moving images of each registrant and non-registrant in advance in folder units. The registrant's facial expression is assumed to be a face with almost no facial expression.

ステップＳ２５００において、例えば制御部１０１等は、学習者の有無を判定する。例えば制御部１０１等は、学習者が存在する場合、ステップＳ２５０１に処理を進め、学習者が存在しない場合、ステップＳ２５１１に処理を進める。
ステップＳ２５１１において、例えば顔照合部１０４等は、ｌｉｓｖｍを用いた学習処理を実行する。一方、ステップＳ２５０１において、例えば制御部１０１等は、予め定められた領域等に学習者の画像が存在するか否かの判定を行う。例えば制御部１０１等は、学習者の画像が存在する場合、ステップＳ２５０２に処理を進め、学習者の画像が存在しない場合、ステップＳ２５００に処理を戻す。 In step S2500, for example, the control unit 101 determines the presence or absence of a learner. For example, when there is a learner, the control unit 101 advances the process to step S2501, and when no learner exists, the control unit 101 advances the process to step S2511.
In step S2511, for example, the face matching unit 104 or the like executes a learning process using lisvm. On the other hand, in step S2501, for example, the control unit 101 or the like determines whether or not the learner's image exists in a predetermined region or the like. For example, when there is a learner image, the control unit 101 advances the process to step S2502, and when there is no learner image, the control unit 101 returns the process to step S2500.

ステップＳ２５０２において、顔検出部１０２は、顔を含む画像データをメモリ上に確保する。以下、ステップＳ２５０３からステップＳ２５０７までの処理は、図２のステップＳ２０２からステップＳ２０６までの処理と同様である。
まず、ステップＳ２５０８の処理を説明する。
ここで、例えば、顔照合部１０４は、図８に示した、目、口及び鼻特徴に設定する局所的領域として、取捨選択した複数の領域パターンを設定する。図２６は、領域パターンの一例を示す図である。図２６（ａ）の領域パターンは、左目と口とに表情が検出され、左目と口とに設定されている局所的領域が除外されている状態に相当する。図２６（ａ）の領域パターン以外のパターンについて図２６を用いて説明する。 In step S2502, the face detection unit 102 secures image data including a face on the memory. Hereinafter, the processing from step S2503 to step S2507 is the same as the processing from step S202 to step S206 in FIG.
First, the process of step S2508 will be described.
Here, for example, the face matching unit 104 sets a plurality of selected region patterns as the local regions to be set in the eye, mouth, and nose features shown in FIG. FIG. 26 is a diagram illustrating an example of a region pattern. The area pattern in FIG. 26A corresponds to a state in which facial expressions are detected in the left eye and mouth, and local areas set in the left eye and mouth are excluded. A pattern other than the area pattern of FIG. 26A will be described with reference to FIG.

領域パターン２６０１は、口のみに形状変化がある場合のパターン
領域パターン２６０２は、右目（又は左目）のみに形状変化がある場合のパターン
領域パターン２６０３は、両目に形状変化がある場合のパターン
領域パターン２６０４は、両目及び口に形状変化がある場合のパターン
領域パターン２６０５は、鼻に形状変化がある場合のパターン
領域パターン２６０６は、口を両側に広げた場合のパターン
領域パターン２６０７は、形状変化がない場合の標準のパターン
大局的領域に関しては、どのパターンでも両目間及び目−口間の大局的領域を共に用いている。その理由は、表情により目といった特徴の形状は変化するが、目、口及び鼻の配置情報は、図４のような特徴点をベースに設定する大局的領域を用いる限り、表情によるズレは生じない。よって、大局的領域により抽出する位置情報は、表情が存在した場合でも個人差を示しているので、表情変動があった場合において有効な特徴である。 The area pattern 2601 is a pattern when there is a shape change only in the mouth. The area pattern 2602 is a pattern when there is a shape change only in the right eye (or left eye). The area pattern 2603 is a pattern when there is a shape change in both eyes. 2604 is a pattern when there is a shape change in both eyes and mouth. The area pattern 2605 is a pattern when there is a shape change in the nose. The area pattern 2606 is a pattern when the mouth is spread on both sides. Standard pattern in the absence of the global area, both patterns use the global area between both eyes and the eye-mouth. The reason is that the shape of the feature such as the eye changes depending on the facial expression, but the positional information on the eyes, mouth, and nose is shifted by the facial expression as long as the global area set based on the feature points as shown in FIG. 4 is used. Absent. Therefore, the positional information extracted by the global area shows individual differences even when facial expressions exist, and is an effective feature when there is facial expression fluctuation.

ステップＳ２５０８において、顔照合部１０４は、前述の複数の領域パターンを入力顔に対して設定する。
次に、領域パターン別の辞書ベクトルを生成するステップＳ２５０９について説明する。例えば顔照合部１０４は、ステップＳ２５０８で設定された領域パターン群を、図１１、図１２及び図１３に示した検出出力分布に対して設定し、高次特徴ベクトルを生成する処理を行う。生成した高次特徴ベクトルは、ステップＳ２５１０で例えば制御部１０１等によって、メモリ等の記憶部に記録される。記録後、例えば制御部１０１等は、別の画像ファイルを読み込むため、画像データの検索を行う。このとき、例えば制御部１０１等は、処理中の人物画像データがない場合、別の人物の画像をサーチする処理を行う。 In step S2508, the face matching unit 104 sets the plurality of area patterns described above for the input face.
Next, step S2509 for generating a dictionary vector for each area pattern will be described. For example, the face matching unit 104 sets the region pattern group set in step S2508 for the detection output distribution shown in FIGS. 11, 12, and 13, and performs processing for generating higher-order feature vectors. The generated higher-order feature vector is recorded in a storage unit such as a memory by the control unit 101 or the like in step S2510. After recording, for example, the control unit 101 or the like searches for image data in order to read another image file. At this time, for example, when there is no person image data being processed, the control unit 101 or the like performs a process of searching for an image of another person.

次に、図２０のステップＳ２０１１の処理について説明する。第２の実施形態では、累積信頼度算出部１０６は、ステップＳ２００７で取捨選択された領域数を基に、信頼度を算出する。より具体的な信頼度の算出方法を以下に示す。
本実施形態の物体認識装置は、確信度を、図２７に示すルックアップテーブルとして用意しているものとする。図２７は、確信度のルックアップテーブルの一例を示す図である。図２７のルックアップテーブル２７００は、１列目及び３列目が有効領域数Ｎ、２列目及び４列目が信頼度を示している。有効領域数とは、ステップＳ２００６の結果を受けて、ステップＳ２００７で取捨選択された結果、決定された特徴を抽出するための領域数のことである。また、確信度は、前述の領域数が増加すれば、情報量が増え、照合精度が向上する考えのもと、領域数に対して単純増加する形式で設定されている。更に、ここで求めた確信度は、公知の学習アルゴリズム、例えば部分空間法により学習された、例えばＡｏｒ！Ａの分類を行う識別器の出力値に積算される。累積信頼度算出部１０６は、前述の積算値を所定時間で算出した値を累積信頼度とする。 Next, the process of step S2011 in FIG. 20 will be described. In the second embodiment, the cumulative reliability calculation unit 106 calculates the reliability based on the number of regions selected in step S2007. A more specific reliability calculation method is shown below.
The object recognition apparatus of the present embodiment prepares the certainty factor as a lookup table shown in FIG. FIG. 27 is a diagram illustrating an example of a certainty lookup table. In the lookup table 2700 of FIG. 27, the first and third columns indicate the number of effective areas N, and the second and fourth columns indicate the reliability. The number of effective areas is the number of areas for extracting features determined as a result of selection in step S2007 in response to the result of step S2006. In addition, the certainty factor is set in a format that simply increases with respect to the number of regions based on the idea that the amount of information increases and the collation accuracy improves as the number of regions increases. Further, the certainty factor obtained here is learned by a known learning algorithm, for example, a subspace method, for example, Aor! It is integrated into the output value of the discriminator for classifying A. The cumulative reliability calculation unit 106 sets a value obtained by calculating the above-described integrated value for a predetermined time as the cumulative reliability.

以上、第２の実施形態によれば、表情変動があった場合においても信頼できるパラメータを算出することが可能となり、表情変動をあった場合（若しくは、姿勢変動等）の顔認識システムの誤認識を低減することが可能となる。 As described above, according to the second embodiment, it is possible to calculate a reliable parameter even when there is a change in facial expression, and misrecognition of the face recognition system when there is a change in facial expression (or posture change or the like). Can be reduced.

＜第３の実施形態＞
第３の実施形態について説明する。第３の実施形態は、入力顔に表情等の変動が存在しない場合の処理を含んだ実施形態である。変動がない場合、類似度は、辞書データの画像群の顔の状態と、入力顔の状態と、がほぼ等しいと考えられるので高くなる。よって、入力顔が変動の存在しない状態であると判断された場合は、所定領域の取捨選択処理を実行しない。より具体的に説明するため、第３の実施形態のフローチャートを、図２８に示す。 <Third Embodiment>
A third embodiment will be described. The third embodiment is an embodiment including processing in the case where there is no change in facial expression or the like in the input face. When there is no change, the similarity is high because it is considered that the face state of the image group of the dictionary data and the state of the input face are substantially equal. Therefore, when it is determined that the input face is in a state where there is no fluctuation, the selection process for the predetermined area is not executed. In order to explain more specifically, a flowchart of the third embodiment is shown in FIG.

図２８は、第３の実施形態における物体認識装置の全体処理の一例を示すフローチャートである。
図２８のフローチャートは、図２０のフローチャートに対し、上位候補者情報の初期化を行うステップＳ２８００を追加されている点と、表情判定を行うステップＳ２００６の処理が表情判定の分岐を行うステップＳ２８０７に変更されている点と、が異なる。また、図２８のフローチャートでは、ステップＳ２８０７で、表情有りと判定された場合の一連の処理も追加されている。追加、又は変更された、それぞれの処理ステップに関する説明を、以下に記す。 FIG. 28 is a flowchart illustrating an example of overall processing of the object recognition apparatus according to the third embodiment.
The flowchart of FIG. 28 is different from the flowchart of FIG. 20 in that step S2800 for initializing high-ranking candidate information is added, and that the process of step S2006 for performing facial expression determination branches to step S2807 for branching facial expression determination. The difference is that it has been changed. Also, in the flowchart of FIG. 28, a series of processes when it is determined that there is a facial expression in step S2807 is added. The explanation about each processing step added or changed is described below.

ステップＳ２８００において、例えば制御部１０１は、後述する上位候補者情報の初期化を行う。次に、ステップＳ２８０７において、表情判別器は、第２の実施形態の処理フローにあるステップＳ２００６の処理に対し、分岐処理機能を追加しただけの処理を行う。 In step S2800, for example, the control unit 101 initializes higher-ranking candidate information described later. Next, in step S2807, the facial expression discriminator performs processing in which the branch processing function is added to the processing in step S2006 in the processing flow of the second embodiment.

次に、ステップＳ２８１５において、例えば累積信頼度算出部１０６は、時系列画像を処理する間のステップＳ２８１３の上位候補者情報を抽出する。より具体的に説明すると、上位候補者情報とは、ある入力顔の照合結果の累積信頼度により順序付けられる候補者リストの上位所定人数の候補者名である。候補者リストは、図２９（ｂ）に示す候補者リスト２９０１のように作成される。ここで、図２９は、候補者リスト等の一例を示す図である。候補者リスト２９０１は、ある時刻の画像データ２９００の顔番号０のリストを示している。候補者リスト２９０１は、リストの３列目の信頼度で順位付けられており、それぞれのＩＤラベルを対応付ける表である。ここで、ＬＡＢＥＬ＿ＡとＬＡＢＥＬ＿Ｃとは、登録者Ａさん、Ｂさん、Ｃさん・・・の辞書データの登録者Ａと登録者Ｃとのことを示している。また、ＲＥＪＥＣＴは、登録者以外の非登録者であることを示している。 Next, in step S2815, for example, the cumulative reliability calculation unit 106 extracts the top candidate information in step S2813 while processing the time-series image. More specifically, the high-ranking candidate information is the names of a predetermined number of high-ranking candidates in a candidate list ordered by the cumulative reliability of a certain input face matching result. The candidate list is created as a candidate list 2901 shown in FIG. Here, FIG. 29 is a diagram illustrating an example of a candidate list and the like. A candidate list 2901 shows a list of face number 0 of the image data 2900 at a certain time. The candidate list 2901 is a table that is ranked by the reliability in the third column of the list and associates each ID label. Here, LABEL_A and LABEL_C indicate registrant A and registrant C of dictionary data of registrants A, B, C,. REJECT indicates that the person is a non-registered person other than the registered person.

ステップＳ２８１６において、例えば累積信頼度算出部１０６は、候補者リスト２９０１の上位の例えば３名のＩＤラベルを取得する。又は、ステップＳ２８１６において、例えば累積信頼度算出部１０６は、信頼度の差を上位から順にとり、前述の差が所定閾値以上のＩＤラベルを取得するようにしてもよい。より具体的に説明すると、４位と５位の信頼度差が、閾値を超えた場合、例えば累積信頼度算出部１０６は、上位４名のＩＤラベルを取得する。 In step S 2816, for example, the cumulative reliability calculation unit 106 acquires, for example, three ID labels that are higher in the candidate list 2901. Alternatively, in step S2816, for example, the cumulative reliability calculation unit 106 may take the reliability differences in order from the top and acquire an ID label having the above difference equal to or greater than a predetermined threshold. More specifically, when the reliability difference between the fourth place and the fifth place exceeds the threshold value, for example, the cumulative reliability calculation unit 106 acquires the ID labels of the top four names.

次に、ステップＳ２８１７において、例えば累積信頼度算出部１０６は、ステップＳ２８１６で抽出された複数の候補者の辞書データを、辞書データＤＢ部１０５より取得する。本実施形態において、辞書データは、ＳＶＭを用いて構築される、登録者Ａと登録者Ａ以外、即ちＡｏｒ！Ａ分類器である。また、本実施形態では、ｒｅｊｅｃｔｏｒ！ｒｅｊｅｃｔ分類器のように、登録者であるか、又は非登録者であるのかという分類器も辞書データとして用意する。 Next, in step S2817, for example, the cumulative reliability calculation unit 106 acquires, from the dictionary data DB unit 105, dictionary data of a plurality of candidates extracted in step S2816. In the present embodiment, the dictionary data is constructed by using SVM, except for the registrant A and the registrant A, that is, A or! A classifier. In this embodiment, the “reject or!” Like the reject classifier, a classifier indicating whether the user is a registered person or a non-registered person is also prepared as dictionary data.

次に、ステップＳ２８１９において、例えば累積信頼度算出部１０６は、ステップＳ２８１７で取得された候補者のみの辞書データを用いて、照合処理を行う。より具体的に説明すると、例えば累積信頼度算出部１０６は、上位候補者がＡ、Ｃ又はｒｅｊｅｃｔであった場合、入力データを、Ａｏｒ！Ａ、Ｃｏｒ！Ｃ、ｒｅｊｅｃｔｏｒ！ｒｅｊｅｃｔ分類器に入力する。ＳＶＭによって構築される分類器は、バイナリの結果出力を行う。よって、例えば入力顔がＡであった場合、Ａｏｒ！Ａ分類器に出力が表れる。 Next, in step S2819, for example, the cumulative reliability calculation unit 106 performs a matching process using the dictionary data of only the candidate acquired in step S2817. More specifically, for example, the cumulative reliability calculation unit 106 determines that the input data is A or! A, C or! C, reject or! Input to the reject classifier. A classifier built by SVM provides a binary result output. Thus, for example, if the input face is A, A or! The output appears in the A classifier.

以上、第３の実施形態によれば、累積信頼度によって決定される上位候補者情報利用するという形態をとることで、オープンな顔認識システムにおける大規模辞書データより高速な照合処理を提供することができる。 As described above, according to the third embodiment, it is possible to provide a collation process that is faster than large-scale dictionary data in an open face recognition system by using the top candidate information determined by the cumulative reliability. Can do.

＜その他の実施形態＞
以上、前述した実施形態では、情報処理装置（物体認識装置）の各機能をハードウェアとして実装した場合の一例を例に説明を行ったが、前述した機能の一部をソフトウェアとして実装してもよい。より具体的に説明すると、例えば、前述した制御部１０１、顔検出部１０２、特徴抽出部１０３、顔照合部１０４、累積信頼度算出部１０６、出力判定部１０７等は、ソフトウェアとして情報処理装置に実装してもよい。つまり、これらの機能に係るプログラムを情報処理装置のＣＰＵがメモリ等から読み出し、実行することによって、前記機能を実現するようにしてもよい。 <Other embodiments>
As described above, in the above-described embodiment, an example in which each function of the information processing apparatus (object recognition apparatus) is implemented as hardware has been described as an example. However, a part of the above-described function may be implemented as software. Good. More specifically, for example, the control unit 101, the face detection unit 102, the feature extraction unit 103, the face matching unit 104, the cumulative reliability calculation unit 106, the output determination unit 107, and the like described above are included in the information processing apparatus as software. May be implemented. That is, the functions may be realized by the CPU of the information processing apparatus reading out and executing the programs related to these functions from the memory or the like.

以上、上述した各実施形態によれば、変動が存在する画像が入力された場合であっても、精度のよい、安定した物体認識を実現することができる。 As described above, according to the above-described embodiments, accurate and stable object recognition can be realized even when an image having fluctuations is input.

以上、本発明の好ましい実施形態について詳述したが、本発明は係る特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形・変更が可能である。 The preferred embodiments of the present invention have been described in detail above, but the present invention is not limited to such specific embodiments, and various modifications can be made within the scope of the gist of the present invention described in the claims.・ Change is possible.

情報処理装置（コンピュータ）の一例である物体認識装置のハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions of the object recognition apparatus which is an example of information processing apparatus (computer). 第１の実施形態における物体認識装置の全体処理の一例を示すフローチャートである。It is a flowchart which shows an example of the whole process of the object recognition apparatus in 1st Embodiment. ニューラルネットワークを用いた顔検出部１０２における顔検出処理の一例を説明するための図である。It is a figure for demonstrating an example of the face detection process in the face detection part 102 using a neural network. 顔検出処理で検出される特徴を説明するための図である。It is a figure for demonstrating the characteristic detected by a face detection process. 特徴抽出部１０３が抽出した特徴点の一例を示す図である。It is a figure which shows an example of the feature point which the feature extraction part 103 extracted. サイズ正規化及び回転変動を説明するための図である。It is a figure for demonstrating size normalization and rotation fluctuation. 照合処理の一例を示すフローチャートである。It is a flowchart which shows an example of a collation process. 顔照合部１０４が設定する局所的領域の一例を示す図（その１）である。FIG. 6 is a diagram (part 1) illustrating an example of a local region set by a face matching unit 104; 顔照合部１０４が設定する局所的領域の一例を示す図（その２）である。FIG. 10B is a diagram illustrating an example of a local region set by the face collation unit 104 (part 2); 顔照合部１０４が設定する大局的領域の一例を示す図（その１）である。FIG. 5 is a diagram (part 1) illustrating an example of a global area set by a face matching unit 104; 顔照合部１０４が設定する大局的領域の一例を示す図（その２）である。FIG. 10B is a diagram illustrating an example of a global area set by the face matching unit 104 (part 2); エッジ抽出ライクな検出出力分布の一例を示す図である。It is a figure which shows an example of detection output distribution like edge extraction. 両目間の大局的領域の設定の一例を示す図である。It is a figure which shows an example of the setting of the global area | region between both eyes. 目−口間の大局的領域の設定の一例を示す図である。It is a figure which shows an example of the setting of the global area | region between eyes-mouths. 高次元特徴ベクトルの定義を示す図である。It is a figure which shows the definition of a high-dimensional feature vector. 辞書データ生成処理の一例を示すフローチャートである。It is a flowchart which shows an example of a dictionary data generation process. 信頼度を算出する処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process which calculates reliability. 顔に要素番号を指定する様子を示す図である。It is a figure which shows a mode that an element number is designated to a face. ２クラス分類器の判定結果を示す図である。It is a figure which shows the determination result of a 2 class classifier. 第２の実施形態における物体認識装置の全体処理の一例を示すフローチャートである。It is a flowchart which shows an example of the whole process of the object recognition apparatus in 2nd Embodiment. 表情の代表例の一例を示す図である。It is a figure which shows an example of the representative example of a facial expression. 特徴点の抽出の一例を示す図である。It is a figure which shows an example of extraction of a feature point. メッシュとパラメータとを説明するための図である。It is a figure for demonstrating a mesh and a parameter. 表情判定器の一例を示す図である。It is a figure which shows an example of a facial expression determination device. 第２の実施形態における物体認識装置の全体処理の一例を示すフローチャートである。It is a flowchart which shows an example of the whole process of the object recognition apparatus in 2nd Embodiment. 領域パターンの一例を示す図である。It is a figure which shows an example of an area | region pattern. 確信度のルックアップテーブルの一例を示す図である。It is a figure which shows an example of the look-up table of reliability. 第３の実施形態における物体認識装置の全体処理の一例を示すフローチャートである。It is a flowchart which shows an example of the whole process of the object recognition apparatus in 3rd Embodiment. 候補者リスト等の一例を示す図である。It is a figure which shows an example of a candidate list etc.

Explanation of symbols

１００撮像部
１０１制御部
１０２顔検出部
１０３特徴抽出部
１０４顔照合部
１０５辞書データＤＢ
１０６累積信頼度算出部
１０７出力判定部
１０８記憶部
１０９表示部 DESCRIPTION OF SYMBOLS 100 Image pick-up part 101 Control part 102 Face detection part 103 Feature extraction part 104 Face collation part 105 Dictionary data DB
106 Cumulative reliability calculation unit 107 Output determination unit 108 Storage unit 109 Display unit

Claims

Receiving means for receiving a time-series image including an object;
Feature point extracting means for extracting a plurality of feature points related to the object from each image of the time-series image;
Facial expression determination means for determining facial expression of the object based on a plurality of feature points extracted by the feature point extraction means;
When it is determined by the facial expression determination means that the object has a facial expression, a plurality of regions are set based on the coordinate values of the plurality of feature points extracted by the feature point extraction means according to the facial expression, Generating a feature vector including arrangement information or shape information of the set region , collating the feature vector with dictionary data related to an object, and collation means for calculating the reliability of the collation result ;
Cumulative value calculation means for accumulating the reliability calculated by the matching means for each image over a plurality of images of the time-series image and calculating a cumulative value related to the reliability of the matching result for the object;
Output determination means for determining whether or not to output a collation result in the collation means based on the cumulative value calculated by the cumulative value calculation means;
An information processing apparatus comprising:

The information processing apparatus according to claim 1, further comprising dictionary data generation means for generating the dictionary data based on an image including an object.

The collation means collates the dictionary data of the object whose cumulative value calculated by the cumulative value calculation means is higher with the feature vector when the facial expression determination means determines that the object has no facial expression. The information processing apparatus according to claim 1 , wherein:

An information processing method in an information processing apparatus,
Receiving a time-series image including an object;
A feature point extracting step of extracting a plurality of feature points related to the object from each image of the time series image;
A facial expression determination step for determining a facial expression of the object based on a plurality of feature points extracted in the feature point extraction step;
When it is determined that the object has an expression in the expression determination step, a plurality of regions are set based on the coordinate values of the plurality of feature points extracted in the feature point extraction step according to the expression, Generating a feature vector including arrangement information or shape information of the set region, collating the feature vector with dictionary data relating to an object, and calculating a reliability of the matching result;
A cumulative value calculating step of accumulating the reliability calculated in the collating step for each image over a plurality of images of the time-series image to calculate a cumulative value related to the object;
An output determination step for determining whether to output a collation result in the collation step based on the cumulative value calculated in the cumulative value calculation step;
An information processing method comprising:

The information processing method according to claim 4, further comprising a dictionary data generation step of generating the dictionary data based on an image including an object.

In the collation step, when it is determined in the facial expression determination step that the object has no facial expression, the cumulative value calculated in the cumulative value calculation step is collated with the dictionary data of the upper object. The information processing method according to claim 4, wherein: