JP2013003706A

JP2013003706A - Facial-expression recognition device, method, and program

Info

Publication number: JP2013003706A
Application number: JP2011132120A
Authority: JP
Inventors: Shiro Kumano; 史朗熊野; Kazuhiro Otsuka; 和弘大塚; Dan Mikami; 弾三上; Atsushi Yamato; 淳司大和; Yoichi Sato; 洋一佐藤; Ro-Me Soh; 鷺梅蘇
Original assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Current assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Priority date: 2011-06-14
Filing date: 2011-06-14
Publication date: 2013-01-07

Abstract

PROBLEM TO BE SOLVED: To accurately and quickly recognize an expression category even if it changes in various speeds.SOLUTION: An expression learning part 16 learns a quick identifier used for identifying, for each expression category, whether it is an expression category on the basis of referencing time-series data taken when a non-expression is changed to the expression category. An input part 10 obtains time-series data indicating the change from non-expression. A second dynamic temporal expansion/compression part 24 generates, for each expression category, respective piece of temporal expansion/compression time-series data produced by expanding/compressing the obtained time-series data in a temporal direction so that the change in data matches the referencing time-series data taken when a non-expression is changed to the expression category. A quick recognition part 26 recognizes the expression category of a person's face on the basis of temporal expansion/compression time-series data generated for each expression category and of a quick identifier for each expression category.

Description

本発明は、表情認識装置、方法、及びプログラムに係り、特に、人物の顔の表情カテゴリを早期に認識する表情認識装置、方法、及びプログラムに関する。 The present invention relates to a facial expression recognition device, method, and program, and more particularly, to a facial expression recognition device, method, and program for early recognition of facial expression categories of a person's face.

表情は、他者と感情を伝達しあう手段として最も基本的な非言語行動であると言われている。そのため、画像に基づく表情認識に関する研究がこれまでコンピュータビジョン分野を中心として盛んに行われてきた。しかし、これまでの表情認識手法では、1枚の顔画像が与えられたときにその表情がどのカテゴリに属するのかを認識することに主眼が置かれている。表情が無表情から次第に変化していく動画像に対して、その表情カテゴリをいかに早い段階で認識するかという問題としてはこれまで取り扱われていなかった。 Facial expressions are said to be the most basic nonverbal behavior as a means of communicating emotions with others. For this reason, research on facial expression recognition based on images has been actively conducted mainly in the field of computer vision. However, the conventional facial expression recognition methods focus on recognizing which category the facial expression belongs to when a single face image is given. Until now, the problem of how to recognize the facial expression category for a moving image whose facial expression changes gradually from no expression has not been dealt with.

例えば、大げさな表情データを用いて学習した表情カテゴリの識別器をベースとし、微細な表情を正しく推定する試みのひとつとして、これまでに、認識対象の表情のデータの表出強度を増幅した後に識別器にかける方法が提案されている（非特許文献１）。この方法では、画像における動き情報をもとにその表情変化による顔画像の変形を線形に増幅している。 For example, based on the facial expression category discriminator learned using exaggerated facial expression data, as one of the attempts to correctly estimate fine facial expressions, the expression intensity of facial expression data to be recognized has been amplified so far. A method for applying the discriminator has been proposed (Non-Patent Document 1). In this method, the deformation of the face image due to the change in facial expression is linearly amplified based on the motion information in the image.

また、手書き文字の認識や自動車運転行動の認識など人物の行動の種類を、早期に認識するための方法が提案されている（非特許文献２）。 In addition, a method has been proposed for early recognition of the types of human behavior such as recognition of handwritten characters and recognition of car driving behavior (Non-Patent Document 2).

Sungsoo Park, Daijin Kim: “Subtle facial expression recognition using motion magnification”, Pattern Recognition Letters, 30, pp. 708-716, 2009.Sungsoo Park, Daijin Kim: “Subtle facial expression recognition using motion magnification”, Pattern Recognition Letters, 30, pp. 708-716, 2009. 石黒勝彦、澤田宏、坂野鋭: “多クラス早期認識ブースティング法とその応用” 、第13回画像の認識・理解シンポジウム論文集、 pp. 1452-1458、 2010.Katsuhiko Ishiguro, Hiroshi Sawada, Akira Sakano: “Multi-class early recognition boosting method and its application”, Proceedings of the 13th Symposium on Image Recognition and Understanding, pp. 1452-1458, 2010.

上記の非特許文献１に記載の技術では、入力データが増幅される方向を決定する際に、そのデータがどのような表情カテゴリであるのかを考慮していない。しかし、表情による顔の時間的・空間的な変形は複雑であるため、その表情のカテゴリを考慮することなく動き情報のみから単純に表情情報を線形に増幅したのでは、結果として誤ったカテゴリの表情に類似したデータが生成される恐れがある。さらに、上記の非特許文献１の手法では、決定した増幅の方向に対して、どれだけ増幅すればよいのかについても、経験的に決定した一意の値が使用されている。このため、様々な強度で表出された表情のカテゴリを正しく認識できるとは言い難い。 The technique described in Non-Patent Document 1 does not consider what facial expression category the data is in determining the direction in which the input data is amplified. However, the temporal and spatial deformations of the face due to facial expressions are complex, so if the facial expression information is simply amplified linearly from only the motion information without considering the facial expression category, the wrong category will result. Data similar to facial expressions may be generated. Furthermore, in the method of Non-Patent Document 1 described above, a unique value determined empirically is used as to how much amplification should be performed with respect to the determined direction of amplification. For this reason, it is difficult to say that facial expression categories expressed with various intensities can be correctly recognized.

また、上記の非特許文献２に記載の方法では、入力されるデータが、ある行動の表出開始時点から始まっており、さらに、その行動があらかじめ決められた速度で行われることを前提としているため、表情表出のように、行動の速度が表出の度に変化するような行動を正しく認識することが困難である。 Further, in the method described in Non-Patent Document 2 above, it is assumed that the input data starts from the start of expression of a certain action, and that the action is performed at a predetermined speed. Therefore, it is difficult to correctly recognize an action such as a facial expression that changes the speed of the action every time it is expressed.

本発明は、上記の事情を鑑みてなされたもので、表情カテゴリの変化が様々な速度で行われても、表情カテゴリを精度良く、かつ早期に認識することができる表情認識装置、表情認識方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and is capable of recognizing a facial expression category accurately and early even if the facial expression category is changed at various speeds. And to provide a program.

上記の目的を達成するために本発明に係る表情認識装置は、人物の顔を含む領域を撮像した動画像から得られる顔特徴量の時系列データに基づいて、前記人物の顔の表情カテゴリを認識する表情認識装置であって、複数の表情カテゴリの各々について、予め求められた所定の表情カテゴリから前記表情カテゴリへ変化したときの参照用の前記時系列データに基づいて、前記表情カテゴリであるか否かを早期に識別するための早期識別器を学習する表情学習手段と、前記動画像から得られる、前記所定の表情カテゴリからの変化を示す前記顔特徴量の時系列データを取得するデータ取得手段と、前記複数の表情カテゴリの各々について、前記データ取得手段によって取得された前記時系列データを、前記表情カテゴリへ変化するときの前記参照用の前記時系列データに対して前記顔特徴量の変化が対応するように時間方向に伸縮させた時間伸縮時系列データを各々生成する時間伸縮手段と、前記複数の表情カテゴリの各々について生成された前記時間伸縮時系列データと、前記複数の表情カテゴリの各々に対する前記早期識別器とに基づいて、前記人物の顔の表情カテゴリを認識する表情認識手段と、を含んで構成されている。 In order to achieve the above object, the facial expression recognition apparatus according to the present invention determines the facial expression category of a person's face based on time-series data of facial feature values obtained from a moving image obtained by imaging an area including the face of the person. A facial expression recognition device for recognizing each of a plurality of facial expression categories based on the time-series data for reference when changing from a predetermined facial expression category obtained in advance to the facial expression category. Expression learning means for learning an early discriminator for early identification of whether or not, and data for acquiring time series data of the facial feature amount indicating a change from the predetermined facial expression category obtained from the moving image The reference means when changing the time-series data acquired by the data acquisition means to the expression category for each of the acquisition means and the plurality of expression categories Time expansion / contraction means for generating time expansion / contraction time-series data that is expanded / contracted in the time direction so that a change in the facial feature amount corresponds to the time-series data, and generated for each of the plurality of facial expression categories Expression recognizing means for recognizing the facial expression category of the person's face based on the time expansion / contraction time-series data and the early classifier for each of the plurality of facial expression categories.

本発明に係る表情認識方法は、人物の顔を含む領域を撮像した動画像から得られる顔特徴量の時系列データに基づいて、前記人物の顔の表情カテゴリを認識する表情認識方法であって、表情学習手段によって、複数の表情カテゴリの各々について、予め求められた所定の表情カテゴリから前記表情カテゴリへ変化したときの参照用の前記時系列データに基づいて、前記表情カテゴリであるか否かを早期に識別するための早期識別器を学習するステップと、データ取得手段によって、前記動画像から得られる、前記所定の表情カテゴリからの変化を示す前記顔特徴量の時系列データを取得するステップと、時間伸縮手段によって、前記複数の表情カテゴリの各々について、前記データ取得手段によって取得された前記時系列データを、前記表情カテゴリへ変化するときの前記参照用の前記時系列データに対して前記顔特徴量の変化が対応するように時間方向に伸縮させた時間伸縮時系列データを各々生成するステップと、表情認識手段によって、前記複数の表情カテゴリの各々について生成された前記時間伸縮時系列データと、前記複数の表情カテゴリの各々に対する前記早期識別器とに基づいて、前記人物の顔の表情カテゴリを認識するステップと、を含むことを特徴としている。 A facial expression recognition method according to the present invention is a facial expression recognition method for recognizing a facial expression category of a person's face based on time-series data of facial feature values obtained from a moving image obtained by imaging a region including the face of the person. Whether each of the plurality of facial expression categories is the facial expression category based on the time-series data for reference when the facial expression learning unit changes from the predetermined facial expression category obtained in advance to the facial expression category. A step of learning an early classifier for early identification, and a step of acquiring time series data of the facial feature amount indicating a change from the predetermined facial expression category obtained from the moving image by a data acquisition unit And the time series data acquired by the data acquisition means for each of the plurality of expression categories by the time expansion / contraction means, Each of generating time-expanded time-series data expanded / contracted in the time direction so that the change of the facial feature amount corresponds to the reference time-series data when changing to a category; Recognizing the facial expression category of the person's face based on the time-scaled time-series data generated for each of the multiple facial expression categories and the early classifier for each of the multiple facial expression categories; It is characterized by including.

本発明によれば、表情学習手段によって、複数の表情カテゴリの各々について、予め求められた所定の表情カテゴリから前記表情カテゴリへ変化したときの参照用の前記時系列データに基づいて、前記表情カテゴリであるか否かを早期に識別するための早期識別器を学習する。データ取得手段によって、前記動画像から得られる、前記所定の表情カテゴリからの変化を示す前記顔特徴量の時系列データを取得する。 According to the present invention, based on the time-series data for reference when the facial expression learning means changes from the predetermined facial expression category obtained in advance to the facial expression category for each of the facial expression categories, the facial expression category Learning an early classifier for early identification of whether or not. Time series data of the facial feature amount indicating a change from the predetermined facial expression category obtained from the moving image is obtained by the data obtaining means.

そして、時間伸縮手段によって、前記複数の表情カテゴリの各々について、前記データ取得手段によって取得された前記時系列データを、前記表情カテゴリへ変化するときの前記参照用の前記時系列データに対して前記顔特徴量の変化が対応するように時間方向に伸縮させた時間伸縮時系列データを各々生成する。 Then, for each of the plurality of facial expression categories by the time expansion / contraction means, the time series data acquired by the data acquisition means with respect to the time series data for reference when changing to the facial expression category Time expansion / contraction time-series data that is expanded / contracted in the time direction so as to correspond to changes in the facial feature amount is generated.

そして、表情認識手段によって、前記複数の表情カテゴリの各々について生成された前記時間伸縮時系列データと、前記複数の表情カテゴリの各々に対する前記早期識別器とに基づいて、前記人物の顔の表情カテゴリを認識する。 Then, based on the time expansion / contraction time series data generated for each of the plurality of expression categories by the expression recognition means and the early classifier for each of the plurality of expression categories, the facial expression category of the person's face Recognize

このように、表情カテゴリ毎に早期識別器を学習するときに用いた顔特徴量の時系列データに対して、顔特徴量の変化が対応するように、所定の表情カテゴリからの変化を示す顔特徴量の時系列データを時間方向に伸縮させてから、複数の表情カテゴリの各々に対する早期識別器に基づいて、人物の顔の表情カテゴリを認識することにより、表情カテゴリの変化が様々な速度で行われても、表情カテゴリを精度良く、かつ早期に認識することができる。 As described above, the face indicating the change from the predetermined facial expression category so that the change in the facial feature amount corresponds to the time series data of the facial feature amount used when learning the early classifier for each facial expression category. By changing the time series data of feature quantities in the time direction and then recognizing facial expression categories of human faces based on the early classifier for each of multiple facial expression categories, changes in facial expression categories can be made at various speeds. Even if it is performed, the facial expression category can be recognized accurately and early.

本発明に係る表情認識装置は、前記複数の表情カテゴリの各々について予め用意した、前記所定の表情カテゴリから前記表情カテゴリへ変化したときの学習用の前記時系列データを、前記表情カテゴリへ変化したときの前記参照用の前記時系列データに対して前記顔特徴量の変化が対応するように時間方向に伸縮させた学習用の時間伸縮時系列データを各々生成する学習用時間伸縮手段を更に含み、前記表情学習手段は、複数の表情カテゴリの各々について、前記表情カテゴリに対する前記参照用の前記時系列データ及び前記学習用の時間伸縮時系列データに基づいて、前記早期識別器を学習するようにすることができる。 In the facial expression recognition apparatus according to the present invention, the time series data for learning when changing from the predetermined facial expression category to the facial expression category prepared in advance for each of the plurality of facial expression categories is changed to the facial expression category. Learning time expansion / contraction means for generating time expansion / contraction time series data for learning each of which is expanded / contracted in the time direction so that the change of the facial feature amount corresponds to the time series data for reference when The facial expression learning means learns the early classifier for each of a plurality of facial expression categories based on the reference time-series data and the learning time expansion / contraction time-series data for the facial expression category. can do.

本発明に係る前記表情認識手段は、前記複数の表情カテゴリの各々について生成された前記時間伸縮時系列データと、前記複数の表情カテゴリの各々に対する前記早期識別器とに基づいて、前記複数の表情カテゴリの各々について前記表情カテゴリである度合いを示すスコアを算出し、前記スコアに基づいて、前記人物の顔の表情カテゴリを認識するようにすることができる。 The facial expression recognition means according to the present invention is based on the time expansion / contraction time-series data generated for each of the plurality of facial expression categories and the early classifier for each of the plurality of facial expression categories. A score indicating the degree of the facial expression category may be calculated for each category, and the facial expression category of the person's face may be recognized based on the score.

上記の顔特徴量を、前記人物の顔の注目点集合の各座標と、前記所定の表情カテゴリであるときの顔の注目点集合の各座標との差分を示すベクトルとすることができる。 The face feature amount can be a vector indicating a difference between each coordinate of the attention point set of the person's face and each coordinate of the attention point set of the face in the predetermined facial expression category.

上記のデータ取得手段は、１つ前の時刻において前記所定の表情カテゴリである認識されたときの現時刻の顔特徴量からなる前記時系列データを取得するようにすることができる。 The data acquisition unit may acquire the time-series data including the facial feature quantity at the current time when the predetermined facial expression category is recognized at the previous time.

上記の所定の表情カテゴリを、無表情とすることができる。 The predetermined facial expression category can be a non-facial expression.

本発明に係るプログラムは、コンピュータを、上記の表情認識装置の各手段として機能させるためのプログラムである。 The program according to the present invention is a program for causing a computer to function as each means of the facial expression recognition apparatus.

以上説明したように、本発明の表情認識装置、表情認識方法、及びプログラムによれば、表情カテゴリ毎に早期識別器を学習するときに用いた顔特徴量の時系列データに対して、顔特徴量の変化が対応するように、所定の表情カテゴリからの変化を示す顔特徴量の時系列データを時間方向に伸縮させてから、複数の表情カテゴリの各々に対する早期識別器に基づいて、人物の顔の表情カテゴリを認識することにより、表情カテゴリの変化が様々な速度で行われても、表情カテゴリを精度良く、かつ早期に認識することができる、という効果が得られる。 As described above, according to the facial expression recognition device, facial expression recognition method, and program of the present invention, the facial feature is compared with the facial feature time-series data used when learning the early classifier for each facial expression category. Based on the early classifier for each of the plurality of facial expression categories, the time series data of the facial feature quantity indicating the change from the predetermined facial expression category is expanded or contracted in the time direction so that the change of the amount corresponds. By recognizing the facial expression category, even if the facial expression category is changed at various speeds, it is possible to recognize the facial expression category accurately and early.

本発明の実施の形態に係る表情認識装置の構成を示す概略図である。It is the schematic which shows the structure of the facial expression recognition apparatus which concerns on embodiment of this invention. 人物の顔面上に配置された複数の特徴点の例を示す図である。It is a figure which shows the example of the some feature point arrange | positioned on the face of a person. 時系列データに対する時間伸縮の模式図である。It is a schematic diagram of the time expansion-contraction with respect to time series data. 本発明の実施の形態に係る表情認識装置における学習処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the learning process routine in the facial expression recognition apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る表情認識装置における表情認識処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the facial expression recognition process routine in the facial expression recognition apparatus which concerns on embodiment of this invention.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜概要＞
まず、本発明の概要について説明する。本発明では、earlyBoostと呼ばれる、一定速度で行われることを仮定した行動の種類をその発生からなるべく早くに認識する手法に、データの時間方向の伸縮を組み合わせることで、これまで正しく認識することが困難であった様々な速度で表情が表出された場合においても、その表情カテゴリを正しく認識する。 <Overview>
First, an outline of the present invention will be described. In the present invention, a technique called earlyBoost, which recognizes the type of action assumed to be performed at a constant speed as soon as possible from its occurrence, combined with expansion and contraction in the time direction of data, can be correctly recognized so far. Even when facial expressions are displayed at various speeds, which are difficult, the facial expression category is recognized correctly.

早期識別器の学習処理では、まず、表情カテゴリ毎に複数用意された表情のデータについて、各表情カテゴリについて１つずつ決められた参照データに対して時間方向の位置合わせを行う。これにより、表出速度の異なる各表情のデータが、その表情カテゴリの参照データと同じ表出速度のデータに変換される。続いて、その時間伸縮されたデータセットを入力として、表情の早期認識に適した早期識別器を学習する。 In the learning process of the early classifier, first, with respect to facial expression data prepared for each facial expression category, time-direction alignment is performed with respect to reference data determined for each facial expression category. As a result, the data of each facial expression having a different expression speed is converted into data having the same expression speed as the reference data of the expression category. Subsequently, an early discriminator suitable for early recognition of facial expressions is learned by using the time-stretched data set as an input.

認識対象データ（テストデータ）に対する表情の認識処理では、まずは学習処理と同様に、参照データに対して時間方向の位置合わせを行う。ただし、テストデータについては正しい表情カテゴリが分かっていないため、全ての表情カテゴリのそれぞれを正解のカテゴリとみなして時間伸縮を行う。続いて、カテゴリ数だけ作成されたその時間伸縮されたデータを早期識別器に入力し、スコアが最も高くなった表情カテゴリを認識結果として出力する。 In the facial expression recognition process for the recognition target data (test data), first, as in the learning process, the reference data is aligned in the time direction. However, since the correct facial expression category is not known for the test data, each facial expression category is regarded as a correct category and time expansion / contraction is performed. Subsequently, the time-scaled data created for the number of categories is input to the early discriminator, and the facial expression category having the highest score is output as a recognition result.

本実施の形態では、表情のカテゴリとして、６つの基本表情、すなわち、幸福、怒り、驚き、恐れ、嫌悪、及び、悲しみに、無表情を加えた７つの表情、すなわち、ｃ∈｛0,1,…,6｝を対象とする場合を例に説明する。ｃ＝０は無表情を表す。カテゴリ数はＮ_ｃ＝７である。ただし、どのようなカテゴリを認識の対象としても構わない。 In the present embodiment, there are six basic facial expressions, namely, happiness, anger, surprise, fear, disgust, and sadness plus no expression, ie c∈ {0,1. ,..., 6} will be described as an example. c = 0 represents no expression. The number of categories is N _c = 7. However, any category may be a recognition target.

＜システム構成＞
次に、顔画像データの時系列である動画像データを入力として、表情カテゴリを認識する表情認識装置に、本発明を適用した場合を例にして、本発明の実施の形態を説明する。 <System configuration>
Next, an embodiment of the present invention will be described by taking as an example a case where the present invention is applied to a facial expression recognition apparatus that recognizes facial expression categories by using moving image data that is time series of facial image data.

本実施の形態に係る表情認識装置は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）と、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）と、後述する学習処理ルーチン及び表情認識処理ルーチンを実行するためのプログラムを記憶したＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）とを備えたコンピュータで構成され、機能的には次に示すように構成されている。表情認識装置は、図１に示すように、入力部１０、学習データセット記憶部１２、参照時系列データ決定部１４、表情学習部１６、時間伸縮後学習データセット記憶部１８、早期識別器記憶部２０、表情認識部２２、第２動的時間伸縮部２４、及び早期認識部２６を備えている。なお、入力部１０は、データ取得手段の一例であり、第２動的時間伸縮部２４は、時間伸縮手段の一例である。 The facial expression recognition apparatus according to the present embodiment includes a CPU (Central Processing Unit), a RAM (Random Access Memory), and a ROM (Read Only) that stores a learning processing routine and a facial expression recognition processing routine to be described later. The computer is functionally configured as shown below. As shown in FIG. 1, the facial expression recognition device includes an input unit 10, a learning data set storage unit 12, a reference time series data determination unit 14, a facial expression learning unit 16, a post-time expansion / contraction learning data set storage unit 18, and an early classifier storage. Unit 20, facial expression recognition unit 22, second dynamic time expansion / contraction unit 24, and early recognition unit 26. The input unit 10 is an example of a data acquisition unit, and the second dynamic time expansion / contraction unit 24 is an example of a time expansion / contraction unit.

入力部１０は、対象人物の顔を含む領域を撮影した動画像データから得られる、図２に示すような対象人物の顔面上に配置された複数の特徴点の３次元(Ｄ＝３)の座標値（位置情報）を高速に計測する手段に接続されており、複数の特徴点の３次元(Ｄ＝３)の座標値を、高い時間分解能で時系列に出力する。また、入力部１０は、キーボードやマウスにも接続されており、キーボードやマウスを操作することにより入力された情報を受け付ける。 The input unit 10 obtains three-dimensional (D = 3) three-dimensional (D = 3) feature points arranged on the target person's face as shown in FIG. 2 obtained from moving image data obtained by capturing an area including the target person's face. It is connected to means for measuring coordinate values (position information) at high speed, and outputs three-dimensional (D = 3) coordinate values of a plurality of feature points in time series with high time resolution. The input unit 10 is also connected to a keyboard and mouse, and receives information input by operating the keyboard and mouse.

入力部１０は、目や口といった顔部品周辺に配置されたＫ個の特徴点のＤ次元の座標値を並べたベクトル（特徴ベクトルと呼ぶ）ｘ＝［ｘ_１，１，…，ｘ_１，Ｄ，ｘ_２，１，…，ｘ_２，Ｄ，ｘ_Ｋ，１，…，ｘ_Ｋ，Ｄ］^Ｔ∈Ｒ^ＤＫを高い時間分解能で時系列に出力する。ここで、ｘ_ｋ，ｄはｋ番目の特徴点のｄ番目の次元の成分を表す。高い時間分解能であることは表出し始めの表情からその特徴を得るために重要である。図２に、Ｋ＝２６点の場合の特徴点の配置の一例を示す。 The input unit 10 is a vector (referred to as a feature vector) in which D-dimensional coordinate values of K feature points arranged around a face part such as an eye and a mouth are arranged x = [x _1,1 ,..., X _{1, D} ₂ , x ₂ , ₁ ,..., X _{2, D 2,} x _{K 1} ,..., X _{K, D} ] ^T ∈R ^DK are output in time series with high time resolution. Here, x _{k, d} represents a component of the d-th dimension of the k-th feature point. A high temporal resolution is important for obtaining the feature from the facial expression at the beginning of expression. FIG. 2 shows an example of the arrangement of feature points when K = 26 points.

本実施の形態では、特徴点の３次元(Ｄ＝３)の座標値を高速に計測する手段として、例えば、１００ｆｌａｍｅ／ｓｅｃで動作するモーションキャプチャシステムを用いる。モーションキャプチャシステムは、対象物表面に小さなマーカーを貼り付けた状態で、その人物を複数台のカメラにて撮影し、その画像中でのマーカーの位置からそれらマーカーの３次元座標を算出する。このマーカーの位置の検出方法としては、緑色の塗料を顔面上に小さく塗ったものをマーカーとして、カラー映像から検出する方法がある。あるいは、赤外光をよく反射する素材をマーカーとし、赤外光を照射しながら赤外以外の波長の光をフィルタでカットしながら撮影した画像から検出する方法もある。または、そのようなマーカーを使用せずに、顔のテクスチャ情報のみから特徴点を検出するようにしてもよい。あるいは、特徴点の位置情報として単に単眼カメラ画像中の特徴点の画像座標（Ｄ＝２）を用いても構わない。なお、各人物に対する顔面上での特徴点の数及び配置は同一であるものとする。 In the present embodiment, for example, a motion capture system operating at 100 frame / sec is used as means for measuring the three-dimensional (D = 3) coordinate value of the feature point at high speed. The motion capture system captures the person with a plurality of cameras in a state where small markers are pasted on the surface of the object, and calculates the three-dimensional coordinates of the markers from the positions of the markers in the image. As a method for detecting the position of this marker, there is a method for detecting from a color image using a marker obtained by applying a small amount of green paint on the face. Alternatively, there is a method in which a material that reflects infrared light well is used as a marker, and detection is performed from an image taken while irradiating infrared light and cutting light of wavelengths other than infrared with a filter. Or you may make it detect a feature point only from the texture information of a face, without using such a marker. Alternatively, the image coordinates (D = 2) of the feature point in the monocular camera image may be simply used as the position information of the feature point. It is assumed that the number and arrangement of feature points on the face for each person are the same.

上述した特徴ベクトルｘは、入力部１０において、人物毎に無表情時を基準として正規化される、すなわち、任意の人物の無表情時の特徴ベクトルｘが等しくなるよう変換される。そのような正規化は以下に説明するように行われる。 The above-described feature vector x is normalized by the input unit 10 with respect to each person on the basis of no expression, that is, the feature vector x of any person with no expression is converted to be equal. Such normalization is performed as described below.

まず、後述する学習データセット中からある一人の人物の無表情時の特徴ベクトルｘ^ＢＡＳＥを選択する。その人物の任意の表情の特徴ベクトルについてはそのまま出力する。一方、他の人物については、全ての特徴ベクトルｘに対して射影ｇを施したベクトルｇ（ｘ）を出力する。この射影ｇについては、その人物の無表情時の特徴ベクトルがなるべくｘ^ＢＡＳＥに近くなるように求められたパラメータを用いる。この射影ｇとしては、例えば、最も簡単なものの１つとして、特徴点座標空間の各座標軸に対してスケーリングを施す方法がある。Ｄ＝３であれば、パラメータは３つであり、射影ｇを施したベクトルｇ（ｘ）は、対角行列を用いて、以下の式で表わされる。 First, a feature vector x ^BASE at the time of expressionlessness of one person is selected from a learning data set to be described later. The feature vector of an arbitrary facial expression of the person is output as it is. On the other hand, for other persons, a vector g (x) obtained by performing projection g on all feature vectors x is output. For this projection g, a parameter obtained so that the feature vector of the person without expression is as close as possible to ^xBASE is used. As the projection g, for example, one of the simplest methods is a method of scaling each coordinate axis in the feature point coordinate space. If D = 3, there are three parameters, and the vector g (x) subjected to the projection g is expressed by the following formula using a diagonal matrix.

この３つのパラメータｓ_１，ｓ_２，ｓ_３については、基準とした人物の無表情時の特徴ベクトルｘ^ＢＡＳＥに対する、それぞれの人物の無表情時の特徴ベクトルの射影後のベクトルｇ（ｘ）の誤差の二乗の和が最小になるときの値を用いればよく、すなわち、最小二乗誤差基準に従って算出された値を用いる。なお、この他にも、AAM(Active Appearance Models)のように、個人毎の特徴ベクトルのばらつきについての基底を求め、その上位（主要な）いくつかの基底の線形和がなるべくｘ^ＢＡＳＥに近くなるようなパラメータを算出するという方法でも構わない。 With respect to these three parameters s ₁ , s ₂ , and s ₃ , the vector g (x) after projection of the feature vector of each person with no expression relative to the feature vector x ^BASE of the person with no expression is used. A value when the sum of the squares of errors is minimized may be used, that is, a value calculated according to the least square error criterion is used. In addition to this, as in AAM (Active Appearance Models), the basis for the variation of the feature vector for each individual is obtained, and the linear sum of several higher (main) bases is as close to ^xBASE as possible. A method of calculating such parameters may also be used.

また、入力部１０は、現在の時刻ｔにおける特徴ベクトルｘ_ｔが、無表情のときの特徴ベクトルｘ_０からどれだけ変化したのか、すなわち、ｚ_ｔ＝ｇ（ｘ_ｔ）−ｇ（ｘ_０）を出力する。データが入力され始めた時刻ｔ＝１における表情が無表情であればｘ_０＝ｘ_ｔであるため、ｚ_ｔ＝０が出力される。なお、データｚ_ｔが、顔特徴量の一例である。 Further, the input unit 10 determines how much the feature vector x _{t at} the current time t has changed from the feature vector x ₀ when there is no expression, that is, z _t = g (x _t ) −g (x ₀ ). Is output. If the facial expression at the time t = 1 when data starts to be input is an expressionless expression, x ₀ = x _t , so z _t = 0 is output. The data z _t is an example of a face feature amount.

＜学習データ＞
入力部１０は、学習データとして、対象とするそれぞれの表情カテゴリｃについて、ｚtの時系列データをＭ_ｃ（≧１）個ずつ用意する。それぞれの時系列データでは、表情が無表情である状態から開始され、そこから対象の表情が表出され始め、その表情が最大限に表出された時点で終了するものとする。無表情のデータについては、開始から終了まで無表情であるものとする。時系列データの総数Ｍを、以下の式で表わす。 <Learning data>
The input unit 10 prepares M _c (≧ 1) time series data of zt as learning data for each facial expression category c as a target. In each time-series data, it is assumed that the facial expression is started from a state of no expression, the target facial expression starts to be expressed, and ends when the facial expression is expressed to the maximum. The expressionless data is assumed to be expressionless from the start to the end. The total number M of time series data is expressed by the following formula.

また、ｍ∈｛１，…，Ｍ｝番目の時系列データを以下の式で表わす。 Further, the mε {1,..., M} th time-series data is represented by the following expression.

ここで、Ｔ_ｍはｍ番目の時系列データのフレーム長を表す。それぞれの表情カテゴリについてのデータは、１人あるいは複数の人物により表出された表情について得られたデータである。ｍ番目の時系列データにおいて表出されている表情のカテゴリをｃ^ｍとする。これら学習データのセットをまとめてＤ＝｛Ｚ^ｍ，ｃ^ｍ｝_{ｍ＝１，…，Ｍ}と表す。学習データに付与される表情カテゴリは、ユーザの手入力により与えられる。入力部１０は、上記の学習データのセットを、学習データセット記憶部１２に格納する。 Here, T _m represents the frame length of the m-th time-series data. The data for each facial expression category is data obtained for facial expressions expressed by one or more persons. Let c ^{m be the} facial expression category represented in the m-th time-series data. These sets of learning data are collectively expressed as D = {Z ^m , c ^m } _{m = 1} ,. The facial expression category assigned to the learning data is given manually by the user. The input unit 10 stores the learning data set in the learning data set storage unit 12.

ここでは、無表情の状態からそれ以外の表情を表出したものを学習データとするため、早期に認識できる表情の変化も無表情からそれ以外の表情への変化となる。ただし、本発明の枠組みは、無表情以外の表情からそれ以外の表情への変化（例えば、驚きから幸福、悲しみから無表情など）についての早期認識へも適用可能である。 In this case, since the expression data other than the expression from the expressionless state is used as the learning data, the change of the expression that can be recognized at an early stage is also the change from the expressionless expression to the expression other than the expression. However, the framework of the present invention can also be applied to early recognition of changes from facial expressions other than expressionless to other facial expressions (for example, surprise to happiness, sadness to no expression).

＜表情認識対象データ＞
入力部１０は、表情を認識する対象のデータ（以下、テストデータとも称する）として、１人の人物について、データが入力され始めた時刻ｔ＝１から現在の時刻ｔまでの時系列データｚ_１：ｔを出力する。ここでは、無表情からそれ以外の表情を表出し、そこから、一旦無表情に戻り、また異なる表情を表出する、ということを仮定している。ただし、前述したとおり、本発明の枠組みは、任意の表情の変化の早期認識に適用可能である。 <Expression recognition target data>
The input unit 10 sets time-series data z ₁ from time t = 1 when data starts to be input to the current time t for one person as data for recognizing facial expressions (hereinafter also referred to as test data). _{: T} is output. Here, it is assumed that the expression other than the expression is expressed from the expressionless expression, and then the expression returns to the expressionless expression once again and another expression is expressed. However, as described above, the framework of the present invention can be applied to early recognition of any facial expression change.

＜表情学習＞
表情の学習では、入力部１０から出力される各カテゴリについて１つあるいは複数の学習データのセットを受け、各表情カテゴリについて１つずつ参照用の時系列データを決め、他の時系列データがその参照用の時系列データに対して、データｚの変化がなるべく一致するよう時間伸縮を行うとともに、それらの全ての時系列データを用いて各表情カテゴリに対してその表情カテゴリをなるべく早い表出の段階で識別するための早期識別器を学習する。 <Facial expression learning>
In facial expression learning, one or more sets of learning data are received for each category output from the input unit 10, time series data for reference is determined for each facial expression category, and other time series data The reference time-series data is time-scaled so that changes in the data z match as much as possible, and the expression category is expressed as early as possible for each expression category using all of the time-series data. Learn early classifiers to identify in stages.

本実施の形態では、参照時系列データ決定部１４によって、学習データとして入力された各表情カテゴリについて、後述する第１動的時間伸縮部１６０及び第２動的時間伸縮部２４で時間伸縮を行う際の参照となる時系列データ（参照時系列データと呼ぶ）を決定する。ここでは、各表情カテゴリｃについて、学習データとしてＭ_ｃ個ある時系列データからランダムに1つを抽出したものを参照時系列データとする。なお、予め人手で参照時系列データを決定しておいても構わない。決定された各表情カテゴリｃの参照時系列データ（データ長をＴ_ｃ）の集合Ｚ^cを、以下の式で表わす。 In the present embodiment, the reference time-series data determination unit 14 performs time expansion / contraction on each facial expression category input as learning data by a first dynamic time expansion / contraction unit 160 and a second dynamic time expansion / contraction unit 24 described later. Time series data (referred to as reference time series data) serving as a reference is determined. Here, for each facial expression category c, reference time-series data is obtained by randomly extracting one from M _c time-series data as learning data. Note that the reference time series data may be determined manually in advance. A set Z ^c of reference time series data (data length is T _c ) of each determined facial expression category c is expressed by the following expression.

ここで、Ｒは実数全体の集合を表し、Ｒ^{ＤＫ×Ｔｃ}は、ＤＫ×Ｔ_ｃ次元の実数ベクトル全体の集合を表す。 Here, R represents a set of all real numbers, and R ^{DK × Tc} represents a set of all DK × T _c- dimensional real vectors.

参照時系列データ決定部１４は、参照時系列データとして決定した結果を、学習データセット記憶部１２に格納する。 The reference time series data determination unit 14 stores the result determined as the reference time series data in the learning data set storage unit 12.

表情学習部１６は、第１動的時間伸縮部１６０及び早期識別器学習部１６２を備えている。なお、第１動的時間伸縮部１６０は、学習用時間伸縮手段の一例である。 The facial expression learning unit 16 includes a first dynamic time expansion / contraction unit 160 and an early classifier learning unit 162. The first dynamic time expansion / contraction unit 160 is an example of a learning time expansion / contraction means.

第１動的時間伸縮部１６０は、参照時系列データＺ^ｃ、及び、以下の式で表わされる被時間伸縮時系列データの集合Ｚを入力として、被時間伸縮時系列データのｚ_tの変化が参照時系列データのｚ_tの変化になるべく一致するように、被時間伸縮時系列データに対して時間方向の伸縮を施し、その伸縮された時系列データを出力する。図３にその時間伸縮の模式図を示す。 The first dynamic time expansion / contraction unit 160 receives the reference time series data Z ^c and the set Z of time expansion / contraction time series data represented by the following expression, and changes in z _t of the time expansion / contraction time series data. The time-series expansion / contraction time series expansion / contraction is applied to the time-series expansion / contraction time-series data so as to coincide with z _t of the reference time-series data, and the expanded / contracted time-series data is output. FIG. 3 shows a schematic diagram of the time expansion and contraction.

時間伸縮は、２つの時系列データのどの時刻のデータ同士が対応しているのかを表す対応行列Ｑ∈Ｒ^2×Lを求めることで行われる。この行列Ｑのi番目の行[ｑ_i, ｑ_i´]は、被時間伸縮時系列データのｑ_i番目の時刻のデータｚ_qiと参照時系列データＺ^cのq_i´番目の時刻のデータｚ_qi´ ^cとが対応していることを表している。２つの時系列データが最も類似する場合を、以下の（１）式のように定義して、最適な対応行列^{^}Ｑを求める。 The time expansion / contraction is performed by obtaining a correspondence matrix Q∈R ^{2 × L} that indicates which time of the two time series data corresponds to each other. I-th row [q _i, q _i '] of the matrix Q, q _i of the reference time-series data Z ^c and data z _qi of q _i-th time of the time warping time-series data' data th time This indicates that z _{qi ′} ^c corresponds. The case where the two time series data are most similar is defined as the following equation (1), and the optimum correspondence matrix ^{^} Q is obtained.

ここで、dist(z,z´)はデータzとz´の間の距離を表す。ここでは、距離尺度としてユークリッド距離を使用することとする。ただし、どのような距離尺度を用いても構わない。Ｌは１つの時系列データを合わせるために必要なステップ数を表す。上記（１）式は動的計画法を用いて解くことが可能である。上記図３に、２つの時系列データと求められるそれらの対応を模式的に示している。 Here, dist (z, z ′) represents the distance between the data z and z ′. Here, the Euclidean distance is used as the distance measure. However, any distance scale may be used. L represents the number of steps required to match one time series data. The above equation (1) can be solved using dynamic programming. FIG. 3 schematically shows two time-series data and their corresponding correspondence.

求めた^{^}Ｑに従って、参照時系列データのそれぞれの時刻に対応する被時間伸縮時系列データを位置合わせしたデータの集合~Ｚを以下の式で表わす。 According obtained ^{^} Q, represents a set ~ Z of the data aligning the time warping time-series data corresponding to the respective times of the reference time-series data by the following equation.

もしも、参照時系列データの対象時刻に対して被時間伸縮時系列データの複数の時刻が対応する場合には、位置合わせ後の時系列データにおいて、それらの平均をとった値を用いればよい。一方、参照時系列データの対象時刻に対応する被時間伸縮時系列データの時刻が存在しなければ、その前後の時刻の位置合わせされた被時間伸縮時系列データを用いて内挿あるいは外挿した値を、位置合わせ後の時系列データの値とすればよい。 If a plurality of times of the time-scaled time-series data correspond to the target time of the reference time-series data, a value obtained by averaging these values may be used in the time-series data after alignment. On the other hand, if the time of the time expansion / contraction time series data corresponding to the target time of the reference time series data does not exist, interpolation or extrapolation is performed using the time expansion / contraction time series data aligned with the time before and after that The value may be the value of the time series data after alignment.

第１動的時間伸縮部１６０は、参照時系列データ以外の全ての時系列データをそれぞれ被時間伸縮時系列データとし、被時間伸縮時系列データそれぞれを、その時系列データが属するカテゴリcに対する参照時系列データＺ^ｃを用いて、位置合わせを行う。よって、各表情カテゴリｃに対して、参照時系列データを含めてＭ_c個の位置合わせ後の時系列データ^~Ｚ^mが出力される。この位置合わせ後の時系列データと表情カテゴリを組み合わせたものの集合~Ｄを、時間伸縮後学習データセットとして、以下の式で表わす。 The first dynamic time expansion / contraction unit 160 sets all time series data other than the reference time series data as time extension / contraction time series data, and each time extension / contraction time series data is referred to the category c to which the time series data belongs. using series data Z ^c, performing alignment. Therefore, for each facial expression category c, M _c time-series data ^to Z ^m including the reference time-series data are output. A set ~ D of combinations of the time-series data after the alignment and the facial expression category is expressed by the following expression as a learning data set after time expansion / contraction.

第１動的時間伸縮部１６０は、上記の式で表わされる時間伸縮後学習データセットを、時間伸縮後学習データセット記憶部１８に格納する。 The first dynamic time expansion / contraction unit 160 stores the post-time expansion / contraction learning data set represented by the above formula in the post-time expansion / contraction learning data set storage unit 18.

早期識別器学習部１６２は、表情カテゴリ毎に早期識別器を学習する。ここで、早期識別器は、決められた速度で行われることを仮定した動作のデータからその動作の種類をできるだけ早く識別することができる識別器である。表情カテゴリ毎の早期識別器を用いた表情の認識は、以下の式に従ったものであり、入力データＺに対する認識結果^〜ｃ_ｔを出力する。 The early classifier learning unit 162 learns an early classifier for each facial expression category. Here, the early discriminator is a discriminator that can identify the type of motion as early as possible from the data of the motion that is assumed to be performed at a determined speed. Recognition of facial expression with early identifier for each expression category is in accordance with the following equation, and outputs the recognition result ^~ c _t for the input data Z.

早期識別器は、各カテゴリｃに対して用意される。すなわち、この早期識別器は、各時刻に対応するデータを入力としてフレーム毎にカテゴリｃに対するスコアを返すフレームベースの弱識別器h_c,tの出力を重み付けした和を、最終的なそのカテゴリｃに対するスコアＨ_ｃ（Ｚ）として出力する。スコアＨ_ｃ（Ｚ）を最大化するカテゴリが、表情カテゴリの認識結果となる。 An early classifier is prepared for each category c. That is, this early classifier receives the data corresponding to each time as an input and outputs a weighted sum of the outputs of the frame-based weak classifiers h _{c, t} that returns a score for the category c for each frame, and finally sets the category c Is output as a score H _c (Z). The category that maximizes the score H _c (Z) is the recognition result of the facial expression category.

早期識別器学習部１６２では、この弱識別器とその識別重みの組み合わせ The early classifier learning unit 162 combines the weak classifier and its classification weight.

を時間伸縮後の学習データセット^〜Ｄを用いて学習し、認識対象の時系列データＺが、対象とする表情カテゴリcであるか否かを出力するＮ_c個の２クラス強識別器H_cを生成する。 The learning using learning data sets ^{~ D} after time warping, the time-series data Z of the recognition target, and outputs whether or not the facial expression category c of interest N _c pieces of two-class strong classifiers H _c Is generated.

フレームベースの弱識別器ｈ_c,t(z):Ｒ^ＤＫ→{1,-1}は、以下の式で表わすように、識別スコアｒ_ｃ，ｔを最大化することによって決定される。 The frame-based weak discriminator h _{c, t} (z): R ^DK → {1, -1} is determined by maximizing the discrimination score r _{c, t} as expressed by the following equation.

ここで、g_c(c´)は、g_c(c´):{1,…, N_c}→{1,-1}であり、カテゴリc´を入力として受け取り、ｃ＝ｃ´ならば１を、そうでなければ−１を出力する関数である。また、Ｄ_ｔ（ｍ，ｃ）∈Ｒは、ｍ番目の時系列データのカテゴリcに対する誤差重みである。弱識別器ｈc,tとしては任意の識別器を使用することが可能である。ここでは簡単な線形識別器としてｓｉｇｎ（ｗTｚ）とする識別器を使用する。ここでｗ∈Ｒ^DKは最適化すべきパラメータである。このパラメータｗの最適化方法としては、任意の方法を用いて構わない。例えば、パラメータ値を逐次的に更新していくシンプレックス法などを用いればよい。 Here, g _c (c ′) is g _c (c ′): {1,..., N _c } → {1, -1}, and receives category c ′ as an input, and if c = c ′ It is a function that outputs 1 and -1 otherwise. Further, D _t (m, c) εR is an error weight for the category c of the m-th time series data. Any classifier can be used as the weak classifier hc, t. Here, a classifier of sign (wTz) is used as a simple linear classifier. Here, wεR ^DK is a parameter to be optimized. As a method for optimizing the parameter w, any method may be used. For example, a simplex method that sequentially updates parameter values may be used.

この誤差重みは、それぞれのカテゴリｃに対して、以下の式に従い時刻ｔ＝１よりｔを１ずつ増加させながらｔ＝Ｔ_cまで逐次的に計算される。 The error weights are sequentially calculated for each category _c until t = T _c while increasing t by 1 from time t = 1 according to the following equation.

ここで、α_tは弱識別器ｈ_c,tの出力に対する識別重みであり、以下の式で表わされる。 Here, α _t is an identification weight for the output of the weak classifier h _{c, t} and is expressed by the following equation.

＜表情認識処理＞
表情の認識処理は、入力部１０から出力される表情カテゴリが未知の時系列データｚを受け、まず、それぞれのカテゴリを仮定して、参照時系列データに対して時間伸縮を行い、続いて、カテゴリ数だけある、時間伸縮された時系列データをそれぞれのカテゴリ用の早期識別器に与えてスコアを計算し、そのスコアが最も高くなるカテゴリを、表情のカテゴリ推定値^ｃとして出力する。 <Expression recognition processing>
The facial expression recognition process receives time series data z with an unknown facial expression category output from the input unit 10, first assumes each category, performs time expansion / contraction on the reference time series data, The time-series data having the number of categories and time-scaled is given to the early classifier for each category to calculate the score, and the category having the highest score is output as the estimated category value ^ c of the facial expression.

本実施の形態では、現在の時刻ｔまでに得られた時系列データz_1:tを入力として、以下に説明するように、現時刻ｔの表情の認識結果^ｃ_tを出力する。 In the present embodiment, the time series data z _{1: t} obtained up to the current time t is input, and the facial expression recognition result ^ ct at the current time _t is output as described below.

まず、表情認識部２２は、現在の時刻をｔとして、直前の時刻ｔ−１における認識結果が無表情、すなわち、^ｃ_t-1＝０であれば、位置合わせを行う時系列データＺに、現在のデータ（特徴量ベクトルの差分データ）z_tを追加する。こうして得られた時系列データＺを第２動的時間伸縮部２４へ入力する。 First, the facial expression recognition unit 22 sets the current time as t, and if the recognition result at the previous time t−1 is a non-facial expression, that is, ^ c _t−1 = 0, the facial expression recognition unit 22 uses the time-series data Z for alignment. , Current data (difference data of feature vector) z _t is added. The time series data Z thus obtained is input to the second dynamic time expansion / contraction unit 24.

一方、表情認識部２２は、直前の時刻ｔ−１における認識結果が無表情以外、すなわち、^ｃ_t-1≠０であれば、まず位置合わせを行う時系列データＺを空にする。そして、表情認識部２２は、そのときの表情が無表情か否かを判定する。ここでは、時刻ｔにおけるデータｚtが、無表情時のときのデータに近いかどうかで判定することとする。この判定方法の１つとして、ここでは、以下の式が成立するのであれば無表情であると判定し、そうでなければ、無表情でないと判定する。 On the other hand, if the recognition result at the previous time t-1 is other than the expressionless expression, that is, ^ _ct-1 ≠ 0, the facial expression recognition unit 22 first empties the time series data Z to be aligned. Then, the facial expression recognition unit 22 determines whether or not the facial expression at that time is an expressionless face. Here, it is determined whether the data zt at time t is close to the data at the time of no expression. As one of the determination methods, here, if the following expression holds, it is determined that there is no expression, and otherwise, it is determined that there is no expression.

ここで、ｘ_k,d及びσ_k,dはｋ番目の特徴点のｄ番目の次元の成分、及び、その成分の無表情時における標準偏差を表す。この標準偏差σ_k,dについては、学習データセット中の各時系列データにおいて、無表情であることを仮定している最初のフレームのデータを集めて、その標準偏差を計算したものとする。この方法で、無表情でないと判定されれば、現在の表情は、直前の時刻での認識結果と同じであると認識する。 Here, x _{k, d} and σ _{k, d} represent the d-th dimension component of the k-th feature point and the standard deviation of the component when no expression is present. For this standard deviation σ _{k, d} , it is assumed that the standard deviation is calculated by collecting the data of the first frame assuming no expression in each time series data in the learning data set. If it is determined by this method that there is no expression, the current facial expression is recognized as being the same as the recognition result at the previous time.

第２動的時間伸縮部２４は、入力された時系列データＺを被時間伸縮時系列データとし、全表情カテゴリのそれぞれにおける参照時系列データＺ^ｃに対して位置合わせを行い、各表情カテゴリｃに対する時間伸縮結果^~Ｚ_cを得る。位置合わせの方法は、表情学習部１６の第１動的時間伸縮部１６０と同様である。ただし、この場合、出力するのは、入力された時系列データの最終の時刻（現時刻）ｔに対応する参照時系列データの時刻ｔ^ｃまででよい。よって、結果として、Ｎ_c個の位置合わせ後の時系列データ（時間伸縮後のテストデータ）の集合 The second dynamic time expansion / contraction unit 24 uses the input time-series data Z as time-expanded time-series data, aligns the reference time-series data Z ^c in each of the facial expression categories, and sets each facial expression category c. obtaining a time warping results ^~ Z _c against. The alignment method is the same as that of the first dynamic time expansion / contraction unit 160 of the facial expression learning unit 16. However, in this case, the output may be up to the time t ^c of the reference time series data corresponding to the final time (current time) t of the input time series data. Therefore, as a result, a set of time series data (test data after time expansion / contraction) after N _c alignments

が出力される。第２動的時間伸縮部２４は、そのそれぞれの^~Ｚ_cを早期認識部２６へ入力して認識結果^c_tを得る。 Is output. Second dynamic time warping unit 24 obtains a recognition result ^ c _t enter its respective ^~ Z _c to the early recognition unit 26.

早期認識部２６は、以下に説明するように、位置合わせ後の時系列データの集合~Ｚに対して表情カテゴリを認識し、認識結果^c_tを出力する。 Early recognition unit 26, as described below, to recognize a facial expression category to the set ~ Z of time series data after the positioning, and outputs the recognition result ^ c _t.

まず、以下の式に従い無表情以外の認識結果の候補~ｃ_tを求める。 First, the candidate ~ c _t recognition results except expressionless accordance with the following equation.

上記の式では、各表情カテゴリｃの早期識別器に対して、当該表情カテゴリｃの参照時系列データＺ^ｃに対して位置合わせを行って得られた時系列データ（時間伸縮後のテストデータ）~Ｚcを入力して、スコアを求め、最大スコアとなる表情カテゴリｃを、認識結果の候補~ｃ_tとしている。 In the above expression, time series data (test data after time expansion / contraction) obtained by aligning the reference time series data Z ^{c of the} facial expression category c with respect to the early classifier of each facial expression category c. enter the ~ Zc, determine the score, the facial expression category c to be a maximum score, and the recognition result candidate ~ c _t.

上記の式に従って求められた認識結果の候補~ctに対するスコア Score for recognition result candidate ~ ct obtained according to the above formula

が事前に決定しておく閾値α以上であれば、最終的な認識結果＾ｃ_ｔを＾ｃ_ｔ＝^〜ｃ_ｔとして出力し、そうでなければ、＾ｃ_ｔを無表情として出力する（＾ｃ_ｔ＝０）。この閾値αについては、事前に表情を無表情のまま変化させないデータを用意しておき、それを各カテゴリの早期識別器に入力したときのスコアＨ_ｃ（Ｚ）の平均値を、閾値αとして設定すればよい。 Is equal to or greater than a predetermined threshold value α, the final recognition result ^ c _t is output as ^ c _t = ^~ c _t , otherwise ^ c _t is output as an expressionless expression (^ c _t = 0). For this threshold value α, data that does not change the facial expression with no expression is prepared in advance, and the average value of the scores H _c (Z) when the data is input to the early classifier of each category is used as the threshold value α. You only have to set it.

最終的な認識結果＾ｃ_ｔが、出力部（図示省略）よりユーザに出力される。 The final recognition result ^ _ct is output from the output unit (not shown) to the user.

＜表情認識装置の作用＞
次に、本実施の形態に係る表情認識装置の作用について説明する。まず、表情認識装置は、各表情カテゴリが予め分かっている複数の時系列データが入力され、各時系列データに対して、教師信号として表情カテゴリが入力されると、図４に示す学習処理ルーチンを実行する。 <Operation of facial expression recognition device>
Next, the operation of the facial expression recognition apparatus according to this embodiment will be described. First, the facial expression recognition device receives a plurality of time series data in which each facial expression category is known in advance, and when the facial expression category is inputted as a teacher signal for each temporal series data, the learning processing routine shown in FIG. Execute.

まず、ステップ１００において、各表情カテゴリが教師信号として与えられた複数の時系列データの入力を受け付けて、学習データセットとして学習データセット記憶部１２に格納する。次のステップ１０２では、学習データセット記憶部１２に記憶された学習データセットから、各表情カテゴリについて、参照時系列データを決定して、決定した結果を、学習データセット記憶部１２に格納する。 First, in step 100, an input of a plurality of time-series data in which each facial expression category is given as a teacher signal is received and stored in the learning data set storage unit 12 as a learning data set. In the next step 102, reference time series data is determined for each facial expression category from the learning data set stored in the learning data set storage unit 12, and the determined result is stored in the learning data set storage unit 12.

そして、ステップ１０４において、教師信号として与えられた複数の表情カテゴリのうち、学習対象の表情カテゴリとして１つの表情カテゴリを設定する。ステップ１０６では、学習対象の表情カテゴリが教師信号として与えられた時系列データのうち、参照時系列データ以外の時系列データの各々を、学習対象の表情カテゴリの参照時系列データに対して位置合わせを行うように時間伸縮し、時間伸縮後の時系列データを時間伸縮後学習データセット記憶部１８に格納する。 In step 104, one facial expression category is set as a facial expression category to be learned among a plurality of facial expression categories given as teacher signals. In step 106, time-series data other than the reference time-series data among the time-series data in which the facial expression category to be learned is given as a teacher signal is aligned with the reference time-series data of the facial expression category to be learned. And time-series data after time expansion / contraction is stored in the learning data set storage unit 18 after time expansion / contraction.

そして、ステップ１０８において、上記ステップ１０６で得られ、かつ、学習対象の表情カテゴリが教師信号として与えられた、時間伸縮後の時系列データおよび参照時系列データを用いて、学習対象の表情カテゴリに対する早期識別器を学習して、学習結果を早期識別器記憶部２０に格納する。 Then, in step 108, using the time-series data and reference time-series data after time expansion and contraction obtained in step 106 and given the learning-target facial expression category as a teacher signal, The early classifier is learned, and the learning result is stored in the early classifier storage unit 20.

そして、ステップ１１０では、全ての表情カテゴリについて、上記ステップ１０４〜１０８の処理を実行したか否かを判定し、上記ステップ１０４〜１０８の処理を実行していない表情カテゴリが存在する場合には、上記ステップ１０４へ戻り、当該表情カテゴリを、学習対象の表情カテゴリとして設定する。一方、全ての表情カテゴリについて、上記ステップ１０４〜１０８の処理を実行した場合には、学習処理ルーチンを終了する。 In step 110, it is determined whether or not the processing in steps 104 to 108 has been executed for all facial expression categories. If there are facial expression categories in which the processing in steps 104 to 108 is not performed, Returning to step 104, the facial expression category is set as the facial expression category to be learned. On the other hand, when the processes of steps 104 to 108 are executed for all facial expression categories, the learning process routine is terminated.

そして、モーションキャプチャシステムから、対象人物の顔面上に配置された複数の特徴点の座標値（位置情報）の時系列データが得られているときに、表情認識装置は、図５に示す表情認識処理ルーチンを実行する。 Then, when the time-series data of the coordinate values (position information) of a plurality of feature points arranged on the face of the target person is obtained from the motion capture system, the facial expression recognition apparatus performs facial expression recognition shown in FIG. Execute the processing routine.

ステップ１２０において、^ｃ0＝０, Ｚ＝[ ]（位置合わせを行う時系列データＺは空）と初期化する。次のステップ１２２では、現時刻を示す変数ｔを１に設定し、ステップ１２４において、時刻ｔのデータｚ_t（特徴ベクトルｘの差分データ）を取得する。 In step 120, initialization is made as ^ c0 = 0, Z = [] (time series data Z for alignment is empty). In the next step 122, a variable t indicating the current time is set to 1, and in step 124, data z _{t at} time _t (difference data of feature vector x) is acquired.

そして、ステップ１２６において、一時刻前の時刻ｔ−１における表情カテゴリの認識結果が無表情であるか否か、すなわち^ｃ_t-1＝０であるか否かを判定する。^ｃ_t-1＝０である場合には、ステップ１２８において、位置合わせを行う時系列データＺに現時刻のデータｚ_tを追加する。この処理を、Ｚ←[Ｚ, ｚ_t]と表す。 Then, in step 126, it is determined whether or not the facial expression category recognition result at time t-1 one hour before is expressionless, that is, ^ c _t-1 = 0. If ^ c _t-1 = 0, in step 128, the current time data z _t is added to the time-series data Z to be aligned. This process is expressed as Z ← [Z, z _t ].

そして、ステップ１３０において、各表情カテゴリについて、時系列データＺ（例えば、Ｚ＝｛ｚ_t｝）を被時間伸縮時系列データとし、当該表情カテゴリの参照時系列データＺｃに対して位置合わせを行うように時間伸縮処理を行い、各表情カテゴリｃに対する時間伸縮後の時系列データ~Ｚcを得る。 In step 130, time series data Z (for example, Z = {z _t }) is used as time-expanded time series data for each facial expression category, and alignment is performed with respect to the reference time series data Zc of the facial expression category. As described above, time expansion / contraction processing is performed to obtain time-series data ~ Zc after time expansion / contraction for each facial expression category c.

次のステップ１３２において、各表情カテゴリについて、上記ステップ１３０で得られた当該表情カテゴリｃに対する時間伸縮後の時系列データ^~Ｚ_cと、当該表情カテゴリｃに対する早期識別器とに基づいて、スコアを算出する。 In the next step 132, for each facial expression category, and time series data ^~ Z _c after time warping for the expression category c obtained in step 130, based on the early identifier for that expression category c, the score calculate.

そして、ステップ１３４では、上記ステップ１３２で算出されたスコアに基づいて、現時刻ｔの表情カテゴリ＾ｃ_ｔを認識して出力し、ステップ１４４へ移行する。 In step 134, the facial expression category ^ ct at the current time _t is recognized and output based on the score calculated in step 132, and the process proceeds to step 144.

一方、上記ステップ１２６において、一時刻前の時刻ｔ−１における表情カテゴリの認識結果が無表情でないと判定された場合には、ステップ１３６において、位置合わせを行う時期列データＺを空にする。この処理を、Ｚ←[ ]と表す。 On the other hand, if it is determined in step 126 that the facial expression category recognition result at time t-1 one hour before is not a facial expression, in step 136, the time series data Z to be aligned is emptied. This process is expressed as Z ← [].

次のステップ１３８では、現時刻ｔのデータｚ_ｔに基づいて、現時刻ｔの表情カテゴリが無表情であるか否かを判定する。現時刻ｔの表情カテゴリが無表情であると判定された場合には、ステップ１４０において、現時刻ｔの表情カテゴリ＾ｃ_ｔが無表情である（＾ｃ_ｔ＝０）と認識して出力し、ステップ１４４へ移行する。 In the next step 138, based on the data z _t at the current time t, it is determined whether or not the facial expression category at the current time t is no expression. If the facial expression category of the current time t is determined to be expressionless, in step 140, it is expressionless facial expression category ^ c _t of the current time t (^ c t _{= 0)} and recognized by the output The process proceeds to step 144.

一方、上記ステップ１３８において、現時刻ｔの表情カテゴリが無表情でないと判定された場合には、ステップ１４２において、現時刻ｔの表情カテゴリ＾ｃ_ｔが、一時刻前の時刻ｔ−１で認識された表情カテゴリと同じ表情カテゴリである（＾ｃ_ｔ＝＾ｃ_ｔ-1）と認識して出力し、ステップ１４４へ移行する。 On the other hand, in step 138, when the facial expression category of the current time t is determined not to be expressionless, in step 142, the facial expression category ^ c _t of the current time t is recognized in one time before time t-1 It is recognized that it is the same facial expression category as the facial expression category (^ c _t = ^ c _t-1 ), and the process proceeds to step 144.

ステップ１４４では、時刻ｔを１インクリメントして、ステップ１２４へ戻る。 In step 144, the time t is incremented by 1, and the process returns to step 124.

以上のように、表情認識処理ルーチンにより、各時刻ｔの表情カテゴリの認識結果＾ｃ_ｔが出力される。 As described above, the facial expression category recognition result ^ t at each time _t is output by the facial expression recognition processing routine.

以上説明したように、本実施の形態に係る表情認識装置によれば、表情カテゴリ毎に早期識別器を学習するときに用いた参照用の時系列データに対して、データｚ_tの変化が対応するように、無表情の状態から変化するときの時系列データｚ_tを時間方向に伸縮させてから、複数の表情カテゴリの各々に対する早期識別器に基づいて、人物の顔の表情カテゴリを認識することにより、表情カテゴリの変化が様々な速度で行われても、表情カテゴリを精度良く、かつ早期に認識することができる。 As described above, according to the facial expression recognition apparatus according to the present embodiment, the change in the data z _t corresponds to the reference time-series data used when learning the early classifier for each facial expression category. As described above, the time series data z _t when changing from the expressionless state is expanded or contracted in the time direction, and then the facial expression category of the person's face is recognized based on the early classifier for each of the plurality of facial expression categories. Thus, even if the facial expression category is changed at various speeds, the facial expression category can be recognized with high accuracy and early.

また、顔の表情が表出され始めた直後、すなわち、表情が無表情からまだそれほど大きく変化していない段階であっても、その表情カテゴリを正しく認識できる。例えば、ある表情が一瞬表出された後、直ちに別の表情によって隠蔽された場合であっても、隠蔽された表情を認識することが可能となる。この表情の隠蔽が生じるのは、怒りや嫌悪といった感情を起因として不随意的、瞬間的かつ微細に表出される、社会的な場面においてはあまり望ましくない否定的な表情が、直ちに笑顔など他の肯定的あるいは中立的な表情によって隠されるといった場合である。このような隠蔽された表情を認識することは、対象人物の感情を正確に推定する上で重要であると考えられる。 In addition, the facial expression category can be correctly recognized immediately after the facial expression starts to appear, that is, even when the facial expression has not changed so much from no expression. For example, even if a certain facial expression appears for a moment and is immediately concealed by another facial expression, the concealed facial expression can be recognized. This concealment of facial expression is caused by emotions such as anger and disgust, which are expressed involuntarily, momentarily and finely, and negative facial expressions that are less desirable in social situations, such as immediate smiles. This is the case when hidden by a positive or neutral expression. It is considered that recognizing such a hidden facial expression is important for accurately estimating the emotion of the target person.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、上述の表情認識装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 For example, the facial expression recognition apparatus described above has a computer system therein, but the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０入力部
１２学習データセット記憶部
１４参照時系列データ決定部
１６表情学習部
１８時間伸縮後学習データセット記憶部
２０早期識別器記憶部
２２表情認識部
２４第２動的時間伸縮部
２６早期認識部
１６０第１動的時間伸縮部
１６２早期識別器学習部 DESCRIPTION OF SYMBOLS 10 Input part 12 Learning data set memory | storage part 14 Reference time series data determination part 16 Expression learning part 18 Learning data set memory | storage part 20 after time expansion / contraction Early expression discriminating part 22 Expression recognition part 24 2nd dynamic time expansion / contraction part 26 Early recognition Unit 160 first dynamic time expansion / contraction unit 162 early classifier learning unit

Claims

A facial expression recognition device for recognizing a facial expression category of a person's face based on time-series data of facial feature values obtained from a moving image obtained by imaging a region including a person's face,
For early identification of each of a plurality of facial expression categories based on the reference time-series data when changing from a predetermined facial expression category obtained in advance to the facial expression category Facial expression learning means for learning the early classifier of
Data acquisition means for acquiring time series data of the facial feature amount obtained from the moving image and indicating a change from the predetermined facial expression category;
For each of the plurality of facial expression categories, the change in the facial feature value corresponds to the reference time-series data when the time-series data acquired by the data acquisition unit is changed to the facial expression category. Time expansion / contraction means for generating time expansion / contraction time series data expanded / contracted in the time direction,
Facial expression recognition means for recognizing the facial expression category of the person's face based on the time expansion / contraction time-series data generated for each of the plurality of facial expression categories and the early classifier for each of the plurality of facial expression categories; ,
Facial expression recognition device.

Prepared in advance for each of the plurality of facial expression categories, the time series data for learning when changing from the predetermined facial expression category to the facial expression category, the time series for reference when changing to the facial expression category Learning time expansion / contraction means for generating each of the time expansion / contraction time series data for learning expanded / contracted in the time direction so that the change of the face feature amount corresponds to the data,
2. The facial expression learning means learns the early discriminator for each of a plurality of facial expression categories based on the reference time-series data and the learning time expansion / contraction time-series data for the facial expression category. The facial expression recognition apparatus described.

The facial expression recognition means for each of the plurality of facial expression categories based on the time expansion / contraction time series data generated for each of the plurality of facial expression categories and the early classifier for each of the plurality of facial expression categories. The facial expression recognition apparatus according to claim 1, wherein a score indicating a degree of the facial expression category is calculated, and the facial expression category of the person's face is recognized based on the score.

The face feature amount is a vector indicating a difference between each coordinate of the attention point set of the person's face and each coordinate of the attention point set of the face in the predetermined facial expression category. 4. The facial expression recognition device according to any one of 3 above.

5. The data acquisition unit according to claim 1, wherein the data acquisition unit acquires the time-series data including face feature values at a current time when the predetermined facial expression category is recognized at a previous time. The facial expression recognition apparatus described in the item.

The facial expression recognition apparatus according to claim 1, wherein the predetermined facial expression category is an expressionless expression.

A facial expression recognition method for recognizing a facial expression category of a person's face based on time-series data of facial feature values obtained from a moving image obtained by imaging an area including a person's face,
Whether each facial expression category is the facial expression category based on the time-series data for reference when the facial expression learning means changes from the predetermined facial expression category obtained in advance to the facial expression category. Learning an early classifier for early identification;
Obtaining time series data of the facial feature amount indicating a change from the predetermined facial expression category obtained from the moving image by a data obtaining unit;
For each of the plurality of facial expression categories by the time expansion / contraction means, the face feature with respect to the time-series data for reference when the time-series data acquired by the data acquisition means is changed to the facial expression category. Generating time-stretched time-series data each stretched in the time direction so that a change in amount corresponds,
The facial expression category of the person's face is recognized based on the time expansion / contraction time-series data generated for each of the plurality of facial expression categories and the early classifier for each of the plurality of facial expression categories by the facial expression recognition means. And steps to
A facial expression recognition method comprising:

The program for functioning a computer as each means of the facial expression recognition apparatus of any one of Claims 1-6.