JP2012008779A

JP2012008779A - Expression learning device, expression recognition device, expression leaning method, expression recognition method, expression learning program and expression recognition program

Info

Publication number: JP2012008779A
Application number: JP2010143751A
Authority: JP
Inventors: Shiro Kumano; 史朗熊野; Kazuhiro Otsuka; 和弘大塚; Dan Mikami; 弾三上; Atsushi Yamato; 淳司大和; Eisaku Maeda; 英作前田; Yoichi Sato; 洋一佐藤; Ro-Me Soh; 鷺梅蘇
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-06-24
Filing date: 2010-06-24
Publication date: 2012-01-12
Anticipated expiration: 2030-06-24
Also published as: JP5485044B2

Abstract

PROBLEM TO BE SOLVED: To realize recognition of not only a category of expression but also a dynamical property of the expression.SOLUTION: An expression recognition device comprises: input means for outputting position information of a plurality of feature points on a face from time-series image data; category padding means for calculating a category spatial coordinate by projecting the position information of the plurality of feature points to a category space of the expression; dynamic property padding means for calculating a dynamic property spatial coordinate by projecting the position information of the plurality of feature points to a dynamic property space; and expression recognition means for estimating the category of expression of an expression recognition target person by referring to pre-stored category variety information and specifying the category variety information to which the category spatial coordinate belongs, and estimating the dynamic property of the expression of the expression recognition target person by referring to pre-stored dynamic property variety information and specifying the dynamic property variety information to which the dynamic property spatial coordinate belongs.

Description

本発明は、人物の顔の表情を認識するための表情学習装置、表情認識装置、表情学習方法、表情認識方法、表情学習プログラム及び表情認識プログラムに関する。 The present invention relates to a facial expression learning apparatus, facial expression recognition apparatus, facial expression learning method, facial expression recognition method, facial expression learning program, and facial expression recognition program for recognizing facial expressions of a person.

顔の表情は他者と感情を伝達しあう手段として最も基本的な非言語行動であると言われている。そのため、画像に基づく表情認識に関する研究がこれまでコンピュータビジョン分野を中心として盛んに行われてきた。しかし、これまでの表情認識手法は、表情カテゴリは認識できるものの、その表情の動的性質、すなわち、表出の速度や強度といった表情の時間変化に関する性質については、複雑な性質を認識できるレベルに至っているとは言い難い。 Facial expressions are said to be the most basic nonverbal behavior as a means of communicating emotions with others. For this reason, research on facial expression recognition based on images has been actively conducted mainly in the field of computer vision. However, the conventional facial expression recognition methods can recognize facial expression categories, but the dynamic properties of facial expressions, that is, the properties related to temporal changes in facial expressions, such as the speed and intensity of expression, are at a level where complex properties can be recognized. It is hard to say that it has reached.

例えば、非特許文献１では、Supervised Locality Preserving Projectionsを用いて学習した多様体中で、表情がどのように時間的に遷移するのかを確率的にモデル化し、入力動画像からベイズ推定の枠組みにて表情カテゴリを推定する手法が提案されている。また、非特許文献２では、顔の３次元形状が表情変化に伴いどのように変形するのかをモデル化しておき、入力の顔の３次元形状から２次元ＨＭＭを用いて表情カテゴリを推定する手法が提案されている。 For example, in Non-Patent Document 1, in a manifold learned using Supervised Locality Preserving Projections, how the facial expression transitions in time is stochastically modeled, and a Bayesian estimation framework is used from the input video. A method for estimating a facial expression category has been proposed. In Non-Patent Document 2, a method of modeling how a three-dimensional shape of a face is deformed with a change in facial expression, and estimating a facial expression category from the input three-dimensional shape of the face using a two-dimensional HMM. Has been proposed.

Caifeng Shan, Shaogang Gong, and Peter W. McOwan: "Dynamic facial expression recognition using a Bayesian temporal manifold model", In Proc. of the British Machine Vision Conf, Vol.1, pp. 297-306, 2006.Caifeng Shan, Shaogang Gong, and Peter W. McOwan: "Dynamic facial expression recognition using a Bayesian temporal manifold model", In Proc. Of the British Machine Vision Conf, Vol.1, pp. 297-306, 2006. Yi Sun and Lijun Yin: "Facial expression recognition based on 3D dynamic range model sequences", In Proc. of the Tenth European Conference on Computer Vision, pp. 58-71, 2008.Yi Sun and Lijun Yin: "Facial expression recognition based on 3D dynamic range model sequences", In Proc. Of the Tenth European Conference on Computer Vision, pp. 58-71, 2008.

しかしながら、非特許文献１、２の２つの手法をはじめとしてこれまでの表情認識手法は、表情カテゴリは認識できるものの、その表情の動的性質については複雑な性質を認識できるレベルに至っていないという問題がある。 However, the conventional facial expression recognition methods, including the two methods of Non-Patent Documents 1 and 2, can recognize facial expression categories, but the dynamic properties of facial expressions have not reached a level where complex properties can be recognized. There is.

本発明は、このような事情に鑑みてなされたもので、顔面上の目や口といった特徴点の移動の情報に基づき、表情のカテゴリの認識のみならず、表情の表出の速度や強度に関する複雑な動的性質についても認識可能とする表情学習装置、表情認識装置、表情学習方法、表情認識方法、表情学習プログラム及び表情認識プログラムを提供することを目的とする。 The present invention has been made in view of such circumstances, and relates to the expression speed and strength of facial expression as well as facial expression category recognition based on information on movement of feature points such as eyes and mouth on the face. It is an object of the present invention to provide an expression learning device, an expression recognition device, an expression learning method, an expression recognition method, an expression learning program, and an expression recognition program that can recognize complex dynamic properties.

本発明は、時系列の画像データから顔面上の複数の特徴点の位置情報からなる学習データを出力する入力手段と、前記入力手段から出力される前記学習データを、表情のカテゴリ空間に射影することにより表情のカテゴリのみに依存して分離されたカテゴリ多様体情報を生成するカテゴリ埋め込み手段と、前記カテゴリ多様体情報を記憶するカテゴリ多様体情報記憶手段と、前記入力手段から出力される前記学習データを、表情の動的性質空間に射影することにより表情の動的性質のみに依存して分離された動的性質多様体情報を生成する動的性質埋め込み手段と、前記動的性質多様体情報を記憶する動的性質多様体情報記憶手段とを備えたことを特徴とする。 The present invention projects input learning data composed of positional information of a plurality of feature points on a face from time-series image data, and the learning data output from the input means onto a facial expression category space. A category embedding means for generating category manifold information separated depending only on a facial expression category, a category manifold information storage means for storing the category manifold information, and the learning output from the input means. Dynamic property embedding means for generating dynamic property manifold information separated depending only on the dynamic property of the expression by projecting data onto the dynamic property space of the expression, and the dynamic property manifold information And dynamic property manifold information storage means for storing.

本発明は、表情認識を行うための学習データを表情のカテゴリ空間に射影したカテゴリ多様体情報を記憶するカテゴリ多様体情報記憶手段と、前記学習データを表情の動的性質空間に射影した動的性質多様体情報を記憶する動的性質多様体情報記憶手段と、時系列の画像データから表情認識対象人物の顔面上の複数の特徴点の位置情報を出力する入力手段と、前記入力手段から出力される前記複数の特徴点の位置情報を、前記表情のカテゴリ空間に射影することにより、カテゴリ空間座標を求めるカテゴリ埋め込み手段と、前記入力手段から出力される前記複数の特徴点の位置情報を、前記表情の動的性質空間に射影することにより、動的性質空間座標を求める動的性質埋め込み手段と、前記カテゴリ多様体情報記憶手段に記憶されている前記カテゴリ多様体情報を参照して、前記カテゴリ空間座標が属する前記カテゴリ多様体情報を特定することにより前記表情認識対象人物の表情のカテゴリを推定するとともに、前記動的性質多様体情報記憶手段に記憶されている前記動的性質多様体情報を参照して、前記動的性質空間座標が属する前記動的性質多様体情報を特定することにより前記表情認識対象人物の表情の動的性質を推定する表情認識手段とを備えたことを特徴とする。 The present invention provides a category manifold information storage means for storing category manifold information obtained by projecting learning data for facial expression recognition onto a category space of an expression, and a dynamic that projects the learning data onto a dynamic property space of an expression. Dynamic property manifold information storage means for storing property manifold information, input means for outputting position information of a plurality of feature points on the face of the facial expression recognition target person from time-series image data, and output from the input means By projecting the positional information of the plurality of feature points to the category space of the facial expression, category embedding means for obtaining category space coordinates, and positional information of the plurality of feature points output from the input means, The dynamic property embedding means for obtaining the dynamic property space coordinates by projecting onto the dynamic property space of the facial expression and the previous stored in the category manifold information storage means By referring to category manifold information and specifying the category manifold information to which the category space coordinates belong, the facial expression category of the facial expression recognition target person is estimated and stored in the dynamic property manifold information storage means A facial expression that estimates the dynamic nature of the facial expression of the facial expression recognition target person by identifying the dynamic property manifold information to which the dynamic property space coordinates belong by referring to the dynamic property manifold information And a recognition means.

本発明は、時系列の画像データから顔面上の複数の特徴点の位置情報からなる学習データを出力する入力手段と、カテゴリ多様体情報を記憶するカテゴリ多様体情報記憶手段と、動的性質多様体情報を記憶する動的性質多様体情報記憶手段と、カテゴリ埋め込み手段と、動的性質埋め込み手段とを備える表情学習装置における表情学習方法であって、前記カテゴリ埋め込み手段が、前記入力手段から出力される前記学習データを、表情のカテゴリ空間に射影することにより表情のカテゴリのみに依存して分離されたカテゴリ多様体情報を生成して、前記カテゴリ多様体記憶手段に記憶するカテゴリ埋め込みステップと、前記動的性質埋め込み手段が、前記入力手段から出力される前記学習データを、表情の動的性質空間に射影することにより表情の動的性質のみに依存して分離された動的性質多様体情報を生成して、前記動的性質多様体情報記憶手段に記憶する動的性質埋め込みステップと、を有することを特徴とする。 The present invention provides an input means for outputting learning data composed of position information of a plurality of feature points on a face from time-series image data, a category manifold information storage means for storing category manifold information, and a variety of dynamic properties. An expression learning method in an expression learning device comprising dynamic property manifold information storage means for storing body information, category embedding means, and dynamic property embedding means, wherein the category embedding means outputs from the input means A category embedding step of generating category manifold information separated depending only on a facial expression category by projecting the learning data into a facial expression category space, and storing it in the category manifold storage means; The dynamic property embedding unit projects the learning data output from the input unit onto a dynamic property space of an expression. A dynamic property embedding step of generating dynamic property manifold information separated depending only on the dynamic property of information and storing the information in the dynamic property manifold information storage means. .

本発明は、表情認識を行うための学習データを表情のカテゴリ空間に射影したカテゴリ多様体情報を記憶するカテゴリ多様体情報記憶手段と、前記学習データを表情の動的性質空間に射影した動的性質多様体情報を記憶する動的性質多様体情報記憶手段と、時系列の画像データから表情認識対象人物の顔面上の複数の特徴点の位置情報を出力する入力手段と、カテゴリ埋め込み手段と、動的性質埋め込み手段と、表情認識手段とを備える表情認識装置における表情認識方法であって、前記カテゴリ埋め込み手段が、前記入力手段から出力される前記複数の特徴点の位置情報を、前記表情のカテゴリ空間に射影することにより、カテゴリ空間座標を求めるカテゴリ埋め込みステップと、前記動的性質埋め込み手段が、前記入力手段から出力される前記複数の特徴点の位置情報を、前記表情の動的性質空間に射影することにより、動的性質空間座標を求める動的性質埋め込みステップと、前記表情認識手段が、前記カテゴリ多様体情報記憶手段に記憶されている前記カテゴリ多様体情報を参照して、前記カテゴリ空間座標が属する前記カテゴリ多様体情報を特定することにより前記表情認識対象人物の表情のカテゴリを推定するとともに、前記動的性質多様体情報記憶手段に記憶されている前記動的性質多様体情報を参照して、前記動的性質空間座標が属する前記動的性質多様体情報を特定することにより前記表情認識対象人物の表情の動的性質を推定する表情認識ステップとを有することを特徴とする。 The present invention provides a category manifold information storage means for storing category manifold information obtained by projecting learning data for facial expression recognition onto a category space of an expression, and a dynamic that projects the learning data onto a dynamic property space of an expression. Dynamic property manifold information storage means for storing property manifold information; input means for outputting position information of a plurality of feature points on the face of the facial expression recognition target person from time-series image data; category embedding means; A facial expression recognition method in a facial expression recognition device comprising dynamic property embedding means and facial expression recognition means, wherein the category embedding means uses the positional information of the plurality of feature points output from the input means as The category embedding step for obtaining the category space coordinates by projecting onto the category space and the dynamic property embedding means are output from the input means. A dynamic property embedding step for obtaining dynamic property space coordinates by projecting the position information of the plurality of feature points onto the dynamic property space of the facial expression; and the facial expression recognition unit includes the category manifold information storage unit. A category of the facial expression of the person to be recognized by the facial expression by specifying the category manifold information to which the category space coordinates belong and referring to the category manifold information stored in By referring to the dynamic property manifold information stored in the body information storage means and specifying the dynamic property manifold information to which the dynamic property space coordinates belong, the facial motion of the facial expression recognition target person is identified. And a facial expression recognition step for estimating a physical property.

本発明は、時系列の画像データから顔面上の複数の特徴点の位置情報からなる学習データを出力する入力手段と、カテゴリ多様体情報を記憶するカテゴリ多様体情報記憶手段と、動的性質多様体情報を記憶する動的性質多様体情報記憶手段とを備える表情学習装置上のコンピュータに表情学習を行わせる表情学習プログラムであって、前記入力手段から出力される前記学習データを、表情のカテゴリ空間に射影することにより表情のカテゴリのみに依存して分離されたカテゴリ多様体情報を生成して、前記カテゴリ多様体記憶手段に記憶するカテゴリ埋め込みステップと、前記入力手段から出力される前記学習データを、表情の動的性質空間に射影することにより表情の動的性質のみに依存して分離された動的性質多様体情報を生成して、前記動的性質多様体情報記憶手段に記憶する動的性質埋め込みステップとを前記コンピュータに行わせることを特徴とする。 The present invention provides an input means for outputting learning data composed of position information of a plurality of feature points on a face from time-series image data, a category manifold information storage means for storing category manifold information, and a variety of dynamic properties. A facial expression learning program for causing a computer on a facial expression learning apparatus comprising dynamic property manifold information storage means for storing body information to perform facial expression learning, wherein the learning data output from the input means is a facial expression category A category embedding step of generating category manifold information separated depending on only the expression category by projecting into space and storing it in the category manifold storage means; and the learning data output from the input means By projecting onto the dynamic property space of the facial expression, the dynamic property manifold information separated depending only on the dynamic property of the facial expression is generated, and the previous The dynamic nature embedding step of storing the dynamic nature manifold information storage means, characterized in that causing the computer.

本発明は、表情認識を行うための学習データを表情のカテゴリ空間に射影したカテゴリ多様体情報を記憶するカテゴリ多様体情報記憶手段と、前記学習データを表情の動的性質空間に射影した動的性質多様体情報を記憶する動的性質多様体情報記憶手段と、時系列の画像データから表情認識対象人物の顔面上の複数の特徴点の位置情報を出力する入力手段とを備える表情認識装置上のコンピュータに表情認識を行わせる表情認識プログラムであって、前記入力手段から出力される前記複数の特徴点の位置情報を、前記表情のカテゴリ空間に射影することにより、カテゴリ空間座標を求めるカテゴリ埋め込みステップと、前記入力手段から出力される前記複数の特徴点の位置情報を、前記表情の動的性質空間に射影することにより、動的性質空間座標を求める動的性質埋め込みステップと、前記カテゴリ多様体情報記憶手段に記憶されている前記カテゴリ多様体情報を参照して、前記カテゴリ空間座標が属する前記カテゴリ多様体情報を特定することにより前記表情認識対象人物の表情のカテゴリを推定するとともに、前記動的性質多様体情報記憶手段に記憶されている前記動的性質多様体情報を参照して、前記動的性質空間座標が属する前記動的性質多様体情報を特定することにより前記表情認識対象人物の表情の動的性質を推定する表情認識ステップとを前記コンピュータに行わせることを特徴とする。 The present invention provides a category manifold information storage means for storing category manifold information obtained by projecting learning data for facial expression recognition onto a category space of an expression, and a dynamic that projects the learning data onto a dynamic property space of an expression. On a facial expression recognition device comprising: dynamic property manifold information storage means for storing property manifold information; and input means for outputting position information of a plurality of feature points on the face of the facial expression recognition target person from time-series image data A facial expression recognition program for causing a computer to perform facial expression recognition, wherein position information of the plurality of feature points output from the input means is projected onto a category space of the facial expression to obtain category space coordinates. Projecting the position information of the plurality of feature points output from the input means onto the dynamic property space of the facial expression, A dynamic property embedding step for obtaining coordinates, and the facial expression by specifying the category manifold information to which the category space coordinates belong by referring to the category manifold information stored in the category manifold information storage means The dynamic property to which the dynamic property space coordinates belong is estimated by estimating the facial expression category of the person to be recognized and referring to the dynamic property manifold information stored in the dynamic property manifold information storage means. A facial expression recognition step for estimating dynamic characteristics of the facial expression of the facial expression recognition target person by specifying manifold information is performed by the computer.

本発明によれば、顔面上の目や口といった特徴点の移動の情報に基づき、表情のカテゴリの認識のみならず、表情の表出の速度や強度に関する複雑な動的性質についても認識可能になるという効果が得られる。また、表情が表出され始めた直後の表出強度が小さい、すなわち、無表情からそれほど大きく変化していない表情についても正しく認識できるようになる。例えば、一瞬表出された後、直ちに別の表情によって隠蔽された表情を認識することも可能となる。この表情の隠蔽が生じるのは、怒りや嫌悪といった感情を起因として不随意的、瞬間的かつ微細に表出される、社会的な場面においてはあまり望ましくない否定的な表情が、直ちに笑顔など他の肯定的あるいは中立的な表情によって隠されるといった場合である。このような隠蔽された表情を認識することは、対象人物の感情を正確に推定する上で重要である。 According to the present invention, based on information on movement of feature points such as eyes and mouth on the face, it is possible to recognize not only facial expression categories but also complex dynamic properties relating to facial expression speed and intensity. The effect of becoming is obtained. In addition, it is possible to correctly recognize a facial expression whose expression intensity is small immediately after the facial expression starts to be expressed, that is, a facial expression that does not change so much from no expression. For example, it is also possible to recognize a facial expression concealed by another facial expression immediately after appearing for a moment. This concealment of facial expression is caused by emotions such as anger and disgust, which are expressed involuntarily, momentarily and finely, and negative facial expressions that are less desirable in social situations, such as immediate smiles. This is the case when hidden by a positive or neutral expression. Recognizing such a hidden facial expression is important for accurately estimating the emotion of the target person.

本発明の一実施形態の構成を示すブロックである。It is a block which shows the structure of one Embodiment of this invention. 対象人物の顔面上に配置された複数の特徴点一例を示す図である。It is a figure which shows an example of the some feature point arrange | positioned on the target person's face. 表情の運動のタイプの一例を示す図である。It is a figure which shows an example of the type of exercise | movement of a facial expression. ３つの表情カテゴリ（幸福、驚き、怒り）についての３Ｌ次元の学習データを３次元へと埋め込みを行った結果の一例を示す図である。It is a figure which shows an example of the result of having embedded 3L-dimensional learning data about three facial expression categories (happiness, surprise, and anger) into three dimensions. ３つの動的性質（通常、微細、大げさ）についてのｈ次元の学習データを３次元へと埋め込みを行った結果の一例を示す図である。It is a figure which shows an example of the result of embedding h-dimensional learning data about three dynamic properties (usually fine, exaggeration) into three dimensions. ベクトルｙ_ｔの概念を示す図である。Is a diagram illustrating the concept of the vector y _t. ベクトルｙ_ｔにおける異なる時間窓サイズの概念を示す図である。It is a diagram showing a different concept of the time window size in the vector y _t.

以下、図面を参照して、本発明の一実施形態による表情学習装置及び表情認識装置を説明する。図１は同実施形態の構成を示すブロック図である。この図において、符号１は、時系列の画像データ（以下、時系列画像データと呼ぶ）から、図２に示すような対象人物の顔面上に配置された複数の特徴点の座標値（位置情報）を学習データあるいは入力データ（テストデータ）として出力する入力部である。符号２は、入力部１から出力され、学習データ（Ｘ）セット記憶部４に記憶された学習データのセットを入力し、表情のカテゴリ及び動的性質についての多様体を生成して、カテゴリ多様体（Ｙ）記憶部５、動的性質多様体（Ｚ）記憶部６に記憶する表情学習部である。表情学習部２は、カテゴリ埋め込み部２１と動的性質埋め込み部２２とからなるコンピュータ装置で構成する。 Hereinafter, an expression learning device and an expression recognition device according to an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing the configuration of the embodiment. In this figure, reference numeral 1 denotes coordinate values (position information) of a plurality of feature points arranged on the face of the target person as shown in FIG. 2 from time-series image data (hereinafter referred to as time-series image data). ) As learning data or input data (test data). Reference numeral 2 is a set of learning data output from the input unit 1 and stored in the learning data (X) set storage unit 4 to generate a variety of facial expression categories and dynamic properties. The facial expression learning unit stores the body (Y) storage unit 5 and the dynamic property manifold (Z) storage unit 6. The facial expression learning unit 2 is configured by a computer device including a category embedding unit 21 and a dynamic property embedding unit 22.

符号３は、入力部１から出力される入力データ（テストデータ）ｘ_ｔ７を入力し、カテゴリ空間座標ｙ_ｔ８，動的性質空間座標ｚ_ｔ９を求め、これらから入力データｘ_ｔ７が表情学習部２より生成された表情のカテゴリ及び動的性質についての多様体のどれに類似しているかに基づき、入力データについて表情のカテゴリｃと動的性質ｍの推定値

からなる認識結果情報１０を出力する表情認識部である。表情認識部３は、カテゴリ埋め込み部３１と動的性質埋め込み部３２とカテゴリ及び動的性質認識部３３とからなるコンピュータ装置で構成する。なお、図１に示す表情学習部２、表情認識部３はそれぞれ異なるコンピュータ装置で構成してもよいし、表情学習部２、表情認識部３をまとめて１台のコンピュータ装置で構成してもよい。 Reference numeral 3 is input data (test data) x _t 7 output from the input unit 1 to obtain category space coordinates y _t 8 and dynamic property space coordinates z _t 9, from which the input data x _t 7 is obtained. Estimated values of facial expression category c and dynamic property m for input data based on which facial expression category and dynamic property manifold generated by facial expression learning unit 2 are similar

It is a facial expression recognition part which outputs the recognition result information 10 which consists of. The facial expression recognition unit 3 includes a computer device including a category embedding unit 31, a dynamic property embedding unit 32, and a category and dynamic property recognition unit 33. The facial expression learning unit 2 and the facial expression recognition unit 3 shown in FIG. 1 may be configured by different computer devices, or the facial expression learning unit 2 and the facial expression recognition unit 3 may be configured by a single computer device. Good.

ここでは、基本表情のカテゴリとして、幸福、怒り、驚き、恐れ、嫌悪、及び、悲しみの６つの表情を対象とする。すなわち、ｃ∈｛１，…，６｝であり、対象カテゴリ数はＮ_ｃ＝６である。動的性質については、図３に示すような、通常、微細、大げさ、高速及び低速の５つを対象とする。すなわち、ｍ∈｛１，…，５｝であり、対象動的性質の数はＮ_ｍ＝５である。表情のカテゴリ及び動的性質のいずれについても、どのような状態を認識の対象としても構わない。例えば、動的性質については、微細かつ高速といった組み合わせの状態を認識対象の１つとしてもよい。 Here, six facial expressions of happiness, anger, surprise, fear, disgust, and sadness are targeted as categories of basic facial expressions. That is, cε {1,..., 6}, and the number of target categories is N _c = 6. As for the dynamic properties, as shown in FIG. 3, usually, five of fine, exaggerated, high speed and low speed are targeted. That is, mε {1,..., 5}, and the number of target dynamic properties is N _m = 5. Regardless of the facial expression category or dynamic property, any state may be used as a recognition target. For example, with regard to dynamic properties, a combination of fine and high speed may be set as one of recognition targets.

次に、学習データについて説明する。１人あるいは複数人の人物についての様々なカテゴリｃ及び動的性質ｍに対する表情の表出を複数回分予め用意しておく。１回の表出は、無表情から開始して対象の表情を表出してから無表情に戻るまでとする。そのような時系列のデータ中の個々のフレームを独立したデータとみなして、１つの学習データとする。この学習データのセットをＸ＝｛ｘ_ｉ，ｃ_ｉ，ｍ_ｉ｝_{ｉ＝１，…，Ｎ}と表す。ここで、ｉはデータの番号（ＩＤ）を、Ｎは学習データの総フレーム数をそれぞれ表す。なお、学習データの数とその中でのデータのばらつきは認識精度に影響を及ぼすが、学習データ中には、認識対象のカテゴリｃ及び動的性質ｍについての表情が、それぞれ少なくとも１回分含まれているように用意しておく必要がある。 Next, learning data will be described. Expressions of facial expressions for various categories c and dynamic properties m for one or a plurality of persons are prepared in advance for a plurality of times. One expression starts from an expressionless expression until the expression of the subject is expressed and then returns to the expressionless expression. Each frame in such time-series data is regarded as independent data and is regarded as one learning data. This set of learning data is represented as X = {x _i , c _i , m _i } _{i = 1} ,. Here, i represents a data number (ID), and N represents the total number of frames of learning data. Note that the number of learning data and variations in the data affect the recognition accuracy, but the learning data includes at least one facial expression for the category c and the dynamic property m to be recognized. It is necessary to have prepared.

次に、表情認識を行うべき対象の入力データについて説明する。入力データは、１人の人物がある動的性質のあるカテゴリの表情を表出している間の特徴点座標の時系列データである。時刻ｔにおいて計測された特徴ベクトルをｘ_ｔと表す。 Next, input data on which facial expression recognition is to be performed will be described. The input data is time-series data of feature point coordinates while expressing a facial expression of a certain category having a certain dynamic character. The feature vector measured at time t is expressed as x _t.

次に、図１に示す入力部について説明する。入力部１は、目や口といった顔部品周辺に配置されたＬ個の特徴点のＤ次元の座標値を並べたベクトルｘ＝［ｘ_１，１，…，ｘ_１，Ｄ，ｘ_２，１，…，ｘ_２，Ｄ，…，ｘ_Ｌ，１，…，ｘ_Ｌ，Ｄ］^Ｔ∈Ｒ^ＤＬを高い時間分解能で時系列に出力する。このベクトルｘを特徴ベクトルと呼ぶ。ここで、ｘ_ｉ，ｄはｉ番目の特徴点のｄ番目の次元の成分を表す。高い時間分解であることは表情の複雑な動的性質を識別するために必要である。ここでは、特徴点の３次元（Ｄ＝３）の座標値を高速に計測する手段として、例えば１００ｆｒａｍｅ／ｓｅｃで動作するモーションキャプチャシステムを用いる。すなわち、対象物表面に小さなマーカーを貼り付けた状態で、その人物を複数台のカメラにて撮影し、これを入力部に入力される時系列画像データとし、各画像中でのマーカーの位置からそれらマーカーの３次元座標を算出する。 Next, the input unit shown in FIG. 1 will be described. The input unit 1 is a vector x = [x _1,1 ,..., X _{1, D} , x _{2,1 in} which D-dimensional coordinate values of L feature points arranged around a facial part such as eyes and mouth are arranged. _{_{, ..., x 2, D,}} ..., x L, 1, ..., x L, and outputs the time series with a high time resolution ^{D] T} ∈R ^DL. This vector x is called a feature vector. Here, x _{i, d} represents the d-th dimension component of the i-th feature point. High time resolution is necessary to identify the complex dynamic nature of facial expressions. Here, for example, a motion capture system that operates at 100 frames / sec is used as means for measuring the three-dimensional (D = 3) coordinate values of feature points at high speed. That is, in a state where a small marker is pasted on the surface of the object, the person is photographed by a plurality of cameras, and this is taken as time-series image data input to the input unit, and from the position of the marker in each image The three-dimensional coordinates of these markers are calculated.

このマーカーの位置の検出方法としては、緑色の塗料を顔面上に小さく塗ったものをマーカーとして、カラー映像から検出する方法を用いることができる。あるいは、赤外光をよく反射する素材をマーカーとして、赤外光を照射しながら赤外以外の波長の光をフィルタでカットしながら撮影した画像から検出する方法も用いることができる。または、そのようなマーカーを使用せずとも顔のテクスチャ情報のみから特徴点を検出できるのであればそうして構わない。あるいは、特徴点の位置情報として単に単眼カメラ画像中の特徴点の画像座標（Ｄ＝２）を用いても構わない。なお、各人物に対する顔面上での特徴点の数及び配置は同一であるものとする。これらの座標値を高速に計測する手段は、公知の方法を用いるため、ここでは詳細な説明を省略する。 As a method for detecting the position of the marker, a method of detecting from a color image using a marker obtained by applying a small amount of green paint on the face can be used. Alternatively, it is also possible to use a method of detecting from a photographed image while irradiating infrared light and cutting light of wavelengths other than infrared with a filter using a material that reflects infrared light well as a marker. Alternatively, as long as the feature point can be detected only from the texture information of the face without using such a marker, it is possible. Alternatively, the image coordinates (D = 2) of the feature point in the monocular camera image may be simply used as the position information of the feature point. It is assumed that the number and arrangement of feature points on the face for each person are the same. Since the means for measuring these coordinate values at high speed uses a known method, detailed description thereof is omitted here.

この特徴ベクトルｘは、入力部１において、人物毎に無表情時を基準として正規化される。すなわち、任意の人物の無表情時の特徴ベクトルｘが等しくなるよう変換される。そのような正規化は次のようにして行われる。まず、前述した学習データセット中からある一人の人物の無表情時の特徴ベクトルｘ^ＢＡＳＥを選択する。その人物の任意の表情の特徴ベクトルについてはそのまま出力する。一方、他の人物については、全ての特徴ベクトルｘに対して射影ｇを施したベクトルｇ（ｘ）を出力する。この射影ｇについては、その人物の無表情時の特徴ベクトルがなるべくｘ^ＢＡＳＥに近くなるようなパラメータを求める。この射影ｇとしては、例えば、最も簡単なものの１つとして、特徴点座標空間の各座標軸に対してスケーリングを施す方法を用いる。 This feature vector x is normalized in the input unit 1 for each person on the basis of no expression. That is, conversion is performed so that the feature vector x of any person without expression is equal. Such normalization is performed as follows. First, a feature vector x ^BASE at the time of expressionlessness of one person is selected from the learning data set described above. The feature vector of an arbitrary facial expression of the person is output as it is. On the other hand, for other persons, a vector g (x) obtained by performing projection g on all feature vectors x is output. This For projection g, obtaining the parameters such as feature vectors of the time expressionless the person is close as possible x ^BASE. As the projection g, for example, as one of the simplest methods, a method of scaling each coordinate axis in the feature point coordinate space is used.

例えば、Ｄ＝３であれば、パラメータは３つであり、対角行列を用いて、

と表される。この３つのパラメータｓ_１，ｓ_２，ｓ_３については、基準とした人物の無表情時の特徴ベクトルｘ^ＢＡＳＥに対する、それぞれの人物の無表情時の特徴ベクトルの射影後のベクトルｇ（ｘ）の誤差の二乗の和が最小になる値、すなわち、最小二乗誤差基準に従って算出する。なお、この他にも、ＡＡＭ（Active Appearance Models）のように、個人毎の特徴ベクトルのばらつきについての基底を求め、その上位（主要な）いくつかの基底の線形和がなるべく特徴ベクトルｘ^ＢＡＳＥに近くなるようなパラメータを算出するという方法でも構わない。 For example, if D = 3, there are three parameters and using a diagonal matrix,

It is expressed. With respect to these three parameters s ₁ , s ₂ , and s ₃ , the vector g (x) after projection of the feature vector of each person with no expression relative to the feature vector x ^BASE of the person with no expression is used. The value is calculated in accordance with a value that minimizes the sum of squares of errors, that is, a minimum square error criterion. In addition to this, as in AAM (Active Appearance Models), the basis for the variation of the feature vector for each individual is obtained, and the linear sum of several higher (main) bases is preferably used as the feature vector x ^BASE . A method of calculating parameters that are close may be used.

次に、図１に示すカテゴリ埋め込み部２１、３１について説明する。カテゴリ埋め込み部は、入力されるデータｘ_ｔを、各軸が認識対象のカテゴリのうちの１つに対応したＮ_ｃ次元の空間（カテゴリ空間と呼ぶ）へと埋め込み、そのカテゴリ空間上での座標ｙ_ｔ∈Ｒ^Ｎｃを出力する。学習データをこの低次元のカテゴリ空間に埋め込んだ際に形成される多様体（カテゴリ多様対Ｙとよぶ）の一例を図４に示す。図４において、各点は１つのフレームを表している。このカテゴリ空間では、表情カテゴリ毎に分離された多様体が形成されるため、表情カテゴリを認識することがもともとの入力データの空間上よりも認識が容易となる。ここでは、表情変化、すなわち、特徴ベクトルｘ_ｔの状態が、動的性質に関わらずそのときの表出強度のみに依存することを仮定する。このとき、このカテゴリ空間は表情の動的性質とは独立となる。すなわち、カテゴリ空間上に形成される図４のような多様体は、表情のカテゴリのみに依存して分離されており、同じカテゴリで動的性質の違う表情はそのカテゴリの多様体上での移動の仕方の違いとして現れる。 Next, the category embedding units 21 and 31 shown in FIG. 1 will be described. Category embedding unit, the data x _t inputted, embedded into N _c dimensional space corresponding to one of the respective axes of the recognition object category (referred to as category space), coordinates on the category space Output y _t εR ^Nc . FIG. 4 shows an example of a manifold (referred to as a category manifold pair Y) formed when the learning data is embedded in this low-dimensional category space. In FIG. 4, each point represents one frame. In this category space, manifolds separated for each expression category are formed, so that it is easier to recognize the expression category than in the original input data space. Here, facial expressions, that is, it is assumed that the state of the feature vector x _t, depends only on the exposed strength at that time regardless of the dynamic nature. At this time, this category space is independent of the dynamic nature of the facial expression. That is, the manifold as shown in FIG. 4 formed on the category space is separated depending only on the facial expression category, and facial expressions with different dynamic properties in the same category move on the manifold of that category. It appears as a difference in the way.

次に、表情学習部２のカテゴリ埋め込み部２１の処理動作を説明する。まず、距離行列Ｍ∈Ｒ^Ｎ×Ｎを作成する。この距離行列Ｍは、成分（ｉ，ｊ）がｉ番目の学習データとｊ番目の学習データとの間の距離となっている。このときの距離尺度としては測地線距離を用いる。測地線距離については次のように計算する。まず、各学習データを１つのノードとするグラフを構築する。このとき、あるｉ番目の学習データからみて他のｊ番目の学習データがそれに隣接していると判断される場合に、ｉ番目の学習データからｊ番目の学習データに対してリンクを設ける。そのリンクには、両ノード間のユークリッド距離を値として持たせる。 Next, the processing operation of the category embedding unit 21 of the facial expression learning unit 2 will be described. First, a distance matrix MεR ^{N × N} is created. In this distance matrix M, the component (i, j) is the distance between the i-th learning data and the j-th learning data. The geodesic distance is used as the distance scale at this time. The geodesic distance is calculated as follows. First, a graph with each learning data as one node is constructed. At this time, when it is determined that another j-th learning data is adjacent to the i-th learning data, a link is provided from the i-th learning data to the j-th learning data. The link has a Euclidean distance between both nodes as a value.

隣接しているか否かの判断基準としては、ｋ−最近傍、すなわち、それぞれのデータに対して、そこからユークリッド距離の小さい他のデータを順にｋ個選択することとする。最後に、距離行列Ｍの（ｉ，ｊ）成分Ｍ_ｉ，ｊを、ｉ番目の学習データからｊ番目の学習データまでの間の単一あるいは複数のリンクで繋がれたパスのうち、通過するリンクの持つ値の和の最小値とする。なお、隣接の判断基準には、ｋ−最近傍以外にも、例えば、２つのデータの間のユークリッド距離が閾値以下であるようなデータ同士を全てリンクさせる方法を用いても構わない。 As a criterion for determining whether or not they are adjacent to each other, k-nearest neighbors, that is, k pieces of other data having a small Euclidean distance are sequentially selected from the respective data. Finally, the (i, j) component M _{i, j} of the distance matrix M passes through a path connected by a single or a plurality of links between the i-th learning data and the j-th learning data. The minimum value of the sum of the values of the link. In addition to the k-nearest neighbor, for example, a method of linking all data whose Euclidean distance between two pieces of data is equal to or less than a threshold may be used as an adjacent determination criterion.

次いで、その距離行列Ｍに基づき、リップシッツ埋め込み（Lipschitz embedding）を用いてＮ_ｃ次元空間へと埋め込む。ｉ番目の学習データｘ_ｔをカテゴリ空間へと埋め込んだときの座標軸ｃについての成分をｙ_ｉ，ｃとすると、

である。ここで、Ｊ_ｃは学習データＸの中で表情カテゴリがであるデータの番号の集合を表す。以上の方法で形成させる学習データセットＸのカテゴリ空間上でのカテゴリ多様体をＹ＝｛ｙ_ｉ｝_{ｉ＝１，…，Ｎ}と表す。 Then, based on the distance matrix M, embedding into the _Nc- dimensional space using Lipschitz embedding. If y _{i, c} is a component about the coordinate axis c when the i-th learning data x _t is embedded in the category space,

It is. Here, J _c represents a set of data of the number is a facial expression category in the training data X. The category manifold in the category space of the learning data set X formed by the above method is represented as Y = {y _i } _{i = 1} ,.

続いて、表情認識部３のカテゴリ埋め込み部３１の処理動作を説明する。まず、入力データｘ_ｔから全ての学習データまでの測地線距離を計算する。ここでは、入力データｘ_ｔからｊ番目の学習データまでの距離をＭ_ｔ，ｊと表す。入力データに対してｋ−最近傍となる学習データについては、入力データとそれらの学習データとの間のユークリッド距離をＭ_ｔ，ｊとする。それ以外の学習データについては、入力データｘ_ｔに対するｋ−最近傍のそれぞれの点に対し、入力データｘ_ｔとその最近傍点との間のユークリッド距離に、その最近傍点ｋから対象としているｊ番目の学習データまでの距離Ｍ_ｋ，ｊを加えたもののうち、最小の値をＭ_ｔ，ｊとする。 Next, the processing operation of the category embedding unit 31 of the facial expression recognition unit 3 will be described. First, the geodesic distance from the input data _xt to all the learning data is calculated. Here, the distance from the input data x _t to the j-th learning data is represented as M _{t, j} . For learning data that is k-nearest neighbor to the input data, the Euclidean distance between the input data and those learning data is M _{t, j} . For other training data, k-to recent respective near points to the input data x _t, the Euclidean distance between the input data x _t and its nearest neighbor, j th as an object from its nearest neighbor k M _{t, j} is the minimum value of the distances M _{k, j} to the learning data.

次いで、入力データｘ_ｔをその距離行列Ｍに基づきリップシッツ埋め込みを用いてＮ_ｃ次元空間へと埋め込む。入力データｘ_ｔをカテゴリ空間へと埋め込んだときの座標軸ｃについての成分をｙ_ｔ，ｃとして、式（１）において、ｙ_ｉ，ｃ及びＭ_ｉ，ｊの代わりにｙ_ｔ，ｃ及びＭ_ｔ，ｊを用いてｙ_ｔ，ｃを計算する。 Then, it embeds into N _{c-dimensional} space using the embedded Rippushittsu based input data x _t to the distance matrix M. Let y _{t, c be} the component about the coordinate axis c when the input data x _t is embedded in the category space, and in equation (1), instead of y _{i, c} and M _{i, j} , y _{t, c} and M _{t , J} is used to calculate _{yt, c} .

次に、図１に示す動的性質埋め込み部２２、３２について説明する。動的性質埋め込み部２２、３２は、対象とする表情データがカテゴリ空間上でどのように移動するのかの情報を、それぞれの軸が認識対象の動的性質の１つに対応したＮ_ｍ次元の空間（これを動的性質空間と呼ぶ）へと埋め込み、その動的性質空間上での座標ｚ_ｔ∈Ｒ^Ｎｍを出力する。学習データセットＸをこのような低次元の動的性質空間に埋め込んだ際に形成される多様体（以下、動的性質多様体Ｚと呼び、Ｚ＝｛ｚ_ｉ｝_{ｉ＝１，…，Ｎ}とする）の一例を図５に示す。図５において、各点は１つのフレームを表している。 Next, the dynamic property embedding units 22 and 32 shown in FIG. 1 will be described. The dynamic property embedding units 22 and 32 indicate information about how the facial expression data to be moved moves in the category space, and each axis has an N _m- dimensional dimension corresponding to one of the dynamic properties of the recognition target. A space (which is called a dynamic property space) is embedded, and coordinates z _t εR ^Nm on the dynamic property space are output. A manifold formed when the learning data set X is embedded in such a low-dimensional dynamic property space (hereinafter referred to as a dynamic property manifold Z, Z = {z _i } _{i = 1,..., N} An example) is shown in FIG. In FIG. 5, each point represents one frame.

この動的性質空間は表情のカテゴリに依存しない、すなわち、図３に示すような動的性質は、カテゴリに関わらず共通であるとする。このように、カテゴリとは独立した動的性質空間を作成することで、全ての表情のカテゴリ及び動的性質の組み合わせに対して学習データを準備しなくとも、認識対象のそれぞれの動的性質についてカテゴリに関わらず少なくとも１回の表出分ずつの学習データを準備すれば、同じ動的性質を持つ全てのカテゴリの表情を認識できるようになる。ただし、認識の精度は学習データの数、及び、その中でのデータのばらつきに依存する。また、カテゴリと動的性質が混合したＮ_ｃ×Ｎ_ｍ次元の空間へ入力データを直接１回で埋め込むよりも、形成される多様体のばらつきを小さくすることができ、より少ないデータから学習した場合でも正しくそれらを認識できることが期待できる。 This dynamic property space does not depend on the category of facial expressions, that is, the dynamic property as shown in FIG. 3 is common regardless of the category. In this way, by creating a dynamic property space independent of categories, it is possible to identify each dynamic property of the recognition target without preparing learning data for all combinations of facial expression categories and dynamic properties. Regardless of the category, if at least one learning data for each expression is prepared, facial expressions of all categories having the same dynamic properties can be recognized. However, the accuracy of recognition depends on the number of learning data and the variation of data therein. In addition, it is possible to reduce the variation of the formed manifold and to learn from less data than to embed the input data directly into the N _c × N _m- dimensional space where the category and dynamic properties are mixed. You can expect them to be recognized correctly.

このような動的性質空間を作成するために、ここでは、カテゴリ空間へと埋め込まれた入力データｙ_ｔから特定の表情カテゴリに関する表出の強度に相当する成分を抽出し、それをさらに対象とする動的性質の数と同じ次元数の空間へと埋め込む。このときの動的性質空間上での座標を

と表す。 To create such a dynamic nature space, wherein extracts a component corresponding to the intensity of expression for a particular facial expression category from the input data y _t embedded into category space, and further subject it It is embedded in a space with the same number of dimensions as the number of dynamic properties. Coordinates on the dynamic property space at this time

It expresses.

動的性質埋め込み部２２、３２は、まず、入力データｘ_ｔのカテゴリ空間上での座標ｙ_ｔ∈Ｒ^Ｎｃを入力として、そのｙ_ｔからそれぞれの表情カテゴリｃの表出の強度に相当する成分ｙ'_ｃ，ｔを抽出し、それを一定時間長（時間窓サイズ）ｈ分まとめたベクトル

を作成する。ベクトルｙ_ｔの概念図を図６に示す。ここでは、この時間窓サイズｈを、０．１〜０．５秒分、すなわち、入力データのフレームレートが１００ｆｒａｍｅ／ｓｅｃであればｈ＝１０〜５０程度とする。図７に異なる時間窓サイズｈについての概念図を示す。ただし、この時間窓サイズｈの値については、認識したい動的性質に応じて適切な値に設定すればよい。大きな時間窓サイズｈは低速に表出される表情の検出に適しており、逆に、小さな時間窓サイズｈは高速に表出される表情の検出に適している。 First, the dynamic

property embedding units

22 and 32 receive coordinates y _t ∈R ^Nc on the category space of the input data x _t as input, and components corresponding to the expression intensity of each facial expression category c from the y _t. y ′ _{c, t} is extracted and the vector is a fixed time length (time window size) h

Create A conceptual diagram of the vector y _t is shown in FIG. Here, the time window size h is about 0.1 to 0.5 seconds, that is, h = about 10 to 50 if the frame rate of the input data is 100 frames / sec. FIG. 7 shows a conceptual diagram for different time window sizes h. However, the value of the time window size h may be set to an appropriate value according to the dynamic property to be recognized. A large time window size h is suitable for detection of facial expressions expressed at low speed, and a small time window size h is suitable for detection of facial expressions expressed at high speed.

成分ｙ'_ｃ，ｔについては式（２）により算出する。

ここで、Ｙ_ｃは学習データＸの中で表情カテゴリｃがであるデータをそれぞれカテゴリ空間へと埋め込んだベクトルの集合を、

はＹ_ｃの中でｙからの測地線距離がｊ番目に小さなベクトルを、‖・‖はＬ２ノルムをそれぞれ表す。 The components y ′ _{c and t} are calculated by the equation (2).

Here, Y _c is a set of vectors obtained by embedding data in which the expression category _c is the learning data X in the category space,

Represent each small vector geodesic distance to j-th from y in the Y _c, || · || is the L2 norm.

このように、埋め込みを行う際の入力となるベクトル

に時間的に連続したデータを含めることで、表情の動的性質を表現する。そして、このベクトル

に対してリップシッツ埋め込みを行い、動的性質空間上での座標

を出力する。このリップシッツ埋め込みについては、カテゴリ埋め込み部と同様の処理とする。表情学習部２と表情認識部３で処理が多少異なる点についても同様である。ただし、入力としてｘの代わりにｙ^（ｃ）を用い、ｙの代わりにｚ^（ｃ）を出力する点が異なる。 Thus, the vector that becomes the input when embedding

The dynamic nature of facial expression is expressed by including temporally continuous data in. And this vector

Lip sitz embedding to the coordinates on the dynamic property space

Is output. This Lipsitz embedding is the same processing as the category embedding unit. The same applies to the point that the processing is slightly different between the facial expression learning unit 2 and the facial expression recognition unit 3. However, the difference is that y ^(c) is used instead of x as input and z ^(c) is output instead of y.

次に、図１に示すカテゴリ及び動的性質認識部３３について説明する。カテゴリ及び動的性質認識部３３は、入力データｘ_ｔに対して表情のカテゴリ及び動的性質の認識を行い、それらの認識結果情報

を出力する。ここでは、表情のカテゴリ及び動的性質の推定値を、入力データが与えられたもとでの、表情のカテゴリ及び動的性質の同時事後確率を最大化するカテゴリ及び動的性質とする。

ここで、ｐ（ｃ｜ｘ_ｔ）は入力データｘ_ｔが与えられたもとでの表情のカテゴリｃの事後確率、ｐ（ｍ｜ｃ，ｘ_ｔ）は入力データｘ_ｔ及び表情カテゴリｃが与えられたもとでの動的性質ｍの事後確率である。ここでは、式（３）を厳密に解く、すわなち、カテゴリと動的性質に対して総当りで調べることとする。ただし、近似的な方法を用いても構わない。例えば、まず、ｐ（ｃ｜ｘ_ｔ）を最大化するカテゴリを推定値

として決定し、次いで、そのカテゴリ

についての

を最大化する動的性質ｍを推定値

としても構わない。 Next, the category and dynamic property recognition unit 33 shown in FIG. 1 will be described. Categories and dynamic nature recognition unit 33 performs the recognition of the categories and the dynamic nature of the expression to the input data x _t, their recognition result information

Is output. Here, the estimated value of the facial expression category and the dynamic property is the category and the dynamic property that maximize the simultaneous posterior probability of the facial expression category and the dynamic property when the input data is given.

Here, p (c | _{x t)} is the posterior probability of category c look at the Moto the input data _{x t} is _{given, p (m | c, x} t) is supplied with the input data _{x t} and the facial expression category c This is the posterior probability of the dynamic property m at the base. Here, the equation (3) is solved exactly, that is, the brute force is examined for the category and the dynamic property. However, an approximate method may be used. For example, first, the category that maximizes p (c | x _t ) is estimated.

And then determine its category

about

Estimate the dynamic property m that maximizes

It does not matter.

なお、ここでは、表情のカテゴリ及び動的性質の時間遷移については考慮していないが、マルコフ過程などを仮定して時系列フィルタリングを適用することも可能である。 Here, the time series of facial expressions and dynamic properties are not considered, but time series filtering can be applied assuming a Markov process or the like.

次に、カテゴリについての事後確率を計算する処理動作について説明する。本実施形態では、入力データの対象の表情カテゴリ毎の事後確率ｐ（ｃ｜ｘ_ｔ）を、入力データｘ_ｔをカテゴリ空間へと埋め込んだ際のカテゴリに関する多様体までの距離に基づき計算する。ベイズ則を用いるとｐ（ｃ｜ｘ_ｔ）は次のように展開される。

ここで、ｐ（ｙ_ｔ｜ｃ）は入力データのカテゴリ空間での座標ｙ_ｔの表情カテゴリｃに対する尤度である。ｐ（ｃ）は表情カテゴリｃの事前確率であり、学習データセット中に含まれるカテゴリｃのデータの割合とする。 Next, the processing operation for calculating the posterior probability for the category will be described. In the present embodiment, the posterior probability p (c | x _t ) for each facial expression category of the input data is calculated based on the distance to the manifold related to the category when the input data x _t is embedded in the category space. Using Bayes rule, p (c | x _t ) is expanded as follows.

Here, p (y _t | c) is the likelihood of the coordinate y _{t for} the expression category c in the category space of the input data. p (c) is a prior probability of the expression category c, and is a ratio of the data of the category c included in the learning data set.

本実施形態では、尤度ｐ（ｙ_ｔ｜ｃ）を、学習データ中での対象データのカテゴリ空間での座標ｙ_ｔのｋ−最近傍に占めるデータのうちカテゴリのデータの占める割合に基づき次のように定義する。

ここで、τ は０＜τ≪１の定数であり、βはスケール係数である。これら２つのパラメータについては、

を成立させるものとして経験的に設定する。 In the present embodiment, the likelihood p (y _t | c) is calculated based on the ratio of the category data to the k-nearest neighbor data of the coordinate y _{t in} the category space of the target data in the learning data. Define as follows.

Here, τ is a constant of 0 <τ << 1, and β is a scale factor. For these two parameters,

Is empirically set to establish

Ｑ_ｃ（ｙ）については、カテゴリ空間上での座標に対するｋ−最近傍

を用いて、

と定義する。ここで、Ｙは学習データ集合Ｘの要素をそれぞれカテゴリ空間へと埋め込んだベクトルの集合を、Ｉ（ｙ，ｃ）はｙに対応する学習データの表情のカテゴリがｃであれば１、そうでなければ０を返す関数である。 For Q _c (y), k-nearest neighbor for coordinates in category space

Using,

It is defined as Here, Y is a set of vectors in which the elements of the learning data set X are embedded in the category space, I (y, c) is 1 if the learning data expression category corresponding to y is c, and so on. Otherwise, it is a function that returns 0.

次に、動的性質についての事後確率を計算する処理動作について説明する。

は入力データｘ_ｔのカテゴリｃについての動的性質空間での座標

が与えられたもとでの動的性質ｍの事後確率である。ベイズ則を用いるとｐ（ｍ｜ｃ，ｘ_ｔ）は次のように展開される。

ここで、ｐ（ｚ_ｔ｜ｍ）は入力データｘ_ｔの埋め込み空間上での座標ｚ_ｔの動的性質ｍに対する尤度である。尤度ｐ（ｍ）は表情の動的性質ｍの事前確率であり、学習サンプル中に含まれる動的性質ｍのデータの占める割合とする。このｐ（ｚ_ｔ｜ｍ）の定義については、カテゴリについての事後確率の計算において、ｙ_ｔをｚ_ｔに、ｃをｍに置き換えて計算することとする。 Next, the processing operation for calculating the posterior probability for the dynamic property will be described.

Coordinates of the dynamic nature space for the category c of the input data x _t is

Is the posterior probability of the dynamic property m. Using Bayes rule, p (m | c, x _t ) is expanded as follows.

Here, p (z _t | m) is the likelihood for the dynamic property m of the coordinate z _t on the embedded space of the input data x _t . The likelihood p (m) is a prior probability of the dynamic property m of the facial expression, and is a ratio occupied by the data of the dynamic property m included in the learning sample. The definition of p (z _t | m) is calculated by replacing _yt with z _t and c with m in calculating the posterior probability for the category.

なお、この動的性質の尤度については、この

を計算する際に時間窓サイズｈの分のデータが必要となる。そこで、まだデータｘ_ｔが入力され始めてからの経過時間がｈ未満である場合には、動的性質については考慮せず、ｐ（ｙ_ｔ｜ｃ）を最大化するカテゴリをカテゴリの推定値とし、これのみを出力することとする。 The likelihood of this dynamic property is

When calculating, data for the time window size h is required. Therefore, when the elapsed time from the start of the input of the data x _t is less than h, the dynamic property is not considered, and the category that maximizes p (y _t | c) is set as the estimated value of the category. This is the only output.

以上説明したように、２段階での埋め込み、すなわち、カテゴリ埋め込み処理と動的性質埋め込み処理を続けて行い、入力データは２段階の埋め込みのそれぞれの段階において、カテゴリに関する空間と動的性質に関する空間という２つの別の性質を持つ空間へと射影するようにした。これにより、１段階目の空間的な埋め込みでは、表情のカテゴリのみに依存して分離された多様体が形成され、２段階目の時間的な埋め込みでは、表情の動的性質のみに依存して分離された多様体が形成されることになる。結果的に、１段階目で埋め込まれるカテゴリに関する空間上ではカテゴリの種類を、２段階目で埋め込まれる動的性質に関する空間上では動的性質の種類をそれぞれ認識しやすくなる。 As described above, the embedding in two stages, that is, the category embedding process and the dynamic property embedding process are continuously performed, and the input data is the space related to the category and the space related to the dynamic property in each of the two stages of embedding Projected into a space with two different properties. As a result, in the first stage spatial embedding, a separated manifold is formed depending only on the facial expression category, and in the second stage temporal embedding, it depends only on the dynamic nature of the facial expression. A separate manifold will be formed. As a result, it is easy to recognize the type of category on the space related to the category embedded in the first stage and the type of dynamic property on the space related to the dynamic property embedded in the second stage.

例えば、表情が表出され始めた直後でまだ表出強度が小さい段階では、表情のカテゴリのみを考えたのではその識別が困難である。本発明によれば、動的性質を同時に扱い、尤もらしいカテゴリと動的性質の組み合わせを探索することで、結果としてその表情のカテゴリを正しく認識することが可能となる。すなわち、ここでのカテゴリ空間への埋め込みとは空間方向の埋め込みであり、動的性質空間への埋め込みとは時間方向の埋め込みである。ここで、空間的とは、対象としている瞬間の表情が無表情時からどれだけ変化しているのか、すなわち、表情のカテゴリに関する情報を意味する。一方、時間的とは、表情変化の度合いが対象としている瞬間までどのように時間的に変化したのかを意味する。 For example, at the stage where the expression intensity is still small immediately after the expression starts to be expressed, it is difficult to identify the expression by considering only the expression category. According to the present invention, it is possible to handle dynamic properties at the same time and search for a combination of plausible categories and dynamic properties, thereby correctly recognizing the facial expression category. That is, embedding in the category space here is embedding in the spatial direction, and embedding in the dynamic property space is embedding in the time direction. Here, spatial means how much the facial expression at the moment of interest has changed since no facial expression, that is, information on the facial expression category. On the other hand, the term “temporal” means how the degree of expression change has changed temporally until the target moment.

入力データとしては、高い時間分解能で得た、目や口といった顔部品の周辺に配置された特徴点の座標値の時系列データとした。このとき、表情が変化した際に、顔部品が移動や変形することで特徴点の座標値が変化する。時間分解能の高いデータを扱うことで表情の動的性質に関する詳細を表現し、さらにそれを認識することができる。処理の第一段階では、入力データを、リップシッツ埋め込み法を用いて表情カテゴリの空間へと埋め込む。次いで、それらのデータがカテゴリの空間上を時間的にどのように移動するかの情報を、さらにリップシッツ埋め込み法を用いて動的性質の空間へと埋め込む。最後に、入力データ中の表情が、学習データから事前に形成されたそれぞれの空間中のカテゴリと動的性質に関するどの多様体に近いのかに基づき、カテゴリと動的性質を認識して出力する。 As input data, time-series data of coordinate values of feature points arranged around the face parts such as eyes and mouth obtained with high time resolution was used. At this time, when the facial expression changes, the coordinate value of the feature point changes due to the movement or deformation of the facial part. By handling data with high temporal resolution, it is possible to express details about the dynamic nature of facial expressions and to recognize them. In the first stage of processing, the input data is embedded in the expression category space using the Lipsitz embedding method. Then, information on how the data moves in time on the category space is further embedded into the dynamic property space using the Lipsitz embedding method. Finally, the category and the dynamic property are recognized and output based on which manifold related to the category and the dynamic property in each space formed in advance from the learning data is close to the expression in the input data.

これにより、顔面上の目や口といった特徴点の移動の情報に基づき、表情のカテゴリの認識のみならず、表情の表出の速度や強度に関する複雑な動的性質についても認識可能となる。また、表情が表出され始めた直後の表出強度が小さい、すなわち、無表情からそれほど大きく変化していない表情についても正しく認識できるようになる。例えば、一瞬表出された後、直ちに別の表情によって隠蔽された表情を認識することも可能となる。 This makes it possible not only to recognize facial expression categories but also to recognize complex dynamic properties related to the speed and intensity of facial expression based on information on the movement of feature points such as eyes and mouths on the face. In addition, it is possible to correctly recognize a facial expression whose expression intensity is small immediately after the facial expression starts to be expressed, that is, a facial expression that does not change so much from no expression. For example, it is also possible to recognize a facial expression concealed by another facial expression immediately after appearing for a moment.

なお、図１における処理部の機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより表情学習処理及び表情認識処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータシステム」は、ホームページ提供環境（あるいは表示環境）を備えたＷＷＷシステムも含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。 1 is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read into a computer system and executed to execute facial expression learning processing and Expression recognition processing may be performed. Here, the “computer system” includes an OS and hardware such as peripheral devices. The “computer system” includes a WWW system having a homepage providing environment (or display environment). The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Further, the “computer-readable recording medium” refers to a volatile memory (RAM) in a computer system that becomes a server or a client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. In addition, those holding programs for a certain period of time are also included.

また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, what is called a difference file (difference program) may be sufficient.

顔面上の目や口といった特徴点の移動の情報に基づき、表情のカテゴリの認識のみならず、表情の表出の速度や強度に関する複雑な動的性質についても認識することが不可欠な用途に適用できる。 Applicable to applications where it is indispensable not only to recognize facial expression categories, but also to recognize complex dynamic properties related to the speed and intensity of facial expression based on information on the movement of feature points such as eyes and mouth on the face it can.

１・・・入力部、２・・・表情学習部、３・・・表情認識部、４・・・学習データセット記憶部、５・・・カテゴリ多様体記憶部、６・・・動的性質多様体記憶部 DESCRIPTION OF SYMBOLS 1 ... Input part, 2 ... Expression learning part, 3 ... Expression recognition part, 4 ... Learning data set storage part, 5 ... Category manifold storage part, 6 ... Dynamic property Manifold storage

Claims

Input means for outputting learning data composed of positional information of a plurality of feature points on the face from time-series image data;
Category embedding means for generating category manifold information separated depending only on a facial expression category by projecting the learning data output from the input means onto a facial expression category space;
Category manifold information storage means for storing the category manifold information;
Dynamic property embedding means for generating dynamic property manifold information separated depending only on the dynamic property of the expression by projecting the learning data output from the input unit onto the dynamic property space of the expression When,
An expression learning device comprising: dynamic property manifold information storage means for storing the dynamic property manifold information.

Category manifold information storage means for storing category manifold information obtained by projecting learning data for facial expression recognition onto a category space of facial expressions;
Dynamic property manifold information storage means for storing dynamic property manifold information obtained by projecting the learning data onto the dynamic property space of an expression;
Input means for outputting position information of a plurality of feature points on the face of the facial expression recognition target person from time-series image data;
Category embedding means for obtaining category space coordinates by projecting the position information of the plurality of feature points output from the input means onto the category space of the facial expression;
Dynamic property embedding means for obtaining dynamic property space coordinates by projecting the positional information of the plurality of feature points output from the input unit onto the dynamic property space of the facial expression;
By referring to the category manifold information stored in the category manifold information storage means, the category manifold information to which the category space coordinates belong is specified to estimate the facial expression category of the facial expression recognition target person. And the facial expression by identifying the dynamic property manifold information to which the dynamic property space coordinates belong by referring to the dynamic property manifold information stored in the dynamic property manifold information storage means. An expression recognizing apparatus comprising: an expression recognizing means for estimating a dynamic property of an expression of a person to be recognized.

Input means for outputting learning data consisting of position information of a plurality of feature points on the face included in each image from time-series image data, category manifold information storage means for storing category manifold information, and dynamic properties A facial expression learning method in a facial expression learning device comprising dynamic property manifold information storage means for storing manifold information, category embedding means, and dynamic property embedding means,
The category embedding unit generates category manifold information separated depending only on a facial expression category by projecting the learning data output from the input unit onto a facial expression category space, and A category embedding step for storing in the body storage means;
The dynamic property manifold information separated by the dynamic property embedding means depending only on the dynamic property of the expression by projecting the learning data output from the input unit onto the dynamic property space of the expression Generating a dynamic property and storing it in the dynamic property manifold information storage means;
A facial expression learning method characterized by comprising:

Category manifold information storage means for storing category manifold information obtained by projecting learning data for facial expression recognition onto a category space for facial expressions, and dynamic property manifold information obtained by projecting the learning data onto a dynamic property space for facial expressions Dynamic property manifold information storage means for storing information, input means for outputting position information of a plurality of feature points on the face of a facial expression recognition target person from time-series image data, category embedding means, and dynamic property embedding And a facial expression recognition method in a facial expression recognition device comprising facial expression recognition means,
A category embedding step in which the category embedding unit obtains category space coordinates by projecting the positional information of the plurality of feature points output from the input unit onto the category space of the facial expression;
A dynamic property embedding step in which the dynamic property embedding unit obtains dynamic property space coordinates by projecting the position information of the plurality of feature points output from the input unit onto the dynamic property space of the facial expression; When,
The facial expression recognition means refers to the category manifold information stored in the category manifold information storage means and identifies the category manifold information to which the category space coordinates belong, thereby identifying the facial expression recognition person. The dynamic property manifold information to which the dynamic property space coordinates belong is obtained by estimating the expression category and referring to the dynamic property manifold information stored in the dynamic property manifold information storage means. A facial expression recognition method comprising: a facial expression recognition step for estimating a dynamic characteristic of the facial expression of the facial expression recognition target person by specifying.

Input means for outputting learning data consisting of position information of a plurality of feature points on the face from time-series image data, category manifold information storage means for storing category manifold information, and dynamic property manifold information are stored A facial expression learning program for causing a computer on a facial expression learning device comprising dynamic property manifold information storage means to perform facial expression learning,
By projecting the learning data output from the input unit onto a facial expression category space, category manifold information separated depending only on the facial expression category is generated and stored in the category manifold storage unit. A category embedding step;
The learning data output from the input means is projected onto the dynamic property space of the facial expression to generate dynamic property manifold information separated depending only on the dynamic property of the facial expression, A facial expression learning program characterized by causing a computer to perform a dynamic property embedding step stored in a property manifold information storage means.

Category manifold information storage means for storing category manifold information obtained by projecting learning data for facial expression recognition onto a category space for facial expressions, and dynamic property manifold information obtained by projecting the learning data onto a dynamic property space for facial expressions A facial expression on a computer on the facial expression recognition device, comprising: a dynamic property manifold information storage means for storing the information; and an input means for outputting position information of a plurality of feature points on the face of the facial expression recognition target person from time-series image data. A facial expression recognition program for recognition,
A category embedding step for obtaining category space coordinates by projecting the position information of the plurality of feature points output from the input means onto the category space of the facial expression;
A dynamic property embedding step of obtaining dynamic property space coordinates by projecting the positional information of the plurality of feature points output from the input means onto the dynamic property space of the facial expression;
By referring to the category manifold information stored in the category manifold information storage means, the category manifold information to which the category space coordinates belong is specified to estimate the facial expression category of the facial expression recognition target person. And the facial expression by identifying the dynamic property manifold information to which the dynamic property space coordinates belong by referring to the dynamic property manifold information stored in the dynamic property manifold information storage means. A facial expression recognition program which causes the computer to perform a facial expression recognition step for estimating a dynamic property of a facial expression of a person to be recognized.