JP2019517693A

JP2019517693A - System and method for facial expression recognition and annotation

Info

Publication number: JP2019517693A
Application number: JP2018562947A
Authority: JP
Inventors: マルティネス，アレイクス
Original assignee: オハイオ・ステイト・イノベーション・ファウンデーション
Priority date: 2016-06-01
Filing date: 2017-06-01
Publication date: 2019-06-24
Anticipated expiration: 2037-06-01
Also published as: JP7063823B2; EP3465615A1; EP3465615A4; US20190294868A1; WO2017210462A1; US20220254191A1; KR102433971B1; US11314967B2; KR20190025564A

Abstract

本明細書に開示され特許請求された本発明は、その態様において、画像内のＡＵおよび感情カテゴリを識別するシステムおよび方法を含む。システムおよび方法は、人々の顔の画像を含む一組の画像を利用した。本システムおよび方法は、顔画像を分析して、感情カテゴリを表す顔血流変動によるＡＵおよび顔色を決定する。諸態様では、分析は、ＡＵ、ＡＵ強度および感情カテゴリを決定するためのガボール変換を含むことができる。他の態様では、システムおよび方法は、ＡＵ、ＡＵ強度および感情カテゴリを決定するための色分散分析を含むことができる。さらなる態様では、分析は、ＡＵ、感情カテゴリ、およびそれらの強度を決定するように訓練されたディープニューラルネットワークを含むことができる。The invention disclosed and claimed herein includes, in its aspects, systems and methods for identifying AU and emotional categories in an image. The system and method utilized a set of images including images of people's faces. The system and method analyzes the face image to determine AU and complexion due to facial blood flow fluctuations representing emotional categories. In aspects, the analysis may include AU, AU intensity, and Gabor transforms to determine emotional categories. In other aspects, the systems and methods can include AU, AU intensity, and chromatic variance analysis to determine emotional categories. In a further aspect, the analysis can include AUs, emotional categories, and deep neural networks trained to determine their strength.

Description

国家ライセンシング権
本発明は、国立眼科研究所、および国立聴覚・伝達障害研究所によって授与された助成金番号Ｒ０１−ＥＹ−０２０８３４およびＲＯ１−ＤＣ−０１４４９８の下で政府の支援を受けてなされた。どちらの機関も国立衛生研究所の一部である。政府は本発明において一定の権利を有する。 This invention was made with government support under Grant Nos. R01-EY-020834 and RO1-DC-014498 awarded by the National Institute of Ophthalmology and the National Institute of Hearing and Communication Disorders. Both institutions are part of the National Institutes of Health. The government has certain rights in the invention.

本出願は、２０１６年６月１日に出願された「表情の認識および注釈付けのためのシステムおよび方法」と題する米国仮特許出願番号６２／３４３，９９４の利益を主張する。上記出願の全体は参照により本明細書に組み込まれる。 This application claims the benefit of US Provisional Patent Application No. 62 / 343,994, entitled "System and Method for Facial Expression Recognition and Annotation," filed June 1, 2016. The entire application is incorporated herein by reference.

顔の知覚と感情の理論の基礎研究は、感情表情の画像およびビデオシーケンスの大きな注釈付きデータベースに影響を与えることができる。最も有用で一般的に必要とされる注釈のいくつかは、アクションユニット（ＡＵ）、ＡＵ強度、および感情カテゴリである。中小規模のデータベースには、数ヶ月かけてエキスパートのコーダーが手動で注釈を付けることができるが、大規模なデータベースにはできない。たとえば、エキスパートコーダによって各顔の画像に非常に高速に注釈を付けることができたとしても（たとえば２０秒/画像）、１００万枚の画像をコーディングするのに５．５５６時間かかることになり、それは、６９４日（８時間）に換算でき、または休まずに行う仕事２．６６年に換算される。 Basic research on the theory of facial perception and emotion can affect large annotated databases of images of emotional expressions and video sequences. Some of the most useful and commonly needed annotations are action units (AU), AU intensity, and emotion categories. Small and medium-sized databases can be manually annotated by expert coders over several months, but not for large databases. For example, even if the expert coder could annotate each face image very quickly (for example 20 seconds per image), it would take 5.556 hours to code a million images, It can be converted to 694 days (8 hours) or converted to work 2.66 years without work.

既存のアルゴリズムでは、すべてのアプリケーションのすべてのＡＵを認識せず、ＡＵ強度を指定せず、大規模なデータベースを扱うには空間的および／または時間的に過度に計算が必要であるか、または特定のデータベース内でのみテストされる（たとえば、複数のデータベースが使用される場合でも、トレーニングとテストは、通常各データベース内で個別に行われる）。 Existing algorithms do not recognize all AUs of all applications, do not specify AU strengths, or require excessive computations in space and / or time to handle large databases, or Tested only in a particular database (eg, even if multiple databases are used, training and testing is usually done separately in each database).

本開示は、アクションユニット（ＡＵ）、それらの強度、ならびにデータベースにわたる多数（２３）の基本および複合感情カテゴリを認識するためのコンピュータビジョンおよび機械学習プロセスを提供する。重要なことに、例示されたプロセスは、データベースにわたるＡＵおよびそれらの強度の信頼できる認識を提供する最初のものであり、リアルタイムで実行される（３０画像／秒）。これらの機能は、「自然のままの（in the wild）」感情画像の百万の表情の大規模データベースへの自動注釈付けを容易にする。これは他のシステムでは達成できない功績である。 The present disclosure provides computer vision and machine learning processes for recognizing action units (AUs), their strengths, and a large number (23) of basic and complex emotion categories across databases. Importantly, the illustrated process is the first to provide reliable recognition of AUs and their strengths across databases, and is performed in real time (30 images / sec). These features facilitate the automatic annotation of large-scale databases of one million expressions of "in the wild" emotional images. This is an achievement that can not be achieved with other systems.

さらに、画像には４２１の感情キーワードで意味的に注釈が付けられる。 In addition, the image is semantically annotated with 421 emotional keywords.

顔の画像におけるＡＵとＡＵ強度の認識のためのコンピュータビジョンプロセスが提示される。とりわけ、本プロセスは、データベースにわたるＡＵおよびＡＵ強度を確実に認識できる。本発明の分類器をトレーニングするために使用されていない独立した画像データベース上のＡＵおよびＡＵ強度を良好に認識するために、いくつかのデータベースを使用して本プロセスをトレーニングできることも本明細書で実証される。さらに、本プロセスは、感情の表情の画像の大きなデータベースを自動的に構築して注釈を付けるために使用される。画像は、ＡＵ、ＡＵ強度および感情カテゴリで注釈が付けられる。結果は、ＡＵ、ＡＵ強度、感情カテゴリおよび/または感情的キーワードによって容易に照会されることができる100万枚の画像のデータベースである。 A computer vision process is presented for the recognition of AU and AU intensity in facial images. Among other things, the process can reliably recognize AU and AU strength across databases. Also described herein, several databases can be used to train the process to better recognize AU and AU intensities on independent image databases that are not used to train the classifier of the present invention. Demonstrated. Furthermore, the process is used to automatically build and annotate a large database of emotional expression images. Images are annotated with AU, AU intensity and emotion categories. The result is a database of 1 million images that can be easily queried by AU, AU intensity, emotional category and / or emotional keywords.

さらに、本プロセスは、色特徴からＡＵを識別するための包括的なコンピュータビジョンプロセスを容易にする。この目的のために、色特徴をＡＵの認識にうまく利用することができ、前述のシステムで得られたものより優れた結果をもたらす。つまり、ＡＵが非アクティブからアクティブ、またはその逆に変わるときの色の変化を定義する関数は、ＡＵ内では一貫しており、それらの間の差異は異なる。さらに、本プロセスは、顔の色の変化をどのように利用して多種多様な画像条件下で撮影されたビデオ中のＡＵの存在を識別することができるかを明らかにする。 In addition, the process facilitates a comprehensive computer vision process for identifying AUs from color features. To this end, color features can be successfully exploited for AU recognition, leading to better results than those obtained with the aforementioned system. That is, the functions that define the change in color as the AU changes from inactive to active or vice versa are consistent within the AU and the differences between them are different. In addition, the process reveals how facial color variations can be used to identify the presence of AUs in videos taken under a wide variety of imaging conditions.

さらに、顔の色は顔の表情の感情を決定するために使用される。上述したように、人間の感情の表情は、一般的にアクションユニット（ＡＵ）と呼ばれる、自分の顔の筋肉を収縮させることによって作り出される。さらに、顔の表面も血管の大きなネットワークで神経支配される。例えば、怒りは顔への血流を増加させ、その結果赤い顔になるが、恐怖は顔からの血液の排出に関連し、青白い顔をもたらす。これらの目に見える顔の色は、顔の筋肉の活性化がない場合でも、顔の表情の画像における感情の解釈を可能にする。この色信号はＡＵが提供するものとは独立しているため、アルゴリズムはＡＵからの感情と色を独立して検出できる。 In addition, the color of the face is used to determine the emotion of the facial expression. As mentioned above, human emotional expressions are created by contracting one's face muscles, commonly called action units (AU). Furthermore, the surface of the face is also innervated by a large network of blood vessels. For example, anger increases blood flow to the face, resulting in a red face, but fear is associated with blood drain from the face, resulting in a pale face. These visible face colors allow interpretation of emotions in facial expression images, even in the absence of facial muscle activation. Since this color signal is independent of what the AU provides, the algorithm can detect emotion and color from the AU independently.

さらに、ディープニューラルネットワーク（ＤＮＮ）のためのグローバルローカル損失関数が提示され、それは類似の対象ランドマークの関心点ならびにＡＵおよび感情カテゴリのきめ細かい検出に効率的に使用することができる。導出された局所的および全体的な損失により、パッチベースのアプローチを使用する必要なしに正確な局所的結果が得られ、迅速で望ましい収束が得られる。本グローバルローカル損失関数は、ＡＵおよび感情カテゴリの認識に使用されてもよい。 Furthermore, a global local loss function for deep neural networks (DNN) is presented, which can be efficiently used for fine point detection of similar subject landmark interest points and AU and emotional categories. The derived local and global losses provide accurate local results without the need to use a patch based approach, and provide rapid and desirable convergence. The present global local loss function may be used for AU and emotion category recognition.

いくつかの実施形態では、顔認識および注釈プロセスは臨床用途で使用される。 In some embodiments, face recognition and annotation processes are used in clinical applications.

いくつかの実施形態では、顔認識および注釈プロセスは精神病理学的評価の検出に使用される。 In some embodiments, face recognition and annotation processes are used to detect psychopathological assessments.

いくつかの実施形態では、顔認識および注釈付けプロセスは、外傷後ストレス障害のスクリーニング、たとえば軍事施設または緊急治療室でのスクリーニングに使用される。 In some embodiments, the face recognition and annotation process is used to screen for post-traumatic stress disorders, such as in military installations or emergency rooms.

いくつかの実施形態では、顔認識および注釈プロセスは、顔の表情を認識するために学習障害（例えば、自閉症スペクトラム障害）を持つ子供を教えるために使用される。 In some embodiments, face recognition and annotation processes are used to teach children with learning disabilities (eg, autism spectrum disorders) to recognize facial expressions.

いくつかの実施形態では、顔認識および注釈プロセスは、広告のために、たとえば広告を見ている人々の分析のために、映画を見る人々の分析のために、スポーツアリーナでの人々の反応の分析のために使用される。 In some embodiments, the face recognition and annotating process may be performed on people's responses at the sports arena for advertising, for example, for analyzing people who are watching advertising, for analyzing people who are watching movies, etc. Used for analysis.

いくつかの実施形態では、顔認識および注釈プロセスは監視のために使用される。 In some embodiments, face recognition and annotation processes are used for monitoring.

いくつかの実施形態では、感情、ＡＵおよび他の注釈の認識は、ウェブ検索を改善または識別するために使用され、例えば、システムは、驚きを表現する顔の画像または眉毛を有する特定の人物の画像を識別するために使用される。 In some embodiments, recognition of emotions, AUs and other annotations is used to improve or identify web searches, for example, the system can be used to image surprises or to show specific people with eyebrows Used to identify an image.

いくつかの実施形態では、顔認識および注釈付けプロセスは、小売店で顧客の行動を監視、評価または決定するために使用される。 In some embodiments, the face recognition and annotation process is used to monitor, evaluate or determine customer behavior at a retail store.

いくつかの実施形態では、顔認識および注釈プロセスは、施設または個人の電子写真を整理するために、例えば感情またはＡＵによって人の個人的写真を整理するために使用される。 In some embodiments, face recognition and annotation processes are used to organize personal photos of a person, for example by emotion or AU, to organize electronic photographs of facilities or individuals.

いくつかの実施形態では、顔認識および注釈プロセスは、病院または臨床現場における患者の感情、痛みおよび精神状態を監視するために、例えば患者の不快感のレベルを決定するために使用される。 In some embodiments, face recognition and annotation processes are used to monitor the patient's emotions, pain and mental status in a hospital or clinical setting, for example to determine the level of patient discomfort.

いくつかの実施形態では、顔認識および注釈プロセスは、運転者の行動ならびに道路および他の車両に対する注意を監視するために使用される。 In some embodiments, face recognition and annotation processes are used to monitor driver behavior and attention to roads and other vehicles.

いくつかの実施形態では、顔認識および注釈プロセスは、絵文字、ステッカーまたは他のテキストメッセージ感情的構成要素を自動的に選択するために使用される。 In some embodiments, face recognition and annotation processes are used to automatically select pictograms, stickers or other text message emotional components.

いくつかの実施形態では、顔認識および注釈プロセスは、オンライン調査を改善するために、例えば、オンライン調査参加者の感情的反応を監視するために使用される。 In some embodiments, face recognition and annotation processes are used to improve the on-line survey, for example, to monitor the emotional response of the on-line survey participant.

いくつかの実施形態では、顔認識および注釈プロセスは、オンライン教育および個別指導において使用される。 In some embodiments, face recognition and annotation processes are used in online education and tutoring.

いくつかの実施形態では、顔認識および注釈プロセスは、求職者の適合が特定の会社であると判断するために使用され、例えば、会社は注意深い参加者を探しているが、別の会社は楽しい人格に興味がある。別の例では、顔認識および注釈プロセスを使用して、面接中またはオンラインビデオ履歴書中の個人の能力を判断する。 In some embodiments, the face recognition and annotation process is used to determine that the job seeker's fit is a particular company, eg, a company is looking for alert participants but another company is entertaining I am interested in personality. In another example, face recognition and annotation processes are used to determine an individual's ability during an interview or online video resume.

いくつかの実施形態では、顔認識および注釈プロセスはゲームで使用される。 In some embodiments, face recognition and annotation processes are used in games.

いくつかの実施形態では、顔認識および注釈プロセスは、精神科医院、診療所または病院で患者の反応を評価するために使用される。 In some embodiments, the face recognition and annotation process is used to evaluate a patient's response at a psychiatric office, clinic or hospital.

いくつかの実施形態では、顔認識および注釈プロセスは、乳児および子供を監視するために使用される。 In some embodiments, face recognition and annotation processes are used to monitor infants and children.

一態様では、コンピュータ実施方法が開示される（例えばＡＵおよびＡＵ強度を決定するために画像を分析するために、例えばリアルタイムで）。この方法は、構成または他の形状特徴およびシェーディング特徴の１つまたは複数のカーネルベクトル空間（たとえばカーネルベクトル空間）をメモリ（たとえば永続メモリ）内に維持することを含む。各カーネル空間は、１つまたはいくつかのアクションユニット（ＡＵｓ）および／またはＡＵ強度値および／または感情カテゴリに関連付けられる。分析されるべき画像（例えば、外部から、または１つもしくは複数のデータベースからの表情の画像）を受け取る。受信する画像ごとに、ｉ）画像内の顔の形態特徴、形状特徴、およびシェーディング特徴（たとえば、顔空間は、形態特徴の形状特徴ベクトルと、顔のシェーディング変化に関連するシェーディング特徴ベクトルを含む）の顔空間データ（たとえば、顔ベクトル空間）を決定し、ｉｉ）ＡＵ、ＡＵ強度および感情カテゴリの存在を判定するために、形態特徴の決定された顔空間データを複数のカーネル空間と比較することによって画像に対する１つまたは複数のＡＵ値を決定する。 In one aspect, computer-implemented methods are disclosed (eg, in real time, eg, to analyze images to determine AU and AU intensities). The method includes maintaining one or more kernel vector spaces (eg, kernel vector space) of the configuration or other shape features and shading features in memory (eg, persistent memory). Each kernel space is associated with one or several action units (AUs) and / or AU intensity values and / or emotion categories. Receive an image to be analyzed (e.g., an image of an expression from outside or from one or more databases). For each image to be received: i) morphological features of the face in the image, shape features, and shading features (e.g., the face space includes shape feature vectors of the feature features and shading feature vectors associated with face shading changes) Determining the face space data (e.g. face vector space) of ii and comparing the determined face space data of morphological features with a plurality of kernel spaces to determine the presence of AU, AU intensity and emotion categories To determine one or more AU values for the image.

いくつかの実施形態では、方法は、複数の画像のそれぞれについてのＡＵ値およびＡＵ強度値を決定するために、複数の画像を含むビデオストリームをリアルタイムで処理することを含む。 In some embodiments, the method includes processing a video stream comprising the plurality of images in real time to determine an AU value and an AU intensity value for each of the plurality of images.

いくつかの実施形態では、顔空間データは、形態特徴の形状特徴ベクトルと、顔のシェーディング変化に関連するシェーディング特徴ベクトルとを含む。 In some embodiments, face space data includes shape feature vectors of morphological features and shading feature vectors associated with shading changes of the face.

いくつかの実施形態では、形態、形状およびシェーディング特徴の決定された顔空間は、ｉ）画像から形成されたドロネー三角形の正規化されたランドマーク間の距離値（例えば、ユークリッド距離）およびｉｉ）正規化された顔のランドマークに対応するドロネー三角形それぞれによって定義される距離、面積および角度を含む。 In some embodiments, the determined face space of shape, shape and shading features is: i) a distance value between normalized landmarks of the Delaunay triangle formed from the image (eg Euclidean distance) and ii) It includes the distances, areas and angles defined by each Delaunay triangle corresponding to the normalized facial landmarks.

いくつかの実施形態では、顔のシェーディング変化に関連するシェーディング特徴ベクトルは、顔から決定された正規化ランドマーク点にガボールフィルタを適用することによって（例えば、皮膚の局所的変形によるシェーディング変化をモデル化するために）決定される。 In some embodiments, shading feature vectors associated with face shading changes are modeled by applying a Gabor filter to normalized landmark points determined from the face (eg, modeling shading changes due to local deformation of the skin Be determined).

いくつかの実施形態では、形態特徴の形状特徴ベクトルは、画像上に投影されたランドマーク点、および/またはＡＵ、および/または感情のカテゴリの局所的および全体的適合の両方を逆伝播するように構成されたグローバルローカル（ＧＬ）損失関数を含むディープニューラルネットワーク（例えば畳み込みニューラルネットワーク、ＤＮＮ）を用いて導出されるランドマーク点を含む。 In some embodiments, shape feature vectors of morphological features are backpropagated to both local and global matches of landmark points and / or AUs and / or emotion categories projected onto the image. And includes landmark points derived using a deep neural network (eg, convolutional neural network, DNN) including global local (GL) loss functions configured in.

いくつかの実施形態では、方法は、各受信画像について、ｉ）顔の色特徴に関連する顔空間を決定すること、およびｉｉ）この決定された色顔空間を複数の色またはカーネルベクトル空間顔と比較することによって画像の１つまたは複数のＡＵ値を決定すること、ｉｉｉ）顔が、特定の感情を表現しているように見えるか、または１つまたは複数のＡＵをアクティブで有するかまたは特定の強度で有するように画像の色を修正すること、を含む。 In some embodiments, the method comprises, for each received image: i) determining a face space associated with color features of the face; and ii) a plurality of colors or kernel vector space faces for the determined color face space Determining one or more AU values of the image by comparison with iii) the face appears to be expressing a particular emotion, or has one or more AU active or Modifying the color of the image to have a specific intensity.

いくつかの実施形態では、ＡＵ値およびＡＵ強度値は、まとめて、感情および感情強度を定義する。 In some embodiments, the AU value and the AU intensity value together define emotion and emotion intensity.

いくつかの実施形態では、画像は写真を含む。 In some embodiments, the image comprises a picture.

いくつかの実施形態では、画像はビデオシーケンスのフレームを含む。 In some embodiments, the image comprises a frame of a video sequence.

いくつかの実施形態では、画像はビデオシーケンス全体を含む。 In some embodiments, the image comprises the entire video sequence.

いくつかの実施形態では、この方法は、自然のまま（in the wild）（例えばインターネット）の表情の画像を受信することを含む。受信画像を処理して、受信画像内の顔のＡＵ値およびＡＵ強度値ならびに感情カテゴリを決定する。 In some embodiments, the method includes receiving an image of the expression in the wild (eg, the Internet). The received image is processed to determine the AU and AU intensity values of the face and the emotion category in the received image.

いくつかの実施形態では、方法は、第１のデータベースから第１の複数の画像を受け取り、第２のデータベースから第２の複数の画像を受け取り、受信された第１の複数の画像および第２の複数の画像を処理して、それらの各画像について、それぞれの各画像における顔のＡＵ値およびＡＵ強度値を決定することを含む。第１の複数の画像は第１の取得形態（captured configuration）を有し、第２の複数の画像は第２の取得形態（captured configuration）を有する。第１の取得形態は、第２の取得形態とは異なる（例えば、取得形態は、照明方式および大きさ、画像の背景、焦点面、キャプチャ解像度、記憶圧縮レベル、顔に対するキャプチャのパン、チルト、およびヨー(yaw)等を含む。） In some embodiments, a method receives a first plurality of images from a first database, receives a second plurality of images from a second database, and receives a first plurality of images and a second received. Processing the plurality of images to determine, for each of the images, an AU value and an AU intensity value of the face in each respective image. The first plurality of images have a first captured configuration and the second plurality of images have a second captured configuration. The first acquisition form is different from the second acquisition form (for example, the acquisition form is the illumination method and size, image background, focal plane, capture resolution, storage compression level, capture pan for face, tilt, And yaw etc.)

別の態様では、コンピュータ実施方法が開示される（例えば、画像内の色変化を使用してＡＵ、ＡＵ強度および感情カテゴリを決定するために画像を分析するための）。この方法は、ＡＵの非アクティブからアクティブへの遷移を定義する変化、この変化は色度、色相および彩度、ならびに輝度からなる群から選択され、を識別することと、識別された色度変化へのガボール変換の適用（例えば、顔の表情の間のこの変化の最小値に対する不変性を得るため）と、を含む。 In another aspect, a computer-implemented method is disclosed (eg, for analyzing images to determine AU, AU intensity and emotional category using color changes in the image). The method identifies a change that defines a transition from inactive to active of the AU, the change being selected from the group consisting of chromaticity, hue and saturation, and luminance, and the identified chromaticity change Applying the Gabor transform to (eg, to obtain invariance to the minimum value of this change between facial expressions).

別の態様では、ＡＵおよびＡＵ強度を決定するために画像を分析するためのコンピュータ実施方法が開示されている。この方法は、ＡＵおよび／またはＡＵ強度に関連する複数の色特徴データをメモリ（例えば、永続的メモリ）内に維持し、分析する画像を受け取り、受信する画像ごとに、ｉ）画像中の顔の形態色特徴を決定し、ｉｉ）決定された形態色特徴を複数のトレーニングされた色特徴データと比較して、決定された形態色特徴のうちの１つまたは複数のトレーニングされた色特徴データにおける存在を判定することによって、画像に対する１つまたは複数のＡＵ値を決定することを含む。 In another aspect, a computer-implemented method for analyzing images to determine AU and AU intensity is disclosed. The method maintains in memory (eg, persistent memory) a plurality of color feature data associated with AU and / or AU intensities, receives an image to be analyzed, and for each image received, i) a face in the image Determining the morphological color feature of the image, and ii) comparing the determined morphological color feature with the plurality of trained color feature data, and determining one or more of the trained color feature data of the determined morphological color features Determining one or more AU values for the image by determining the presence at.

別の態様では、コンピュータ実施方法が開示される（例えば、それぞれＡＵ値およびＡＵ強度値に関連付けられた複数の顔空間データのリポジトリを生成するため。リポジトリは、ＡＵおよびＡＵ強度についての画像またはビデオフレームの顔データの分類に使用される。）。方法は、複数のＡＵ値およびＡＵ強度値についてのカーネル空間データを決定するために画像またはビデオフレーム内の複数の顔を分析することを含む。各カーネル空間データは、単一のＡＵ値および単一のＡＵ強度値に関連付けられ、各カーネル空間は他のカーネル面空間と線形的または非線形的に分離可能である。 In another aspect, a computer-implemented method is disclosed (e.g., to generate a repository of a plurality of face space data associated with an AU value and an AU intensity value, respectively), the repository comprises an image or video for AU and AU intensity. Used for classification of frame face data). The method includes analyzing multiple faces in an image or video frame to determine kernel spatial data for multiple AU values and AU intensity values. Each kernel space data is associated with a single AU value and a single AU intensity value, and each kernel space can be separated linearly or non-linearly from other kernel plane spaces.

いくつかの実施形態では、複数の顔を分析してカーネーションを決定するステップ、所定数のＡＵ強度値に対して複数のＡＵトレーニングセットを生成し、複数のカーネル空間を決定するためのカーネルサブクラス判別分析を実行することを含み、複数のカーネル空間のそれぞれは、所与のＡＵ値、ＡＵ強度値、感情カテゴリ、およびその感情の強度に対応する。 In some embodiments, analyzing multiple faces to determine carnations, kernel subclassing to generate multiple AU training sets for a predetermined number of AU intensity values, and determine multiple kernel spaces Performing an analysis, each of the plurality of kernel spaces correspond to a given AU value, an AU intensity value, an emotion category, and the intensity of the emotion.

いくつかの実施形態では、カーネル空間は、画像またはビデオシーケンスの機能的色空間特徴データを含む。 In some embodiments, the kernel space includes functional color space feature data of an image or video sequence.

いくつかの実施形態では、機能的色空間は、複数の画像のうちの所与の画像からそれぞれ導出されたカラー画像に対して判別機能学習分析を実行することによって（例えば、最大マージン機能分類器を使用して）決定される。 In some embodiments, the functional color space is generated by performing a discriminant function learning analysis on color images respectively derived from a given one of the plurality of images (eg, maximum margin functional classifier Determined).

他の態様では、非一時的コンピュータ可読媒体が開示される。コンピュータ可読媒体には命令が格納されており、命令は、プロセッサによって実行されると、プロセッサに上述の方法のうちのいずれかを実行させる。 In another aspect, a non-transitory computer readable medium is disclosed. The computer readable medium stores instructions, which, when executed by the processor, cause the processor to perform any of the methods described above.

他の態様では、システムが開示される。このシステムは、プロセッサと、その上に格納された命令を有するコンピュータ可読媒体とを備え、命令は、プロセッサによって実行されると、プロセッサに上述の方法のうちのいずれかを実行させる。 In another aspect, a system is disclosed. The system comprises a processor and a computer readable medium having instructions stored thereon, the instructions, when executed by the processor, causing the processor to perform any of the methods described above.

図１は、自然のままの顔画像中の感情カテゴリおよびすべてを自動的に注釈付けするためのコンピュータビジョンプロセスの出力を示す図である。FIG. 1 is a diagram illustrating the output of a computer vision process for automatically annotating emotional categories and all in a natural face image.

図２Ａおよび図２Ｂを含む図２は、検出された顔のランドマークおよび画像のドロネー三角測量の図である。FIG. 2, including FIGS. 2A and 2B, is a diagram of Delaunay triangulation of detected facial landmarks and images.

図３は、アクティブＡＵを有するサンプル画像がサブクラスに分割される仮定モデルを示す図である。FIG. 3 shows a hypothetical model in which a sample image with active AUs is divided into subclasses.

図４は、ＡＵおよび感情カテゴリを決定するためにガボール変換を使用するシステムの例示的な構成要素図を示す。FIG. 4 shows an exemplary component diagram of a system that uses Gabor transforms to determine AUs and emotion categories.

図５は、ビデオおよび／または静止画像中の色特徴を用いてＡＵを検出するための色分散システムを示す。FIG. 5 shows a color distribution system for detecting AUs using color features in video and / or still images.

図６は、ビデオおよび／または静止画像中の色特徴を用いてＡＵを検出するための色分散システムを示す。FIG. 6 shows a color distribution system for detecting AUs using color features in video and / or still images.

図７は、ビデオおよび／または静止画像においてディープニューラルネットワークを使用してＡＵを検出するためのネットワークシステムを示す。FIG. 7 shows a network system for detecting AUs using deep neural networks in video and / or still images.

図８は、例示的なコンピュータシステムを示す。FIG. 8 shows an exemplary computer system.

自然のままの百万の表情の自動注釈付けのためのリアルタイムアルゴリズム Real-time algorithm for automatic annotation of one million untouched facial expressions

図１は、ＡＵ、ＡＵ強度、感情カテゴリ、または感情／影響キーワードによって容易に問い合わせ（例えばソート、整理など）することができる表情の結果データベースを示す。このデータベースは、新しいコンピュータビジョンアルゴリズムの設計、ならびに社会的および認知的心理学、社会的および認知的神経科学、神経マーケティング、精神医学などにおける基礎的、変遷的および臨床的研究を容易にする。 FIG. 1 shows a result database of facial expressions that can be easily queried (eg, sorted, organized, etc.) by AU, AU intensity, emotion category, or emotion / impact keywords. This database facilitates the design of new computer vision algorithms and basic, transitional and clinical research in social and cognitive psychology, social and cognitive neuroscience, neuromarketing, psychiatry, etc.

データベースは、自然のままの顔画像（すなわち、既存のデータベースではまだキュレーションされていない画像）内のカテゴリおよびＡＵについての感情に自動的に注釈を付けるコンピュータビジョンシステムの出力から編集される。画像の自動車は、ＷｏｒｄＮｅｔまたは他の辞書の中の顔と関連する感情キーワードを持つ画像だけを選択することによって、さまざまなＷｅｂ検索エンジンを使用してダウンロードされる。図１は、データベースに対する３つのクエリ例を示す。一番上の例は、幸せおよび恐怖と識別されたすべての画像を取得するときに取得された２つのクエリの結果である。また、幸せまたは恐怖のいずれかであると注釈された自然のままの画像のデータベース内の画像の数も示される。３番目のクエリは、ＡＵ４または６が存在するすべての画像と、感情的なキーワードが「不安」および「不承認」の画像をすべて検索した結果を示す。 The database is compiled from the output of a computer vision system that automatically annotates emotions about categories and AUs in pristine face images (ie, images that have not yet been curated in existing databases). Images cars are downloaded using various web search engines by selecting only images with emotional keywords associated with faces in WordNet or other dictionaries. FIG. 1 shows three example queries against a database. The top example is the result of two queries acquired when acquiring all images identified as happiness and fear. Also shown is the number of images in the database of pristine images annotated as either happy or fear. The third query shows all images in which AU 4 or 6 exist, and the results of searching for all the images with emotional keywords “anxiety” and “disapproval”.

ＡＵと強度の認識 AU and strength recognition

いくつかの実施形態では、ＡＵを認識するためのシステムは、毎秒３０画像を超えて処理することができ、データベースにわたって非常に正確であると判定される。このシステムはデータベース間で高い認識精度を達成し、リアルタイムで実行できる。システムは、２３の基本的および／または複合的な感情カテゴリのうちの１つの中に表情を分類することを容易にすることができる。感情の分類は、検出されたＡＵ活性化パターンによって与えられる。いくつかの実施形態では、画像は２３のカテゴリのうちの１つに属していなくてもよい。この場合、画像には感情カテゴリなしでＡＵの注釈が付けられる。画像にアクティブなＡＵがない場合、その画像は中立的な表現（neutral expression）として分類される。顔における感情および感情強度を決定することに加えて、例示されたプロセスは、画像中の「顔ではない」を識別するために使用され得る。 In some embodiments, a system for recognizing AUs can process over 30 images per second and is determined to be very accurate across a database. This system achieves high recognition accuracy between databases and can be implemented in real time. The system can facilitate classifying expressions into one of 23 basic and / or complex emotion categories. Emotional classification is given by the detected AU activation pattern. In some embodiments, the image may not belong to one of 23 categories. In this case, the image is annotated with AU without the emotion category. If there is no active AU in the image, the image is classified as a neutral expression. In addition to determining emotion and emotional intensity in the face, the illustrated process can be used to identify "not face" in the image.

ＡＵと強度認識のための顔空間 Face space for AU and strength recognition

システムは、顔画像内のＡＵを表すために使用される特徴空間を定義することによって開始する。人間による顔の知覚、特に顔の表情は、形状分析と陰影分析との組み合わせを含むことが知られている。システムは、感情の表情の認識を容易にする形状特徴を定義することができる。形状特徴は、顔のランドマーク（すなわち、顔画像中のランドマーク点間の距離および角度）の二次統計量であり得る。特徴は顔の形態を定義するので、特徴は代替的に形態特徴と呼ぶことができる。本出願ではこれらの用語は互換的に使用され得ることが理解される。 The system starts by defining a feature space that is used to represent AUs in the face image. Human perception of faces, particularly facial expressions, is known to include a combination of shape analysis and shading analysis. The system can define shape features that facilitate the recognition of emotional expressions. Shape features may be quadratic statistics of facial landmarks (i.e., the distance and angle between landmark points in the face image). Features can alternatively be referred to as morphological features, as they define facial features. It is understood that in the present application these terms may be used interchangeably.

図２（ａ）は、提案されたアルゴリズムによって使用される正規化された顔のランドマーク
の例を示す。いくつか（例えば、１５個）のランドマークが解剖学的ランドマーク（例えば、目の角、口、眉毛、鼻の先端、およびあご）に対応することができる。他のランドマークは、まぶた、口、眉、唇、および顎の線の端、ならびに鼻の先端から２つの目の中心によって与えられる水平線までの鼻の正中線を定義する疑似ランドマークであり得る。各顔の構成要素（例えば、眉毛）の輪郭を画定する擬似ランドマークの数は一定であり、これは、異なる顔または人に対してランドマーク位置の同等性を提供する。 Figure 2 (a) shows the normalized facial landmarks used by the proposed algorithm
An example of Several (eg, fifteen) landmarks can correspond to anatomical landmarks (eg, eye corners, mouth, eyebrows, nose tips, and jaws). Other landmarks may be pseudo landmarks that define the edge of the eyelid, mouth, eyebrow, lip, and chin lines, and the midline of the nose from the tip of the nose to the horizon provided by the center of the two eyes . The number of pseudo-landmarks that delineate each facial component (e.g., eyebrows) is constant, which provides landmark position equality to different faces or people.

図２（ｂ）は、システムによって実行されたドロネー三角形分割を示す。この例では、この構成の三角形の数は１０７である。画像にはベクトルの角度θ_a=(θ_a1,…,θ_aqa)^T (q_a=３)も示され、角度θaは、正規化されたランドマークから出る三角形の角度
を定義する。 FIG. 2 (b) shows the Delaunay triangulation performed by the system. In this example, the number of triangles in this configuration is 107. The image also shows the vector angles θ _a = (θ _a1 ,..., Θ _aqa ) ^T (q _a = 3), where the angle θ _a is the angle of the triangle coming out of the normalized landmark
Define

は、ＡＵｉのj^th のサンプル画像(j = 1, …, n_i )内のランドマーク点のベクトルにすることができる。ここで、

はk^thのランドマークの2D画像座標である。n_i はＡＵｉが存在するサンプル画像の数である。いくつかの実施形態において、顔のランドマークは、コンピュータビジョンアルゴリズムを用いて取得され得る。例えば、コンピュータビジョンアルゴリズムは、ランドマークの数が６６個の場合
において、図２ａに示すように、任意の数のランドマーク（例えば、テスト画像中の６６個の検出されたランドマーク）を自動的に検出するために使用することができる）。 Can be a vector of landmark points in the j ^th sample image of AUi (j = 1,..., N _i ). here,

Is the 2D image coordinates of the k ^th landmark. n _i is the number of sample images in which AU _i exists. In some embodiments, facial landmarks may be obtained using computer vision algorithms. For example, if the computer vision algorithm has 66 landmarks
In, as shown in FIG. 2a, any number of landmarks (eg, 66 detected landmarks in a test image) can be used to automatically detect.

トレーニング画像は、τピクセルの同じ眼間距離を有するように正規化することができる。具体的には、

とし、

とし、l、rは、左右の目の中心の画像座標であり、
は、ベクトル

およびτ＝３００の２ノルムを定義する。各目の中心の位置は、目の２つの角を画定するランドマーク間の幾何学的中間点として容易に計算することができる。 The training image can be normalized to have the same interocular distance of τ pixels. In particular,

age,

And l and r are the image coordinates of the center of the left and right eyes,
Is the vector

And define a 2 norm of τ = 300. The position of the center of each eye can be easily calculated as the geometric midpoint between landmarks that define the two corners of the eye.

構成特徴の形状特徴ベクトルは、次のように定義することができる。

を正規化されたランドマークa=1,…,p-1,b=a+1,p間のユークリッド距離とし、θ_a = (θ_a1, … , θ_aqa)^Tは、
および
を起点とする数q_aを伴う正規化されたランドマーク
から出る各ドロネー三角形によって定義される角度である（境界がない境界点についても同等性が成り立つ）。この図の各三角形は３つの角度で定義できるため、この例では１０７個の三角形があるので、形状特徴ベクトル内の角度の総数は３２１である。形状特徴ベクトルは
であり、ドロネー三角形分割のpはランドマークの数、tは三角形の数である。この例では、p =６６、t =１０７で、ベクトル

である。 The shape feature vectors of the constituent features can be defined as follows.

_{Let E be} the Euclidean distance between normalized landmarks a = 1,..., P−1, b = a + 1, p, and θ _a = (θ _a1 ,..., Θ _aqa ) ^{T be}
and
Normalized landmarks with numbers q _a starting at
The angle defined by each Delaunay triangle coming out of (the equivalence holds for boundary points without boundaries). Since each triangle in this figure can be defined by three angles, there are 107 triangles in this example, so the total number of angles in the shape feature vector is 321. Shape feature vector is
Delaunay triangulation p is the number of landmarks and t is the number of triangles. In this example, p = 66, t = 107 and the vector

It is.

システムは、正規化されたランドマーク点

のそれぞれを中心とするガボールフィルタを使用して、皮膚の局所的変形によるシェーディング変化をモデル化することができる。顔筋群が顔の皮膚を局所的に変形させると（例えば皮膚の双方向反射率分布関数は、皮膚のしわの関数として定義される。これは、光が表皮と真皮の間を透過して移動する方法を変更し、ヘモグロビンレベルも変化させる可能性があるためである。）、皮膚の表面上の点から見て、皮膚の反射率特性が変化し、光源が短くなる。 The system has normalized landmark points

A Gabor filter centered on each of the can be used to model shading changes due to local deformation of the skin. When the facial muscles deform the skin of the face locally (for example, the bidirectional reflectance distribution function of the skin is defined as a function of the wrinkles of the skin, which means that light is transmitted between the epidermis and the dermis) This may change the way it moves, which may also change the level of hemoglobin.) From the point on the surface of the skin, the reflectance characteristics of the skin change and the light source becomes shorter.

ヒトの初期視覚皮質の細胞は、ガボールフィルタを使用してシステムによってモデル化することができる。顔の知覚は、ガボール風のモデリングを使用して、感情を表現するときに見られるような濃淡の変化に対する不変性を得ることができる。次のように定義できる。

Human early visual cortex cells can be modeled by the system using Gabor filters. Facial perception can use Gabor-like modeling to obtain invariance to the change in intensity as seen when expressing emotions. It can be defined as follows.

は波長（すなわち、サイクル数/ピクセル）、αは方向（すなわち、正弦関数の法線ベクトルの角度）、φは位相（すなわち、正弦関数のオフセット）、γは（空間的）アスペクト比、σはフィルタのスケール（ガウス窓の標準偏差）である。
Is the wavelength (ie, cycles / pixel), α is the direction (ie, the angle of the normal vector of the sine function), φ is the phase (ie, the offset of the sine function), γ is the (spatial) aspect ratio, σ is Filter scale (standard deviation of Gaussian window).

いくつかの実施形態では、ガボールフィルタバンクは、o方位、s空間スケール、およびr位相と共に使用することができる。ガボールフィルタの例では、次のように設定されている。
γ=１。値は感情の表情を表すのに適している。o、s、およびrの値は、トレーニングセットの交差検定を使用して学習される。 In some embodiments, Gabor filter banks can be used with o orientation, s space scale, and r phase. In the Gabor filter example, it is set as follows.
γ = 1. Values are suitable for expressing emotional expressions. The values of o, s and r are learned using cross-validation of the training set.

I_ijは、ＡＵｉが存在し、j^th番目のサンプル画像でありk^th番目のランドマーク点でのガボール応答の特徴ベクトルとして

と定義され、＊は、フィルタg(.)と画像Ｉ_ｉｊとの畳み込みを定義し、λ_ｋは、上で定義された集合λのｋ^th番目の要素である。同じことがα_kとφkにも当てはまるが、これは一般に1なのでγには当てはまらない。 I _ij is the j ^th sample image with AU i present, and as the feature vector of Gabor response at the k ^th landmark point

, * Defines the convolution of the filter g (.) With the image I _ij , λ _k is the k ^th element of the set λ defined above. The same applies to α _k and φ _k , but this is generally 1 and not to γ.

ＡＵｉがアクティブであるj^th番目のサンプル画像に対するすべてのランドマーク点上のガボール応答の特徴ベクトルは、次のように定義される。

Feature vector of Gabor response on all landmark points AUi is for j ^th th sample image is active, is defined as follows.

特徴ベクトルは、顔のランドマークの周りの局所パッチのシェーディング情報を定義し、それらの次元数はg_ij∈R^{5×p×o×s×r}である。 The feature vectors define the shading information of local patches around landmarks of the face, and their dimensionality is g _ij ∈R ^{5 × p × o × s × r} .

顔空間におけるＡＵｉの形状およびシェーディング変化を定義する最終特徴ベクトルは、次のように定義される。

The final feature vector defining the shape and shading change of AUi in the face space is defined as follows.

ＡＵと強度認識のための顔空間の分類 Classification of face space for AU and strength recognition

システムは、ＡＵｉのトレーニングセットを以下のように定義することができ、
j = 1, ... , n_iに対してy_ij= 1であり、ＡＵiが画像に存在することを示し、j = n_i+ 1, ..., n_i+ m_iに対してy_ij= 0であり、ＡＵiが画像に存在しないことを示し、m_iはＡＵiがアクティブでないサンプル画像の数である。 The system can define the training set of AUi as
For j = 1, ..., n _i y _ij = 1, indicating that AU _i is present in the image, j = n _i + 1, ..., n _i + m _i for y _ij = 0, indicates that AUi is not present in the image, m _i is the number of sample images AUi is not active.

上記のトレーニングセットは以下のように順序付けられる。セット
は、強度ａ（すなわち、ＡＵの活性化の最低強度）で活性なＡＵｉを有するｎ_ｉａサンプルを含むみ、セット

は、強度ｂ（２番目に小さい強度）でアクティブなＡＵiを持つn_ibサンプルである。 The above training set is ordered as follows. set
The strength a (i.e., a minimum intensity of the activation of the AU) Fukumumi the _{n ia} sample with active AUi in the set

Are n _ib samples with active AU i at intensity b (the second lowest intensity).

セット

は、強度c（次の強度）でアクティブなＡＵｉを持つn_icサンプルである。 set

Is a n _ics samples with active AUi intensity c (next intensity).

セット

set

は、強度ｄ（これが最高強度である）でアクティブなＡＵｉを持つn_idサンプルであり、n_ia+n_ib+n_ic+n_id=n_iである。 Is an n _id sample with an active AUi at intensity d (which is the highest intensity) and n _ia + n _ib + n _ic + n _id = n _i .

ＡＵは５つの強度で活性化することができ、それはａ、ｂ、ｃ、ｄ、またはｅと標識することができる。いくつかの実施形態では、強度ｅを伴う稀な例があり、したがって、いくつかの実施形態では、他の４つの強度で十分である。そうでなければ、D_i（e）は５番目の強度を定義する。 AU can be activated at 5 intensities, which can be labeled a, b, c, d or e. In some embodiments, there are rare cases with intensity e, so in some embodiments, the other four intensities are sufficient. Otherwise, D _i (e) defines the fifth intensity.

上記で定義された４つのトレーニングセットはＤｉのサブセットであり、ＡＵｉがアクティブである画像のセットの異なるサブクラスとして表すことができ、サブクラスベースの分類子を使用でき、システムは本プロセスを導出するためにカーネルサブクラス判別分析（ＫＳＤＡ）を利用する。ＫＳＤＡは、カーネル行列とサブクラスの数を最適化することによって複雑な非線形分類境界を明らかにできるため使用できる。ＫＳＤＡはクラス判別基準を最適化してクラスを最適に分離することができる。この基準は形式的にはQ_i（φ_i、h_i1、h_i2）= Q_i1（φ_i、h_i1、h_i2）Q_i2（φ_i、h_i1、h_i2）で与えられ、Q_i1（φ_i、h_i1、h_i2）は等分散性の最大化を担う。カーネルマップの目的は、データが線形に分離可能であるカーネル空間Ｆを見つけることであり、いくつかの実施形態では、サブクラスは、クラス分布が同じ分散を共有する場合であるＦにおいて線形に分離可能であり得る。Q_i2(φ_i,h_i1,h_i2)は、すべてのサブクラス平均間の距離を最大化する（すなわち、これは最小ベイズ誤差を有するベイズ分類器を見つけるために使用される）。 The four training sets defined above are subsets of Di and can be represented as different subclasses of the set of images on which AUi is active, can use subclass based classifiers, and the system derives this process Use kernel subclass discriminant analysis (KSDA). KSDA can be used because it can reveal complex non-linear classification boundaries by optimizing the number of kernel matrices and subclasses. KSDA can optimize the classification criteria to optimally separate the classes. This criterion is formally given by Q _i (φ _i , h _i1 , h _i2 ) = Q _i1 (φ _i , h _i1 , h _i2 ) Q _i2 (φ _i , h _i1 , h _i2 ), Q _i1 (Φ _i , h _i1 , h _i2 ) is responsible for the maximization of isodispersity. The purpose of the kernel map is to find a kernel space F in which the data is linearly separable, and in some embodiments, the subclasses can be linearly separable in F, where the class distributions share the same variance It can be. Q _i2 (φ _i , h _i1 , h _i2 ) maximizes the distance between all subclass averages (ie, it is used to find a Bayesian classifier with minimal Bayesian error).

この想起を見るために、ベイズ分類境界は、２つの正規分布の確率が同一である特徴空間の位置において与えられることを思い出されたい(例えばp(z|N(μ₁,Σ₁))= p(z|N(μ₂,Σ₂))、N(μ_i,Σ_i)は、平均μ_iと共分散行列Σ_iをもつ正規分布である。２つの正規分布の平均を分離すると、この等式が成り立つ値が減少する、例えば、等式p(x|N(μ₁,Σ₁))=p(x|N(μ₂,Σ₂))は以前よりも低い確率値で与えられるので、ベイズ誤差は減少する。 To see this recollection, recall that Bayesian classification boundaries are given at locations in the feature space where the probabilities of the two normal distributions are identical (eg p (z | N (μ ₁ , _{1 1} )) = p (z | N (μ ₂ , _{2 2} )) and N (μ _i , _{i i} ) are normal distributions with an average μ _i and a covariance matrix Σ _i . For example, the equation p (x | N (μ ₁ , _{1 1} )) = p (x | N (μ ₂ , _{2 2} )) is given a lower probability value than before. Bayesian error is reduced.

したがって、上記のＫＳＤＡ基準の最初の要素は、

で与えられ、 Thus, the first element of the above KSDA criteria is

Given by

は、マッピング関数によって定義されるカーネル空間内のサブクラス共分散行列（すなわち、サブクラスl内のサンプルの共分散行列）であり、j_i(.):R^e→F,h_i1は、画像中に存在するＡＵｉを表すサブクラスの数であり、h_i2はＡＵｉを表すサブクラスの数で、画像には含まれておらず、e=3t+p(p-1)/2+5×p×o×s×rは、フェイススペースに関するセクションで定義されている顔スペース内の特徴ベクトルの次元数である。 Is the subclass covariance matrix in kernel space defined by the mapping function (ie, the covariance matrix of the samples in subclass l), j _i (.): R ^e → F, h _{i 1} is in the image H _i2 is the number of subclasses representing AUi, hi2 is the number of subclasses representing AUi and is not included in the image, e = 3t + p (p-1) / 2 + 5 × p × o × s × r is the number of dimensions of the feature vector in the face space defined in the section on face space.

ＫＳＤＡ基準の２番目の要素は、

であり、p_il= n_l/n_iは、クラスiのサブクラスlの前のものであり（すなわちＡＵｉを定義するクラス）、n_lは、サブクラスlのサンプル数であり、
は、マッピング関数j_i(.)で定義されるカーネル空間のクラスiのサブクラスlのサンプル平均である。 The second element of the KSDA standard is

Where p _il = n _l / n _i is the previous to subclass l of class i (ie the class defining AUi) and n _l is the number of samples of subclass l,
Is the sample mean of subclass l of class i in kernel space defined by the mapping function j _i (.).

例えば、システムは、放射基底関数（ＲＢＦ）カーネルを使用してマッピング関数φi（．）を定義でき、

ν_iはＲＢＦの分散であり、j₁,j₂=1,...,n_i+m_iである。それ故、本発明のＫＳＤＡに基づく分類器は、以下の解によって与えられる。
For example, the system can define the mapping function φi (.) Using a radial basis function (RBF) kernel,

ν _i is the variance of the RBF, j ₁ , j ₂ = 1, ..., n _i + m _i . Therefore, the KSDA based classifier of the present invention is given by the following solution.

図３は、ＡＵｉのモデルをもたらすための上記方程式の解を示す。上記の仮説モデルでは、ＡＵ４がアクティブなサンプル画像は最初に４つのサブクラスに分割され、各サブクラスは同じ強度のＡＵ４のサンプルを含む(a-e)。次に、導き出されたＫＳＤＡベースのアプローチは、各サブクラスを追加のサブクラスにさらに細分するプロセスを使用して、上記の正規分布が線形に分離されできるだけ互いに離れているカーネル空間にデータを本質的にマッピングするカーネルマッピングを見つける。 FIG. 3 shows the solution of the above equation to yield a model of AUi. In the above hypothetical model, the sample image with AU4 active is first divided into four subclasses, each subclass containing a sample of AU4 of the same intensity (a-e). Next, the derived KSDA based approach essentially uses the process of further subdividing each subclass into additional subclasses, such that the above normal distributions are linearly separated and the data in kernel space as far apart as possible. Find the kernel mapping to map.

これを行うために、システムはトレーニングセットＤ_ｉを５つのサブクラスに分割する。第１のサブクラス（すなわち、ｌ＝１）は、強度ａでアクティブなＡＵｉを有する画像に対応するサンプル特徴ベクトルを含み、つまり、D_i(a) は、その全体が参照により本明細書に組み込まれているS. Du, Y. TaoおよびA. M. Martinezの「複合的な表情の感情表現」国立科学アカデミー論文集111(15):E1454-E1462, 2014で定義されている。２番目のサブクラス（ｌ＝２）はサンプルサブセットを含む。同様に、３番目と４番目のサブクラス（ｌ＝２、３）は、それぞれサンプルサブセットを含む。最後に、５つのサブクラス（ｌ＝５）は、ＡＵｉがアクティブではない画像、例えば

に対応するサンプル特徴ベクトルを含む。 To do this, the system divides the training set D _i into five subclasses. The first subclass (i.e., l = 1) comprises sample feature vectors corresponding to the image with active AUi at intensity a, ie, D _i (a) is incorporated herein by reference in its entirety S. Du, Y. Tao and AM Martinez, "Emotional Expressions in Complex Expressions," National Academy of Sciences Proceedings 111 (15): E1454-E1462, 2014. The second subclass (l = 2) contains sample subsets. Similarly, the third and fourth subclasses (l = 2, 3) each contain a sample subset. Finally, five subclasses (l = 5) are images for which AUi is not active, eg

Contains sample feature vectors corresponding to.

したがって、最初は、アクティブ／非アクティブのＡＵｉを定義するためのサブクラスの数は５である（すなわち、ｈ_ｉ１＝４およびｈ_ｉ２＝１）。いくつかの実施形態では、この数はもっと大きくてもよい。たとえば、強度eの画像が考慮されるとする。 Thus, initially, the number of subclasses to define AUi active / inactive is 5 _{(i.e., h i1} = 4 and _h i2 = 1). In some embodiments, this number may be larger. For example, suppose that an image of intensity e is considered.

式１４を最適化すると、追加のサブクラスが得られる。導出された手法は、サブクラスｈ_ｉ１およびｈ_ｉ２の数と同様にカーネルマップｕ_ｉのパラメータを最適化する。この実施形態では、最初の（５つの）サブクラスをさらにサブクラスに細分することができる。例えば、カーネルパラメータν_iが、Ｄ_ｉ(a)内の非線形に分離可能なサンプルを他のサブセットから線形に分離可能な空間にマッピングできない場合、Ｄｉ(a)はさらに２つのサブセットD_i(a)={D_i(a₁),D_i(a₂)}に分割される。この分割は単に最近隣クラスタリングによって与えられる。形式的には、サンプルz_{i j+1}をz_ijの最近傍とすると、Ｄｉ(a)の除算は

によって容易に与えられる。 Optimization of Equation 14 provides additional subclasses. The derived approach optimizes the parameters of kernel map u _i as well as the number of subclasses h _i1 and h _i2 . In this embodiment, the first (five) subclasses can be further subdivided into subclasses. For example, if the kernel parameter _{i i} can not map non-linearly separable samples in D _i (a) into a space that can be linearly separated from other subsets, then Di (a) further comprises two subsets D _i (a) _{_{) = {D i (a 1}} ), D i (a 2)} is split into. This division is simply given by nearest neighbor clustering. Formally, if the sample z _{i j + 1} is the closest to z _ij , then the division of Di (a) is

It is easily given by

同じことが、D_i(b)、D_i(c)、D_i(d)、D_i(e)、D_i（非アクティブ）にも当てはまる。したがって、式１４を最適化することは、ＡＵｉの活性化または非活性化の各強度のサンプルをモデル化するための複数のサブクラスをもたらし得、例えば、サブクラス1（ｌ＝１）がＤｉ(a)のサンプルを定義し、システムがこれを２つのサブクラス（および現在ｈ_ｉ１＝４）に分割すると、新しい２つの第１のサブクラス(the first new two subclasses)は、Ｄｉ(a₁)のサンプルとＤｉ(a₂)内の第２のサブクラス（ｌ＝２）含む第１のサブクラスを使用して、Ｄｉ(a)のサンプルを定義するために使用される（そしてｈ_ｉ１は５になる）。後続のサブクラスは、上で定義されたように、サンプルをD_i(b), D_i(c), D_i(d), D_i(e) ,D_i（非アクティブ）に定義する。したがって、Ｄ_ｉで与えられるサンプルの順序は、サブクラス１からＡＵｉがアクティブである画像に関連するサンプル特徴ベクトルを定義するｈ_ｉ１、およびサブクラスｈ_ｉ１＋１からＡＵｉがアクティブでない画像を表すｈ_ｉ１＋ｈ_ｉ２で変わることはない。この最終結果は、図３の仮説の例を使用して説明されている。 The same applies to D _i (b), D _i (c), D _i (d), D _i (e), D _i (inactive). Thus, optimizing Equation 14 may result in multiple subclasses for modeling samples of each strength of activation or deactivation of AUi, eg, subclass 1 (l = 1) is Di (a Define the sample of) and divide it into two subclasses (and now h _i1 = 4), the two new first subclasses (the first new two subclasses) with the sample of Di (a ₁ ) It is used to define the sample of Di (a) (and h _i1 becomes 5), using the first subclass including the second subclass (l = ₂ ) in Di (a ₂ ). Subsequent subclasses define the samples to D _i (b), D _i (c), D _i (d), D _i (e), D _i (inactive), as defined above. Thus, the order of the samples given by D _i defines the sample feature vector associated with the image from which subclass 1 to AUi is active _, and h _i1 + h _i2 from subclass h _i1 +1 to which AUi represents an inactive image There is no change in This final result is illustrated using the hypothetical example of FIG.

一例では、一組の画像I_test内のすべての試験画像を分類することができる。第１に、I_testは、上述のようにフェイススペースに関して計算されるフェイススペースベクトルz_test内の特徴表現を含む。次に、ベクトルはカーネル空間に投影され、z^j _test.と呼ばれる。この画像がアクティブなＡＵｉを有するかどうかを判断するために、システムは最も近い平均を計算し、

In one example, all test images in the set of images I _test can be classified. First, I _test includes feature representations within the face space vector z _test that are calculated with respect to the face space as described above. Next, the vector is projected into kernel space and called z ^j _test . In order to determine if this image has an active AUi, the system calculates the closest average,

もし j^＊ h_i1であれば、I_testはＡＵｉがアクティブであるとラベル付けされ、そうでなければされない。 If j ^* h _i1 , I _test is labeled AUi active, otherwise it is not.

分類結果は強度認識を提供する。サブクラスlで表されるサンプルがＤｉ(a)のサンプルのサブセットである場合、識別された強度はaである。同様に、サブクラスlのサンプルがD_i(b), D_i(c), D_i(d) またはD_i(e)のサンプルのサブセットである場合、テスト画像I_testのＡＵiの強度は、それぞれb、c、d、eである。もちろん、j *> h_i1の場合、画像にはＡＵｉが存在せず、強度もない（または、強度がゼロであると言える）。 The classification results provide strength recognition. If the sample represented by subclass l is a subset of the samples of Di (a), then the identified intensity is a. Similarly, if the samples of subclass l are a subset of the samples of D _i (b), D _i (c), D _i (d) or D _i (e), then the intensity of the AUi of the test image I _test is b, c, d, e. Of course, j *> For h _i1, there is no AUi the image intensity no (or, it can be said that the strength is zero).

図４は、図１〜図３に関して上述した機能を実行するためのシステム４００の例示的な構成図を示す。システム４００は、一組の画像を有する画像データベース構成要素４１０を含む。システム４００は、画像データベース内の顔以外の画像を除去するための検出器４２０を含む。顔のみを含む画像の画像セットのサブセットを作成する。システム４００は、トレーニングデータベース４３０を含む。トレーニングデータベース４３０は、画像を感情カテゴリに分類するために分類器コンポーネント４４０によって利用される。システム４００は、少なくとも１つのＡＵおよび感情カテゴリで画像をタグ付けするタグ付けコンポーネント４５０を含む。システム４００はタグ付き画像を処理済み画像データベース４６０に格納することができる。 FIG. 4 shows an exemplary block diagram of a system 400 for performing the functions described above with respect to FIGS. System 400 includes an image database component 410 having a set of images. System 400 includes a detector 420 for removing non-face images in the image database. Create a subset of the image set of the image that includes only the face. System 400 includes training database 430. Training database 430 is utilized by classifier component 440 to classify images into emotion categories. System 400 includes a tagging component 450 that tags an image with at least one AU and emotion category. System 400 can store tagged images in processed image database 460.

顔アクションユニット認識のための色特徴の判別関数学習 Discriminant Function Learning of Color Features for Facial Action Unit Recognition

別の態様では、システムは、顔の色特徴を使用してＡＵを識別するための包括的なコンピュータビジョンプロセスを容易にする。色特徴は、ＡＵおよびＡｕ強度を認識するために使用され得る。ＡＵが非アクティブからアクティブ、またはその逆に変わるときの色の変化を定義する関数は、ＡＵ内およびそれらの違いの間で一貫している。さらに、システムは、顔の色の変化をどのように利用して、多種多様な画像条件の下で、および画像データベースの外部で撮影されたビデオ中のＡＵの存在を識別することができるかを明らかにする。 In another aspect, the system facilitates a comprehensive computer vision process for identifying AUs using facial color features. Color features can be used to recognize AU and Au intensities. The functions that define the change in color as the AU changes from inactive to active or vice versa are consistent within the AU and among those differences. In addition, how the system can exploit facial color variations to identify the presence of AUs in videos taken under a wide variety of image conditions and outside of the image database. Reveal.

システムは、i^th番目のサンプルビデオシーケンスV_i= {I_i1, ... , I_iri}を受信する。r_iはフレーム数、I_ik∈R^3qwはq×w RGBピクセルのベクトル化されたk^th番目のカラー画像である。 V_iはサンプル関数f_i(t)として記述である。 System, i ^th th sample video sequence _{_{V i = {I i1, ...}} , I iri} to receive. r _i is the number of frames, I _ik ^{∈R 3 qw} is the vectorized k ^th color image of q × w RGB pixels. V _i is described as a sample function f _i (t).

本システムは、本明細書に記載のアルゴリズムを使用して、顔上の一組の物理的顔ランドマークを識別し、局所的顔領域を取得する。システムはランドマーク点をベクトル形式でs_ik=(s_ik1,…,s_ik66)として定義し、iはサンプルビデオインデックス、kはフレーム番号、s_ikl∈R²はl^th番目、l=1,..., 66、のランドマークの２Ｄ画像座標である。説明の目的で、特定の例示的な値（例えば、６６個のランドマーク、１０７個の画像パッチ）を使用することができる。 The system uses the algorithm described herein to identify a set of physical facial landmarks on the face and obtain local facial regions. The system defines landmark points in vector format as s _ik = (s _ik ₁ , ..., s _ik ₆₆ ), i is a sample video index, k is a frame number, s i k _r ² R ² is l ^th , l = 1, l, ..., 66, 2D image coordinates of the landmark. For illustrative purposes, certain exemplary values (eg, 66 landmarks, 107 image patches) can be used.

システムは、上述のようにドロネー三角形分割で得られた１０７個の画像パッチｄ_ｉｊｋの集合として集合D_ij= {d_i1k,...,d_i107k}を定義し、d_ijk ∈ R^3q _ijはq_ij個のＲＧＢピクセルのj^th番目の三角局所領域を表すベクトルで、上記のように、ｉはサンプルビデオ番号(i=1,...,n)を指定し、kはフレーム(k=1,...,r_i)を指定する。 The system 107 or a set as a set of image patches d _ijk D _ij = obtained in Delaunay triangulation as described above _{_{{d i1k, ..., d i107k}} } defines a, d _ijk ∈ R ^3q _ij is A vector representing the j ^th triangular local region of q _ij RGB pixels, where i specifies sample video numbers (i = 1,..., n) and k is a frame (k =), as described above Specify 1, ..., r _i ).

いくつかの実施形態では、これらの局所（三角形）領域のサイズ（すなわち、ピクセル数、ｑ_ｉｊ）は、個人間で異なるだけでなく、同一人物のビデオシーケンス内でも変動する。これは、顔のランドマーク点の移動、顔の表情を生み出すために必要なプロセスの結果である。システムは、これらの各局所領域内のピクセル数に対して不変の特徴空間を定義する。システムは、以下のように各局所領域内のピクセルの色に関する統計を計算する。 In some embodiments, the size (ie, the number of pixels, q _ij ) of these local (triangular) regions not only differs between individuals, but also varies within the same person's video sequence. This is the result of the movement of landmark points on the face, the process necessary to produce facial expressions. The system defines a feature space that is invariant to the number of pixels in each of these local regions. The system calculates statistics on the color of the pixels in each local area as follows.

システムは、各局所領域の色の一次および二次（中心）モーメントを計算し、
The system calculates the first and second (center) moments of the color of each local region,

d_ijk=(d_ijk1,…,d_ijkP)^Tおよびμ_ijk,σ_ijk∈R³とする。いくつかの実施形態では、追加のモーメントが計算される。 Let d _ijk = (d _ijk1 ,..., d _ijkP ) ^T and μ _ijk , σ _ijk ∈R ³ . In some embodiments, additional moments are calculated.

各局所パッチの色特徴ベクトルは、次のように定義することができ、

iはサンプルビデオインデックス(V_i)、jはローカルパッチ番号、r_iはこのビデオシーケンスのフレーム数である。この特徴表現は、パッチｊにおける色の寄与を定義する。いくつかの実施形態では、特徴表現の豊かさを増すために他の証明済みの特徴を含めることができる。たとえば、フィルタへの応答や形状特徴である。
The color feature vector of each local patch can be defined as

i is a sample video index (V _i ), j is a local patch number, and r _i is the number of frames of this video sequence. This feature representation defines the color contribution in patch j. In some embodiments, other proven features can be included to increase feature richness. For example, the response to the filter or the shape feature.

色の不変関数表現 Color invariant function representation

システムは、時間に対して不変の関数として上記の計算された色情報を定義することができる。すなわち、機能的表現は、ビデオシーケンス内のどこでＡＵがアクティブになるかにかかわらず一貫している。 The system can define the above calculated color information as a function which is invariant to time. That is, the functional representation is consistent regardless of where in the video sequence the AU is active.

色関数f(.)はビデオシーケンスＶの色変化を定義し、テンプレート関数f_T(.)は、ＡＵの起動（つまり、ＡＵが非アクティブからアクティブに）に関連する色変化をモデル化する。システムは、f_T(.)がf(.)にあるかを判断する。 The color function f (.) Defines the color change of the video sequence V, and the template function f _T (.) Models the color change associated with the activation of an AU (ie, from AU inactive to active). The system determines if f _T (.) Is at f (.).

いくつかの実施形態では、システムは、テンプレート関数f_T(.)をf_T(.)の時間領域内の各可能な位置に配置することによってこれを決定する。f_T(.)のすべての可能な位置が確認されるまでウィンドウを左右にスライドさせることを含むため、これは通常スライディングウィンドウアプローチと呼ばれる。 In some embodiments, the system determines this by placing the template function f _T (.) At each possible position in the time domain of f _T (.). This is usually referred to as a sliding window approach, as it involves sliding the window left and right until all possible positions of f _T (.) are identified.

他の実施形態では、システムはガボール変換を用いた方法を導出する。ガボール変換は、スライディングウィンドウ検索を使用せずにf(.)内のf_T(.)の一致を見つけるためのアルゴリズムを導出するために、関数の局所セクションの周波数と位相の内容を決定するように設計されている。 In another embodiment, the system derives a method using Gabor transform. The Gabor transform determines the content of the frequency and phase of the local section of the function to derive an algorithm for finding a match of f _T (.) In f (.) Without using sliding window search It is designed.

この実施形態では、一般性を失うことなく、f(t)は、色記述子のうちの１つ、例えば、ビデオｉのｊ^th番目の三角形の中の赤色チャネルの平均、または反対色表現（opponent color representation）の第１のチャンネルになることができる。そして、この関数のガボール変換は、
であり、 In this embodiment, without loss of generality, f (t) is one of the color descriptors, eg, the average of the red channel in the j ^th triangle of video i, or the opposite color representation ( It can be the first channel of the opponent color representation). And the Gabor transform of this function is
And

g(t)は凹関数であり、
である。１つの可能なパルス関数は以下のように定義され得、
g (t) is a concave function,
It is. One possible pulse function may be defined as

Ｌは固定時間長である。他の実施形態では他のパルス関数を使用することができる。 2つの方程式を使うと
となり、
L is a fixed time length. Other pulse functions can be used in other embodiments. With two equations
And

期間[0、L]、したがってＧ(., .)の内積の定義として、次のように書くことができ、
As a definition of the inner product of the period [0, L], and hence G (.,.), We can write

<., .>は機能的内積である。上記のガボール変換は、ノイズがない場合、時間と周波数が連続的である。 <.,.> Are functional inner products. The Gabor transform described above is continuous in time and frequency in the absence of noise.

ｉ^th番目のビデオの色記述子f_i1(t)を計算するために、すべての関数は、係数のベクトルが
である一組のｂ個の基底関数
によって張られた色空間において定義される。２つの色記述子の機能的内積は、
で定義されることができ、 In order to calculate the color descriptor f _i1 (t) of the i ^th video, all functions have a vector of coefficients
A set of b basis functions that are
Defined in the color space spanned by The functional inner product of two color descriptors is
Can be defined by

Φは要素Φ_ij=(f_i(t),f_j(t))をもつb×b行列である。 Φ is a b × b matrix with elements _ij = (f _i (t), f _j (t)).

いくつかの実施形態では、モデルは、統計的色特性が経時的に滑らかに変化し、筋肉活性化におけるそれらの効果がＬ秒の最大期間を有すると仮定する。この説明に適合する基底関数は、フーリエ級数の実部の最初のいくつかの成分、すなわち正規化余弦基底である。他の実施形態では他の基底関数を使用することができる。 In some embodiments, the models assume that the statistical color properties change smoothly over time, and that their effect on muscle activation has a maximum duration of L seconds. The basis functions that fit this description are the first few components of the real part of the Fourier series, ie the normalized cosine basis. Other basis functions can be used in other embodiments.

余弦基底は、ψ_z(t)=cos(2πzt)、z=0,...,b - 1として定義することができる。対応する正規化基底は、次のように定義される。
The cosine basis can be defined as ψ _z (t) = cos (2πzt), z = 0,. The corresponding normalized basis is defined as follows:

規化基底関数（normalized basis set）は、Φ=Id_bを許容する。ここで、Id_bは、任意の正定値行列ではなく、ｂ×ｂ単位行列を表す。 The normalized basis set allows Φ = Id _b . Here, Id _b represents a b × b unit matrix, not an arbitrary positive definite matrix.

余弦基底を用いた上記の導出は、周波数空間を暗黙的に離散的にする。色関数のガボール変換
は次のようになり、 The above derivation with cosine basis makes frequency space implicitly discrete. Gabor transform of color function
Becomes as follows,

は、区間[t-L,t]で計算された関数
で、c_i1zはz^th番目の係数である。 Is the function computed on the interval [tL, t]
And c _i1z is the z ^th -th coefficient.

上記で導出したシステムは時間領域を含まないが、必要に応じて時間領域係数を見つけて利用することができることを理解されたい。 It should be understood that although the system derived above does not include the time domain, it is possible to find and use time domain coefficients as needed.

アクションユニットの機能分類 Function classification of action unit

システムは、上記で導出されたガボール変換を使用して、ＡＵのタイミングおよび期間に対して不変の特徴空間を定義する。結果として生じる空間において、システムは線形または非線形分類器を使用する。いくつかの実施形態では、ＫＳＤＡ、サポートベクターマシン（ＳＶＭ）またはディープマルチレイヤニューラルネットワーク（ＤＮ）を分類器として使用することができる。 The system uses the Gabor transform derived above to define a feature space that is invariant to AU timing and duration. In the resulting space, the system uses linear or non-linear classifiers. In some embodiments, KSDA, Support Vector Machine (SVM) or Deep Multilayer Neural Network (DN) can be used as a classifier.

機能色空間 Function color space

システムは、異なる局所パッチからの色情報の平均および標準偏差を記述する関数を含み、それは以下に記述される複数の関数の同時モデリングを使用する。 The system includes functions that describe the mean and standard deviation of color information from different local patches, which use simultaneous modeling of multiple functions described below.

システムは多次元関数
を定義し、各関数γ_z(t)は所与のパッチにおけるカラーチャネルの平均または標準偏差である。基底展開アプローチを使用すると、それぞれ
は、係数c_ieのセットによって定義され、したがって、Γ_i(t)は次式で与えられる。
System is multidimensional function
And each function γ _z (t) is the mean or standard deviation of the color channel in a given patch. Each using the basis expansion approach
Is defined by the set of coefficients c _ie , so Γ _i (t) is given by

多次元関数の内積は、正規化されたフーリエ余弦基底を使用して再定義され、
となる。 The inner product of multidimensional functions is redefined using a normalized Fourier cosine basis,
It becomes.

他の基底は、他の実施形態で使用することができる。 Other bases can be used in other embodiments.

システムは、各分類器を最適化するためにビデオシーケンスのトレーニングセットを使用する。システムはビデオの長さ（すなわちフレーム数）に対して不変であることに留意することが重要である。したがって、システムは、認識のためにビデオの整列または切り取りを使用しない。 The system uses a training set of video sequences to optimize each classifier. It is important to note that the system is invariant to the length of the video (ie the number of frames). Thus, the system does not use video alignment or clipping for recognition.

いくつかの実施形態では、上記の手法およびマルチクラス分類器を使用してＡＵ強度を識別するようにシステムを拡張することができる。システムは、ＡＵと、５つの強度ａ、ｂ、ｃ、ｄ、ｅのそれぞれを検出するようにトレーニングされ得、ＡＵは非アクティブである（存在しない）。システムはまた、上記と同じアプローチを使用して、表情の画像中の感情カテゴリを識別するようにトレーニングされ得る。 In some embodiments, the system can be extended to identify AU strengths using the techniques described above and a multi-class classifier. The system can be trained to detect the AU and each of the 5 intensities a, b, c, d, e, and the AU is inactive (not present). The system may also be trained to identify emotional categories in the image of the expression using the same approach as described above.

いくつかの実施形態では、システムはビデオ内のＡＵおよび感情カテゴリを検出することができる。他の実施形態では、システムは静止画像内のＡＵを識別することができる。静止画像内のＡＵを識別するために、システムは最初に回帰を用いて単一画像から上記で定義された機能的色特徴を計算することを学習する。この実施形態では、システムは関数h(x)=yを回帰して入力画像ｘを色ｙの必要な関数表現にマッピングする。 In some embodiments, the system can detect AUs and emotion categories in the video. In another embodiment, the system can identify AUs in still images. In order to identify AUs in still images, the system first learns to calculate the above defined functional color features from a single image using regression. In this embodiment, the system regresses the function h (x) = y to map the input image x to the required functional representation of the color y.

サポートベクターマシン Support vector machine

トレーニングセットは、{(γ₁(t),y₁),...,(γ_n(t),y_n)}で定義され、γ_i (t) ∈ H^v, H^vは、次数vまでの有界導関数をもつ連続関数のヒルベルト空間であり、y_i ∈ {-1, 1}はクラスラベルで、＋１はＡＵがアクティブで−１は非アクティブであることを示す。 The training set is defined by {(γ ₁ (t), y ₁ ), ..., (γ _n (t), y _n )}, and γ _i (t) ∈ H ^v , H ^v is the order v Hilbert space of continuous function with bounded derivatives up to y _i ∈ {−1, 1} is a class label, +1 indicates that AU is active and −1 is inactive.

別個のクラスのサンプルが線形に分離可能であるとき、クラスの分離可能性を最大にする関数w(t)は、以下で与えられ、
When the separate classes of samples are linearly separable, the function w (t) that maximizes the separability of the classes is given by

vはバイアスであり、上記のように、
は機能的内積を表し、ξ=(ξ₁,...,ξ_n)^Tはスラック変数で、ｃ＞０は交差検定を使用して検出されたペナルティ値である。 v is a bias, as above
Represents a functional inner product, ξ = (ξ ₁ ,..., Ξ _n ) ^T is a slack variable and c> 0 is a penalty value detected using cross validation.

正規化余弦係数を（２８）と使用してΓ_iをモデル化するために我々によって導出された手法を適用することは、（２９）を以下の基準に変換し、
Applying our derived approach to model Γ _i using normalized cosine coefficients with (28) converts (29) to

ｃ＞０は交差検定を使用して見つかったペナルティ値である。 c> 0 is the penalty value found using cross validation.

システムは、元の色空間をデータの最初のいくつか（例えば２つ）の主成分に投影する。主成分は主成分分析（ＰＣＡ）によって得られる。結果のp次元は、φ_PCAk,k =1,2,…,pとラベル付けされる The system projects the original color space onto the first few (eg two) principal components of the data. Principal components are obtained by principal component analysis (PCA). The resulting p-dimensions are labeled φ _PCA k, k = 1,2, ..., p

一旦トレーニングされると、システムはリアルタイムで又はリアルタイムよりも速くビデオ内のＡＵ、ＡＵ強度及び感情カテゴリを検出することができる。いくつかの実施形態では、システムは、３０フレーム／秒／ＣＰＵスレッドを超えるＡＵを検出することができる。 Once trained, the system can detect AU, AU intensity and emotion categories in the video in real time or faster than real time. In some embodiments, the system can detect AUs exceeding 30 frames / sec / CPU thread.

多層パーセプトロンを用いたディープネットワークアプローチ Deep Network Approach Using Multilayer Perceptron

いくつかの実施形態では、システムは、色特徴空間内の非線形分類器を識別するためのディープネットワークを含むことができる。 In some embodiments, the system can include a deep network to identify non-linear classifiers in the color feature space.

システムは、係数c_iを使用して多層パーセプトロンネットワーク（ＭＰＮ）をトレーニングすることができる。このディープニューラルネットワークは、バッチ正規化およびいくつかの線形または非線形の機能的整流、例えば整流線形ユニット（ＲｅＬｕ）を有する接続された層のいくつか（たとえば５つ）のブロックから構成される。ネットワークを効果的にトレーニングするために、システムは、少数派クラスをスーパーサンプリングする（ＡＵアクティブ/ＡＵ強度）か、多数派クラスをダウンサンプリングする（ＡＵ非アクティブ）ことによってデータ拡張（data augmentation）を使用する。システムはクラスの重みと重みの減衰も使用できる。 The system can train a multilayer perceptron network (MPN) using the coefficients c _i . This deep neural network is composed of several (for example 5) blocks of connected layers with batch normalization and some linear or non-linear functional rectification, for example a rectification linear unit (ReLu). To effectively train the network, the system supersamples minority classes (AU active / AU strength) or downsamples majority classes (AU inactive) to perform data augmentation use. The system can also use class weights and weight attenuation.

このニューラルネットワークを勾配降下法を用いてトレーニングする。結果として生じるアルゴリズムは、リアルタイムで、またはリアルタイムよりも速く、＞３０フレーム／秒／ＣＰＵスレッドで動作する。 The neural network is trained using a gradient descent method. The resulting algorithm runs in> 30 frames / sec / CPU thread in real time or faster than real time.

静止画像中のＡＵ検出 AU detection in still image

システムを静止画像に適用するために、システムは画像I_iの色関数f_iを特定する。つまり、システムはマッピングh(I_i)=f_iを定義する。ここでf_iはその係数
で定義される。いくつかの実施形態では、係数は、非線形回帰を使用してトレーニングデータから学習することができる。 In order to apply the system to a still image, the system identifies the color function f _i of the image I _i . That is, the system defines the mapping h (I _i ) = f _i . Where f _i is its coefficient
Defined by In some embodiments, the coefficients can be learned from training data using non-linear regression.

システムは、ｍ個のビデオ{V₁,...,V_m}のトレーニングセットを利用する。上記のように、V_i={I_i1,...,I_iri}である。システムは、長さL(with L_i)、例えばW_i1={I_i1,...,I_iL}, W_i2={I_i2,...,I_i(L+1)},...,W_i(ri-L)={I_i(ri-L),...,I_iri}の連続フレームのすべてのサブセットを考慮する。システムは、上記のようにすべてのW_ikの色表現を計算する。これにより、各W_ik, k=1,..., r_i-Lについてx_ik=(x_i1k,...,x_i107k)^Tが得られる。次の（１９）では、
The system utilizes a training set of _m videos {V ₁ ,..., V _m }. As mentioned above, V _i = {I _i1 ,..., I _iri }. The system has a length L (with L _i ), for example W _i1 = {I _i1 ,..., I _iL }, W _i2 = {I _i2 ,..., I _{i (L + 1)} },. ., W _{i (ri-L)} = {I _{i (ri-L),} ..., I _iri } consider all subsets of consecutive frames. The system calculates the color representation of all W _ik as described above. Thus, each _{W ik, k = 1, ...} , x ik = about _{_{r i -L (x i1k, ...}} , x i107k) T is obtained. In the next (19),

iとkはビデオW_ikを指定し、j、j =1,...,107はパッチを指定する。 i and k designate video W _ik and j, j = 1,..., 107 designate patches.

システムは、各パッチについて各W_ikの機能色表現f_ijk、j=1,...,107を計算する。これは、f_ijk=(c_ijk1,...,c_ijkQ)^Tをもたらすために上で詳述されたアプローチを使用して行われ、c_ijkqは、ビデオW_ijのjパッチのq^th番目の係数である。トレーニングセットは、ペア{x_ijk, f_ijk}によって与えられる。トレーニングセットは、関数f_ijk=h(x_ijk)を回帰するために使用される。例えば、パッチｊにおけるテスト画像をI、色表現を
とする。回帰は、上で定義されたように、画像から機能的色表現へのマッピングを推定するために使用される。たとえば、カーネルリッジ回帰を使用して、テスト画像のq^th番目の係数をNと推定する。
The system calculates the functional color representation f _ijk, j = 1,..., 107 of each W _ik for each patch. _{_{This, f ijk = (c ijk1,}} ..., c ijkQ) is done using the approach detailed above to bring the ^_T, c ^ijkq is, q ^th th j patch video W _ij Is the coefficient of The training set is given by the pair {x _ijk , f _ijk }. The training set is used to regress the function f _ijk = h (x _ijk ). For example, I is a test image in patch j, color representation
I assume. Regression is used to estimate the mapping from image to functional color representation, as defined above. For example, kernel ridge regression is used to estimate the q ^th coefficient of the test image as N.

は、j^th番目のパッチ
の色特徴ベクトルであり、すべてのトレーニング画像のj^th番目のパッチの係数のベクトルであり、Kはカーネル行列
である。システムはラジアル基底関数カーネル
を使用できる。いくつかの実施形態では、パラメータηおよびλは、精度を最大にし、モデルの複雑さを最小にするように選択される。これはバイアスと分散のトレードオフを最適化することと同じである。このシステムは、当技術分野で知られているようにバイアス分散問題に対する解決策を使用する。 Is the j ^th patch
Color feature vectors of the j ^th th patch of all training images, K the kernel matrix
It is. System is a radial basis function kernel
Can be used. In some embodiments, the parameters η and λ are selected to maximize accuracy and minimize model complexity. This is equivalent to optimizing the bias and variance trade-off. This system uses a solution to the bias distribution problem as known in the art.

上記に示したように、システムは、以前には見られなかったテスト画像に対してリグレッサ（regressor：独立変数）を使用することができる。もし
が以前には見られなかったテスト画像であるならば、その機能的表現は
と
として容易に得られる。この機能的色表現は、上記で導出された機能的分類子において直接使用され得る。 As indicated above, the system can use regressors for test images not previously seen. if
If is a test image not seen before, its functional representation is
When
Easily obtained. This functional color representation can be used directly in the functional classifier derived above.

図５は、ビデオおよび／または静止画像における色分散を用いてＡＵまたは感情を検出するための色分散システム５００を示す。システム５００は、一組のビデオおよび／または画像を有する画像データベースコンポーネント５１０を含む。システム５００は、画像データベース５１０内のランドマークを検出するランドマークコンポーネント５２０を含む。ランドマークコンポーネント５２０は、定義されたランドマークを有する画像の画像のセットのサブセットを作成する。システム５００は、ビデオシーケンス内の色の変化または顔の静止画像内の統計を計算する統計コンポーネント５３０を含む。統計コンポーネント５３０から、上述のようにデータベースコンポーネント５１０内の各ビデオまたは画像に対してＡＵまたは感情が決定される。システム５００は、画像を少なくとも１つのＡＵでタグ付けするか、またはＡＵなしでタグ付けするタグ付けコンポーネント５４０を含む。システム５００はタグ付き画像を処理済み画像データベース５５０に格納することができる。 FIG. 5 shows a color distribution system 500 for detecting AU or emotion using color distribution in video and / or still images. System 500 includes an image database component 510 having a set of video and / or images. System 500 includes a landmark component 520 that detects landmarks in image database 510. The landmark component 520 creates a subset of the set of images of the image having the defined landmarks. System 500 includes a statistics component 530 that calculates statistics in color changes in the video sequence or still images of the face. From the statistics component 530, an AU or emotion is determined for each video or image in the database component 510 as described above. The system 500 includes a tagging component 540 that tags the image with at least one AU or tags without an AU. System 500 may store tagged images in processed image database 550.

顔の表情の画像から感情を認識し、顔の画像を編集して別の感情を表現するように見せるための顔の色 Face color for recognizing emotions from images of facial expressions and editing face images to make them express different emotions

上記の方法において、システムは、構成、形状、シェーディングおよび色の特徴を使用してＡＵを識別する。これは、ＡＵが感情のカテゴリを定義し、すなわち、ＡＵの固有の組み合わせが固有の感情のカテゴリを指定するからである。それにもかかわらず、顔の色も感情を伝える。顔は、皮膚の表面に最も近い血管網上の血流を変えることによって、観察者に感情情報を表現することができる。例えば、怒りに関連した発赤や恐怖の中の青白さを考える。これらのカラーパターンは血流の変動によって引き起こされ、筋肉の活性化がない場合でも発生する可能性がある。我々のシステムはこれらの色の変化を検出するため、筋肉の動きがなくても（すなわち、ＡＵが画像内に存在するか否かにかかわらず）、感情を識別することが可能になる。 In the above method, the system identifies AUs using configuration, shape, shading and color features. This is because AU defines a category of emotions, ie, a unique combination of AUs specifies a category of unique emotions. Nevertheless, the color of the face also conveys emotions. The face can express emotional information to the observer by altering the blood flow on the vascular network closest to the surface of the skin. For example, consider the redness associated with anger and the paleness in fear. These color patterns are caused by blood flow fluctuations and can occur even without muscle activation. Our system detects these color changes, which allows us to identify emotions without muscle movement (ie, whether or not AU is present in the image).

顔の領域 Face area

システムは、ｐ×ｑの画素の各顔カラー画像を、
として表し、顔の各顔面成分のｒ個のランドマーク点を
画像上のランドマーク点の２次元座標、として表す。ここで、iは主題を指定し、ｊは感情カテゴリを指定する。いくつかの実施形態において、システムはｒを６６として使用する。これらの基準点は、内部の輪郭と、顔の外部要素、例えば、口、鼻、目、眉、あごの稜と紋を定義する。ドロネー三角形分割は、これらの顔のランドマーク点によって定義される三角形の局所領域を生成するために使用することができる。この三角形分割は、いくつかの局所領域（例えば、６６個のランドマーク点を使用するときには１４２個の領域）をもたらす。この数をaとする。 The system calculates each face color image of p × q pixels,
Expressed as r landmark points of each facial component of the face
Expressed as two-dimensional coordinates of landmark points on the image. Here, i specifies a subject and j specifies an emotion category. In some embodiments, the system uses r as 66. These reference points define the inner contour and the outer elements of the face, such as the mouth, nose, eyes, eyebrows, eyebrows and marks of the chin. Delaunay triangulation can be used to generate local regions of triangles defined by landmark points on these faces. This triangulation results in several local regions (eg, 142 regions when using 66 landmark points). Let this number be a.

システムは、ａの局所領域のそれぞれの画素を返す一連の関数として、関数D＝{d_１,...,d_ａ}を定義することができる。例えば、d_k（I_ij）は、画像I_ijにおけるk^th番目のドロネー三角形、例えば、
の内部におけるl個の画素を含むベクトルであり、ここで、
は、各画素の３つのカラーチャネルの値を定義する。 The system can define the function D = {d ₁ ,..., D _a } as a series of functions that return each pixel of the local region of _a . For example, d _k (I _ij ) is the k ^th th Delaunay triangle in the image I _ij , for example
A vector containing l pixels inside of, where
Defines the values of the three color channels of each pixel.

色空間 Color space

上記の導出は、各顔画像を一連の局所領域に分割する。システムは、各画像内のこれらの局所領域のそれぞれの色統計量を計算することができる。具体的には、システムは、以下のように定義される、データの一次モーメントおよび二次モーメント（すなわち、平均および分散）を計算する。
The above derivation divides each face image into a series of local regions. The system can calculate the color statistics of each of these local regions in each image. Specifically, the system calculates the first moment and the second moment (i.e., the mean and the variance) of the data, defined as follows.

他の実施形態では、画像の色の追加のモーメントが利用される。すべての画像I_ijは、色統計量の以下の特徴ベクトルを用いて表す。
In other embodiments, additional moments of image color are utilized. All images I _ij are represented using the following feature vectors of color statistics:

同じモデルを使用して、システムは各中立面の色特徴ベクトルを以下のように定義する。
ここで、nは、この特徴ベクトルが感情カテゴリではなく中立的な表現に対応することを示す。平均的な中立面は以下である。
ｍは、トレーニングセット内の識別子の数である。感情の顔表情の色表現は、この中立の顔からの偏差によって与えられる。
Using the same model, the system defines the color feature vectors for each neutral plane as follows:
Here, n indicates that this feature vector corresponds to a neutral expression rather than an emotion category. The average neutral plane is
m is the number of identifiers in the training set. The color representation of emotional facial expressions is given by the deviation from this neutral face.

分類 Classification

システムは、線形または非線形の分類器を使用して、上記で定義された色空間内の感情カテゴリを分類する。いくつかの実施形態では、線形判別分析（ＬＤＡ）が上記で定義された色空間で計算される。いくつかの実施形態において、色空間は、以下のマトリックスのゼロではない固有値に対応する固有ベクトルによって定義することができる。
ここで、以下は、（正規化された）共分散行列である。
以下は、クラス平均である。
以下は、識別マトリクスである。
δ＝.01が正規化パラメータであり、Ｃはクラスの数である。 The system uses linear or non-linear classifiers to classify emotion categories in the color space defined above. In some embodiments, linear discriminant analysis (LDA) is calculated in the color space defined above. In some embodiments, the color space can be defined by eigenvectors corresponding to non-zero eigenvalues of the following matrix:
Here, the following is the (normalized) covariance matrix.
The following is the class average.
The following is the identification matrix.
δ = .01 is the normalization parameter, and C is the number of classes.

他の実施形態において、システムは、サブクラス判別分析（ＳＤＡ）、ＫＳＤＡ、またはディープニューラルネットワークを採用することができる。 In other embodiments, the system can employ subclass discriminant analysis (SDA), KSDA, or deep neural networks.

多方向分類 Multi-directional classification

選択された分類器（例えば、ＬＤＡ）は、Ｃの感情カテゴリおよび中立の色空間（または複数の空間）を計算するために使用される。いくつかの実施形態において、システムは、基本感情および複合感情を含む２３の感情カテゴリを認識するように訓練されている。 The selected classifier (e.g., LDA) is used to calculate the emotional category of C and the neutral color space (or spaces). In some embodiments, the system is trained to recognize 23 emotion categories, including basic emotions and complex emotions.

システムは、利用可能なサンプルを１０個の異なるセットＳ＝{Ｓ_１, ... ,Ｓ_１０}に分割する。ここで、各サブセットＳ_ｔは、同じ数のサンプルを有する。この分割は、各感情カテゴリ（中立を含む）内のサンプル数がすべてのサブセットで等しくなるように行われる。システムは、１、・・・、１０のｔを用いて以下の手順を繰り返す。Ｓ_ｔを除くすべてのサブセットがΣｘおよびＳ_Ｂを計算するために使用される。ＬＤＡのサブスペース
の計算に使用されなかったサブセットＳ_ｔのサンプルは、
に投射される。各テストサンプルの特徴ベクトル
は、以下のユークリッド距離によって与えられる最も近いカテゴリ平均の感情カテゴリに割り当てられる。
すべてのテストサンプル
における分類精度は、以下によって与えられる。
ここで、ｎ_ｔはＳ_ｔにおけるサンプル数であり、ｙ（ｔ_ｊ）は、サンプルｔ_ｊの真の感情カテゴリを返すオラクル関数であり、
は０-１損失であり、
であるときには１であり、それ以外では０である。したがって、Ｓ_ｔは、カラーモデルの一般化を判断するためのテスト用サブセットとして機能する。ｔは１、…、１０であるため、システムは、この手続きを１０回繰り返すことができる。各回では、サブセットＳ_ｔのうちの１つをテストのために残す。そして、以下のように平均分類精度を計算する。
交差検証された分類精度の標準偏差は、以下である。
このプロセスにより、システムは、最も一般化された識別色特徴、すなわち、トレーニングセットに含まれない画像に適用されるものを識別できる。 The system divides the available samples into 10 different sets S = {S ₁ ,..., S ₁₀ }. Here, each subset _St has the same number of samples. This division is performed such that the number of samples in each emotion category (including neutral) is equal in all subsets. The system repeats the following procedure using 1, ..., 10 t. All subsets except S _t is used to calculate the Σx and S _B. LDA subspace
The samples of subset S _t not used for the calculation of
Projected Feature vector of each test sample
Are assigned to the emotion category of the nearest category average given by Euclidean distance
All test samples
The classification accuracy in is given by:
Where n _t is the number of samples in S _t and y (t _j ) is an oracle function that returns the true emotion category of sample t _j ,
Is 0-1 loss,
It is 1 when it is and 0 otherwise. Thus, S _t serves as a test subset to determine a generalized color model. Since t is 1, ..., 10, the system can repeat this procedure 10 times. In each time, leaving one of the subset S _t for testing. Then, the average classification accuracy is calculated as follows.
The standard deviation of the cross-validated classification accuracy is
This process allows the system to identify the most generalized identification color features, ie, those that apply to images not included in the training set.

他の実施形態において、システムは、２方向（一対全部）分類器を使用する。 In another embodiment, the system uses a two-way (one-to-all) classifier.

一対全部の分類 Class of all pairs

システムは、１つの感情カテゴリ（例えば、感情カテゴリｃ）のサンプルをクラス１（例えば、研究中の感情）に割り当てるとともに、他のすべての感情カテゴリのサンプルをクラス２に割り当てるたびに、上記の手法をＣ回繰り返すことにより、各感情カテゴリの最も識別可能な色特徴を識別する。形式的には、以下である。
The system assigns the sample of one emotion category (eg, emotion category c) to class 1 (eg, emotions under study), and every time that all other emotion category samples are assigned to class 2, the above method Repeat C times to identify the most distinguishable color feature of each emotion category. Formally, it is the following.

のサンプルを区別するために、線形または非線形の分類器（例えば、ＫＳＤＡ）が使用される。 A linear or non-linear classifier (eg, KSDA) is used to distinguish the samples of.

１０分割交差検証：システムは、上記と同じ１０分割交差検証処理および最近傍平均の分類器を使用する。 10-fold cross validation: The system uses the same 10-fold cross validation process and nearest-average classifier as above.

いくつかの実施形態において、この２つのクラスの問題におけるサンプルの不均衡によるバイアスを回避するために、システムは、
にダウンサンプリングを適用することができる。いくつかの場合において、システムは、
からランダムサンプルを引き出すたびに、
におけるサンプル数に一致するように、この手順を複数回繰り返す。 In some embodiments, to avoid bias due to sample imbalance in this two class problem, the system
Can be applied to downsampling. In some cases, the system
Every time I pull a random sample from
Repeat this procedure multiple times to match the number of samples at.

判別カラーモデル Discrimination color model

２方向分類器としてＬＤＡを使用する場合、
は、最大から最小の判別の順に序で並べた一連の判別可能なベクトル
を与える。
以下の判別ベクトルは、感情カテゴリを識別するときの各色特徴の寄与を定義する。
これは、非ゼロ固有値λ_１＞０に関連する唯一の基底ベクトルであるため、システムはｖ_１を保持するだけである。したがって、感情ｊのカラーモデルは、以下によって与えられる。
When using LDA as a two-way classifier,
Is a series of discriminable vectors ordered from highest to lowest discriminant
give.
The following discriminant vectors define the contribution of each color feature when identifying emotion categories.
The system only holds v ₁ since this is the only basis vector associated with the non-zero eigenvalue λ ₁ > 0. Thus, the color model of emotion j is given by:

ＳＤＡ、ＫＳＤＡ、ディープネットワーク、その他の分類器を使用しても同様の結果が得られる。 Similar results can be obtained using SDA, KSDA, deep networks, and other classifiers.

顔によって表される表情を変えるための画像色の修正 Image color correction to change the facial expression represented by the face

中立的な表現Ｉ_ｉｎは、感情を表現するように見えるようにシステムによって修正することができる。これらは修正画像
と呼ぶことができる。ここで、ｉは画像または画像内の個人を特定し、ｊは感情カテゴリを特定する。
は、以下の修正された色特徴ベクトルに対応する。
いくつかの実施形態において、これらの画像を生成するために、システムは、以下のように感情jのカラーモデルを用いて、中立画像のｋ^ｔｈ番目の画素を修正する。
ここで、Ｉ_ｉｎｋは、中立画像Ｉ_ｉｎにおけるｋ^ｔｈ番目の画素である。
は、ｇ^ｔｈ番目のドロネー三角形内における画素の色の平均および標準偏差である。
は、ニューモデルｙ_ｉｊによって与えられるｄ_ｇにおける画素の色の平均および標準偏差である。 The neutral expression I _in can be modified by the system to appear to express emotions. These are corrected images
It can be called. Here, i identifies an image or an individual within an image, and j identifies an emotional category.
Corresponds to the following modified color feature vector:
In some embodiments, to generate these images, the system modifies the k ^th th pixel of the neutral image using the color model of emotion j as follows.
Here, I _ink is the k ^th th pixel _in the neutral image I _in .
Is the mean and standard deviation of the color of the pixel within the g ^th th Delaunay triangle.
Is the mean and standard deviation of the color of the pixel at d _g given by the new model y _ij .

いくつかの実施形態において、システムは、分散σを用いたγガウスフィルタによって、γを有する修正画像を平滑化する。平滑化は、局所的なシェーディングと形状の特徴を排除し、人々に顔の色に集中させ、感情のカテゴリをより明確にする。 In some embodiments, the system smoothes the modified image with γ by a γ Gaussian filter with variance σ. Smoothing eliminates local shading and shape features, allows people to focus on face color, and makes emotion categories more explicit.

いくつかの実施形態において、システムは、感情の顔表情の画像を修正して、表現された感情の外観を増減させる。感情ｊの外観を減少させるために、システムは、感情ｊに関連するカラーパターンを除去して、結果として生じる画像
を得ることができる。画像は、以下の関連する特徴ベクトルを用いて、上述したように計算される。
In some embodiments, the system modifies the image of the facial expression of the emotion to increase or decrease the appearance of the expressed emotion. In order to reduce the appearance of emotion j, the system removes the color pattern associated with emotion j and the resulting image
You can get The image is calculated as described above using the following associated feature vectors:

感情の知覚を増大させるために、システムは、新しい色特徴ベクトルを以下のように定義し、
結果画像
を取得する。 In order to increase the perception of emotion, the system defines a new color feature vector as
Result image
To get

図６は、ビデオおよび／または静止画像内の色分散を用いて、ＡＵまたは感情を検出するための色分散システム５００を示す。システム６００は、一組のビデオおよび／または画像を有する画像データベースコンポーネント６１０を含む。システム６００は、画像データベース６１０内のランドマークを検出するランドマークコンポーネント６２０を含む。ランドマークコンポーネント６２０は、画定されたランドマークを有する一連の画像のサブセットを生成する。システム６００は、ビデオシーケンスにおける色の変化または顔の静止画像における統計を計算する統計コンポーネント６３０を含む。統計コンポーネント６３０から、上述のようにデータベースコンポーネント６１０内の各ビデオまたは画像についてＡＵまたは感情が決定される。システム６００は、画像を少なくとも１つのＡＵでタグ付けするか、ＡＵ無しでタグ付けするタグ付けコンポーネント６４０を含む。システム６００は、タグ付き画像を処理済み画像データベース６５０に格納することができる。 FIG. 6 shows a chromatic dispersion system 500 for detecting AU or emotion using chromatic dispersion in video and / or still images. System 600 includes an image database component 610 having a set of video and / or images. System 600 includes a landmark component 620 that detects landmarks in image database 610. The landmark component 620 generates a subset of the series of images having the defined landmarks. System 600 includes a statistics component 630 that calculates statistics on color changes in the video sequence or still images of the face. From the statistics component 630, an AU or emotion is determined for each video or image in the database component 610 as described above. System 600 includes a tagging component 640 that tags the image with at least one AU or without AU. System 600 may store tagged images in processed image database 650.

システム６００は、画像内の知覚感情を変化させることができる修正コンポーネント６６０を含む。いくつかの実施形態において、システム６００が画像内の中立顔を決定した後、修正コンポーネント６６０が中立顔の画像の色調を修正して、感情またはＡＵの決定された表現の外観を生み出すかまたは修正する。例えば、画像は中立表現を含むと判定される。修正コンポーネント６６０は、幸せまたは悲しみなどの所定の表情を知覚するように表情を変えるために画像内の色を変えることができる。 System 600 includes a correction component 660 that can change the perceived emotion in the image. In some embodiments, after the system 600 determines the neutral face in the image, the correction component 660 corrects the tonality of the neutral face image to produce or correct the appearance of the determined representation of emotion or AU. Do. For example, the image is determined to include a neutral representation. The correction component 660 can change the color in the image to change the expression so as to perceive a predetermined expression such as happiness or sadness.

他の実施形態において、システム６００が画像内の顔の感情またはＡＵを決定した後、修正コンポーネント６６０は、感情またはＡＵの知覚を変更するために、感情またはＡＵの強度を増減するために画像の色を修正する。例えば、悲しい表情を含むと画像が判定される。修正コンポーネント６６０は、表情がより少なくまたはより悲しいと知覚されるように画像内の色を変更することができる。 In other embodiments, after the system 600 determines the facial emotion or AU in the image, the correction component 660 may adjust the intensity of the emotion or AU to increase or decrease the intensity of the emotion or AU to change the perception of the emotion or AU. Correct the color. For example, an image is determined to include a sad expression. The correction component 660 can change the color in the image such that the expression is perceived as less or more sad.

顔ランドマーク点及びアクションユニットの早くて正確な検出及び認識のために、ＤＮＮで適合されるグローバルローカル Global local matched by DNN for fast and accurate detection and recognition of facial landmark points and action units

他の観点において、ディープニューラルネットワークのためのグローバル−ローカル損失関数（ＤＮＮ）は、関心のある類似の対象ランドマーク点（例えば、顔面ランドマーク点）のきめ細かい検出だけでなく、ＡＵ等の対象特性のきめ細かい認識において、効率的に使用することができる。導出された局所的＋全体的な損失は、パッチベースのアプローチを使用する必要なしに正確な局所的結果をもたらし、そして迅速で望ましい収束をもたらす。本グローバル−ローカル損失関数は、ＡＵの認識のために用いたり、ＡＵおよび顔の表情の認識に必要な顔および顔のランドマーク点を検出するために用いたりすることができる。 In another aspect, the global-local loss function (DNN) for deep neural networks is not only fine-grained detection of similar target landmark points of interest (eg, facial landmark points), but also target characteristics such as AU. In the fine-grained recognition of, it can be used efficiently. The derived local + overall losses provide accurate local results without the need to use a patch based approach, and provide rapid and desirable convergence. This global-local loss function can be used for AU recognition or to detect faces and facial landmark points needed for AU and facial expression recognition.

グローバル−ローカル損失 Global-local loss

グローバル−ローカル（ＧＬ）損失の導出は、画像内の検出および認識のためにディープネットワークで効率的に使用できる。システムは、この損失を使用してＡＵを認識するように深いＤＮＮをトレーニングする。システムは、ＤＮＮの一部を使用して顔のランドマーク点を検出する。これらの検出は、ＡＵを検出するためにネットワークの他のコンポーネントの完全に接続されたレイヤの出力と連結される。 Derivation of global-local (GL) loss can be efficiently used in deep networks for detection and recognition in images. The system uses this loss to train the deep DNN to recognize the AU. The system detects facial landmark points using a portion of DNN. These detections are concatenated with the output of fully connected layers of other components of the network to detect AUs.

ローカルフィット Local fit

システムは、画像サンプルと対応する出力変数を、{（Ｉ_１，ｙ_１），…，（Ｉ_ｎ，ｙ_ｎ）}のセットとして定義する。ここで、I_i∈R^{l ×m}は、顔におけるa l × mの画素の画像であり、ｙｉは真の（望ましい）出力であり、ｎはサンプル数である。 The system defines the image samples and corresponding output variables as a set of {(I ₁ , y ₁ ),..., (I _n , y _n )}. Here, I _i ∈ R ^{l × m} is an image of al × m pixels in the face, yi is a true (desired) output, and n is the number of samples.

いくつかの実施形態において、出力変数ｙ_ｉは様々な形態であり得る。例えば、画像内の２Ｄの対象のランドマーク点の検出において、ｙ_ｉは、２Ｄ画像の座標y_i = (u_i1,v_i1, ... , u_ip, v_ip)^Tのｐのベクトルである。(u_ij,v_ij)^Tはj^th番目のランドマーク点である。ＡＵの認識では、出力変数は指標ベクトルy_i=(y_i1, . . . , y_iq)Tに対応する。ＡＵｊが画像Ｉ_ｉ内に存在する場合、ｙ_ｉｊは１であり、ＡＵｊがその画像内に存在しなければ、ｙ_ｉｊは−１である。 In some embodiments, output variable y _i can be in various forms. For example, in the detection of a landmark point of a 2D object in an image, y _i is a vector of p of coordinates y _i = (u _i1 , v _i1 , ..., u _ip , v _ip ) ^T of the 2D image is there. (u _ij , v _ij ) ^T is the j ^th landmark point. For AU recognition, the output variable corresponds to the index vector y _i = (y _i1 ,..., Y _iq ) T. If AUj is present in image I _i then y _ij is 1, and if AU j is not present in the image, y _ij is -1.

システムは、マッピング関数f (I_i,w) = (f₁(I_i,w₁),...,f_r(I_i,w_r))^Tのベクトルを識別する。マッピング関数は、入力画像Ｉｉを検出または属性の出力ベクトルｙｉに変換し、w = (w₁, ... , w_r)^Tは、これらのマッピング関数のパラメータのベクトルである。検出においては、r=p及び
である。ここで、２Ｄ画像座標u_ij及びv_ijの推定値として、
である。同様に、ＡＵの認識では、r = q及び
である。ここで、
は、ＡＵｊが、画像Ｉ_ｉ内に存在（１）するか存在しない（−１）の推定値であり、ｑはＡＵの数である。 The system identifies a vector of mapping functions f (I _i , w) = (f ₁ (I _i , w ₁ ),..., F _r (I _i , w _r )) ^T. The mapping function detects the input image Ii or converts it into an output vector yi of attributes, and w = (w ₁ ,..., W _r ) ^T is a vector of parameters of these mapping functions. In detection, r = p and
It is. Here, as estimated values of 2D image coordinates u _ij and v _ij
It is. Similarly, in AU recognition, r = q and
It is. here,
Is an estimate of (-1) with or without (1) AUj in image I _i and q is the number of AUs.

固定マッピング関数f (I_i, w) (e.g., a DNN)に対して、システムは、以下のようにｗを最適化する。
For a fixed mapping function f (I _i , w) (eg, a DNN), the system optimizes w as follows.

ここで、
は、損失関数を表す。この損失関数に対する古典的な解は、以下のように定義されるＬ^２-損失である。
here,
Represents a loss function. The classical solution to this loss function is L ² -loss which is defined as:

ここで、ｙ_ｉｊはｙ_ｉのｊ^ｔｈ番目の要素である。これは、顔のランドマーク点の検出ではy_ij ∈ R²であり、ＡＵの認識では、y_ij ∈ {-1, +1}である。 Here, y _ij is the j ^th th element of y _i . This is y _ij ∈ R ² for face landmark point detection and y _ij ∈ {−1, +1} for AU recognition.

一般性を失うことなく、システムは、f (I_i,w)の代わりにf_iを用い、f_j(I_i,w_j)の代わりにf_ijを用いる。関数ｆｉｊはすべて同じであるが、jの規定値が異なる場合がある。 Without loss of generality, the system uses the f _i instead of _{f (I i, w),} f j (I i, w j) using the f _ij instead of. The functions fij are all the same, but the specified value of j may be different.

上記の導出はローカルフィットに対応する。つまり、（33）と（34）は、各出力の適合を独立して最適化してから、すべての出力における平均適合を採用することを試みる。 The above derivation corresponds to a local fit. That is, (33) and (34) try to optimize the fit of each output independently and then adopt the average fit at all outputs.

上述した導出アプローチは、固定された適合誤差
に対しても、いくつかの解決策を有している。例えば、誤差は、すべての出力に均等に分散させることができる。
ここで、
は、ベクトルの２ノルムである。または、誤差の大部分は、次のように定義される推定値の１つ（または少数）にある。
The derivation approach described above has a fixed fit error
Also have some solutions. For example, the error can be evenly distributed to all outputs.
here,
Is the 2 norm of the vector. Or, most of the error is in one (or a few) of the estimates defined as:

いくつかの実施形態において、関数を最小化するために追加の制約が追加される。
In some embodiments, additional constraints are added to minimize the function.

ａ≧１。システムは、収束を容易にするグローバル基準を追加する。 a ≧ 1. The system adds global criteria to facilitate convergence.

グローバル構成の追加 Add global configuration

システムは、グローバル記述子を拡張するグローバル構成を追加するための一連の制約を定義する。（３４）の制約条件は、y_i （例えば、y_ij）の各要素の適合を独立して測定するため、局所的である。それにもかかわらず、同じ基準を使用して点のペアの適合度を測定することができる。正式には、以下のように定義される。
The system defines a set of constraints for adding global configurations that extend global descriptors. The constraint of (34) is local because it independently measures the fit of each element of y _i (eg, y _ij ). Nevertheless, the same criteria can be used to measure the fitness of a pair of points. Formally, it is defined as follows.

ここで、g(x,z)は、２つのエントリの類似度を計算する関数である。h(.)は、ネットワークの（制約のない）出力を適切な数値範囲にスケーリングする。ランドマーク検出では、h(f_ij) = f_ij∈ R²
Here, g (x, z) is a function that calculates the similarity of two entries. h (.) scales the (unconstrained) output of the network to the appropriate numerical range. In landmark detection, h (f _ij ) = f _ij ∈ R ²

ｘ−ｚのｂノルムである（例えば、２ノルム、
ここで、ｘとｚは、２つのランドマークの画像座標を定義する２Ｄベクトルである。 b is the norm of xz (e.g., 2 norm,
Here, x and z are 2D vectors that define the image coordinates of two landmarks.

ＡＵ認識では、h(f_ij) = sign(f_ij)∈{-1, +1} In AU recognition, h (f _ij ) = sign (f _ij ) ∈ {-1, +1}

ここで、sign（．）は、入力数値が負の場合には−１を返し、この数値が正またはゼロの場合には＋１を返す。ＡＵｊが画像Ｉ_ｉ内に存在する場合にはｘ_ｉｊが１であり、それがその画像内に存在しない場合には−１である。したがって、関数h(.) : R → {-1, +1} Here, sign (.) Returns -1 if the input number is negative, and +1 if this number is positive or zero. If AUj is present in the image I _i then x _ij is 1 and -1 if it is not present in the image. Therefore, the function h (.): R → {-1, +1}

いくつかの実施形態において、システムは、各対の要素、すなわち、検出時の各対のランドマーク点および認識時の各対のＡＵのグローバル構成を考慮に入れる。すなわち、検出においては、システムは、すべてのランドマーク点間の距離の情報を使用し、認識においては、ＡＵの対が共存する場所を決定する（例えば、２つがサンプル画像中に同時に存在するか存在しないことを意味する）。 In some embodiments, the system takes into account the elements of each pair, namely the landmark points of each pair at detection and the global configuration of each pair of AU at recognition. That is, in detection, the system uses information on the distance between all landmark points, and in recognition it determines where the AU pairs co-exist (for example, if two are simultaneously present in the sample image Means not exist).

いくつかの実施形態において、グローバル基準はトリプレットに拡張することができる。正式には、以下である。
In some embodiments, global criteria can be extended to triplets. Formally, it is the following.

ここで、g（x、z、u）は、３つのエントリ間の類似度を計算する関数である。 Here, g (x, z, u) is a function that calculates the similarity between three entries.

検出において、これは、システムがｂノルム、例えば、
を計算することができること、
以下のように、各トリプレットによって定義される三角形の面積を計算すること、を意味する。
In detection, this means that the system is b norm, eg
That can be calculated,
It means to calculate the area of the triangle defined by each triplet as follows.

３つのランドマークポイントは、共線ではない。 The three landmark points are not collinear.

いくつかの実施形態において、方程式は４つ以上の点に拡張することができる。例えば、この方程式は、次のように凸四辺形に拡張することができる。
In some embodiments, the equation can be extended to four or more points. For example, this equation can be extended to a convex quadrilateral as follows:

最も一般的な場合では、システムは、ｔ個のランドマーク点について、多角形エンベロープ、すなわち、ｔ個のランドマーク点{x_i1,…,x_it}によって含まれる非自己交差多角形の面積を計算する。多角形は、以下のように与えられる。 In the most general case, the system determines, for t landmark points, the area of the polygon envelope, ie the non-self-intersecting polygon contained by t landmark points {x _i1, ..., x _it } calculate. The polygon is given as follows.

システムは、顔のランドマーク点のドロネー三角形分割を計算する。多角形包絡線は、１組のｔ個のランドマーク点の線を反時計回りに接続することによって得られる。ランドマーク点の順序付き集合は、以下のように定義される。
の領域は、以下によって与えられる。
The system calculates Delaunay triangulation of landmark points on the face. The polygon envelope is obtained by connecting a set of t landmark point lines counterclockwise. An ordered set of landmark points is defined as follows.
The domain of is given by:

ここで、ga(.)の添え字ａは領域を表す。
Here, the suffix a of ga (.) Represents an area.

いくつかの実施形態において、上記式の結果は、当技術分野で知られているようにグリーンの定理を用いて得られる。
は、ＤＮＮ
のｔ個の出力、または
真値
とすることができる。 In some embodiments, the results of the above equation are obtained using Green's Theorem as known in the art.
Is the DNN
T outputs of, or true values
It can be done.

システムは、次のように、t個のランドマーク点の一般的な場合について、グローバルｂノルムg_n(.)を計算することができる。
The system can calculate the global b norm g _n (.) For the general case of t landmark points as follows:

上記導出は、検出課題において、g(.)を３つ以上の点に拡張することを定義する。これから、上記は画像中のＡＵを認識するために使用することができる。 The above derivation defines the extension of g (.) To more than two points in the detection task. From this, the above can be used to recognize AUs in the image.

システムは、画像Ｉ_ｉ内の３つ以上のＡＵの共起を計算する。正式には、
は、ｔ個のＡＵのセットであり、
である。 The system calculates co-occurrence of three or more AUs in the image I _i . Formally,
Is a set of t AUs,
It is.

ＧＬ−損失Ioss GL-loss Ioss

最終的なローカルグローバル（ＧＬ）損失関数は、以下によって与えられる。
The final local global (GL) loss function is given by:

ここで、グローバル損失、
は、以下のように定義される。
Where global loss,
Is defined as follows.

ｇ（．）は、検出においては、ｇ_ａ（．）若しくはｇ_ｎ（．）又はこの両方であり、認識においては、ｇ_ＡＵ（．）であり、α_ｔは、トレーニングセットの交差検証を利用して学習した正規化定数である。 g (.) is g _a (.) or g _n (.) or both in detection, g _AU (.) in recognition, and α _t uses cross validation of training set It is a normalization constant learned by

バックプロパゲーション Back propagation

ＤＮＮのパラメータであるｗを最適化するために、システムは以下を計算する。
To optimize w, which is a parameter of DNN, the system calculates

局所損失の偏導関数は、もちろん次のように与えられる。
The partial derivative of the local loss is, of course, given by

グローバル損失の定義では、マッピング関数h（．）を使用する。いくつかの実施形態において、ランドマーク検出を実行するとき、ｈ（ｆ_ｉｊ）＝ｆ_ｉｊであり、グローバル損失の偏導関数は、上式に示される局所損失のものと同じ形式を有する。他の実施形態において、ＡＵ認識を実行するとき、システムは、以下を利用する。
この関数は微分ではないが、システムは、それを、小さい部分
に対して、以下のように再定義する。
偏導関数は、以下になる。
The global loss definition uses the mapping function h (.). In some embodiments, when performing landmark detection, h (f _ij ) = f _ij and the partial derivative of the global loss has the same form as that of the local loss shown above. In another embodiment, when performing AU recognition, the system utilizes:
This function is not a derivative, but the system does
Redefine it as follows.
The partial derivative is

ディープＤＮＮ Deep DNN

システムは、ＡＵを認識するためのディープニューラルネットワークを含む。ＤＮＮは２つの部分を含む。ＤＮＮの第１の部分は、多数の顔面ランドマーク点を検出するために用いられる。ランドマーク点により、システムは上述したようにＧＬ損失を計算することができる。 The system includes a deep neural network to recognize AUs. DNN contains two parts. The first part of the DNN is used to detect multiple facial landmark points. The landmark points allow the system to calculate the GL loss as described above.

システムは、正規化されたランドマーク点を計算することができる。システムは、ＤＮＮの第２の部分の第１の完全に接続されたレイヤの出力と連結して、ランドマークの位置情報を、ＡＵを認識するために使用されるＤＮＮに埋め込むことができる。これは、感情の表現において典型的に観察される局所的な形状変化の検出を容易にする。これは上記のＧＬ損失の定義で行われる。 The system can calculate normalized landmark points. The system can be coupled with the output of the first fully connected layer of the second part of the DNN to embed landmark position information into the DNN used to recognize the AU. This facilitates the detection of local shape changes that are typically observed in emotional expression. This is done with the definition of GL loss above.

いくつかの実施形態において、ＤＮＮは複数のレイヤを含む。例示的な実施形態において、９つのレイヤが顔のランドマーク点の検出専用であり、他の層は一連の画像内のＡＵを認識するために用いられる。 In some embodiments, the DNN comprises multiple layers. In the exemplary embodiment, nine layers are dedicated to detection of facial landmark points, and the other layers are used to recognize AUs in a series of images.

顔のランドマーク点の検出に向けられたレイヤは、以下のように詳述される。 The layers directed to the detection of facial landmark points are detailed as follows.

顔のランドマーク点の検出 Facial landmark point detection

例示的な実施形態において、ＤＮＮは、３つの畳み込みレイヤと、２つの最大プールレイヤと、２つの完全な接続レイヤを含む。システムは、各畳み込みレイヤの終わりにおいて、正規化、ドロップアウト、および整流線形単位（ＲｅＬＵ）を適用する。 In the exemplary embodiment, the DNN includes three convolutional layers, two largest pool layers, and two complete connection layers. The system applies normalization, dropout, and rectified linear units (ReLU) at the end of each convolutional layer.

これらのレイヤのウェイトは、バックプロパゲーション、導出されたＧＬ損失を使用して最適化される。グローバル損失およびバックプロパゲーションの式は上記に提供されている。 The weights of these layers are optimized using back propagation, the derived GL loss. The equations for global loss and backpropagation are provided above.

一例において、システムは、ＤＮＮのこの部分を使用して、合計６６個の顔ランドマーク点を検出する。提案されたＧＬ損失の１つの利点は、それが非常に大きいデータセットで効率的に訓練されることができるということである。いくつかの実施形態において、システムは、データ変換を採用して、変形変換および部分的オクルージョンに対して不変であるようにする顔ランドマーク検出器を含む。 In one example, the system uses this portion of DNN to detect a total of 66 facial landmark points. One advantage of the proposed GL loss is that it can be trained efficiently with very large data sets. In some embodiments, the system includes a face landmark detector that employs data transformation to be invariant to transformation and partial occlusion.

顔ランドマーク検出器は、既存のトレーニングセットに２次元アフィン変換、すなわちスケール、反射、並進および回転を適用することによって、追加の画像を生成する。例示的な実施形態において、スケールは２及び０．５の間にあり、回転は−１０°から１０°であり、並進および反射はランダムに生成され得る。ＤＮＮを部分的オクルージョンに対してよりロバストにするために、システムは、ｄ×ｄの画素のオクルージョンボックスをランダム化し、ｄは、内側の目の間隔の０．２から０．４倍である。 The facial landmark detector generates additional images by applying a two-dimensional affine transformation, ie, scale, reflection, translation and rotation, to the existing training set. In an exemplary embodiment, the scale is between 2 and 0.5, the rotation is -10 ° to 10 °, and translation and reflection may be generated randomly. In order to make the DNN more robust to partial occlusion, the system randomizes the occlusion box of d × d pixels, where d is 0.2 to 0.4 times the inner eye separation.

ＡＵ認識 AU recognition

ＤＮＮの第２の部分は、顔の外観特徴と、ＤＮＮの第１の部分によって与えられるランドマーク位置とを組み合わせる。具体的には、ＤＮＮの第２の部分の第１の完全な接続レイヤの出力において、外観画像特徴は、正規化され自動的に検出されたランドマーク点と連結される。 The second part of the DNN combines the facial appearance features with the landmark locations provided by the first part of the DNN. Specifically, at the output of the first complete connection layer of the second part of the DNN, the appearance image features are concatenated with normalized and automatically detected landmark points.

正式には、ｉ_ｔｈのサンプル画像（ｉ＝１，．．．，ｎ）のランドマーク点のベクトルを以下とする。
ここで、s_ik ∈ R²は、ｋ^ｔｈのランドマークの２Ｄ画像座標であり、ｎは、サンプル画像の数である。故に、s_i ∈ R¹³²となる。次に、すべての画像をτピクセルの同じ眼間距離を持つように正規化する。すなわち、以下となる。
ここで、ｌ及びｒは左右の目の中心の画像座標であり、|| ・ ||2はベクトルの２ノルムを定義する。
τ＝２００を用いることができる。 Formally, the vector of landmark points of the sample image (i = 1,..., N) of i _th is set as follows.
Here, s _ik ∈ R ² is the 2D image coordinates of the k ^th landmark, and n is the number of sample images. Therefore, s _i ∈ R ¹³² . Next, all images are normalized to have the same interocular distance of τ pixels. That is, it becomes the following.
Here, l and r are the image coordinates of the center of the left and right eyes, and || · ||
It is possible to use τ = 200.

システムは、ランドマーク点を以下のように正規化する。
また、システムは、ランドマーク点に回転行列Ｒを乗算して、左右の目の外側の角が水平線と一致するようにする。システムは、
の値を再調整してシフトし、画像内の左目と右目の外側のコーナーをそれぞれ（.５，０）と（−．５，０）の所定の位置に移動させる。 The system normalizes landmark points as follows.
The system also multiplies the landmark points by the rotation matrix R so that the outer corners of the left and right eyes coincide with the horizon. the system,
And shift the outer corners of the left and right eyes in the image to predetermined positions (.5, 0) and (-. 5, 0), respectively.

一実施形態において、ＤＮＮは、ＧｏｏｇｌｅＮｅｔのものと同様であるが、本明細書で定義されたＧＬ損失が使用されることにおいて、大きな違いがある。ＤＮＮの入力は、顔画像とすることができる。システムは、入力に適合するように第１レイヤのフィルタのサイズを変更し、これらのフィルタの重みをランダムに初期化する。ＤＮＮにランドマークを埋め込むために、第１の完全な接続レイヤ内のフィルタの数、ならびにＡＵの数としての出力のためのフィルタの数を変更することができる。システムは、顔の表情の画像中のすべてのＡＵを検出するために単一のＤＮＮを使用することができる。 In one embodiment, the DNN is similar to that of GoogleNet, but there is a significant difference in that the GL loss as defined herein is used. The input of DNN can be a face image. The system resizes the filters of the first layer to fit the input and randomly initializes the weights of these filters. In order to embed the landmarks in the DNN, the number of filters in the first complete connection layer as well as the number of filters for output as the number of AUs can be changed. The system can use a single DNN to detect all AUs in the facial expression image.

ＤＮＮの第２の部分の重みは、バックプロパゲーション方法および上記で定義されたグローバル損失を用いて、最適化することができる。 The weights of the second part of the DNN can be optimized using the back propagation method and the global loss defined above.

いくつかの実施形態において、データ増強は、ランダムなノイズを２Ｄランドマーク点に追加し、上記のアフィン変換を適用することによって実行することができる。 In some embodiments, data enhancement can be performed by adding random noise to 2D landmark points and applying the above affine transformation.

いくつかの実施形態において、システムは、上記のようなトレーニングデータベースを使用して、野生のＡＵの認識を初期化するように訓練することができる。 In some embodiments, the system can be trained to initialize wild AU recognition using a training database as described above.

図７は、ビデオおよび／または静止画像内のディープニューラルネットワーク（ＤＮＮ）を使用してＡＵおよび感情カテゴリを検出するためのネットワークシステム７００を示す。システム７００は、一連のビデオおよび／または画像を有する画像データベースコンポーネント７１０を含む。システム７００は、画像データベース７１０の画像セット内のＡＵを決定するＤＮＮ７２０を含む。ＤＮＮ７２０は、上述のように一連の画像内のランドマークを定義する第１の部分７３０を含む。ＤＮＮ７２０は、上述のようにデータベースコンポーネント７１０内の画像セットのランドマーク内のＡＵを決定する第２の部分７４０を含む。システム７００は、画像を少なくとも１つのＡＵでタグ付けする、またはＡＵ無しでタグ付けするタグ付けコンポーネント７５０を含む。システム７００は、タグ付けされた画像を処理済み画像データベース７６０に格納することができる。 FIG. 7 shows a network system 700 for detecting AU and emotion categories using deep neural networks (DNNs) in video and / or still images. System 700 includes an image database component 710 having a series of videos and / or images. System 700 includes DNN 720 that determines AUs in the image set of image database 710. The DNN 720 includes a first portion 730 that defines landmarks in the series of images as described above. The DNN 720 includes a second portion 740 that determines AUs in the landmarks of the image set in the database component 710 as described above. System 700 includes a tagging component 750 that tags an image with at least one AU, or tags without an AU. System 700 can store the tagged images in processed image database 760.

例示的なコンピュータデバイス Exemplary computer device

図８は、産業用自動化システムにおいて、ハードウェア装置を構成するために使用することができる例示的なコンピュータを示す。様々な態様では、図８のコンピュータは、本明細書で説明されているように、開発ワークスペース１００の全部または一部を含むことができる。本明細書で使用されるとき、「コンピュータ」は、複数のコンピュータを含み得る。コンピュータは、例えば、プロセッサ８２１、ランダムアクセスメモリ（ＲＡＭ）モジュール８２２、読み出し専用メモリ（ＲＯＭ）モジュール８２３、ストレージ８２４、データベース８２５、１つまたは複数の入出力（Ｉ／Ｏ）デバイス８２６、インターフェース８２７のように、１つまたは複数のハードウェアコンポーネントを含むことができる。代替的および／または追加的に、コントローラ８２０は、例えば、例示的実施形態に関連する方法を実行するためのコンピュータ実行可能命令を含むコンピュータ可読媒体などの１つまたは複数のソフトウェアコンポーネントを含み得る。上に挙げたハードウェアコンポーネントのうちの１つまたは複数を、ソフトウェアを使用して実装することができると考えられる。例えば、ストレージ８２４は、１つまたは複数の他のハードウェアコンポーネントに関連するソフトウェア区画を含み得る。上記に列挙されたコンポーネントは例示的なものにすぎず、これに限定されることを意図しないと理解される。 FIG. 8 illustrates an exemplary computer that can be used to configure hardware devices in an industrial automation system. In various aspects, the computer of FIG. 8 may include all or part of development workspace 100, as described herein. As used herein, a "computer" may include multiple computers. The computer may include, for example, a processor 821, a random access memory (RAM) module 822, a read only memory (ROM) module 823, storage 824, a database 825, one or more input / output (I / O) devices 826, an interface 827. As such, one or more hardware components can be included. Alternatively and / or additionally, controller 820 may include one or more software components, such as, for example, computer readable media including computer executable instructions for performing the methods associated with the illustrative embodiments. It is contemplated that one or more of the hardware components listed above can be implemented using software. For example, storage 824 may include software partitions associated with one or more other hardware components. It is understood that the components listed above are merely exemplary and are not intended to be limiting.

プロセッサ８２１は、画像を索引付けするためのコンピュータに関連する１つまたは複数の機能を実行するために、命令を実行しデータを処理するようにそれぞれ構成された１つまたは複数のプロセッサを含むことができる。プロセッサ８２１は、ＲＡＭ８２２、ＲＯＭ８２３、ストレージ８２４、データベース８２５、Ｉ／Ｏデバイス８２６、およびインターフェース８２７に通信可能に結合することができる。プロセッサ８２１は、様々なプロセスを実行するために、一連のコンピュータプログラム命令を実行するように構成され得る。コンピュータプログラム命令は、プロセッサ８２１による実行のためにＲＡＭ８２２にロードされてもよい。本明細書では、プロセッサとは、入力に対して機能を実行して出力を生成するための符号化された命令を実行する物理的ハードウェアデバイスを指す。 The processor 821 includes one or more processors each configured to execute instructions and process data to perform one or more functions associated with the computer for indexing the image. Can. Processor 821 can be communicatively coupled to RAM 822, ROM 823, storage 824, database 825, I / O device 826, and interface 827. Processor 821 may be configured to execute a series of computer program instructions to perform various processes. Computer program instructions may be loaded into RAM 822 for execution by processor 821. As used herein, a processor refers to a physical hardware device that performs encoded functions to perform functions on inputs and generate outputs.

ＲＡＭ８２２およびＲＯＭ８２３はそれぞれ、プロセッサ８２１の動作に関連する情報を記憶するための１つまたは複数のデバイスを含み得る。例えば、ＲＯＭ８２３は、１つまたは複数のコンポーネントおよびサブシステムの動作を識別、初期化および監視するための情報を含む、コントローラ８２０に関連する情報にアクセスして記憶するように構成されたメモリデバイスを含み得る。ＲＡＭ８２２は、プロセッサ８２１の１つまたは複数の動作に関連するデータを記憶するためのメモリデバイスを含み得る。例えば、ＲＯＭ８２３は、プロセッサ８２１による実行のためにＲＡＭ８２２に命令をロードすることができる。 RAM 822 and ROM 823 may each include one or more devices for storing information related to the operation of processor 821. For example, ROM 823 is configured to access and store information associated with controller 820, including information for identifying, initializing and monitoring the operation of one or more components and subsystems. May be included. RAM 822 may include a memory device for storing data associated with one or more operations of processor 821. For example, ROM 823 can load instructions into RAM 822 for execution by processor 821.

ストレージ８２４は、プロセッサ８２１が開示された実施形態と一致するプロセスを実行するために必要とする可能性がある情報を格納するように構成された任意のタイプの大容量記憶装置を含むことができる。例えば、ストレージ８２４は、ハードドライブ、ＣＤ - ＲＯＭ、ＤＶＤ - ＲＯＭ、または他の任意の種類のマスメディアデバイスなどの１つまたは複数の磁気および／または光ディスクデバイスを含み得る。 Storage 824 may include any type of mass storage device configured to store information that processor 821 may need to perform processes consistent with the disclosed embodiments. . For example, storage 824 may include one or more magnetic and / or optical disk devices such as a hard drive, CD-ROM, DVD-ROM, or any other type of mass media device.

データベース８２５は、コントローラ８２０および／またはプロセッサ８２１によって使用されるデータを格納、整理、分類、フィルタリング、および／または配置するために協働する１つまたは複数のソフトウェアおよび／またはハードウェア構成要素を含み得る。例えば、データベース８２５は、本明細書に記載されるように、入出力ハードウェアデバイスおよびコントローラに関連するハードウェアおよび／またはソフトウェア構成データを格納し得る。データベース８２５は、上に列挙したものとは別のおよび／または異なる情報を格納することができると考えられる。 Database 825 includes one or more software and / or hardware components that cooperate to store, organize, classify, filter, and / or arrange data used by controller 820 and / or processor 821. obtain. For example, database 825 may store hardware and / or software configuration data associated with input / output hardware devices and controllers as described herein. It is contemplated that database 825 may store other and / or different information from those listed above.

Ｉ／Ｏ装置８２６は、コントローラ８２０に関連付けられたユーザと情報を通信するように構成された１つまたは複数の構成要素を含み得る。例えば、Ｉ／Ｏ装置は、ユーザが画像のデータベース、関連するものの更新、デジタルコンテンツへのアクセスを維持できるように、統合キーボードおよびマウスを備えるコンソールを含み得る。Ｉ／Ｏ装置８２６はまた、モニタ上に情報を出力するためのグラフィカルユーザインターフェース（ＧＵＩ）を含むディスプレイを含み得る。Ｉ／Ｏ装置８２６はまた、例えば、コントローラ８２０に関連する情報を印刷するためのプリンタ、ユーザがアクセス可能なディスクドライブ（例えば、ＵＳＢポート、フロッピー、ＣＤ - ＲＯＭ、またはＤＶＤ - ＲＯＭ）のような周辺装置を含み得る。ユーザが携帯型メディア装置、マイクロフォン、スピーカシステム、または任意の他の適切な種類のインターフェース装置に格納されたデータを入力することを可能にするために、ドライブなど）を使用することができる。 I / O device 826 may include one or more components configured to communicate information with a user associated with controller 820. For example, the I / O device may include a console with an integrated keyboard and mouse so that the user can maintain a database of images, updates of related things, access to digital content. The I / O device 826 may also include a display that includes a graphical user interface (GUI) for outputting information on a monitor. The I / O unit 826 may also be, for example, a printer for printing information related to the controller 820, a user accessible disk drive (eg, USB port, floppy, CD-ROM, or DVD-ROM) Peripheral devices may be included. A drive or the like may be used to allow the user to input data stored on a portable media device, microphone, speaker system, or any other suitable type of interface device.

インターフェース８２７は、インターネット、ローカルエリアネットワーク、ワークステーションピアツーピアネットワーク、ダイレクトリンクネットワーク、無線ネットワークなどの通信ネットワークを介してデータを送受信するように構成された１つまたは複数の構成要素を含むことができる。または他の適切な通信プラットフォーム。例えば、インターフェース７２７は、１つまたは複数の変調器、復調器、マルチプレクサ、デマルチプレクサ、ネットワーク通信デバイス、無線デバイス、アンテナ、モデム、および通信ネットワークを介したデータ通信を可能にするように構成された他の任意のタイプのデバイスを含み得る。 The interface 827 may include one or more components configured to send and receive data via a communication network, such as the Internet, a local area network, a workstation peer-to-peer network, a direct link network, a wireless network, and the like. Or any other suitable communication platform. For example, interface 727 is configured to enable data communication via one or more modulators, demodulators, multiplexers, demultiplexers, network communication devices, wireless devices, antennas, modems, and communication networks. It may include any other type of device.

方法およびシステムは好ましい実施形態および特定の実施例に関連して記載されているが、本明細書の実施形態はあらゆる点で制限的であることよりもむしろ例示的であることが意図されるので、その範囲は特定の実施形態に限定されることを意図しない。 Although the methods and systems are described in connection with the preferred embodiments and the specific examples, the embodiments herein are intended to be illustrative rather than restrictive in every respect. The scope is not intended to be limited to the particular embodiments.

特に明記しない限り、本明細書に記載の任意の方法が、その工程が特定の順序で行われることを必要とすると解釈されることは決して意図されていない。
したがって、方法クレームがそのステップが従うべき順序を実際には記載していないか、またはステップが特定の順序に限定されるべきであることがクレームまたは説明において別段に具体的に述べられていない場合、いかなる意味においても、順序が推測されることを意図するものでは決してない。これには、解釈のためのあらゆる非明示的な根拠が含まれ、根拠は、ステップの配置や操作の流れに関する論理的事項、文法上の編成または句読点から派生した単純な意味、明細書に記載されている実施形態の数または種類を含む。本出願を通して、様々な刊行物を参照することができる。これらの刊行物の全体の開示は、方法およびシステムが属する技術水準をより完全に説明するために、参照により本明細書に組み込まれる。範囲または精神から逸脱することなく様々な修正および変形をなし得ることが当業者には明らかであろう。他の実施形態は、本明細書の考察および本明細書に開示された実施から当業者には明らかであろう。明細書および実施例は例示としてのみ考慮されることを意図しており、真の範囲および精神は特許請求の範囲によって示される。 Unless otherwise stated, it is by no means intended that any method described herein be construed as requiring that its steps be performed in a particular order.
Thus, if a method claim does not actually state the order in which the steps should be followed, or if it is not specifically stated in the claim or description that the steps should be limited to a particular order In no way is it intended that the order be inferred. This includes any implicit grounds for interpretation, which are stated in the specification, logical matters relating to the arrangement of steps and the flow of operations, grammatical organization or simple meanings derived from punctuation marks. Including the number or type of embodiment being implemented. Various publications can be referenced throughout this application. The entire disclosure of these publications is incorporated herein by reference to more fully describe the state of the art to which the methods and systems belong. It will be apparent to those skilled in the art that various modifications and variations can be made without departing from the scope or spirit. Other embodiments will be apparent to those skilled in the art from consideration of the specification and practice disclosed herein. It is intended that the specification and examples be considered as exemplary only, with the true scope and spirit being indicated by the following claims.

Claims

A computer-implemented method for analyzing images to determine AU and AU intensity is
Maintain multiple kernel spaces of shape, shape and shading features, each kernel space can be separated non-linearly from other kernel spaces, each kernel space is one or more action units (AU), and one or more Associated with multiple AU intensity values,
Including receiving multiple images to be analyzed,
For each image to be received,
Determining face space data of face morphology, shape and shading features in the image, the face space including shape feature vectors, morphological feature vectors, and shading feature vectors associated with shading changes of the face;
In order to determine the presence of the determined face space data of form, shape and shading features, the face space data of the determined form features are compared with the plurality of kernel spaces to zero, one or more for the image Determine multiple AU values.

The method according to claim 1 is
Processing a video stream comprising the plurality of images in real time to determine an AU value and an AU intensity value for each of the plurality of images.

In the method according to claim 1,
The face space data includes feature vectors of the shape, morphological features, and shading feature vectors associated with the face.

In the method according to claim 3,
The face space data of the morphological features determined are the distances and angle values between normalized landmarks in the Delaunay triangle formed from the image, and the respective ones corresponding to the normalized landmarks. Includes the angle defined by the Delaunay triangle.

In the method according to claim 3,
The shading feature vector associated with the face shading change is:
It is determined by applying a Gabor filter to the normalized landmark points determined from the face.

The method according to claim 3, wherein the image features are landmark points derived using a deep neural network including global local (GL) loss function, AUs, AU intensities, emotion categories, and on the image. Identify those intensities derived using a deep neural network including a global local (GL) loss function configured to back propagate both the local and global fit of the projected landmark points to And image features.

The method of claim 1, wherein the AU value and the AU intensity value together define emotion and emotion intensity.

The method according to claim 1, wherein the image comprises a picture.

The method according to claim 1, wherein the image comprises a frame of a video sequence.

The method of claim 1 wherein the system uses video sequences from a controlled or non-controlled environment.

In the method according to claim 1, the system uses a black and white image or a color image.

The method according to claim 1 is
Receive the image,
Processing the received image to determine AU values and AU intensity values of faces in the received image.

The method according to claim 1 is
Receive a first plurality of images from a first database;
Receive a second plurality of images from a second database;
Processing the first plurality of received images and the second plurality of images to determine, for each image, an AU value and an AU intensity value of a face in each image;
The first plurality of images have a first acquisition form, the second plurality of images have a second acquisition form, and the first acquisition form is the second acquisition form. It is different from

The method according to claim 1 is
Perform kernel subclass discriminant analysis (KSDA) on the face space;
Recognize AU and AU strength, emotion category, and emotion strength based on the KSDA.

A computer-implemented method for analyzing an image to determine AU and AU intensities using color features in the image comprises
Identifying a change defining a transition from inactive to active of the AU, said change being selected from the group consisting of chromaticity, hue and saturation, and luminance,
Applying a Gabor transform to the identified color change to obtain invariance to the timing of this change in the expression.

The method according to claim 15 is
Maintain in memory a plurality of color feature data associated with AU and / or AU intensity;
Receive the image to be analyzed,
For each image to be received,
Determine the color features of the face in the image by the action of the face muscle;
The determined morphological color feature is compared to the plurality of trained color feature data to determine the presence of the determined morphological color feature in one or more of the plurality of trained color feature data. Determining the zero, one or more AU values for the image.

The method according to claim 15 is
Analyzing the plurality of faces in the image or video frame to determine kernel or face space for a plurality of AU values and AU intensity values, each kernel or face space comprising at least one AU value and at least one AU value Associated with the AU intensity values, each kernel or face space can be linearly or non-linearly separated from other kernels and face space data, said kernel or face space comprising functional color space feature data.

In the method according to claim 17,
The functional color space is determined by performing a discriminant function learning analysis on color images respectively derived from a given one of the plurality of images.

3. A method, comprising: identifying a change indicative of a face color transition in a video sequence resulting from blood flow in the face from non-existence to presence.

A computer-implemented method for analyzing an image to determine AU and AU intensity in a face image is:
Training a deep neural network using a set of training images having a face, said deep neural network being trained to identify AUs in said face image,
Identifying local and global losses using the deep neural network in the face image to determine AUs in the image.

The method according to claim 20, wherein said deep neural network is
Detecting a plurality of landmark points using the first portion of the deep neural network, the landmark points facilitating calculation of the overall loss;
Detecting the local image change due to the second part of the deep neural network, wherein the landmark points are normalized and connected to embed positional information of the landmark points into the deep neural network .

22. The method of claim 21, wherein the deep neural network comprises a plurality of layers for identifying landmarks in the image and a second plurality of layers for recognizing AUs in the face image. Including.

Determine the neutral plane in the image with color,
Modifying the color of the image of the neutral plane to produce an appearance of a determined expression of emotion or AU.

Determine the facial emotion or AU in the image with color,
Modifying the color of the image to increase or decrease the intensity of the emotion or AU to change the perception of the emotion or AU.