JP6788264B2

JP6788264B2 - Facial expression recognition method, facial expression recognition device, computer program and advertisement management system

Info

Publication number: JP6788264B2
Application number: JP2016191819A
Authority: JP
Inventors: 金輝陳; 兆傑羅; 康雄有木
Original assignee: Kobe University NUC
Current assignee: Kobe University NUC
Priority date: 2016-09-29
Filing date: 2016-09-29
Publication date: 2020-11-25
Anticipated expiration: 2036-09-29
Also published as: JP2018055470A

Description

本発明は、表情認識方法、表情認識装置、コンピュータプログラム及び広告管理システムに関する。具体的には、階層型畳み込みニューラルネットワークを用いた表情認識の精度を改善する画像処理技術に関する。 The present invention relates to a facial expression recognition method, a facial expression recognition device, a computer program, and an advertisement management system. Specifically, the present invention relates to an image processing technique for improving the accuracy of facial expression recognition using a hierarchical convolutional neural network.

近年、深層学習（Deep Learning）による画像認識の性能が飛躍的に向上している。深層学習は、多層の階層型ニューラルネットワークを用いた機械学習の総称である。多層の階層型ニューラルネットワークとしては、例えば、畳み込みニューラルネットワーク（以下、「ＣＮＮ」ともいう。）が用いられる。
ＣＮＮは、局所領域の畳み込み層とプーリング層とが繰り返される多層の積層構造を有しており、かかる積層構造により画像認識の性能が向上するとされている。 In recent years, the performance of image recognition by deep learning has been dramatically improved. Deep learning is a general term for machine learning using a multi-layered hierarchical neural network. As the multi-layered hierarchical neural network, for example, a convolutional neural network (hereinafter, also referred to as “CNN”) is used.
CNN has a multi-layered laminated structure in which a convolutional layer and a pooling layer in a local region are repeated, and it is said that such a laminated structure improves image recognition performance.

非特許文献１に示すように、畳み込みニューラルネットワークを用いた深層学習により、幸福感、驚き、恐れ、悲しみ、怒り、嫌悪などの普遍的な顔の表情のクラスを認識することも既に行われている。 As shown in Non-Patent Document 1, deep learning using a convolutional neural network has already been performed to recognize universal facial expression classes such as happiness, surprise, fear, sadness, anger, and disgust. There is.

「畳み込みニューラルネットワークを用いた表情表現の獲得」西銘大喜他４名 2016年度人工知能学会全国大会 4L1-5in1 2016年6月9日一般発表"Acquisition of facial expression expression using convolutional neural network" Daiki Nishimei and 4 others 2016 Annual Meeting of the Japanese Society for Artificial Intelligence 4L1-5in1 General presentation on June 9, 2016

畳み込みニューラルネットワークを用いた表情認識では、顔の原画像に前処理を施すことなく、原画像の画素値（ＲＧＢ値）をそのままネットワークに入力するか、画素値に主成分分析（Principle Component Analysis）が行われる。
例えば、非特許文献１では、顔の原画像に対する前処理としてＧＣＮ（Global Contrast Normalization）が実行されている。 In facial expression recognition using a convolutional neural network, the pixel value (RGB value) of the original image is input to the network as it is without preprocessing the original image of the face, or the pixel value is subjected to principal component analysis (Principle Component Analysis). Is done.
For example, in Non-Patent Document 1, GCN (Global Contrast Normalization) is executed as preprocessing for the original image of the face.

このように、従来では、原画像の画素値（生データ）をそのまま使用するか、原画像から単一の特徴因子を抽出する前処理を行うだけである。この点は、表情認識の高精度化を抑制する原因の１つであると考えられる。
本発明は、かかる従来の問題点に鑑み、階層型ニューラルネットワークを用いた表情認識の精度を向上することを目的とする。 As described above, conventionally, the pixel value (raw data) of the original image is used as it is, or only preprocessing for extracting a single feature factor from the original image is performed. This point is considered to be one of the causes of suppressing the improvement of the accuracy of facial expression recognition.
In view of the conventional problems, an object of the present invention is to improve the accuracy of facial expression recognition using a hierarchical neural network.

（１）本発明の表情認識方法は、撮影画像に含まれる顔の表情を認識する方法であって、顔の凹凸情報、質感情報及び輪郭情報を含む少なくとも３種類の情報を有する学習用画像群を入力データとして、階層型ニューラルネットワークにパラメータを学習させる学習ステップと、前記撮影画像から少なくとも前記３種類の情報に関する特徴をそれぞれ抽出して複数の入力画像を生成し、生成した前記複数の入力画像を入力データとして、前記撮影画像に含まれる顔の表情を学習済みの前記階層型ニューラルネットワークに認識させる認識ステップと、を含む。 (1) The facial expression recognition method of the present invention is a method for recognizing facial facial expressions included in a captured image, and is a learning image group having at least three types of information including facial unevenness information, texture information, and contour information. As input data, a learning step of causing a hierarchical neural network to learn parameters, and extracting features related to at least the three types of information from the captured image to generate a plurality of input images, and generating the plurality of input images. Is included as input data, and includes a recognition step of causing the trained hierarchical neural network to recognize the facial expression included in the captured image.

本発明の表情認識方法によれば、学習ステップにおいて、階層型ニューラルネットワークのパラメータの学習に用いる入力データが、顔の凹凸情報、質感情報及び輪郭情報を含む少なくとも３種類の情報に関する特徴を有する学習用画像群よりなる。
また、認識ステップにおいて、学習済みの階層型ニューラルネットワークによる表情認識のための入力データが、撮影画像から少なくとも上記３種類の情報に関する特徴をそれぞれ抽出して生成された複数の入力画像よりなる。 According to the facial expression recognition method of the present invention, in the learning step, the input data used for learning the parameters of the hierarchical neural network has features related to at least three types of information including facial unevenness information, texture information and contour information. It consists of a group of images for learning.
Further, in the recognition step, the input data for facial expression recognition by the trained hierarchical neural network is composed of a plurality of input images generated by extracting features related to at least the above three types of information from the captured image.

このため、階層型ニューラルネットワークへの入力前に前処理を施さない、或いは、単一の特徴因子のみを抽出する前処理を施す従来技術に比べて、階層型ニューラルネットワークを用いた表情認識の精度を向上することができる（図８参照）。 Therefore, the accuracy of facial expression recognition using the hierarchical neural network is higher than that of the conventional technique in which no preprocessing is performed before input to the hierarchical neural network or preprocessing for extracting only a single feature factor is performed. Can be improved (see FIG. 8).

（２）本発明の表情認識方法において、具体的には、前記階層型ニューラルネットワークは、畳み込みニューラルネットワークよりなる。
その理由は、畳み込みニューラルネットワークは、表情認識を含む画像認識に高い性能を実現できるからである。 (2) In the facial expression recognition method of the present invention, specifically, the hierarchical neural network comprises a convolutional neural network.
The reason is that the convolutional neural network can realize high performance for image recognition including facial expression recognition.

（３）本発明の表情認識方法において、前記凹凸情報は、各画素点における画素値の勾配ベクトルの方向角度であり、前記質感情報は、各画素点における画素値の勾配ベクトルのノルムであり、前記輪郭情報は、画素値が急峻に変化する画素点の位置情報であることが好ましい。
その理由は、上記の方向角度（Ａ）、勾配ベクトルのノルム（Ｇ）及び輪郭情報（Ｅ）は、既存のオープンソースソフトウェアにて演算可能であるから、これらのパラメータを採用すれば、本発明を比較的容易に実装可能となるからである。 (3) In the expression recognition method of the present invention, the unevenness information is the direction angle of the gradient vector of the pixel value at each pixel point, and the texture information is the norm of the gradient vector of the pixel value at each pixel point. The contour information is preferably position information of pixel points whose pixel values change sharply.
The reason is that the above-mentioned direction angle (A), gradient vector norm (G) and contour information (E) can be calculated by existing open source software. Therefore, if these parameters are adopted, the present invention Is relatively easy to implement.

（４）本発明の表情認識方法において、前記学習ステップは、具体的には、少なくとも前記３種類の情報に関する特徴を複数のサンプル画像からそれぞれ抽出することにより、前記学習用画像群を生成する生成ステップと、生成した前記学習用画像群を入力データとして前記階層型ニューラルネットワークが出力する認識結果に基づいて、当該ネットワークの前記パラメータを更新する更新ステップと、を含む。 (4) In the expression recognition method of the present invention, the learning step specifically generates the learning image group by extracting features related to at least the three types of information from a plurality of sample images. The step includes an update step of updating the parameter of the network based on the recognition result output by the hierarchical neural network using the generated image group for learning as input data.

（５）この場合、前記生成ステップには、前記サンプル画像から抽出した顔画像に水平反射を施す処理が含まれることが好ましい。
このようにすれば、同じ枚数のサンプル画像から得られる学習用画像群の枚数を倍増させることができる。このため、ラベル付きのサンプル画像を収集する手間を省くことができる。更に深い原因として、画像処理によく用いられている深層学習識別器は、反転不変性を有していないという問題があった。このため，異なる方向の撮影条件において、同じの物体としても抽出の物体特徴が異同になり、認識精度の低下を招いていた。このため、顔の学習画像に水平反射処理を追加すると、認識精度は向上することができる。 (5) In this case, it is preferable that the generation step includes a process of horizontally reflecting the face image extracted from the sample image.
In this way, the number of learning image groups obtained from the same number of sample images can be doubled. Therefore, it is possible to save the trouble of collecting the sample image with the label. As a deeper cause, there is a problem that the deep learning classifier, which is often used for image processing, does not have inversion invariance. For this reason, under shooting conditions in different directions, the extracted object characteristics are different even for the same object, which causes a decrease in recognition accuracy. Therefore, the recognition accuracy can be improved by adding the horizontal reflection processing to the learning image of the face.

（６）本発明の表情認識装置は、撮影画像に含まれる顔の表情を認識する装置であって、顔の凹凸情報、質感情報及び輪郭情報を含む少なくとも３種類の情報に関する特徴を有する学習用画像群を入力データとしてパラメータを学習した、階層型ニューラルネットワークを有する処理部と、前記撮影画像から少なくとも前記３種類の情報に関する特徴をそれぞれ抽出して複数の入力画像を生成し、生成した前記複数の入力画像を前記処理部に入力する画像生成部と、前記複数の入力画像を入力データとして学習済みの前記階層型ニューラルネットワークが出力した認識結果を、前記撮影画像の顔の表情として外部に出力する出力部と、を備える。 (6) The facial expression recognition device of the present invention is a device that recognizes facial facial expressions included in a captured image, and is for learning that has features related to at least three types of information including facial unevenness information, texture information, and contour information . A processing unit having a hierarchical neural network in which parameters are learned using an image group as input data, and a plurality of input images generated by extracting features related to at least the three types of information from the captured image. The recognition result output by the image generation unit that inputs the input image of the above to the processing unit and the hierarchical neural network that has learned the plurality of input images as input data is output to the outside as the facial expression of the captured image. It is provided with an output unit.

本発明の表情認識装置によれば、処理部が有する階層型ニューラルネットワークのパラメータの学習に用いる入力データが、顔の凹凸情報、質感情報及び輪郭情報を含む少なくとも３種類の情報に関する特徴を有する学習用画像群よりなる。
また、画像生成部が生成する、学習済みの階層型ニューラルネットワークによる表情認識のための入力データが、撮影画像から少なくとも上記３種類の情報に関する特徴をそれぞれ抽出して生成された複数の入力画像よりなる。 According to the expression recognition device of the present invention, the input data used for learning the parameters of the hierarchical neural network possessed by the processing unit has characteristics related to at least three types of information including facial unevenness information, texture information and contour information. It consists of a group of images for learning.
Further, the input data for facial expression recognition by the trained hierarchical neural network generated by the image generation unit is obtained from a plurality of input images generated by extracting at least the features related to the above three types of information from the captured image. Become.

（７）本発明のコンピュータプログラムは、画像処理を実行可能なコンピュータ装置に、撮影画像に含まれる顔の表情を認識する処理を実行させるためのコンピュータプログラムであって、上述の（１）〜（５）の表情認識方法と同様のステップを含む。
従って、本発明のコンピュータプログラムは、上述の（１）〜（５）の表情認識方法と同様の作用効果を奏する。 (7) The computer program of the present invention is a computer program for causing a computer device capable of executing image processing to execute a process of recognizing a facial expression included in a captured image, and the above-mentioned (1) to (1) to ( It includes the same steps as the facial expression recognition method of 5).
Therefore, the computer program of the present invention has the same effect as the facial expression recognition methods (1) to (5) described above.

（８）本発明の広告管理システムは、広告表示装置と、前記広告表示装置が表示する広告画像の視認者を撮影する撮影装置と、上述の表情認識装置を有する制御装置と、を備えており、前記制御装置は、前記撮影装置が撮影した前記視認者を含む撮影画像から当該視認者の表情を認識する認識処理と、前記表情の認識結果を集計する集計処理と、集計結果を広告の管理者に提示する提示処理と、を実行する。 (8) The advertisement management system of the present invention includes an advertisement display device, a photographing device for photographing a viewer of an advertisement image displayed by the advertisement display device, and a control device having the above-mentioned expression recognition device. The control device manages the recognition process of recognizing the facial expression of the visual viewer from the captured image including the visual visual photographed by the photographing device, the aggregation process of aggregating the recognition result of the facial expression, and the aggregation result of the advertisement. The presentation process to be presented to the person is executed.

本発明の広告管理システムによれば、制御装置が、撮影装置が撮影した視認者を含む撮影画像から当該視認者の表情を認識する認識処理と、表情の認識結果を集計する集計処理と、集計結果を広告の管理者に提示する提示処理とを実行するので、管理者は、提示された集計結果から、現状の広告画像の有意性を検討することができる。
このため、現状の広告画像による広告の中止又は継続、或いは、現状の広告画像に改変を加えるなどの判断を、管理者が行えるようになる。 According to the advertisement management system of the present invention, the control device recognizes the facial expression of the viewer from the captured image including the viewer captured by the photographing device, and aggregates the recognition result of the facial expression. Since the presentation process of presenting the result to the manager of the advertisement is executed, the manager can examine the significance of the current advertisement image from the presented aggregated result.
Therefore, the administrator can make a judgment such as canceling or continuing the advertisement based on the current advertisement image, or modifying the current advertisement image.

本発明は、上記のような特徴的な構成を備えるシステム及び装置として実現できるだけでなく、かかる特徴的な構成をコンピュータに実行させるためのコンピュータプログラムとして実現することができる。
また、上記の本発明は、システム及び装置の一部又は全部を実現する、１又は複数の半導体集積回路として実現することができる。 The present invention can be realized not only as a system and an apparatus having the above-mentioned characteristic configuration, but also as a computer program for causing a computer to execute such a characteristic configuration.
Further, the present invention described above can be realized as one or more semiconductor integrated circuits that realize a part or all of a system and an apparatus.

本発明によれば、階層型ニューラルネットワークを用いた表情認識の精度を向上することができる。 According to the present invention, the accuracy of facial expression recognition using a hierarchical neural network can be improved.

本発明の実施形態に係る画像処理装置のブロック図である。It is a block diagram of the image processing apparatus which concerns on embodiment of this invention. ＣＮＮ処理部に含まれるＣＮＮの概略構成図である。It is a schematic block diagram of CNN included in a CNN processing part. 畳み込み層の処理内容の概念図である。It is a conceptual diagram of the processing content of a convolutional layer. 受容野の構造の概念図である。It is a conceptual diagram of the structure of the receptive field. 画像生成部の処理内容の説明図である。It is explanatory drawing of the processing content of an image generation part. 画像処理装置を用いた表情認識方法の具体例を示す説明図である。It is explanatory drawing which shows the specific example of the facial expression recognition method using an image processing apparatus. ＣＮＮ処理部に構築される深層ＣＮＮの構造図である。It is a structural drawing of the deep CNN constructed in the CNN processing part. 入力画像がＡＧＥ画像である場合の誤り率と、入力画像がＲＧＢ画像である場合の誤り率を比較したグラフである。It is a graph which compared the error rate when the input image is an AGE image, and the error rate when the input image is an RGB image. 本実施形態の広告管理システムの全体構成図である。It is an overall block diagram of the advertisement management system of this embodiment. 表情認識の集計結果の一例を示す棒グラフである。It is a bar graph which shows an example of the aggregation result of facial expression recognition.

以下、図面を参照して、本発明の実施形態の詳細を説明する。なお、以下に記載する実施形態の少なくとも一部を任意に組み合わせてもよい。 Hereinafter, the details of the embodiment of the present invention will be described with reference to the drawings. In addition, at least a part of the embodiments described below may be arbitrarily combined.

〔画像処理装置の全体構成〕
図１は、本発明の実施形態に係る画像処理装置１のブロック図である。
図１に示すように、本実施形態の画像処理装置１は、例えば、図示しないＰＣ（Personal Computer）に搭載されたＧＰＵ（Graphics Processing Unit）を含む。画像処理装置１は、ＰＣのメモリに記録されたコンピュータプログラムにより実現される機能部として、画像生成部２、ＣＮＮ処理部３、学習部４及び出力部５を備える。 [Overall configuration of image processing device]
FIG. 1 is a block diagram of an image processing device 1 according to an embodiment of the present invention.
As shown in FIG. 1, the image processing device 1 of the present embodiment includes, for example, a GPU (Graphics Processing Unit) mounted on a PC (Personal Computer) (not shown). The image processing device 1 includes an image generation unit 2, a CNN processing unit 3, a learning unit 4, and an output unit 5 as functional units realized by a computer program recorded in the memory of a PC.

画像生成部２は、ラベル付きのサンプル画像７や認識対象である撮影画像８から所定の特徴を抽出する処理などを施して、ＣＮＮ処理部３に対する入力画像（以下、「入力データ」ともいう。）を生成する。画像生成部２は、入力画像をＣＮＮ処理部３に入力する。
ＣＮＮ処理部３は、入力データに対してＣＮＮを利用した認識処理（本実施形態では顔画像の表情認識）を実行し、その認識結果（具体的には、分類クラスごとの確率など）を学習部４又は出力部５に入力する。 The image generation unit 2 performs a process of extracting a predetermined feature from the labeled sample image 7 or the captured image 8 to be recognized, and performs an input image to the CNN processing unit 3 (hereinafter, also referred to as “input data”. ) Is generated. The image generation unit 2 inputs the input image to the CNN processing unit 3.
The CNN processing unit 3 executes recognition processing using CNN (facial expression recognition of a face image in this embodiment) on the input data, and learns the recognition result (specifically, the probability for each classification class). Input to unit 4 or output unit 5.

具体的には、ラベル付きのサンプル画像７を用いてＣＮＮを訓練する場合には、ＣＮＮ処理部３は、認識結果を学習部４に入力する。
他方、学習済みのＣＮＮ処理部３が撮影画像８に含まれる顔画像の分類クラス（本実施形態では顔画像の表情の種別）を特定する場合、すなわち、画像処理装置１が表情識別器として動作する場合には、ＣＮＮ処理部３は、認識結果を出力部５に入力する。 Specifically, when training the CNN using the labeled sample image 7, the CNN processing unit 3 inputs the recognition result to the learning unit 4.
On the other hand, when the trained CNN processing unit 3 specifies the classification class of the face image included in the captured image 8 (the type of facial expression of the face image in this embodiment), that is, the image processing device 1 operates as a facial expression classifier. In this case, the CNN processing unit 3 inputs the recognition result to the output unit 5.

学習部４は、入力された認識結果に基づいて、ＣＮＮ処理部３が保持するパラメータ（重みやバイアス）を更新し、更新後のパラメータをＣＮＮ処理部３に記憶させる。
出力部５は、入力された認識結果に基づいて入力画像の分類クラスを特定する。具体的には、ＣＮＮ処理部３から入力された最も高い確率の分類クラスを、入力画像の分類クラスとする。出力部５が出力する分類結果は、ＰＣのディスプレイなどに表示されることにより、ＰＣのオペレータに通知される。 The learning unit 4 updates the parameters (weights and biases) held by the CNN processing unit 3 based on the input recognition result, and stores the updated parameters in the CNN processing unit 3.
The output unit 5 identifies the classification class of the input image based on the input recognition result. Specifically, the classification class with the highest probability input from the CNN processing unit 3 is set as the classification class of the input image. The classification result output by the output unit 5 is displayed on the display of the PC or the like to notify the operator of the PC.

〔ＣＮＮ処理部の処理内容〕
（ＣＮＮの構成例）
図２は、ＣＮＮ処理部３に含まれるＣＮＮの概略構成図である。
図２に示すように、ＣＮＮ処理部３に構築されるＣＮＮは、畳み込み層（「ダウンサンプリング層」ともいう。）Ｃ１，Ｃ２、プーリング層Ｐ１，Ｐ２及び全結合層Ｆの３つの演算処理層と、ＣＮＮの出力層である最終層Ｅとを備える。 [Processing content of CNN processing unit]
(CNN configuration example)
FIG. 2 is a schematic configuration diagram of a CNN included in the CNN processing unit 3.
As shown in FIG. 2, the CNN constructed in the CNN processing unit 3 has three arithmetic processing layers: a convolutional layer (also referred to as a “downsampling layer”) C1 and C2, a pooling layer P1 and P2, and a fully connected layer F. And the final layer E, which is the output layer of the CNN.

畳み込み層Ｃ１，Ｃ２の後にはプーリング層Ｐ１，Ｐ２が配置され、最後のプーリング層Ｐ２の後に全結合層Ｆが配置される。ＣＮＮの最終層Ｅには、予め設定された分類クラス数と同数（図２では１０個）の最終ノードが含まれる。
図２では、畳み込み層Ｃ１，Ｃ２とこれに対応するプーリング層Ｐ１，Ｐ２が２つの場合を例示している。もっとも、畳み込み層とプーリング層は、３つ以上であってもよい。また、全結合層Ｆは少なくとも１つ配置される。 The pooling layers P1 and P2 are arranged after the convolutional layers C1 and C2, and the fully connected layer F is arranged after the last pooling layer P2. The final layer E of the CNN includes the same number of final nodes (10 in FIG. 2) as the preset number of classification classes.
FIG. 2 illustrates a case where there are two convolutional layers C1 and C2 and corresponding pooling layers P1 and P2. However, the number of convolutional layers and pooling layers may be three or more. Further, at least one fully connected layer F is arranged.

ある層Ｃ１，Ｐ１，Ｃ２，Ｐ２におけるｊ番目のノードは、直前の層のｍ個のノードからそれぞれ入力ｘ_ｉ（ｉ＝１，２，……ｍ）を受け取り、これらの重み付き和にバイアスを加算した中間変数ｕ_ｊを計算する。すなわち、中間変数ｕ_ｊは次式で計算される。なお、次式において、ｗ_ｉｊは重みであり、ｂ_ｊはバイアスである。
The j-th node in a certain layer C1, P1, C2, P2 receives input x _i (i = 1, 2, ... m) from m nodes of the immediately preceding layer, respectively, and biases them to the weighted sum. The intermediate variable u _j is calculated by adding. That is, the intermediate variable u _j is calculated by the following equation. Note that in the following _{equation, w ij} is the weight, _{b j} is the bias.

非線形関数である活性化関数ａ（・）に中間変数ｕ_ｊを適用した応答ｙ_ｊ、すなわち、ｙ_ｊ＝ａ（ｕ_ｊ）がこの層のノードの出力となり、この出力は次の層に入力される。
活性化関数ａには、「シグモイド関数」、或いは、ａ（ｘ_ｊ）＝ｍａｘ（ｘ_ｊ，０）などが使用される。特に、後者の活性化関数は、「ＲｅＬＵ（Rectified Linear Unit）」と呼ばれる。ＲｅＬＵは、収束性の良さや学習速度の向上などに貢献することから、近年よく使用される。 The response y _j in which the intermediate variable u _j is applied to the activation function a (・), which is a non-linear function, that is, y _j = a (u _j ) becomes the output of the node of this layer, and this output is input to the next layer. Will be done.
For the activation function a, a "sigmoid function" or a (x _j ) = max (x _j , 0) is used. In particular, the latter activation function is called "ReLU (Rectified Linear Unit)". ReLU is often used in recent years because it contributes to good convergence and improvement of learning speed.

ＣＮＮの出力層付近には、隣接層間のノードをすべて結合した全結合層Ｆが１層以上配置される。ＣＮＮの出力を与える最終層Ｅは、通常のニューラルネットワークと同様に設計される。
入力画像のクラス分類を目的とする場合は、分類クラス数と同数のノードが最終層Ｅに配置され、最終層Ｅの活性化関数ａには「ソフトマックス関数」が用いられる。 In the vicinity of the output layer of the CNN, one or more fully connected layers F in which all the nodes between adjacent layers are connected are arranged. The final layer E that gives the output of CNN is designed in the same way as a normal neural network.
When the purpose is to classify the input image, the same number of nodes as the number of classification classes are arranged in the final layer E, and the "softmax function" is used for the activation function a of the final layer E.

具体的には、ｎ個のノードへの入力ｕ_ｊ（ｊ＝１，２，……ｎ）をもとに、次式が算出される。認識時には、ｐ_ｊが最大値をとるノードのインデックスｊ＝ａｒｇｍａｘ_ｊｐ_ｊが推定クラスとして選択される。
Specifically, the following equation is calculated based on the inputs u _j (j = 1, 2, ... n) to the n nodes. At the time of recognition, the index j = argmax _j p _{j of the} node in which p _j has the maximum value is selected as the estimation class.

（畳み込み層の処理内容）
図３は、畳み込み層Ｃ１，Ｃ２の処理内容の概念図である。
図３に示すように、畳み込み層Ｃ１，Ｃ２の入力は、縦長のサイズがＳ×Ｓ画素のＮ枚（Ｎチャンネル）の形式となっている。
以下、この形式の画像をＳ×Ｓ×Ｎと記載する。また、Ｓ×Ｓ×Ｎの入力をｘ_ｉｊｋ（ただし、(i,j,k）∈[0,S-1],[0,S-1],[1,N]）と記載する。 (Processing content of convolutional layer)
FIG. 3 is a conceptual diagram of the processing contents of the convolutional layers C1 and C2.
As shown in FIG. 3, the inputs of the convolutional layers C1 and C2 are in the form of N pixels (N channels) having a vertically long size of S × S pixels.
Hereinafter, the image of this format will be described as S × S × N. Further, the input of S × S × N is described as x _ijk (where (i, j, k) ∈ [0, S-1], [0, S-1], [1, N]).

ＣＮＮにおいて、最初の入力層（畳み込み層Ｃ１）のチャンネル数は、入力画像がグレースケールならばＮ＝１となり、カラーならばＮ＝３（ＲＧＢの３チャンネル）となる。
畳み込み層Ｃ１，Ｃ２では、入力ｘ_ｉｊｋにフィルタ（「カーネル」ともいう。）を畳み込む計算が実行される。 In CNN, the number of channels of the first input layer (convolutional layer C1) is N = 1 if the input image is grayscale, and N = 3 (3 channels of RGB) if the input image is color.
In the convolutional layers C1 and C2, the calculation of convolving a filter (also referred to as "kernel") into the input x _ijk is executed.

この計算は、一般的な画像処理におけるフィルタの畳み込み、例えば、小サイズの画像を入力画像に２次元的に畳み込んで画像をぼかす処理（ガウシアンフィルタ）や、エッジを強調する処理（鮮鋭化フィルタ）と基本的に同様の処理である。
具体的には、各チャンネルｋ（ｋ＝１〜Ｎ）の入力ｘ_ｉｊｋのサイズＳ×Ｓの画素に、Ｌ×Ｌのサイズの２次元フィルタを畳み込み、その結果を全チャンネルｋ＝１〜Ｎにわたって加算する。この計算結果は、１チャンネルの画像ｕ_ｉｊの形式となる。 This calculation includes filter convolution in general image processing, for example, a process of two-dimensionally convolving a small-sized image into an input image to blur the image (Gaussian filter), and a process of emphasizing edges (sharpening filter). ) Is basically the same process.
Specifically, a two-dimensional filter having an L × L size is convoluted into pixels having an input x _ijk size S × S of each channel k (k = 1 to N), and the result is obtained from all channels k = 1 to N. Add over. The calculation result is in the form of a 1-channel image _uij .

フィルタをｗ_ｉｊｋ（ただし、(i,j,k）∈[1,L-1],[1,L-1],[1,N]）と定義すると、ｕ_ｉｊは次式で算出される。
If the filter is defined as w _ijk (where (i, j, k) ∈ [1, L-1], [1, L-1], [1, N]), u _ij is calculated by the following equation. ..

ただし、Ｐ_ｉｊは、画像中の画素（ｉ，ｊ）を頂点とするサイズＬ×Ｌ画素の正方領域である。すなわち、Ｐ_ｉｊは、次式で定義される。
However, _Pij is a square region of size L × L pixels having pixels (i, j) in the image as vertices. That is, _Pij is defined by the following equation.

ｂ_ｋは、バイアスである。本実施形態では、バイアスは、チャンネルごとに全出力ノード間で共通とする。すなわち、ｂ_ｉｊｋ＝ｂ_ｋとする。
フィルタは、全画素ではなく複数画素の間隔で適用されることもある。すなわち、所定の画素数ｓについて、Ｐ_ｉｊを次式のように定義し、ｗ_{ｐ−ｉ，ｑ−ｊ，ｋ}をｗ_{ｐ−ｓｉ，ｑ−ｓｊ，ｋ}と置き換えてｕ_ｉｊを計算してもよい。この画素間隔ｓを「スライド」という。
b _k is a bias. In this embodiment, the bias is common to all output nodes for each channel. That is, b _ijk = b _k .
The filter may be applied at intervals of multiple pixels instead of all pixels. That is, for a given number of pixels _s, define a _{P ij} as _{follows, w p-i, q-} j, a _{_{k w p-si, q-}} sj, to calculate the _{u ij} replaced with _k May be good. This pixel spacing s is called a "slide".

上記のように計算されたｕ_ｉｊは、その後、活性化関数ａ（・）を経て、畳み込み層Ｃ１，Ｃ２の出力ｙ_ｉｊとなる。すなわち、ｙ_ｉｊ＝ａ（ｕ_ｉｊ）となる。
これにより、１つのフィルタｗ_ｉｊｋにつき、入力ｘ_ｉｊｋと縦横サイズが同じであるＳ×Ｓの１チャンネル分の出力ｙ_ｉｊが得られる。 The u _ij calculated as described above then passes through the activation function a (.) And becomes the output y _ij of the convolutional layers C1 and C2. That is, y _ij = a (u _ij ).
As a result, for one filter w _ijk , the output y _ij for one channel of S × S having the same vertical and horizontal sizes as the input x _ijk can be obtained.

同様のフィルタをＮ’個用意して、それぞれ独立して上述の計算を実行すれば、Ｎ’チャンネル分のＳ×Ｓの出力、すなわち、Ｓ×Ｓ×Ｎ’サイズの出力ｙ_ｉｊｋ（ただし、(i,j,k）∈[1,S-1],[1,S-1],[1,N']）が得られる。
このＮ’チャンネル分の出力ｙ_ｉｊｋは、次の層への入力ｘ_ｉｊｋとなる。図３は、Ｎ’個あるフィルタのうちの１つに関する計算内容を示している。 If N'similar filters are prepared and the above calculations are performed independently for each, the output of S × S for N'channels, that is, the output of S × S × _N'size y _ijk (however, (i, j, k) ∈ [1, S-1], [1, S-1], [1, N']) is obtained.
The output y _{ijk for} this _N'channel becomes the input x _ijk to the next layer. FIG. 3 shows the calculation contents for one of the N'filters.

以上の計算は、例えば図４に示すように、特殊な形で層間ノードが結ばれた単層ネットワークとして表現できる。図４は、受容野の構造の概念図である。左側の図では受容野が矩形で表現され、右側の図では受容野がノードで表現されている。
具体的には、上位層の各ノードは下位層の各ノードの一部と結合している（これを「局所受容野」という。）。また、結合の重みは各ノード間で共通となっている（これを「重み共有」という。）。 The above calculation can be expressed as a single-layer network in which interlayer nodes are connected in a special form, as shown in FIG. 4, for example. FIG. 4 is a conceptual diagram of the structure of the receptive field. In the figure on the left, the receptive field is represented by a rectangle, and in the figure on the right, the receptive field is represented by a node.
Specifically, each node in the upper layer is connected to a part of each node in the lower layer (this is called a "local receptive field"). In addition, the join weight is common among each node (this is called "weight sharing").

（プーリング層の処理内容）
図２に示す通り、プーリング層Ｐ１，Ｐ２は、畳み込み層Ｃ１，Ｃ２と対で存在する。従って、畳み込み層Ｃ１，Ｃ２の出力はプーリング層Ｐ１，Ｐ２への入力となり、プーリング層Ｐ１，Ｐ２の入力はＳ×Ｓ×Ｎの形式となる。
プーリング層Ｐ１，Ｐ２の目的は、画像のどの位置でフィルタの応答が強かったかという情報を一部捨てて、特徴の微少な変化に対する応答の不変性を実現することである。 (Processing content of pooling layer)
As shown in FIG. 2, the pooling layers P1 and P2 exist in pairs with the convolutional layers C1 and C2. Therefore, the outputs of the convolutional layers C1 and C2 are the inputs to the pooling layers P1 and P2, and the inputs of the pooling layers P1 and P2 are in the form of S × S × N.
The purpose of the pooling layers P1 and P2 is to discard some information about the position of the image where the filter response was strong, and to realize the invariance of the response to a slight change in the characteristics.

プーリング層Ｐ１，Ｐ２のノード（ｉ，ｊ）は、畳み込み層Ｃ１，Ｃ２と同様に、入力側の層に局所受容野Ｐ_ｉ，ｊを有する。プーリング層Ｐ１，Ｐ２のノード（ｉ，ｊ）は、局所受容野Ｐ_ｉ，ｊの内部のノード（ｐ，ｑ）∈Ｐ_ｉ，ｊの出力ｙ_ｐ，ｑを１つに集約する。
プーリング層Ｐ１，Ｐ２の局所受容野Ｐ_ｉ，ｊのサイズは、畳み込み層Ｃ１，Ｃ２のそれ（フィルタサイズ）と無関係に設定される。 The nodes (i, j) of the pooling layers P1 and P2 have local receptive fields _{Pi and j} in the input side layer, similarly to the convolutional layers C1 and C2. Node pooling layer P1, P2 (i, j) is to aggregate local receptive field _{P i,} the internal nodes of the _{_{j (p, q) ∈P i}} , the output _{y p} of the _{_j, q} to one.
The sizes of the local receptive fields _{Pi and j} of the pooling layers P1 and P2 are set independently of those of the convolutional layers C1 and C2 (filter size).

入力が複数チャンネルの場合、チャンネルごとに上記の処理が行われる。すなわち、畳み込み層Ｃ１，Ｃ２とプーリング層Ｐ１，Ｐ２の出力チャンネル数は一致する。
プーリングは、画像の縦横（ｉ，ｊ）の方向に間引いて行われる。すなわち、２以上のストライドｓが設定される。例えば、ｓ＝２とすると、出力の縦横サイズは入力の縦横サイズの半分となり、プーリング層の出力ノード数は、入力ノード数の１／ｓ^２倍となる。 When the input is a plurality of channels, the above processing is performed for each channel. That is, the number of output channels of the convolutional layers C1 and C2 and the pooling layers P1 and P2 are the same.
The pooling is performed by thinning out the images in the vertical and horizontal directions (i, j). That is, two or more strides are set. For example, when s = 2, vertical and horizontal size of the output becomes half the vertical and horizontal size of the input, the number of output nodes of the pooling layer becomes 1 / s ² times the number of input nodes.

受容野Ｐ_ｉ，ｊの内部のノードからの入力を１つに纏めて集約する方法には、「平均プーリング」及び「最大プーリング」などがある。
平均プーリングは、次式の通り、Ｐ_ｉ，ｊに属するノードからの入力ｘ_ｐｑｋの平均値を出力する方法である。
There are "average pooling" and "maximum pooling" as a method of collecting the inputs from the nodes inside the receptive fields _{Pi and j} into one.
The average pooling is a method of outputting the average value of the input x _pqk from the nodes belonging to _Pi _{and j as shown} in the following equation.

最大プーリングは、次式の通り、Ｐ_ｉ，ｊに属するノードからの入力ｘ_ｐｑｋの最大値を出力する方法である。ＣＮＮの初期の研究では平均プーリングが主流であったが、現在では最大プーリングが一般的に採用される。
Maximum pooling is a method of outputting the maximum value of the input x _pqk from the nodes belonging to _Pi _{and j as shown} in the following equation. Average pooling was predominant in CNN's early studies, but maximum pooling is now commonly adopted.

なお、畳み込み層Ｃ１，Ｃ２と異なり、プーリング層Ｐ１，Ｐ２では、学習によって変化する重みは存在せず、活性化関数も適用されない。
本実施形態のＣＮＮにおいて、平均プーリング及び最大プーリングのいずれを採用してもよいが、図７に示すＣＮＮの実装例では最大プーリングを採用している。 Unlike the convolutional layers C1 and C2, the pooling layers P1 and P2 do not have weights that change due to learning, and the activation function is not applied.
In the CNN of the present embodiment, either the average pooling or the maximum pooling may be adopted, but the maximum pooling is adopted in the implementation example of the CNN shown in FIG. 7.

〔学習部の処理内容〕
ＣＮＮの学習（training）では、「教師あり学習」が基本である。本実施形態においても、学習部４は教師あり学習を実行する。
具体的には、学習部４は、学習データとなる多数のラベル付きのサンプル画像を含む集合を対象として、各サンプル画像の分類誤差を最小化することにより実行される。以下、この処理について説明する。 [Processing content of the learning department]
"Supervised learning" is the basis of CNN training. Also in this embodiment, the learning unit 4 executes supervised learning.
Specifically, the learning unit 4 is executed by minimizing the classification error of each sample image for a set including a large number of labeled sample images as learning data. This process will be described below.

ＣＮＮ処理部３の最終層Ｅの各ノードは、ソフトマックス関数による正規化（前述の〔数２〕）により、対応するクラスに対する確率ｐ_ｊ（ｊ＝１，２，……ｎ）を出力する。この確率ｐ_ｊは、学習部４に入力される。
学習部４は、入力された確率ｐ_ｊから算出される分類誤差を最小化するように、ＣＮＮ処理部３に設定された重みなどのパラメータを更新する。 Each node of the final layer E of the CNN processing unit 3 outputs the probability _pj (j = 1, 2, ... n) for the corresponding class by normalization by the softmax function ([Equation 2] described above). .. This probability p _j is input to the learning unit 4.
Learning unit 4, so as to minimize the classification error calculated from the input probability p _j, it updates the parameters such weight set in the CNN processing unit 3.

具体的には、学習部４は、入力サンプルに対する理想的な出力ｄ１，ｄ２，……ｄｎ（ラベル）と、出力ｐ１．ｐ２．……ｐｎの乖離を、次式の交差エントロピーＣによって算出する。この交差エントロピーＣが分類誤差である。
Specifically, the learning unit 4 has ideal outputs d1, d2, ... dn (label) for the input sample and output p1. p2. The dissociation of pn is calculated by the cross entropy C of the following equation. This cross entropy C is the classification error.

目標出力ｄ１，ｄ２，……ｄｎは、正解クラスｊのみでｄ_ｊ＝１となり、それ以外のすべてのｋ（≠ｊ）ではｄ_ｋ＝０となるように設定される。
学習部４は、上記の交差エントロピーＣが小さくなるように、各畳み込み層Ｃ１，Ｃ２のフィルタの係数ｗ_ｉｊｋと各ノードのバイアスｂ_ｋ、及び、ＣＮＮの出力層側に配置された全結合層Ｆの重みとバイアスを調整する。 The target outputs d1, d2, ... dn are set so that d _j = 1 only in the correct answer class j and d _k = 0 in all other k (≠ j).
Learning unit 4, as in the above cross-entropy C decreases, the bias b _k coefficients w _ijk each node of the filter of each convolution layers C1, _C2, and total binding layer disposed on the output layer side of the CNN Adjust the weight and bias of F.

分類誤差Ｃの最小化には、確率的勾配降下法が用いられる。学習部４は、重みやバイアスに関する誤差勾配（∂Ｃ／∂ｗ_ｉｊ）を、誤差逆伝播法（ＢＰ法）により計算する。ＢＰ法による計算方法は、通常のニューラルネットワークの場合と同様である。
もっとも、ＣＮＮ処理部３が最大プーリングを採用する場合の逆伝播では、学習サンプルに対する順伝播の際に、プーリング領域のどのノードの値を選んだかを記憶し、逆伝播時にそのノードのみと結合（重み１で結合）させる。 A stochastic gradient descent method is used to minimize the classification error C. Learning unit 4, the error gradient related weights and biases the (∂C / ∂w _ij), calculated by backpropagation (BP method). The calculation method by the BP method is the same as that of a normal neural network.
However, in the back propagation when the CNN processing unit 3 adopts the maximum pooling, the value of which node in the pooling region is selected at the time of forward propagation for the training sample is memorized, and only that node is combined at the time of back propagation ( Combine with weight 1).

学習部４による分類誤差Ｃの評価とこれに基づくパラメータ（重みなど）の更新は、全学習サンプルについて実行してもよい。しかし、収束性及び計算速度の観点から、数個から数百個程度のサンプルの集合（ミニバッチ）ごとに実行することが好ましい。この場合の重みｗ_ｉｊの更新量Δｗ_ｉｊは、次式で決定される。
The evaluation of the classification error C by the learning unit 4 and the update of the parameters (weights, etc.) based on the evaluation may be performed for all the learning samples. However, from the viewpoint of convergence and calculation speed, it is preferable to execute each set (mini-batch) of several to several hundred samples. The update amount Δw _ij of the weight w _{ij in} this case is determined by the following equation.

上式において、Δｗ_ｉｊ ^（ｔ）は今回の重み更新量であり、Δｗ_ｉｊ ^{（ｔ−１）}は前回の重み更新量である。上式の第１項は、勾配降下法により誤差を削減するためのｗ_ｉｊの修正量を表す項であり、εは学習率である。
上式の第２項は、モメンタム（momentum）である。モメンタムは、前回更新量のα（〜０．９）倍を加算することでミニパッチの選択による重みの偏りを抑える。第３項は、重み減衰（weight decay）である。重み減衰は、重みが過大にならないようにするパラメータである。なお、バイアスｂ_ｋの更新についても同様である。 In the above equation, Δw _ij ^(t) is the current weight update amount, and Δw _ij ^(t-1) is the previous weight update amount. The first term of the above equation is a term representing the amount of correction of _wij for reducing the error by the gradient descent method, and ε is the learning rate.
The second term of the above equation is momentum. Momentum suppresses weight bias due to mini-patch selection by adding α (~ 0.9) times the previous update amount. The third term is weight decay. Weight attenuation is a parameter that prevents the weight from becoming excessive. The same applies to the update of the bias b _k .

〔画像生成部の処理内容〕
図５は、画像生成部２の処理内容の説明図である。
図５に示すように、画像生成部２が実行する画像処理には、「顔抽出処理」及び「特徴抽出処理」の２つの処理が含まれる。 [Processing content of image generator]
FIG. 5 is an explanatory diagram of the processing content of the image generation unit 2.
As shown in FIG. 5, the image processing executed by the image generation unit 2 includes two processes, a “face extraction process” and a “feature extraction process”.

顔抽出処理は、サンプル画像７又は撮影画像８（図１参照）などのソース画像から、大半が人間の顔部分である矩形画像（顔の原画像）をトリミングする処理である。
特徴抽出処理は、顔抽出処理で得られた矩形画像における所定の特徴を際立たせることにより、ＣＮＮ処理部３に供給する入力画像を生成する処理である。 The face extraction process is a process of trimming a rectangular image (original image of a face), which is mostly a human face portion, from a source image such as a sample image 7 or a captured image 8 (see FIG. 1).
The feature extraction process is a process of generating an input image to be supplied to the CNN processing unit 3 by highlighting a predetermined feature in the rectangular image obtained by the face extraction process.

本実施形態の特徴抽出処理では、矩形画像から「角度（Angle）」、「勾配（Gradient）」及び「輪郭（Edge）」の３種類の特徴を抽出した、合計３種類の入力画像群が生成される。以下、これらの３種類の入力画像を「ＡＧＥ画像」ともいう。
ここで、矩形画像の画素点の２次元座標を（ｘ，ｙ）とし、各画素点（ｘ，ｙ）の画素値（例えばＲＧＢ値）を「Ｉ」とすると、角度Ａ、勾配Ｇ及び輪郭Ｅの数学的な意味は、それぞれ以下の通りである。 In the feature extraction process of the present embodiment, a total of three types of input image groups are generated by extracting three types of features of "angle", "gradient", and "edge" from the rectangular image. Will be done. Hereinafter, these three types of input images are also referred to as "AGE images".
Here, assuming that the two-dimensional coordinates of the pixel points of the rectangular image are (x, y) and the pixel value (for example, RGB value) of each pixel point (x, y) is "I", the angle A, the gradient G, and the contour The mathematical meanings of E are as follows.

角度Ａ：各画素点（ｘ，ｙ）における画素値Ｉの勾配ベクトル∇ｆ＝（∂I／∂ｘ，∂I／∂ｙ）の「方向角度」
勾配Ｇ：各画素点（ｘ，ｙ）における画素値Ｉの勾配ベクトル∇ｆ＝（∂I／∂ｘ，∂I／∂ｙ）の「ノルム」（長さ）
輪郭Ｅ：画素値Ｉが急峻に変化する画素点（ｘ，ｙ）の位置情報（エッジ画像） Angle A: “Direction angle” of the gradient vector ∇f = (∂I / ∂x, ∂I / ∂y) of the pixel value I at each pixel point (x, y)
Gradient G: The "norm" (length) of the gradient vector ∇f = (∂I / ∂x, ∂I / ∂y) of the pixel value I at each pixel point (x, y).
Contour E: Position information (edge image) of pixel points (x, y) at which the pixel value I changes sharply.

角度Ａは、矩形画像に含まれる顔内部の凹凸などの地理的（Geometrical）な情報（以下、「凹凸情報」という。）を表す。
勾配Ｇは、矩形画像に含まれる顔内部の皮膚や毛髪などの質感（texture）に関する情報（以下、「質感情報」という。）を表す。
輪郭Ｅは、矩形画像に含まれる顔の頭部、目、口及び鼻などの各部分のアウトライン（以下、「輪郭情報」という。）を表す。 The angle A represents geographical information (hereinafter, referred to as “unevenness information”) such as unevenness inside the face included in the rectangular image.
The gradient G represents information on the texture of the skin, hair, etc. inside the face included in the rectangular image (hereinafter, referred to as “texture information”).
The contour E represents an outline (hereinafter, referred to as “contour information”) of each part such as the head, eyes, mouth, and nose of the face included in the rectangular image.

各画素点（ｘ，ｙ）における３つの特徴量（角度Ａ、勾配Ｇ及び輪郭Ｅ）の値を、それぞれｖ１，ｖ２，ｖ３とし、各画素点における矩形画像の画素値Ｉからｖ１，ｖ２，ｖ３を生成するためのフィルタを、それぞれＤａ，Ｄｇ，Ｄｅとすると、次式が成立する。
The values of the three feature quantities (angle A, gradient G, and contour E) at each pixel point (x, y) are v1, v2, v3, respectively, and the pixel values I to v1, v2 of the rectangular image at each pixel point. Assuming that the filters for generating v3 are Da, Dg, and De, respectively, the following equation holds.

この場合、フィルタＤａ，Ｄｇの計算式は、以下の通りである。
In this case, the calculation formulas for the filters Da and Dg are as follows.

また、フィルタＤｅの計算式は、例えば以下の通りである。
The calculation formula of the filter De is as follows, for example.

なお、輪郭とその周囲の情報をはっきり区別するため、輪郭点（ｘ，ｙ）の周囲の各方向点の濃淡値（白黒値）を、次式で表される輪郭点（ｘ，ｙ）の角度値θ_ｅに応じて調整することが望ましい。
In order to clearly distinguish the contour from the information around it, the shading value (black and white value) of each direction point around the contour point (x, y) is the contour point (x, y) expressed by the following equation. It is desirable to adjust according to the angle value θ _e .

例えば、θ_ｅ（ｘ，ｙ）＝０の場合、点（ｘ，ｙ）は縦方向の輪郭を有するので、その点から抽出された輪郭情報の右より、左の方を暗くすることにより、オブジェクトの輪郭をはっきりと表現できる。
上記の通り、各フィルタＤａ，Ｄｇ，Ｄｅは、各画素点（ｘ，ｙ）における角度Ａ、勾配Ｇ及び輪郭Ｅの特徴をそれぞれ抽出したＡＧＥ画像としてもたらす。特徴抽出処理では、矩形画像のすべての画素点（ｘ，ｙ）を上記のフィルタで１回走査することにより、１枚の矩形画像から角度Ａ、勾配Ｇ及び輪郭Ｅの情報を含む３つの入力画像が生成される。 For example, when θ _e (x, y) = 0, the point (x, y) has a vertical contour, so by darkening the left side of the contour information extracted from that point, the left side is darkened. The outline of the object can be clearly expressed.
As described above, each of the filters Da, Dg, and De brings the features of the angle A, the gradient G, and the contour E at each pixel point (x, y) as an extracted AGE image. In the feature extraction process, all the pixel points (x, y) of the rectangular image are scanned once by the above filter, so that three inputs including information on the angle A, the gradient G, and the contour E are input from one rectangular image. An image is generated.

画像生成部２が実行するその他の画像処理には、顔抽出処理によってトリミングされた矩形画像のサイズを変更する処理や、矩形画像の水平反射（鏡映）を生成する処理などが含まれていてもよい。 Other image processing executed by the image generation unit 2 includes a process of changing the size of the rectangular image trimmed by the face extraction process, a process of generating horizontal reflection (reflection) of the rectangular image, and the like. May be good.

画像生成部２が実行する以上の画像処理は、「ＶＬｆｅａｔ」、「ＯｐｅｎＣＶ」、「ＩｍａｇｅＳｔｏｎｅ」、「ＧＩＭＰ」及び「ＣｘＩｍａｇｅ」などのオープンソースソフトウェアにより実行することができる。
フィルタＤａ，Ｄｇは、ＶＬｆｅａｔやＯｐｅｎＣＶなどの偏微分フィルタにより求まるＩｘ及びＩｙから算出することができる。また、フィルタＤｅは、ＯｐｅｎＣＶのゾーベルフィルタ、ラプラスフィルタ、キャニーフィルタなどを使用することができる。 The above image processing executed by the image generation unit 2 can be executed by open source software such as "VLfeat", "OpenCV", "ImageStone", "GIMP", and "CxImage".
The filters Da and Dg can be calculated from Ix and Iy obtained by a partial differential filter such as VLfeat or OpenCV. Further, as the filter De, an OpenCV Sobel filter, a Laplace filter, a Canny filter and the like can be used.

〔表情認識方法の具体例〕
図６は、画像処理装置１を用いた表情認識方法の具体例を示す説明図である。
図６に示すように、本実施形態の表情認識方法は、「学習ステップ」と「認識ステップ」の２つのステップに大別される。
学習ステップは、複数のサンプル画像７を用いて画像処理装置１のＣＮＮを学習させるステップである。認識ステップは、学習済みのＣＮＮを含むＣＮＮ処理部３に、撮影画像８に含まれる顔画像の表情を認識させるステップである。 [Specific example of facial expression recognition method]
FIG. 6 is an explanatory diagram showing a specific example of a facial expression recognition method using the image processing device 1.
As shown in FIG. 6, the facial expression recognition method of the present embodiment is roughly divided into two steps, a “learning step” and a “recognition step”.
The learning step is a step of learning the CNN of the image processing apparatus 1 using a plurality of sample images 7. The recognition step is a step of causing the CNN processing unit 3 including the learned CNN to recognize the facial expression of the face image included in the captured image 8.

学習ステップでは、複数のサンプル画像７（ラベル付きの生画像）が、６４×６４のサイズの顔画像に変更（トリミング）される。図６中のＮは、ＣＮＮにおける訓練のための画像枚数を表す。
次に、画像枚数をＮからＧに増やすために、Ｎ枚のサイズ６４×６４の画像に水平反射（鏡映）を掛け、それぞれサイズ５６×５６のパッチを抽出する。なお、Ｇ＝２×Ｎである。 In the learning step, a plurality of sample images 7 (raw images with labels) are changed (trimmed) into face images having a size of 64 × 64. N in FIG. 6 represents the number of images for training in CNN.
Next, in order to increase the number of images from N to G, horizontal reflection (reflection) is applied to N images of size 64 × 64, and patches of size 56 × 56 are extracted respectively. In addition, G = 2 × N.

次に、Ｇ枚のサイズ５６×５６のパッチから、顔の凹凸情報、質感情報及び輪郭情報をそれぞれ抽出した３種類の入力データ（本実施形態ではＡＧＥ画像群）が生成される。すなわち、Ｇ枚のパッチから、サイズ５６×５６でかつ３×Ｇ枚のＡＧＥ画像が生成される。以上の処理は、画像処理装置１の画像生成部２により実行される。 Next, three types of input data (AGE image group in the present embodiment) obtained by extracting facial unevenness information, texture information, and contour information from G-sheet size 56 × 56 patches are generated. That is, from G patches, AGE images having a size of 56 × 56 and 3 × G are generated. The above processing is executed by the image generation unit 2 of the image processing device 1.

サイズ５６×５６でかつ３×Ｇ枚のＡＧＥ画像（ＣＮＮにそれぞれ入力される学習用画像群）は、畳み込みネットワークを訓練するために、画像処理装置１のＣＮＮ処理部３に入力される。
この訓練において、学習部４は、ＣＮＮ処理部３に対する重みやバイアスなどのパラメータを調整する。 The AGE images having a size of 56 × 56 and 3 × G (learning image groups input to each CNN) are input to the CNN processing unit 3 of the image processing device 1 in order to train the convolutional network.
In this training, the learning unit 4 adjusts parameters such as weights and biases for the CNN processing unit 3.

認識ステップでは、表情認識の対象となる撮影画像８（ラベル付なし生画像）が、サイズ５６×５６の顔画像に変更（トリミング）される。
次に、１枚のサイズ５６×５６の顔画像から、顔の凹凸情報、質感情報及び輪郭情報をそれぞれ抽出した３種類の入力データ（本実施形態ではＡＧＥ画像）が生成される。以上の処理は、画像処理装置１の画像生成部２により実行される。 In the recognition step, the captured image 8 (raw image without a label), which is the target of facial expression recognition, is changed (trimmed) to a face image having a size of 56 × 56.
Next, three types of input data (AGE image in the present embodiment) are generated by extracting the unevenness information, the texture information, and the contour information of the face from one face image having a size of 56 × 56. The above processing is executed by the image generation unit 2 of the image processing device 1.

サイズ５６×５６でかつ３枚のＡＧＥ画像は、顔画像の表情認識のために、画像処理装置１のＣＮＮ処理部３に入力される。
この表情認識において、ＣＮＮ処理部３は、学習済みのパラメータを有するＣＮＮを用いて、入力されたＡＧＥ画像に対して予め設定された表情の分類クラスを特定する。特定された分類クラスは、出力部５に入力される。出力部５は、入力された分類クラスをＰＣのディスプレイなどに表示させる。 The three AGE images having a size of 56 × 56 are input to the CNN processing unit 3 of the image processing device 1 for facial expression recognition of the facial image.
In this facial expression recognition, the CNN processing unit 3 identifies a preset facial expression classification class for the input AGE image by using the CNN having the learned parameters. The specified classification class is input to the output unit 5. The output unit 5 displays the input classification class on a display of a PC or the like.

〔推奨されるＣＮＮの構造例〕
図７は、ＣＮＮ処理部３に構築される深層ＣＮＮの構造図である。
図７に示すように、本願発明者らが推奨する、人間の表情認識のためのＣＮＮのアーキテクチャは、入力ボリュームを出力ボリュームに変換する畳み込み層Ｃ１〜Ｃ４と、全結合層Ａ１〜Ａ３の積層体により構成されている。 [Recommended CNN structure example]
FIG. 7 is a structural diagram of a deep CNN constructed in the CNN processing unit 3.
As shown in FIG. 7, the CNN architecture for human facial expression recognition recommended by the inventors of the present application is a stack of convolution layers C1 to C4 for converting an input volume into an output volume and fully coupled layers A1 to A3. It is composed of the body.

ＣＮＮの各層Ｃ１〜Ｃ４，Ａ１〜Ａ３は、幅、高さ及び奥行きの３次元的に配列されたニューロンを有する。
最初の入力層Ｃ１の幅、高さ及び奥行きのサイズは５６×５６×３が好ましい。畳み込み層Ｃ２〜Ｃ４及び全結合層Ａ１の内部のニューロンは、１つ前の層の受容野と呼ばれる小領域のノードのみに接続されている。 Each layer C1-C4, A1-A3 of the CNN has neurons arranged three-dimensionally in width, height and depth.
The width, height and depth of the first input layer C1 are preferably 56 × 56 × 3. The neurons inside the convolutional layers C2 to C4 and the fully connected layer A1 are connected only to the nodes in a small area called the receptive field of the previous layer.

出力ボリュームの空間的な大きさは、次式で計算することができる。
Ｗ２＝１＋（Ｗ１−Ｋ＋２Ｐ）／Ｓ
上式において、Ｗ１は、入力ボリュームのサイズである。Ｋは、畳み込み層のニューロンの核（ノード）のフィールドサイズである。Ｓはストライド、すなわち、カーネルマップにおける隣接するニューロンの受容野の中心間距離を意味する。Ｐは、ボーダー上で使用されるゼロパディングの量を意味する。 The spatial size of the output volume can be calculated by the following equation.
W2 = 1 + (W1-K + 2P) / S
In the above equation, W1 is the size of the input volume. K is the field size of the nucleus (node) of the neuron in the convolutional layer. S means stride, that is, the distance between the centers of the receptive fields of adjacent neurons in the kernel map. P means the amount of zero padding used on the border.

図７のＣＮＮでは、第１畳み込み層Ｃ１において、Ｗ１＝５６、Ｋ＝５、Ｓ＝２、Ｐ＝２である。従って、第２畳み込み層Ｃ２の出力ボリュームの空間的な大きさは、Ｗ２＝１＋（５６−５＋２×２）／２＝２８．５→２８となる。
図７のネットワークでは、重みを持つ７つの層を含む。最初の４つは畳み込み層Ｃ１〜Ｃ４であり、残りの３つは完全に接続された全結合層Ａ１〜Ａ３である。全結合層Ａ１〜Ａ３には、ドロップアウトが含まれる。 In the CNN of FIG. 7, W1 = 56, K = 5, S = 2, P = 2 in the first convolutional layer C1. Therefore, the spatial size of the output volume of the second convolutional layer C2 is W2 = 1 + (56-5 + 2 × 2) / 2 = 28.5 → 28.
The network of FIG. 7 includes seven layers with weights. The first four are convolutional layers C1 to C4 and the remaining three are fully connected fully connected layers A1 to A3. Fully bonded layers A1 to A3 include dropouts.

最後の全結合層Ａ３の出力は、この層Ａ３と完全に接続された最終層である、７クラスラベルの分布を生成する7-way SOFTMAXに供給される。
畳み込み層Ｃ２〜Ｃ４と全結合層Ａ１のニューロンは前の層の受容野に接続され、全結合層Ａ２〜Ａ３のニューロンは、前の層の全てのニューロンに接続されている。 The output of the last fully connected layer A3 is fed to 7-way SOFTMAX, which produces a distribution of 7 class labels, which is the final layer fully connected to this layer A3.
The neurons of the convolutional layers C2 to C4 and the fully connected layer A1 are connected to the receptive field of the previous layer, and the neurons of the fully connected layers A2 to A3 are connected to all the neurons of the previous layer.

畳み込み層Ｃ１，Ｃ２の後にはバッチ正規化層が続く。各バッチ正規化層の後には、それぞれ前述の最大プーリングを実行するプーリング層が続く。
畳み込み層Ｃ１〜Ｃ４と全結合層Ａ１〜Ａ３のための非線形マッピング関数は、整流リニアユニット（ＲｅＬＵ）よりなる。 The convolutional layers C1 and C2 are followed by a batch regularization layer. Each batch regularization layer is followed by a pooling layer that performs the maximal pooling described above.
The nonlinear mapping function for the convolutional layers C1 to C4 and the fully coupled layers A1 to A3 consists of a rectifying linear unit (ReLU).

第１畳み込み層Ｃ１は、サイズが５×５×３の６４個のカーネルにより、２画素のストライドで５６×５６×３の入力画像（ＡＧＥ画像）をフィルタリングする。
ストライド（歩幅）は、カーネルマップ内で隣接するニューロンの受容野の中心間の距離である。ストライドは、すべての畳み込み層において１ピクセルに設定されている。 The first convolutional layer C1 filters a 56 × 56 × 3 input image (AGE image) with a 2-pixel stride by 64 kernels having a size of 5 × 5 × 3.
Stride is the distance between the centers of the receptive fields of adjacent neurons in the kernel map. The stride is set to 1 pixel in all convolutional layers.

第２畳み込み層Ｃ２の入力は、バッチ正規化及び最大プールされた第１畳み込み層Ｃ１の出力である。第２畳込み層Ｃ２は、サイズが３×３×６４である１２８のカーネルで入力をフィルタリングする。
第３畳み込み層Ｃ３は、サイズが３×３×６４である１２８のカーネルを有し、これらは第２層Ｃ２（バッチ正規化とＭＡＸプーリング）の出力に接続されている。 The input of the second convolutional layer C2 is the output of the first convolutional layer C1 batch-normalized and maximally pooled. The second convolution layer C2 filters the inputs with 128 kernels of size 3x3x64.
The third convolutional layer C3 has 128 kernels of size 3x3x64, which are connected to the output of the second layer C2 (batch normalization and MAX pooling).

第４畳み込み層Ｃ４は、サイズが３×３×１２８である１２８のカーネルを備えている。完全に接続された全結合層Ａ１〜Ａ３は、それぞれ１０２４のニューロンを備えている。 The fourth convolutional layer C4 comprises 128 kernels having a size of 3 × 3 × 128. Fully connected fully connected layers A1 to A3 each include 1024 neurons.

〔推奨される学習例〕
本願発明者らは、図７の構造の深層ＣＮＮを実際に訓練（学習）させた。訓練に際しては、NVIDIA GTX745 4GBのＧＰＵを実装するＰＣに対して、オープンソースの数値解析ソフトウェアである「ＭＡＴＬＡＢ」を用いて行った。
ＣＮＮの学習ステップにおいては、重み減衰、モメンタム、バッチサイズ、学習率や学習サイクルを含むパラメータなどの重要な設定がある。以下、この点について説明する。 [Recommended learning example]
The inventors of the present application actually trained (learned) the deep CNN of the structure of FIG. The training was conducted using "MATLAB", an open source numerical analysis software, for a PC equipped with an NVIDIA GTX745 4GB GPU.
In the learning step of CNN, there are important settings such as weight attenuation, momentum, batch size, learning rate and parameters including learning cycle. This point will be described below.

本願発明者らによる訓練では、モメンタムが０．９であり、重み減衰が０．０００５である非同期の確率的勾配降下法を採用した。次式は、今回採用した重みｗの更新ルールである。
In the training by the inventors of the present application, an asynchronous stochastic gradient descent method having a momentum of 0.9 and a weight attenuation of 0.0005 was adopted. The following equation is the update rule for the weight w adopted this time.

上式において、ｉは反復回数であり、ｍはモメンタム変数である。εは学習率を意味する。右辺の第３項は、ｗｉにおいて誤差Ｌを削減するための重みｗの修正量のｉ番目のバッチＤｉに関する平均値である。
バッチサイズの増加は、より信頼性の高い勾配推定値をもたらし、学習時間を短縮できるが、それでは最大の安定した学習率εの増加が得られない。そこで、ＣＮＮのモデルに適したバッチサイズを選択する必要がある。 In the above equation, i is the number of iterations and m is the momentum variable. ε means the learning rate. The third term on the right side is the average value of the correction amount of the weight w for reducing the error L in wi with respect to the i-th batch Di.
Increasing the batch size provides a more reliable gradient estimate and can reduce the learning time, but it does not provide the maximum stable increase in the learning rate ε. Therefore, it is necessary to select a batch size suitable for the CNN model.

ここでは、畳み込み層Ｃ１〜Ｃ４について、それぞれ、６４、１２８、２５６及び５１２のバッチサイズを採用した訓練（学習）の結果を比較した。その結果、図７のＣＮＮでは、２５６のバッチサイズが最適であることが判明した。
また、すべての層に同等の学習率を使用し、訓練を通して手動で調整した。学習率は０．１に初期化し、エラーレートが現時点の学習率で改善を停止したときに、学習率を１０で分割した。また、訓練に際しては、ＡＧＥ画像よりなる入力画像を入力し、約２０サイクルでネットワークを訓練した。 Here, the results of training (learning) using batch sizes of 64, 128, 256, and 512 were compared for the convolutional layers C1 to C4, respectively. As a result, it was found that the batch size of 256 was optimal for the CNN of FIG.
Equal learning rates were used for all layers and manually adjusted throughout the training. The learning rate was initialized to 0.1, and when the error rate stopped improving at the current learning rate, the learning rate was divided by 10. In the training, an input image consisting of an AGE image was input, and the network was trained in about 20 cycles.

〔実験例：ＡＧＥ画像を入力画像とした場合の効果〕
本願発明者らは、図７のＣＮＮについて、ＳＦＥＷ（Static Facial Expression in the Wild）のデータベースを使用して、ＡＧＥ画像を入力画像とした場合の表情認識の精度を確認する実験を行った。 [Experimental example: Effect when AGE image is used as input image]
The inventors of the present application conducted an experiment to confirm the accuracy of facial expression recognition when an AGE image was used as an input image using a database of SFW (Static Facial Expression in the Wild) for CNN in FIG.

入力画像であるＳＦＥＷには、「平静」、「喜び」、「怒り」、「驚き」、「不快」、「悲しみ」、「嫌」の７つの感情ラベルうちの１つが割り当てられている。
従って、学習済みのＣＮＮが出力する感情ラベルも、上記の７種類のうちのいずれかである。 The input image SFW is assigned one of seven emotion labels of "calm", "joy", "anger", "surprise", "discomfort", "sadness", and "dislike".
Therefore, the emotion label output by the learned CNN is also one of the above seven types.

図８は、入力画像がＡＧＥ画像である場合の誤り率と、入力画像がＲＧＢ画像である場合の誤り率を比較したグラフである。図８において、横軸は訓練のサイクル数であり、縦軸は各サイクルにおける誤り率を表す。
誤り率は、表情認識に失敗する確率のことを意味する。例えば、誤り率＝０．６は、１０人の表情認識を行った場合に、６人が失敗で４人が成功であることを意味する。現状の深層ＣＮＮによる表情認識では、誤り率が０．６程度のものしか存在しない。 FIG. 8 is a graph comparing the error rate when the input image is an AGE image and the error rate when the input image is an RGB image. In FIG. 8, the horizontal axis represents the number of training cycles, and the vertical axis represents the error rate in each cycle.
The error rate means the probability that facial expression recognition fails. For example, an error rate of 0.6 means that when 10 facial expressions are recognized, 6 people fail and 4 people succeed. In the current facial expression recognition by deep CNN, there is only one with an error rate of about 0.6.

図８に示すように、入力画像がＲＧＢ画像である場合には、２０サイクルの場合で誤り率が約０．６５である。入力画像がＡＧＥ画像である場合には、１０サイクル以上になると誤り率が０．６を下回っている。
図８のグラフから明らかな通り、深層ＣＮＮを用いた表情認識において、入力画像としてＡＧＥ画像を採用すれば、表情認識の識別力が向上し、従来の生データ（ＲＧＢ画像）を入力画像とする場合に比べて、表情認識の性能が有意に改善される。 As shown in FIG. 8, when the input image is an RGB image, the error rate is about 0.65 in the case of 20 cycles. When the input image is an AGE image, the error rate is less than 0.6 after 10 cycles or more.
As is clear from the graph of FIG. 8, in the facial expression recognition using the deep CNN, if the AGE image is adopted as the input image, the discriminating power of the facial expression recognition is improved, and the conventional raw data (RGB image) is used as the input image. Compared with the case, the performance of facial expression recognition is significantly improved.

〔画像処理装置の応用例〕
図９は、本実施形態の広告管理システム１０の全体構成図である。
本実施形態の広告管理システム１０は、撮影画像に含まれる顔画像の表情認識を実行可能な画像処理装置１（図１参照）を広告の評価に利用する管理システムである。 [Application example of image processing device]
FIG. 9 is an overall configuration diagram of the advertisement management system 10 of the present embodiment.
The advertisement management system 10 of the present embodiment is a management system that uses an image processing device 1 (see FIG. 1) capable of performing facial expression recognition of a face image included in a captured image for advertisement evaluation.

図９に示すように、広告管理システム１０は、広告表示装置１１、撮影装置１２、広告制御装置１３及び管理装置１４を備える。
広告表示装置１１は、例えば、ＬＥＤ電光表示板、液晶ディスプレイなどよりなる。広告表示装置１１は、広告制御装置１３から受信した所定の広告画像を表示面に表示させる。広告画像は、静止画及び動画像のいずれでもよい。広告表示装置１１は、広告用のポスターが貼り付けられる広告看板であってもよい。 As shown in FIG. 9, the advertisement management system 10 includes an advertisement display device 11, a photographing device 12, an advertisement control device 13, and a management device 14.
The advertisement display device 11 includes, for example, an LED electric display board, a liquid crystal display, or the like. The advertisement display device 11 displays a predetermined advertisement image received from the advertisement control device 13 on the display surface. The advertisement image may be either a still image or a moving image. The advertisement display device 11 may be an advertisement signboard on which an advertisement poster is attached.

撮影装置１２は、例えば、ＣＣＤ（電荷結合素子）を利用してデジタル画像を生成するデジタルカメラよりなる。撮影装置１２は、広告表示装置１１の上端部などに取り付けられており、広告表示装置１１の手前に立って広告を目視する人間（以下、「視認者」という。）を撮影する。
撮影装置１２は、視認者の顔が含まれるデジタル画像よりなる撮影画像を、広告制御装置１３に送信する。撮影画像は、静止画及び動画像のいずれでもよい。 The photographing device 12 includes, for example, a digital camera that generates a digital image using a CCD (charge coupling element). The photographing device 12 is attached to the upper end portion of the advertisement display device 11 or the like, and photographs a person (hereinafter, referred to as “visual viewer”) who stands in front of the advertisement display device 11 and visually observes the advertisement.
The photographing device 12 transmits a photographed image including the face of the viewer to the advertisement control device 13. The captured image may be either a still image or a moving image.

広告制御装置１３は、広告表示装置１１及び撮影装置１２を制御するコンピュータ装置よりなる。広告制御装置１３は、第１通信部１６、第２通信部１７、制御部１８及び記憶部１９を備える。
第１通信部１６は、所定のＩ／Ｏインタフェース規格により、広告表示装置１１及び撮影装置１２と通信する通信装置よりなる。第１通信部１６と広告表示装置１１及び撮影装置１２との通信は、有線通信及び無線通信のいずれであってもよい。 The advertisement control device 13 includes a computer device that controls the advertisement display device 11 and the photographing device 12. The advertisement control device 13 includes a first communication unit 16, a second communication unit 17, a control unit 18, and a storage unit 19.
The first communication unit 16 includes a communication device that communicates with the advertisement display device 11 and the photographing device 12 according to a predetermined I / O interface standard. The communication between the first communication unit 16 and the advertisement display device 11 and the photographing device 12 may be either wired communication or wireless communication.

第２通信部１７は、有線又は無線ＬＡＮなどの所定の通信規格により、管理装置１４と通信する通信装置よりなる。
第２通信部１７は、インターネットなどの公衆通信網を介して管理装置１４と通信してもよいし（図９の場合）、構内通信網のみを経由して管理装置１４と通信してもよいし、管理装置１４と直接通信してもよい。第２通信部１７と管理装置１４との通信は、有線通信及び無線通信のいずれであってもよい。 The second communication unit 17 includes a communication device that communicates with the management device 14 according to a predetermined communication standard such as a wired or wireless LAN.
The second communication unit 17 may communicate with the management device 14 via a public communication network such as the Internet (in the case of FIG. 9), or may communicate with the management device 14 only via the premises communication network. However, it may communicate directly with the management device 14. The communication between the second communication unit 17 and the management device 14 may be either wired communication or wireless communication.

制御部１８は、１又は複数のＣＰＵ（Central Processing Unit）と、上述の本実施形態のＧＰＵ（図１の画像処理装置１）を含む制御装置よりなる。
記憶部１９は、１又は複数のＲＡＭ（Random Access Memory）及びＲＯＭ（Read Only Memory）などのメモリを含む記憶装置よりなる。記憶部１９は、制御部１８に実行させる各種のコンピュータプログラムや、管理装置１４などから受信した各種のデータの、一時的又は非一時的な記録媒体として機能する。 The control unit 18 includes one or a plurality of CPUs (Central Processing Units) and a control device including the GPU (image processing unit 1 of FIG. 1) of the present embodiment described above.
The storage unit 19 includes a storage device including one or a plurality of RAMs (Random Access Memory) and a memory such as a ROM (Read Only Memory). The storage unit 19 functions as a temporary or non-temporary recording medium for various computer programs executed by the control unit 18, various data received from the management device 14, and the like.

このように、広告制御装置１３は、コンピュータを備えて構成される。従って、広告制御装置１３の各機能は、当該コンピュータの記憶装置に記憶されたコンピュータプログラムが前記コンピュータのＣＰＵ及びＧＰＵによって実行されることで発揮される。
かかるコンピュータプログラムは、ＣＤ−ＲＯＭやＵＳＢメモリなどの一時的又は非一時的な記録媒体に記憶させることができる。 In this way, the advertisement control device 13 is configured to include a computer. Therefore, each function of the advertisement control device 13 is exhibited by executing the computer program stored in the storage device of the computer by the CPU and GPU of the computer.
Such a computer program can be stored in a temporary or non-temporary recording medium such as a CD-ROM or a USB memory.

制御部１８は、記憶部１９に格納されたコンピュータプログラムを読み出して実行することにより、第１及び第２通信部１６，１７に対する通信制御や、管理装置１４を運用する管理者にとって有用な種々のアプリケーションを実現できる。
例えば、制御部１８は、管理装置１４が自局宛に送信した広告画像を第２通信部１７が受信すると、受信した広告画像を広告表示装置１１に送信するように、第１通信部１６を制御する。その後、広告表示装置１１は、受信した広告画像を表示面に表示する。 The control unit 18 reads and executes the computer program stored in the storage unit 19, thereby controlling communication with the first and second communication units 16 and 17, and various types useful for the administrator who operates the management device 14. The application can be realized.
For example, the control unit 18 sets the first communication unit 16 so that when the second communication unit 17 receives the advertisement image transmitted by the management device 14 to its own station, the control unit 18 transmits the received advertisement image to the advertisement display device 11. Control. After that, the advertisement display device 11 displays the received advertisement image on the display surface.

制御部１８は、撮影装置１２が送信した撮影画像を第１通信部１６が受信すると、受信した撮影画像に含まれる顔画像に対して表情認識を実行し、表情の分類結果を管理装置１４に送信するように第２通信部１７を制御する。
記憶部１９は、顔画像の表情認識を実行可能な所定構造のＣＮＮ（例えば図７）や、当該ＣＮＮに対する学習済みの重み及びバイアスなどを記憶している。制御部１８のＧＰＵは、記憶部１９が記憶する学習済みのＣＮＮにより、撮影画像に含まれる視認者の顔画像に対する表情認識を実行する。 When the first communication unit 16 receives the photographed image transmitted by the photographing device 12, the control unit 18 executes facial expression recognition on the face image included in the received photographed image, and sends the facial expression classification result to the management device 14. The second communication unit 17 is controlled so as to transmit.
The storage unit 19 stores a CNN having a predetermined structure (for example, FIG. 7) capable of performing facial expression recognition of a facial image, learned weights and biases with respect to the CNN, and the like. The GPU of the control unit 18 executes facial expression recognition for the viewer's face image included in the captured image by the learned CNN stored in the storage unit 19.

管理装置１４は、広告の管理者が運用する、例えばサーバコンピュータ装置よりなる。図９の例では、１つの広告制御装置１３のみが管理装置１４に接続されているが、複数の広告制御装置１３が管理装置１４に接続されていてもよい。
管理装置１４は、１又は複数の広告制御装置１３に対する広告画像の配信処理を実行可能である。具体的には、管理装置１４は、管理者が入力した広告画像を所定の広告表示装置１０宛てに送信する。従って、広告表示装置１１の広告画像は切り替え可能である。 The management device 14 comprises, for example, a server computer device operated by the advertisement manager. In the example of FIG. 9, only one advertisement control device 13 is connected to the management device 14, but a plurality of advertisement control devices 13 may be connected to the management device 14.
The management device 14 can execute the distribution processing of the advertisement image to one or a plurality of advertisement control devices 13. Specifically, the management device 14 transmits the advertisement image input by the administrator to the predetermined advertisement display device 10. Therefore, the advertisement image of the advertisement display device 11 can be switched.

管理装置１４は、１又は複数の広告制御装置１３から受信した分類結果の集計処理を実行可能である。具体的には、管理装置１４は、所定期間に広告制御装置１３から受信した多数の分類結果を集計する。管理装置１４は、集計結果をグラフ化又はテーブル化してディスプレイに表示することより、当該集計結果を管理者に提示する。
従って、管理者は、管理装置１４が提示する集計結果に基づいて、広告画像の有意性の評価、広告画像の表示の継続又は中止、広告画像の改変などを判断できるようになる。 The management device 14 can execute the aggregation processing of the classification results received from one or a plurality of advertisement control devices 13. Specifically, the management device 14 aggregates a large number of classification results received from the advertisement control device 13 during a predetermined period. The management device 14 presents the total result to the manager by graphing or tabulating the total result and displaying it on the display.
Therefore, the administrator can determine the evaluation of the significance of the advertisement image, the continuation or cancellation of the display of the advertisement image, the modification of the advertisement image, and the like based on the aggregation result presented by the management device 14.

図１０は、表情認識の集計結果の一例を示す棒グラフである。
図１０の棒グラフにおいて、横軸は、人間の表情に関する７種類（「平静」、「喜び」、「怒り」、「驚き」、「不快」、「悲しみ」、「嫌」）の分類クラスである。縦軸は、当該７種類の分類クラスの発生割合である。 FIG. 10 is a bar graph showing an example of the aggregated results of facial expression recognition.
In the bar graph of FIG. 10, the horizontal axis is a classification class of seven types (“calm”, “joy”, “anger”, “surprise”, “discomfort”, “sadness”, and “dislike”) related to human facial expressions. .. The vertical axis is the occurrence rate of the seven types of classification classes.

図１０上段の棒グラフは、広告画像の「継続」に繋がる集計結果を示す。
この集計結果では、「喜び」の割合が他の分類クラスに比べて多くなっているので、多くの視認者が現状の広告画像を見て喜びを感じていると推定できる。
従って、図１０上段の棒グラフのような集計結果が得られた場合には、管理者は、現状の広告画像による広告の継続を判断すべきと考えられる。 The bar graph in the upper part of FIG. 10 shows the aggregation result that leads to the “continuation” of the advertisement image.
In this tabulation result, the ratio of "joy" is higher than that of other classification classes, so it can be estimated that many viewers are happy to see the current advertisement image.
Therefore, when the aggregated result as shown in the bar graph in the upper part of FIG. 10 is obtained, it is considered that the administrator should judge the continuation of the advertisement by the current advertisement image.

図１０下段の棒グラフは、広告画像の「中止」や「改変」に繋がる集計結果を示す。
この集計結果では、「不快」及び「嫌」の割合が他の分類クラスに比べて多くなっているので、多くの視認者が現状の広告画像を見て不快を感じていると推定できる。
従って、図１０下段の棒グラフのような集計結果が得られた場合には、管理者は、現状の広告画像による広告の中止、或いは、現状の広告画像に改変を加えることを判断すべきと考えられる。 The bar graph at the bottom of FIG. 10 shows the aggregated results leading to "cancellation" and "modification" of the advertisement image.
In this tabulation result, the ratios of "discomfort" and "dislike" are higher than those of other classification classes, so it can be estimated that many viewers are discomforted by seeing the current advertisement image.
Therefore, when the aggregated result as shown in the bar graph at the bottom of FIG. 10 is obtained, the administrator should decide to stop the advertisement by the current advertisement image or to modify the current advertisement image. Be done.

図９の例では、広告制御装置１３が表情認識を実行してその分類結果を管理装置１４に送信し、管理装置１４が分類結果を集計しているが、広告制御装置１３が分類結果の集計を実行し、その集計結果を管理装置１４に送信することにしてもよい。
また、広告制御装置１３が撮影画像を管理装置１４に転送し、管理装置１４が撮影画像に含まれる顔画像の表情認識と、その分類及び集計を実行することにしてもよい。更に、広告制御装置１３及び管理装置１４は、１つのコンピュータ装置よりなる同じ筐体の制御装置で構成されていてもよい。 In the example of FIG. 9, the advertisement control device 13 executes facial expression recognition and transmits the classification result to the management device 14, and the management device 14 aggregates the classification results. However, the advertisement control device 13 aggregates the classification results. Is executed, and the total result may be transmitted to the management device 14.
Further, the advertisement control device 13 may transfer the captured image to the management device 14, and the management device 14 may execute the facial expression recognition of the face image included in the captured image, and the classification and aggregation thereof. Further, the advertisement control device 13 and the management device 14 may be composed of a control device having the same housing including one computer device.

上述の広告管理システム１０により実現される広告管理方法の工程を列挙すると、次の通りである。
工程１）所定の広告画像（静止画又は動画像）を広告表示装置１１に表示する期間中に、当該広告表示装置１１の前の視認者をカメラ１２で撮影する。 The steps of the advertisement management method realized by the above-mentioned advertisement management system 10 are as follows.
Step 1) During the period of displaying a predetermined advertisement image (still image or moving image) on the advertisement display device 11, the viewer in front of the advertisement display device 11 is photographed by the camera 12.

工程２）撮影画像に含まれる顔画像の表情認識を、広告制御装置１３及び管理装置１４などのコンピュータ装置で実行し、その分類結果を集計する。具体的には、認識した顔画像の総数を分母とし、分類された各表情の割合を求める。
工程３）集計結果に基づいて、現状の広告画像の継続、中止及び改変などを管理者が判断する。 Step 2) The facial expression recognition of the face image included in the captured image is executed by a computer device such as the advertisement control device 13 and the management device 14, and the classification results are totaled. Specifically, the total number of recognized facial images is used as the denominator, and the ratio of each classified facial expression is obtained.
Process 3) The manager determines whether to continue, cancel, or modify the current advertising image based on the aggregated result.

〔その他の変形例〕
今回開示した実施形態（変形例を含む。）はすべての点で例示であって制限的なものではない。本発明の権利範囲は、上述の実施形態に限定されるものではなく、特許請求の範囲に記載された構成と均等の範囲内でのすべての変更が含まれる。
例えば、上述の実施形態では、原画像から３種類の特徴を抽出することにより、３種類の入力画像（ＡＧＥ画像）を生成しているが、当該３種類の特徴を含む４種類以上の特徴を抽出し、４種類以上の入力画像を生成することにしてもよい。 [Other variants]
The embodiments disclosed this time (including modified examples) are examples in all respects and are not restrictive. The scope of rights of the present invention is not limited to the above-described embodiment, and includes all modifications within a range equivalent to the configuration described in the claims.
For example, in the above-described embodiment, three types of input images (AGE images) are generated by extracting three types of features from the original image, but four or more types of features including the three types of features are generated. It may be extracted and four or more kinds of input images may be generated.

上述の実施形態では、ニューラルネットワークが畳み込みニューラルネットワーク（ＣＮＮ）よりなるが、畳み込み層を有しない他の構造の階層型ニューラルネットワークであってもよい。
上述の実施形態において、広告制御装置１３の制御部１８は、顔画像の表情認識を精度よく行えるものであれば、深層ＣＮＮ以外のアルゴリズムで当該表情認識を実行するものであってもよい。 In the above embodiment, the neural network is composed of a convolutional neural network (CNN), but it may be a hierarchical neural network having another structure that does not have a convolutional layer.
In the above-described embodiment, the control unit 18 of the advertisement control device 13 may execute the facial expression recognition by an algorithm other than the deep CNN, as long as it can accurately recognize the facial expression of the facial image.

１画像処理装置
２画像生成部
３ＣＮＮ処理部（処理部）
４学習部
５出力部
７サンプル画像
８撮影画像
１０広告管理システム
１１広告表示装置
１２撮影装置
１３広告制御装置（制御装置）
１４管理装置（制御装置）
１６第１通信部
１７第２通信部
１８制御部
１９記憶部 1 Image processing device 2 Image generation unit 3 CNN processing unit (processing unit)
4 Learning unit 5 Output unit 7 Sample image 8 Captured image 10 Advertisement management system 11 Advertisement display device 12 Shooting device 13 Advertisement control device (control device)
14 Management device (control device)
16 1st communication unit 17 2nd communication unit 18 Control unit 19 Storage unit

Claims

It is a method of recognizing facial expressions included in captured images.
A learning step in which a hierarchical neural network learns parameters using a learning image group having features related to at least three types of information including facial unevenness information, texture information, and contour information as input data.
A plurality of input images have been generated by extracting features related to at least the three types of information from the captured images, and the generated facial expressions included in the captured images have been learned using the generated plurality of input images as input data. A facial expression recognition method including a recognition step for causing the hierarchical neural network to recognize.

The facial expression recognition method according to claim 1, wherein the hierarchical neural network is a convolutional neural network.

The unevenness information is the direction angle of the gradient vector of the pixel value at each pixel point.
The texture information is the norm of the gradient vector of the pixel values at each pixel point.
The facial expression recognition method according to claim 1 or 2, wherein the contour information is position information of a pixel point whose pixel value changes sharply.

The learning step includes a generation step of generating the learning image group by extracting features related to at least the three types of information from a plurality of sample images.
Any one of claims 1 to 3, including an update step of updating the parameters of the network based on the recognition result output by the hierarchical neural network using the generated image group for learning as input data. The facial expression recognition method described in the section.

The facial expression recognition method according to claim 4, wherein the generation step includes a process of applying horizontal reflection to a face image extracted from the sample image.

It is a device that recognizes facial expressions included in captured images.
A processing unit having a hierarchical neural network in which parameters are learned by using a learning image group having features related to at least three types of information including facial unevenness information, texture information, and contour information as input data.
An image generation unit that extracts features related to at least the three types of information from the captured image to generate a plurality of input images, and inputs the generated plurality of input images to the processing unit.
A facial expression recognition device including an output unit that outputs a recognition result output by the hierarchical neural network that has been trained using the plurality of input images as input data to the outside as a facial expression of the captured image.

A computer program for causing a computer device capable of performing image processing to perform processing for recognizing facial expressions included in a captured image.
A learning step in which a hierarchical neural network learns parameters using a learning image group having features related to at least three types of information including facial unevenness information, texture information, and contour information as input data.
Features related to at least the three types of information are extracted from the captured image to generate a plurality of input images, and the generated plurality of input images are used as input data to learn facial expressions included in the captured image. A computer program including a recognition step to be recognized by the hierarchical neural network.

Advertising display device and
A photographing device that captures a viewer of an advertisement image displayed by the advertisement display device, and
An advertisement management system including a control device having the facial expression recognition device according to claim 6.
The control device has a recognition process for recognizing the facial expression of the viewer from a photographed image including the viewer taken by the photographing device, an aggregation process for aggregating the recognition result of the facial expression, and an advertisement manager for the aggregation result. An ad management system that executes the presentation process presented to.