JP2022109742A

JP2022109742A - Program, device, and method for estimating label by using input intermediate layer different for every area image of object

Info

Publication number: JP2022109742A
Application number: JP2021005217A
Authority: JP
Inventors: 剣明呉; Jiangming Wu; 博楊; Hiroshi Yo; 元服部; Hajime Hattori
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2021-01-15
Filing date: 2021-01-15
Publication date: 2022-07-28
Anticipated expiration: 2041-01-15
Also published as: JP7474553B2

Abstract

To provide a program, a device, and a method for analyzing every area image of an object from an object image and estimating a label that comprehensively evaluates the area images.SOLUTION: A program causes a face expression recognition device (computer) to function as area division means that divides an object image into area images according to different area types, input intermediate layers that are each provided for every area type, receive input of the area images, and output intermediate feature quantities by using a model trained in advance, an intermediate feature quantity fusion layer that fuses the intermediate feature quantities output from the plurality of input intermediate layers to output a fused intermediate feature quantity, and an output layer that receives input of the fused intermediate feature quantity and estimates a label by using the model trained in advance. The intermediate feature quantity fusion layer fuses the plurality of intermediate feature quantities with weighting different for every input intermediate layer. The object image is an image capturing the face of a person who wears an attachment material, and the label is the facial expression of the person. The program further has face area detection means that receives input of the object image and detects the person's face area image.SELECTED DRAWING: Figure 1

Description

本発明は、複数の対象物が映り込む画像から、総合的に評価されたラベルを推定する機械学習エンジンの技術に関する。特に、人の顔に例えばマスクのような装着物が着用された顔画像から、その表情を推定する用途に適する。 The present invention relates to a machine learning engine technique for estimating comprehensively evaluated labels from images in which multiple objects are captured. In particular, it is suitable for estimating the facial expression from a face image of a person wearing a wearable object such as a mask.

撮影画像から人や対象物を認識する機械学習エンジンの技術が発展してきている。特に、顔画像からその本人を認識する顔認識の精度は、深層学習(Deep Learning)技術の発展と共に、急激に向上している。例えばfacebook社は、深層学習を用いた顔認識技術DeepFace（登録商標）の精度が97.35%に達したと発表した（例えば非特許文献１参照）。
また、機械学習エンジンの学習モデルを訓練するために、大量の教師画像を使用する必要があるが、例えばAffectiva社は、世界87か国以上から収集された約70億の感情特徴量を用いて、感情認識技術を実現している（例えば非特許文献２参照）。 Machine learning engine technology that recognizes people and objects from captured images is developing. In particular, the accuracy of face recognition for recognizing a person from a face image is rapidly improving with the development of deep learning technology. For example, facebook announced that the accuracy of DeepFace (registered trademark), a face recognition technology using deep learning, has reached 97.35% (for example, see Non-Patent Document 1).
Also, in order to train a learning model for a machine learning engine, it is necessary to use a large amount of teacher images. , has realized emotion recognition technology (see, for example, Non-Patent Document 2).

従来、感情毎に大量の顔画像の特徴量を予め学習しており、顔画像から感情を認識する技術がある（例えば特許文献１参照）。具体的には、Ekman 7分類表情モデル（ニュートラル、喜び、嫌悪、怒り、サプライズ、悲しみ、恐怖）や、ポジティブ・ネガティブ・ニュートラルの３分類感情モデルなどがある。 Conventionally, there is a technology in which feature amounts of a large number of face images are learned in advance for each emotion, and the emotion is recognized from the face image (for example, see Patent Document 1). Specifically, there are Ekman 7-classified facial expression models (neutral, joy, disgust, anger, surprise, sadness, and fear) and 3-classified emotion models of positive, negative, and neutral.

また、対象人物の状態に基づく複数の認識モード毎に認識器を有し、顔認識時に、認識モードに応じたいずれか１つの認識器を適用する技術もある（例えば特許文献２参照）。対象人物の顔の状態としては、マスク、メガネ、サングラス、帽子等の着用の有無がある。この技術によれば、対象人物の顔の閉鎖領域から認識モードを選択し、その認識モードに基づく認識器が認証の成否を判定する。即ち、各認識器は、閉鎖領域が異なる教師画像から訓練されたものである。 There is also a technique of having a recognizer for each of a plurality of recognition modes based on the state of a target person, and applying any one of the recognizers according to the recognition mode during face recognition (see Patent Document 2, for example). The condition of the target person's face includes whether or not he or she is wearing a mask, glasses, sunglasses, a hat, or the like. According to this technique, a recognition mode is selected from the closed area of the target person's face, and a recognizer based on the recognition mode determines success or failure of authentication. That is, each recognizer was trained from teacher images with different closed regions.

更に、マスクで覆われていない目の周辺の特徴点を抽出し且つ照合する「マスク着用に特化した」顔認証エンジンの技術もある（例えば非特許文献３参照）。この技術によれば、マスク着用時の1：1認証で99.9％以上という認証率を達成したとしている。 Furthermore, there is also a technology of a face recognition engine "specialized for wearing a mask" that extracts and matches feature points around eyes not covered by a mask (see, for example, Non-Patent Document 3). According to this technology, 1:1 authentication when wearing a mask achieved an authentication rate of 99.9% or more.

更に、本願の出願人によって開発された表情認識ＡＩ(Artificial Intelligence)の技術もある（例えば非特許文献４参照）。顔認識技術の用途としては、一般的に、認証成功によるロック解除がある。これに対して、顔表情認識技術の用途としては、笑顔検出による写真の自動撮影機能や、テレビ番組の視聴者の表情解析に基づく受容度のマーケティング調査がある。 Furthermore, there is also a facial expression recognition AI (Artificial Intelligence) technology developed by the applicant of the present application (see, for example, Non-Patent Document 4). Face recognition technology is generally used for unlocking by successful authentication. On the other hand, applications of facial expression recognition technology include an automatic photographing function based on smile detection, and a marketing survey of receptivity based on facial expression analysis of TV program viewers.

特開２０１１－１５０３８１号公報JP 2011-150381 A 特開２０１８－１６５９８３号公報JP 2018-165983 A

Taigman, Yaniv, et al. "Deepface: Closing the gap to human-level performance in face verification." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.Taigman, Yaniv, et al. "Deepface: Closing the gap to human-level performance in face verification." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014. affectiva、[online]、［令和３年１月４日検索］、インターネット＜URL:https://affectiva.jp/reason.html＞affectiva, [online], [searched on January 4, 2021], Internet <URL: https://affectiva.jp/reason.html> 「NEC、マスク着用に特化した顔認証エンジンを開発--認証率は99.9％以上」、[online]、［令和３年１月４日検索］、インターネット＜URL: https://japan.cnet.com/article/35160036/＞“NEC Develops Face Recognition Engine Specializing in Wearing a Mask--Recognition Rate is 99.9% or More”, [online], [Searched on January 4, 2021], Internet <URL: https://japan. cnet.com/article/35160036/> 「アングルフリーな表情認識ＡＩ」、[online]、［令和３年１月４日検索］、インターネット＜URL:https://www.kddi-research.jp/newsrelease/2018/080201.html＞"Angle-free facial expression recognition AI", [online], [searched on January 4, 2018], Internet <URL: https://www.kddi-research.jp/newsrelease/2018/080201.html> 「対象物検出、セグメンテーションをMask R-CNNで理解してみる」、[online]、［令和３年１月４日検索］、インターネット＜URL:https://qiita.com/shtmr/items/4283c851bc3d9721ed96＞"Understanding object detection and segmentation with Mask R-CNN", [online], [searched on January 4, 2021], Internet <URL: https://qiita.com/shtmr/items/ 4283c851bc3d9721ed96>

近年、新型コロナウイルス感染症が流行して以来、顔にマスクやゴーグルを着用することが一般的になっている。このような装着物を顔に着用した場合、顔の面積の最大70％が覆われてしまう。そのために、顔や表情を十分に認識できないという課題が生じてきた。一般的な顔認識アルゴリズムによれば、顔画像から目、鼻、口、頬、顔面の筋肉など、可能な限り多くの特徴量を取り込む必要がある。
例えば非特許文献３に記載の技術によれば、顔認識の機械学習エンジンに対して、マスクやゴーグルを着用した顔や表情の教師画像を大量に訓練させている。 In recent years, since the outbreak of the new coronavirus infection, it has become common to wear masks and goggles on the face. When such a wearable is worn on the face, it covers up to 70% of the face area. For this reason, a problem has arisen that faces and expressions cannot be sufficiently recognized. According to general face recognition algorithms, it is necessary to capture as many features as possible, such as eyes, nose, mouth, cheeks, and facial muscles, from a face image.
For example, according to the technology described in Non-Patent Document 3, a machine learning engine for face recognition is trained with a large number of teacher images of faces and facial expressions wearing masks and goggles.

しかしながら、顔表情認識の用途の場合、例えば顔にマスクを着用することによって、鼻・口のほとんど、及び、頬・顔面の筋肉の大半から、特徴量を抽出できなくなる。そのために、顔表情の認識精度が大きく低下することとなった。また、顔表情認識技術は、1：1で本人を認証する顔認識技術と違って、目の周辺の特徴点だけでは、万人に適用可能であって汎用的な学習モデルを構築することができないという問題も生じた。 However, in the case of facial expression recognition, for example, wearing a mask on the face makes it impossible to extract feature amounts from most of the nose/mouth and most of the muscles of the cheeks/face. As a result, the recognition accuracy of facial expressions is greatly reduced. In addition, facial expression recognition technology differs from face recognition technology in that it authenticates individuals on a 1:1 basis. There was also the problem of not being able to

これに対し、本願の発明者らは、人の顔画像から、顔露出領域とそれ以外の領域とを別々に分析し、それらを総合的に評価して顔表情を推定することができないか、と考えた。これには、対象画像から物体の領域画像毎に別々に分析し、それらを総合的に評価したラベルを推定する技術が必要になる、と考えた。 On the other hand, the inventors of the present application have investigated the possibility of estimating a facial expression by separately analyzing an exposed face area and other areas from a person's face image and evaluating them comprehensively. thought. We thought that this would require a technique to analyze each area image of the object separately from the target image and estimate the label by comprehensively evaluating them.

そこで、本発明は、対象画像から物体の領域画像毎に別々に分析し、それらを総合的に評価したラベルを推定するプログラム、装置及び方法を提供することを目的とする。 SUMMARY OF THE INVENTION Accordingly, it is an object of the present invention to provide a program, an apparatus, and a method for analyzing each area image of an object separately from a target image and estimating a label obtained by comprehensively evaluating them.

本発明によれば、対象画像からラベルを推定するようにコンピュータを機能させるプログラムにおいて、
対象画像を、異なる領域種別の領域画像に分割する領域分割手段と、
領域種別毎に複数備えられ、領域画像を入力し、予め訓練されたモデルを用いて中間特徴量を出力する入力中間層と、
複数の入力中間層から出力された中間特徴量を融合して、融合中間特徴量を出力する中間特徴量融合層と、
融合中間特徴量を入力し、予め訓練されたモデルを用いてラベルを推定する出力層と
してコンピュータを機能させることを特徴とする。 According to the present invention, in a program that causes a computer to estimate a label from a target image,
area dividing means for dividing a target image into area images of different area types;
a plurality of input intermediate layers provided for each area type, inputting an area image, and outputting an intermediate feature amount using a pre-trained model;
an intermediate feature value fusion layer that fuses intermediate feature values output from a plurality of input intermediate layers and outputs a fused intermediate feature value;
It is characterized by having a computer function as an output layer that inputs fused intermediate feature values and estimates labels using pre-trained models.

本発明のプログラムにおける他の実施形態によれば、
中間特徴量融合層は、入力中間層毎に異なる重み付けによって複数の中間特徴量を融合する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
It is also preferable that the intermediate feature value fusion layer causes a computer to fuse a plurality of intermediate feature values with different weightings for each input intermediate layer.

本発明のプログラムにおける他の実施形態によれば、
対象画像は、装着物が着用された人の顔が映り込む画像であり、
ラベルは、人の顔の表情であり、
対象画像を入力し、人の顔領域画像を検出する顔領域検出手段を更に有し、
領域分割手段は、人の顔領域画像を入力し、異なる領域種別の領域画像として顔露出領域画像と装着物領域画像とに分割する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
The target image is an image in which the face of the person wearing the attachment is reflected,
A label is a person's facial expression,
further comprising face area detection means for inputting a target image and detecting a person's face area image;
It is also preferable that the region dividing means causes a computer to function so as to input a face region image of a person and divide it into a face-exposed region image and a wearable object region image as region images of different region types.

本発明のプログラムにおける他の実施形態によれば、
装着物は、マスク、メガネ、ゴーグル又はサングラスである
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
It is also preferred that the wearable be a mask, eyeglasses, goggles, or sunglasses for the computer to function.

本発明のプログラムにおける他の実施形態によれば、
教師ラベルが付与された教師画像を用いて、
領域分割手段は、教師画像を、異なる領域種別の領域画像に分割し、
入力中間層は、ニューラルネットワークにおける入力層及び中間層からなり、教師画像に基づく各領域画像を入力し且つ出力層から教師ラベルが出力されるように訓練したモデルを有し、
中間特徴量融合層は、各入力中間層から出力された中間特徴量を入力し且つ出力層から教師ラベルが出力されるように訓練した、入力中間層毎に異なる重み付けを導出し、
出力層は、教師画像に基づく融合中間特徴量を入力し且つ教師ラベルを出力するように訓練したモデルを有する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
Using a teacher image with a teacher label,
The area dividing means divides the teacher image into area images of different area types,
The input hidden layer consists of an input layer and a hidden layer in the neural network, and has a model trained so that each region image based on the teacher image is input and the teacher label is output from the output layer,
The intermediate feature value fusion layer is trained to input the intermediate feature value output from each input intermediate layer and output a teacher label from the output layer, and derives different weights for each input intermediate layer,
The output layer also preferably functions a computer with a model trained to input fused intermediate features based on the training image and output training labels.

本発明のプログラムにおける他の実施形態によれば、
領域分割手段は、入力された画像の画素毎に領域種別を推定し、領域画像の境界線（セグメンテーション）に基づく領域画像を検出する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
It is also preferable that the area dividing means estimates an area type for each pixel of the input image and causes the computer to detect the area image based on the boundary line (segmentation) of the area image.

本発明によれば、対象画像からラベルを推定する推定装置において、
対象画像を、異なる領域種別の領域画像に分割する領域分割手段と、
領域種別毎に複数備えられ、領域画像を入力し、予め訓練されたモデルを用いて中間特徴量を出力する入力中間層と、
複数の入力中間層から出力された中間特徴量を融合して、融合中間特徴量を出力する中間特徴量融合層と、
融合中間特徴量を入力し、予め訓練されたモデルを用いてラベルを推定する出力層と
を有することを特徴とする。 According to the present invention, in an estimation device that estimates a label from a target image,
area dividing means for dividing a target image into area images of different area types;
a plurality of input intermediate layers provided for each area type, inputting an area image, and outputting an intermediate feature amount using a pre-trained model;
an intermediate feature value fusion layer that fuses intermediate feature values output from a plurality of input intermediate layers and outputs a fused intermediate feature value;
It is characterized by having an output layer that inputs the fused intermediate feature amount and estimates the label using a pre-trained model.

本発明によれば、対象画像からラベルを推定する装置の推定方法において、
装置は、
対象画像を、異なる領域種別の領域画像に分割する第１のステップと、
領域種別毎に備えられた入力中間層を用いて、領域画像を入力し、予め訓練されたモデルを用いて中間特徴量を出力する第２のステップと、
複数の入力中間層から出力された中間特徴量を融合して、融合中間特徴量を出力する第３のステップと、
融合中間特徴量を入力し、予め訓練されたモデルを用いてラベルを推定する第４のステップと
を実行することを特徴とする。 According to the present invention, an estimation method for a device for estimating a label from a target image includes:
The device
a first step of dividing the target image into area images of different area types;
a second step of inputting an area image using an input intermediate layer provided for each area type and outputting an intermediate feature amount using a pre-trained model;
a third step of fusing intermediate feature values output from a plurality of input intermediate layers to output a fused intermediate feature value;
and a fourth step of inputting the fused intermediate feature amount and estimating the label using a pre-trained model.

本発明のプログラム、装置及び方法によれば、対象画像から物体の領域画像毎に別々に分析し、それらを総合的に評価したラベルを推定することができる。 According to the program, apparatus and method of the present invention, it is possible to analyze each area image of the object separately from the target image and estimate the label by comprehensively evaluating them.

本発明における顔表情認識装置の訓練段階の機能構成図である。FIG. 3 is a functional configuration diagram of the facial expression recognition device in the training stage of the present invention; 顔領域検出部及び領域分割部の処理を表す説明図である。FIG. 4 is an explanatory diagram showing processing of a face area detection unit and an area division unit; 本発明における顔表情認識装置の推定段階の機能構成図である。FIG. 2 is a functional configuration diagram of the estimation stage of the facial expression recognition device according to the present invention; 本発明のプログラムにおける基本的な訓練段階の機能構成図である。It is a functional block diagram of the basic training stage in the program of the present invention. 本発明のプログラムにおける基本的な推定段階の機能構成図である。It is a functional block diagram of a basic estimation stage in the program of the present invention.

以下、本発明の実施の形態について、図面を用いて詳細に説明する。 BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

本願の発明者らは、例えばマスクを着用した人の顔画像から、高精度に表情を認識するために、顔露出領域の特徴量だけでなく、マスク着用領域の特徴量も利用して、表情を認識した方がよいのではないか、と考えた。
その際、顔露出領域から認識された表情と、マスク着用領域から認識された表情との相関関係も考慮すべきではないか、と考えた。
顔露出領域については、マスクに覆われていない目の周辺領域に表情が表れやすい。特に、表情は眉間のシワなどに表れやすい。一方で、マスク着用領域についても、顔全体、特に鼻・口・頬における筋肉の変化によって、マスク自体にシワが生じて変形することとなる。本願の発明者らは、顔露出領域の特徴量と、マスク着用領域の特徴量とを別々に推定しながら、両者の相関関係を考慮して、表情を認識すべきと考えた。 The inventors of the present application, for example, use not only the feature amount of the face exposed area but also the feature amount of the mask wearing area to recognize the facial expression from the face image of the person wearing the mask with high accuracy. I thought it would be better to recognize
At that time, we thought that we should also consider the correlation between the facial expression recognized from the face-exposed area and the facial expression recognized from the mask-wearing area.
As for the exposed face area, facial expressions are likely to appear in the area around the eyes that is not covered by the mask. In particular, facial expressions tend to appear in wrinkles between the eyebrows. On the other hand, the area where the mask is worn is also wrinkled and deformed due to changes in the muscles of the entire face, especially the nose, mouth, and cheeks. The inventors of the present application considered that the facial expression should be recognized by separately estimating the feature amount of the face-exposed area and the feature amount of the mask-wearing area, and considering the correlation between the two.

尚、以下では、図１～図３については、顔に装着物が着用された顔画像から、顔表情を認識する用途に適する技術を説明する。
その後、図４及び図５については、顔表情認識技術に限られず、一般的な物体認識の用途に適する技術を説明する。 1 to 3, a technique suitable for recognizing a facial expression from a facial image in which a wearable object is worn on the face will be described below.
4 and 5, techniques suitable for general object recognition applications, not limited to facial expression recognition techniques, will be described.

図１は、本発明における顔表情認識装置の訓練段階の機能構成図である。 FIG. 1 is a functional configuration diagram of a training stage of a facial expression recognition apparatus according to the present invention.

顔表情認識装置１は、人の顔画像（対象画像）を入力することによって、表情（ラベル）を推定することができる。
図１によれば、顔表情認識装置１は、＜訓練段階＞として、教師画像蓄積部２と、顔領域検出部１０と、領域分割部１１と、顔露出領域入力中間層（第１の入力中間層）１２１と、マスク着用領域入力中間層（第２の入力中間層）１２２と、中間特徴量融合層１３と、出力層１４とを有する。これら機能構成部は、顔表情認識装置１に搭載されたコンピュータを機能させるプログラムを実行することによって実現される。また、これら機能構成部の処理の流れは、装置の推定方法としても理解できる。 The facial expression recognition apparatus 1 can estimate an expression (label) by inputting a person's face image (target image).
According to FIG. 1, the facial expression recognition apparatus 1 includes, as a <training stage>, a teacher image storage unit 2, a face area detection unit 10, an area division unit 11, an exposed face area input intermediate layer (first input intermediate layer) 121 , mask-wearing region input intermediate layer (second input intermediate layer) 122 , intermediate feature value fusion layer 13 , and output layer 14 . These functional components are implemented by executing a program that causes a computer installed in the facial expression recognition apparatus 1 to function. In addition, the flow of processing of these functional components can also be understood as an estimation method of the apparatus.

［教師画像蓄積部２］
教師画像蓄積部２は、人の顔が映り込む顔画像（教師画像）と、人の表情ラベル（教師ラベル）とを対応付けた教師データを、予め大量に蓄積したものである。
顔画像（教師画像） <-> 表情ラベル（教師ラベル）
顔画像は、映り込む人の顔に、例えばマスクのような装着物が着用されているものとする。勿論、用途によっては、例えばメガネ、ゴーグル又はサングラスのような装着物が着用されたものであってもよい。
表情ラベルは、例えばポジティブ、ネガティブ、ニュートラルであってもよい。勿論、４つ以上の表情が用意されていてもよい。 [Teacher image accumulation unit 2]
The teacher image storage unit 2 stores in advance a large amount of teacher data in which a face image (teacher image) in which a person's face is reflected is associated with a person's expression label (teacher label).
face image (teacher image) <-> facial expression label (teacher label)
In the face image, it is assumed that an object such as a mask is worn on the person's face. Of course, depending on the application, a wearable object such as spectacles, goggles or sunglasses may be worn.
Expression labels may be, for example, positive, negative, and neutral. Of course, four or more facial expressions may be prepared.

教師画像蓄積部２は、顔画像及び表情ラベルのセット毎に出力する。顔画像は、顔領域検出部１０に入力され、出力層１４から表情ラベルが出力されるように、各機能構成部の学習モデルが訓練される。具体的には、教師データの顔画像によって、顔露出領域入力中間層１２１と、マスク着用領域入力中間層１２２と、中間特徴量融合層１３と、出力層１４とが、当該出力層１４からその表情ラベルを出力するように訓練される。 The teacher image storage unit 2 outputs each set of face images and expression labels. A face image is input to the face region detection unit 10, and a learning model of each functional component is trained so that an expression label is output from the output layer 14. FIG. Specifically, according to the face image of the teacher data, the face exposed region input intermediate layer 121, the mask wearing region input intermediate layer 122, the intermediate feature amount fusion layer 13, and the output layer 14 are converted from the output layer 14 to the It is trained to output facial expression labels.

図２は、顔領域検出部及び領域分割部の処理を表す説明図である。 FIG. 2 is an explanatory diagram showing the processing of the face area detection section and the area division section.

［顔領域検出部１０］
顔領域検出部１０は、入力された顔画像から、人の顔領域（例えばバウンディングボックス）を検出する。＜訓練段階＞で入力される顔画像は、教師画像である。
検出された顔領域は、領域分割部１１へ出力される。 [Face area detector 10]
The face area detection unit 10 detects a person's face area (for example, a bounding box) from the input face image. The face image input in the <training stage> is a teacher image.
The detected face area is output to the area dividing section 11 .

顔領域検出部１０には、具体的には、Ｒ－ＣＮＮ(Regions with Convolutional Neural Networks)やＳＳＤ(Single Shot Multibox Detector)を用いる。
Ｒ－ＣＮＮは、四角形の顔領域を畳み込みニューラルネットワークの特徴と組み合わせて、顔領域のサブセットを検出する（領域提案）。次に、領域提案からＣＮＮ特徴量を抽出する。そして、ＣＮＮ特徴量を用いて予め学習したサポートベクタマシンによって、領域提案のバウンディングボックスを調整する。
ＳＳＤは、機械学習を用いた一般対象物検知のアルゴリズムであって、デフォルトボックス(default boxes)という長方形のバウンディングボックスを決定する。１枚の画像上に、大きさの異なるデフォルトボックスを多数重畳させ、そのボックス毎に予測値を計算する。各デフォルトボックスについて、自身が対象物からどのくらい離れていて、どのくらい大きさが異なるのか、とする位置を予測することができる。 Specifically, R-CNN (Regions with Convolutional Neural Networks) or SSD (Single Shot Multibox Detector) is used for the face region detection unit 10 .
R-CNN combines square face regions with convolutional neural network features to detect subsets of face regions (region proposal). Next, CNN features are extracted from the region proposal. Then, the bounding box of the region proposal is adjusted by a support vector machine pre-trained using the CNN features.
SSD is a general object detection algorithm using machine learning, which determines rectangular bounding boxes called default boxes. A large number of default boxes with different sizes are superimposed on one image, and a predicted value is calculated for each box. For each default box, we can predict its position, how far it is from the object, and how different it is in size.

［領域分割部１１］
領域分割部１１は、顔領域検出部１０によって検出された顔領域画像を入力し、異なる領域種別の領域画像に分割する。
領域分割部１１は、入力された画像の画素(pixel)毎に領域種別を推定し、領域画像の境界線（セグメンテーション）に基づく領域画像を検出する。ここで、異なる領域種別の領域画像とは、具体的には以下のようになる。
第１の領域種別：顔露出領域画像
第２の領域種別：マスク着用領域画像（装着物領域画像）
そして、顔露出領域画像は、顔露出領域入力中間層１２１へ出力され、マスク着用領域画像は、マスク着用領域入力中間層１２２へ出力される。 [Region dividing unit 11]
The area division unit 11 receives the face area image detected by the face area detection unit 10 and divides it into area images of different area types.
The area dividing unit 11 estimates the area type for each pixel of the input image, and detects the area image based on the boundary line (segmentation) of the area image. Here, the area images of different area types are specifically as follows.
First region type: Face exposed region image Second region type: Mask wearing region image (equipment region image)
Then, the face exposed area image is output to the face exposed area input intermediate layer 121 and the mask wearing area image is output to the mask wearing area input intermediate layer 122 .

領域分割部１１は、対象物画像及びクラスからなる大量のデータセットによって予め訓練された機械学習エンジンである。
具体的には、例えばmask rcnn（登録商標）、YOLACT（登録商標）、BlendMAS（登録商標）などの既存技術を適用することができる。mask rcnnによれば、画素単位でクラス分類をし、画像全体からクラスに基づく境界領域を検出する。そして、画像から「対象物らしき領域」を大量に検出する。それらの中から、「人の顔らしさ」が閾値以上の領域と、「マスクらしさ」が閾値以上の領域とを絞り込んでいき、最終的に「顔露出領域」及び「マスク着用領域」が得られる。
mask rcnnのネットワーク構造は、例えばFaster R-CNNをベースに改善されたものである（例えば非特許文献５参照）。 The segmenter 11 is a machine learning engine pre-trained with a large dataset of object images and classes.
Specifically, existing technologies such as mask rcnn (registered trademark), YOLACT (registered trademark), and BlendMAS (registered trademark) can be applied. According to mask rcnn, class classification is performed on a pixel-by-pixel basis, and a boundary region based on the class is detected from the entire image. Then, a large number of “object-like regions” are detected from the image. From among them, we narrow down the areas where the "likeness of a human face" is above the threshold and the areas where the "likeness of a mask" is above the threshold, and finally the "face exposed area" and "mask wearing area" are obtained. .
The network structure of mask rcnn is, for example, improved based on Faster R-CNN (see Non-Patent Document 5, for example).

［顔露出領域入力中間層１２１及びマスク着用領域入力中間層１２２］
入力中間層１２は、「入力層」及び「中間層」からなり、出力層と共に、一般的なニューラルネットワークに基づくものである。特に、中間層は、出力層から正解の教師ラベルが得られるように、画像のどの部分を特徴としてとらえるかを繰り返し訓練したものである。 [Face exposed area input intermediate layer 121 and mask wearing area input intermediate layer 122]
The input hidden layer 12 consists of an "input layer" and an "hidden layer", and together with the output layer is based on a general neural network. In particular, the intermediate layer repeatedly trains which part of the image is taken as a feature so that the correct teacher label can be obtained from the output layer.

本発明によれば、入力中間層１２は、教師画像に基づく各領域画像を入力し且つ出力層１４から教師ラベルが出力されるように訓練したモデルを有する。
入力中間層１２は、入力された画像に基づく領域特徴量を抽出するべく機能する。特に、中間層の最終段の第Ｎ－１層を可視化（ヒートマップ）すると、それぞれの領域画像の特徴量を認識したものとなる。 According to the present invention, the input hidden layer 12 has a model trained to input each region image based on the teacher image and output the teacher label from the output layer 14 .
The input hidden layer 12 functions to extract region features based on the input image. In particular, visualization (heat map) of the (N−1)-th layer at the final stage of the intermediate layer recognizes the feature amount of each area image.

また、入力中間層１２は、領域種別毎に予め複数備えられる。図１によれば、領域分割部１１が２つの領域画像に分割し、各領域画像がそれぞれの入力中間層１２に入力される。ここでは、以下の２つを備える。
顔領域種別の画像を訓練する顔露出領域入力中間層１２１
マスク着用領域種別の画像を訓練するマスク着用領域入力中間層１２２
勿論、領域分割部１１が３つの領域画像に分割し、３つの入力中間層を有するものであってもよい。
そして、顔露出領域入力中間層１２１及びマスク着用領域入力中間層１２２は、それぞれ中間特徴量を、中間特徴量融合層１３へ出力する。 A plurality of input intermediate layers 12 are provided in advance for each region type. According to FIG. 1, the area division unit 11 divides into two area images, and each area image is input to the respective input intermediate layer 12 . Here, the following two are provided.
A face exposed region input hidden layer 121 that trains images of the face region type
A mask-wearing region input hidden layer 122 that trains images of the mask-wearing region type
Of course, the area division unit 11 may divide the image into three area images and have three input intermediate layers.
Then, the face-exposed region input intermediate layer 121 and the mask-wearing region input intermediate layer 122 output intermediate feature amounts to the intermediate feature amount fusion layer 13 .

［中間特徴量融合層１３］
中間特徴量融合層１３は、複数の入力中間層１２から出力された中間特徴量を融合して、融合中間特徴量を出力層１４へ出力する。
このとき、中間特徴量融合層１３は、出力層１４から教師ラベルが出力されるべく、入力中間層１２毎に異なる重みを導出しておく。 [Intermediate feature quantity fusion layer 13]
The intermediate feature amount fusion layer 13 fuses the intermediate feature amounts output from the plurality of input intermediate layers 12 and outputs the fused intermediate feature amount to the output layer 14 .
At this time, the intermediate feature value fusion layer 13 derives different weights for each input intermediate layer 12 so that the output layer 14 outputs a teacher label.

中間特徴量融合層１３の重みは、ニューラルネットワークにおける出力層１４から見て、各入力中間層１２からの特徴量の重要度を数値化したものである。
一般的なニューラルネットワークによれば、後段の各ニューロンは、前段の複数のニューロンそれぞれに対して重みを訓練する。一般的な重みは、ニューロンとニューロンの繋がりが情報の伝わりやすさを変えるべく、シナプスの結合の強さを表す。
これに対し、本発明の中間特徴量融合層１３は、前段の各入力中間層１２単位で、重みを訓練している。本発明の重みは、顔露出領域入力中間層１２１及びマスク着用領域入力中間層１２２それぞれと、出力層１４との結合の強さを表す。
顔露出領域入力中間層１２１に対する重み：β
マスク着用領域入力中間層１２２に対する重み：１－β
ここで、顔表情を認識する場合、β＞（１－β）となると想定される。顔表情認識について、例えば、顔露出領域からの顔表情認識にβ＝０．９の重みが付与され、マスク着用領域からの顔表情推定に１－β＝０．１の重みが付与されるように想定する。 The weight of the intermediate feature amount fusion layer 13 is a numerical value of the importance of the feature amount from each input intermediate layer 12 when viewed from the output layer 14 in the neural network.
According to a general neural network, each post-stage neuron trains weights for each of the pre-stage neurons. Weights generally represent the strength of synaptic connections, such that neuron-to-neuron connections change the ease with which information can be transmitted.
On the other hand, the intermediate feature amount fusion layer 13 of the present invention trains weights in units of each input intermediate layer 12 in the previous stage. The weights of the present invention represent the strength of coupling between the face-exposing region input intermediate layer 121 and the mask-wearing region input intermediate layer 122 and the output layer 14 respectively.
Weight for face exposed area input hidden layer 121: β
Weight for mask wearing region input hidden layer 122: 1-β
Here, when recognizing a facial expression, it is assumed that β>(1−β). For facial expression recognition, for example, a weight of β = 0.9 is given to facial expression recognition from the face exposed region, and a weight of 1 - β = 0.1 is given to facial expression estimation from the mask wearing region. assumed to.

これに対し、本発明では、第１の入力中間層と出力層との情報の伝わりやすさβと、第２の入力中間層と出力層との情報の伝わりやすさβ－１とを、出力層によって推定されるラベルの精度から訓練したものである。 On the other hand, in the present invention, the ease of information transfer β between the first input hidden layer and the output layer and the ease of information transfer β−1 between the second input hidden layer and the output layer are output. It is trained from the accuracy of the labels estimated by the layers.

［出力層１４］
出力層１４は、教師画像に基づく融合中間特徴量を入力し且つ教師ラベルを出力するようにモデルを訓練したものである。これも、一般的なニューラルネットワークにおける出力層と同じ機能のものである。 [Output layer 14]
The output layer 14 is a model trained to input a fused intermediate feature value based on a teacher image and output a teacher label. This also has the same function as the output layer in a general neural network.

図３は、本発明における顔表情認識装置の推定段階の機能構成図である。 FIG. 3 is a functional configuration diagram of the estimation stage of the facial expression recognition apparatus according to the present invention.

図３によれば、基本的に、図１と同じ機能構成を有する。
顔表情認識装置１は、＜推定段階＞として、人の顔が映り込む対象画像から、顔表情を推定することができる。ここで、対象画像は、装着物としてマスクを着用した顔画像を想定するが、勿論、装着物が着用されていない顔画像が混在していてもよい。
顔領域検出部１０は、対象画像を入力し、人の顔領域画像を検出する。
領域分割部１１は、対象画像における人の顔領域画像から、異なる領域種別の領域画像として、顔露出領域画像及びマスク着用領域画像に分割する。そして、顔露出領域画像を顔露出領域入力中間層１２１へ出力し、マスク着用領域画像をマスク着用領域入力中間層１２２へ出力する。
顔露出領域入力中間層１２１及びマスク着用領域入力中間層１２２はそれぞれ、領域画像を入力し、予め訓練されたモデルを用いて中間特徴量を出力する。顔露出領域の画像の中間特徴量と、マスク着用領域の画像の中間特徴量と別々に抽出される。
中間特徴量融合層１３は、複数の入力中間層から出力された中間特徴量を融合して、融合中間特徴量を出力する。これによって、顔露出領域（例えば眉間のシワなど）及びマスク着用領域（例えば顔の筋肉変化に基づくマスクのシワ）のそれぞれの画像の中間特徴量の傾向を残しながら融合することができる。特に、顔露出領域の中間特徴量の重みβを、マスク着用領域の中間特徴量の重み１－βよりも重くすることによって、顔露出領域からの表情認識の結果を強く反映することができる。
出力層１４は、最終的に、融合中間特徴量を入力し、予め訓練されたモデルを用いて表情ラベルを推定する。 According to FIG. 3, it basically has the same functional configuration as in FIG.
The facial expression recognition apparatus 1 can estimate a facial expression from a target image in which a person's face is reflected as <estimating stage>. Here, the target image is assumed to be a face image wearing a mask as a wearable object, but of course, a face image not wearing a wearable object may be mixed.
A face area detection unit 10 receives a target image and detects a person's face area image.
The region dividing unit 11 divides a face region image of a person in a target image into a face-exposed region image and a mask-wearing region image as region images of different region types. Then, the face-exposed region image is output to the face-exposed region input intermediate layer 121 and the mask-wearing region image is output to the mask-wearing region input intermediate layer 122 .
The face-exposed region input intermediate layer 121 and the mask-wearing region input intermediate layer 122 each receive an area image and output an intermediate feature amount using a pre-trained model. The intermediate feature amount of the image of the face-exposed area and the intermediate feature amount of the image of the mask-wearing area are separately extracted.
The intermediate feature amount fusion layer 13 fuses intermediate feature amounts output from a plurality of input intermediate layers and outputs a fused intermediate feature amount. As a result, it is possible to fuse the images of the face-exposed region (for example, wrinkles between the eyebrows) and the mask-wearing region (for example, mask wrinkles based on changes in facial muscles) while preserving the tendency of the intermediate feature amount of each image. In particular, by making the weight β of the intermediate feature quantity of the exposed face region heavier than the weight 1−β of the intermediate feature quantity of the mask-wearing region, the result of expression recognition from the exposed face region can be strongly reflected.
The output layer 14 finally receives the fused intermediate feature amount and estimates the expression label using a pre-trained model.

尚、本発明における他の実施形態として、領域分割部１１が、例えば布製マスク、不織布製マスク、平型マスク、ブリーツ型マスク、立体型マスクのように、マスク種別毎に領域画像を分割するものであってもよい。その場合、マスク種別に応じてマスク着用領域入力中間層１２２を備えておく。訓練段階では、教師画像に応じて、顔に着用されたマスクに対応する入力中間層が訓練される。推定段階では、対象画像に応じて、顔に着用されたマスクに対応する入力中間層によって推定される。 In another embodiment of the present invention, the area dividing unit 11 divides the area image for each mask type, such as a cloth mask, a nonwoven fabric mask, a flat mask, a pleated mask, and a three-dimensional mask. may be In that case, the mask wearing area input intermediate layer 122 is provided according to the mask type. In the training phase, the input hidden layer corresponding to the mask worn on the face is trained according to the teacher image. In the estimation stage, depending on the target image, it is estimated by the input hidden layer corresponding to the mask worn on the face.

図４は、本発明のプログラムにおける基本的な訓練段階の機能構成図である。
図５は、本発明のプログラムにおける基本的な推定段階の機能構成図である。 FIG. 4 is a functional block diagram of the basic training stage in the program of the present invention.
FIG. 5 is a functional block diagram of the basic estimation stage in the program of the present invention.

図４及び図５は、人の顔画像から顔表情を認識する用途に限定しない、基本的な機能構成図である。
領域分割部１１は、対象画像を、異なる領域種別の領域画像に分割する。
入力中間層１２は、領域種別毎に複数備えられ、領域画像を入力し、予め訓練されたモデルを用いて中間特徴量を出力する。
中間特徴量融合層１３は、複数の入力中間層１２から出力された中間特徴量を融合して、融合中間特徴量を出力する。このとき、入力中間層１２毎に異なる重みが付与されることも好ましい。
出力層１４は、融合中間特徴量を入力し、予め訓練されたモデルを用いてラベルを推定する。 4 and 5 are basic functional configuration diagrams that are not limited to applications for recognizing facial expressions from a person's facial image.
The area dividing unit 11 divides the target image into area images of different area types.
A plurality of input intermediate layers 12 are provided for each area type, receive an area image, and output an intermediate feature amount using a pre-trained model.
The intermediate feature amount fusion layer 13 fuses the intermediate feature amounts output from the plurality of input intermediate layers 12 and outputs a fused intermediate feature amount. At this time, it is also preferable that each input intermediate layer 12 is given a different weight.
The output layer 14 inputs the fused intermediate features and estimates labels using pre-trained models.

図４及び図５における機能構成は、様々な用途に利用することができる。例えば室内カメラで撮影された画像であれば、人の領域画像と家具の領域画像とを別々に中間特徴量を抽出することによって、室内全体を総合的に評価したラベルを推定することができるかもしれない。また、例えば車載カメラで撮影された画像であれば、道路上の領域画像と道路側面上の領域画像とを別々に中間特徴量を抽出することによって、交通全体を総合的に評価したラベルを推定することができるかもしれない。 The functional configurations in FIGS. 4 and 5 can be used for various purposes. For example, if an image is captured by an indoor camera, it may be possible to estimate a label that comprehensively evaluates the entire room by extracting the intermediate feature values separately for the human region image and the furniture region image. unknown. In the case of images taken by an in-vehicle camera, for example, by extracting the intermediate feature values separately for the area image on the road and the area image on the side of the road, a label that comprehensively evaluates the entire traffic can be estimated. maybe you can.

以上、詳細に説明したように、本発明のプログラム、装置及び方法によれば、対象画像から物体の領域画像毎に別々に分析し、それらを総合的に評価したラベルを推定することができる。
特に、本発明によれば、顔表情認識の用途に適用することができ、顔に装着物を着用した人の顔画像であっても、その顔表情を推定することができる。 As described in detail above, according to the program, apparatus and method of the present invention, it is possible to analyze each region image of an object separately from the target image and estimate the label by comprehensively evaluating them.
In particular, the present invention can be applied to the use of facial expression recognition, and can estimate the facial expression even if it is a facial image of a person wearing an accessory on the face.

尚、これにより、コロナ禍にあっても「顔にマスクを装着したユーザは、そのマスクを取り外すことなく、その表情を推定することができる」ことから、国連が主導する持続可能な開発目標（ＳＤＧｓ）の目標３「あらゆる年齢のすべての人々の健康的な生活を確保し、福祉を推進する」に貢献することが可能となる。 As a result, even in the midst of the corona crisis, "users wearing masks on their faces can estimate their facial expressions without removing the mask." It will be possible to contribute to Goal 3 of the SDGs, “Ensure healthy lives and promote well-being for all at all ages.”

前述した本発明の種々の実施形態について、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 For the various embodiments of the present invention described above, various changes, modifications and omissions within the spirit and scope of the present invention can be easily made by those skilled in the art. The foregoing description is exemplary only and is not intended to be limiting. The invention is to be limited only as limited by the claims and the equivalents thereof.

１顔表情認識装置
１０顔領域検出部
１１領域分割部
１２入力中間層
１２１顔露出領域入力中間層、第１の入力中間層
１２２マスク着用領域入力中間層、第２の入力中間層
１３中間特徴量融合層
１４出力層
２教師画像蓄積部
REFERENCE SIGNS LIST 1 facial expression recognition device 10 facial region detection unit 11 region division unit 12 input intermediate layer 121 face exposed region input intermediate layer, first input intermediate layer 122 mask wearing region input intermediate layer, second input intermediate layer 13 intermediate feature amount Fusion layer 14 Output layer 2 Teacher image storage unit

Claims

In a program that causes a computer to deduce labels from images of interest,
area dividing means for dividing a target image into area images of different area types;
a plurality of input intermediate layers provided for each area type, inputting an area image, and outputting an intermediate feature amount using a pre-trained model;
an intermediate feature value fusion layer that fuses intermediate feature values output from a plurality of input intermediate layers and outputs a fused intermediate feature value;
A program characterized by having a computer function as an output layer for inputting fused intermediate feature values and estimating labels using pre-trained models.

2. The program according to claim 1, wherein the intermediate feature amount fusion layer causes a computer to fuse a plurality of intermediate feature amounts with different weights for each input intermediate layer.

The target image is an image in which the face of the person wearing the attachment is reflected,
A label is a person's facial expression,
further comprising face area detection means for inputting a target image and detecting a person's face area image;
3. The program according to claim 1 or 2, wherein the region dividing means causes a computer to function so as to input a face region image of a person and divide it into a face-exposed region image and a wearable object region image as region images of different region types.

4. A program according to claim 3, causing the computer to act like the wearable is a mask, glasses, goggles or sunglasses.

Using a teacher image with a teacher label,
The area dividing means divides the teacher image into area images of different area types,
The input hidden layer consists of an input layer and a hidden layer in the neural network, and has a model trained so that each region image based on the teacher image is input and the teacher label is output from the output layer,
The intermediate feature value fusion layer is trained to input the intermediate feature value output from each input intermediate layer and output a teacher label from the output layer, and derives different weights for each input intermediate layer,
5. The output layer according to any one of claims 1 to 4, wherein the computer functions to have a model trained to input fused intermediate features based on a teacher image and output a teacher label. program as described.

5. The area dividing means estimates the type of area for each pixel of the input image and causes the computer to detect the area image based on the boundary line (segmentation) of the area image. The program according to any one of Claims 1 to 3.

In an estimation device that estimates a label from a target image,
area dividing means for dividing a target image into area images of different area types;
a plurality of input intermediate layers provided for each area type, inputting an area image, and outputting an intermediate feature amount using a pre-trained model;
an intermediate feature value fusion layer that fuses intermediate feature values output from a plurality of input intermediate layers and outputs a fused intermediate feature value;
An estimating device, comprising: an output layer for inputting a fused intermediate feature amount and estimating a label using a pre-trained model.

In a method for estimating a device for estimating a label from a target image,
The device
a first step of dividing the target image into area images of different area types;
a second step of inputting an area image using an input intermediate layer provided for each area type and outputting an intermediate feature amount using a pre-trained model;
a third step of fusing intermediate feature values output from a plurality of input intermediate layers to output a fused intermediate feature value;
and a fourth step of inputting the fused intermediate feature quantity and estimating the label using a pre-trained model.