JP2020107244A

JP2020107244A - Posture estimating apparatus, learning apparatus, and program

Info

Publication number: JP2020107244A
Application number: JP2018247875A
Authority: JP
Inventors: 俊枝三須; Toshie Misu; 秀樹三ツ峰; Hideki Mitsumine
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2018-12-28
Filing date: 2018-12-28
Publication date: 2020-07-09

Abstract

To estimate a posture of a subject simply and with high accuracy.SOLUTION: A discriminating unit 10 of a posture estimating apparatus 1 inputs an image I and performs discrimination processing for discrete N angles θn using a preset parameter p in a convolution neutral network, for example, so as to calculate respective likelihood values w(θn). A weighted synthesis unit 20 performs weighted synthesis processing for the N angles θn with weighting responding to the respective likelihood values w(θn) for the N angles θn so as to generate continuous posture information θ.SELECTED DRAWING: Figure 1

Description

本発明は、入力画像から被写体の姿勢を推定する姿勢推定装置、入力画像と被写体の姿勢の関係を学習する学習装置、及びプログラムに関する。 The present invention relates to a posture estimation device that estimates the posture of a subject from an input image, a learning device that learns the relationship between the input image and the posture of the subject, and a program.

従来、被写体を含む入力画像から当該被写体の姿勢を推定する姿勢推定装置が知られている。例えば、顔の姿勢を推定するために、テンプレートを用いて入力画像との間でマッチング処理を行う技術が開示されている（例えば、特許文献１を参照）。具体的には、この技術は、顔を構成する眼等の器官のテンプレートを用いて、入力画像から頭部姿勢を推定し、眼等の器官の位置を計算し、頭部モデルを適合して頭部の回転変位及び並進変位を決定するものである。 Conventionally, there is known a posture estimation device that estimates a posture of a subject from an input image including the subject. For example, a technique of performing matching processing with an input image using a template in order to estimate the posture of a face is disclosed (for example, see Patent Document 1). Specifically, this technique uses a template of organs such as eyes that form a face to estimate the head posture from an input image, calculates the positions of organs such as eyes, and adapts the head model. The rotational displacement and translational displacement of the head are determined.

また、例えば、顔の姿勢を推定するために、入力画像の色ヒストグラムに基づく識別結果と、色ヒストグラム以外の特徴量（例えば、勾配ヒストグラム）に基づく識別結果とを統合化する技術が開示されている（例えば、特許文献２を参照）。 Further, for example, in order to estimate the posture of a face, a technique of integrating an identification result based on a color histogram of an input image and an identification result based on a feature amount (for example, a gradient histogram) other than the color histogram is disclosed. (For example, see Patent Document 2).

特許第５０１６１７５号公報Japanese Patent No. 5016175 特開２０１８−２２４１６号公報JP, 2008-22416, A

しかしながら、前述の特許文献１のテンプレートマッチングによる技術は、顔の姿勢を推定するために、顔を構成する眼等の器官毎のテンプレートが予め必要である。このため、顔を構成する器官毎のテンプレートを用意するのに手間がかかるという問題があった。 However, in the technique of template matching of Patent Document 1 described above, in order to estimate the posture of the face, a template for each organ such as an eye forming the face is required in advance. Therefore, there is a problem that it takes time to prepare a template for each organ constituting a face.

また、前述の特許文献２の技術は、入力画像の色ヒストグラムと例えば勾配ヒストグラムとに基づいて、顔の姿勢の推定を低負荷に行うものである。 Further, the technique of the above-mentioned Patent Document 2 is for performing the estimation of the posture of the face with a low load based on the color histogram of the input image and the gradient histogram, for example.

しかしながら、入力画像には色ヒストグラム及び勾配ヒストグラム以外にも、顔の姿勢を推定するために有用な情報が含まれ得る。例えば、周波数領域における位相情報、特徴的なパターンの見え方（傾き、位置、大きさ、縦横比等）等も有用となる可能性があるが、特許文献２の技術では、これらの情報を有効に活用していない。このため、色ヒストグラム等の情報に限定した処理では、顔の姿勢の推定精度が不十分であるという問題があった。 However, the input image may include information useful for estimating the posture of the face, in addition to the color histogram and the gradient histogram. For example, phase information in the frequency domain, appearance of a characteristic pattern (tilt, position, size, aspect ratio, etc.) may be useful, but in the technique of Patent Document 2, these information are effective. Not used for. Therefore, there is a problem that the accuracy of estimating the posture of the face is insufficient in the processing limited to the information such as the color histogram.

そこで、本発明は前記課題を解決するためになされたものであり、その目的は、被写体の姿勢を簡易かつ高精度に推定可能な姿勢推定装置、学習装置及びプログラムを提供することにある。 Then, this invention is made in order to solve the said subject, The objective is to provide the posture estimation apparatus, the learning apparatus, and program which can estimate the posture of a to-be-photographed object easily and highly accurately.

前記課題を解決するために、請求項１の姿勢推定装置は、入力画像に含まれる被写体の姿勢を推定する姿勢推定装置において、前記入力画像に基づいて前記被写体の角度を識別し、予め設定された複数の角度のそれぞれに対応する確度値を求める識別部と、前記識別部により求めた前記複数の角度のそれぞれに対応する前記確度値に応じた重み付けにより、前記複数の角度を加重合成し、姿勢情報を求める加重合成部と、を備えたことを特徴とする。 In order to solve the problem, the posture estimation apparatus according to claim 1 is a posture estimation apparatus that estimates the posture of a subject included in an input image, identifies the angle of the subject based on the input image, and sets the preset angle. An identifying unit that obtains an accuracy value corresponding to each of the plurality of angles, by weighting according to the accuracy value corresponding to each of the plurality of angles obtained by the identifying unit, the plurality of angles are weighted and combined, And a weighted synthesis unit for obtaining posture information.

請求項１の発明によれば、識別部は、離散的な角度に対する確度値を求めればよいから、連続的な角度の確度値を求める場合に比べ、回路規模を削減することができる。また、加重合成部により、離散的な角度に対する確度値を用いて、連続的な角度情報である姿勢情報が得られる。加重合成部の処理は、積和演算で済むから低負荷である。したがって、低負荷かつ小規模な回路により、連続的な姿勢情報を得ることができる。 According to the first aspect of the present invention, since the identifying unit only needs to obtain the accuracy value for discrete angles, the circuit scale can be reduced as compared with the case where the accuracy values for continuous angles are obtained. In addition, the weighted composition unit obtains posture information that is continuous angle information using the accuracy values for discrete angles. The processing of the weighted synthesis unit is low in load because it is sufficient to perform the product-sum calculation. Therefore, continuous posture information can be obtained with a low-load and small-scale circuit.

また、請求項２の姿勢推定装置は、請求項１に記載の姿勢推定装置において、前記識別部が、ニューラルネットワークにより構成される、ことを特徴とする。 A posture estimating apparatus according to a second aspect is the posture estimating apparatus according to the first aspect, characterized in that the identifying unit is configured by a neural network.

請求項２の発明によれば、ニューラルネットワークの構成及び種類並びにパラメータである結合重み係数の設定次第で、入力画像の多様な特徴のうち姿勢を推定するために好適な特徴を抽出するネットワークを構築することができる。その結果、特定の特徴量を用いて姿勢を推定する従来の手法よりも、推定精度を向上させることができる。また、被写体の部分（例えば、顔の器官）毎のテンプレートを明示的に与える必要もない。 According to the second aspect of the present invention, a network for extracting a feature suitable for estimating the posture from the various features of the input image is constructed depending on the configuration and type of the neural network and the setting of the connection weighting coefficient which is a parameter. can do. As a result, the estimation accuracy can be improved as compared with the conventional method of estimating the posture by using the specific feature amount. Also, it is not necessary to explicitly give a template for each part of the subject (for example, a facial organ).

また、請求項３の姿勢推定装置は、請求項１または２に記載の姿勢推定装置において、前記姿勢情報を前記被写体の角度とする、または、前記姿勢情報をベクトル値または複素数値として表したときのノルムを信頼度とした場合に、前記姿勢情報を、前記被写体の角度及び当該角度における前記信頼度とする、ことを特徴とする。 Further, the posture estimation apparatus according to claim 3 is the posture estimation apparatus according to claim 1 or 2, wherein the posture information is an angle of the subject, or the posture information is expressed as a vector value or a complex value. When the norm of is the reliability, the posture information is the angle of the subject and the reliability at the angle.

請求項３の発明によれば、姿勢情報を被写体の角度及び信頼度とすることで、被写体の角度の信頼度も定量化することができる。これにより、当該姿勢推定装置により得られた姿勢情報を用いて他の処理を行う場合、信頼度の低い姿勢情報については、他の処理のために用いないようにする。つまり、当該姿勢推定装置、及び姿勢情報を用いる装置を含む全体システムにおいて、信頼度を向上させることができる。 According to the third aspect of the present invention, by using the posture information as the angle and the reliability of the subject, the reliability of the angle of the subject can be quantified. Thereby, when other processing is performed using the posture information obtained by the posture estimation apparatus, the posture information having low reliability is not used for other processing. That is, reliability can be improved in the entire system including the posture estimation device and a device that uses posture information.

さらに、請求項４の学習装置は、学習データとして被写体を含む画像及び前記被写体の姿勢情報を入力し、前記学習データを用いてモデルを学習し、当該モデルのパラメータの最適化を行う学習装置において、前記姿勢情報に基づいて、予め設定された複数の角度のそれぞれに対応する学習用確度値を求める確度生成部と、前記画像、及び前記確度生成部により求めた前記複数の角度のそれぞれに対応する前記学習用確度値に基づいて、前記被写体の角度を識別するための前記モデルを学習し、前記被写体の姿勢を推定するために用いる前記パラメータを更新する学習用識別部と、を備えたことを特徴とする。 Furthermore, the learning device according to claim 4, wherein an image including a subject and posture information of the subject are input as learning data, a model is learned using the learning data, and parameters of the model are optimized. , An accuracy generating unit that obtains a learning accuracy value corresponding to each of a plurality of preset angles based on the posture information, the image, and the plurality of angles that are obtained by the accuracy generating unit. A learning identifying unit that learns the model for identifying the angle of the subject based on the learning accuracy value and updates the parameter used to estimate the posture of the subject. Is characterized by.

請求項４の発明によれば、姿勢情報から、複数の角度のそれぞれに対応する学習用確度値を得ることができ、画像及び学習用確度値を用いてモデルを学習することができ、最適化したパラメータを得ることができる。 According to the invention of claim 4, the learning accuracy value corresponding to each of the plurality of angles can be obtained from the posture information, and the model can be learned using the image and the learning accuracy value. The obtained parameters can be obtained.

また、請求項５の学習装置は、請求項４に記載の学習装置において、前記確度生成部が、前記姿勢情報のベクトルと前記複数の角度のそれぞれのベクトルとの間のなす角を算出し、当該なす角に対し、広義単調減少かつ非定数の関数を適用し、前記複数の角度のそれぞれに対応する前記学習用確度値を求める、ことを特徴とする。 The learning device according to claim 5 is the learning device according to claim 4, wherein the accuracy generation unit calculates an angle formed between the vector of the posture information and each vector of the plurality of angles. A broadly monotonically decreasing and non-constant function is applied to the formed angle, and the learning accuracy value corresponding to each of the plurality of angles is obtained.

請求項５の発明によれば、姿勢情報に近い角度ほど、学習用確度値が大きくなる。このような学習用確度値を用いる学習用識別部は、モデルにより姿勢情報に近い角度の姿勢が推定されるように、パラメータを更新することができる。このパラメータを姿勢推定装置に用いることで、被写体の姿勢を適切に推定することができる。 According to the invention of claim 5, the learning accuracy value increases as the angle becomes closer to the posture information. The learning identifying unit that uses such a learning accuracy value can update the parameters so that the model estimates the attitude at an angle close to the attitude information. By using this parameter in the posture estimation device, the posture of the subject can be properly estimated.

また、請求項６の学習装置は、請求項４に記載の学習装置において、前記確度生成部が、前記姿勢情報のベクトルと前記複数の角度のそれぞれのベクトルとの間のなす角を算出し、前記複数の角度のうち前記なす角が最小となる角度について、所定値を前記学習用確度値に設定し、前記複数の角度のうち前記なす角が最小とならない角度について、前記所定値よりも小さい値を前記学習用確度値に設定する、ことを特徴とする。 In the learning device according to claim 6, in the learning device according to claim 4, the accuracy generation unit calculates an angle formed between the vector of the posture information and each of the plurality of angles, Of the plurality of angles, a predetermined value is set to the learning accuracy value for the angle at which the formed angle is the smallest, and an angle at which the formed angle is not the smallest among the plurality of angles is smaller than the predetermined value. A value is set to the learning accuracy value.

請求項６の発明によれば、学習用確度値は２値であるため、学習用確度値に対応する各角度に対し、２値分類の学習を行えばよいこととなる。その結果、学習用識別部の回路規模を小さくすることができ、学習効率を向上させることができる。 According to the invention of claim 6, since the learning accuracy value is binary, it is sufficient to perform binary classification learning for each angle corresponding to the learning accuracy value. As a result, the circuit size of the learning identifying unit can be reduced, and the learning efficiency can be improved.

さらに、請求項７のプログラムは、コンピュータを、請求項１から３までのいずれか一項に記載の姿勢推定装置として機能させることを特徴とする。 Further, the program according to claim 7 causes a computer to function as the posture estimation device according to any one of claims 1 to 3.

また、請求項８のプログラムは、コンピュータを、請求項４から６までのいずれか一項に記載の学習装置として機能させることを特徴とする。 Further, a program according to claim 8 causes a computer to function as the learning device according to any one of claims 4 to 6.

以上のように、本発明によれば、被写体の姿勢を簡易かつ高精度に推定することができる。 As described above, according to the present invention, the posture of a subject can be estimated easily and with high accuracy.

本発明の実施形態による姿勢推定装置の構成の一例を示すブロック図である。It is a block diagram showing an example of composition of a posture estimating device by an embodiment of the present invention. 姿勢推定装置の処理の一例を示すフローチャートである。It is a flow chart which shows an example of processing of a posture estimating device. 識別部の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of an identification part. 本発明の実施形態による学習装置の構成の一例を示すブロック図である。It is a block diagram showing an example of composition of a learning device by an embodiment of the present invention. 学習装置の処理の一例を示すフローチャートである。It is a flow chart which shows an example of processing of a learning device. 学習用識別部の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the identification part for learning.

以下、本発明を実施するための形態について図面を用いて詳細に説明する。本発明の実施形態による姿勢推定装置は、姿勢推定の対象となる画像Ｉを入力し、予め設定されたパラメータｐを用いて、画像Ｉに基づき姿勢情報θ及び必要に応じてその信頼度ｒを推定して出力する。 Hereinafter, modes for carrying out the present invention will be described in detail with reference to the drawings. The posture estimation apparatus according to the embodiment of the present invention inputs an image I to be subjected to posture estimation, and uses a preset parameter p to calculate the posture information θ based on the image I and its reliability r as necessary. Estimate and output.

画像Iは、例えば、人の頭部画像とする。姿勢情報θは、例えば、画像Iを撮影したカメラに対して人の頭部が正対した場合を0とし、所定の回転方向（例えば、人の上から見て反時計回り）を正の回転角とした場合の角度値（０ラジアン以上２πラジアン未満の角度値）とする。 The image I is, for example, a human head image. The posture information θ is, for example, 0 when the person's head is directly facing the camera that captured the image I, and a positive rotation in a predetermined rotation direction (eg, counterclockwise when viewed from above the person). An angle value (angle value of 0 radian or more and less than 2π radian) is defined as an angle.

画像Iの大きさ、形状及び解像度は、好適には固定とする。例えば、画像Ｉは、水平Ｗ画素及び垂直Ｈ画素の矩形画像とする。Ｗ及びＨは自然数とする。画像Iは、例えばカラー画像であってもよいし、モノクロ画像であってもよい。 The size, shape and resolution of the image I are preferably fixed. For example, the image I is a rectangular image having horizontal W pixels and vertical H pixels. W and H are natural numbers. The image I may be, for example, a color image or a monochrome image.

姿勢情報θは、角度値であってもよいし、方向ベクトルのベクトル値または複素数値（フェーザ）であってもよい。姿勢情報θを角度値にて表現する場合、θの範囲を所定範囲に限定してもよい（θを弧度法で表す場合には、例えば、０≦θ＜２πとする）。また、姿勢情報θをベクトル値または複素数値にて表現する場合、そのノルム値（例えば、ユークリッドノルム値）は１とする。信頼度ｒは、例えば０以上１以下の実数値とする。 The posture information θ may be an angle value, a vector value of a direction vector, or a complex value (phasor). When the posture information θ is represented by an angle value, the range of θ may be limited to a predetermined range (when θ is represented by the arc method, for example, 0≦θ<2π). When the posture information θ is represented by a vector value or a complex value, its norm value (for example, Euclidean norm value) is 1. The reliability r is, for example, a real value of 0 or more and 1 or less.

尚、姿勢情報θ及び信頼度ｒを一括して、ベクトル値または複素数値としての姿勢情報θとして表現してもよい。このとき、姿勢情報θの偏角を姿勢角に呼応させ、姿勢情報θのノルム値を信頼度ｒに呼応させる。 The posture information θ and the reliability r may be collectively expressed as the posture information θ as a vector value or a complex value. At this time, the deviation angle of the posture information θ is made to correspond to the posture angle, and the norm value of the posture information θ is made to correspond to the reliability r.

また、本発明の実施形態による学習装置は、画像Ｊ_k及び姿勢情報φ_kを学習データとして、画像Ｊ_kと姿勢情報φ_kとの間の関係を学習し、姿勢推定装置にて使用する最適なパラメータｐを求める。 Further, the learning apparatus according to the embodiment of the present invention uses the image J _k and the posture information φ _k as learning data to learn the relationship between the image J _k and the posture information φ _k, and uses the optimum value in the posture estimation apparatus. A parameter p is calculated.

〔姿勢推定装置〕
次に、本発明の実施形態による姿勢推定装置について詳細に説明する。図１は、本発明の実施形態による姿勢推定装置の構成の一例を示すブロック図であり、図２は、姿勢推定装置の処理の一例を示すフローチャートである。この姿勢推定装置１は、入力画像に含まれる被写体の姿勢を推定する装置であり、識別部１０及び加重合成部２０を備えている。 [Posture estimation device]
Next, the posture estimation apparatus according to the embodiment of the present invention will be described in detail. FIG. 1 is a block diagram showing an example of the configuration of a posture estimation apparatus according to an embodiment of the present invention, and FIG. 2 is a flowchart showing an example of processing of the posture estimation apparatus. The posture estimation device 1 is a device that estimates the posture of a subject included in an input image, and includes an identification unit 10 and a weighted synthesis unit 20.

（識別部１０）
識別部１０は、姿勢推定の対象となる画像Ｉを入力し、予め設定されたパラメータｐを用いて、画像Ｉに含まれる被写体についてＮ個の代表的な角度θ_nの識別処理を行い、それぞれの確度値ｗ（θ_n）を求める。Ｎは２以上の自然数であり、ｎは０以上Ｎ−１以下の整数である。Ｎ個の代表的な角度θ_nは、予め設定される。そして、識別部１０は、Ｎ個の角度θ_nに対するそれぞれの確度値ｗ（θ_n）からなる確度値列（ｗ（θ_n））_{n∈{0,1,…,N-1}}を加重合成部２０に出力する。 (Identification unit 10)
The identification unit 10 inputs an image I that is an object of posture estimation, performs identification processing of N representative angles θ _n for a subject included in the image I using a preset parameter p, and The accuracy value w(θ _n ) of is calculated. N is a natural number of 2 or more, and n is an integer of 0 or more and N-1 or less. The N representative angles θ _n are set in advance. Then, the identification unit 10 _{weights the} accuracy value sequence (w(θ _n )) _n _{∈ {0, 1,..., N-1} including} the accuracy values w(θ _n ) for the N angles θ _n . Output to the synthesis unit 20.

すなわち、識別部１０は、入力信号を画像Ｉとし、出力信号を、Ｎ個の代表的な角度θ_nに対するそれぞれの確度値ｗ（θ_n）とする識別器である。後述する図３においては、入力信号であるの画像Ｉを、各画素３成分からなるカラー画像とし、Ｎ＝８としている。 That is, the discrimination unit 10 is a discriminator in which the input signal is the image I and the output signal is each accuracy value w(θ _n ) with respect to the N representative angles θ _n . In FIG. 3, which will be described later, the image I, which is an input signal, is a color image composed of three components of each pixel, and N=8.

代表的な角度θ_nは、例えば以下の式のとおり、２πラジアンをＮ等分するように予め設定される。

The representative angle θ _n is preset so as to divide 2π radian into N equal parts, for example, as in the following formula.

識別部１０は、例えばニューラルネットワークにより構成される。図３は、識別部１０の構成の一例を示すブロック図であり、識別部１０が畳み込みニューラルネットワークにより構成された場合を示している。識別部１０は、畳み込み層（畳み込み部）１１，１２，１３，１４及び全結合層（全結合部）１５，１６からなる畳み込みニューラルネットワークにより構成される。 The identification unit 10 is composed of, for example, a neural network. FIG. 3 is a block diagram showing an example of the configuration of the identifying unit 10, and shows a case where the identifying unit 10 is configured by a convolutional neural network. The identification unit 10 is configured by a convolutional neural network including convolutional layers (convolutional units) 11, 12, 13, and 14 and fully connected layers (fully connected units) 15 and 16.

尚、ニューラルネットワークの層数、素子数、活性化関数、畳み込み層の有無、畳み込みカーネルの大きさ、全結合層の有無、ストライド（サブサンプリング）の有無及びステップ、プーリング層の有無及び種類、ドロップアウトの有無等の構成は任意である。また、識別部１０は、畳み込みニューラルネットワーク以外のニューラルネットワークであってもよい。 The number of layers of neural network, number of elements, activation function, presence/absence of convolutional layer, size of convolution kernel, presence/absence of fully connected layer, presence/absence and step of stride (subsampling), presence/absence and type of pooling layer, drop The configuration such as presence or absence of out is arbitrary. The identification unit 10 may be a neural network other than the convolutional neural network.

図３において、例えば、識別部１０が入力する画像Ｉを、水平Ｗ＝２０画素、垂直Ｈ＝２０画素及び色３成分からなる２０×２０×３の３階テンソルとする。以下、水平Ｗ画素数×垂直Ｈ画素数×成分数で表される水平画素、垂直画素及び成分を、説明の便宜上「画素成分」という。 In FIG. 3, for example, the image I input by the identification unit 10 is a 20×20×3 third-order tensor composed of horizontal W=20 pixels, vertical H=20 pixels, and three color components. Hereinafter, the horizontal pixel, the vertical pixel, and the component represented by the number of horizontal W pixels×the number of vertical H pixels×the number of components are referred to as “pixel components” for convenience of description.

図２及び図３を参照して、畳み込み層１１は、２０×２０×３画素成分の３階テンソルの画像Ｉを入力する（ステップＳ２０１）。そして、畳み込み層１１は、画像Ｉに対し、３×３×３画素成分の畳み込みフィルタを２×２のストライドにおいて１２種類適用し、予め設定されたパラメータｐを用いて畳み込み処理を行う。畳み込み層１１は、１０×１０×１２画素成分の３階テンソルの画像Ｔ₁を生成する（ステップＳ２０２）。 With reference to FIG. 2 and FIG. 3, the convolutional layer 11 inputs the image I of the third-order tensor of 20×20×3 pixel components (step S201). Then, the convolutional layer 11 applies 12 types of convolution filters of 3×3×3 pixel components to the image I in a 2×2 stride, and performs convolution processing using a preset parameter p. The convolutional layer 11 generates an image T ₁ of a third-order tensor having 10×10×12 pixel components (step S202).

尚、畳み込みニューラルネットワークにおける畳み込み層１１，１２，１３，１４による畳み込み処理は既知であるから、ここでは詳細な説明を省略する。 Since the convolution processing by the convolution layers 11, 12, 13, and 14 in the convolutional neural network is already known, detailed description thereof will be omitted here.

畳み込み層１１は、１０×１０×１２画素成分の３階テンソルの画像Ｔ₁を畳み込み層１２に出力する。画像Ｔ₁は、水平Ｗ＝１０画素、垂直Ｈ＝１０画素及び１２成分からなる３階テンソルの画像である。 The convolutional layer 11 outputs to the convolutional layer 12 the image T ₁ of the third-order tensor of 10×10×12 pixel components. The image T ₁ is a third-order tensor image composed of horizontal W=10 pixels, vertical H=10 pixels, and 12 components.

畳み込み層１２は、畳み込み層１１から、１０×１０×１２画素成分の３階テンソルの画像Ｔ₁を入力する。そして、畳み込み層１２は、画像Ｔ₁に対し、３×３×３画素成分の畳み込みフィルタを２×２のストライドにおいて２４種類適用し、予め設定されたパラメータｐを用いて畳み込み処理を行う。畳み込み層１２は、５×５×２４画素成分の３階テンソルの画像Ｔ₂を生成する（ステップＳ２０３）。 The convolutional layer 12 inputs the image T ₁ of the third-order tensor of 10×10×12 pixel components from the convolutional layer 11. Then, the convolutional layer 12 applies 24 types of convolutional filters of 3×3×3 pixel components to the image T ₁ in a 2×2 stride, and performs convolutional processing using a preset parameter p. The convolutional layer 12 generates an image T ₂ of a third-order tensor of 5×5×24 pixel components (step S203).

畳み込み層１２は、５×５×２４画素成分の３階テンソルの画像Ｔ₂を畳み込み層１３に出力する。画像Ｔ₂は、水平Ｗ＝５画素、垂直Ｈ＝５画素及び２４成分からなる３階テンソルの画像である。 The convolutional layer 12 outputs the image T ₂ of the third-order tensor of 5×5×24 pixel components to the convolutional layer 13. The image T ₂ is a third-order tensor image composed of horizontal W=5 pixels, vertical H=5 pixels, and 24 components.

畳み込み層１３は、畳み込み層１２から、５×５×２４画素成分の３階テンソルの画像Ｔ₂を入力する。そして、畳み込み層１３は、画像Ｔ₂に対し、３×３×３画素成分の畳み込みフィルタを１×１のストライドにおいて３２種類適用し、予め設定されたパラメータｐを用いて畳み込み処理を行う。畳み込み層１３は、３×３×３２画素成分の３階テンソルの画像Ｔ₃を生成する（ステップＳ２０４）。 The convolutional layer 13 inputs the image T ₂ of the third-order tensor of 5×5×24 pixel components from the convolutional layer 12. Then, the convolutional layer 13 applies 32 types of convolutional filters of 3×3×3 pixel components to the image T ₂ in a 1×1 stride, and performs convolutional processing using a preset parameter p. The convolutional layer 13 generates an image T ₃ of a third-order tensor of 3×3×32 pixel components (step S204).

畳み込み層１３は、３×３×３２画素成分の３階テンソルの画像Ｔ₃を畳み込み層１４に出力する。画像Ｔ₃は、水平Ｗ＝３画素、垂直Ｈ＝３画素及び３２成分からなる３階テンソルの画像である。 The convolutional layer 13 outputs the image T ₃ of the third-order tensor of 3×3×32 pixel components to the convolutional layer 14. The image T ₃ is a third-order tensor image composed of horizontal W=3 pixels, vertical H=3 pixels, and 32 components.

畳み込み層１４は、畳み込み層１３から、３×３×３２画素成分の３階テンソルの画像Ｔ₃を入力する。そして、畳み込み層１４は、画像Ｔ₃に対し、３×３×３画素成分の畳み込みフィルタを１×１のストライドにおいて６４種類適用し、予め設定されたパラメータｐを用いて畳み込み処理を行う。畳み込み層１４は、１×１×６４画素成分の３階テンソルの画像（６４成分のベクトルＶ₁）を生成する（ステップＳ２０５）。畳み込み層１４は、６４成分のベクトルＶ₁を全結合層１５に出力する。 The convolutional layer 14 inputs the image T ₃ of the third-order tensor of 3×3×32 pixel components from the convolutional layer 13. Then, the convolutional layer 14 applies 64 types of convolution filters of 3×3×3 pixel components to the image T ₃ in a 1×1 stride, and performs convolution processing using a preset parameter p. The convolutional layer 14 generates an image of a third-order tensor of 1×1×64 pixel components (vector V _{1 of} 64 components) (step S205). The convolutional layer 14 outputs the 64-component vector V ₁ to the fully connected layer 15.

全結合層１５は、畳み込み層１４から６４成分のベクトルＶ₁を入力し、予め設定されたパラメータｐを用いて、６４成分のベクトルＶ₁を構成する全ての成分を結合するための全結合処理を行い、１６成分のベクトルＶ₂を生成する（ステップＳ２０６）。そして、全結合層１５は、１６成分のベクトルＶ₂を全結合層１６に出力する。つまり、全結合層１５は、入力信号である６４成分のベクトルＶ₁の各要素と、出力信号である１６成分のベクトルＶ₂の各要素とを全て結合するネットワークである。 The full connection layer 15 inputs the 64-component vector V ₁ from the convolutional layer 14, and uses a preset parameter p to connect all components forming the 64-component vector V ₁ with each other. Is performed to generate a 16-component vector V ₂ (step S206). Then, the fully connected layer 15 outputs the 16-component vector V ₂ to the fully connected layer 16. That is, the total connection layer 15 is a network that connects all the elements of the 64-component vector V ₁ that is the input signal and the respective elements of the 16-component vector V ₂ that is the output signal.

尚、畳み込みニューラルネットワークにおける全結合層１５，１６による全結合処理は既知であるから、ここでは詳細な説明を省略する。 Since the total connection process by the total connection layers 15 and 16 in the convolutional neural network is known, detailed description thereof will be omitted here.

全結合層１６は、全結合層１５から１６成分のベクトルＶ₂を入力し、予め設定されたパラメータｐを用いて、１６成分のベクトルＶ₂を構成する全ての成分を結合するための全結合処理を行う。全結合層１６は、８成分のベクトルＶ₃（確度値ｗ（θ_n），ｎ＝０，１，・・・，７）を生成する（ステップＳ２０７）。 The fully-connected layer 16 inputs the 16-component vector V ₂ from the fully-connected layer 15 and uses a preset parameter p to combine all the components forming the 16-component vector V _2. Perform processing. The fully connected layer 16 generates an eight-component vector V ₃ (accuracy value w(θ _n ), n=0, 1,..., 7) (step S207).

全結合層１６は、８成分のベクトルＶ₃である、８個の角度θ_nに対するそれぞれの確度値ｗ（θ_n）からなる確度値列（ｗ（θ_n））_{n∈{0,1,…,7}}を加重合成部２０に出力する。この場合、θ₀＝０・２π／８＝０，θ₁＝１・２π／８＝π／４，θ₂＝２・２π／８＝π／２，・・・，θ₇＝７・２π／８＝７π／４である。つまり、全結合層１６は、入力信号である１６成分のベクトルＶ₂の各要素と、出力信号である８成分のベクトルＶ₃の各要素とを全て結合するネットワークである。 The fully connected layer 16 has an accuracy value sequence (w(θ _n )) _n _{∈ {0, 1,} which is an accuracy vector w ₃ of 8 components and is composed of accuracy values w (θ _n ) for eight angles θ _n _{. , 7}} is output to the weighted synthesis unit 20. In this case, θ ₀ =0/2π/8=0, θ ₁ =1/2π/8=π/4, θ ₂ =2/2π/8=π/2,..., θ ₇ =7.2π /8=7π/4. In other words, the total connection layer 16 is a network that connects all the elements of the 16-component vector V ₂ that is the input signal and the respective elements of the 8-component vector V ₃ that is the output signal.

このように、識別部１０は、離散的な角度θ_nに対する確度値ｗ（θ_n）を求めればよいから、連続的な角度に対する確度値を求める場合に比べ、簡易な処理で済み、かつ回路規模を削減することができる。 As described above, since the identification unit 10 only needs to obtain the accuracy value w(θ _n ) for the discrete angle θ _n , it requires simpler processing than the case where the accuracy value for continuous angles is obtained, and the circuit The scale can be reduced.

尚、畳み込み層１１，１２，１３，１４及び全結合層１５，１６を構成する素子（ニューロン）には、バイアス値を設定するようにしてもよい。また、畳み込み層１１，１２，１３，１４及び全結合層１５，１６を構成する素子に適用する活性化関数は任意であるが、例えば半波整流関数（ReLU：Rectified Linear Unit）、シグモイド（Sigmoid）関数、双曲線正接関数等を用いることができる。 A bias value may be set to the elements (neurons) that form the convolutional layers 11, 12, 13, and 14 and the total coupling layers 15 and 16. Further, the activation function applied to the elements forming the convolutional layers 11, 12, 13, 14 and the fully coupled layers 15, 16 is arbitrary, but for example, a half-wave rectifying function (ReLU: Rectified Linear Unit), a sigmoid (Sigmoid) ) Function, a hyperbolic tangent function, etc. can be used.

畳み込み層１１，１２，１３，１４及び全結合層１５，１６にて用いるパラメータｐは、図１に示した姿勢推定装置１における識別部１０の識別方法を特定するためのパラメータである。識別部１０がニューラルネットワークによる場合は、重み値、バイアス値、フィルタ係数等の結合重み係数である。パラメータｐは、後述する学習装置２により予め求めた値が用いられ、姿勢推定装置１に備えたＲＯＭ（Read Only Memory）等に格納しておくようにしてもよいし、外部から更新できるように、ＲＡＭ（Random Access Memory）またはフラッシュＲＯＭに格納しておくようにしてもよい。 The parameter p used in the convolutional layers 11, 12, 13, 14 and the fully connected layers 15, 16 is a parameter for identifying the identification method of the identification unit 10 in the posture estimation apparatus 1 shown in FIG. When the identification unit 10 is a neural network, it is a weighting coefficient such as a weighting value, a bias value, or a filter coefficient. A value obtained in advance by the learning device 2 described later is used as the parameter p, and the parameter p may be stored in a ROM (Read Only Memory) or the like provided in the posture estimation device 1 or may be updated externally. , RAM (Random Access Memory) or flash ROM.

（加重合成部２０）
図１及び図２に戻って、加重合成部２０は、識別部１０から、Ｎ個（図３の例ではＮ＝８）の角度θ_nに対するそれぞれの確度値ｗ（θ_n）からなる確度値列（ｗ（θ_n））_{n∈{0,1,…,N-1}}を入力する。 (Weighted combining unit 20)
Returning to FIG. 1 and FIG. 2, the weighted synthesis unit 20 detects that the identification unit 10 has accuracy values w (θ _n ) corresponding to _N ( _n =8 in the example of FIG. 3) angles θ _n . _{Input the} column (w(θ _n )) _n _{∈ {0, 1,..., N-1}} .

加重合成部２０は、Ｎ個の角度θ_nに対するそれぞれの確度値ｗ（θ_n）に応じた重み付けにより、Ｎ個の角度θ_nの加重合成処理を行い、姿勢情報θを推定する（ステップＳ２０８）。そして、加重合成部２０は、姿勢情報θを出力する（ステップＳ２０９）。 Weighting combining unit 20, the weighting according to the respective reliability values w (θ _n) for the N angle theta _n, performs weighting synthesis processing of the N angle theta _n, estimates the posture information theta (step S208 ). Then, the weighted composition unit 20 outputs the posture information θ (step S209).

例えば、加重合成部２０は、以下の式のとおり、確度値ｗ（θ_n）を重みとして、絶対値１及び角度（偏角）θ_nの複素数値を加重合成し、加重合成の結果である複素数ζを算出する。

ｊは虚数単位である。 For example, the weighted synthesis unit 20 weights the complex value of the absolute value 1 and the angle (declination) θ _{n with} the accuracy value w(θ _n ) as a weight, and gives the result of the weighted synthesis, as in the following formula. Calculate the complex number ζ.

j is an imaginary unit.

尚、複素数ζの代わりに、複素数ζの実部及び虚部を成分とする２次元ベクトル値を用いて、前記式（２）の演算を行うようにしてもよい。 Instead of the complex number ζ, a two-dimensional vector value having a real part and an imaginary part of the complex number ζ as a component may be used to perform the calculation of the equation (2).

また、加重合成部２０は、姿勢情報θ及び信頼度ｒを一括して、ベクトル値または複素数値としての姿勢情報θを表現する場合、前記式（２）において、以下の式のとおり、複素数ζを姿勢情報θとしてそのまま出力する。

Further, when the posture information θ and the reliability r are collectively expressed as the vector information or the complex posture information θ, the weighted synthesis unit 20 calculates the complex number ζ according to the following equation in the equation (2). Is output as the posture information θ as it is.

また、加重合成部２０は、姿勢情報θ及び信頼度ｒを個別に出力する場合、以下の式を用いて演算を行う。

前記式（４）において、ノルム

は、例えばユークリッドノルムとする。 In addition, the weighted synthesis unit 20 performs the calculation using the following equation when individually outputting the posture information θ and the reliability r.

In the formula (4), the norm

Is, for example, the Euclidean norm.

このように、加重合成部２０は、離散的な角度θ_nに対する確度値ｗ（θ_n）を重みとして角度θ_nを加重合成することで、連続的な角度情報の姿勢情報θを推定するようにした。これにより、加重合成の処理は積和演算により行われるから、演算負荷を低減することができ、かつ小規模な回路にて連続的な値をとる姿勢情報θを推定することができる。 Thus, the weighted combining unit 20, reliability value w for discrete angles theta _n a (theta _n) by weighted synthesis of the angle theta _n as a weight, to estimate the posture information of the continuous angle information theta I chose As a result, the processing of weighted composition is performed by the sum of products calculation, so that the calculation load can be reduced and the posture information θ taking a continuous value can be estimated by a small-scale circuit.

以上のように、本発明の実施形態の姿勢推定装置１によれば、識別部１０は、画像Ｉを入力し、例えば畳み込みニューラルネットワークにて、予め設定されたパラメータｐを用いて、Ｎ個の代表的な角度θ_nを識別し、それぞれの確度値ｗ（θ_n）を求める。 As described above, according to the posture estimation apparatus 1 of the embodiment of the present invention, the identification unit 10 inputs the image I, and uses the preset parameter p in the convolutional neural network, for example, to determine the N A representative angle θ _n is identified and each accuracy value w(θ _n ) is obtained.

加重合成部２０は、Ｎ個の角度θ_nに対するそれぞれの確度値ｗ（θ_n）に応じた重み付けにより、Ｎ個の角度θ_nを加重合成し、姿勢情報θを生成する。 The weighted synthesis unit 20 weights and synthesizes the N angles θ _n by weighting the N angles θ _n according to the respective accuracy values w(θ _n ), and generates the posture information θ.

これにより、予め設定されたパラメータｐを用いて姿勢情報θを推定することができ、パラメータｐは後述する学習装置２により得ることができるから、前述の特許文献１に記載された顔の器官毎のテンプレートを用意する必要がない。つまり、特許文献１の技術に比べ、手間がかかることはない。 Thereby, the posture information θ can be estimated using the preset parameter p, and the parameter p can be obtained by the learning device 2 described later. Therefore, for each facial organ described in Patent Document 1 described above. There is no need to prepare a template. In other words, compared to the technique of Patent Document 1, it takes less time and effort.

また、姿勢情報θを推定するために、特定の特徴量のみを用いることがないから、特定の特徴量のみを用いる特許文献２の技術に比べ、姿勢情報θの推定精度を向上させることができる。したがって、被写体の姿勢を簡易かつ高精度に推定することができる。 Further, since the particular feature amount is not used to estimate the posture information θ, the estimation accuracy of the posture information θ can be improved as compared with the technique of Patent Document 2 that uses only the particular feature amount. .. Therefore, the posture of the subject can be estimated easily and with high accuracy.

〔学習装置〕
次に、本発明の実施形態による学習装置について詳細に説明する。図４は、本発明の実施形態による学習装置の構成の一例を示すブロック図であり、図５は、学習装置の処理の一例を示すフローチャートである。この学習装置２は、確度生成部３０及び学習用識別部４０を備えている。 [Learning device]
Next, the learning device according to the embodiment of the present invention will be described in detail. FIG. 4 is a block diagram showing an example of the configuration of the learning device according to the embodiment of the present invention, and FIG. 5 is a flowchart showing an example of the processing of the learning device. The learning device 2 includes an accuracy generation unit 30 and a learning identification unit 40.

学習装置２は、学習データとして、Ｋ個（組）の画像Ｊ_k及び姿勢情報φ_kを入力する（ステップＳ５０１）。そして、学習装置２は、これらの学習データを用いて、画像Ｊ_kに含まれる被写体の角度を識別するためのモデルを学習する。学習装置２は、当該モデルのパラメータｐ、すなわち図１に示した姿勢推定装置１の識別部１０の動作を規定する、被写体の姿勢を推定するために用いるパラメータｐの最適化を行い、最適化されたパラメータｐを出力する。パラメータｐは、図１に示した姿勢推定装置１の識別部１０に設定される。Ｋは自然数であり、ｋは０以上Ｋ未満の整数である。 The learning device 2 inputs K (group) images J _k and posture information φ _k as learning data (step S501). Then, the learning device 2 uses these learning data to learn a model for identifying the angle of the subject included in the image J _k . The learning device 2 optimizes and optimizes the parameter p of the model, that is, the parameter p used to estimate the posture of the subject, which defines the operation of the identification unit 10 of the posture estimation device 1 illustrated in FIG. 1. And output the parameter p. The parameter p is set in the identification unit 10 of the posture estimation apparatus 1 shown in FIG. K is a natural number, and k is an integer of 0 or more and less than K.

（確度生成部３０）
確度生成部３０は、学習データの姿勢情報φ_kを入力し、姿勢情報φ_kに基づいて、Ｎ個の角度θ_nに対するそれぞれの学習用確度値ｔ_k（θ_n）を生成する（ステップＳ５０２）。そして、確度生成部３０は、１個の姿勢情報φ_kについて、Ｎ個の角度θ_nに対するそれぞれの学習用確度値ｔ_k（θ_n）からなる学習用確度値列（ｔ_k（θ_n））_{n∈{0,1,…,N-1}}を学習用識別部４０に出力する。 (Accuracy generation unit 30)
The accuracy generation unit 30 receives the attitude information φ _k of the learning data, and generates each learning accuracy value t _k (θ _n ) for N angles θ _n based on the attitude information φ _k (step S502). ). Then, the accuracy generation unit 30 learns a learning accuracy value sequence (t _k (θ _n )) including the learning accuracy values t _k (θ _n ) for N pieces of angle θ _n for one piece of posture information φ _k. ) _{Output n ∈ {0, 1,..., N-1}} to the learning identifying unit 40.

姿勢情報φ_kは、図１に示した姿勢情報θと同様に、角度値（例えば、弧度法による）とする。 The posture information φ _k is an angle value (for example, by the radian method), like the posture information θ shown in FIG. 1.

具体的には、確度生成部３０は、姿勢情報φ_k（の示す角度）のベクトルと各角度θ_nのベクトルとの間のなす角α（φ_k，θ_n）を算出し、なす角α（φ_k，θ_n）に応じた学習用確度値ｔ_k（θ_n）を生成する。α（φ_k，θ_n）は、姿勢情報φ_k（の示す角度）のベクトルと各角度θ_nのベクトルとの間のなす角を演算する関数である。 Specifically, the accuracy generation unit 30 calculates an angle α (φ _k , θ _n ) formed between the vector of (the angle indicated by) the posture information φ _k and the vector of each angle θ _n , and forms the angle α. (φ _k, θ _n) to generate the learning accuracy value t _k (θ _n) corresponding to. α(φ _k , θ _n ) is a function for calculating an angle formed between the vector of (the angle indicated by) the posture information φ _k and the vector of each angle θ _n .

例えば、確度生成部３０は、以下の式のとおり、なす角α（φ_k，θ_n）が最小となる場合、当該角度θ_nについて学習用確度値ｔ_k（θ_n）＝Ａ（Ａは所定の実数、例えばＡ＝１）を設定する。また、確度生成部３０は、なす角α（φ_k，θ_n）が最小とならない場合、当該角度θ_nについて学習用確度値ｔ_k（θ_n）＝Ｂ（ＢはＡよりも小さい所定の実数、例えばＢ＝０）を設定する。

For example, the accuracy generation unit 30 calculates the learning accuracy value t _k (θ _n )=A (A is A for the angle θ _n when the angle α (φ _k , θ _n ) is the minimum, as shown in the following equation. A predetermined real number, for example A=1) is set. Further, when the angle α(φ _k , θ _n ) formed is not the minimum, the accuracy generation unit 30 determines the learning accuracy value t _k (θ _n )=B for the angle θ _n (B is a predetermined value smaller than A). Set a real number, for example B=0).

また、確度生成部３０は、以下の式のとおり、なす角α（φ_k，θ_n）に対して所定の関数ｆを適用し、学習用確度値ｔ_k（θ_n）を算出するようにしてもよい。

Further, accuracy generator 30, shown in the following formula, by applying a predetermined function f angle α (φ _k, θ _n) relative, to calculate the learning accuracy value t _k (θ _n) May be.

関数ｆは、広義単調減少の関数であり、かつ非定数の関数である。例えば、関数ｆとして、以下のガウス関数が用いられる。

λは正の実定数とする。 The function f is a monotonically decreasing function in a broad sense and is a non-constant function. For example, the following Gaussian function is used as the function f.

λ is a positive real constant.

このように、確度生成部３０は、連続的な角度情報の姿勢情報φ_kから、離散的な角度θ_nに対する学習用確度値ｔ_k（θ_n）を生成するようにした。これにより、離散的な角度θ_nに対する学習用確度値ｔ_k（θ_n）は、図１に示した姿勢推定装置１の識別部１０により生成される確度値ｗ（θ_n）に対応させることができる。そして、識別部１０に対応する学習用識別部４０において、これを学習データとして用いることができる。 As described above, the accuracy generation unit 30 is configured to generate the learning accuracy value t _k (θ _n ) for the discrete angle θ _n from the attitude information φ _k of the continuous angle information. Thereby, the learning accuracy value t _k (θ _n ) for the discrete angle θ _n should correspond to the accuracy value w(θ _n ) generated by the identification unit 10 of the posture estimation apparatus 1 shown in FIG. You can Then, the learning identifying unit 40 corresponding to the identifying unit 10 can use this as learning data.

（学習用識別部４０）
図６は、学習用識別部４０の構成の一例を示すブロック図である。学習用識別部４０は、畳み込み層１１，１２，１３，１４及び全結合層１５，１６等を備えている。 (Learning identification unit 40)
FIG. 6 is a block diagram showing an example of the configuration of the learning identifying unit 40. The learning identifying unit 40 includes convolutional layers 11, 12, 13, 14 and fully connected layers 15, 16.

学習用識別部４０は、学習データの画像Ｊ_kを入力すると共に、確度生成部３０から、Ｎ個の角度θ_nに対するそれぞれの学習用確度値ｔ_k（θ_n）からなる学習用確度値列（ｔ_k（θ_n））_{n∈{0,1,…,N-1}}を入力する。そして、学習用識別部４０は、図１に示した識別部１０に対応する学習処理を行い、Ｋ個の画像Ｊ_k及び学習用確度値列（ｔ_k（θ_n））_{n∈{0,1,…,N-1}}を用いて、識別部１０が備えるべき最適なパラメータpを求め、当該パラメータpを出力する。 The learning identifying unit 40 inputs the image J _{k of the} learning data, and at the same time, outputs the learning accuracy value sequence including the learning accuracy values t _k (θ _n ) for the N angles θ _n from the accuracy generation unit 30. _Input (t _k (θ _n )) _n _{∈ {0, 1,..., N-1}} . Then, the learning identifying unit 40 performs the learning process corresponding to the identifying unit 10 illustrated in FIG. 1, and the K images J _k and the learning accuracy value sequence (t _k (θ _n )) _n _{ε{0, 1,..., N−1}} is used to find the optimum parameter p to be included in the identification unit 10, and the parameter p is output.

識別部１０がニューラルネットワークにより構成される場合には、学習用識別部４０も識別部１０と同様に、ニューラルネットワークにより構成され、その結合重み係数であるパラメータｐを更新可能な状態としておく。 When the identifying unit 10 is configured by a neural network, the learning identifying unit 40 is also configured by a neural network similarly to the identifying unit 10, and the parameter p that is the connection weighting coefficient thereof is set to be updatable.

学習用識別部４０は、学習データの画像Ｊ_kを入力する。そして、学習用識別部４０は、画像Ｊ_kに対し、畳み込み層１１，１２，１３，１４による畳み込み処理、及び全結合層１５，１６による全結合処理を行い、Ｎ個の角度θ_nに対するそれぞれの確度値ｗ_k（θ_n）を求める（ステップＳ５０３）。これにより、Ｎ個の角度θ_nに対するそれぞれの確度値ｗ_k（θ_n）からなる確度値列（ｗ_k（θ_n））_{n∈{0,1,…,N-1}}が得られる。 The learning identification unit 40 inputs the image J _{k of} learning data. Then, the learning identification unit 40 performs convolution processing on the image J _k by the convolution layers 11, 12, 13, and 14, and total combination processing by the total connection layers 15 and 16, respectively, for each of the N angles θ _n . The accuracy value w _k (θ _n ) of is calculated (step S503). As a result, a sequence of accuracy values (w _k (θ _n )) _n _{∈ {0, 1,..., N-1} including} the accuracy values w _k (θ _n ) for the N angles θ _n is obtained.

学習用識別部４０に備えた図示しない誤差算出部は、以下の式のとおり、確度値列（ｗ_k（θ_n））_{n∈{0,1,…,N-1}}と学習用確度値列（ｔ_k（θ_n））_{n∈{0,1,…,N-1}}との間の誤差を算出する。そして、誤差算出部は、当該誤差を、Ｎ個の角度θ_nに対するそれぞれの誤差値ｄ_k（θ_n）からなる誤差値列（ｄ_k（θ_n））_{n∈{0,1,…,N-1}}とする（ステップＳ５０４）。

The error calculating unit (not shown) included in the learning identifying unit 40 calculates the accuracy value sequence (w _k (θ _n )) _n _{∈ {0, 1,..., N-1}} and the learning accuracy value according to the following equation. Calculate the error between the column (t _k (θ _n )) _n _{∈ {0, 1,..., N-1}} . Then, the error calculation unit calculates the error by a series of error values (d _k (θ _n )) _n _{ε{0, 1,..., Which is} composed of error values d _k (θ _n ) for N angles θ _n _{. N-1}} (step S504).

学習用識別部４０に備えた図示しない逆伝播部は、誤差値列（ｄ_k（θ_n））_{n∈{0,1,…,N-1}}を、畳み込み層１１，１２，１３，１４及び全結合層１５，１６に対してこの逆の順番に伝播（逆伝播）させる（ステップＳ５０５）。そして、逆伝播部は、この誤差値逆伝播法により、畳み込み層１１，１２，１３，１４及び全結合層１５，１６においてそれぞれ用いるパラメータｐを更新する（ステップＳ５０６）。 The back propagation unit (not shown) included in the learning identifying unit 40 _converts the error value sequence (d _k (θ _n )) _n _{ε{0,1,...,N-1}} into the convolution layers 11, 12, 13, and 14. And, the full coupling layers 15 and 16 are propagated in the reverse order (back propagation) (step S505). Then, the back propagation unit updates the parameters p used in the convolutional layers 11, 12, 13, and 14 and the fully connected layers 15 and 16 by this error value back propagation method (step S506).

学習用識別部４０は、Ｋ個の画像Ｊ_k及び学習用確度値列（ｔ_k（θ_n））_{n∈{0,1,…,N-1}}について、ステップＳ５０２〜Ｓ５０６の処理が完了したか否かを判定する（ステップＳ５０７）。学習用識別部４０は、ステップＳ５０７において、処理が完了していないと判定した場合（ステップＳ５０７：Ｎ）、次のパラメータｋを設定し（ステップＳ５０８）、ステップＳ５０２へ移行する。 The learning identifying unit 40 completes the processes of steps S502 to S506 for the K images J _k and the learning accuracy value sequence (t _k (θ _n )) _n _{∈ {0, 1,..., N-1}.} It is determined whether or not (step S507). When it is determined in step S507 that the processing is not completed (step S507: N), the learning identifying unit 40 sets the next parameter k (step S508), and proceeds to step S502.

一方、学習用識別部４０は、ステップＳ５０７において、処理が完了したと判定した場合（ステップＳ５０７：Ｙ）、ステップＳ５０６にて更新したパラメータｐを最適なパラメータであるとして出力する（ステップＳ５０９）。出力されたパラメータｐは、図１に示した識別部１０にて用いられる。 On the other hand, when the learning identifying unit 40 determines in step S507 that the processing is completed (step S507: Y), the learning identifying unit 40 outputs the parameter p updated in step S506 as the optimum parameter (step S509). The output parameter p is used by the identification unit 10 shown in FIG.

尚、学習装置２は、図５のステップＳ５０２〜Ｓ５０８に示したように、誤差値逆伝播法による処理を、Ｋ個の画像Ｊ_k及び姿勢情報φ_k（学習用確度値列（ｔ_k（θ_n））_{n∈{0,1,…,N-1}}）に対して適宜実行するようにしてもよい。この場合、学習装置２は、Ｋ個の画像Ｊ_k及び姿勢情報φ_k（学習用確度値列（ｔ_k（θ_n））_{n∈{0,1,…,N-1}}）の全てについて、順次実行するようにしてもよいし、Ｋ個の中から所定数をランダムに選択して実行するようにしてもよい。また、学習装置２は、Ｋ個の中から１個以上を選択し、１個以上の画像Ｊ_k及び姿勢情報φ_k（学習用確度値列（ｔ_k（θ_n））_{n∈{0,1,…,N-1}}）をまとめた、いわゆるミニバッチを構成し、実行するようにしてもよい。 As shown in steps S502 to S508 of FIG. 5, the learning device 2 performs the processing by the error value back propagation method on the K images J _k and the posture information φ _k (learning accuracy value sequence (t _k ( θ _n )) _n _{∈ {0, 1,..., N-1}} ) may be appropriately executed. In this case, the learning device 2 _regards all of the K images J _k and the posture information φ _k (learning accuracy value sequence (t _k (θ _n )) _n _{∈ {0, 1,..., N-1}} ). , May be sequentially executed, or a predetermined number may be randomly selected from K pieces and executed. Further, the learning device 2 selects one or more from the K pieces and selects one or more images J _k and posture information φ _k (learning accuracy value sequence (t _k (θ _n )) _n _{∈ {0, 1,...,N-1}} ) may be combined into a so-called mini-batch and executed.

このように、学習装置２は、図１に示した識別部１０にて用いるパラメータｐを、確度生成部３０及び当該識別部１０に対応する学習用識別部４０において学習し、最適化するようにした。これにより、最適化されたパラメータｐを用いて、識別部１０を動作させることができる。 In this way, the learning device 2 learns and optimizes the parameter p used in the identification unit 10 shown in FIG. 1 in the accuracy generation unit 30 and the learning identification unit 40 corresponding to the identification unit 10. did. As a result, the identifying unit 10 can be operated using the optimized parameter p.

以上のように、本発明の実施形態の学習装置２によれば、確度生成部３０は、姿勢情報φ_k（の示す角度）のベクトルと各角度θ_nのベクトルとの間のなす角α（φ_k，θ_n）を算出し、なす角α（φ_k，θ_n）に応じた学習用確度値ｔ_k（θ_n）を生成する。 As described above, according to the learning device 2 of the embodiment of the present invention, the accuracy generating unit 30 causes the angle α (which is formed between the vector of the attitude information φ _k (the angle indicated by the angle) and the vector of each angle θ _n ( φ _k, θ _n) is calculated, the angle alpha (phi _k, to generate the learning accuracy value corresponding to _{_{_{θ n) t k (θ n}}} ).

学習用識別部４０は、図１に示した識別部１０と同様にニューラルネットワークにより構成される場合、学習データの画像Ｊ_kに対し、畳み込み層１１，１２，１３，１４及び全結合層１５，１６による処理を行う。そして、学習用識別部４０は、Ｎ個の角度θ_nに対するそれぞれの確度値ｗ_k（θ_n）を求める。 When the learning identifying unit 40 is configured by a neural network like the identifying unit 10 illustrated in FIG. 1, the convolutional layers 11, 12, 13, 14 and the fully connected layer 15, with respect to the learning data image J _k . 16 is performed. Then, the learning identifying unit 40 obtains each accuracy value w _k (θ _n ) for the N angles θ _n .

学習用識別部４０は、確度値ｗ_k（θ_n）と学習用確度値ｔ_k（θ_n）との間の誤差値ｄ_k（θ_n）を算出し、誤差値ｄ_k（θ_n）を、畳み込み層１１，１２，１３，１４及び全結合層１５，１６に逆伝播させ、パラメータｐを更新する。 Learning identification unit 40 calculates an error value d _k (θ _n) between the reliability value w _k (θ _n) the learning accuracy value t _{k (θ} _n), the error value d _k (θ _n) Is propagated back to the convolutional layers 11, 12, 13, 14 and the fully connected layers 15, 16 to update the parameter p.

確度生成部３０は、Ｋ個の姿勢情報φ_kについて処理を行い、Ｋ個の学習用確度値ｔ_k（θ_n）を生成する。そして、学習用識別部４０は、Ｋ個の画像Ｊ_k及び学習用確度値ｔ_k（θ_n）について処理を行い、最適なパラメータｐを生成する。 The accuracy generation unit 30 processes the K pieces of posture information φ _k to generate K learning accuracy values t _k (θ _n ). Then, the learning identifying unit 40 processes the K images J _k and the learning accuracy value t _k (θ _n ) to generate an optimum parameter p.

このようにして生成された最適なパラメータｐは、図１に示した姿勢推定装置１にて用いられ、姿勢推定装置１の識別部１０を動作させることができる。 The optimum parameter p thus generated is used in the posture estimation apparatus 1 shown in FIG. 1, and the identification unit 10 of the posture estimation apparatus 1 can be operated.

これにより、姿勢推定装置１において、パラメータｐを用いて姿勢情報θを推定することができるから、前述の特許文献１に記載された顔の器官毎のテンプレートを用意する必要がなく、手間がかかることはない。 With this, the posture estimation apparatus 1 can estimate the posture information θ using the parameter p, and therefore it is not necessary to prepare the template for each facial organ described in Patent Document 1 described above, which is troublesome. There is no such thing.

また、姿勢情報θを推定するために、特定の特徴量のみを用いることがないから、特定の特徴量のみを用いる特許文献２の技術に比べ、姿勢情報θの推定精度を向上させることができる。 Further, since the particular feature amount is not used to estimate the posture information θ, the estimation accuracy of the posture information θ can be improved as compared with the technique of Patent Document 2 that uses only the particular feature amount. ..

したがって、学習装置２により生成されたパラメータｐを用いることで、姿勢推定装置１において被写体の姿勢を簡易かつ高精度に推定することができる。 Therefore, by using the parameter p generated by the learning device 2, the posture estimation device 1 can easily and accurately estimate the posture of the subject.

以上、実施形態を挙げて本発明を説明したが、本発明は前記実施形態に限定されるものではなく、その技術思想を逸脱しない範囲で種々変形可能である。例えば、図３において、識別部１０は、水平Ｗ＝２０画素、垂直Ｈ＝２０画素及び色３成分からなる画像Ｉを入力するようにしたが、本発明は、画素数及び色成分数を限定するものではない。 Although the present invention has been described with reference to the embodiments, the present invention is not limited to the above embodiments, and various modifications can be made without departing from the technical idea thereof. For example, in FIG. 3, the identification unit 10 inputs the image I including horizontal W=20 pixels, vertical H=20 pixels, and three color components, but the present invention limits the number of pixels and the number of color components. Not something to do.

また、図３において、識別部１０は、ニューラルネットワークにより構成されるようにした。本発明は、識別部１０をニューラルネットワークに限定するものではなく、ニューラルネットワーク以外の構成部を用いるようにしてもよい。つまり、識別部１０は、画像Ｉを入力し、パラメータｐを用いて、画像Ｉに含まれる被写体についてＮ個の代表的な角度θ_nの識別処理を行い、それぞれの確度値ｗ（θ_n）を求めて出力する構成部であればよい。識別部１０に対応する図６に示した学習用識別部４０についても同様である。 Further, in FIG. 3, the identification unit 10 is configured by a neural network. The present invention does not limit the identification unit 10 to a neural network, but may use a component other than the neural network. That is, the identification unit 10 inputs the image I, performs identification processing of N representative angles θ _n with respect to the subject included in the image I using the parameter p, and obtains each accuracy value w(θ _n ). It suffices as long as it is a constituent unit that obtains and outputs. The same applies to the learning identifying unit 40 shown in FIG. 6 corresponding to the identifying unit 10.

尚、本発明の実施形態による姿勢推定装置１及び学習装置２のハードウェア構成としては、通常のコンピュータを使用することができる。姿勢推定装置１及び学習装置２は、ＣＰＵ、ＲＡＭ等の揮発性の記憶媒体、ＲＯＭ等の不揮発性の記憶媒体、及びインターフェース等を備えたコンピュータによって構成される。 As the hardware configuration of the posture estimation device 1 and the learning device 2 according to the embodiment of the present invention, a normal computer can be used. The posture estimation apparatus 1 and the learning apparatus 2 are configured by a computer including a CPU, a volatile storage medium such as a RAM, a non-volatile storage medium such as a ROM, and an interface.

姿勢推定装置１に備えた識別部１０及び加重合成部２０の各機能は、これらの機能を記述したプログラムをＣＰＵに実行させることによりそれぞれ実現される。また、学習装置２に備えた確度生成部３０及び学習用識別部４０の各機能も、これらの機能を記述したプログラムをＣＰＵに実行させることによりそれぞれ実現される。 Each function of the identification unit 10 and the weighted composition unit 20 included in the posture estimation apparatus 1 is realized by causing the CPU to execute a program describing these functions. Further, each function of the accuracy generation unit 30 and the learning identification unit 40 included in the learning device 2 is also realized by causing the CPU to execute a program describing these functions.

これらのプログラムは、前記記憶媒体に格納されており、ＣＰＵに読み出されて実行される。また、これらのプログラムは、磁気ディスク（フロッピー（登録商標）ディスク、ハードディスク等）、光ディスク（ＣＤ−ＲＯＭ、ＤＶＤ等）、半導体メモリ等の記憶媒体に格納して頒布することもでき、ネットワークを介して送受信することもできる。 These programs are stored in the storage medium, read by the CPU, and executed. Further, these programs can be stored and distributed in a storage medium such as a magnetic disk (floppy (registered trademark) disk, hard disk, etc.), optical disk (CD-ROM, DVD, etc.), semiconductor memory, etc., and distributed via a network. You can also send and receive.

１姿勢推定装置
２学習装置
１０識別部
１１，１２，１３，１４畳み込み層（畳み込み部）
１５，１６全結合層（全結合部）
２０加重合成部
３０確度生成部
４０学習用識別部 1 Posture estimation device 2 Learning device 10 Discrimination unit 11, 12, 13, 14 Convolution layer (convolution unit)
15,16 Fully bonded layer (Fully bonded part)
20 Weighted combiner 30 Accuracy generator 40 Learning identifier

Claims

In a posture estimation device that estimates the posture of a subject included in an input image,
An identification unit that identifies the angle of the subject based on the input image and obtains a certainty value corresponding to each of a plurality of preset angles;
By a weighted combination of the plurality of angles by weighting according to the certainty value corresponding to each of the plurality of angles obtained by the identification unit, a weighted combination unit that obtains posture information,
An attitude estimation device comprising:

The posture estimation apparatus according to claim 1,
The posture estimating apparatus, wherein the identifying unit is configured by a neural network.

In the posture estimation device according to claim 1,
When the posture information is the angle of the subject, or when the norm when the posture information is expressed as a vector value or a complex value is reliability, the posture information is the angle of the subject and the angle at the angle. A posture estimation device having reliability.

In a learning device that inputs an image including a subject as learning data and posture information of the subject, learns a model using the learning data, and optimizes parameters of the model,
An accuracy generation unit that obtains a learning accuracy value corresponding to each of a plurality of preset angles based on the posture information;
Based on the learning accuracy value corresponding to each of the plurality of angles obtained by the image and the accuracy generation unit, the model for identifying the angle of the subject is learned, and the posture of the subject is estimated. A learning identification unit that updates the parameters used to
A learning device comprising:

The learning device according to claim 4,
The accuracy generation unit,
The angle between the vector of the posture information and each vector of the plurality of angles is calculated, and a monotonically-decreasing and non-constant function in a broad sense is applied to the angle to correspond to each of the plurality of angles. A learning device, wherein the learning accuracy value is obtained.

The learning device according to claim 4,
The accuracy generation unit,
An angle formed between the vector of the posture information and each vector of the plurality of angles is calculated, and a predetermined value is set to the learning accuracy value for an angle of the plurality of angles where the formed angle is the smallest. The learning device is characterized in that a value smaller than the predetermined value is set as the learning accuracy value for an angle that does not become the smallest among the plurality of angles.

A program for causing a computer to function as the posture estimation device according to any one of claims 1 to 3.

A program for causing a computer to function as the learning device according to any one of claims 4 to 6.