JP2020135637A

JP2020135637A - Attitude estimation device, learning device, method, and program

Info

Publication number: JP2020135637A
Application number: JP2019030703A
Authority: JP
Inventors: 豪入江; Takeshi Irie; 正志西山; Masashi Nishiyama; 儀雄岩井; Yoshio Iwai; 高貴上野; Koki Ueno
Original assignee: Nippon Telegraph and Telephone Corp; Tottori University NUC
Current assignee: Nippon Telegraph and Telephone Corp; Tottori University NUC
Priority date: 2019-02-22
Filing date: 2019-02-22
Publication date: 2020-08-31
Anticipated expiration: 2039-02-22
Also published as: JP7156643B2

Abstract

To provide an attitude estimation device capable of accurately estimating the posture of the subject captured in the image based on only a learning data collectable at low cost.SOLUTION: An attitude estimation device 150 estimates a relative posture of a given subject captured in a first image with respect to a reference posture. The attitude estimation device 150 has an estimation unit 64 that estimates a relative posture by performing a predetermined process on the first image. The predetermined process is based on a part of the process optimized to geometrically transform the first image into a second image in which the predetermined subject in the reference posture is captured.SELECTED DRAWING: Figure 6

Description

本発明は、姿勢推定装置、学習装置、方法、及びプログラムに係り、特に、画像に撮像された被写体の姿勢を推定するための姿勢推定装置、学習装置、方法、及びプログラムに関する。 The present invention relates to a posture estimation device, a learning device, a method, and a program, and more particularly to a posture estimation device, a learning device, a method, and a program for estimating the posture of a subject captured in an image.

画像認識技術の進展が目覚ましい。これまで実用化されてきた顔や指紋による個人認証技術や、人物・顔の認識・追跡技術の更なる高度化に加え、スマートフォン等の汎用デバイス上で動作する文字認識／物体認識技術も普及してきており、O2Oサービスやナビゲーションサービス等、新しいサービスへも活用されるようになってきている。 The progress of image recognition technology is remarkable. In addition to the personal authentication technology using faces and fingerprints that has been put into practical use and the further sophistication of person / face recognition / tracking technology, character recognition / object recognition technology that operates on general-purpose devices such as smartphones has also become widespread. It is also being used for new services such as O2O services and navigation services.

特に最近注目を集めている応用の一つに、ロボットの“目”としての活用がある。製造業においては、古くより画像認識機能を備えたロボットによるファクトリーオートメーションの導入が進められてきたが、昨今のロボットＡＩ技術の進歩に伴い、リテイル・物流現場での搬送・在庫管理、運送・運搬など、より高度な認識が求められるフィールドへの展開が期待されてきている。 One of the applications that has been attracting attention recently is the use of robots as "eyes". In the manufacturing industry, the introduction of factory automation by robots equipped with image recognition functions has been promoted for a long time, but with the recent progress of robot AI technology, transportation / inventory management, transportation / transportation at retail / logistics sites It is expected to expand into fields that require a higher level of recognition.

典型的な画像認識技術は、画像そのもの、あるいは、それに写る被写体に名称ラベル（以降単にラベルと呼ぶ）を与える技術である。例えば、ある画像にリンゴが写っていたとし、これを入力された時の画像認識技術の望ましい動作としては、“リンゴ”というラベルを出力する、あるいは、当該画像のリンゴの写る領域、すなわち画素の集合に対して、“リンゴ”というラベルを割り当てることである。 A typical image recognition technique is a technique of giving a name label (hereinafter, simply referred to as a label) to the image itself or the subject reflected in the image itself. For example, if an apple appears in a certain image, the desirable operation of the image recognition technology when this is input is to output the label "apple" or to output the label "apple", or to capture the apple in the image, that is, the pixel. Assign the label "apple" to the set.

一方で、先に述べたようなロボットに具備されうる画像認識技術においては、このようにラベルを出力するのみでは不十分である場合も多い。例えば、リテーラーでのロボットの活用事例として、物品棚にある商品を把持・運搬し、別の商品棚に移すような場面を考えよう。この際、当該タスクを完遂するためには、ロボットは大雑把に言って、（１）物品棚にあるさまざまな商品の中から、移すべき対象の商品を特定する、（２）当該商品を把持する、（３）目的の商品棚まで移動・運搬する、（４）望ましいレイアウトとなるよう配置する、という工程を実行できなければならない。然るに画像認識技術は、物品棚、商品、商品棚を認識できることはもちろんのこと、これに加えて、特に（２）、（４）に関連しては物体を把持、陳列するために物体の姿勢、すなわち、位置・角度・大きさも正確に認識できる必要があるのである。先に述べたような典型的な画像認識技術には、このような物体の姿勢を推定する機能は備えておらず、別途、物体の姿勢を推定するような技術が必要となる。 On the other hand, in the image recognition technology that can be provided in the robot as described above, it is often insufficient to output the label in this way. For example, as an example of using a robot in a retailer, let's consider a situation in which a product on an article shelf is grasped and transported and then moved to another product shelf. At this time, in order to complete the task, the robot roughly (1) identifies the product to be transferred from the various products on the goods shelf, and (2) grasps the product. , (3) Move and transport to the target product shelf, and (4) Arrange in the desired layout. However, the image recognition technology can recognize the goods shelves, goods, and goods shelves, and in addition to this, especially in relation to (2) and (4), the posture of the object for grasping and displaying the object. That is, it is necessary to be able to accurately recognize the position, angle, and size. The typical image recognition technology as described above does not have a function of estimating the posture of such an object, and a technique for estimating the posture of the object is required separately.

このような問題を鑑み、物体の姿勢を推定するための画像認識技術が発明・開示されてきている。 In view of such problems, an image recognition technique for estimating the posture of an object has been invented and disclosed.

非特許文献１には、Scale Invariant Feature Transform (SIFT)特徴と一般化ハフ変換に基づく姿勢推定方法が開示されている。この姿勢推定方法では、画像の輝度値を解析することで、顕著な輝度変化を持つような部分領域を多数抽出し、それら各部分領域の輝度変化を、大きさ・回転に対して不変性を持つ特徴量ベクトルとして表現する（SIFT特徴）。次に、同一物体を写した互いに異なる２枚の画像に含まれる部分領域について、SIFT特徴同士のユークリッド距離を測り、これが小さい値を持つような異なる画像間の部分領域同士を対応候補として求める。さらに、同一の物体から得られた部分領域であれば、物体上の対応する部分領域間の姿勢の変化が、撮影視点に依らず一貫性を持つという仮定に基づき、対応候補となった部分領域間の位置・姿勢・大きさの「ずれ」を求める。同一の物体から得られた対応する部分領域の集合は、このずれに一貫性があるという仮定の下、ずれのヒストグラムを構成したとすると、これらはごく少数のビンに集中して分布することが想定される。したがって、頻度の高いビンに分布している対応候補のみを真に有効な対応であると見做し、そのビンに対応するずれを姿勢パラメータとして推定する。この姿勢は、比較された２枚の画像に写る物体の相対的な姿勢変化を表しており、一方の物体の姿勢を基準姿勢とすれば、他方の姿勢を知ることが可能となっている。 Non-Patent Document 1 discloses a posture estimation method based on Scale Invariant Feature Transform (SIFT) features and generalized Hough transform. In this posture estimation method, by analyzing the brightness value of the image, a large number of partial regions having a remarkable change in brightness are extracted, and the change in brightness of each of these partial regions is invariant with respect to size and rotation. Expressed as a feature vector (SIFT feature). Next, the Euclidean distances between SIFT features are measured for the partial regions included in two different images of the same object, and the partial regions between different images having small values are obtained as correspondence candidates. Furthermore, if it is a partial region obtained from the same object, the partial region that is a candidate for correspondence is based on the assumption that the change in posture between the corresponding partial regions on the object is consistent regardless of the shooting viewpoint. Find the "deviation" of the position, posture, and size between them. Assuming that the set of corresponding subregions obtained from the same object constitutes a histogram of the deviations, assuming that the deviations are consistent, they may be concentrated in a very small number of bins. is assumed. Therefore, only the correspondence candidates distributed in the frequently occurring bins are regarded as truly effective correspondences, and the deviation corresponding to the bins is estimated as the posture parameter. This posture represents the relative posture change of the objects shown in the two compared images, and if the posture of one object is used as the reference posture, the posture of the other can be known.

特許文献１には、非特許文献１を改良した技術が開示されている。SIFT特徴に基づき部分領域の対応候補を求め、これらの位置・姿勢・大きさのずれを計算して対応の適否を判定することは同様であるが、ずれを評価する際に３次元回転角度を考えている。結果として、非特許文献１の技術よりもさらに精細な姿勢推定を可能にしている。 Patent Document 1 discloses an improved technique of Non-Patent Document 1. It is the same as finding the correspondence candidates of the partial area based on the SIFT feature and calculating the deviation of these positions, postures, and sizes to judge the suitability of the correspondence, but when evaluating the deviation, the three-dimensional rotation angle is used. thinking. As a result, it is possible to estimate the posture more finely than the technique of Non-Patent Document 1.

非特許文献２に開示の技術は、前述の２つの文献とは異なり、畳み込みニューラルネットワーク（Convolutional Neural Network: CNN）を用いた機械学習に基づく技術を開示している。予め、画像に対して、そこに写る物体を囲む８点の直方体の頂点を記録した大量の学習用画像データ（例えば１２万枚）を用意しておき、このデータを基に、画像が入力された際に、当該画像に写る物体を囲む８点の直方体の頂点を予測するようなCNNを誤差逆伝搬法に基づいて学習することにより、姿勢推定を可能にしている。 The technique disclosed in Non-Patent Document 2 is different from the above two documents, and discloses a technique based on machine learning using a convolutional neural network (CNN). A large amount of learning image data (for example, 120,000 sheets) in which the vertices of eight rectangular parallelepipeds surrounding the object appearing on the image are recorded is prepared in advance, and an image is input based on this data. At that time, the posture estimation is made possible by learning the CNN that predicts the vertices of the eight rectangular parallelepipeds surrounding the object in the image based on the error back propagation method.

特開2015-95156号公報JP-A-2015-95156

David G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints ”, International Journal of Computer Vision, pp.91-110, 2004.David G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints”, International Journal of Computer Vision, pp.91-110, 2004. Jonathan Tremblay, Thang To, Balakumar Sundaralingam, Yu Xiang, Dieter Fox, and Stan Birchfield. Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects, Proceedings of Conference on Robot Learning, pp. 306-316, 2018.Jonathan Tremblay, Thang To, Balakumar Sundaralingam, Yu Xiang, Dieter Fox, and Stan Birchfield. Deep Object Pose Optimization for Semantic Robotic Grasping of Household Objects, Proceedings of Conference on Robot Learning, pp. 306-316, 2018.

大局的にみれば、既存の発明は、同一物体を写した２枚の異なる画像間の対応関係を分析することにより、相対的な姿勢を推定するか（非特許文献１、特許文献１）、あるいは、画像とそこに写る物体の姿勢を表す情報（８点の直方体の頂点など）を記録した大量の学習用データを基にCNNを学習し、姿勢推定器を構成するか（非特許文献２）のいずれかの方法を採っている。 From a broad perspective, the existing invention estimates the relative orientation by analyzing the correspondence between two different images of the same object (Non-Patent Document 1, Patent Document 1). Alternatively, is it possible to construct a posture estimator by learning CNN based on a large amount of learning data that records information representing the posture of an image and an object reflected therein (such as the vertices of an eight rectangular parallelepiped) (Non-Patent Document 2). ) Is adopted.

しかしながら、特許文献１、非特許文献１に記載の技術は、予め姿勢を推定する対象となる種類の物体の画像を必ず保持していなければならないという問題があった。仮に、姿勢を推定する対象の物体が１０種類であるとすると、姿勢を推定したい画像と、先の１０種類の物体がそれぞれ基準姿勢で撮影された、少なくとも１０枚の画像とをそれぞれ比較し、相対的な姿勢を推定する必要がある。この原理的に、物体の姿勢を推定するためには、必ずその物体が基準姿勢として写っている画像を保持しておかなければ適用できず、すなわち、未知の物体に対しては適用できないという問題があった。 However, the techniques described in Patent Document 1 and Non-Patent Document 1 have a problem that they must always hold an image of an object of the type to be estimated in advance. Assuming that there are 10 types of objects whose postures are to be estimated, the image for which the posture is to be estimated is compared with at least 10 images of the above 10 types of objects taken in the reference postures. It is necessary to estimate the relative posture. In this principle, in order to estimate the posture of an object, it cannot be applied unless an image showing the object as a reference posture is held, that is, it cannot be applied to an unknown object. was there.

さらに、姿勢推定は２枚の画像間の相対的な姿勢変化により実施されるため、物体の種類の数に応じて姿勢推定処理に掛かる時間が増大していくという問題もあった。 Further, since the posture estimation is performed by the relative posture change between the two images, there is also a problem that the time required for the posture estimation process increases according to the number of types of objects.

また、非特許文献２に記載の方法は、CNNの学習に、姿勢を表す情報を記録した大量の学習用データが必要となるという問題があった。姿勢を表す情報（非特許文献２の場合、８点の直方体の頂点）は画像の画素レベルでの精細さと、正確さを備えている必要があり、このような情報を付与するのはコストがかかる。また、CNNは一般に大量の学習データが必要であることからも、学習データの構築に掛かるコストは非常に高くなることが想定される。 Further, the method described in Non-Patent Document 2 has a problem that a large amount of learning data in which information representing a posture is recorded is required for learning CNN. Information representing the posture (in the case of Non-Patent Document 2, the vertices of eight rectangular parallelepipeds) must have fineness and accuracy at the pixel level of the image, and it is costly to give such information. It takes. In addition, since CNN generally requires a large amount of learning data, it is expected that the cost of constructing the learning data will be extremely high.

すなわち、現在に至るまで、低コストで収集可能な学習用データのみに基づき、未知の物体に対しても高精度に姿勢を推定できる画像認識技術は発明されていなかった。 That is, until now, no image recognition technique has been invented that can estimate the posture of an unknown object with high accuracy based only on learning data that can be collected at low cost.

本発明は、上記問題点を解決するために成されたものであり、低コストで収集可能な学習用データのみに基づいて、画像に撮像された被写体の姿勢を精度良く推定することができる姿勢推定装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made to solve the above problems, and is capable of accurately estimating the posture of a subject captured in an image based only on learning data that can be collected at low cost. It is intended to provide estimation devices, methods, and programs.

また、低コストで収集可能な学習用データのみに基づいて、画像に撮像された被写体の姿勢を精度良く推定するための学習を行うことができる学習装置を提供することを目的とする。 Another object of the present invention is to provide a learning device capable of performing learning for accurately estimating the posture of a subject captured in an image based only on learning data that can be collected at low cost.

上記目的を達成するために、第１の態様に係る姿勢推定装置は、第一の画像に撮像された所定の被写体の、基準姿勢に対する相対的な姿勢を推定する姿勢推定装置であって、前記第一の画像に対して所定の処理を行うことで前記相対的な姿勢を推定する推定部を有し、前記所定の処理は、前記基準姿勢の前記所定の被写体が撮像されている第二の画像に、前記第一の画像を幾何変換するよう最適化された処理の一部に基づいている。 In order to achieve the above object, the posture estimation device according to the first aspect is a posture estimation device that estimates the relative posture of a predetermined subject captured in the first image with respect to the reference posture. It has an estimation unit that estimates the relative posture by performing a predetermined process on the first image, and the predetermined process is a second image in which the predetermined subject in the reference posture is imaged. It is based on some of the processes optimized to geometrically transform the first image into an image.

第２の態様に係る学習装置は、第一の画像に撮像された所定の被写体の、基準姿勢に対する相対的な姿勢を推定する処理を学習する学習装置であって、前記第一の画像に撮像された所定の被写体の、基準姿勢に対する相対的な姿勢を推定し、前記推定された前記相対的な姿勢を用いて、前記基準姿勢の前記所定の被写体が撮像されている第二の画像に、前記第一の画像を幾何変換する推定部と、前記推定部による幾何変換の結果と、前記第二の画像とが一致するように、前記相対的な姿勢の推定及び前記幾何変換を学習する学習部と、を含んで構成されている。 The learning device according to the second aspect is a learning device that learns a process of estimating a posture of a predetermined subject captured in the first image relative to a reference posture, and is imaged in the first image. A second image in which the predetermined subject in the reference posture is imaged by estimating the relative posture of the predetermined subject with respect to the reference posture and using the estimated relative posture. Learning to estimate the relative posture and learn the geometric transformation so that the estimation unit that geometrically transforms the first image, the result of the geometric transformation by the estimation unit, and the second image match. It is composed of parts and parts.

第３の態様に係る姿勢推定方法は、第一の画像に撮像された所定の被写体の、基準姿勢に対する相対的な姿勢を推定する姿勢推定方法であって、推定部が、前記第一の画像に対して所定の処理を行うことで前記相対的な姿勢を推定することを含み、前記所定の処理は、前記基準姿勢の前記所定の被写体が撮像されている第二の画像に、前記第一の画像を幾何変換するよう最適化された処理の一部に基づいている。 The posture estimation method according to the third aspect is a posture estimation method for estimating the relative posture of a predetermined subject captured in the first image with respect to the reference posture, and the estimation unit uses the first image. The predetermined process includes estimating the relative posture by performing a predetermined process on the subject, and the predetermined process is performed on the second image in which the predetermined subject in the reference posture is captured. It is based on some of the processing optimized to geometrically transform the image in.

第４の態様に係るプログラムは、第一の画像に撮像された所定の被写体の、基準姿勢に対する相対的な姿勢を推定するためのプログラムであって、コンピュータに、前記第一の画像に対して所定の処理を行うことで前記相対的な姿勢を推定することを実行させるためのプログラムであり、前記所定の処理は、前記基準姿勢の前記所定の被写体が撮像されている第二の画像に、前記第一の画像を幾何変換するよう最適化された処理の一部に基づいている。 The program according to the fourth aspect is a program for estimating the relative posture of a predetermined subject captured in the first image with respect to the reference posture, and is used by a computer with respect to the first image. It is a program for executing the estimation of the relative posture by performing a predetermined process, and the predetermined process is performed on a second image in which the predetermined subject in the reference posture is captured. It is based on some of the processes optimized to geometrically transform the first image.

本発明の一態様である姿勢推定装置、方法、及びプログラムによれば、低コストで収集可能な学習用データのみに基づいて、画像に撮像された被写体の姿勢を精度良く推定することができる、という効果が得られる。 According to the posture estimation device, method, and program according to one aspect of the present invention, the posture of the subject captured in the image can be accurately estimated based only on the learning data that can be collected at low cost. The effect is obtained.

また、本発明の一態様である学習装置によれば、低コストで収集可能な学習用データのみに基づいて、画像に撮像された被写体の姿勢を精度良く推定するための学習を行うことができる、という効果が得られる。 Further, according to the learning device according to one aspect of the present invention, learning for accurately estimating the posture of the subject captured in the image can be performed based only on the learning data that can be collected at low cost. , The effect can be obtained.

幾何変換層付きの画像変換器の構成を示すブロック図である。It is a block diagram which shows the structure of the image converter with a geometric transformation layer. 幾何変換層の構成を示すブロック図である。It is a block diagram which shows the structure of the geometric transformation layer. 幾何変換層無しの画像変換器の構成を示すブロック図である。It is a block diagram which shows the structure of the image converter without a geometric transformation layer. 本発明の実施の形態に係る学習処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the learning process which concerns on embodiment of this invention. 本発明の実施の形態に係る学習装置の構成を示すブロック図である。It is a block diagram which shows the structure of the learning apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る姿勢推定装置の構成を示すブロック図である。It is a block diagram which shows the structure of the posture estimation apparatus which concerns on embodiment of this invention. 学習装置又は姿勢推定装置として機能するコンピュータの一例の概略ブロック図である。It is a schematic block diagram of an example of a computer functioning as a learning device or a posture estimation device. 実験結果を示す図である。It is a figure which shows the experimental result.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜＜処理概要＞＞
本実施形態における姿勢推定方法は、画像を入力として受け付けると、その姿勢パラメータ、姿勢変換画像、並びに特徴量を出力するニューラルネットワークを画像変換器として用いる構成を採る。ここで言う姿勢パラメータとは、物体の姿勢を表す値であり、より具体的には、例えば、物体の位置・大きさ・角度を与える行列であるアフィン変換行列や射影変換行列の要素である。また、姿勢とは、基準となる姿勢から、物体にどのような剛体運動が与えられたかを示す値である。姿勢変換画像は、当該姿勢パラメータによる姿勢変換を入力画像に適用した場合に得られる姿勢変換後の画像を意味する。また、特徴量は、入力画像に写る物体の特徴をベクトル表現したものであり、物体の名称を特定するのに用いるものである。 << Processing overview >>
The posture estimation method in the present embodiment adopts a configuration in which a neural network that outputs a posture parameter, a posture conversion image, and a feature amount when an image is received as an input is used as an image converter. The attitude parameter referred to here is a value representing the attitude of an object, and more specifically, it is an element of an affine transformation matrix or a projective transformation matrix, which is a matrix that gives the position, size, and angle of an object, for example. The posture is a value indicating what kind of rigid body motion is given to the object from the reference posture. The posture conversion image means an image after the posture conversion obtained when the posture conversion by the posture parameter is applied to the input image. The feature quantity is a vector representation of the features of the object appearing in the input image, and is used to specify the name of the object.

画像変換器の構成は、上記のような入出力関係を実現できるものであれば任意の構成を採ることができるが、構成の一例を説明する。 As the configuration of the image converter, any configuration can be adopted as long as the above input / output relationship can be realized, but an example of the configuration will be described.

図１は当該画像変換器１０の構成の一例を示す図である。この構成はCNNに基づいており、畳み込み層、プーリング層、全結合層、幾何変換層の４種類の層の組み合わせで構成されており、例えば、幾何変換層１１、畳み込み層１２、プーリング層１３、畳み込み層１４、プーリング層１５、全結合層１６、１７を含んで構成されている。畳み込み層、プーリング層、全結合層については広く知られた公知の層であり、例えば参考文献１に記載のものを用いればよい。 FIG. 1 is a diagram showing an example of the configuration of the image converter 10. This configuration is based on CNN and is composed of a combination of four types of layers: a folding layer, a pooling layer, a fully connected layer, and a geometric conversion layer. For example, the geometric conversion layer 11, the folding layer 12, and the pooling layer 13 It is composed of a folding layer 14, a pooling layer 15, and fully connected layers 16 and 17. The convolution layer, the pooling layer, and the fully connected layer are widely known and known layers, and for example, those described in Reference 1 may be used.

［参考文献１］Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classification with Deep Convoutional Neural Networks. Proceedings of Neural Information Processing Systems, 2012. [Reference 1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classification with Deep Convoutional Neural Networks. Proceedings of Neural Information Processing Systems, 2012.

また、幾何変換層１１については、画像を入力として受け付けると、その姿勢パラメータ、並びに、姿勢パラメータに基づく画像変換を入力画像に適用して得られる姿勢変換画像を出力できるような層であればどんなものでも構わない。例えば、下記の公知の層を用いることができる。 Further, regarding the geometric transformation layer 11, any layer can be used as long as it can output a posture conversion image obtained by applying the posture parameter and the image conversion based on the posture parameter to the input image when the image is received as an input. It doesn't matter what. For example, the following known layers can be used.

［参考文献２］Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial Transformer Networks. Proceedings of Neural Information Processing Systems, 2015. [Reference 2] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial Transformer Networks. Proceedings of Neural Information Processing Systems, 2015.

例えば、図２に示すように、幾何変換層１１は、当該幾何変換層１１に入力された画像Ｕから、アフィン変換行列、又は、射影変換行列により表現される姿勢パラメータθを推定する姿勢パラメータ推定層１１Ａと、推定した姿勢パラメータθを画像Ｕに適用することによって姿勢変換画像Ｖを求め、出力する画像変換層１１Ｂとにより構成される。姿勢パラメータ推定層１１Ａは、画像Ｕを入力として受け付け、当該画像Ｕに写る物体のアフィン変換行列、又は、射影変換行列の要素を出力するニューラルネットワークにより構成される。このように構成される幾何変換層１１は、入力画像に写る物体が、基準姿勢に対して位置・大きさ・角度の観点でどの程度ずれているかを直接推定することができるため、好適である。 For example, as shown in FIG. 2, the geometric transformation layer 11 estimates the attitude parameter θ represented by the affine transformation matrix or the projective transformation matrix from the image U input to the geometric transformation layer 11. It is composed of a layer 11A and an image conversion layer 11B for obtaining and outputting a posture-transformed image V by applying the estimated posture parameter θ to the image U. The attitude parameter estimation layer 11A is composed of a neural network that receives an image U as an input and outputs an element of an affine transformation matrix or a projective transformation matrix of an object reflected in the image U. The geometric transformation layer 11 configured in this way is suitable because it is possible to directly estimate how much the object reflected in the input image is deviated from the reference posture in terms of position, size, and angle. ..

もちろん、本発明における画像変換器１０のニューラルネットワークの構成はこれに限られるものではなく、先の入出力要件を満たす限りどんな構成を採っても構わない。好ましくは、最終層にＬ２正規化層を加える。こうすることによって、特徴量を頑健化できるため、好適である。 Of course, the configuration of the neural network of the image converter 10 in the present invention is not limited to this, and any configuration may be adopted as long as the above input / output requirements are satisfied. Preferably, an L2 normalized layer is added to the final layer. By doing so, the feature amount can be made robust, which is preferable.

また、例えば、図１に示した構成は、畳み込み層及びプーリング層の組み合わせを２回繰り返し、その後全結合層を２層積んだ構造となっているが、この全結合層２層の代わりに、参考文献３、あるいは、参考文献４などに開示されている大域的プーリング層を用いても構わない。 Further, for example, the configuration shown in FIG. 1 has a structure in which the combination of the convolution layer and the pooling layer is repeated twice, and then two fully bonded layers are stacked. Instead of the two fully bonded layers, The global pooling layer disclosed in Reference 3 or Reference 4 may be used.

［参考文献３］Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition,Proceedings of Conference on Computer Vision and Pattern Recognition, 2016. [Reference 3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition, Proceedings of Conference on Computer Vision and Pattern Recognition, 2016.

［参考文献４］Giorgos Tolias, Ronan Sicre, and Herve Jegou. Particular Object Retrieval with Integral Max-pooling of CNN Activations. ArXiv Preprint: https://arxiv.org/abs/1511.05879, 2015. [Reference 4] Giorgos Tolias, Ronan Sicre, and Herve Jegou. Particular Object Retrieval with Integral Max-pooling of CNN Activations. ArXiv Preprint: https://arxiv.org/abs/1511.05879, 2015.

このように、全結合の無い構成にすることにより、入力画像のサイズに依らず、常に同じ次元数の特徴量を求めることができるため好適である。 As described above, it is preferable to use a configuration without full coupling because it is possible to always obtain features having the same number of dimensions regardless of the size of the input image.

以降、簡単のため、図１に示す画像変換器１０を用いる場合を例にとり、説明を続ける。 Hereinafter, for the sake of simplicity, the description will be continued by taking the case of using the image converter 10 shown in FIG. 1 as an example.

本実施形態における姿勢推定方法は、この画像変換器１０を用いて特徴量及び姿勢を求める姿勢推定処理と、当該画像変換器１０を学習する学習処理の、大別して２つの異なる処理を実行する。 The posture estimation method in the present embodiment roughly classifies two different processes, that is, a posture estimation process for obtaining a feature amount and a posture using the image converter 10 and a learning process for learning the image converter 10.

以降、まず学習処理について詳述し、その後、姿勢推定処理について詳細に述べる。 Hereinafter, the learning process will be described in detail first, and then the posture estimation process will be described in detail.

＜＜学習処理＞＞
学習処理は、姿勢推定を行う前に、少なくとも一度実施する必要のある処理であり、より具体的には、画像変換器１０のパラメータであるニューラルネットワークの重みを学習用データに基づいて適切に決定するための処理である。 << Learning process >>
The learning process is a process that needs to be performed at least once before performing posture estimation, and more specifically, the weight of the neural network, which is a parameter of the image converter 10, is appropriately determined based on the training data. It is a process to do.

なお、本学習処理を実行する上では、先に説明した画像変換器１０（ニューラルネットワーク）において、幾何変換層を除いた画像変換器１０Ａも用いる。一例として、図１の画像変換器１０に対し、幾何変換層１１を除いたニューラルネットワークの構成を図３に図示する。以降、幾何変換層１１を有するニューラルネットワークにより構成された画像変換器１０を幾何変換層付きモデル１０、それを持たない画像変換器１０Ａを幾何変換層無しモデル１０Ａと呼称する。ここで、幾何変換層１１以外のパラメータは両モデルで共通のものを利用する。すなわち、図１、図３の例では、対応する畳み込み層、全結合層のパラメータは全て同じ値を持つものとする。 In executing this learning process, in the image converter 10 (neural network) described above, the image converter 10A excluding the geometric transformation layer is also used. As an example, FIG. 3 shows the configuration of the neural network of the image converter 10 of FIG. 1 excluding the geometric transformation layer 11. Hereinafter, the image converter 10 configured by the neural network having the geometric transformation layer 11 will be referred to as a model 10 with a geometric transformation layer, and the image converter 10A without it will be referred to as a model 10A without a geometric transformation layer. Here, the parameters other than the geometric transformation layer 11 are common to both models. That is, in the examples of FIGS. 1 and 3, the parameters of the corresponding convolution layer and the fully connected layer all have the same value.

本実施形態における学習処理を実行するには、予め学習用画像データを準備しておく必要がある。この学習用画像データは、画像の三つ組みの集合により構成されるものである。この三つ組みとは、ある物体を基準姿勢で写した画像Ｉ^ａ、画像Ｉ^ａと同一の被写体を別姿勢で写した画像Ｉ^ｐ、画像Ｉ^ａとは異なる被写体を写した画像Ｉ^ｎの三つからなり、異なる画像の組み合わせからなる三つ組みの集合により学習用画像データが構成されるとする。 In order to execute the learning process in the present embodiment, it is necessary to prepare the learning image data in advance. This learning image data is composed of a set of triplets of images. And this triad, the image I ^a you shoot an object at the reference ^position, an image I ^p of photographs depicting the same subject as the image I ^a in a different ^orientation, the image I ⁿ in the burst different subject from the image I ^a three It is assumed that the image data for learning is composed of a set of three sets consisting of a combination of different images.

このような三つ組みの学習用画像データは、次の（ａ）〜（ｄ）の要件を満たす画像集合があれば自動的に構成することが可能である。 Such a triplet of image data for learning can be automatically configured if there is an image set that satisfies the following requirements (a) to (d).

（ａ）１枚以上の画像を含み、各画像は少なくとも一つの物体を写したものであること。 (A) Including one or more images, each image is a representation of at least one object.

（ｂ）学習用画像データ全体を通して２種類以上の物体を含み、同一種類の物体を写した画像が少なくとも２枚以上存在すること。 (B) The entire learning image data includes two or more types of objects, and there are at least two or more images of the same type of objects.

（ｃ）全種類の物体に対して、それぞれ基準姿勢となる物体画像が存在し、特定できること。 (C) An object image that serves as a reference posture exists and can be identified for all types of objects.

（ｄ）同一種類の物体を写した画像同士が互いに識別可能であること。 (D) Images of the same type of object can be distinguished from each other.

上記（ａ）〜（ｄ）の要件が揃ったものであれば、例えば、以下の（１）〜（３）の処理により、上記三つ組みを一組構成することができる。 As long as the above requirements (a) to (d) are satisfied, for example, the above three sets can be configured by the following processes (1) to (3).

（１）基準姿勢となる画像をランダムに一つ選び、これを画像Ｉ^ａとする。 (1) Select one image as a reference attitude randomly, which is an image I ^a.

（２）画像Ｉ^ａと同一の被写体を写した基準姿勢でない画像Ｉ^ｐをランダムに選び、これを画像Ｉ^ｐとする。 (2) An image ^Ip that is not the reference posture and shows the same subject as the image I ^a is randomly selected and used as the image I ^p .

（３）画像Ｉ^ａ、Ｉ^ｐと別の被写体を写した画像をランダムに選び、これを画像Ｉ^ｎとする。 (3) image ^I a, randomly selects an image taken of ^{I p} and another object, this is the image ^{I n.}

（１）〜（３）の処理を、所望の組数が揃うまで繰り返せば、学習用画像データを構成可能である。 By repeating the processes (1) to (3) until the desired number of sets is obtained, the image data for learning can be constructed.

なお実際には、上記要件を満たすような学習用画像データを準備する手段は本発明の要点とは無関係であり、どのような手段で準備しても構わない。人手によって準備しても構わないし、自動的に準備しても構わない。 Actually, the means for preparing the learning image data that satisfies the above requirements is irrelevant to the main point of the present invention, and any means may be used for preparing. It may be prepared manually or automatically.

上記のような学習用画像データが存在するもとで、本発明の実施形態における学習処理は、次の工程により実行される。図４に、本実施形態における学習処理の概要を示す。 In the presence of the above-mentioned learning image data, the learning process according to the embodiment of the present invention is executed by the following steps. FIG. 4 shows an outline of the learning process in this embodiment.

まず、ステップＳ３０１では、画像Ｉ^ｐに対して幾何変換層付きモデル１０を適用し、幾何変換後の画像Ｉ^ｐである画像Ｉ^ｐｔ、並びに特徴量ｆ^ｐｔを求める。 First, in step S301, applying a geometric transformation layer-model 10 for the image ^{I p,} obtaining the image ^{I pt,} and the feature amount ^{f pt} is an image ^{I p} after the geometric transformation.

続いてステップＳ３０２では、画像Ｉ^ａ、画像Ｉ^ｎのそれぞれに対して幾何変換層無しモデル１０Ａを適用し、対応する特徴量ｆ^ａ、ｆ^ｎを求める。 Subsequently, in step S302, the model 10A without the geometric transformation layer is applied to each of the image I ^a and the image In ⁿ , and the corresponding feature quantities f ^a and f ⁿ are obtained.

続いてステップＳ３０３では、画像Ｉ^ｐｔと画像Ｉ^ａに基づき、アピアランス損失を求める。 Subsequently, in step S303, the appearance loss is obtained based on the image ^Ipt and the image I ^a .

続いてステップＳ３０４では、特徴量ｆ^ａ、特徴量ｆ^ｐｔ、特徴量ｆ^ｎを基に、特徴量損失を求める。 Subsequently in step S304, the feature amount ^{f a,} the feature amount ^{f pt,} based on the feature quantity ^{f n,} calculates a characteristic quantity loss.

続いてステップＳ３０５では、アピアランス損失と特徴量損失の重み付き和を小さくするように幾何変換層付きモデル１０のパラメータを更新し、幾何変換層無しモデル１０Ａにパラメータをコピーする。 Subsequently, in step S305, the parameters of the model 10 with the geometric transformation layer are updated so as to reduce the weighted sum of the appearance loss and the feature amount loss, and the parameters are copied to the model 10A without the geometric transformation layer.

上記ステップＳ３０１〜Ｓ３０５を所定の終了条件が満たされるまで繰り返し、終了条件を満たした場合に学習処理を終了する。 The above steps S301 to S305 are repeated until a predetermined end condition is satisfied, and when the end condition is satisfied, the learning process is ended.

ステップＳ３０１、Ｓ３０２については、先に説明した幾何変換層付きモデル１０、幾何変換層無しモデル１０Ａを各画像に対して適用することで直ちに実行することができるものである。以降、ステップＳ３０３〜Ｓ３０５について、その処理詳細を説明する。 Steps S301 and S302 can be immediately executed by applying the model 10 with the geometric transformation layer and the model 10A without the geometric transformation layer described above to each image. Hereinafter, the processing details of steps S303 to S305 will be described.

＜各処理の処理詳細＞
以降、各処理の詳細処理について、本実施形態における一例を説明する。 <Processing details of each process>
Hereinafter, an example of the detailed processing of each processing in the present embodiment will be described.

［ステップＳ３０３：アピアランス損失計算処理］ [Step S303: Appearance loss calculation process]

この処理では、同一の被写体を含む画像Ｉ^ａ、並びに、Ｉ^ｐｔに基づいて、その画像としての見え方の差異であるところのアピアランス損失を求める。 In this process, the image I ^a include the same ^subject, and, on the basis of the I ^pt, obtaining the appearance losses where a appearance difference as the image.

ステップＳ３０１を通して、学習用画像データに含まれる画像Ｉ^ｐに対して、幾何変換層１１による姿勢変換を施した姿勢変換画像Ｉ^ｐｔが得られている。画像Ｉ^ａ、Ｉ^ｐは同一の被写体を写しており、異なっているのはそれぞれ撮影された姿勢であることから、理想的には、姿勢変換後の画像Ｉ^ｐｔは、画像Ｉ^ａと同じ見えになっていることが好ましい。 Through step S301, a posture-converted image ^Ipt obtained by subjecting the image ^Ip included in the learning image data to a posture-transformed image by the geometric transformation layer 11. Image I ^{a and} I ^p have a copy of the same subject, since different What are each captured position, ideally, the image I ^pt after the posture change, the same appearance as the image I ^a It is preferable that it is.

この考え方から、画像Ｉ^ａと画像Ｉ^ｐｔの画素値の距離の総和を求め、これをアピアランス損失Ｌ^ａとして用いる。画素値の距離は、例えばＬ１距離によって下記のように求めることができる。 From this concept, the total sum of the distance of the pixel value of the image I ^a and image I ^pt, used as the appearance loss L ^a. The distance of the pixel value can be obtained as follows, for example, by the L1 distance.

（１） (1)

ここで、Ｉ^ａ _ｉ（ｘ、ｙ、ｃ）またはＩ^ｐｔ _ｉ（ｘ、ｙ、ｃ）は、それぞれ画像Ｉ^ａ _ｉまたはＩ^ｐｔ _ｉのｘ、ｙ位置におけるチャネルｃの画素値を表し、Ｘ_ｉ、Ｙ_ｉ、Ｃ_ｉはそれぞれＩ^ａ _ｉ、Ｉ^ｐｔ _ｉの組に対するｘの定義域、ｙの定義域、ｃの定義域を表す。Ｎは学習用画像データの中に含まれる三つ組みの数、あるいはそれ以下の定数である。 Here, I ^a _i (x, y, c) or I ^pt _i (x, y, c) represents the pixel value of the channel c at the x and y positions of the image I ^a _i or I ^pt _i , respectively, and X _i , Y _i , and C _i represent the domain of x, the domain of y, and the domain of c for the set of I ^a _i and I ^pt _i , respectively. N is a constant of the number of triplets included in the learning image data or less.

このアピアランス損失Ｌ_ａは、姿勢変換後の画像Ｉ^ｐｔ _ｉと画像Ｉ^ａ _ｉが同じ見えになっているほど小さい値を取り、Ｉ^ｐｔ _ｉ＝Ｉ^ａ _ｉの場合に０となる。すなわち、様々な画像Ｉ^ｐ _ｉに対してこの値が小さくするように幾何変換層１１のパラメータを学習することによって、どんな姿勢で撮影された被写体であっても、常に基準姿勢時の被写体を撮影した画像Ｉ^ａ _ｉに近づけるような姿勢変換を行うことができるようになり、結果として、姿勢パラメータは基準姿勢との差異を推定することができるようになるのである。 The appearance loss _{L a} takes a smaller value image ^{I pt} _i and image ^I _{a i} after the posture conversion have the same ^appearance, becomes 0 in the case of _^I pt _i ⁼ _I _a i. That is, by learning the parameters of the geometric transformation layer 11 so that this value is smaller than the different images I ^p _i, even photographed object in any orientation, always photographing the subject at the reference posture and it will be able to perform the posture change as close to the image I ^a _i has, as a result, the attitude parameters are the it is possible to estimate the difference between the reference attitude.

アピアランス損失は必ずしも上記式（１）の形でなくとも構わず、画像Ｉ^ｐｔ _ｉと画像Ｉ^ａ _ｉが同じ見えになっているほど小さい値を取り、Ｉ^ｐｔ _ｉ＝Ｉ^ａ _ｉの場合に０となるようなものであれば任意の形態をとって構わない。例えば、利用する距離はＬ１距離でなくともよく、例えばＬ２距離などを用いても構わない。また、Ｘ_ｉ、Ｙ_ｉ、Ｃ_ｉは必ずしも画像全体を指示するものでなくとも構わない。もし予め物体の写る領域が何らかの形で与えられるのであれば、その範囲に限定して計算しても構わない。通常、画像には、背景など、対象とする被写体以外のものが写り込んでいる場合も多いが、物体の姿勢変化を求めるという観点からは、被写体と関係のない画素の情報を含まない方が望ましく、そのような観点では可能な限り被写体が写る領域に限定して定義域が設定されていることが好ましい。 Appearance loss is not necessarily may not be the form of the above formula (1), takes a smaller value image ^{I pt} _i and image ^I _{a i} have the same ^appearance, 0 in the case of _^I pt _i ⁼ _I _a i Any form may be taken as long as it becomes. For example, the distance to be used does not have to be the L1 distance, and for example, the L2 distance may be used. Further, X _i , Y _i , and C _i do not necessarily have to indicate the entire image. If the area in which the object is captured is given in some form in advance, the calculation may be limited to that range. Normally, the image often shows something other than the target subject, such as the background, but from the perspective of finding the change in the posture of the object, it is better not to include information on pixels that are not related to the subject. From such a viewpoint, it is preferable that the domain is set as limited to the area where the subject is captured as much as possible.

以上がステップＳ３０３で行う処理である。 The above is the process performed in step S303.

［ステップＳ３０４：特徴量損失計算処理］
この処理では、特徴量を学習するために必要な特徴量損失を求める。画像認識の観点から言えば、同一の被写体を含む画像Ｉ^ａ、Ｉ^ｐｔの特徴量ｆ^ａ、ｆ^ｐｔはベクトルとして互いに近く、また、逆に、異なる被写体を含む画像Ｉ^ａ、Ｉ^ｎの特徴量ｆ^ａ、ｆ^ｎはベクトルとして互いに遠くなっている方が好ましい。このような特徴量を得ることにより、特徴量の近さに基づいて同一被写体を検索したり、あるいは、例えばK近傍法などの識別法により被写体の名称を推定することもできるようになる。 [Step S304: Feature loss calculation process]
In this process, the feature amount loss required for learning the feature amount is obtained. In terms of image recognition, the image ^I a include the same ^subject, the feature amount ^f a of ^{I ^pt,} ^{f pt} is close to each other as a vector, also, conversely, an image ^I a containing different subject, the ^{I n} wherein It is preferable that the quantities f ^a and f ⁿ are far from each other as vectors. By obtaining such a feature amount, it becomes possible to search for the same subject based on the closeness of the feature amount, or to estimate the name of the subject by an identification method such as the K-nearest neighbor method.

このような特徴量を学習するための損失は様々な形態のものが有り得るが、例えば下記のような損失を用いることができる。 The loss for learning such a feature amount can be in various forms, and for example, the following loss can be used.

（２）
(2)

ここで、｜｜ｘ｜｜^２ _２はｘのＬ２ノルム、ｍは０以上の実数値であり、任意の値に設定してよいが、例えば０．３などとすればよい。 Here, || x || ² ₂ is the L2 norm of x, m is a real value of 0 or more, and may be set to any value, but may be set to 0.3, for example.

この特徴量損失Ｌｆは、特徴量ｆ^ｐｔ _ｉが特徴量ｆ^ｎ _ｉよりもＬ２距離の意味で特徴量ｆ^ａ _ｉにｍ以上近しければ最小値である０を取り、そうでなければ、０よりも大きな値を取る。したがって、様々な三つ組みに対してこの値が小さくするように幾何変換層付きモデル１０のパラメータを学習することによって、同一の被写体を撮影した画像ほど近い特徴量を出力できるようになるのである。 The feature quantity loss Lf takes the feature quantity ^{f pt} _i is the minimum value if Chikashikere m or more in the feature value ^f _{a i} in L2 distance sense than the feature quantity ^f _{n i} 0, otherwise, 0 Take a larger value than. Therefore, by learning the parameters of the model 10 with the geometric transformation layer so that this value is small for various triplets, it is possible to output a feature amount closer to an image of the same subject.

あるいは、下記のような特徴量損失を用いても構わない。 Alternatively, the following feature amount loss may be used.

（３）
(3)

このように定められた特徴量損失Ｌｆは、特徴量ｆ^ｐｔ _ｉとｆ^ａ _ｉが同じＬ２距離の意味で近いほど小さい値となり、また、特徴量ｆ^ｎ _ｉとｆ^ａ _ｉがＬ２距離の意味で遠いほど小さい値を取る。そうでなければ、０よりも大きな値を取る。 Thus feature quantity loss Lf defined becomes a smaller value closer in terms of the feature ^{f pt} _i and ^f _{a i} are the same L2 distance, also means the feature amount ^f _{n i} and ^f _{a i} is L2 distance The farther it is, the smaller the value is taken. Otherwise, it takes a value greater than 0.

特徴量損失は必ずしも上記式（２）あるいは式（３）の形でなくとも構わない。例えば、ノルムはＬ２ノルムでなくともよく、例えばＬ１ノルムなどを用いても構わない。 The feature amount loss does not necessarily have to be in the form of the above equation (2) or equation (3). For example, the norm does not have to be the L2 norm, and for example, the L1 norm may be used.

以上がステップＳ３０４で行う処理である。 The above is the process performed in step S304.

［ステップＳ３０４：モデルパラメータ更新］ [Step S304: Model parameter update]

この処理では、ステップＳ３０３、Ｓ３０４で求めたアピアランス損失Ｌ_ａ、及び特徴量損失Ｌ_ｆを基に、これらの重み付き和Ｌを小さくするように幾何変換層付きモデル１０のパラメータを更新する。 In this process, step S303, S304 in the determined appearance loss L _a, and on the basis of the feature quantity loss L _f, updates the parameters of the geometric transformation layer-model 10 so as to reduce these weighted sum L.

具体的には、Ｌは下記のように定義される。 Specifically, L is defined as follows.

（４）
(4)

λは二つの損失のバランスを取るパラメータであり、任意の実数値を設定してよい。例えばλ＝１００などとすればよい。 λ is a parameter that balances the two losses, and any real value may be set. For example, λ = 100 may be set.

Ｌ_ａ、Ｌ_ｆ共に、ニューラルネットワークのパラメータに対して区分的に微分可能であることを鑑みれば、勾配法により学習可能である。例えば、確率的勾配降下法に基づいて学習する場合、幾何変換層付きモデル１０のあるパラメータをｗとおくと、１ステップあたり、 L _a, L _f both in view of the piecewise it is differentiable with respect to the neural network parameters, it is possible learned by gradient method. For example, when learning based on the stochastic gradient descent method, if a parameter of the model 10 with a geometric transformation layer is set to w, one step per step.

（５）
(5)

に基づいてパラメータｗを更新していけばよい。任意のパラメータｗに対するＬの微分値は、誤差逆伝搬法により計算することができる。もちろん、モーメンタム項を利用する、重み減衰を利用するなど、一般的な確率的勾配降下法の改善法を導入しても構わないし、あるいは別の勾配降下法を利用しても構わない。 The parameter w may be updated based on. The derivative value of L with respect to any parameter w can be calculated by the error back propagation method. Of course, an improvement method of a general stochastic gradient descent method such as using a momentum term or using weight attenuation may be introduced, or another gradient descent method may be used.

上記に基づいて全てのパラメータの値を更新した後、幾何変換層１１を除く全てのパラメータを幾何変換層無しモデル１０Ａへとコピーする。 After updating the values of all the parameters based on the above, all the parameters except the geometric transformation layer 11 are copied to the model 10A without the geometric transformation layer.

以上がステップＳ３０５で行う処理である。 The above is the process performed in step S305.

以上、Ｓ３０１〜Ｓ３０５までの処理を、終了条件を満たすまで繰り返す。終了条件は任意のものを定めて構わないが、例えば「所定の回数（例えば１００回など）繰り返したら終了」、「誤差の現象が一定の繰り返し回数の間、一定の範囲内に収まっていたら終了」などとすればよい。 As described above, the processes from S301 to S305 are repeated until the end condition is satisfied. The end condition may be set arbitrarily, but for example, "end after repeating a predetermined number of times (for example, 100 times)", "end if the error phenomenon is within a certain range for a certain number of repetitions". "And so on.

以上、本実施形態の一例における学習処理の詳細を説明した。 The details of the learning process in the example of this embodiment have been described above.

＜＜姿勢推定処理＞＞
続いて、本実施形態の一例における姿勢推定方法の姿勢推定処理について説明する。 << Posture estimation processing >>
Subsequently, the posture estimation process of the posture estimation method in the example of the present embodiment will be described.

学習処理が済んだ幾何変換層付きモデル１０を用いれば、姿勢推定処理は非常に単純である。具体的には、幾何変換層付きモデル１０に対して画像を入力し、幾何変換層１１が出力した姿勢パラメータを得ればよい。なお、本実施形態の一例において、姿勢パラメータはアフィン変換行列、あるいは、射影変換行列を想定しているが、両者ともに、当該行列から物体の大きさ、位置（平行移動）、及び角度は容易に求めることが可能である。 If the model 10 with the geometric transformation layer that has been trained is used, the posture estimation process is very simple. Specifically, an image may be input to the model 10 with a geometric transformation layer, and the posture parameters output by the geometric transformation layer 11 may be obtained. In one example of the present embodiment, the attitude parameter is assumed to be an affine transformation matrix or a projective transformation matrix, but in both cases, the size, position (translation), and angle of the object can be easily determined from the matrix. It is possible to ask.

また、もし仮に本実施形態の一例において得られる幾何変換層付きモデル１０は、特徴量も同時に出力することができるが、前述の通り、この特徴量を用いれば、例えばＫ平均法などの公知の方法を用いて、被写体の名称を認識することも可能である。 Further, if the model 10 with the geometric transformation layer obtained in the example of the present embodiment can output the feature amount at the same time, as described above, if this feature amount is used, for example, the K-means method is known. It is also possible to recognize the name of the subject using the method.

以上が、本実施形態の一例における姿勢推定方法の姿勢推定処理である。 The above is the posture estimation process of the posture estimation method in the example of the present embodiment.

＜＜装置構成＞＞
以下、図面を参照して、本発明の一実施形態による学習装置及び姿勢推定装置について説明する。 << Device configuration >>
Hereinafter, the learning device and the posture estimation device according to the embodiment of the present invention will be described with reference to the drawings.

図５は、学習装置１００の構成を示すブロック図である。同図に示す学習装置１００は、上述した学習処理に従って、入力された第一の画像に撮像された所定の被写体の、基準姿勢に対する相対的な姿勢を表す姿勢パラメータを推定する画像変換器１０を学習する。学習装置１００は、入力部２０及び演算部３０を備える。演算部３０は、モデル記憶部３２、推定部３４、及び学習部３８を備える。なお、モデル記憶部３２は、学習装置１００の内部にあっても外部にあっても構わないが、本実施形態においては内部に配する構成を採る。 FIG. 5 is a block diagram showing the configuration of the learning device 100. The learning device 100 shown in the figure is an image converter 10 that estimates a posture parameter representing a posture relative to a reference posture of a predetermined subject captured in the input first image according to the learning process described above. learn. The learning device 100 includes an input unit 20 and a calculation unit 30. The calculation unit 30 includes a model storage unit 32, an estimation unit 34, and a learning unit 38. The model storage unit 32 may be inside or outside the learning device 100, but in the present embodiment, the model storage unit 32 is arranged inside.

入力部２０は、所定の被写体が撮影された２枚以上の第一の画像、及び所定の被写体とは異なる被写体が撮像された画像である２枚以上の第三の画像を受け付ける。 The input unit 20 receives two or more first images in which a predetermined subject is captured, and two or more third images in which a subject different from the predetermined subject is captured.

モデル記憶部３２は、画像変換器１０及び画像変換器１０Ａの各々のパラメータを記憶する。 The model storage unit 32 stores the parameters of the image converter 10 and the image converter 10A.

推定部３４は、画像変換器１０により、第一の画像に撮像された所定の被写体の、基準姿勢に対する相対的な姿勢を表す姿勢パラメータを推定し、推定された姿勢パラメータを用いて、第一の画像を幾何変換することにより、基準姿勢の所定の被写体が撮像されている第二の画像を求めると共に、第二の画像の特徴量を抽出する。 The estimation unit 34 estimates the posture parameter representing the posture relative to the reference posture of the predetermined subject captured in the first image by the image converter 10, and uses the estimated posture parameter to perform the first. By geometrically transforming the image of, the second image in which a predetermined subject in the reference posture is captured is obtained, and the feature amount of the second image is extracted.

また、推定部３４は、画像変換器１０Ａにより、第一の画像の特徴量を抽出する。また、推定部３４は、画像変換器１０Ａにより、第一の画像の被写体とは異なる被写体が撮像された画像である第三の画像の特徴量を抽出する。 Further, the estimation unit 34 extracts the feature amount of the first image by the image converter 10A. Further, the estimation unit 34 extracts the feature amount of the third image, which is an image in which a subject different from the subject of the first image is captured, by the image converter 10A.

学習部３８は、同一被写体を撮影した少なくとも２枚以上の第一の画像を、少なくとも１枚以上の画像を含む２つのグループに分け、一方のグループの画像に画像変換器１０Ａの幾何変換層１１を適用して求めた第二の画像の画素値と、他方のグループの画像の画素値の距離の総和が最小となり、かつ、第一の画像の被写体とは異なる被写体が撮像された画像である第三の画像に対して画像変換器１０Ａにより得られた特徴量と、第一の画像に対して画像変換器１０により得られた特徴量との距離が遠くなり、かつ、第一の画像に対して画像変換器１０により得られた特徴量と、第一の画像に対して画像変換器１０Ａにより得られた特徴量との距離が近くなるように、上記式（１）〜式（５）に従って、画像変換器１０、１０Ａを学習し、モデル記憶部３２を更新する。 The learning unit 38 divides at least two or more first images of the same subject into two groups including at least one image, and divides the images in one group into the geometric conversion layer 11 of the image converter 10A. Is an image in which the sum of the distances between the pixel values of the second image and the pixel values of the images of the other group obtained by applying the above is the minimum, and a subject different from the subject of the first image is captured. The distance between the feature amount obtained by the image converter 10A with respect to the third image and the feature amount obtained by the image converter 10 with respect to the first image becomes long, and the first image becomes On the other hand, the above equations (1) to (5) are made so that the distance between the feature amount obtained by the image converter 10 and the feature amount obtained by the image converter 10A is close to the first image. According to this, the image converters 10 and 10A are learned, and the model storage unit 32 is updated.

上記の推定部３４、及び学習部３８の各処理は、予め定められた終了条件を満たすまで繰り返し行われる。 Each process of the estimation unit 34 and the learning unit 38 is repeated until a predetermined end condition is satisfied.

図６は、姿勢推定装置１５０の構成を示すブロック図である。同図に示す姿勢推定装置１５０は、上述した姿勢推定処理に従って、入力された第一の画像に撮像された所定の被写体の、基準姿勢に対する相対的な姿勢を表す姿勢パラメータを推定すると共に、第一の画像の被写体を認識する。姿勢推定装置１５０は、入力部５０、演算部６０、及び出力部７０を備える。演算部６０は、モデル記憶部６２、推定部６４、参照画像データベース６６、及び照合部６８を備える。なお、モデル記憶部６２及び参照画像データベース６６は、姿勢推定装置１５０の内部にあっても外部にあっても構わないが、本実施形態においては内部に配する構成を採る。 FIG. 6 is a block diagram showing the configuration of the posture estimation device 150. The posture estimation device 150 shown in the figure estimates a posture parameter representing the posture of a predetermined subject captured in the input first image relative to the reference posture according to the posture estimation process described above, and at the same time, estimates the posture parameter. Recognize the subject of one image. The posture estimation device 150 includes an input unit 50, a calculation unit 60, and an output unit 70. The calculation unit 60 includes a model storage unit 62, an estimation unit 64, a reference image database 66, and a collation unit 68. The model storage unit 62 and the reference image database 66 may be inside or outside the posture estimation device 150, but in the present embodiment, the model storage unit 62 and the reference image database 66 are arranged inside.

入力部５０は、推定対象となる、所定の被写体が撮影された第一の画像を受け付ける。 The input unit 50 receives a first image in which a predetermined subject is captured, which is an estimation target.

モデル記憶部６２は、学習装置１００によって学習された、画像変換器１０のパラメータを記憶する。 The model storage unit 62 stores the parameters of the image converter 10 learned by the learning device 100.

推定部６４は、画像変換器１０により、第一の画像に撮像された所定の被写体の、基準姿勢に対する相対的な姿勢を表す姿勢パラメータを推定し、推定された姿勢パラメータを用いて、第一の画像を幾何変換することにより、基準姿勢の所定の被写体が撮像されている第二の画像を求めると共に、第二の画像の特徴量を抽出する。 The estimation unit 64 estimates the posture parameter representing the posture relative to the reference posture of the predetermined subject captured in the first image by the image converter 10, and uses the estimated posture parameter to perform the first. By geometrically transforming the image of, the second image in which a predetermined subject in the reference posture is captured is obtained, and the feature amount of the second image is extracted.

参照画像データベース６６には、被写体情報である被写体の名称が既知の、参照画像の各々についての特徴量を記憶している。参照画像の特徴量は、予め画像変換器１０又は画像変換器１０Ａによって抽出しておくものとする。 The reference image database 66 stores a feature amount for each of the reference images in which the name of the subject, which is the subject information, is known. The feature amount of the reference image shall be extracted in advance by the image converter 10 or the image converter 10A.

また、照合部６８は、推定部６４によって抽出された特徴量と、参照画像の特徴量とを照合することにより、第一の画像の被写体の名称を認識する。 Further, the collation unit 68 recognizes the name of the subject of the first image by collating the feature amount extracted by the estimation unit 64 with the feature amount of the reference image.

出力部７０は、推定された姿勢パラメータ、第二の画像、及び第一の画像の被写体の名称を出力する。 The output unit 70 outputs the estimated posture parameters, the second image, and the names of the subjects in the first image.

学習装置１００及び姿勢推定装置１５０の各々は、一例として、図７に示すコンピュータ８４によって実現される。コンピュータ８４は、ＣＰＵ８６、メモリ８８、プログラム９０を記憶した記憶部９２、モニタを含む表示部９４、及びキーボードやマウスを含む入力部９６を含んでいる。ＣＰＵ８６、メモリ８８、記憶部９２、表示部９４、及び入力部９６はバス９８を介して互いに接続されている。 Each of the learning device 100 and the posture estimation device 150 is realized by the computer 84 shown in FIG. 7 as an example. The computer 84 includes a CPU 86, a memory 88, a storage unit 92 that stores the program 90, a display unit 94 that includes a monitor, and an input unit 96 that includes a keyboard and a mouse. The CPU 86, the memory 88, the storage unit 92, the display unit 94, and the input unit 96 are connected to each other via the bus 98.

記憶部９２はＨＤＤ、ＳＳＤ、フラッシュメモリ等によって実現される。記憶部９２には、コンピュータ８４を学習装置１００又は姿勢推定装置１５０として機能させるためのプログラム９０が記憶されている。ＣＰＵ８６は、プログラム９０を記憶部９２から読み出してメモリ８８に展開し、プログラム９０を実行する。なお、プログラム９０をコンピュータ可読媒体に格納して提供してもよい。 The storage unit 92 is realized by an HDD, SSD, flash memory, or the like. The storage unit 92 stores a program 90 for causing the computer 84 to function as the learning device 100 or the posture estimation device 150. The CPU 86 reads the program 90 from the storage unit 92, expands the program 90 into the memory 88, and executes the program 90. The program 90 may be stored and provided on a computer-readable medium.

＜＜実験結果＞＞
これまで説明した本技術の実施形態の一例により構築した学習処理方法及び姿勢推定処理方法を用いて学習処理及び姿勢推定処理を行った実験結果を示す。 << Experimental Results >>
The results of an experiment in which the learning process and the attitude estimation process are performed using the learning processing method and the attitude estimation processing method constructed according to the example of the embodiment of the present technology described above are shown.

本実験は、１０種類の文字がさまざまな姿勢で撮影された１２９３９組の三つ組みを含む画像群を用い、認識精度、及び、姿勢推定誤差を評価したものである。ここで、三つ組みは、文字が基準姿勢で撮影されたシード画像と、当該シード画像を、予め求めておいたアフィン変換行列により幾何変換したアフィン変換画像と、当該シード画像とは異なる文字が撮影されたシード画像とからなる。 In this experiment, the recognition accuracy and the posture estimation error were evaluated using an image group including 12939 sets in which 10 kinds of characters were photographed in various postures. Here, in the triplet, a seed image in which the characters are photographed in the reference posture, an affine transformation image obtained by geometrically transforming the seed image by an affine transformation matrix obtained in advance, and a character different from the seed image are photographed. It consists of a seed image.

なお、学習用画像データとして、上記三つ組みの集合の内１１５５９組を用いて学習処理を実施し、評価用データとして、残りの１３８０組を用いて姿勢推定処理を実施した。 The learning process was performed using 11559 sets of the above three sets of image data for learning, and the posture estimation process was performed using the remaining 1380 sets as the evaluation data.

また、比較例となる従来技術として、参考文献２に記載のＣＮＮを用いて構成した画像変換器を用いた技術についても、同様に、学習処理及び姿勢推定処理を実施した。 Further, as a conventional technique as a comparative example, a learning process and a posture estimation process were similarly performed on a technique using an image converter configured by using the CNN described in Reference 2.

図８に、認識精度及び姿勢推定誤差の結果を示す。ここで、認識精度は、文字の種類を正しく認識できた画像の割合であり、０〜１の値を取り、高ければ高いほどよいことを表す。姿勢推定誤差は、真の姿勢（アフィン変換行列）に対する推定誤差であり、０以上の値を取り、小さければ小さいほどよいことを表す。本図から明らかな通り、本技術によれば、従来技術に対して極めて高精度な認識が可能である。 FIG. 8 shows the results of recognition accuracy and attitude estimation error. Here, the recognition accuracy is the ratio of images in which the type of characters can be correctly recognized, and takes a value of 0 to 1, and the higher the value, the better. The attitude estimation error is an estimation error with respect to the true attitude (affine transformation matrix), and takes a value of 0 or more, and the smaller the value, the better. As is clear from this figure, according to this technique, it is possible to recognize the conventional technique with extremely high accuracy.

以上説明したように、本発明の実施の形態に係る姿勢推定装置によれば、基準姿勢の被写体が撮像されている第二の画像に、第一の画像を幾何変換するよう最適化することにより、低コストで収集可能な学習用データのみに基づいて、画像に撮像された被写体の姿勢を精度良く推定することができる。 As described above, according to the posture estimation device according to the embodiment of the present invention, by optimizing the first image to be geometrically transformed into the second image in which the subject in the reference posture is captured. It is possible to accurately estimate the posture of the subject captured in the image based only on the learning data that can be collected at low cost.

本発明の実施の形態に係る学習装置によれば、基準姿勢の被写体が撮像されている第二の画像に、第一の画像を幾何変換するよう最適化することにより、低コストで収集可能な学習用データのみに基づいて、画像に撮像された被写体の姿勢を精度良く推定するための学習を行うことができる。 According to the learning device according to the embodiment of the present invention, it is possible to collect at low cost by optimizing the first image to be geometrically transformed into the second image in which the subject in the reference posture is captured. Learning for accurately estimating the posture of the subject captured in the image can be performed based only on the learning data.

また、画像変換器は、幾何変換層を備えたニューラルネットワークにより構成され、当該幾何変換層の学習は、同一被写体を撮影した異なる２枚の画像Ａと画像Ｂを入力し、一方の画像Ｂにのみ当該幾何変換層を適用して姿勢変換画像Ｂを求め、この姿勢変換画像Ｂと、もう一方の幾何変換層を適用しなかった画像Ａとの画素値が一致するように行われる。このように学習された幾何変換層を用いると、画像Ｂに写る物体がどのような姿勢のものであっても、常に画像Ａと見た目が一致するように変換することができる。この原理により、仮に画像Ａが基準姿勢で写る物体を含むようなものである場合、任意の姿勢で写る同一被写体画像を基準姿勢に近づけることができるようになるのである。同時に、この幾何変換層は、その変換に用いた姿勢パラメータを出力するように構成されており、その姿勢パラメータを持って、画像Ｂに写る物体が、基準姿勢からどの程度ずれているかを推定することが可能である。 Further, the image converter is composed of a neural network provided with a geometric transformation layer, and learning of the geometric transformation layer involves inputting two different images A and B in which the same subject is photographed, and inputting the same subject to one image B. Only the attitude-transformed image B is obtained by applying the geometric transformation layer, and the pixel values of the attitude-transformed image B and the image A to which the other geometry-transforming layer is not applied are matched. By using the geometric transformation layer learned in this way, it is possible to convert the object in the image B so that the appearance always matches the image A regardless of the posture. According to this principle, if the image A includes an object captured in the reference posture, the same subject image captured in an arbitrary posture can be brought closer to the reference posture. At the same time, this geometric transformation layer is configured to output the attitude parameters used for the transformation, and has the attitude parameters to estimate how much the object in the image B deviates from the reference attitude. It is possible.

また、同一被写体を写した画像さえあれば、常にこれが基準姿勢を写した画像となるように画像を変換する幾何変換層を学習するという原理により、明示的な姿勢情報が付与された学習用データが無くとも、画像の見た目が一致するか否かという基準のみで、姿勢推定を行うことができるようになるのである。これは、明示的な姿勢情報が付与された学習用データを収集するよりも、遥かに低コストで収集可能な学習用データによって学習できることを意味しており、非特許文献２のような技術よりも極めて低いコストで学習が可能である。 In addition, learning data to which explicit posture information is added based on the principle of learning a geometric transformation layer that transforms an image so that it always becomes an image showing a reference posture as long as there is an image of the same subject. Even if there is no such thing, the posture can be estimated only by the criteria of whether or not the appearances of the images match. This means that it is possible to learn by learning data that can be collected at a much lower cost than collecting learning data to which explicit attitude information is given, and this means that learning can be performed by learning data such as Non-Patent Document 2. Can be learned at extremely low cost.

さらに、当該幾何変換層を、入力画像に対して適用して、アフィン変換行列、または、射影変換行列を推定し、これを適用して幾何変換画像を求めるニューラルネットワークとして画像変換器を構成することにより、位置・大きさ・角度を正確に求めることが可能である。これは、アフィン変換行列及び射影変換行列は、位置・大きさ・角度を変換する行列であることに由来する。このように構成することで、画像Ｂが、基準姿勢を持つ画像Ａに対して、位置・大きさ・角度の観点でどの程度ずれているかを直接推定することができるのである。 Further, the geometric transformation layer is applied to an input image to estimate an affine transformation matrix or a projective transformation matrix, and the geometric transformation layer is applied to form an image converter as a neural network for obtaining a geometric transformation image. Therefore, it is possible to accurately obtain the position, size, and angle. This is because the affine transformation matrix and the projective transformation matrix are matrices that transform the position, size, and angle. With this configuration, it is possible to directly estimate how much the image B is deviated from the image A having the reference posture in terms of position, size, and angle.

また、本発明の実施の形態による学習装置は、同一被写体画像のみを要求し、その物体がどんな種類のものであるかという情報を利用しない。すなわち、非特許文献１あるいは特許文献１とは異なり、未知の物体に対しても高精度に姿勢を推定できる。 Further, the learning device according to the embodiment of the present invention requires only the same subject image and does not use information on what kind of object the object is. That is, unlike Non-Patent Document 1 or Patent Document 1, the posture can be estimated with high accuracy even for an unknown object.

以上の通り、本発明の実施の形態により、低コストで収集可能な学習用データのみに基づき、未知の物体に対しても高精度に姿勢を推定できる姿勢推定装置、方法、及びプログラムを提供することができる。 As described above, according to the embodiment of the present invention, a posture estimation device, a method, and a program capable of estimating a posture with high accuracy even for an unknown object based only on learning data that can be collected at low cost are provided. be able to.

なお、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、姿勢推定装置は、入力画像の複数の領域で、それぞれ姿勢パラメータを推定するようにしてもよい。例えば、入力画像を領域分割して画像変換器に入力し、領域毎に姿勢パラメータを推定するようにしてもよい。 For example, the posture estimation device may estimate the posture parameters in each of a plurality of regions of the input image. For example, the input image may be divided into areas and input to the image converter, and the posture parameters may be estimated for each area.

また、学習装置と姿勢推定装置とを別々の装置として構成する場合を例に説明したが、一つの装置として構成してもよい。 Further, although the case where the learning device and the posture estimation device are configured as separate devices has been described as an example, they may be configured as one device.

また、本発明の実施形態における姿勢推定方法及び学習方法を、汎用演算処理装置、記憶装置等を備えたコンピュータやサーバ等により構成して、各処理がプログラムによって実行されるものとしてもよい。このプログラムは記憶装置に記憶されており、磁気ディスク、光ディスク、半導体メモリ等の記録媒体に記録することも、ネットワークを通して提供することも可能である。もちろん、その他いかなる構成要素についても、単一のコンピュータやサーバによって実現しなければならないものではなく、ネットワークによって接続された複数のコンピュータに分散して実現しても構わない。 Further, the posture estimation method and the learning method in the embodiment of the present invention may be configured by a computer, a server, or the like equipped with a general-purpose arithmetic processing unit, a storage device, or the like, and each processing may be executed by a program. This program is stored in a storage device, can be recorded on a recording medium such as a magnetic disk, an optical disk, or a semiconductor memory, or can be provided through a network. Of course, any other component does not have to be realized by a single computer or server, but may be distributed and realized by a plurality of computers connected by a network.

以上、図面を参照して本発明の実施の形態を説明してきたが、上記実施の形態は本発明の例示に過ぎず、本発明が上記実施の形態に限定されるものではないことは明らかである。したがって、本発明の技術思想及び範囲を逸脱しない範囲で構成要素の追加、省略、置換、その他の変更を行ってもよい。 Although the embodiments of the present invention have been described above with reference to the drawings, it is clear that the embodiments are merely examples of the present invention and the present invention is not limited to the above embodiments. is there. Therefore, components may be added, omitted, replaced, or otherwise modified without departing from the technical idea and scope of the present invention.

１０画像変換器、幾何変換層付きモデル
１０Ａ画像変換器、幾何変換層無しモデル
１１幾何変換層
１１Ａ姿勢パラメータ推定層
１１Ｂ画像変換層
２０、５０入力部
３０、６０演算部
３２、６２モデル記憶部
３４、６４推定部
３８学習部
６６参照画像データベース
６８照合部
７０出力部
１００学習装置
１５０姿勢推定装置 10 Image converter, model with geometric transformation layer 10A Image converter, model without geometric transformation layer 11 Geometric transformation layer 11A Attitude parameter estimation layer 11B Image transformation layer 20, 50 Input unit 30, 60 Calculation unit 32, 62 Model storage unit 34 , 64 Estimating unit 38 Learning unit 66 Reference image database 68 Matching unit 70 Output unit 100 Learning device 150 Geometry estimation device

Claims

A posture estimation device that estimates the posture of a predetermined subject captured in the first image relative to the reference posture.
It has an estimation unit that estimates the relative posture by performing a predetermined process on the first image.
The predetermined process is based on a part of a process optimized to geometrically transform the first image into a second image in which the predetermined subject in the reference posture is captured. ..

The process optimized for geometric transformation is one optimized for geometrically transforming the first image into the second image, and has a process of estimating the relative posture. The posture estimation device according to claim 1.

The posture estimation device according to claim 1 or 2, wherein the process optimized for geometric transformation is STN (Spatial Transferner Networks).

The estimation unit further performs a process of extracting the feature amount of the image geometrically transformed by the process optimized for the geometric transformation.
The process of extracting the feature amount of the image is
Optimal to perform the geometric conversion on the feature amount obtained by performing the process of extracting the feature amount of the third image, which is an image in which a subject different from the predetermined subject is captured, and the first image. The distance from the feature amount obtained by performing the conversion process and the process of extracting the feature amount of the image becomes long, and
A process for extracting the feature amount of the first image and a feature amount obtained by performing the process optimized for geometrically transforming the first image and the process of extracting the feature amount of the image. The posture estimation device according to any one of claims 1 to 3, which is optimized so that the distance from the feature amount obtained by performing the above is close.

By collating the feature amount obtained by the estimation unit with the feature amount obtained by performing a process of extracting the feature amount of the image with respect to the reference image to which the subject information is added, the first The posture estimation device according to claim 4, further comprising a collating unit that recognizes a predetermined subject captured in an image.

It is a learning device that learns the process of estimating the posture of a predetermined subject captured in the first image relative to the reference posture.
The relative posture of the predetermined subject captured in the first image with respect to the reference posture is estimated, and the predetermined subject in the reference posture is imaged using the estimated relative posture. An estimation unit that geometrically transforms the first image into the second image
A learning unit that learns the estimation of the relative posture and the geometric transformation so that the result of the geometric transformation by the estimation unit matches the second image.
Learning device including.

It is a posture estimation method that estimates the posture of a predetermined subject captured in the first image relative to the reference posture.
The estimation unit includes estimating the relative posture by performing a predetermined process on the first image.
The predetermined process is based on a part of a process optimized for geometrically transforming the first image into a second image in which the predetermined subject in the reference posture is captured. ..

A program for estimating the relative posture of a predetermined subject captured in the first image with respect to the reference posture.
On the computer
It is a program for executing the estimation of the relative posture by performing a predetermined process on the first image.
The predetermined process is a program based on a part of a process optimized to geometrically transform the first image into a second image in which the predetermined subject in the reference posture is captured.