JP2021070122A

JP2021070122A - Learning data generation method

Info

Publication number: JP2021070122A
Application number: JP2019199418A
Authority: JP
Inventors: 芳宏中野; Yoshihiro Nakano
Original assignee: MinebeaMitsumi Inc
Current assignee: MinebeaMitsumi Inc
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2021-05-06
Also published as: WO2021085561A1

Abstract

To readily generate learning data that is used when a machine learns a gripping operation.SOLUTION: A learning data generation method includes a step of generating an image of works loaded in bulk with an image generation software to use the image as learning data for machine learning.SELECTED DRAWING: Figure 1

Description

本発明は、学習データ生成方法に関する。 The present invention relates to a learning data generation method.

バラ積みされた複数の対象物（ワーク）をロボットアーム等により把持するために、ワークの三次元形状をモデル化した仮想的な作業空間である仮想作業空間内に複数のワークモデルを積み上げ、バラ積みピッキング動作を検証するシミュレーションを行う技術が知られている。 In order to grip a plurality of objects (workpieces) stacked separately by a robot arm or the like, a plurality of work models are stacked in a virtual work space that is a virtual work space that models the three-dimensional shape of the work, and the pieces are separated. A technique for performing a simulation for verifying a stacking picking operation is known.

特開２０１８−１４４１５２号公報JP-A-2018-144152

しかし、バラ積みされた実在のワークを多数３次元モデル化し、さらに、ピッキング動作のシミュレーションを行うには長い時間を要する。 However, it takes a long time to model a large number of real works stacked separately in three dimensions and to simulate the picking operation.

本発明は、上記課題を一例とするものであり、ロボットアーム等による把持動作を機械学習するための学習データを、画像生成ソフトウェアを利用して作成する学習データ生成方法、この学習データを利用した機械学習方法及びこの機械学習方法を利用した把持装置を提供することを目的とする。 The present invention is an example of the above problem, and is a learning data generation method for creating learning data for machine learning a gripping motion by a robot arm or the like by using image generation software, and using this learning data. It is an object of the present invention to provide a machine learning method and a gripping device using this machine learning method.

本発明の一態様に係る学習データ生成方法は、画像生成ソフトウェアによりバラ積みされたワークの画像を生成し、機械学習の学習データとする。 In the learning data generation method according to one aspect of the present invention, images of works stacked separately by image generation software are generated and used as learning data for machine learning.

本発明の一態様によれば、把持動作の機械学習に用いる学習データを容易に生成することができる。 According to one aspect of the present invention, learning data used for machine learning of gripping motion can be easily generated.

図１は、第１の実施形態に係る画像処理装置を実装した物体把持システムの一例を示す図である。FIG. 1 is a diagram showing an example of an object grasping system equipped with the image processing device according to the first embodiment. 図２は、第１の実施形態に係る物体把持システムの構成の一例を示すブロック図である。FIG. 2 is a block diagram showing an example of the configuration of the object gripping system according to the first embodiment. 図３は、学習処理の一例を示すフローチャートである。FIG. 3 is a flowchart showing an example of the learning process. 図４は、対象物の三次元データの一例を示す図である。FIG. 4 is a diagram showing an example of three-dimensional data of an object. 図５は、複数の対象物が配置された仮想空間のキャプチャ画像の一例を示す図である。FIG. 5 is a diagram showing an example of a captured image of a virtual space in which a plurality of objects are arranged. 図６は、ロボットアームの制御に関する処理の一例を示す図である。FIG. 6 is a diagram showing an example of processing related to control of the robot arm. 図７は、ロボットアームの制御に関する処理の別の一例を示す図である。FIG. 7 is a diagram showing another example of processing related to control of the robot arm. 図８は、第１の実施形態に係る検出モデルの一例を示す図である。FIG. 8 is a diagram showing an example of a detection model according to the first embodiment. 図９は、第１の実施形態に係る特徴検出層（ｕ１）が出力する特徴マップの一例を示す図である。FIG. 9 is a diagram showing an example of a feature map output by the feature detection layer (u1) according to the first embodiment. 図１０は、第１の実施形態に係る対象物の位置及び姿勢の推定結果の一例を示す図である。FIG. 10 is a diagram showing an example of an estimation result of the position and posture of the object according to the first embodiment. 図１１は、第１の実施形態に係る対象物の把持位置の推定結果の別の一例を示す図である。FIG. 11 is a diagram showing another example of the estimation result of the gripping position of the object according to the first embodiment. 図１２は、第１の実施形態に係るステレオカメラにより撮影されたバラ積み画像の一例を示す図である。FIG. 12 is a diagram showing an example of loosely stacked images taken by the stereo camera according to the first embodiment. 図１３は、第１の実施形態に係るバラ積み画像とマッチングマップとの関係の一例を示す図である。FIG. 13 is a diagram showing an example of the relationship between the loosely stacked images and the matching map according to the first embodiment. 図１４は、第１の実施形態に係る推定処理の一例を示すフローチャートである。FIG. 14 is a flowchart showing an example of the estimation process according to the first embodiment. 図１５は、第１の実施形態に係る推定処理の一例を示す図である。FIG. 15 is a diagram showing an example of the estimation process according to the first embodiment. 図１６は、変形例に係るトレイを含むバラ積み画像の一例を示す図である。FIG. 16 is a diagram showing an example of a loosely stacked image including a tray according to a modified example. 図１７は、変形例に係る位置ずれ推定モデルの一例を示す図である。FIG. 17 is a diagram showing an example of a position deviation estimation model according to a modified example. 図１８は、変形例に係る位置ずれ推定モデルの別の一例を示す図である。FIG. 18 is a diagram showing another example of the position deviation estimation model according to the modified example.

以下、実施形態に係る画像処理装置及び画像処理方法について図面を参照して説明する。なお、この実施形態によりこの発明が限定されるものではない。また、図面における各要素の寸法の関係、各要素の比率などは、現実と異なる場合がある。図面の相互間においても、互いの寸法の関係や比率が異なる部分が含まれている場合がある。また、１つの実施形態や変形例に記載された内容は、原則として他の実施形態や変形例にも同様に適用される。 Hereinafter, the image processing apparatus and the image processing method according to the embodiment will be described with reference to the drawings. The present invention is not limited to this embodiment. In addition, the relationship between the dimensions of each element in the drawing, the ratio of each element, and the like may differ from reality. Even between drawings, there may be parts where the relationship and ratio of dimensions are different from each other. Further, in principle, the contents described in one embodiment or modification are similarly applied to other embodiments or modifications.

（第１の実施形態）
第１の実施形態における画像処理装置は、例えば物体把持システム１において用いられる。図１は、第１の実施形態に係る画像処理装置を実装した物体把持システムの一例を示す図である。図１に示す物体把持システム１は、図示しない画像処理装置１０と、カメラ２０と、ロボットアーム３０とを備える。カメラ２０は、例えば、ロボットアーム３０と、ロボットアーム３０が把持する対象物となる、バラ積みされたワーク４１、４２等との両方を撮影可能な位置に設けられる。カメラ２０は、例えば、ロボットアーム３０と、ワーク４１、４２の画像とを撮影し、画像処理装置１０に出力する。なお、ロボットアーム３０とバラ積みされたワーク４１，４２等とは別々のカメラで撮影してもよい。第１の実施形態におけるカメラ２０には、図１に示されるように、例えば公知のステレオカメラ等、複数の画像を撮影できるカメラが用いられる。画像処理装置１０は、カメラ２０から出力された画像を用いて、ワーク４１、４２等の位置及び姿勢を推定する。画像処理装置１０は、推定されたワーク４１、４２等の位置及び姿勢に基づいて、ロボットアーム３０の動作を制御する信号を出力する。ロボットアーム３０は、画像処理装置１０から出力された信号に基づいて、ワーク４１、４２等を把持する動作を行う。なお、図１においては、複数の異なる種類のワーク４１、４２等が開示されているが、ワークの種類は１種類であってもよい。第１の実施形態においては、ワークが１種類である場合について説明する。また、ワーク４１、４２等は、位置及び姿勢が不規則であるように配置されている。図１に示すように、例えば、複数のワークが上面視において重なるように配置されていてもよい。また、ワーク４１、４２は、対象物の一例である。 (First Embodiment)
The image processing apparatus in the first embodiment is used, for example, in the object grasping system 1. FIG. 1 is a diagram showing an example of an object grasping system equipped with the image processing device according to the first embodiment. The object grasping system 1 shown in FIG. 1 includes an image processing device 10 (not shown), a camera 20, and a robot arm 30. The camera 20 is provided at a position where, for example, both the robot arm 30 and the loosely stacked workpieces 41, 42, etc., which are objects to be gripped by the robot arm 30, can be photographed. The camera 20 captures, for example, images of the robot arm 30 and the works 41 and 42 and outputs them to the image processing device 10. The robot arm 30 and the workpieces 41, 42 and the like stacked separately may be photographed by different cameras. As the camera 20 in the first embodiment, as shown in FIG. 1, a camera capable of capturing a plurality of images, such as a known stereo camera, is used. The image processing device 10 estimates the positions and orientations of the works 41, 42, etc. using the image output from the camera 20. The image processing device 10 outputs a signal for controlling the operation of the robot arm 30 based on the estimated positions and postures of the works 41, 42, and the like. The robot arm 30 performs an operation of gripping the works 41, 42, and the like based on the signal output from the image processing device 10. Although a plurality of different types of works 41, 42 and the like are disclosed in FIG. 1, the type of work may be one type. In the first embodiment, a case where there is only one type of work will be described. Further, the works 41, 42 and the like are arranged so that their positions and postures are irregular. As shown in FIG. 1, for example, a plurality of workpieces may be arranged so as to overlap each other in a top view. The works 41 and 42 are examples of objects.

図２は、第１の実施形態に係る物体把持システムの構成の一例を示すブロック図である。図２に示すように、画像処理装置１０は、カメラ２０及びロボットアーム３０と、ネットワークＮＷを通じて通信可能に接続されている。また、図２に示すように、画像処理装置１０は、通信Ｉ／Ｆ（インターフェース）１１と、入力Ｉ／Ｆ１２と、ディスプレイ１３と、記憶回路１４と、処理回路１５とを備える。 FIG. 2 is a block diagram showing an example of the configuration of the object gripping system according to the first embodiment. As shown in FIG. 2, the image processing device 10 is communicably connected to the camera 20 and the robot arm 30 through the network NW. Further, as shown in FIG. 2, the image processing device 10 includes a communication I / F (interface) 11, an input I / F 12, a display 13, a storage circuit 14, and a processing circuit 15.

通信Ｉ／Ｆ１１は、ネットワークＮＷを通じた外部装置とのデータ入出力の通信を制御する。例えば、通信Ｉ／Ｆ１１は、ネットワークカードやネットワークアダプタ、ＮＩＣ（Network Interface Controller）等によって実現され、カメラ２０から出力される画像のデータを受信するとともに、ロボットアーム３０に出力する信号を送信する。 The communication I / F 11 controls data input / output communication with an external device through the network NW. For example, the communication I / F 11 is realized by a network card, a network adapter, a NIC (Network Interface Controller), etc., receives image data output from the camera 20, and transmits a signal to be output to the robot arm 30.

入力Ｉ／Ｆ１２は、処理回路１５に接続され、画像処理装置１０の管理者（不図示）から受け付けた入力操作を電気信号に変換して処理回路１５に出力する。例えば、入力Ｉ／Ｆ１２は、スイッチボタン、マウス、キーボード、タッチパネル等である。 The input I / F 12 is connected to the processing circuit 15, converts the input operation received from the administrator (not shown) of the image processing device 10 into an electric signal, and outputs the input operation to the processing circuit 15. For example, the input I / F12 is a switch button, a mouse, a keyboard, a touch panel, or the like.

ディスプレイ１３は、処理回路１５に接続され、処理回路１５から出力される各種情報及び各種画像データを表示する。例えば、ディスプレイ１３は、液晶モニタやＣＲＴ（Cathode Ray Tube）モニタ、タッチパネル等によって実現される。 The display 13 is connected to the processing circuit 15 and displays various information and various image data output from the processing circuit 15. For example, the display 13 is realized by a liquid crystal monitor, a CRT (Cathode Ray Tube) monitor, a touch panel, or the like.

記憶回路１４は、例えば、メモリ等の記憶装置により実現される。記憶回路１４には、処理回路１５により実行される各種のプログラムが記憶されている。また、記憶回路１４には、処理回路１５により各種のプログラムが実行される際に用いられる各種のデータが一時的に記憶される。記憶回路１４は、機械（深層）学習モデル１４１を有する。さらに、機械（深層）学習モデル１４１はニューラルネットワーク構造１４１ａと学習パラメータ１４１ｂを備えている。ニューラルネットワーク構造１４１ａは、例えば、図８の畳み込みニューラルネットワークｂ１のような公知のネットワークを応用したもので、後述する図１５に示されるネットワーク構造である。学習パラメータ１４１ｂは、例えば、畳み込みニューラルネットワークの畳み込みフィルタの重みであり、対象物の位置及び姿勢を推定するために学習され、最適化されるパラメータである。ニューラルネットワーク構造１４１ａは、推定部１５２に備えられていても構わない。なお、本発明における機械（深層）学習モデル１４１は学習済みモデルを例として説明するが、これに限定されない。なお、以下において、機械（深層）学習モデル１４１を、単に「学習モデル１４１」と表記する場合がある。 The storage circuit 14 is realized by, for example, a storage device such as a memory. The storage circuit 14 stores various programs executed by the processing circuit 15. Further, the storage circuit 14 temporarily stores various data used when various programs are executed by the processing circuit 15. The storage circuit 14 has a machine (deep) learning model 141. Further, the machine (deep) learning model 141 includes a neural network structure 141a and learning parameters 141b. The neural network structure 141a is an application of a known network such as the convolutional neural network b1 of FIG. 8, and is a network structure shown in FIG. 15 described later. The learning parameter 141b is, for example, the weight of the convolutional filter of the convolutional neural network, and is a parameter that is learned and optimized for estimating the position and orientation of the object. The neural network structure 141a may be provided in the estimation unit 152. The machine (deep) learning model 141 in the present invention will be described by taking a trained model as an example, but the present invention is not limited to this. In the following, the machine (deep) learning model 141 may be simply referred to as a “learning model 141”.

学習モデル１４１は、カメラ２０から出力された画像から、ワークの位置及び姿勢を推定する処理に用いられる。学習モデル１４１は、例えば、複数のワークの位置及び姿勢と、当該複数のワークを撮影した画像とを教師データして学習することにより生成される。なお、第１の実施形態においては、学習モデル１４１が、例えば、処理回路１５により生成されるが、これに限られず、外部のコンピュータにより生成されてもよい。以下においては、図示しない学習装置により、学習モデル１４１が生成及び更新される実施形態について説明する。 The learning model 141 is used in a process of estimating the position and orientation of the work from the image output from the camera 20. The learning model 141 is generated, for example, by learning the positions and postures of a plurality of works and images of the plurality of works as teacher data. In the first embodiment, the learning model 141 is generated by, for example, the processing circuit 15, but the learning model 141 is not limited to this, and may be generated by an external computer. In the following, an embodiment in which the learning model 141 is generated and updated by a learning device (not shown) will be described.

第１の実施形態において、学習モデル１４１の生成に用いられる大量の画像は、例えば、仮想空間上に複数のワークを配置し、当該仮想空間の画像をキャプチャすることにより生成されてもよい。図３は、学習処理の一例を示すフローチャートである。図３に示すように、学習装置は、対象物の三次元データを取得する（ステップＳ１０１）。三次元データは、例えば公知の３Ｄスキャン等の手法により取得することができる。図４は、対象物の三次元データの一例を示す図である。三次元データを取得することにより、仮想空間上において、ワークの姿勢を任意に変更して配置させることができる。 In the first embodiment, the large number of images used to generate the learning model 141 may be generated by, for example, arranging a plurality of works in the virtual space and capturing the images in the virtual space. FIG. 3 is a flowchart showing an example of the learning process. As shown in FIG. 3, the learning device acquires three-dimensional data of the object (step S101). The three-dimensional data can be acquired by a method such as a known 3D scan. FIG. 4 is a diagram showing an example of three-dimensional data of an object. By acquiring the three-dimensional data, the posture of the work can be arbitrarily changed and arranged in the virtual space.

次に、学習装置は、仮想空間上に、対象物を配置する際の各種条件を設定する（ステップＳ１０２）。仮想空間への対象物の配置は、例えば公知の画像生成ソフトウェア等を用いて行うことができる。配置する対象物の数や位置、姿勢などの条件は、画像生成ソフトウェアがランダムに対象物を生成するように設定することも可能だが、これに限らず、画像処理装置１０の管理者が任意に設定してもよい。次に、学習装置は、設定された条件に従い、仮想空間上に対象物を配置する（ステップＳ１０３）。次に、学習装置は、例えば、複数の対象物が配置された仮想空間をキャプチャすることにより、配置された対象物の画像、位置及び姿勢を取得する（ステップＳ１０４）。第１の実施形態において、対象物の位置及び姿勢は、例えば三次元座標（ｘ，ｙ，ｚ）により示され、対象物の姿勢は、物体の姿勢又は回転状態を表す四元数であるクオタニオン（ｑｘ，ｑｙ，ｑｚ，ｑｗ）により示される。図５は、複数の対象物が配置された仮想空間のキャプチャ画像の一例を示す図である。図５に示すように、仮想空間上には、複数の対象物Ｗ１ａ及びＷ１ｂが、それぞれランダムな位置及び姿勢にて配置される。また、以下において、ランダムに配置された対象物の画像を、「バラ積み画像」と表記する場合がある。次に、学習装置は、取得された画像と、配置された対象物の位置及び姿勢を記憶回路１４に保存する（ステップＳ１０５）。さらに、学習装置は、ステップＳ１０２からステップＳ１０５をあらかじめ定められた回数繰り返す（ステップＳ１０６）。なお、ここで記憶回路１４に保存される、上記ステップによって取得された画像と対象物が配置された位置及び姿勢との組み合わせを「教師データ」と表記する場合がある。ステップＳ１０２からステップＳ１０５までの処理を所定の回数繰り返すことにより、学習処理を繰り返し行うために十分な数の教師データが生成される。 Next, the learning device sets various conditions for arranging the object in the virtual space (step S102). The object can be placed in the virtual space by using, for example, known image generation software or the like. Conditions such as the number, position, and posture of the objects to be arranged can be set so that the image generation software randomly generates the objects, but the present invention is not limited to this, and the administrator of the image processing device 10 arbitrarily sets the conditions. You may. Next, the learning device arranges the object in the virtual space according to the set conditions (step S103). Next, the learning device acquires an image, a position, and a posture of the arranged objects by capturing, for example, a virtual space in which a plurality of objects are arranged (step S104). In the first embodiment, the position and orientation of the object are indicated by, for example, three-dimensional coordinates (x, y, z), and the attitude of the object is a quaternion which is a quaternion representing the attitude or rotational state of the object. It is indicated by (qx, qy, qz, qw). FIG. 5 is a diagram showing an example of a captured image of a virtual space in which a plurality of objects are arranged. As shown in FIG. 5, a plurality of objects W1a and W1b are arranged at random positions and postures in the virtual space, respectively. Further, in the following, an image of randomly arranged objects may be referred to as a “separately stacked image”. Next, the learning device stores the acquired image and the position and orientation of the arranged object in the storage circuit 14 (step S105). Further, the learning device repeats steps S102 to S105 a predetermined number of times (step S106). Here, the combination of the image acquired in the above step and the position and posture in which the object is arranged, which is stored in the storage circuit 14, may be referred to as “teacher data”. By repeating the processes from step S102 to step S105 a predetermined number of times, a sufficient number of teacher data is generated to repeat the learning process.

そして、学習装置は、生成された教師データを用いて所定の回数学習処理を行うことにより、ニューラルネットワーク構造１４１ａにおいて重み付けとして用いられる学習パラメータ１４１ｂを生成し、又は更新する（ステップＳ１０７）。このように、三次元データが取得された対象物を仮想空間上に配置することにより、学習処理に用いられる、対象物の画像と、位置及び姿勢の組み合わせとを含む教師データを、容易に生成することができる。 Then, the learning device generates or updates the learning parameter 141b used as weighting in the neural network structure 141a by performing the learning process a predetermined number of times using the generated teacher data (step S107). By arranging the object from which the three-dimensional data has been acquired in this way on the virtual space, it is easy to generate teacher data including the image of the object and the combination of the position and the posture used in the learning process. can do.

図２に戻って、処理回路１５は、ＣＰＵ（Central Processing Unit）等のプロセッサにより実現される。処理回路１５は、画像処理装置１０全体を制御する。処理回路１５は、記憶回路１４に記憶された各種のプログラムを読み取り、読み取ったプログラムを実行することで、各種の処理を実行する。例えば、処理回路１５は、画像取得部１５１と、推定部１５２と、ロボット制御部１５３とを有することとなる。 Returning to FIG. 2, the processing circuit 15 is realized by a processor such as a CPU (Central Processing Unit). The processing circuit 15 controls the entire image processing device 10. The processing circuit 15 reads various programs stored in the storage circuit 14 and executes the read programs to execute various processes. For example, the processing circuit 15 includes an image acquisition unit 151, an estimation unit 152, and a robot control unit 153.

画像取得部１５１は、例えば、通信Ｉ／Ｆ１１を通じて、バラ積み画像を取得し、推定部１５２に出力する。画像取得部１５１は、取得部の一例である。 The image acquisition unit 151 acquires the separately stacked images through the communication I / F11, for example, and outputs the images to the estimation unit 152. The image acquisition unit 151 is an example of an acquisition unit.

推定部１５２は、出力されたバラ積み画像を用いて、対象物の位置及び姿勢を推定する。推定部１５２は、例えば、学習モデル１４１を用いて、対象物の画像に対する推定処理を行い、推定結果をロボット制御部１５３に出力する。なお、推定部１５２は、例えば、対象物が配置されるトレイ等の位置及び姿勢をさらに推定してもよい。トレイの位置及び姿勢を推定する構成については、後に説明する。 The estimation unit 152 estimates the position and orientation of the object using the output loose-stacked image. The estimation unit 152 performs estimation processing on an image of an object using, for example, the learning model 141, and outputs the estimation result to the robot control unit 153. The estimation unit 152 may further estimate the position and orientation of the tray or the like on which the object is placed, for example. The configuration for estimating the position and orientation of the tray will be described later.

ロボット制御部１５３は、推定された対象物の位置及び姿勢に基づいて、ロボットアーム３０を制御する信号を生成し、通信Ｉ／Ｆ１１を通じてロボットアーム３０に出力する。ロボット制御部１５３は、例えば、現在のロボットアーム３０の位置及び姿勢に関する情報を取得する。そして、ロボット制御部１５３は、現在のロボットアーム３０の位置及び姿勢と、推定された対象物の位置及び姿勢に応じて、ロボットアーム３０が対象物を把持する際に移動する軌道を生成する。なお、ロボット制御部１５３は、トレイ等の位置及び姿勢に基づいて、ロボットアーム３０が移動する軌道を修正してもよい。 The robot control unit 153 generates a signal for controlling the robot arm 30 based on the estimated position and orientation of the object, and outputs the signal to the robot arm 30 through the communication I / F 11. The robot control unit 153 acquires, for example, information regarding the current position and posture of the robot arm 30. Then, the robot control unit 153 generates a trajectory that the robot arm 30 moves when gripping the object according to the current position and posture of the robot arm 30 and the estimated position and posture of the object. The robot control unit 153 may correct the trajectory in which the robot arm 30 moves based on the position and posture of the tray or the like.

図６は、ロボットアームの制御に関する処理の一例を示す図である。図６に示すように、推定部１５２は、バラ積み画像から、ターゲットとなる対象物の位置及び姿勢を推定する。同様に、推定部１５２は、バラ積み画像から、対象物が配置されたトレイ等の位置及び姿勢を推定してもよい。ロボット制御部１５３は、推定された対象物及びトレイ等のモデルに基づいて、ロボットアーム３０の手先の位置の座標及び姿勢を算出し、ロボットアーム３０の軌道を生成する。 FIG. 6 is a diagram showing an example of processing related to control of the robot arm. As shown in FIG. 6, the estimation unit 152 estimates the position and orientation of the target object from the images stacked separately. Similarly, the estimation unit 152 may estimate the position and orientation of the tray or the like on which the object is placed from the images stacked separately. The robot control unit 153 calculates the coordinates and posture of the hand position of the robot arm 30 based on the estimated model of the object and the tray, and generates the trajectory of the robot arm 30.

なお、ロボット制御部１５３は、ロボットアーム３０が対象物を把持した後に、把持した対象物を整列させるためのロボットアーム３０の動作を制御する信号を、さらに出力してもよい。図７は、ロボットアームの制御に関する処理の別の一例を示す図である。図７に示すように、画像取得部１５１は、カメラ２０により撮影された、ロボットアーム３０により把持された対象物を撮影した画像を取得する。推定部１５２は、ターゲットとなる、ロボットアーム３０に把持された対象物の位置及び姿勢を推定し、ロボット制御部１５３に出力する。また、画像取得部１５１は、カメラ２０により撮影された、把持された対象物の移動先となる、整列先のトレイ等の画像をさらに取得してもよい。その際、画像取得部１５１は、整列先のトレイ等に既に整列された対象物の画像（整列済み画像）をさらに取得する。推定部１５２は、整列先の画像、又は整列済み画像から、整列先となるトレイ等の位置及び姿勢、並びに既に整列済みである対象物の位置及び姿勢を推定する。そして、ロボット制御部１５３は、推定された、ロボットアーム３０に把持された対象物の位置及び姿勢、整列先となるトレイ等の位置及び姿勢、並びに既に整列済みである対象物の位置及び姿勢に基づいて、ロボットアーム３０の手先の位置の座標及び姿勢を算出し、対象物を整列させる際のロボットアーム３０の軌道を生成する。 After the robot arm 30 grips the object, the robot control unit 153 may further output a signal for controlling the operation of the robot arm 30 for aligning the gripped object. FIG. 7 is a diagram showing another example of processing related to control of the robot arm. As shown in FIG. 7, the image acquisition unit 151 acquires an image of an object gripped by the robot arm 30 taken by the camera 20. The estimation unit 152 estimates the position and posture of the target object gripped by the robot arm 30, and outputs the position and orientation to the robot control unit 153. Further, the image acquisition unit 151 may further acquire an image of a tray or the like of the alignment destination, which is a movement destination of the grasped object, taken by the camera 20. At that time, the image acquisition unit 151 further acquires an image (arranged image) of the object already aligned on the tray or the like of the alignment destination. The estimation unit 152 estimates the position and orientation of the tray and the like to be aligned, and the position and orientation of the already aligned objects from the aligned image or the aligned image. Then, the robot control unit 153 determines the estimated position and orientation of the object held by the robot arm 30, the position and orientation of the tray or the like to be aligned, and the position and orientation of the already aligned object. Based on this, the coordinates and posture of the hand position of the robot arm 30 are calculated, and the trajectory of the robot arm 30 when aligning the objects is generated.

次に、推定部１５２における推定処理について説明する。推定部１５２は、例えば公知のダウンサンプリング、アップサンプリング、スキップコネクションを持つ物体検出モデルを応用したモデルを用いて、対象物の特徴量を抽出する。図８は、第１の実施形態に係る検出モデルの一例を示す図である。図８に示す物体検出モデルにおいて、ｄ１層は、例えばバラ積み画像Ｐ１（３２０×３２０ピクセル）を畳み込みニューラルネットワークｂ１を介してダウンサンプリングによって縦横４０×４０グリッドに区分し、各グリッドについて複数の特徴量（例えば２５６種類）を算出する。また、ｄ１層より下位の層にあたるｄ２層は、ｄ１層で区分されたグリッドを、ｄ１層よりも粗く（例えば２０×２０グリッドに）区分して、各グリッドの特徴量を算出する。同様に、ｄ１層及びｄ２層よりも下位の層にあたるｄ３層及びｄ４層は、ｄ２層で区分されたグリッドを、それぞれより粗く区分する。ｄ４層はアップサンプリングによって、より精細な区分で特徴量を算出し、同時にスキップコネクションｓ３によりｄ３層の特徴量と統合してｕ３層を生成する。スキップコネクションは、単純な加算、特徴量の連結でも良く、ｄ３層の特徴量に対して畳み込みニューラルネットワークのような変換が加えられていても良い。同様にｕ３層をアップサンプリングして算出した特徴量とｄ２層の特徴量をスキップコネクションｓ２により統合してｕ２層を生成する。さらに同様にｕ１層を生成する。この結果、ｕ１層においては、ｄ１層と同様に、４０×４０グリッドに区分された各グリッドの特徴量が算出される。 Next, the estimation process in the estimation unit 152 will be described. The estimation unit 152 extracts the feature amount of the object by using, for example, a model to which an object detection model having known downsampling, upsampling, and skip connection is applied. FIG. 8 is a diagram showing an example of a detection model according to the first embodiment. In the object detection model shown in FIG. 8, for example, the d1 layer divides the loosely stacked image P1 (320 × 320 pixels) into 40 × 40 grids in length and width by downsampling via a convolutional neural network b1, and has a plurality of features for each grid. Calculate the amount (for example, 256 types). Further, in the d2 layer, which is a layer lower than the d1 layer, the grid divided by the d1 layer is divided coarser than the d1 layer (for example, 20 × 20 grids), and the feature amount of each grid is calculated. Similarly, the d3 layer and the d4 layer, which are lower layers than the d1 layer and the d2 layer, divide the grid divided by the d2 layer more coarsely, respectively. The d4 layer calculates the feature amount in a finer division by upsampling, and at the same time, integrates with the feature amount of the d3 layer by the skip connection s3 to generate the u3 layer. The skip connection may be a simple addition or a connection of features, or a transformation such as a convolutional neural network may be added to the features of the d3 layer. Similarly, the feature amount calculated by upsampling the u3 layer and the feature amount of the d2 layer are integrated by the skip connection s2 to generate the u2 layer. Further, the u1 layer is generated in the same manner. As a result, in the u1 layer, the feature amount of each grid divided into 40 × 40 grids is calculated as in the d1 layer.

図９は、第１の実施形態に係る特徴抽出層（ｕ１）が出力する特徴マップの一例を示す図である。図９に示す特徴マップの水平方向は、４０×４０のグリッドに区分されたバラ積み画像Ｐ１の水平方向の各グリッドを示し、垂直方向は、垂直方向の各グリッドを示す。また、図９に示す特徴マップの奥行方向は、各グリッドにおける特徴量の要素を示す。 FIG. 9 is a diagram showing an example of a feature map output by the feature extraction layer (u1) according to the first embodiment. The horizontal direction of the feature map shown in FIG. 9 indicates each grid in the horizontal direction of the loosely stacked image P1 divided into 40 × 40 grids, and the vertical direction indicates each grid in the vertical direction. Further, the depth direction of the feature map shown in FIG. 9 indicates an element of the feature amount in each grid.

図１０は、第１の実施形態に係る対象物の位置及び姿勢の推定結果の一例を示す図である。図１０に示すように、推定部は、対象物の位置を示す２次元座標（Δｘ，Δｙ）、対象物の姿勢を示すクオタニオン（ｑｘ，ｑｙ，ｑｚ，ｑｗ）、及びクラス分類のスコア（Ｃ０，Ｃ１，…，Ｃｎ）を出力する。なお、第１の実施形態においては、推定結果として、対象物の位置を示す座標のうち、カメラ２０から対象物までの距離を示す深度の値は算出されない。深度の値を算出する構成については、後に説明する。なお、ここで言う深度とは、カメラの光軸に平行なｚ軸方向における、カメラのｚ座標から対象物のｚ座標までの距離をいう。なお、クラス分類のスコアはグリッドごとに出力される値であって、そのグリッドに対象物の中心点が含まれている確率である。例えば、対象物の種類がｎ種類だった場合に、これに“対象物の中心点が含まれていない確率”を加えてｎ＋１個のクラス分類のスコアが出力される。例えば、対象物となるワークが１種類のみの場合は、２個のクラス分類のスコアが出力される。また、同一グリッド内に複数の対象物が存在する場合、より上に積まれている物体の確率を出力する。 FIG. 10 is a diagram showing an example of an estimation result of the position and posture of the object according to the first embodiment. As shown in FIG. 10, the estimation unit includes two-dimensional coordinates (Δx, Δy) indicating the position of the object, quaternions (qx, qy, qz, qw) indicating the posture of the object, and a classification score (C0). , C1, ..., Cn) is output. In the first embodiment, as an estimation result, among the coordinates indicating the position of the object, the value of the depth indicating the distance from the camera 20 to the object is not calculated. The configuration for calculating the depth value will be described later. The depth referred to here means the distance from the z-coordinate of the camera to the z-coordinate of the object in the z-axis direction parallel to the optical axis of the camera. The classification score is a value output for each grid, and is the probability that the center point of the object is included in the grid. For example, when there are n types of objects, the score of n + 1 class classifications is output by adding the "probability that the center point of the object is not included". For example, when there is only one type of target work, the scores of two classifications are output. Also, when there are a plurality of objects in the same grid, the probability of the objects stacked on top is output.

図１０において、点ＣはグリッドＧｘの中心を示し、座標（Δｘ，Δｙ）である点ΔＣは、例えば、検出された対象物の中心点を示す。すなわち、図１０に示す例において、対象物の中心は、グリッドＧｘの中心点Ｃから、ｘ軸方向にΔｘ、ｙ軸方向にΔｙだけオフセットしている。 In FIG. 10, the point C indicates the center of the grid Gx, and the point ΔC which is the coordinates (Δx, Δy) indicates, for example, the center point of the detected object. That is, in the example shown in FIG. 10, the center of the object is offset from the center point C of the grid Gx by Δx in the x-axis direction and Δy in the y-axis direction.

なお、図１０に代えて、図１１に示すように対象物の中心以外の任意の点ａ、ｂ、ｃを設定し、グリッドＧｘの中心の点Ｃからの任意の点ａ、ｂ、ｃの座標（Δｘ１，Δｙ１、Δｚ１、Δｘ２，Δｙ２、Δｚ２、ｘ３，Δｙ３、Δｚ３）を出力してもよい。なお、任意の点は対象物のどの位置に設定してもよく、１点でも複数の点でも構わない。 Instead of FIG. 10, arbitrary points a, b, and c other than the center of the object are set as shown in FIG. 11, and arbitrary points a, b, and c from the center point C of the grid Gx are set. The coordinates (Δx1, Δy1, Δz1, Δx2, Δy2, Δz2, x3, Δy3, Δz3) may be output. Any point may be set at any position of the object, and may be one point or a plurality of points.

なお、対象物の大きさに比してグリッドの区分が粗いと、複数の対象物が一つのグリッドに入ってしまい、各対象物の特徴が交じり合って誤検出するおそれがあるため、第１の実施形態においては、最終的に生成された精細な（４０×４０グリッドの）特徴量が算出される特徴抽出層（ｕ１）の出力である特徴マップのみ利用する。 If the grid division is coarser than the size of the object, a plurality of objects may be included in one grid, and the features of the objects may be mixed and erroneously detected. In the embodiment of the above, only the feature map which is the output of the feature extraction layer (u1) in which the finally generated fine (40 × 40 grid) feature amount is calculated is used.

また、第１の実施形態においては、例えばステレオカメラを用いて、左右２種類の画像を撮影することにより、カメラ２０から対象物までの距離を特定する。図１２は、第１の実施形態に係るステレオカメラにより撮影されたバラ積み画像の一例を示す図である。図１２に示すように、画像取得部１５１は、左画像Ｐ１Ｌ及び右画像Ｐ１Ｒの２種類のバラ積み画像を取得する。また、推定部１５２は、左画像Ｐ１Ｌ及び右画像Ｐ１Ｒの両方に対して、学習モデル１４１を用いた推定処理を行う。なお、推定処理を行う際に、左画像Ｐ１Ｌに対して用いられる学習パラメータ１４１ｂの一部、またはすべてを、右画像Ｐ１Ｒに対する重み付けとして共有してもよい。なお、ステレオカメラではなく、１台のカメラを用い、カメラの位置をずらして、２か所で左右２種の画像に相当する画像を撮影してもよい。 Further, in the first embodiment, the distance from the camera 20 to the object is specified by taking two types of images on the left and right, for example, using a stereo camera. FIG. 12 is a diagram showing an example of loosely stacked images taken by the stereo camera according to the first embodiment. As shown in FIG. 12, the image acquisition unit 151 acquires two types of loosely stacked images, the left image P1L and the right image P1R. Further, the estimation unit 152 performs estimation processing using the learning model 141 on both the left image P1L and the right image P1R. When performing the estimation process, a part or all of the learning parameters 141b used for the left image P1L may be shared as weighting for the right image P1R. It should be noted that, instead of using a stereo camera, one camera may be used, and the positions of the cameras may be shifted to capture images corresponding to two types of left and right images at two locations.

そこで、第１の実施形態における推定部１５２は、左画像Ｐ１Ｌの特徴量と、右画像Ｐ１Ｒの特徴量とを組み合わせたマッチングマップを用いることにより、対象物の誤認識を抑制する。第１の実施形態において、マッチングマップは、各特徴量について、右画像Ｐ１Ｒと左画像Ｐ１Ｌとで特徴量の相関の強弱を示す。すなわち、マッチングマップを用いることにより、各画像における特徴量に着目して、左画像Ｐ１Ｌと右画像Ｐ１Ｒとのマッチングを図ることができる。 Therefore, the estimation unit 152 in the first embodiment suppresses erroneous recognition of the object by using a matching map in which the feature amount of the left image P1L and the feature amount of the right image P1R are combined. In the first embodiment, the matching map shows the strength of the correlation between the right image P1R and the left image P1L for each feature amount. That is, by using the matching map, it is possible to match the left image P1L and the right image P1R by paying attention to the feature amount in each image.

図１３は、第１の実施形態に係るバラ積み画像とマッチングマップとの関係の一例を示す図である。図１３に示すように、左画像Ｐ１Ｌを基準とし、右画像Ｐ１Ｒとの対応をとったマッチングマップＭＬにおいては、左画像Ｐ１Ｌの対象物Ｗ１Ｌの中心点が含まれるグリッドの特徴量と、右画像Ｐ１Ｒに含まれる特徴量との相関が最も大きいグリッドＭＬａが強調して表示される。同様に、右画像Ｐ１Ｒを基準とし、左画像Ｐ１Ｌとの対応をとったマッチングマップＭＲにおいても、右画像Ｐ１Ｒの対象物Ｗ１Ｒの中心点が含まれるグリッドの特徴量と、左画像Ｐ１Ｌに含まれる特徴量との相関が最も大きいグリッドＭＲａが強調して表示される。また、マッチングマップＭＬにおいて相関が最も大きいグリッドＭＬａは、左画像Ｐ１Ｌにおける対象物Ｗ１Ｌが位置するグリッドに対応し、マッチングマップＭＲにおいて相関が最も大きいグリッドＭＲａは、右画像Ｐ１Ｒにおける対象物Ｗ１Ｒが位置するグリッドに対応する。これにより、左画像Ｐ１Ｌにおいて対象物Ｗ１Ｌが位置するグリッドと、右画像Ｐ１Ｒにおいて対象物Ｗ１Ｒが位置するグリッドとが一致することを特定できる。すなわち、図１２においては、一致するグリッドは、左画像Ｐ１ＬのグリッドＧ１Ｌと、右画像Ｐ１ＲのグリッドＧ１Ｒである。これにより、左画像Ｐ１Ｌにおける対象物Ｗ１ＬのＸ座標と、右画像Ｐ１Ｒにおける対象物Ｗ１ＲのＸ座標とに基づいて、対象物Ｗ１に対する視差を特定できるので、カメラ２０から対象物Ｗ１までの深度ｚを特定することができる。 FIG. 13 is a diagram showing an example of the relationship between the loosely stacked images and the matching map according to the first embodiment. As shown in FIG. 13, in the matching map ML that is based on the left image P1L and corresponds to the right image P1R, the feature amount of the grid including the center point of the object W1L of the left image P1L and the right image. The grid MLa having the largest correlation with the feature amount contained in P1R is highlighted and displayed. Similarly, in the matching map MR that is based on the right image P1R and corresponds to the left image P1L, the feature amount of the grid including the center point of the object W1R of the right image P1R and the left image P1L are included. The grid MRa, which has the largest correlation with the feature amount, is highlighted and displayed. Further, the grid MLa having the largest correlation in the matching map ML corresponds to the grid on which the object W1L in the left image P1L is located, and the grid MRa having the largest correlation in the matching map MR is the position of the object W1R in the right image P1R. Corresponds to the grid to be. Thereby, it can be specified that the grid on which the object W1L is located in the left image P1L and the grid on which the object W1R is located in the right image P1R match. That is, in FIG. 12, the matching grids are the grid G1L of the left image P1L and the grid G1R of the right image P1R. As a result, the parallax with respect to the object W1 can be specified based on the X coordinate of the object W1L in the left image P1L and the X coordinate of the object W1R in the right image P1R. Can be identified.

図１４は、第１の実施形態に係る推定処理の一例を示すフローチャートである。また、図１５は、第１の実施形態に係る推定処理の一例を示す図である。以降、図１２〜図１５を用いて説明する。まず、画像取得部１５１は、図１２に示す左画像Ｐ１Ｌ及び右画像Ｐ１Ｒのように、対象物の左右の各画像を取得する（ステップＳ２０１）。次に、推定部１５２は、左右の各画像の水平方向の各グリッドについて、特徴量を算出する。ここで、上で述べたように、各画像を４０×４０のグリッドに区分し、各グリッドについて２５６個の特徴量を算出する場合、各画像の水平方向において、式（１）の左辺第１項及び第２項に示すような４０行４０列の行列が得られる。 FIG. 14 is a flowchart showing an example of the estimation process according to the first embodiment. Further, FIG. 15 is a diagram showing an example of the estimation process according to the first embodiment. Hereinafter, description will be made with reference to FIGS. 12 to 15. First, the image acquisition unit 151 acquires the left and right images of the object as shown in the left image P1L and the right image P1R shown in FIG. 12 (step S201). Next, the estimation unit 152 calculates the feature amount for each grid in the horizontal direction of each of the left and right images. Here, as described above, when each image is divided into 40 × 40 grids and 256 features are calculated for each grid, the first left side of the equation (1) is calculated in the horizontal direction of each image. A 40-by-40 matrix as shown in the terms and the second term is obtained.

次に、推定部１５２は、図１５に示す処理ｍを実行する。まず、推定部１５２は、例えば、式（１）により、左画像Ｐ１Ｌから抽出した特定の列の特徴量に、右画像Ｐ１Ｒから抽出した同じ列の特徴量を転置したものの行列積を計算する。式（１）において、左辺第１項は、左画像Ｐ１Ｌの特定の列の水平方向における１番目のグリッドにおける各特徴量ｌ１１乃至ｌ１ｎが、それぞれ行方向に並んでいる。一方、式（１）の左辺第２項においては、右画像Ｐ１Ｒ特定の列の水平方向における１番目のグリッドの各特徴量ｒ１１乃至ｒ１ｎが、それぞれ列方向に並んでいる。すなわち、左辺第２項の行列は、右画像Ｐ１Ｒの特定の列の水平方向にグリッドの各特徴量ｒ１１乃至ｒ１ｍがそれぞれ行方向に並んだ行列を転置したものである。また、式（１）の右辺は、左辺第１項の行列と、左辺第２項の行列の行列積とを計算したものである。式（１）の右辺の１列目は、右画像Ｐ１Ｒから抽出した１グリッド目の特徴量と左画像Ｐ１Ｌから抽出した特定の列の水平方向の各グリッドの特徴量の相関を表し、１行目は、左画像Ｐ１Ｌから抽出した１グリッド目の特徴量と右画像Ｐ１Ｒから抽出した特定の列の水平方向の各グリッドの特徴量の相関を表す。すなわち、式（１）の右辺は、左画像Ｐ１Ｌの各グリッドの特徴量と、右画像Ｐ１Ｒの各グリッドの特徴量との相関マップを示す。なお、式（１）において、添字「ｍ」は各画像の水平方向のグリッドの位置を示し、添え字「ｎ」は各グリッドにおける特徴量の番号を示す。すなわち、ｍは１〜４０であり、ｎは１〜２５６である。 Next, the estimation unit 152 executes the process m shown in FIG. First, the estimation unit 152 calculates, for example, the matrix product of the feature amount of the same column extracted from the right image P1R transposed to the feature amount of the specific column extracted from the left image P1L by the equation (1). In the formula (1), in the first term on the left side, the feature amounts l11 to l1n in the first grid in the horizontal direction of the specific column of the left image P1L are arranged in the row direction, respectively. On the other hand, in the second term on the left side of the equation (1), the feature amounts r11 to r1n of the first grid in the horizontal direction of the right image P1R specific column are arranged in the column direction, respectively. That is, the matrix of the second term on the left side is a transposed matrix in which the feature amounts r11 to r1m of the grid are arranged in the row direction in the horizontal direction of the specific column of the right image P1R. Further, the right side of the equation (1) is a calculation of the matrix product of the first term on the left side and the matrix of the second term on the left side. The first column on the right side of the equation (1) represents the correlation between the feature amount of the first grid extracted from the right image P1R and the feature amount of each horizontal grid of the specific column extracted from the left image P1L, and one row. The eye represents the correlation between the feature amount of the first grid extracted from the left image P1L and the feature amount of each horizontal grid of a specific column extracted from the right image P1R. That is, the right side of the equation (1) shows a correlation map between the feature amount of each grid of the left image P1L and the feature amount of each grid of the right image P1R. In the equation (1), the subscript "m" indicates the position of the grid in the horizontal direction of each image, and the subscript "n" indicates the number of the feature amount in each grid. That is, m is 1 to 40 and n is 1 to 256.

次に、推定部１５２は、算出された相関マップを用いて、行列（１）に示すような左画像Ｐ１Ｌに対する右画像Ｐ１ＲのマッチングマップＭＬを算出する。左画像Ｐ１Ｌに対する右画像Ｐ１ＲのマッチングマップＭＬは、例えば、相関マップの行方向に対してＳｏｆｔｍａｘ関数を適用することにより算出される。これにより、水平方向の相関の値を正規化している。つまり、行方向の値をすべて合計すると１になるよう変換している。 Next, the estimation unit 152 calculates the matching map ML of the right image P1R with respect to the left image P1L as shown in the matrix (1) by using the calculated correlation map. The matching map ML of the right image P1R with respect to the left image P1L is calculated, for example, by applying the Softmax function to the row direction of the correlation map. This normalizes the value of the correlation in the horizontal direction. That is, all the values in the row direction are converted so that the sum is 1.

次に、推定部１５２は、算出されたマッチングマップＭＬに、例えば、式（２）により、右画像Ｐ１Ｒから抽出された特徴量を畳み込む。式（２）の左辺第１項は、行列（１）を転置したものであり、左辺第２項は、式（１）の左辺第１項の行列である。なお、本発明では、相関を取るための特徴量と、マッチングマップに畳み込むための特徴量とは同じものを用いているが、抽出された特徴量から畳み込みニューラルネットワーク等によって、新たに相関を取るための特徴量と畳み込むための特徴量を別々に生成しても良い。 Next, the estimation unit 152 convolves the calculated matching map ML with the feature amount extracted from the right image P1R by, for example, the equation (2). The first term on the left side of the equation (2) is a transposed version of the matrix (1), and the second term on the left side is the matrix of the first term on the left side of the equation (1). In the present invention, the feature amount for correlating and the feature amount for convolution in the matching map are the same, but a new correlation is obtained from the extracted feature amount by a convolutional neural network or the like. The feature amount for convolution and the feature amount for convolution may be generated separately.

次に、推定部１５２は式（２）で得られた特徴量を左画像Ｐ１Ｌから抽出された特徴量に連結させて、例え畳み込みニューラルネットワークによって新たな特徴量を生成する。このように、左右の画像の特徴量を統合することにより、位置、姿勢の推定精度が向上する。なお、図１５における処理ｍは複数回繰り返しても良い。 Next, the estimation unit 152 connects the feature amount obtained by the equation (2) to the feature amount extracted from the left image P1L, and generates a new feature amount by, for example, a convolutional neural network. By integrating the features of the left and right images in this way, the estimation accuracy of the position and orientation is improved. The process m in FIG. 15 may be repeated a plurality of times.

次に、推定部１５２はここで得られた特徴量から、例えば畳み込みニューラルネットワークによって位置、姿勢及びクラス分類を推定する。あわせて、推定部１５２は、算出された相関マップを用いて、行列（２）に示すような右画像Ｐ１Ｒに対する左画像Ｐ１ＬのマッチングマップＭＲを算出する(ステップＳ２０２)。右画像Ｐ１Ｒに対する左画像Ｐ１ＬのマッチングマップＭＲも、左画像Ｐ１Ｌに対する右画像Ｐ１ＲのマッチングマップＭＬと同様に、例えば、相関マップの行方向に対してＳｏｆｔｍａｘ関数を適用することにより算出される。 Next, the estimation unit 152 estimates the position, orientation, and classification from the features obtained here, for example, by a convolutional neural network. At the same time, the estimation unit 152 calculates the matching map MR of the left image P1L with respect to the right image P1R as shown in the matrix (2) using the calculated correlation map (step S202). The matching map MR of the left image P1L with respect to the right image P1R is also calculated by applying the Softmax function to the row direction of the correlation map, for example, in the same manner as the matching map ML of the right image P1R with respect to the left image P1L.

次に、推定部１５２は、算出されたマッチングマップに、例えば、式（３）により、左画像Ｐ１Ｌの特徴量を畳み込む。式（３）の左辺第１項は、行列（２）であり、左辺第２項は、式（１）の左辺第２項の行列の転置前のものである。 Next, the estimation unit 152 convolves the feature amount of the left image P1L into the calculated matching map by, for example, the equation (3). The first term on the left side of the equation (3) is the matrix (2), and the second term on the left side is the one before the transposition of the second term on the left side of the equation (1).

次に推定部１５２は、あらかじめ設定しておいた閾値と、左画像Ｐ１Ｌから推定したターゲット（対象物）のクラス分類の推定結果が一番大きいグリッドを選択して比較する（ステップＳ２０３）。閾値をこえていなかった場合は、ターゲットが無いとして終了する。閾値をこえていた場合は、そのグリッドに対する右画像Ｐ１ＲとのマッチングマップＭＬから、一番大きい値のグリッドを選択する（ステップＳ２０４）。 Next, the estimation unit 152 selects and compares the preset threshold value and the grid with the largest estimation result of the target (object) classification estimated from the left image P1L (step S203). If the threshold is not exceeded, it ends as if there is no target. If the threshold value is exceeded, the grid with the largest value is selected from the matching map ML with the right image P1R for that grid (step S204).

次に、選択したグリッドにおいて、右画像Ｐ１Ｒのターゲットのクラス分類の推定結果とあらかじめ設定しておいた閾値とを比較する（ステップＳ２０８）。閾値をこえていた場合は、そのグリッドに対する左画像Ｐ１ＬとのマッチングマップＭＬから一番大きい値のグリッドを選択する（ステップＳ２０９）。閾値を超えていない場合は、左画像Ｐ１Ｌの推定結果から選択したグリッドのクラス分類スコアを０にしてステップＳ２０３へ戻る（ステップＳ２０７）。 Next, in the selected grid, the estimation result of the target classification of the right image P1R is compared with the preset threshold value (step S208). If the threshold value is exceeded, the grid with the largest value is selected from the matching map ML with the left image P1L for that grid (step S209). If the threshold value is not exceeded, the classification score of the grid selected from the estimation result of the left image P1L is set to 0, and the process returns to step S203 (step S207).

次に、ステップＳ２０９にて選択したマッチングマップＭＬのグリッドと、ステップＳ２０４にて左画像Ｐ１Ｌの推定結果から選択したグリッドが等しいかを比較する（ステップＳ２１０）。グリッドが異なる場合は、ステップＳ２０４にて左画像Ｐ１Ｌの推定結果から選択したグリッドのクラス分類スコアを０にして、ステップＳ２０３のグリッドの選択に戻る（ステップＳ２０７）。最終的に、左画像Ｐ１Ｌ及び右画像Ｐ１Ｒで選択したグリッドの位置情報（例えば、図１における水平方向ｘの値）の検出結果から視差を算出する（ステップＳ２１１）。 Next, it is compared whether the grid of the matching map ML selected in step S209 and the grid selected from the estimation result of the left image P1L in step S204 are equal (step S210). If the grids are different, the classification score of the grid selected from the estimation result of the left image P1L in step S204 is set to 0, and the process returns to the grid selection in step S203 (step S207). Finally, the parallax is calculated from the detection result of the position information (for example, the value of x in the horizontal direction in FIG. 1) of the grid selected in the left image P1L and the right image P1R (step S211).

次に、ステップＳ２１１から算出した視差をもとに、ターゲットの深度を算出する（ステップＳ２１２）。なお、複数のターゲットに対して深度を算出する場合は、ステップＳ２１１の後、左画像Ｐ１Ｌ及び右画像Ｐ１Ｒの推定結果から選択したグリッドのクラス分類スコアを０にしてからステップＳ２０３に戻り、以後、ステップＳ２１２までを繰り返せば良い。 Next, the depth of the target is calculated based on the parallax calculated from step S211 (step S212). When calculating the depth for a plurality of targets, after step S211, the classification score of the grid selected from the estimation results of the left image P1L and the right image P1R is set to 0, and then the process returns to step S203. The process up to step S212 may be repeated.

以上述べたように、第１の実施形態における画像処理装置１０は、取得部と、推定部と、を備える。取得部は、バラ積みされたワークを撮影した第１の画像及び第２の画像を取得する。推定部は、第１の画像の特徴量と、第２の画像の特徴量とのマッチングマップを生成し、第１の画像と第２の画像それぞれに対してターゲットとなる各ワークの位置と姿勢とクラス分類スコアを推定し、前記アテンションマップを用いたマッチング結果と位置の推定結果に基づいて、ワーク位置を推定することにより、ステレオカメラからワークまでの深度を算出する。これにより、物体認識における誤検出を抑制できる。 As described above, the image processing apparatus 10 in the first embodiment includes an acquisition unit and an estimation unit. The acquisition unit acquires a first image and a second image obtained by photographing the workpieces stacked separately. The estimation unit generates a matching map of the feature amount of the first image and the feature amount of the second image, and the position and orientation of each target work for each of the first image and the second image. And the classification score is estimated, and the depth from the stereo camera to the work is calculated by estimating the work position based on the matching result and the position estimation result using the attention map. As a result, erroneous detection in object recognition can be suppressed.

（変形例）
以上、本発明の実施形態について説明したが、本発明は上記実施形態に限定されるものではなく、その趣旨を逸脱しない限りにおいて種々の変更が可能である。例えば、第１の実施形態では、対象物（ワーク）が１種類である場合について説明したが、これに限られず、画像処理装置１０が、複数のワークの種類を検出するような構成であってもよい。また、画像処理装置１０は、対象物を検出するだけでなく、対象物が配置されたトレイ等の位置や姿勢をさらに検出してもよい。図１６は、変形例に係るトレイを含むバラ積み画像の一例を示す図である。図１６に示す例において、画像処理装置１０は、対象物が配置されたトレイの位置及び姿勢を特定することにより、ロボットアーム３０がトレイに衝突しないような軌道を設定することができる。なお、検出する対象であるトレイは、障害物の一例である。画像処理装置１０は、トレイ以外のその他の障害物となるものを検出するような構成であってもよい。 (Modification example)
Although the embodiments of the present invention have been described above, the present invention is not limited to the above embodiments, and various modifications can be made without departing from the spirit of the present invention. For example, in the first embodiment, the case where the object (work) is one type has been described, but the present invention is not limited to this, and the image processing device 10 is configured to detect a plurality of types of the work. May be good. Further, the image processing device 10 may not only detect the object, but may further detect the position and orientation of the tray or the like on which the object is placed. FIG. 16 is a diagram showing an example of a loosely stacked image including a tray according to a modified example. In the example shown in FIG. 16, the image processing device 10 can set a trajectory so that the robot arm 30 does not collide with the tray by specifying the position and posture of the tray on which the object is placed. The tray to be detected is an example of an obstacle. The image processing device 10 may be configured to detect other obstacles other than the tray.

また、画像処理装置１０が、例えばバラ積み画像を４０×４０のグリッドに区分する例について説明したが、これに限られず、より細かな、あるいは、より粗いグリッドに区分して対象物を検出してもよく、また画素単位で推定処理を行ってもよい。これにより、画像処理装置１０は、より精度よくカメラと対象物との距離を算出することができる。図１７は、変形例に係る位置ずれ推定モデルの一例を示す図である。図１７に示すように、画像処理装置１０は、左画像Ｐ１Ｌと右画像Ｐ１Ｒのうち、推定位置周辺のグリッドよりもサイズが小さい部分を切り出して結合してもよい。そして、第１の実施形態における推定処理と同様に推定処理を行い、処理結果に基づいて位置ずれを推定してもよい。 Further, the example in which the image processing device 10 divides the loosely stacked images into a grid of 40 × 40 has been described, but the present invention is not limited to this, and the object is detected by dividing the images into a finer or coarser grid. The estimation process may be performed on a pixel-by-pixel basis. As a result, the image processing device 10 can calculate the distance between the camera and the object more accurately. FIG. 17 is a diagram showing an example of a position deviation estimation model according to a modified example. As shown in FIG. 17, the image processing apparatus 10 may cut out a portion of the left image P1L and the right image P1R that is smaller in size than the grid around the estimated position and combine them. Then, the estimation process may be performed in the same manner as the estimation process in the first embodiment, and the positional deviation may be estimated based on the processing result.

また、細かな、あるいは、粗いグリッド単位や画素単位で推定処理を行う場合に、第１の実施形態と同様に、左画像Ｐ１Ｌと右画像Ｐ１Ｒとで、それぞれ個別に推定処理を行ってもよい。図１８は、変形例に係る位置ずれ推定モデルの別の一例を示す図である。図１８に示す例では、画像処理装置１０は、左画像Ｐ１Ｌと右画像Ｐ１Ｒとに対し、それぞれ別々に推定処理を行う。この場合においても、画像処理装置１０は、第１の実施形態と同様に、それぞれの推定処理を行う際に、左画像Ｐ１Ｌに対する重み付けを、右画像Ｐ１Ｒに対する重み付けと共有してもよい。 Further, when the estimation process is performed in fine or coarse grid units or pixel units, the estimation process may be performed individually for the left image P1L and the right image P1R as in the first embodiment. .. FIG. 18 is a diagram showing another example of the position deviation estimation model according to the modified example. In the example shown in FIG. 18, the image processing device 10 separately performs estimation processing on the left image P1L and the right image P1R. In this case as well, the image processing apparatus 10 may share the weighting for the left image P1L with the weighting for the right image P1R when performing each estimation processing, as in the first embodiment.

また、以上述べた推定処理を、バラ積みされたワーク４１、４２の画像に対してではなく、ロボットアーム３０や、ロボットアーム３０に保持されたワーク４１、４２、又は整列先に整列されたワーク４１、４２に対して行ってもよい。 Further, the estimation process described above is not applied to the images of the workpieces 41 and 42 stacked separately, but to the robot arm 30, the workpieces 41 and 42 held by the robot arm 30, or the workpieces aligned at the alignment destination. You may go to 41, 42.

また、上記実施の形態により本発明が限定されるものではない。上述した各構成要素を適宜組み合わせて構成したものも本発明に含まれる。また、さらなる効果や変形例は、当業者によって容易に導き出すことができる。よって、本発明のより広範な態様は、上記の実施の形態に限定されるものではなく、様々な変更が可能である。 Moreover, the present invention is not limited by the above-described embodiment. The present invention also includes a configuration in which the above-mentioned components are appropriately combined. Further, further effects and modifications can be easily derived by those skilled in the art. Therefore, the broader aspect of the present invention is not limited to the above-described embodiment, and various modifications can be made.

１物体把持システム
１０画像処理装置
２０カメラ
３０ロボットアーム
４１、４２ワーク 1 Object grasping system 10 Image processing device 20 Camera 30 Robot arm 41, 42 Work

Claims

A learning data generation method that generates images of workpieces stacked separately by image generation software and uses them as learning data for machine learning.

Obtain a 3D model of the workpiece to be gripped and
Determine the position and orientation of the 3D model and
A plurality of the three-dimensional models are arranged in the virtual space based on the determined position and orientation, and the three-dimensional models are arranged in the virtual space.
The learning data generation method according to claim 1, wherein an image of the loosely stacked workpieces is generated by acquiring an image of the virtual space in which the three-dimensional model is arranged.

The learning data generation method according to claim 1 or 2, further performing machine learning processing using teacher data including the coordinates and orientations of the loosely stacked workpieces and images of the loosely stacked workpieces.