JP6522488B2

JP6522488B2 - Machine learning apparatus, robot system and machine learning method for learning work taking-out operation

Info

Publication number: JP6522488B2
Application number: JP2015233857A
Authority: JP
Inventors: 岳山▲崎▼; 拓未尾山; 峻陶山; 一隆中山; 英俊組谷; 中川　浩; 中川　　浩; 大輔岡野原; 遼介奥田; 叡一松元; 圭悟河合
Original assignee: Preferred Networks Inc
Current assignee: Preferred Networks Inc
Priority date: 2015-07-31
Filing date: 2015-11-30
Publication date: 2019-05-29
Anticipated expiration: 2035-11-30
Also published as: CN106393102B; CN106393102A; JP7100426B2; JP2017064910A; JP2022145915A; JP2020168719A; JP2017030135A; DE102016015873B3; CN113199483A

Description

本発明は、バラ積みされた状態を含む、乱雑に置かれたワークの取り出し動作を学習する機械学習装置、ロボットシステムおよび機械学習方法に関する。 The present invention relates to a machine learning device, a robot system, and a machine learning method for learning the taking-out operation of a randomly placed work including a bulk-stacked state.

従前より、例えば、かご状の箱にバラ積みされたワークを、ロボットのハンド部により把持して運搬するロボットシステムが知られている（例えば、特許文献１、２参照）。このようなロボットシステムにおいては、例えば、かご状の箱の上方に設置された三次元計測器を用いて複数のワークの位置情報を取得し、その位置情報に基づいてワークを１つずつロボットのハンド部によって取り出している。 DESCRIPTION OF RELATED ART Conventionally, the robot system which hold | grips and conveys the workpiece | work piled up in the cage-like box, for example with the hand part of a robot is known (for example, refer patent document 1, 2). In such a robot system, for example, position information of a plurality of workpieces is acquired using a three-dimensional measuring device installed above a cage-like box, and the workpieces are obtained one by one based on the position information. It is taken out by the hand part.

特許第５６４２７３８号公報Patent No. 5642738 gazette 特許第５６７０３９７号公報Patent No. 5670397 gazette

しかしながら、上述した従来のロボットシステムにおいては、例えば、三次元計測器により計測された複数のワークの距離画像から、取り出すワークをどのように抽出するのか、並びに、どの位置のワークを取り出すのかを事前に設定しておく必要がある。また、ワークを取り出すとき、ロボットのハンド部をどのように動作させるのかといったことも事前にプログラミングしておく必要がある。具体的に、例えば、人間がティーチングペンダントを用いて、ロボットにワークの取り出し動作を教示するといったことが必要になる。 However, in the above-described conventional robot system, for example, how to extract a workpiece to be taken out from distance images of a plurality of workpieces measured by a three-dimensional measuring device and which position of the workpiece should be taken out in advance It needs to be set to Also, when taking out the work, it is necessary to program in advance how to operate the hand portion of the robot. Specifically, for example, it is necessary for a human to teach the robot an operation of taking out a work using a teaching pendant.

そのため、複数のワークの距離画像から、取り出すワークを抽出する設定が適切でなかったり、ロボットの動作プログラムが適切に作成されないと、ロボットがワークを取り出して運搬する際の成功率が低下する。また、その成功率を高めるには、人間が試行錯誤を重ねてロボットの最適な動作を模索しながら、ワークの検出設定とロボットの動作プログラムとを改良していく必要がある。 Therefore, if the setting for extracting the workpiece to be taken out is not appropriate from the distance images of the plurality of workpieces, or the robot operation program is not properly created, the success rate when the robot takes out and transports the workpiece decreases. Moreover, in order to increase the success rate, it is necessary to improve the detection setting of the work and the operation program of the robot while searching for the optimum operation of the robot by trial and error.

そこで、本発明の目的は、上述したような実情に鑑み、バラ積みされた状態を含む、乱雑に置かれたワークを取り出すときのロボットの最適な動作を人間の介在無しに学習できる機械学習装置、ロボットシステムおよび機械学習方法を提供することにある。 Therefore, an object of the present invention is to provide a machine learning apparatus capable of learning an optimal operation of a robot when taking out a randomly placed work, including a bulk-stacked state, without human intervention, in consideration of the above-mentioned situation. , Robot system and machine learning method.

本発明に係る第１実施形態の第一構成例によれば、バラ積みされた状態を含む、乱雑に置かれた複数のワークからハンド部によって前記ワークを取り出すロボットの動作を学習する機械学習装置であって、バラ積みされた複数のワークを計測し、当該複数のワークの表面の三次元位置情報を出力する三次元計測器の出力データおよび前記ワークの前記ハンド部による取り出し動作を前記ハンド部の位置，姿勢および取り出し方向をそれぞれ設定する状態量を観測する状態量観測部と、前記ハンド部によって前記ワークを取り出す前記ロボットの取り出し動作の結果を取得する動作結果取得部と、前記状態量観測部により観測された前記状態量および前記動作結果取得部により取得された前記ロボットの取り出し動作の成否の判定結果を受け取って、前記ワークの前記取り出し動作を学習する学習部と、を備え、前記学習部は、前記動作結果取得部からの前記ワークの取り出しの成否の判定結果に基づいて報酬を計算する報酬計算部と、前記ワークの前記取り出し動作の価値を定める価値関数を有し、前記報酬に応じて前記価値関数を更新する価値関数更新部と、を備え、前記学習部により得られた前記ハンド部の最適な位置，姿勢および取り出し方向をそれぞれ設定する状態変数を前記価値関数に基づいて更新する学習を実施し、更新された前記状態変数に基づいて前記ハンド部および前記ロボットの各駆動軸を動作させる機械学習装置が提供される。本発明に係る第１実施形態の第二構成例によれば、バラ積みされた状態を含む、乱雑に置かれた複数のワークからハンド部によって前記ワークを取り出すロボットの動作を学習する機械学習装置であって、バラ積みされた複数のワークを計測し、当該複数のワークの表面の三次元位置情報を出力する三次元計測器の出力データおよび前記ワークの前記ハンド部による取り出し動作を前記ハンド部の位置，姿勢および取り出し方向をそれぞれ設定する状態量を観測する状態量観測部と、前記ハンド部によって前記ワークを取り出す前記ロボットの取り出し動作の結果を取得する動作結果取得部と、前記状態量観測部により観測された前記状態量および前記動作結果取得部により取得された前記ロボットの取り出し動作の成否の判定結果を受け取って、前記ワークの前記取り出し動作を学習する学習部と、を備え、前記学習部は、前記ワークの前記取り出し動作を学習する学習モデルを実装し、前記動作結果取得部から出力されるラベルと、前記学習部に実装された前記学習モデルの出力に基づいて誤差を計算する誤差計算部と、前記誤差に応じて前記学習モデルを更新する学習モデル更新部と、を備え、前記学習部により得られた前記ハンド部の最適な位置，姿勢および取り出し方向をそれぞれ設定する状態変数に基づいて、前記ハンド部および前記ロボットの各駆動軸を動作させる機械学習装置が提供される。前記機械学習装置は、さらに、前記学習部により得られた前記ハンド部の最適な位置，姿勢および取り出し方向をそれぞれ設定する前記状態変数に基づいて、前記ワークの取り出し動作を前記ロボットに指令する指令データを決定する意思決定部を備えるのが好ましい。 According to the first configuration example of the first embodiment of the present invention, the machine learning device learns the operation of the robot for taking out the workpiece by the hand unit from the plurality of workpieces placed in randomness, including the bulk-stacked state Output data of a three-dimensional measuring device that measures a plurality of workpieces stacked in bulk and outputs three-dimensional position information of the surface of the plurality of workpieces, and the operation of taking out the workpiece by the hand unit A state quantity observation unit that observes a state quantity that sets the position, posture, and take-out direction of each, an operation result acquisition unit that obtains a result of the take-out operation of the robot that takes out the workpiece by the hand unit, and the state quantity observation Receiving the determination result of the state quantity observed by the unit and the success or failure of the extraction operation of the robot acquired by the operation result acquisition unit A learning unit for learning the take-out operation of the work, and the learning unit calculates a reward based on the determination result of success or failure of the take-out of the work from the operation result acquisition unit; A value function updating unit that has a value function that determines the value of the retrieval operation of the work, and updates the value function according to the reward, and the optimum of the hand unit obtained by the learning unit The machine learning which carries out the learning which updates the state variable which sets position, posture, and taking-out direction based on the value function, and operates the drive unit of the hand unit and the robot based on the updated state variable. An apparatus is provided. According to the second configuration example of the first embodiment of the present invention, the machine learning device learns the operation of the robot for taking out the workpiece by the hand unit from the plurality of workpieces placed in randomness, including the bulk-stacked state Output data of a three-dimensional measuring device that measures a plurality of workpieces stacked in bulk and outputs three-dimensional position information of the surface of the plurality of workpieces, and the operation of taking out the workpiece by the hand unit A state quantity observation unit that observes a state quantity that sets the position, posture, and take-out direction of each, an operation result acquisition unit that obtains a result of the take-out operation of the robot that takes out the workpiece by the hand unit, and the state quantity observation Receiving the determination result of the state quantity observed by the unit and the success or failure of the extraction operation of the robot acquired by the operation result acquisition unit A learning unit for learning the take-out operation of the work, the learning unit implementing a learning model for learning the take-out operation of the work, and a label output from the operation result acquisition unit; An error calculation unit that calculates an error based on the output of the learning model implemented in the learning unit; and a learning model updating unit that updates the learning model according to the error, obtained by the learning unit There is provided a machine learning device for operating the drive axes of the hand unit and the robot based on state variables which respectively set the optimum position, posture and take-out direction of the hand unit. The machine learning apparatus further instructs the robot to take out the workpiece based on the state variable which sets the optimum position, posture, and take-out direction of the hand unit obtained by the learning unit. Preferably, a decision making unit is provided to determine the data.

本発明に係る第２実施形態によれば、バラ積みされた状態を含む、乱雑に置かれた複数のワークからハンド部によって前記ワークを取り出すロボットの動作を学習する機械学習装置であって、前記ワーク毎の三次元マップを計測する三次元計測器の出力データを含む前記ロボットの状態量を観測する状態量観測部と、前記ハンド部によって前記ワークを取り出す前記ロボットの取り出し動作の結果を取得する動作結果取得部と、前記状態量観測部からの出力および前記動作結果取得部からの出力を受け取り、前記三次元計測器の計測パラメータを含む操作量を、前記ロボットの前記状態量および前記取り出し動作の結果に関連付けて学習する学習部と、を備える機械学習装置が提供される。前記機械学習装置は、さらに、前記学習部が学習した前記操作量を参照して、前記三次元計測器の前記計測パラメータを決定する意思決定部を備えるのが好ましい。 According to a second embodiment of the present invention, there is provided a machine learning device for learning an operation of a robot for taking out a workpiece from a plurality of randomly placed workpieces including a bulked state by a hand unit, A state quantity observing unit for observing a state quantity of the robot including output data of a three-dimensional measuring instrument for measuring a three-dimensional map for each work, and a result of a taking-out operation of the robot for taking out the work by the hand unit An operation result acquisition unit, an output from the state amount observation unit, and an output from the operation result acquisition unit, and an operation amount including a measurement parameter of the three-dimensional measuring instrument is the state amount of the robot and the extraction operation. And a learning unit for learning in association with the result of the above. It is preferable that the machine learning apparatus further includes a decision making unit which determines the measurement parameter of the three-dimensional measuring device with reference to the operation amount learned by the learning unit.

前記状態量観測部は、さらに、前記三次元計測器の出力に基づいて、前記ワーク毎の三次元位置を計算する座標計算部の出力データを含む前記ロボットの状態量も観測することもできる。前記座標計算部は、さらに、前記ワーク毎の姿勢を計算し、計算された前記ワーク毎の三次元位置および姿勢のデータを出力してもよい。前記動作結果取得部は、前記三次元計測器の出力データを利用することができる。前記機械学習装置は、さらに、前記三次元計測器の出力データを、前記状態量観測部への入力前に処理する前処理部を備え、前記状態量観測部は、前処理部の出力データを前記ロボットの状態量として受け取るのが好ましい。前記前処理部は、前記三次元計測器の出力データにおける前記ワーク毎の方向および高さを一定に揃えることができる。前記動作結果取得部は、前記ワークの取り出しの成否、前記ワークの破損状態、および、取り出した前記ワークを後工程に渡すときの達成度のうちの少なくとも１つを取得することができる。 The state quantity observation unit can also observe a state quantity of the robot including output data of a coordinate calculation unit that calculates the three-dimensional position of each work based on the output of the three-dimensional measuring device. The coordinate calculation unit may further calculate an attitude of each work and output data of the calculated three-dimensional position and attitude of each work. The operation result acquisition unit can use output data of the three-dimensional measuring device. The machine learning apparatus further includes a pre-processing unit that processes output data of the three-dimensional measuring device before input to the state amount observation unit, and the state amount observation unit outputs output data of the pre-processing unit. Preferably, it is received as a state quantity of the robot. The pre-processing unit can uniformly align the direction and height of each workpiece in the output data of the three-dimensional measuring device. The operation result acquisition unit can acquire at least one of success or failure of taking out the work, a damaged state of the work, and an achievement degree when the taken out work is passed to a post-process.

前記学習部は、前記動作結果取得部の出力に基づいて報酬を計算する報酬計算部と、前記ワークの前記取り出し動作の価値を定める価値関数を有し、前記報酬に応じて前記価値関数を更新する価値関数更新部と、を備えることができる。前記学習部は、前記ワークの前記取り出し動作を学習する学習モデルを有し、前記動作結果取得部の出力、および、前記学習モデルの出力に基づいて誤差を計算する誤差計算部と、前記誤差に応じて前記学習モデルを更新する学習モデル更新部と、を備えることもできる。前記機械学習装置は、ニューラルネットワークを有するのが好ましい。 The learning unit has a reward calculation unit that calculates a reward based on the output of the operation result acquisition unit, and a value function that determines the value of the retrieval operation of the work, and updates the value function according to the remuneration And a value function updating unit. The learning unit has a learning model that learns the taking-out operation of the work, and an error calculating unit that calculates an error based on an output of the operation result acquiring unit and an output of the learning model; And a learning model updating unit that updates the learning model accordingly. The machine learning device preferably comprises a neural network.

本発明に係る第３実施形態によれば、バラ積みされた状態を含む、乱雑に置かれた複数のワークからハンド部によって前記ワークを取り出すロボットの動作を学習する機械学習装置であって、前記ワーク毎の三次元マップを計測する三次元計測器の出力データを含む前記ロボットの状態量を観測する状態量観測部と、前記ハンド部によって前記ワークを取り出す前記ロボットの取り出し動作の結果を取得する動作結果取得部と、前記状態量観測部からの出力および前記動作結果取得部からの出力を受け取り、前記ワークの前記取り出し動作を前記ロボットに指令する指令データを含む操作量を、前記ロボットの前記状態量および前記取り出し動作の結果に関連付けて学習する学習部と、を備える機械学習装置を備えたロボットシステムであって、前記ロボットと、前記三次元計測器と、前記ロボットおよび前記三次元計測器をそれぞれ制御する制御装置と、を備えるロボットシステムが提供される。 According to a third embodiment of the present invention, there is provided a machine learning device for learning an operation of a robot for taking out a workpiece from a plurality of randomly placed workpieces including a bulked state by a hand unit, A state quantity observing unit for observing a state quantity of the robot including output data of a three-dimensional measuring instrument for measuring a three-dimensional map for each work, and a result of a taking-out operation of the robot for taking out the work by the hand unit An operation amount acquiring unit, an output from the state quantity observing unit, and an output from the operation result acquiring unit, and an operation amount including command data for instructing the robot to take out the workpiece from the robot A robot system comprising a machine learning apparatus comprising: a learning unit that learns in association with a state quantity and a result of the extraction operation , And the robot, and the three-dimensional measuring instrument, a robot system comprising a control device, the controlling the robot and the three-dimensional measuring instrument, respectively is provided.

本発明に係る第４実施形態によれば、バラ積みされた状態を含む、乱雑に置かれた複数のワークからハンド部によって前記ワークを取り出すロボットの動作を学習する機械学習装置であって、前記ワーク毎の三次元マップを計測する三次元計測器の出力データを含む前記ロボットの状態量を観測する状態量観測部と、前記ハンド部によって前記ワークを取り出す前記ロボットの取り出し動作の結果を取得する動作結果取得部と、前記状態量観測部からの出力および前記動作結果取得部からの出力を受け取り、前記三次元計測器の計測パラメータを含む操作量を、前記ロボットの前記状態量および前記取り出し動作の結果に関連付けて学習する学習部と、を備える機械学習装置を備えたロボットシステムであって、前記ロボットと、前記三次元計測器と、前記ロボットおよび前記三次元計測器をそれぞれ制御する制御装置と、を備えるロボットシステムが提供される。 According to a fourth embodiment of the present invention, there is provided a machine learning device that learns the operation of a robot that takes out workpieces by a hand unit from a plurality of randomly placed workpieces, including bulk-stacked states, A state quantity observing unit for observing a state quantity of the robot including output data of a three-dimensional measuring instrument for measuring a three-dimensional map for each work, and a result of a taking-out operation of the robot for taking out the work by the hand unit An operation result acquisition unit, an output from the state amount observation unit, and an output from the operation result acquisition unit, and an operation amount including a measurement parameter of the three-dimensional measuring instrument is the state amount of the robot and the extraction operation. A robot system provided with a machine learning apparatus comprising a learning unit for learning in association with the result of And measuring device, a robot system and a control device for controlling each said robot and said three-dimensional measuring device is provided.

前記ロボットシステムは、複数の前記ロボットを備え、前記機械学習装置は、前記ロボット毎にそれぞれ設けられ、複数の前記ロボットに設けられた複数の前記機械学習装置は、通信媒体を介して相互にデータを共有または交換するのが好ましい。前記機械学習装置は、クラウドサーバ上に存在してもよい。 The robot system includes a plurality of the robots, the machine learning device is provided for each of the robots, and the plurality of machine learning devices provided in the plurality of robots mutually communicate data via a communication medium. Are preferably shared or exchanged. The machine learning device may reside on a cloud server.

本発明に係る第５実施形態によれば、バラ積みされた状態を含む、乱雑に置かれた複数のワークからハンド部によって前記ワークを取り出すロボットの動作を学習する機械学習方法であって、前記ワーク毎の三次元マップを計測する三次元計測器の出力データを含む前記ロボットの状態量を観測し、前記ハンド部によって前記ワークを取り出す前記ロボットの取り出し動作の結果を取得し、前記状態量観測部からの出力および前記動作結果取得部からの出力を受け取り、前記ワークの前記取り出し動作を前記ロボットに指令する指令データを含む操作量を、前記ロボットの前記状態量および前記取り出し動作の結果に関連付けて学習する機械学習方法が提供される。 According to a fifth embodiment of the present invention, there is provided a machine learning method for learning an operation of a robot for taking out a workpiece from a plurality of randomly placed workpieces including a bulked state by a hand unit, The state quantity of the robot including the output data of the three-dimensional measuring instrument which measures the three-dimensional map of each work is observed, the result of the taking-out operation of the robot taking out the work by the hand unit is obtained, and the state quantity observation Receiving an output from the unit and an output from the operation result acquisition unit, and associating an operation amount including command data for instructing the robot to take out the workpiece from the state quantity of the robot and the result of the take-out operation A machine learning method is provided.

本発明に係る機械学習装置、ロボットシステムおよび機械学習方法によれば、バラ積みされた状態を含む、乱雑に置かれたワークを取り出すときのロボットの最適な動作を人間の介在無しに学習できるという効果を奏する。 According to the machine learning device, the robot system, and the machine learning method according to the present invention, it is possible to learn the optimum operation of the robot when taking out the work placed in a mess including the bulked state without human intervention. Play an effect.

図１は、本発明の一実施形態のロボットシステムの概念的な構成を示すブロック図である。FIG. 1 is a block diagram showing a conceptual configuration of a robot system according to an embodiment of the present invention. 図２は、ニューロンのモデルを模式的に示す図である。FIG. 2 is a view schematically showing a model of a neuron. 図３は、図２に示すニューロンを組み合わせて構成した三層のニューラルネットワークを模式的に示す図である。FIG. 3 is a view schematically showing a three-layered neural network configured by combining the neurons shown in FIG. 図４は、図１に示す機械学習装置の動作の一例を示すフローチャートである。FIG. 4 is a flow chart showing an example of the operation of the machine learning device shown in FIG. 図５は、本発明の他の実施形態のロボットシステムの概念的な構成を示すブロック図である。FIG. 5 is a block diagram showing a conceptual configuration of a robot system according to another embodiment of the present invention. 図６は、図５に示すロボットシステムにおける前処理部の処理の一例を説明するための図である。FIG. 6 is a diagram for explaining an example of processing of a preprocessing unit in the robot system shown in FIG. 図７は、図１に示すロボットシステムの変形例を示すブロック図である。FIG. 7 is a block diagram showing a modification of the robot system shown in FIG.

以下、本発明に係る機械学習装置、ロボットシステムおよび機械学習方法の実施例を、添付図面を参照して詳述する。ここで、各図面において、同じ部材には同じ参照符号が付されている。また、異なる図面において同じ参照符号が付されたものは同じ機能を有する構成要素であることを意味するものとする。なお、理解を容易にするために、これらの図面は縮尺を適宜変更している。 Hereinafter, embodiments of a machine learning device, a robot system and a machine learning method according to the present invention will be described in detail with reference to the attached drawings. Here, in each drawing, the same reference numerals are given to the same members. Further, in the drawings, the same reference numerals are used to mean components having the same functions. In addition, in order to facilitate understanding, the drawings are appropriately scaled.

図１は、本発明の一実施形態のロボットシステムの概念的な構成を示すブロック図である。本実施形態のロボットシステム１０は、かご状の箱１１にバラ積みされたワーク１２を把持するハンド部１３が取り付けられたロボット１４と、ワーク１２の表面の三次元マップを計測する三次元計測器１５と、ロボット１４および三次元計測器１５をそれぞれ制御する制御装置１６と、座標計算部１９と、機械学習装置２０と、を備える。 FIG. 1 is a block diagram showing a conceptual configuration of a robot system according to an embodiment of the present invention. The robot system 10 according to this embodiment includes a robot 14 attached with a hand unit 13 for gripping the workpieces 12 stacked in a cage 11 and a three-dimensional measuring device for measuring a three-dimensional map of the surface of the workpiece 12 And a controller 16 for controlling the robot 14 and the three-dimensional measuring instrument 15, a coordinate calculation unit 19, and a machine learning device 20.

ここで、機械学習装置２０は、状態量観測部２１と、動作結果取得部２６と、学習部２２と、意思決定部２５と、を備える。なお、機械学習装置２０は、後に詳述するように、ワーク１２の取り出し動作をロボット１４に指令する指令データ、或いは、三次元計測器１５の計測パラメータといった操作量を学習して出力する。 Here, the machine learning apparatus 20 includes a state quantity observation unit 21, an operation result acquisition unit 26, a learning unit 22, and a decision making unit 25. The machine learning apparatus 20 learns and outputs an operation amount such as command data for instructing the robot 14 to take out the workpiece 12 or a measurement parameter of the three-dimensional measuring instrument 15 as described in detail later.

ロボット１４は、例えば、６軸多関節型ロボットであり、ロボット１４およびハンド部１３のそれぞれの駆動軸は、制御装置１６によって制御される。また、ロボット１４は、所定の位置に設置された箱１１からワーク１２を１つずつ取り出して指定の場所、例えば、コンベヤまたは作業台（図示しない）まで順次移動させるために使用される。 The robot 14 is, for example, a six-axis articulated robot, and drive axes of the robot 14 and the hand unit 13 are controlled by the controller 16. The robot 14 is also used to take out the works 12 one by one from the box 11 installed at a predetermined position and sequentially move them to a specified place, for example, a conveyor or a work table (not shown).

ところで、バラ積みされたワーク１２を箱１１から取り出す際、ハンド部１３またはワーク１２が箱１１の壁と衝突もしくは接触する場合がある。あるいは、ハンド部１３またはワーク１２が別のワーク１２に引っかかったりする場合もある。そのような場合にロボット１４に掛かる過負荷を直ちに回避できるように、ハンド部１３に作用する力を検出する機能が必要となる。そのため、ロボット１４のアーム部の先端とハンド部１３との間には、６軸の力センサ１７が設けられている。また、本実施形態のロボットシステム１０は、ロボット１４の各関節部の駆動軸を駆動するモータ（図示しない）の電流値をもとにハンド部１３に作用する力を推定する機能も備えている。 By the way, when the bulked workpieces 12 are taken out of the box 11, the hand portion 13 or the workpieces 12 may collide with or contact the wall of the box 11. Alternatively, the hand unit 13 or the work 12 may be caught on another work 12. In such a case, it is necessary to have a function of detecting the force acting on the hand unit 13 so that an overload applied to the robot 14 can be immediately avoided. Therefore, a six-axis force sensor 17 is provided between the tip of the arm of the robot 14 and the hand 13. Further, the robot system 10 according to the present embodiment also has a function of estimating the force acting on the hand unit 13 based on the current value of a motor (not shown) that drives the drive shaft of each joint of the robot 14 .

さらに、力センサ１７は、ハンド部１３に作用する力を検出できるため、ハンド部１３がワーク１２を実際に把持しているか否かも判断することができる。つまり、ハンド部１３がワーク１２を把持した場合、ハンド部１３にワーク１２の重さが作用するため、ワーク１２の取り出し動作を実施した後、力センサ１７の検出値が所定の閾値を超えていれば、ハンド部１３がワーク１２を把持していると判断することができる。なお、ハンド部１３がワーク１２を把持しているか否かの判断については、例えば、三次元計測器１５に使用されるカメラの撮影データや、ハンド部１３に取り付けられた図示しない光電センサ等の出力により判断することもできる。また、後述の吸着式ハンドの圧力計のデータをもとに判断してもよい。 Furthermore, since the force sensor 17 can detect the force acting on the hand portion 13, it can also determine whether the hand portion 13 is actually gripping the work 12. That is, since the weight of the work 12 acts on the hand 13 when the hand 13 grips the work 12, the detection value of the force sensor 17 exceeds the predetermined threshold after the work 12 is taken out. Then, it can be determined that the hand unit 13 grips the work 12. In addition, about judgment of whether the hand part 13 is holding the workpiece 12, for example, the imaging data of the camera used for the three-dimensional measuring instrument 15, the photoelectric sensor etc. which is not shown attached to the hand part 13 etc. It can also be judged by the output. Alternatively, the determination may be made based on data of a pressure gauge of a suction type hand described later.

ここで、ハンド部１３は、ワーク１２を保持可能であれば様々な形態を有していてもよい。例えば、ハンド部１３は、２本または複数の爪部を開閉することによってワーク１２を把持する形態、あるいは、ワーク１２に対して吸引力を発生する電磁石または負圧発生装置を備えたものであってもよい。すなわち、図１において、ハンド部１３は、２本の爪部によりワークを把持するものとして描かれているが、これ限定されないのはいうまでもない。 Here, the hand unit 13 may have various forms as long as the work 12 can be held. For example, the hand unit 13 has a form in which the work 12 is gripped by opening and closing two or more claws, or an electromagnet or negative pressure generating device that generates a suction force on the work 12. May be That is, in FIG. 1, although the hand part 13 is drawn as what grasps a work by two nail parts, it is needless to say that it is not limited to this.

三次元計測器１５は、複数のワーク１２を測定するために、支持部１８によって複数のワーク１２の上方の所定の位置に設けられている。三次元計測器１５としては、例えば、２台のカメラ（図示しない）から撮影されたワーク１２の画像データを画像処理することによって、三次元位置情報を取得する三次元視覚センサを使用することができる。具体的には、三角計測法、光切断法、Time-of-flight法、Depth from Defocus法、または、これらを併用した方法などを適用することにより、三次元マップ（バラ積みされた複数のワーク１２の表面の位置）が測定される。 The three-dimensional measuring instrument 15 is provided at a predetermined position above the plurality of workpieces 12 by the support 18 in order to measure the plurality of workpieces 12. As the three-dimensional measuring instrument 15, for example, using a three-dimensional visual sensor that acquires three-dimensional position information by performing image processing on image data of the workpiece 12 captured by two cameras (not shown) it can. Specifically, three-dimensional maps (a plurality of workpieces stacked in bulk are applied by applying trigonometry, light cutting, time-of-flight, depth from Defocus, or a method using both of them. The position of the surface of 12) is measured.

座標計算部１９は、三次元計測器１５で得られた三次元マップを入力として、バラ積みされた複数のワーク１２の表面の位置を計算（測定）する。すなわち、三次元計測器１５の出力を利用して、それぞれのワーク１２毎の三次元位置データ（ｘ，ｙ，ｚ）、あるいは、三次元位置データ（ｘ，ｙ，ｚ）および姿勢データ（ｗ，ｐ，ｒ）を得ることができる。ここで、状態量観測部２１は、三次元計測器１５からの三次元マップおよび座標計算部１９からの位置データ（姿勢データ）の両方を受け取ってロボット１４の状態量を観測しているが、例えば、三次元計測器１５からの三次元マップだけを受け取ってロボット１４の状態量を観測することもできる。また、後に図５を参照して説明するのと同様に、前処理部５０を追加し、この前処理部５０により、状態量観測部２１への入力前に、三次元計測器１５からの三次元マップを処理（前処理）して状態量観測部２１に入力することも可能である。 The coordinate calculation unit 19 calculates (measures) the position of the surface of the plurality of workpieces 12 stacked in bulk, with the three-dimensional map obtained by the three-dimensional measuring instrument 15 as an input. That is, using the output of the three-dimensional measuring instrument 15, three-dimensional position data (x, y, z) for each work 12 or three-dimensional position data (x, y, z) and posture data (w) , P, r) can be obtained. Here, the state quantity monitoring unit 21 may receive both the three-dimensional map and the coordinate calculating portion 1 9 or these location data from the three-dimensional measuring device 15 (orientation data) is observing the state quantity of the robot 14 However, for example, only the three-dimensional map from the three-dimensional measuring instrument 15 can be received to observe the state quantity of the robot 14. In addition, as described later with reference to FIG. 5, the preprocessing unit 50 is added, and the preprocessing unit 50 generates a third order from the three-dimensional measuring instrument 15 before input to the state quantity observing unit 21. It is also possible to process (preprocess) the source map and input it to the state quantity observation unit 21.

なお、ロボット１４と三次元計測器１５との相関位置は、予めキャリブレーションにより決定されているものとする。また、本願発明の三次元計測器１５には、三次元視覚センサに代えて、レーザ距離測定器を使用することもできる。つまり、三次元計測器１５が設置された位置から各ワーク１２の表面までの距離をレーザ走査によって計測することや、単眼カメラ、触覚センサなどの各種センサを用いることにより、バラ積みされた複数のワーク１２の三次元位置データおよび姿勢（ｘ，ｙ，ｚ，ｗ，ｐ，ｒ）を取得してもよい。 The correlation position between the robot 14 and the three-dimensional measuring instrument 15 is assumed to be determined in advance by calibration. Further, in the three-dimensional measuring device 15 of the present invention, a laser distance measuring device can be used instead of the three-dimensional visual sensor. That is, by measuring the distance from the position where the three-dimensional measuring instrument 15 is installed to the surface of each work 12 by laser scanning, or by using various sensors such as a single-eye camera, a tactile sensor, etc. The three-dimensional position data and posture (x, y, z, w, p, r) of the work 12 may be acquired.

すなわち、本発明においては、例えば、それぞれのワーク１２のデータ（ｘ，ｙ，ｚ，ｗ，ｐ，ｒ）を取得できれば、どのような三次元計測法を適用した三次元計測器１５でも適用することができる。また、三次元計測器１５が設置される態様も特に限定されるものではなく、例えば、床や壁などに固定されていてもよいし、ロボット１４のアーム部等に取り付けられていてもよい。 That is, in the present invention, for example, as long as data (x, y, z, w, p, r) of each work 12 can be acquired, any three-dimensional measuring instrument 15 to which any three-dimensional measuring method is applied is applied. be able to. Moreover, the aspect in which the three-dimensional measuring device 15 is installed is not specifically limited, either, For example, you may be fixed to a floor, a wall, etc., and may be attached to the arm part etc. of the robot 14.

三次元計測器１５は、制御装置１６からの指令により、箱１１にバラ積みされた複数のワーク１２の三次元マップを取得し、座標計算部１９は、その三次元マップをもとに複数のワーク１２の三次元位置（姿勢）のデータを取得（計算）し、そのデータを、制御装置１６と後述する機械学習装置２０の状態量観測部２１および動作結果取得部２６とに出力するようになっている。特に、座標計算部１９においては、例えば、撮影された複数のワーク１２の画像データを基に、或るワーク１２と別のワーク１２との境界や、ワーク１２と箱１１との境界が推定され、ワーク１２毎の三次元位置のデータが取得される。 The three-dimensional measuring device 15 acquires a three-dimensional map of the plurality of works 12 stacked in bulk in the box 11 according to a command from the control device 16, and the coordinate calculation unit 19 calculates a plurality of three-dimensional maps based on the three-dimensional map. Data of the three-dimensional position (posture) of the work 12 is acquired (calculated), and the data is output to the control device 16 and the state quantity observation unit 21 and the operation result acquisition unit 26 of the machine learning device 20 described later. It has become. In particular, in the coordinate calculation unit 19, for example, the boundary between a certain work 12 and another work 12 and the boundary between a work 12 and a box 11 are estimated based on image data of a plurality of photographed works 12 , Data of the three-dimensional position of each work 12 is acquired.

ワーク１２毎の三次元位置のデータとは、例えば、バラ積みされた複数のワーク１２の表面上の複数の点の位置から各々のワーク１２の存在位置や保持可能な位置を推定することによって取得されたデータを指す。勿論、ワーク１２毎の三次元位置のデータには、ワーク１２の姿勢のデータが含まれてもよい。 The data of the three-dimensional position of each work 12 is obtained, for example, by estimating the existing position and the holdable position of each work 12 from the positions of a plurality of points on the surfaces of the plurality of works 12 stacked in bulk. Refers to the data that has been Of course, the data of the three-dimensional position of each work 12 may include the data of the attitude of the work 12.

さらに、座標計算部１９におけるワーク１２毎の三次元位置および姿勢データの取得には、機械学習の手法を使用することも含まれる。例えば、後述する教師あり学習等の手法を用いた入力画像もしくはレーザ距離測定器などからの物体認識や角度推定などを適用することも可能である。 Furthermore, the acquisition of the three-dimensional position and posture data for each work 12 in the coordinate calculation unit 19 also includes using a machine learning method. For example, it is also possible to apply object recognition or angle estimation from an input image or a laser distance measuring device using a method such as supervised learning described later.

そして、ワーク１２毎の三次元位置のデータが三次元計測器１５から座標計算部１９を介して制御装置１６に入力されると、制御装置１６は、或るワーク１２を箱１１から取り出すハンド部１３の動作を制御する。このとき、後述する機械学習装置２０により得られたハンド部１３の最適な位置，姿勢および取り出し方向に対応する指令値（操作量）に基づいて、ハンド部１３やロボット１４の各軸のモータ（図示しない）が駆動される。 Then, when data of the three-dimensional position of each work 12 is input from the three-dimensional measuring instrument 15 to the control device 16 via the coordinate calculation unit 19, the control device 16 takes out the work 12 from the box 11. Control 13 operations. At this time, based on the command value (operation amount) corresponding to the optimal position, posture, and take-out direction of the hand unit 13 obtained by the machine learning device 20 described later, Not shown) is driven.

また、機械学習装置２０は、三次元計測器１５に使用されるカメラの撮影条件の変数（三次元計測器１５の計測パラメータ：例えば、露出計を用いて撮影時に調整される露出時間、被撮影対象を照明する照明系の照度など）を学習し、制御装置１６を介して、学習した計測パラメータ操作量に基づいて、三次元計測器１５を制御することもできる。ここで、三次元計測器１５が、計測した複数のワーク１２の位置から各々のワーク１２の存在位置・姿勢や保持可能な位置・姿勢を推定するのに使用する位置・姿勢推定条件の変数は、上述の三次元計測器１５の出力データに含まれてもよい。 Further, the machine learning device 20 is a variable of the photographing condition of the camera used for the three-dimensional measuring device 15 (measurement parameter of the three-dimensional measuring device 15: for example, exposure time adjusted at the time of photographing using It is also possible to learn the illuminance of the illumination system that illuminates the object, and control the three-dimensional measuring instrument 15 via the control device 16 based on the learned measurement parameter operation amount. Here, the variables of the position / posture estimation condition used by the three-dimensional measuring device 15 to estimate the existing position / posture of each workpiece 12 and the position / posture that can be held from the measured positions of the plurality of workpieces 12 are , And may be included in the output data of the three-dimensional measuring instrument 15 described above.

さらに、三次元計測器１５からの出力データは、図５を参照して後に詳述する前処理部５０等により、事前に処理し、その処理されたデータ（画像データ）を状態量観測部２１に与えるようにすることも可能なのは前述した通りである。なお、動作結果取得部２６は、例えば、三次元計測器１５からの出力データ（座標計算部１９の出力データ）から、ロボット１４のハンド部１３によりワーク１２を取り出した結果を取得することができるが、それ以外に、例えば、取り出したワーク１２を後工程に渡したときの達成度、並びに、取り出したワーク１２の破損等の状態変化がないかどうかといった動作結果を、他の手段（例えば、後工程に設けられたカメラやセンサ等）を介して取得することもできるのはいうまでもない。以上において、状態量観測部２１および動作結果取得部２６は、機能的なブロックであり、１つのブロックにより両者の機能を達成するものとして捉えることもできるのは勿論である。 Furthermore, the output data from the three-dimensional measuring instrument 15 is processed in advance by the pre-processing unit 50 or the like described in detail later with reference to FIG. 5, and the processed data (image data) is measured by the state quantity observation unit 21 As described above, it is also possible to give to. The operation result acquisition unit 26 can acquire, for example, the result obtained by taking out the work 12 by the hand unit 13 of the robot 14 from the output data from the three-dimensional measuring instrument 15 (output data of the coordinate calculation unit 19). In addition to that, for example, the achievement when the taken out work 12 is passed to the post process, and the operation result such as whether or not there is a change in state such as breakage of the taken out work 12, other means (eg, It goes without saying that the information can also be acquired via a camera, a sensor or the like provided in the post process. In the above, the state quantity observation unit 21 and the operation result acquisition unit 26 are functional blocks, and it is needless to say that one block can be regarded as achieving the functions of both.

次に、図１に示される機械学習装置２０について、詳述する。機械学習装置２０は、装置に入力されるデータの集合から、その中にある有用な規則や知識表現、判断基準などを解析により抽出し、その判断結果を出力するとともに、知識の学習（機械学習）を行う機能を有する。機械学習の手法は様々であるが、大別すれば、例えば、「教師あり学習」、「教師なし学習」および「強化学習」に分けられる。さらに、これらの手法を実現するうえで、特徴量そのものの抽出を学習する、「深層学習（ディープラーニング：Deep Learning）」と呼ばれる手法がある。なお、これらの機械学習（機械学習装置２０）は、汎用の計算機もしくはプロセッサを用いてもよいが、ＧＰＧＰＵ（General-Purpose computing on Graphics Processing Units）や大規模ＰＣクラスター等を適用すると、より高速に処理することが可能である。 Next, the machine learning apparatus 20 shown in FIG. 1 will be described in detail. The machine learning apparatus 20 extracts useful rules, knowledge expressions, judgment criteria and the like from the collection of data input to the apparatus by analysis, outputs the judgment results, and learns knowledge (machine learning ) Has a function to The methods of machine learning are various, but roughly classified, for example, into "supervised learning", "unsupervised learning" and "reinforcement learning". Furthermore, in order to realize these methods, there is a method called "Deep Learning" in which extraction of the feature amount itself is learned. Note that these machine learning (machine learning device 20) may use a general-purpose computer or processor, but when GPGPU (General-Purpose computing on Graphics Processing Units), large-scale PC clusters, etc. are applied, it is faster. It is possible to process.

まず、教師あり学習とは、ある入力と結果（ラベル）のデータの組を大量に機械学習装置２０に与えることで、それらのデータセットにある特徴を学習し、入力から結果を推定するモデル、すなわちその関係性を帰納的に獲得するものである。この教師あり学習を本実施形態に適用する場合、例えば、センサ入力からワーク位置を推定する部分、あるいはワーク候補に対してその取得成功確率を推定する部分などに用いることができる。例えば、後述のニューラルネットワークなどのアルゴリズムを用いて実現することが可能である。 First, supervised learning is a model in which a large number of input and result (label) data sets are given to the machine learning apparatus 20 to learn features in those data sets and to estimate the results from the input. That is, the relationship is acquired inductively. When this supervised learning is applied to the present embodiment, it can be used, for example, as a part that estimates a work position from sensor input, or a part that estimates an acquisition success probability for a work candidate. For example, it can be realized using an algorithm such as a neural network described later.

また、教師なし学習とは、入力データのみを大量に学習装置に与えることで、入力データがどのような分布をしているか学習し、対応する教師出力データを与えなくても、入力データに対して圧縮・分類・整形などを行う装置で学習する手法である。例えば、それらのデータセットにある特徴を、似た者どうしにクラスタリングすることなどができる。この結果を使って、何らかの基準を設けてそれを最適化するような出力の割り当てを行うことにより、出力の予測を実現することができる。 In addition, unsupervised learning is to give only a large amount of input data to the learning device, thereby learning how the input data has a distribution, and for the input data without providing the corresponding teacher output data. This is a method of learning with a device that performs compression, classification, shaping, etc. For example, features in those data sets can be clustered together, and so on. Using this result, by performing the allocation of the output so as to optimize it provided some criteria, it is possible to realize the prediction of output.

なお、教師なし学習と教師あり学習との中間的な問題設定として、半教師あり学習と呼ばれるものもあり、これは、例えば、一部のみ入力と出力のデータの組が存在し、それ以外は入力のみのデータである場合が対応する。本実施形態においては、実際にロボットを動かさなくても取得することができるデータ（画像データやシミュレーションのデータ等）を教師なし学習で利用することにより、学習を効率的に行うことが可能となる。 In addition, as an intermediate problem setting between unsupervised learning and supervised learning, there is also one called semi-supervised learning, and for example, there is only a partial set of input and output data, and others It corresponds to the case of input only data. In the present embodiment, learning can be efficiently performed by using data (image data, simulation data, etc.) that can be acquired without actually moving the robot for unsupervised learning. .

次に、強化学習について、説明する。まず、強化学習の問題設定として、次のように考える。
・ロボットは、環境の状態を観測し、行動を決定する。
・環境は、何らかの規則に従って変化し、さらに、自分の行動が、環境に変化を与えることもある。
・行動するたびに、報酬信号が帰ってくる。
・最大化したいのは、将来にわたっての（割引）報酬の合計である。
・行動が引き起こす結果を全く知らない、または、不完全にしか知らない状態から学習はスタートする。すなわち、ロボットは、実際に行動して初めて、その結果をデータとして得ることができる。つまり、試行錯誤しながら最適な行動を探索する必要がある。
・人間の動作を真似るように、事前学習（前述の教師あり学習や、逆強化学習といった手法）した状態を初期状態として、良いスタート地点から学習をスタートさせることもできる。 Next, reinforcement learning will be described. First, consider the following as the problem setting of reinforcement learning.
The robot observes the state of the environment and decides the action.
The environment changes in accordance with some rules, and in addition, one's own actions may change the environment.
・ The reward signal comes back each time you take action.
・ What I would like to maximize is the sum of (discounted) rewards over the future.
• Learning starts from a state in which you do not know at all, or only incompletely, the consequences of the action. That is, the robot can obtain the result as data only after actually acting. In other words, it is necessary to search for the optimal action while trying and erroring.
-It is also possible to start learning from a good start point, with the pre-learned (the above-mentioned supervised learning and reverse reinforcement learning method) state as the initial state so as to imitate human motion.

ここで、強化学習とは、判定や分類だけではなく、行動を学習することにより、環境に行動が与える相互作用を踏まえて適切な行動を学習、すなわち将来的に得られる報酬を最大にするための学習する方法を学ぶものである。このことは、本実施形態において、例えば、ワーク１２の山を崩して将来的にワーク１２を取り易くする、といった、未来に影響をおよぼすような行動を獲得できることを表している。以下に、例として、Ｑ学習の場合で説明を続けるが、Ｑ学習に限定されるものではない。 Reinforcement learning is not limited to judgment and classification, but also learning behavior to learn appropriate behavior based on the interaction given to the environment, that is, to maximize future rewards. Learn how to learn This means that, in the present embodiment, it is possible to acquire an action that affects the future, such as, for example, breaking the pile of the work 12 and making it easy to take the work 12 in the future. In the following, as an example, the explanation will be continued in the case of Q learning, but it is not limited to Q learning.

Ｑ学習は、或る環境状態ｓの下で、行動ａを選択する価値Ｑ（ｓ，ａ）を学習する方法である。つまり、或る状態ｓのとき、価値Ｑ（ｓ，ａ）の最も高い行動ａを最適な行動として選択すればよい。しかし、最初は、状態ｓと行動ａとの組合せについて、価値Ｑ（ｓ，ａ）の正しい値は全く分かっていない。そこで、エージェント（行動主体）は、或る状態ｓの下で様々な行動ａを選択し、その時の行動ａに対して、報酬が与えられる。それにより、エージェントは、より良い行動の選択、すなわち、正しい価値Ｑ（ｓ，ａ）を学習していく。 Q learning is a method of learning a value Q (s, a) for selecting an action a under a certain environmental condition s. That is, in a certain state s, the highest action a of the value Q (s, a) may be selected as the optimum action. However, at first, the correct value of the value Q (s, a) is not known at all for the combination of the state s and the action a. Therefore, the agent (action entity) selects various actions a under a certain state s, and a reward is given to the action a at that time. Thereby, the agent learns the choice of the better action, that is, the correct value Q (s, a).

さらに、行動の結果、将来にわたって得られる報酬の合計を最大化したいので、最終的にＱ（ｓ，ａ）＝Ｅ［Σ（γ^t）ｒ_t］となるようにすることを目指す。ここでＥ［］は期待値を表し、ｔは時刻、γは後述する割引率と呼ばれるパラメータ、ｒ_tは時刻ｔにおける報酬、Σは時刻ｔによる合計である。この式における期待値は、最適な行動に従って状態変化したときについてとるものとし、それは、分かっていないので、探索しながら学習することになる。このような価値Ｑ（ｓ，ａ）の更新式は、例えば、次の式（１）により表すことができる。 Furthermore, since we want to maximize the total rewards obtained in the future as a result of action, we aim to make Q (s, a) = E [[(γ ^t ) r _t ] finally. Here, E [] represents an expected value, t is time, γ is a parameter called a discount rate to be described later, r _t is a reward at time t, and Σ is a total at time t. The expected value in this equation is taken about when the state changes according to the optimal action, which is unknown and will be learned while searching. An update equation of such a value Q (s, a) can be represented, for example, by the following equation (1).

上記の式（１）において、ｓ_tは、時刻ｔにおける環境の状態を表し、ａ_tは、時刻ｔにおける行動を表す。行動ａ_tにより、状態はｓ_t+1に変化する。r_t+1は、その状態の変化により得られる報酬を表している。また、ｍａｘの付いた項は、状態ｓ_t+1の下で、その時に分かっている最もＱ値の高い行動ａを選択した場合のＱ値にγを乗じたものになる。ここで、γは、０＜γ≦１のパラメータで、割引率と呼ばれる。また、αは、学習係数で、０＜α≦１の範囲とする。 In the above formula (1), s _t represents the state of the environment at time t, a _t represents the action at time t. By the action a _t, the state changes to s _{t + 1.} r _{t + 1} represents the reward obtained by the change of the state. Also, the term with max is the Q value when selecting the action a with the highest Q value known at that time under the state s _{t + 1} , multiplied by γ. Here, γ is a parameter of 0 <γ ≦ 1 and is called a discount rate. Further, α is a learning coefficient and is in the range of 0 <α ≦ 1.

上述した式（１）は、試行ａ_tの結果、帰ってきた報酬ｒ_t+1を元に、状態ｓ_tにおける行動ａ_tの評価値Ｑ（ｓ_t，ａ_t）を更新する方法を表している。すなわち、状態ｓにおける行動ａの評価値Ｑ（ｓ_t，ａ_t）よりも、報酬ｒ_t+1と行動ａによる次の状態における最良の行動ｍａｘａの評価値Ｑ（ｓ_t+1，ｍａｘａ_t+1）の合計の方が大きければ、Ｑ（ｓ_t，ａ_t）を大きくし、反対に小さければ、Ｑ（ｓ_t，ａ_t）を小さくすることを示している。つまり、或る状態における或る行動の価値を、結果として即時帰ってくる報酬と、その行動による次の状態における最良の行動の価値に近付けるようにしている。 The above-mentioned formula (1) as a result of the trial a _t, based on the reward r _{t + 1} came back, represents a method for updating the evaluation value Q of the action a _t in state _{_{_{s t (s t, a t}}} ) ing. That is, rather than the evaluation value Q (s _t , a _t ) of the action a in the state s, the evaluation value Q (s _{t + 1} , max) of the best action max a in the next state by the reward r _{t + 1} and the action a If the sum of a _{t + 1} ) is larger, it indicates that Q (s _t , a _t ) is larger, and if it is smaller, Q (s _t , a _t ) is smaller. In other words, the value of a certain action in a certain state is brought closer to the value of the reward that is immediately returned as a result and the value of the best action in the next state due to that action.

ここで、Ｑ（ｓ，ａ）の計算機上での表現方法は、すべての状態行動ペア（ｓ，ａ）に対して、その値をテーブルとして保持しておく方法と、Ｑ（ｓ，ａ）を近似するような関数を用意する方法がある。後者の方法では、前述の式（１）は、確率勾配降下法などの手法で近似関数のパラメータを調整していくことにより、実現することができる。なお、近似関数としては、後述のニューラルネットワークを用いることができる。 Here, the method of expressing Q (s, a) on a computer is a method of storing the values of all state action pairs (s, a) as a table, and Q (s, a) There is a way to prepare a function that approximates. In the latter method, the above equation (1) can be realized by adjusting the parameters of the approximation function by a method such as the probability gradient descent method. A neural network described later can be used as the approximation function.

また、教師あり学習、教師なし学習の学習モデル、あるいは強化学習での価値関数の近似アルゴリズムとして、ニューラルネットワークを用いることができる。図２は、ニューロンのモデルを模式的に示す図であり、図３は、図２に示すニューロンを組み合わせて構成した三層のニューラルネットワークを模式的に示す図である。すなわち、ニューラルネットワークは、例えば、図２に示すようなニューロンのモデルを模した演算装置およびメモリ等で構成される。 Also, a neural network can be used as a learning model of supervised learning, unsupervised learning, or an approximation algorithm of a value function in reinforcement learning. FIG. 2 is a view schematically showing a model of a neuron, and FIG. 3 is a view schematically showing a three-layer neural network configured by combining the neurons shown in FIG. That is, the neural network is configured by, for example, an arithmetic unit and a memory, etc., which simulate a model of a neuron as shown in FIG.

図２に示されるように、ニューロンは、複数の入力ｘ（図２では、一例として入力ｘ1〜入力ｘ3）に対する出力（結果）ｙを出力するものである。各入力ｘ（ｘ1，ｘ2，ｘ3）には、この入力ｘに対応する重みｗ（ｗ1，ｗ2，ｗ3）が掛けられる。これにより、ニューロンは、次の式（２）により表現される結果ｙを出力する。なお、入力ｘ、結果ｙおよび重みｗは、すべてベクトルである。また、下記の式（２）において、θは、バイアスであり、ｆ_kは、活性化関数である。
As shown in FIG. 2, the neuron outputs an output (result) y for a plurality of inputs x (in FIG. 2, as an example, the inputs x1 to x3). Each input x (x1, x2, x3) is multiplied by a weight w (w1, w2, w3) corresponding to this input x. Thereby, the neuron outputs the result y expressed by the following equation (2). The input x, the result y and the weight w are all vectors. Further, in the following formula (2), θ is a bias, and f _k is an activation function.

図３を参照して、図２に示すニューロンを組み合わせて構成した三層のニューラルネットワークを説明する。図３に示されるように、ニューラルネットワークの左側から複数の入力ｘ（ここでは、一例として、入力ｘ1〜入力ｘ3）が入力され、右側から結果ｙ（ここでは、一例として、結果ｙ1〜入力ｙ3）が出力される。具体的に、入力ｘ1，ｘ2，ｘ3は、３つのニューロンＮ11〜Ｎ13の各々に対して、対応する重みが掛けられて入力される。これらの入力に掛けられる重みは、まとめてＷ１と標記されている。 Referring to FIG. 3, a three-layer neural network configured by combining the neurons shown in FIG. 2 will be described. As shown in FIG. 3, a plurality of inputs x (here, as an example, inputs x1 to x3) are input from the left side of the neural network, and results y (here, as an example, results y1 to input y3). Is output. Specifically, the inputs x1, x2, x3 are input to the three neurons N11 to N13 after being multiplied by corresponding weights. The weights applied to these inputs are collectively labeled W 1.

ニューロンＮ11〜Ｎ13は、それぞれ、ｚ11〜ｚ13を出力する。図３において、これらｚ11〜ｚ13は、まとめて特徴ベクトルＺ１と標記され、入力ベクトルの特徴量を抽出したベクトルとみなすことができる。この特徴ベクトルＺ１は、重みＷ１と重みＷ２との間の特徴ベクトルである。ｚ11〜ｚ13は、２つのニューロンＮ21およびＮ22の各々に対して、対応する重みが掛けられて入力される。これらの特徴ベクトルに掛けられる重みは、まとめてＷ２と標記されている。 The neurons N11 to N13 output z11 to z13, respectively. 3, these z11~z13 are collectively is labeled as a feature vector Z 1 and can be considered to have extracts a feature quantity of the input vector vector. The feature vector Z 1 is a feature vector between the weight W 1 and the weight W 2. z11 to z13 are input after being multiplied by corresponding weights for each of the two neurons N21 and N22. The weights applied to these feature vectors are collectively labeled W2 .

ニューロンＮ21，Ｎ22は、それぞれｚ21，ｚ22を出力する。図３において、これらｚ21，ｚ22は、まとめて特徴ベクトルＺ２と標記されている。この特徴ベクトルＺ２は、重みＷ２と重みＷ３との間の特徴ベクトルである。ｚ21，ｚ22は、３つのニューロンＮ31〜Ｎ33の各々に対して、対応する重みが掛けられて入力される。これらの特徴ベクトルに掛けられる重みは、まとめてＷ３と標記されている。 The neurons N21 and N22 output z21 and z22, respectively. 3, these z21, Z22 are collectively are labeled as feature vector Z 2. The feature vector Z 2 is a feature vector between the weight W 2 and the weight W 3. z21 and z22 are input to each of the three neurons N31 to N33 after being multiplied by corresponding weights. The weights applied to these feature vectors are collectively labeled W3 .

最後に、ニューロンＮ31〜Ｎ33は、それぞれ、結果ｙ１〜結果ｙ３を出力する。ニューラルネットワークの動作には、学習モードと価値予測モードとがある。例えば、学習モードにおいて、学習データセットを用いて重みＷを学習し、そのパラメータを用いて予測モードにおいて、ロボットの行動判断を行う。なお、便宜上、予測と書いたが、検出・分類・推論など多様なタスクが可能なのはいうまでもない。 Finally, the neurons N31 to N33 output the results y1 to y3, respectively. The operation of the neural network includes a learning mode and a value prediction mode. For example, in the learning mode, the weight W is learned using the learning data set, and the action judgment of the robot is performed in the prediction mode using the parameters. In addition, although written as prediction for convenience, it goes without saying that various tasks such as detection, classification, and inference can be performed.

ここで、予測モードで実際にロボットを動かして得られたデータを即時学習し、次の行動に反映させる（オンライン学習）ことも、予め収集しておいたデータ群を用いてまとめた学習を行い、以降はずっとそのパラメータで検知モードを行う（バッチ学習）こともできる。あるいは、その中間的な、ある程度データが溜まるたびに学習モードを挟むということも可能である。 Here, the data obtained by actually moving the robot in the prediction mode is immediately learned, and reflected on the next action (online learning), or the learning summarized using the data group collected in advance is performed. After that, it is also possible to perform detection mode with that parameter (batch learning). Alternatively, it is possible to interpolate the learning mode every time the data is accumulated to some extent.

また、重みＷ１〜Ｗ３は、誤差逆伝搬法（誤差逆転伝播法：バックプロパゲーション：Backpropagation）により学習可能なものである。なお、誤差の情報は、右側から入り左側に流れる。誤差逆伝搬法は、各ニューロンについて、入力ｘが入力されたときの出力ｙと真の出力ｙ（教師）との差分を小さくするように、それぞれの重みを調整（学習）する手法である。 Also, the weights W 1 to W 3 can be learned by an error back propagation method (error reverse propagation method: back propagation). The error information flows from the right side to the left side. The error back propagation method is a method of adjusting (learning) each weight of each neuron so as to reduce the difference between the output y when the input x is input and the true output y (teacher).

このようなニューラルネットワークは、三層以上に、さらに層を増やすことも可能である（深層学習と称される）。また、入力の特徴抽出を段階的に行い、結果を回帰する演算装置を、教師データのみから自動的に獲得することも可能である。 Such neural networks can also have more layers in three or more layers (referred to as deep learning). Further, it is also possible to automatically obtain an arithmetic device which performs input feature extraction stepwise and returns the result from only the teacher data.

そこで、本実施形態の機械学習装置２０は、上述のＱ学習を実施すべく、図１に示されるように、状態量観測部２１、動作結果取得部２６、学習部２２、および、意思決定部２５を備えている。ただし、本発明に適用される機械学習方法は、Ｑ学習に限定されないのは前述した通りである。すなわち、機械学習装置で用いることが出来る手法である「教師あり学習」、「教師なし学習」、「半教師あり学習」および「強化学習」等といった様々な手法が適用可能である。なお、これらの機械学習（機械学習装置２０）は、汎用の計算機もしくはプロセッサを用いてもよいが、ＧＰＧＰＵや大規模ＰＣクラスター等を適用すると、より高速に処理することが可能である。 Therefore, the machine learning device 20 according to the present embodiment performs the state quantity observation unit 21, the operation result acquisition unit 26, the learning unit 22, and the decision making unit as illustrated in FIG. It is equipped with 25. However, as described above, the machine learning method applied to the present invention is not limited to Q learning. That is, various methods such as “supervised learning”, “unsupervised learning”, “semi-supervised learning” and “reinforcement learning”, which are methods that can be used in a machine learning apparatus, are applicable. In addition, although these machine learnings (machine learning apparatus 20) may use a general purpose computer or processor, if GPGPU, a large scale PC cluster, etc. are applied, it is possible to process more rapidly.

すなわち、本実施形態によれば、バラ積みされた状態を含む、乱雑に置かれた複数のワーク１２からハンド部１３によってワーク１２を取り出すロボット１４の動作を学習する機械学習装置であって、ワーク１２毎の三次元位置（ｘ，ｙ，ｚ）、あるいは三次元位置と姿勢（ｘ，ｙ，ｚ，ｗ，ｐ，ｒ）を計測する三次元計測器１５の出力データを含むロボット１４の状態量を観測する状態量観測部２１と、ハンド部１３によってワーク１２を取り出すロボット１４の取り出し動作の結果を取得する動作結果取得部２６と、状態量観測部２１からの出力および動作結果取得部２６からの出力を受け取り、ワーク１２の取り出し動作をロボット１４に指令する指令データを含む操作量を、ロボット１４の状態量および取り出し動作の結果に関連付けて学習する学習部２２と、を備える。 That is, according to the present embodiment, the machine learning device learns the operation of the robot 14 that takes out the workpieces 12 by the hand unit 13 from the plurality of workpieces 12 placed in a mess, including the bulked state. The state of the robot 14 including the output data of the three-dimensional measuring instrument 15 which measures the three-dimensional position (x, y, z) for every 12 or three-dimensional position and posture (x, y, z, w, p, r) State amount observation unit 21 for observing the amount, an operation result acquisition unit 26 for acquiring the result of the taking-out operation of the robot 14 for taking out the work 12 by the hand unit 13, an output from the state amount observation unit 21 and the operation result acquisition unit 26 Operation amount including command data for instructing the robot 14 to take out the work 12 from the output from the control unit 12. The operation amount is related to the state quantity of the robot 14 and the result of the take-out operation. Comprising a learning unit 22 for learning and Te.

なお、状態量観測部２１が観測する状態量は、例えば、或るワーク１２を箱１１から取り出すときのハンド部１３の位置，姿勢および取り出し方向をそれぞれ設定する状態変数が含まれてもよい。また、学習される操作量は、例えば、ワーク１２を箱１１から取り出す際に制御装置１６からロボット１４やハンド部１３の各駆動軸に与えられるトルク、速度、回転位置などの指令値が含まれてもよい。 Note that the state quantities observed by the state quantity observing unit 21 may include, for example, state variables for setting the position, posture, and taking-out direction of the hand unit 13 when taking out a certain work 12 from the box 11. Further, the operation amount to be learned includes, for example, command values such as torque, speed, and rotational position given from the control device 16 to the respective drive shafts of the robot 14 and the hand unit 13 when the work 12 is taken out of the box 11 May be

そして、学習部２２は、バラ積みされた複数のワーク１２のうちの１つを取り出すとき、上記の状態変数をワーク１２の取り出し動作の結果（動作結果取得部２６の出力）に関連付けて学習する。つまり、制御装置１６により三次元計測器１５（座標計算部１９）の出力データとハンド部１３の指令データとをそれぞれ無作為に設定し、あるいは所定のルールに基づいて作為的に設定し、ハンド部１３によるワーク１２の取り出し動作を実施する。ここで、上記所定のルールとしては、例えば、バラ積みされた複数のワーク１２のうち、高さ（ｚ）方向が高いワークから順番に取り出すといったものがある。これにより、或るワークを取り出す行為に対して、三次元計測器１５の出力データとハンド部１３の指令データが対応する。そして、ワーク１２の取り出しの成功と失敗が生じ、そのような成功と失敗が生じる都度、学習部２２は、三次元計測器１５の出力データとハンド部１３の指令データとから構成される状態変数を評価していく。 When the learning unit 22 takes out one of the plurality of works 12 stacked in bulk, the learning unit 22 learns the above-mentioned state variable in association with the result of the taking-out operation of the work 12 (output of the operation result acquiring unit 26). . That is, the output data of the three-dimensional measuring instrument 15 (coordinate calculating unit 19) and the command data of the hand unit 13 are set at random by the control device 16 or are set intentionally based on a predetermined rule. The operation of taking out the work 12 by the unit 13 is performed. Here, as the predetermined rule, for example, among the plurality of workpieces 12 stacked in bulk, there is a method in which the workpieces are taken out in order from the workpiece having the high height (z) direction. Thereby, the output data of the three-dimensional measuring instrument 15 and the command data of the hand unit 13 correspond to the action of taking out a certain work. Then, each time the success or failure of taking out the work 12 occurs and such success or failure occurs, the learning unit 22 is a state variable configured from the output data of the three-dimensional measuring instrument 15 and the command data of the hand unit 13 Evaluate the

また、学習部２２は、ワーク１２を取り出すときの三次元計測器１５の出力データおよびハンド部１３の指令データと、ワーク１２の取り出し動作の結果に対する評価とを関連付けて記憶する。なお、失敗例としては、ハンド部１３がワーク１２を把持できていない場合、あるいは、ワーク１２を把持できたとしてもワーク１２が箱１１の壁と衝突もしくは接触する場合、等がある。また、このようなワーク１２の取り出しの成否は、力センサ１７の検出値や、三次元計測器による撮影データをもとに判断される。ここで、機械学習装置２０は、例えば、制御装置１６から出力されるハンド部１３の指令データの一部を利用して学習を行うことも可能である。 Further, the learning unit 22 stores the output data of the three-dimensional measuring instrument 15 when taking out the work 12 and the command data of the hand unit 13 in association with the evaluation of the result of the taking-out operation of the work 12. As a failure example, there is a case where the hand portion 13 can not hold the work 12 or a case where the work 12 collides with or contacts the wall of the box 11 even if the hand 12 can be held. In addition, the success or failure of such removal of the work 12 is determined based on the detection value of the force sensor 17 and the imaging data by the three-dimensional measuring device. Here, the machine learning apparatus 20 can also perform learning using, for example, a part of the command data of the hand unit 13 output from the control device 16.

ここで、本実施形態の学習部２２は、報酬計算部２３および価値関数更新部２４を備えることが好ましい。例えば、報酬計算部２３は、上記の状態変数に起因するワーク１２の取り出しの成否に基づいて報酬、例えば、スコアを計算する。ワーク１２の取り出しの成功に対しては報酬が高くなるようにし、ワーク１２の取り出しの失敗に対しては報酬が低くなるようにする。また、所定の時間内にワーク１２の取り出しに成功した回数に基づいて報酬を計算してもよい。さらに、この報酬を計算するとき、例えば、ハンド部１３による把持に成功や、ハンド部１３による運搬の成功、ワーク１２の置き動作に成功、などといったワーク１２の取り出しの各段階に応じて報酬を計算してもよい。 Here, it is preferable that the learning unit 22 of the present embodiment includes a reward calculating unit 23 and a value function updating unit 24. For example, the reward calculation unit 23 calculates a reward, for example, a score, based on the success or failure of the retrieval of the work 12 caused by the above-mentioned state variable. The reward is made to be high for success in taking out the work 12 and to be low for failure in taking out the work 12. Alternatively, the reward may be calculated based on the number of successful retrieval of the work 12 within a predetermined time. Furthermore, when calculating this reward, for example, according to each stage of taking out the work 12, such as success in gripping by the hand 13, success in transportation by the hand 13, success in placing the work 12, etc. It may be calculated.

そして、価値関数更新部２４は、ワーク１２の取り出し動作の価値を定める価値関数を有していて、上記の報酬に応じて価値関数を更新する。この価値関数の更新には、上述したような価値Ｑ（ｓ，ａ）の更新式が使用される。さらに、この更新の際、行動価値テーブルを作成することが好ましい。ここでいう行動価値テーブルとは、ワーク１２を取り出した時の三次元計測器１５の出力データおよびハンド部１３の指令データと、その時のワーク１２の取り出し結果に応じて更新された価値関数（すなわち評価値）とを互いに関連付けて記録したものをいう。 Then, the value function updating unit 24 has a value function that determines the value of the retrieval operation of the work 12, and updates the value function according to the above-mentioned reward. To update the value function, the update equation of the value Q (s, a) as described above is used. Furthermore, at the time of this updating, it is preferable to create an action value table. The action value table referred to here is a value function updated according to the output data of the three-dimensional measuring instrument 15 when the work 12 is taken out, the command data of the hand unit 13 and the taken out result of the work 12 at that time (ie Evaluation value) is related to each other and recorded.

なお、この行動価値テーブルとして、前述のニューラルネットワークを用いて近似した関数を用いることも可能であり、画像データなどのように状態ｓの情報量が莫大であるときは特に有効である。また、上記の価値関数は１種類に限定されない。例えば、ハンド部１３によるワーク１２の把持の成否を評価する価値関数や、ハンド部１３によりワーク１２を把持して運搬するのに要した時間（サイクルタイム）を評価する価値関数が考えられる。 It is also possible to use a function approximated using the above-mentioned neural network as this action value table, and it is particularly effective when the amount of information of the state s is enormous as in image data. Also, the above value function is not limited to one type. For example, a value function for evaluating success or failure of gripping of the work 12 by the hand unit 13 or a value function for evaluating time (cycle time) required for gripping and transporting the work 12 by the hand unit 13 can be considered.

さらに、上記の価値関数として、ワーク取り出し時の箱１１とハンド部１３またはワーク１２との干渉を評価する価値関数を使用してもよい。この価値関数の更新に用いる報酬を計算するため、状態量観測部２１は、ハンド部１３に作用する力、例えば、力センサ１７により検出される値を観測することが好ましい。そして、力センサ１７により検出される力の変化量が所定の閾値を超える場合、上記の干渉が発生したと推定できるため、その場合の報酬を例えばマイナスの値とし、価値関数が定める価値が低くなるようにするのが好ましい。 Furthermore, as the above-mentioned value function, a value function that evaluates the interference between the box 11 and the hand unit 13 or the work 12 at the time of taking out the work may be used. In order to calculate a reward used for updating the value function, the state quantity observing unit 21 preferably observes a force acting on the hand unit 13, for example, a value detected by the force sensor 17. Then, when the amount of change in force detected by the force sensor 17 exceeds a predetermined threshold, it can be estimated that the above interference has occurred, so the reward in that case is, for example, a negative value, and the value determined by the value function is low. It is preferable to

また、本実施形態によれば、三次元計測器１５の計測パラメータを操作量として学習することも可能である。すなわち、本実施形態によれば、バラ積みされた状態を含む、乱雑に置かれた複数のワーク１２からハンド部１３によってワーク１２を取り出すロボット１４の動作を学習する機械学習装置であって、ワーク１２毎の三次元位置（ｘ，ｙ，ｚ）、あるいは三次元位置と姿勢（ｘ，ｙ，ｚ，ｗ，ｐ，ｒ）を計測する三次元計測器１５の出力データを含むロボット１４の状態量を観測する状態量観測部２１と、ハンド部１３によってワーク１２を取り出すロボット１４の取り出し動作の結果を取得する動作結果取得部２６と、状態量観測部２１からの出力および動作結果取得部２６からの出力を受け取り、三次元計測器１５の計測パラメータを含む操作量を、ロボット１４の状態量および取り出し動作の結果に関連付けて学習する学習部２２と、を備える。 Further, according to the present embodiment, it is also possible to learn the measurement parameter of the three-dimensional measuring instrument 15 as the operation amount. That is, according to the present embodiment, the machine learning device learns the operation of the robot 14 that takes out the workpieces 12 by the hand unit 13 from the plurality of workpieces 12 placed in a mess, including the bulked state. The state of the robot 14 including the output data of the three-dimensional measuring instrument 15 which measures the three-dimensional position (x, y, z) for every 12 or three-dimensional position and posture (x, y, z, w, p, r) State amount observation unit 21 for observing the amount, an operation result acquisition unit 26 for acquiring the result of the taking-out operation of the robot 14 for taking out the work 12 by the hand unit 13, an output from the state amount observation unit 21 and the operation result acquisition unit 26 A learning unit 22 that receives an output from the control unit and learns the operation amount including the measurement parameter of the three-dimensional measuring instrument 15 in association with the state amount of the robot 14 and the result of the extraction operation; Provided.

さらに、本実施形態のロボットシステム１０においては、ロボット１４に取り付けられているハンド部１３を別の形態のハンド部１３に交換する自動ハンド交換装置（図示しない）が備えられていてもよい。その場合、価値関数更新部２４は、形態の異なるハンド部１３毎に上記の価値関数を有していて、交換後のハンド部１３の価値関数を報酬に応じて更新するものであるとよい。それにより、形態の異なる複数のハンド１３毎にハンド部１３の最適な動作を学習できるため、価値関数のより高いハンド部１３を自動ハンド交換装置に選定させることが可能となる。 Furthermore, in the robot system 10 according to the present embodiment, an automatic hand changer (not shown) may be provided to replace the hand unit 13 attached to the robot 14 with the hand unit 13 of another form. In that case, it is preferable that the value function updating unit 24 have the above-mentioned value function for each of the hand units 13 having different forms, and update the value function of the hand unit 13 after replacement according to the reward. As a result, since the optimum operation of the hand unit 13 can be learned for each of the plurality of hands 13 having different forms, the automatic hand changer can select the hand unit 13 having a higher value function.

続いて、意思決定部２５は、例えば、上述したように作成した行動価値テーブルを参照して、最も高い評価値に対応する、三次元計測器１５の出力データおよびハンド部１３の指令データを選択することが好ましい。その後、意思決定部２５は、選定したハンド部１３や三次元計測器１５の最適なデータを制御装置１６に出力する。 Subsequently, the decision making unit 25 selects output data of the three-dimensional measuring instrument 15 and command data of the hand unit 13 corresponding to the highest evaluation value, for example, with reference to the action value table created as described above. It is preferable to do. After that, the decision making unit 25 outputs the optimum data of the selected hand unit 13 or three-dimensional measuring instrument 15 to the control device 16.

そして、制御装置１６は、学習部２２が出力するハンド部１３や三次元計測器１５の最適なデータを用いて、三次元計測器１５およびロボット１４をそれぞれ制御してワーク１２を取り出す。例えば、制御装置１６は、学習部２２により得られたハンド部１３の最適な位置，姿勢および取り出し方向をそれぞれ設定する状態変数に基づいて、ハンド部１３やロボット１４の各駆動軸を動作させることが好ましい。 Then, the control device 16 controls the three-dimensional measuring instrument 15 and the robot 14 to take out the workpiece 12 by using the optimum data of the hand unit 13 and the three-dimensional measuring instrument 15 output by the learning unit 22. For example, the control device 16 operates the drive shafts of the hand unit 13 and the robot 14 based on the state variables which set the optimum position, posture, and take-out direction of the hand unit 13 obtained by the learning unit 22. Is preferred.

なお、上述した実施形態のロボットシステム１０は、図１に示されるように１つのロボット１４に対して１つの機械学習装置２０を備えたものである。しかし、本発明においては、ロボット１４および機械学習装置２０の各々の数は１つに限定されない。例えば、ロボットシステム１０は複数のロボット１４を備えていて、１つ以上の機械学習装置２０が各々のロボット１４に対応して設けられていてもよい。そして、ロボットシステム１０は、各ロボット１４の機械学習装置２０が取得した、三次元計測器１５とハンド部１３の最適な状態変数を、ネットワークなどの通信媒体によって共有または相互交換するのが好ましい。それにより、或るロボット１４の稼働率が別のロボット１４の稼働率より低くても、別のロボット１４に備わる機械学習装置２０が取得した最適な動作結果を或るロボット１４の動作に利用することができる。また、複数のロボットでの学習モデルの共有、もしくは三次元計測器１５の計測パラメータを含む操作量とロボット１４の状態量および取り出し動作の結果を共有することにより、学習に掛かる時間を短縮することができる。 The robot system 10 according to the embodiment described above is provided with one machine learning device 20 for one robot 14 as shown in FIG. 1. However, in the present invention, the number of each of the robot 14 and the machine learning device 20 is not limited to one. For example, the robot system 10 may include a plurality of robots 14 and one or more machine learning devices 20 may be provided corresponding to each robot 14. Then, it is preferable that the robot system 10 share or exchange the optimum state variables of the three-dimensional measuring instrument 15 and the hand unit 13 acquired by the machine learning device 20 of each robot 14 by a communication medium such as a network. Thereby, even if the operation rate of one robot 14 is lower than the operation rate of another robot 14, the optimal operation result acquired by the machine learning device 20 provided to another robot 14 is used for the operation of a robot 14. be able to. In addition, the time taken for learning can be shortened by sharing learning models among a plurality of robots or sharing the operation amount including the measurement parameter of the three-dimensional measuring instrument 15 with the state quantity of the robot 14 and the result of taking out operation. Can.

さらに、機械学習装置２０はロボット１４内に在ってもロボット１４外に在ってもよい。あるいは、機械学習装置２０は、制御装置１６内に在ってもよいし、クラウドサーバ（図示しない）に存在してもよい。 Furthermore, the machine learning device 20 may be inside the robot 14 or outside the robot 14. Alternatively, the machine learning device 20 may be in the control device 16 or in a cloud server (not shown).

また、ロボットシステム１０が複数のロボット１４を備える場合には、或るロボット１４がハンド部１３により把持したワーク１２を運搬する間に、別のロボット１４のハンド部にワーク１２を取り出す作業を実施させることが可能である。そして、このようなワーク１２を取り出すロボット１４が切替る間の時間を利用して価値関数更新部２４が価値関数を更新することもできる。さらに、機械学習装置２０には、複数のハンドモデルの状態変数をもち、ワーク１２の取り出し動作中に複数のハンドモデルでの取り出しシミュレーションを行い、その取り出しシミュレーションの結果に応じて、複数のハンドモデルの状態変数を、ワーク１２の取り出し動作の結果に関連付けて学習することも可能である。 When the robot system 10 includes a plurality of robots 14, while the robot 14 transports the work 12 gripped by the hand unit 13, the work of taking out the work 12 to the hand unit of another robot 14 is performed. It is possible to Then, the value function updating unit 24 can also update the value function using the time during which the robot 14 taking out such a work 12 switches. Furthermore, the machine learning device 20 has state variables of a plurality of hand models, performs extraction simulation with a plurality of hand models during the extraction operation of the work 12, and according to the result of the extraction simulation, a plurality of hand models It is also possible to learn in association with the result of the retrieval operation of the work 12.

なお、上述の機械学習装置２０においては、ワーク１２毎の三次元マップのデータを取得した際の三次元計測器１５の出力データが、三次元計測器１５から状態量観測部２１に送信されるようになっている。そのような送信データには、異常なデータが含まれていないとは限らないので、機械学習装置２０には、異常データのフィルタリング機能、すなわち三次元計測器１５からのデータを状態量観測部２１に入力するか否かを選択可能な機能を持たせることができる。それにより、機械学習装置２０の学習部２２は、三次元計測器１５およびロボット１４によるハンド部１３の最適な動作を効率よく学習できるようになる。 In the above-described machine learning apparatus 20, output data of the three-dimensional measuring instrument 15 at the time of acquiring data of the three-dimensional map for each work 12 is transmitted from the three-dimensional measuring instrument 15 to the state quantity observing unit 21. It is supposed to be. Since such transmission data does not necessarily include abnormal data, the machine learning device 20 has a function of filtering abnormal data, that is, data from the three-dimensional measuring instrument 15 as the state quantity observation unit 21. It is possible to have a function capable of selecting whether or not to input. As a result, the learning unit 22 of the machine learning device 20 can efficiently learn the optimal operation of the hand unit 13 by the three-dimensional measuring instrument 15 and the robot 14.

さらに、上述した機械学習装置２０において、制御装置１６には、学習部２２からの出力データが入力されているが、その学習部２２からの出力データにも、異常なデータが含まれていないとは限られないので、異常データのフィルタリング機能、すなわち、学習部２２からのデータを制御装置１６に出力するか否かを選択可能な機能を持たせてもよい。それにより、制御装置１６は、ハンド部１３の最適な動作をより安全にロボット１４に実行させることが可能になる。 Furthermore, in the above-described machine learning apparatus 20, although the output data from the learning unit 22 is input to the control device 16, it is assumed that the output data from the learning unit 22 does not include any abnormal data. However, the function of filtering abnormal data, that is, a function capable of selecting whether to output the data from the learning unit 22 to the control device 16 may be provided. As a result, the control device 16 can make the robot 14 execute the optimum operation of the hand unit 13 more safely.

なお、上述の異常データは、次のような手順により検出し得る。すなわち、入力データの確率分布を推定し、確率分布を用いて新規の入力の発生確率を導き、発生確率が一定以下ならば、典型的な挙動から大きく外れる異常なデータと見なす、という手順により異常データを検出できる。 In addition, the above-mentioned abnormal data can be detected by the following procedures. In other words, the probability distribution of input data is estimated, and the probability distribution is used to derive the occurrence probability of a new input, and if the occurrence probability is less than or equal to a certain value, it is regarded as abnormal data that deviates significantly from typical behavior Data can be detected.

次に、本実施形態のロボットシステム１０に備わる機械学習装置２０の動作の一例を説明する。図４は、図１に示す機械学習装置の動作の一例を示すフローチャートである。図４に示されるように、図１に示す機械学習装置２０において、学習動作（学習処理）が開始すると、三次元計測器１５により三次元計測を実施して出力する（図４のステップＳ１１）。すなわち、ステップＳ１１において、例えば、バラ積みされた状態を含む、乱雑に置かれたワーク１２毎の三次元マップ（三次元計測器１５の出力データ）を取得して状態量観測部２１に出力するとともに、座標計算部１９によりワーク１２毎の三次元マップを受け取ってワーク１２毎の三次元位置（ｘ，ｙ，ｚ）を計算して状態量観測部２１，動作結果取得部２６および制御装置１６に出力する。ここで、座標計算部１９は、三次元計測器１５の出力からワーク１２毎の姿勢（ｗ，ｐ，ｒ）を計算して出力してもよい。 Next, an example of the operation of the machine learning device 20 provided in the robot system 10 of the present embodiment will be described. FIG. 4 is a flow chart showing an example of the operation of the machine learning device shown in FIG. As shown in FIG. 4, in the machine learning apparatus 20 shown in FIG. 1, when the learning operation (learning process) starts, the three-dimensional measuring device 15 performs three-dimensional measurement and outputs it (step S11 in FIG. 4). . That is, in step S11, for example, a three-dimensional map (output data of the three-dimensional measuring instrument 15) for each of the randomly placed workpieces 12 including the bulk-stacked state is acquired and output to the state quantity observation unit 21. At the same time, the coordinate calculation unit 19 receives the three-dimensional map of each work 12 and calculates the three-dimensional position (x, y, z) of each work 12 to obtain the state quantity observation unit 21, the operation result acquisition unit 26 and the control device 16. Output to Here, the coordinate calculation unit 19 may calculate and output the posture (w, p, r) of each work 12 from the output of the three-dimensional measuring instrument 15.

なお、図５を参照して説明するように、三次元計測器１５の出力（三次元マップ）は、状態量観測部２１へ入力される前に処理する前処理部５０を介して状態量観測部２１に入力されてもよい。また、図７を参照して説明するように、三次元計測器１５の出力だけが状態量観測部２１に入力されてもよく、さらに、三次元計測器１５の出力だけが前処理部５０を介して状態量観測部２１に入力されてもよい。このように、ステップＳ１１における三次元計測の実施および出力は、様々なものを含むことが可能である。 As described with reference to FIG. 5, the output (three-dimensional map) of the three-dimensional measuring instrument 15 is subjected to state amount observation via the preprocessing unit 50 that is processed before being input to the state amount observation unit 21. It may be input to the unit 21. Further, as described with reference to FIG. 7, only the output of the three-dimensional measuring instrument 15 may be input to the state quantity observing unit 21, and further, only the output of the three-dimensional measuring instrument 15 may be the preprocessing unit 50. It may be input to the state quantity observation unit 21 via Thus, the implementation and output of the three-dimensional measurement in step S11 can include various things.

具体的に、図１の場合には、状態量観測部２１は、三次元計測器１５からのワーク１２毎の三次元マップ、ならびに、座標計算部１９からのワーク１２毎の三次元位置（ｘ，ｙ，ｚ）および姿勢（ｗ，ｐ，ｒ）といった状態量（三次元計測器１５の出力データ）を観測する。なお、動作結果取得部２６は、三次元計測器１５の出力データ（座標計算部１９の出力データ）により、ハンド部１３によってワーク１２を取り出すロボット１４の取り出し動作の結果を取得する。なお、動作結果取得部２６は、三次元計測器の出力データ以外に、例えば、取り出したワーク１２を後工程に渡したときの達成度や取り出したワーク１２の破損といった取り出し動作の結果も取得することができる。 Specifically, in the case of FIG. 1, the state quantity observation unit 21 has a three-dimensional map for each work 12 from the three-dimensional measuring instrument 15 and a three-dimensional position for each work 12 from the coordinate calculation unit 19 (x , Y, z) and attitudes (w, p, r), and observe state quantities (output data of the three-dimensional measuring instrument 15). The operation result acquisition unit 26 acquires the result of the extraction operation of the robot 14 for extracting the workpiece 12 by the hand unit 13 based on the output data of the three-dimensional measuring instrument 15 (output data of the coordinate calculation unit 19). In addition to the output data of the three-dimensional measuring instrument, the operation result acquisition unit 26 also acquires, for example, the result of the extraction operation such as the degree of achievement when the taken out workpiece 12 is passed to the post process and the breakage of the extracted workpiece 12 be able to.

さらに、例えば、機械学習装置２０により、三次元計測器１５の出力データをもとに最適な動作を決定し（図４のステップＳ１２）、また、制御装置１６は、ハンド部１３（ロボット１４）の指令データ（操作量）を出力して、ワーク１２の取り出し動作を実施する（図４のステップＳ１３）。そして、ワークの取り出し結果は、上述した動作結果取得部２６により取得される（図４のステップＳ１４）。 Furthermore, for example, the machine learning device 20 determines the optimum operation based on the output data of the three-dimensional measuring instrument 15 (step S12 in FIG. 4), and the control device 16 controls the hand unit 13 (robot 14). The command data (operation amount) is output to carry out the operation of taking out the work 12 (step S13 in FIG. 4). Then, the work extraction result is acquired by the above-described operation result acquisition unit 26 (step S14 in FIG. 4).

次に、動作結果取得部２６からの出力により、ワーク１２の取り出しの成否を判定し（図４のステップＳ１５）、ワーク１２の取り出しに成功した場合は、プラスの報酬を設定し（図４のステップＳ１６）、ワーク１２の取り出しに失敗した場合は、マイナスの報酬を設定し（図４のステップＳ１７）、そして、行動価値テーブル（価値関数）を更新する（図４のステップＳ１８）。 Next, based on the output from the operation result acquisition unit 26, it is determined whether or not the removal of the work 12 is successful (step S15 in FIG. 4), and when the work 12 is successfully removed, a positive reward is set (FIG. 4). Step S16), if the retrieval of the work 12 fails, a negative reward is set (step S17 in FIG. 4), and the action value table (value function) is updated (step S18 in FIG. 4).

ここで、ワーク１２の取り出しの成否判定は、例えば、ワーク１２の取り出し動作の後の三次元計測器１５の出力データに基づいてことができる。また、ワーク１２の取り出しの成否判定は、ワーク１２の取り出しの成否を評価したものに限定されず、例えば、取り出したワーク１２を後工程に渡したときの達成度、取り出したワーク１２の破損等の状態変化がないかどうか、あるいは、ハンド部１３によりワーク１２を把持して運搬するのに要した時間（サイクルタイム）やエネルギー（電力量）などを評価したものであってもよい。 Here, the success or failure of the removal of the workpiece 12 can be determined, for example, based on the output data of the three-dimensional measuring instrument 15 after the removal operation of the workpiece 12. Moreover, the success or failure determination of the removal of the work 12 is not limited to the evaluation of the success or failure of the removal of the work 12. For example, the achievement degree when the taken out work 12 is passed to the post process, breakage of the removed work 12 Alternatively, the time (cycle time) or energy (electric energy) required for gripping and transporting the work 12 by the hand unit 13 may be evaluated as to whether there is a change in state.

なお、ワーク１２の取り出しの成否判定に基づいた報酬の値の計算は、報酬計算部２３により行われ、また、行動価値テーブルの更新は、価値関数更新部２４により行われる。すなわち、学習部２２は、ワーク１２の取り出しに成功したときは、前述した価値Ｑ（ｓ，ａ）の更新式における報酬にプラスの報酬を設定し（Ｓ１６）、また、ワーク１２の取り出しに失敗したときは、その更新式における報酬にマイナスの報酬を設定する（Ｓ１７）。そして、学習部２２は、ワーク１２の取り出しの都度、前述した行動価値テーブルの更新を行う（Ｓ１８）。以上のステップＳ１１〜Ｓ１８を繰返すことにより、学習部２２は、行動価値テーブルの更新を継続（学習）することになる。 The calculation of the value of the reward based on the judgment of success or failure of taking out the work 12 is performed by the reward calculation unit 23, and the update of the action value table is performed by the value function update unit 24. That is, when the retrieval unit 22 succeeds in taking out the work 12, a positive reward is set to the reward in the update formula of the value Q (s, a) described above (S16), and the taking out of the work 12 fails. If it has, the negative reward is set to the reward in the renewal formula (S17). Then, the learning unit 22 updates the aforementioned action value table each time the work 12 is taken out (S18). By repeating the above steps S11 to S18, the learning unit 22 continues (learning) updating of the action value table.

以上において、状態量観測部２１に入力されるデータは、三次元計測器１５の出力データに限定されず、例えば、他のセンサの出力等のデータが含まれてもよく、さらに、制御装置１６からの指令データの一部を利用することも可能である。このようにして、制御装置１６は、機械学習装置２０から出力された指令データ（操作量）を使って、ワーク１２の取り出し動作をロボット１４に実行させる。なお、機械学習装置２０による学習は、ワーク１２の取り出し動作に限定されるものではなく、例えば、三次元計測器１５の計測パラメータであってもよいのは前述した通りである。 In the above, the data input to the state quantity observation unit 21 is not limited to the output data of the three-dimensional measuring instrument 15. For example, the data such as the output of another sensor may be included. It is also possible to use part of the command data from. In this manner, the control device 16 uses the command data (operation amount) output from the machine learning device 20 to cause the robot 14 to take out the work 12. The learning performed by the machine learning apparatus 20 is not limited to the taking-out operation of the work 12. For example, as described above, the measurement parameter of the three-dimensional measuring instrument 15 may be used.

以上のように、本実施形態の機械学習装置２０を備えたロボットシステム１０によれば、バラ積みされた状態を含む、乱雑に置かれた複数のワーク１２からハンド部１３によってワーク１２を取り出すロボット１４の動作を学習することができる。これにより、ロボットシステム１０は、バラ積みされたワーク１２を取り出すロボット１４の最適な動作の選択を人間の介在無しに学習することが可能になる。 As described above, according to the robot system 10 including the machine learning device 20 of the present embodiment, the robot that takes out the workpieces 12 from the plurality of randomly placed workpieces 12 including the bulked state by the hand unit 13 It is possible to learn 14 actions. As a result, the robot system 10 can learn the selection of the optimal operation of the robot 14 for taking out the workpieces 12 stacked in bulk without human intervention.

図５は、本発明の他の実施形態のロボットシステムの概念的な構成を示すブロック図であり、教師あり学習を適用したロボットシステムを示すものである。図５と、前述した図１の比較から明らかなように、図５に示す教師あり学習を適用したロボットシステム１０’は、図１に示すＱ学習（強化学習）を適用したロボットシステム１０に対して、さらに、結果（ラベル）付きデータ記録部４０を備える。なお、図５に示すロボットシステム１０’は、さらに、三次元計測器１５の出力データを前処理する前処理部５０を備える。なお、前処理部５０は、例えば、図１に示すロボットシステム１０に対しても設けることができるのはいうまでもない。 FIG. 5 is a block diagram showing a conceptual configuration of a robot system according to another embodiment of the present invention, showing a robot system to which supervised learning is applied. As apparent from the comparison between FIG. 5 and FIG. 1 described above, the robot system 10 ′ to which the supervised learning shown in FIG. 5 is applied is different from the robot system 10 to which Q learning (reinforcement learning) shown in FIG. Further, a result (labeled) data recording unit 40 is provided. The robot system 10 ′ shown in FIG. 5 further includes a preprocessing unit 50 that preprocesses the output data of the three-dimensional measuring instrument 15. Needless to say, the preprocessing unit 50 can also be provided, for example, for the robot system 10 shown in FIG.

図５に示されるように、教師あり学習を適用したロボットシステム１０’における機械学習装置３０は、状態量観測部３１と、動作結果取得部３６と、学習部３２と、意思決定部３５と、を備える。学習部３２は、誤差計算部３３と、学習モデル更新部３４と、を含む。なお、本実施形態のロボットシステム１０’においても、機械学習装置３０は、ワーク１２の取り出し動作をロボット１４に指令する指令データ、或いは、三次元計測器１５の計測パラメータといった操作量を学習して出力する。 As shown in FIG. 5, the machine learning device 30 in the robot system 10 ′ to which supervised learning is applied includes the state quantity observing unit 31, the operation result acquiring unit 36, the learning unit 32, and the decision making unit 35. Equipped with The learning unit 32 includes an error calculating unit 33 and a learning model updating unit 34. Also in the robot system 10 'of the present embodiment, the machine learning device 30 learns an operation amount such as command data for instructing the robot 14 to take out the work 12 or a measurement parameter of the three-dimensional measuring instrument 15. Output.

すなわち、図５に示す教師あり学習を適用したロボットシステム１０’において、誤差計算部３３および学習モデル更新部３４は、それぞれ、図１に示すＱ学習を適用したロボットシステム１０における報酬計算部２３および価値関数更新部２４に対応する。なお、他の構成、例えば、三次元計測器１５，制御装置１６およびロボット１４等の構成は、前述した図１と同様であり、その説明は省略する。 That is, in the robot system 10 'to which supervised learning shown in FIG. 5 is applied, the error calculating unit 33 and the learning model updating unit 34 respectively calculate the reward calculating unit 23 and the reward calculating unit 23 in the robot system 10 to which Q learning shown in FIG. It corresponds to the value function update unit 24. The other configurations, for example, the configurations of the three-dimensional measuring instrument 15, the control device 16, the robot 14 and the like are the same as those of FIG. 1 described above, and the description thereof will be omitted.

動作結果取得部３６から出力される結果（ラベル）と学習部に実装されている学習モデルの出力との誤差が誤差計算部３３で計算される。ここで、結果（ラベル）付きデータ記録部４０は、例えば、ワーク１２の形状やロボット１４による処理が同一の場合にはロボット１４に作業を行わせる所定日の前日までに得られた結果（ラベル）付きデータを保持し、その所定日に、結果（ラベル）付きデータ記録部４０に保持された結果（ラベル）付きデータを誤差計算部３３に提供することができる。あるいは、ロボットシステム１０’の外部で行われたシミュレーション等により得られたデータ、または、他のロボットシステムの結果（ラベル）付きデータを、メモリカードや通信回線により、そのロボットシステム１０’の誤差計算部３３に提供することも可能である。さらに、結果（ラベル）付きデータ記録部４０をフラッシュメモリ（Flash Memory）等の不揮発性メモリで構成し、結果（ラベル）付きデータ記録部（不揮発性メモリ）４０を学習部３２に内蔵し、その結果（ラベル）付きデータ記録部４０に保持された結果（ラベル）付きデータを、そのまま学習部３２で使用することもできる。 The error calculation unit 33 calculates an error between the result (label) output from the operation result acquisition unit 36 and the output of the learning model implemented in the learning unit. Here, the result (label) attached data recording unit 40 is a result (label obtained by the day before the predetermined day of causing the robot 14 to perform the work, for example, when the shape of the work 12 and the processing by the robot 14 are the same) ) The attached data can be held, and the result (labeled) data held in the result (labeled) data recording unit 40 can be provided to the error calculating unit 33 on the predetermined date. Alternatively, data obtained by simulation or the like performed outside the robot system 10 'or data (labeled) with other robot system results can be calculated by the error of the robot system 10' using a memory card or a communication line. It is also possible to provide to the unit 33. Furthermore, the result (label) attached data recording unit 40 is configured by a non-volatile memory such as a flash memory (Flash Memory), and the result (label) attached data recording unit (nonvolatile memory) 40 is incorporated in the learning unit 32 The result (labeled) data stored in the result (labeled) data recording unit 40 can be used as it is by the learning unit 32.

図６は、図５に示すロボットシステムにおける前処理部の処理の一例を説明するための図であり、図６(a)は、箱１１にバラ積みされた複数のワーク１２の三次元位置（姿勢）のデータ、すなわち、三次元計測器１５の出力データの一例を示し、図６(b)〜図６(d)は、図６(a)におけるワーク１２１〜１２３に対して前処理を行った後の画像データの例を示す。 FIG. 6 is a view for explaining an example of processing of the pre-processing unit in the robot system shown in FIG. 5, and FIG. 6 (a) shows three-dimensional positions of a plurality of works 12 stacked in bulk in box 11. 6B shows an example of output data of the three-dimensional measuring instrument 15. In FIG. 6B to FIG. 6D, pre-processing is performed on the workpieces 121 to 123 in FIG. Shows an example of image data after image processing.

ここで、ワーク１２（１２１〜１２３）としては、円柱形状の金属部品を想定し、ハンド（１３）としては、２本の爪部でワークを把持するのではなく、例えば、円柱形状のワーク１２の長手中央部分を負圧で吸い取る吸着パッドを想定している。そのため、例えば、ワーク１２の長手中央部分の位置が分かれば、その位置に対して吸着パッド（１３）を移動させて吸着することにより、ワーク１２を取り出すことができるようになっている。また、図６(a)〜図６(d)における数値は、［ｍｍ］で表され、それぞれｘ方向，ｙ方向，ｚ方向を示す。なお、ｚ方向は、複数のワーク１２がバラ積みされた箱１１を、上方に設けられた三次元計測器１５（例えば、２つのカメラを有する）により撮像した画像データの高さ（深さ）方向に対応する。 Here, a cylindrical metal part is assumed as the workpiece 12 (121 to 123), and for the hand (13), for example, the cylindrical workpiece 12 is not held by two claws. It is assumed that the suction pad sucks the longitudinal central portion of the under pressure with a negative pressure. Therefore, for example, if the position of the longitudinal central portion of the work 12 is known, the work 12 can be taken out by moving and suctioning the suction pad (13) to the position. The numerical values in FIGS. 6A to 6D are represented by [mm], and indicate the x direction, the y direction, and the z direction, respectively. In the z direction, the height (depth) of the image data captured by the three-dimensional measuring instrument 15 (for example, having two cameras) provided above the box 11 in which a plurality of works 12 are stacked in bulk. Correspond to the direction.

図６(a)と、図６(b)〜図６(d)の比較から明らかなように、図５に示すロボットシステム１０’における前処理部５０の処理の一例としては、三次元計測器１５の出力データ（三次元画像）から、注目するワーク１２（例えば、３つのワーク１２１〜１２３）を、回転させると共に、中心の高さが『０』となるように処理するものである。 As apparent from the comparison between FIG. 6 (a) and FIGS. 6 (b) to 6 (d), a three-dimensional measuring device is an example of the processing of the preprocessing unit 50 in the robot system 10 ′ shown in FIG. From the 15 output data (three-dimensional image), the target work 12 (for example, three works 121 to 123) is rotated and processed so that the height of the center becomes "0".

すなわち、三次元計測器１５の出力データには、例えば、それぞれのワーク１２の長手中央部分の三次元位置（ｘ，ｙ，ｚ）および姿勢（ｗ，ｐ，ｒ）の情報が含まれている。このとき、図６(b)，図６(c)および図６(d)に示されるように、注目する３つのワーク１２１，１２２，１２３は、それぞれ−ｒだけ回転させると共に、ｚだけ減算して、全て同じ条件に揃えるようになっている。このような前処理を行うことにより、機械学習装置３０の負荷を低減することが可能になる。 That is, the output data of the three-dimensional measuring instrument 15 includes, for example, information on the three-dimensional position (x, y, z) and the attitude (w, p, r) of the longitudinal central portion of each work 12 . At this time, as shown in FIGS. 6 (b), 6 (c) and 6 (d), the three works 121, 122 and 123 to be noticed are rotated by -r respectively and subtracted by z. All in line with the same conditions. By performing such pre-processing, the load on the machine learning device 30 can be reduced.

ここで、図６(a)に示す三次元画も、三次元計測器１５の出力データそのものではなく、例えば、以前より実施しているワーク１２の取り出し順を規定するプログラムにより得られた画像から、選択するためのしきい値を低くしたものとなっており、この処理自体も前処理部５０で行うこともできる。なお、このような前処理部５０による処理としては、ワーク１２の形状およびハンド１３の種類等を始めとしてさまざまな条件により様々に変化し得るのはいうまでもない。 Here, also the three-dimensional drawing shown in FIG. 6A is not the output data of the three-dimensional measuring instrument 15 itself, but, for example, from an image obtained by a program defining the taking out order of the work 12 implemented before. The threshold value for selection is lowered, and this processing itself can also be performed by the pre-processing unit 50. It is needless to say that the processing by the pre-processing unit 50 can be variously changed depending on various conditions including the shape of the work 12 and the type of the hand 13 and the like.

このように、前処理部５０により、状態量観測部３１への入力前に処理を行った三次元計測器１５の出力データ（ワーク１２毎の三次元マップ）は、状態量観測部３１に入力されることになる。再び、図５を参照して、動作結果取得部３６から出力される結果（ラベル）を受け取る誤差計算部３３は、例えば、学習モデルとして図３に示すニューラルネットワークの出力をｙとしたとき、実際にワーク１２の取り出し動作を行って成功していたときは−ｌｏｇ(ｙ)の誤差、失敗していたときは−ｌｏｇ(１−ｙ)の誤差があるとみなし、この誤差を最小化することを目標として処理を行う。なお、図３に示すニューラルネットワークの入力としては、例えば、図６(b)〜図６(d)に示されるような前処理を行った後の注目するワーク１２１〜１２３の画像データ、並びに、それら注目するワーク１２１〜１２３毎の三次元位置および姿勢（ｘ，ｙ，ｚ，ｗ，ｐ，ｒ）のデータを与えることになる。 Thus, the output data (three-dimensional map for each work 12) of the three-dimensional measuring instrument 15 processed by the pre-processing unit 50 before input to the state quantity observation unit 31 is input to the state quantity observation unit 31 It will be done. Again, referring to FIG. 5, error calculation unit 33 receiving the result (label) output from operation result acquisition unit 36 is, for example, actually when the output of the neural network shown in FIG. 3 is y as a learning model. If there is an error of -log (y) if the work 12 is taken out successfully and if there is an error, it is considered that there is an error of -log (1-y) to minimize this error. Process with the goal. In addition, as an input of the neural network shown in FIG. 3, for example, image data of the objects 121 to 123 of interest after pre-processing as shown in FIG. 6 (b) to FIG. 6 (d), and Data of the three-dimensional position and posture (x, y, z, w, p, r) of each of the works 121 to 123 to be noted will be given.

図７は、図１に示すロボットシステムの変形例を示すブロック図である。図７と、図１の比較から明らかなように、図７に示すロボットシステム１０の変形例において、座標計算部１９は削除され、状態量観測部２１は、三次元計測器１５からの三次元マップだけを受け取ってロボット１４の状態量を観測するようになっている。なお、制御装置１６に対して、座標計算部１９に対応する構成を設けることができるのはいうまでもない。また、この図７に示す構成は、例えば、図５を参照して説明した教師あり学習を適用したロボットシステム１０’に対しても適用することができる。すなわち、図５に示すロボットシステム１０’において、前処理部５０を削除し、状態量観測部３１が三次元計測器１５からの三次元マップだけを受け取ってロボット１４の状態量を観測することも可能である。このように、上述した各実施例は、様々な変更および変形することが可能である。 FIG. 7 is a block diagram showing a modification of the robot system shown in FIG. As apparent from the comparison between FIG. 7 and FIG. 1, in the modified example of the robot system 10 shown in FIG. 7, the coordinate calculation unit 19 is eliminated, and the state quantity observation unit 21 is three-dimensional from the three-dimensional measuring instrument 15. It receives only the map and observes the state quantity of the robot 14. It goes without saying that a configuration corresponding to the coordinate calculation unit 19 can be provided to the control device 16. The configuration shown in FIG. 7 can also be applied to, for example, a robot system 10 ′ to which the supervised learning described with reference to FIG. 5 is applied. That is, in the robot system 10 'shown in FIG. 5, the preprocessing unit 50 may be deleted, and the state quantity observing unit 31 may receive only the three-dimensional map from the three-dimensional measuring instrument 15 and observe the state quantity of the robot 14. It is possible. Thus, various modifications and variations can be made to the embodiments described above.

以上、詳述したように、本実施形態によれば、バラ積みされた状態を含む、乱雑に置かれたワークを取り出すときのロボットの最適な動作を人間の介在無しに学習できる機械学習装置、ロボットシステムおよび機械学習方法を提供することが可能になる。なお、本発明における機械学習装置２０，３０としては、強化学習（例えば、Ｑ学習）または教師あり学習を適用したものに限定されず、様々な機械学習のアルゴリズムを適用することが可能である。 As described above in detail, according to the present embodiment, a machine learning apparatus capable of learning an optimal operation of the robot when taking out a randomly placed work including a bulked state without human intervention. It becomes possible to provide a robot system and a machine learning method. The machine learning devices 20 and 30 in the present invention are not limited to those to which reinforcement learning (for example, Q learning) or supervised learning is applied, and various machine learning algorithms can be applied.

以上、実施形態を説明したが、ここに記載したすべての例や条件は、発明および技術に適用する発明の概念の理解を助ける目的で記載されたものであり、特に記載された例や条件は発明の範囲を制限することを意図するものではない。また、明細書のそのような記載は、発明の利点および欠点を示すものでもない。発明の実施形態を詳細に記載したが、各種の変更、置き換え、変形が発明の精神および範囲を逸脱することなく行えることが理解されるべきである。 Although the embodiments have been described above, all the examples and conditions described herein are for the purpose of assisting the understanding of the concept of the invention applied to the invention and the technology, and the examples and conditions described are particularly It is not intended to limit the scope of the invention. Also, such descriptions in the specification do not show the advantages and disadvantages of the invention. While the embodiments of the invention have been described in detail, it should be understood that various changes, substitutions, and alterations can be made without departing from the spirit and scope of the invention.

１０，１０’ ロボットシステム
１１箱
１２ワーク
１３ハンド部
１４ロボット
１５三次元計測器
１６制御装置
１７力センサ
１８支持部
１９座標計算部
２０，３０機械学習装置
２１，３１状態量観測部
２２，３２学習部
２３報酬計算部
２４価値関数更新部
２５，３５意思決定部
２６，３６動作結果取得部
３３誤差計算部
３４学習モデル更新部
４０結果（ラベル）付きデータ記録部
５０前処理部 10, 10 'Robot system 11 Box 12 Work 13 Hand part 14 Robot 15 Three-dimensional measuring instrument 16 Control device 17 Force sensor 18 Support part 19 Coordinate calculation part 20, 30 Machine learning device 21, 31 State quantity observation part 22, 32 Learning Part 23 Reward calculation part 24 Value function update part 25, 35 Decision making part 26, 36 Operation result acquisition part 33 Error calculation part 34 Learning model update part 40 Result (labeled) data storage part 50 Pre-processing part

Claims

A machine learning device that learns the operation of a robot (14) for taking out the work (12) by a hand unit (13) from a plurality of randomly placed works (12), including a bulk-stacked state,
Output data of a three-dimensional measuring instrument (15) which measures a plurality of workpieces (12) stacked in bulk and outputs three-dimensional position information of the surface of the plurality of workpieces (12) and the hand of the workpiece (12) A state quantity observing section (21, 31) for observing a state quantity which sets the position, posture and taking-out direction of the hand section (13) respectively for the taking-out operation by the section (13);
An operation result acquisition unit (26, 36) for acquiring the result of the taking out operation of the robot (14) for taking out the work (12) by the hand unit (13);
The state quantities observed by the state quantity observation unit (21, 31) and the determination result of the success or failure of the extraction operation of the robot (14) acquired by the operation result acquisition unit (26, 36) are received, A learning unit (22, 32) for learning the takeout operation of the work (12);
The learning unit (22)
A reward calculation unit (23) that calculates a reward based on the result of the success or failure of the removal of the work from the operation result acquisition unit (26);
A value function updating unit (24) which has a value function that determines the value of the extraction operation of the work (12), and updates the value function according to the reward, and is acquired by the learning unit (22) A learning is performed to update the state variable for setting the optimum position, posture and take-out direction of the hand portion (13) based on the value function, and the hand portion (13) is updated based on the updated state variable. 13) and operating each drive axis of the robot (14),
A machine learning apparatus characterized by

A machine learning device that learns the operation of a robot (14) for taking out the work (12) by a hand unit (13) from a plurality of randomly placed works (12), including a bulk-stacked state,
Output data of a three-dimensional measuring instrument (15) which measures a plurality of workpieces (12) stacked in bulk and outputs three-dimensional position information of the surface of the plurality of workpieces (12) and the hand of the workpiece (12) A state quantity observing section (21, 31) for observing a state quantity which sets the position, posture and taking-out direction of the hand section (13) respectively for the taking-out operation by the section (13);
An operation result acquisition unit (26, 36) for acquiring the result of the taking out operation of the robot (14) for taking out the work (12) by the hand unit (13);
The state quantities observed by the state quantity observation unit (21, 31) and the determination result of the success or failure of the extraction operation of the robot (14) acquired by the operation result acquisition unit (26, 36) are received, A learning unit (22, 32) for learning the takeout operation of the work (12);
The learning unit (32) implements a learning model for learning the extraction operation of the work (12);
An error calculation unit (33) that calculates an error based on the label output from the operation result acquisition unit (36) and the output of the learning model implemented in the learning unit (32);
A learning model updating unit (34) for updating the learning model according to the error, and the optimum position, posture, and taking-out direction of the hand unit (13) obtained by the learning unit (32) Operate the drive shafts of the hand unit (13) and the robot (14) based on the state variable to be set;
A machine learning apparatus characterized by

further,
Based on the state variable for setting the optimum position, posture and take-out direction of the hand unit (13) obtained by the learning unit (22, 32), the robot ( 14) comprising a decision making unit (25, 35) for determining command data to be commanded to
The machine learning apparatus according to claim 1 or 2, wherein

A machine learning device that learns the operation of a robot (14) for taking out the work (12) by a hand unit (13) from a plurality of randomly placed works (12), including a bulk-stacked state,
Output data of a three-dimensional measuring instrument (15) which measures a plurality of workpieces (12) stacked in bulk and outputs three-dimensional position information of the surface of the plurality of workpieces (12); the hand of the workpiece (12) the hand part take-out operation by the unit (13) the position (13), sets the posture and extraction direction, respectively, as well as the containing measurement parameters of the three-dimensional measuring instrument (15), the state quantity observation for observing state variables Part (21, 31),
An operation result acquisition unit (26, 36) for acquiring the result of the taking out operation of the robot (14) for taking out the work (12) by the hand unit (13);
The state quantities observed by the state quantity observation unit (21, 31) and the determination result of the success or failure of the extraction operation of the robot (14) acquired by the operation result acquisition unit (26, 36) are received, A learning unit (22, 32) for learning the takeout operation of the work (12);
The learning unit (22)
A reward calculation unit (23) that calculates a reward based on the result of the success or failure of the removal of the work from the operation result acquisition unit (26);
A value function updating unit (24) which has a value function that determines the value of the extraction operation of the work (12), and updates the value function according to the reward, and is acquired by the learning unit (22) A learning is performed to update the state variable for setting the optimum position, posture and take-out direction of the hand portion (13) based on the value function, and the hand portion (13) is updated based on the updated state variable. 13) and operating each drive axis of the robot (14),
A machine learning apparatus characterized by

A machine learning device that learns the operation of a robot (14) for taking out the work (12) by a hand unit (13) from a plurality of randomly placed works (12), including a bulk-stacked state,
Output data of a three-dimensional measuring instrument (15) which measures a plurality of workpieces (12) stacked in bulk and outputs three-dimensional position information of the surface of the plurality of workpieces (12); the hand of the workpiece (12) the hand part take-out operation by the unit (13) the position (13), sets the posture and extraction direction, respectively, as well as the containing measurement parameters of the three-dimensional measuring instrument (15), the state quantity observation for observing state variables Part (21, 31),
An operation result acquisition unit (26, 36) for acquiring the result of the taking out operation of the robot (14) for taking out the work (12) by the hand unit (13);
The state quantities observed by the state quantity observation unit (21, 31) and the determination result of the success or failure of the extraction operation of the robot (14) acquired by the operation result acquisition unit (26, 36) are received, A learning unit (22, 32) for learning the takeout operation of the work (12);
The learning unit (32) implements a learning model for learning the extraction operation of the work (12);
An error calculation unit (33) that calculates an error based on the label output from the operation result acquisition unit (36) and the output of the learning model implemented in the learning unit (32);
A learning model updating unit (34) for updating the learning model according to the error, and the optimum position, posture, and taking-out direction of the hand unit (13) obtained by the learning unit (32) Operate the drive shafts of the hand unit (13) and the robot (14) based on the state variable to be set;
A machine learning apparatus characterized by

further,
Based on the state variable for setting the optimum position, posture and take-out direction of the hand unit (13) obtained by the learning unit (22, 32), the robot ( 14) command data for commanding the operation and a decision making unit (25, 35) for determining the measurement parameter of the three-dimensional measuring instrument (15),
The machine learning device according to claim 4 or 5, characterized in that:

The state quantity observation unit (21, 31) is further calculated by the coordinate calculation unit (19) based on three-dimensional position information of the surface of the work (12) output from the three-dimensional measuring instrument (15). It also receives three-dimensional position information for each work (12),
The machine learning device according to any one of claims 1 to 6, characterized in that:

The coordinate calculation unit (19) further
Calculating posture data for each work (12) and outputting the calculated three-dimensional position data and posture data for each work (12);
The machine learning device according to claim 7, wherein

The operation result acquisition unit (26, 36) acquires the result of the extraction operation of the robot (14) using the output data of the three-dimensional measuring instrument (15).
The machine learning device according to any one of claims 1 to 8, characterized in that.

further,
Pre-processing to process output data of the three-dimensional measuring instrument (15) so that the state quantity observing unit (21, 31) can be easily processed before input to the state quantity observing unit (21, 31) Equipped with a part (50)
The state quantity observation unit (21, 31) receives output data of the pre-processing unit (50),
The machine learning device according to any one of claims 1 to 9, characterized in that:

The pre-processing unit (50) rotates and positions the three-dimensional position data (x, y, z) and attitude data (w, p, r) included in the output data of the three-dimensional measuring instrument (15). Perform alignment to make the height of the center of each work (12) uniform,
The machine learning device according to claim 10, characterized in that:

The operation result acquisition unit (26, 36)
In addition to the success or failure of the removal of the work (12), the damaged state of the work (12) and the time and energy required to hold and transport the work (12) by the hand portion (13), or Obtaining at least one of the time and an estimate of the energy,
The machine learning device according to any one of claims 1 to 11, characterized in that:

The machine learning device
The machine learning device according to any one of claims 1 to 12, comprising a neural network.

A robot system (10, 10 ') comprising the machine learning device (20, 30) according to any one of claims 1 to 13, comprising:
The robot (14),
The three-dimensional measuring instrument (15);
And a control device (16) for controlling the robot (14) and the three-dimensional measuring instrument (15), respectively.
A robot system characterized by

The robot system (10, 10 ') comprises a plurality of the robots (14)
The machine learning device (20, 30) is provided for each of the robots (14).
The plurality of machine learning devices (20) provided in the plurality of robots (14) share or exchange data with each other via a communication medium.
The robot system according to claim 14, characterized in that:

The machine learning device (20, 30) exists on a cloud server,
The robot system according to claim 15, characterized in that:

A machine learning method by a machine learning device for learning an operation of a robot (14) for taking out the work (12) by a hand unit (13) from a plurality of randomly placed works (12) including a bulk-stacked state There,
Output data of a three-dimensional measuring instrument (15) which measures a plurality of workpieces (12) stacked in bulk and outputs three-dimensional position information of the surface of the plurality of workpieces (12) and the hand of the workpiece (12) Observing a state quantity for setting the position, posture, and taking-out direction of the hand part (13) with respect to the taking-out operation by the part (13);
Obtaining the result of the taking-out operation of the robot (14) for taking out the work (12) by the hand unit (13);
Receiving the observed state quantity and the acquired determination result of the success or failure of the extraction operation of the robot (14), and learning the extraction operation of the work (12);
The learning of the extraction operation is
Calculate the reward based on the acquired judgment result of success or failure of taking out the work,
Having a value function that determines the value of the retrieval operation of the work (12), and updating the value function according to the reward;
The state variable is updated based on the value function, and the state variable for setting the optimum position, posture, and take-out direction of the hand unit (13) obtained by learning the take-out operation is carried out. Operate the drive shafts of the hand unit (13) and the robot (14) based on
Machine learning method characterized by

A machine learning method by a machine learning device for learning an operation of a robot (14) for taking out the work (12) by a hand unit (13) from a plurality of randomly placed works (12) including a bulk-stacked state There,
Output data of a three-dimensional measuring instrument (15) which measures a plurality of workpieces (12) stacked in bulk and outputs three-dimensional position information of the surface of the plurality of workpieces (12) and the hand of the workpiece (12) Observing a state quantity for setting the position, posture, and taking-out direction of the hand part (13) with respect to the taking-out operation by the part (13);
Obtaining the result of the taking-out operation of the robot (14) for taking out the work (12) by the hand unit (13);
Receiving the observed state quantity and the acquired determination result of the success or failure of the extraction operation of the robot (14), and learning the extraction operation of the work (12);
The learning of the extraction operation implements a learning model for learning the extraction operation of the work (12),
Calculate the error based on the given label and the output of the implemented learning model,
Updating the learning model according to the error;
Each driving shaft of the hand unit (13) and the robot (14) is obtained based on the state variable which sets the optimum position, posture and taking-out direction of the hand unit (13) obtained by learning of the taking-out operation. To operate,
Machine learning method characterized by

A machine learning method by a machine learning device for learning an operation of a robot (14) for taking out the work (12) by a hand unit (13) from a plurality of randomly placed works (12) including a bulk-stacked state There,
Output data of a three-dimensional measuring instrument (15) which measures a plurality of workpieces (12) stacked in bulk and outputs three-dimensional position information of the surface of the plurality of workpieces (12); the hand of the workpiece (12) Setting the position, posture and taking-out direction of the hand unit (13) for the taking-out operation by the unit (13), and observing the state quantities including the measurement parameters of the three-dimensional measuring instrument (15);
Obtaining the result of the taking-out operation of the robot (14) for taking out the work (12) by the hand unit (13);
Receiving the observed state quantity and the acquired determination result of the success or failure of the extraction operation of the robot (14), and learning the extraction operation of the work (12);
The learning of the extraction operation is
Calculate the reward based on the acquired judgment result of success or failure of taking out the work,
Having a value function that determines the value of the retrieval operation of the work (12), and updating the value function according to the reward;
The state variable is updated based on the value function, and the state variable for setting the optimum position, posture, and take-out direction of the hand unit (13) obtained by learning the take-out operation is carried out. Operate the drive shafts of the hand unit (13) and the robot (14) based on
Machine learning method characterized by

A machine learning method by a machine learning device for learning an operation of a robot (14) for taking out the work (12) by a hand unit (13) from a plurality of randomly placed works (12) including a bulk-stacked state There,
Output data of a three-dimensional measuring instrument (15) which measures a plurality of workpieces (12) stacked in bulk and outputs three-dimensional position information of the surface of the plurality of workpieces (12); the hand of the workpiece (12) Setting the position, posture and taking-out direction of the hand unit (13) for the taking-out operation by the unit (13), and observing the state quantities including the measurement parameters of the three-dimensional measuring instrument (15);
Obtaining the result of the taking-out operation of the robot (14) for taking out the work (12) by the hand unit (13);
Receiving the observed state quantity and the acquired determination result of the success or failure of the extraction operation of the robot (14), and learning the extraction operation of the work (12);
The learning of the extraction operation implements a learning model for learning the extraction operation of the work (12),
Calculate the error based on the given label and the output of the implemented learning model,
Updating the learning model according to the error;
Each driving shaft of the hand unit (13) and the robot (14) is obtained based on the state variable which sets the optimum position, posture and taking-out direction of the hand unit (13) obtained by learning of the taking-out operation. To operate,
Machine learning method characterized by