JP2018126799A

JP2018126799A - Control device, robot, and robot system

Info

Publication number: JP2018126799A
Application number: JP2017019314A
Authority: JP
Inventors: 公威溝部; Kimii Mizobe; 長谷川　浩; Hiroshi Hasegawa; 浩長谷川; 太郎田中; Taro Tanaka; 國益符; Guoyi Fu; ラッパユーリ; Rappa Yuri; ステファンリーアラン; stephen li Alan
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 2017-02-06
Filing date: 2017-02-06
Publication date: 2018-08-16
Also published as: US20180222057A1; CN108393888A

Abstract

PROBLEM TO BE SOLVED: To solve such the problem that performing a variety of setting on a robot needs advanced know-how and is challenging.SOLUTION: A control device includes: a calculation part for calculating an image processing parameter relating to image processing to an image of an object imaged by an imaging part using machine learning; a detection part for detecting the object on the basis of an image to which the image processing is executed with the calculated image processing parameter; and a control part for controlling a robot on the basis of a detection result of the object.SELECTED DRAWING: Figure 1

Description

本発明は、制御装置、ロボットおよびロボットシステムに関する。 The present invention relates to a control device, a robot, and a robot system.

ロボットに作業を行わせるためには、各種の設定が必要であり、従来、各種の設定は人
為的に行われている。 In order to make the robot perform work, various settings are required, and conventionally, various settings have been performed manually.

また、従来、工作機械の工具補正の頻度を最適化するために機械学習を利用した技術が
知られている（特許文献１）。 Conventionally, a technique using machine learning for optimizing the frequency of tool correction of a machine tool is known (Patent Document 1).

特許第５９６９６７６号公報Japanese Patent No. 5969676

ロボットにおける各種の設定を行うためには高度なノウハウが必要であり、難易度が高
かった。 In order to make various settings in the robot, advanced know-how is required, and the degree of difficulty is high.

上記課題の少なくとも一つを解決するために、制御装置は、機械学習を用いて、撮像部
によって撮像された対象物の画像に対する画像処理に関する画像処理パラメーターを算出
する算出部と、算出された画像処理パラメーターによって画像処理が実行された画像に基
づいて、対象物を検出する検出部と、対象物の検出結果に基づいてロボットを制御する制
御部と、を備える。この構成によれば、人為的に決められた画像処理パラメーターよりも
高精度に対象物を検出する画像処理パラメーターを高い確率で算出することができる。 In order to solve at least one of the above problems, the control device uses machine learning to calculate an image processing parameter related to image processing for an image of an object captured by the imaging unit, and a calculated image A detection unit that detects an object based on an image that has been subjected to image processing according to a processing parameter, and a control unit that controls a robot based on a detection result of the object. According to this configuration, it is possible to calculate with high probability an image processing parameter for detecting an object with higher accuracy than an artificially determined image processing parameter.

さらに、検出部は、対象物の位置姿勢を検出する構成であっても良い。この構成によれ
ば、高精度に対象物の位置姿勢を検出することができる。 Further, the detection unit may be configured to detect the position and orientation of the object. According to this configuration, the position and orientation of the object can be detected with high accuracy.

さらに、算出部は、状態変数として、少なくとも画像処理パラメーターによって画像処
理が実行された画像を観測する状態観測部と、状態変数としての画像に基づいて画像処理
パラメーターを学習する学習部と、を含む構成であっても良い。この構成によれば、高精
度に対象物を検出する画像処理パラメーターを容易に算出することができる。 Further, the calculation unit includes, as state variables, a state observation unit that observes at least an image that has been subjected to image processing using image processing parameters, and a learning unit that learns image processing parameters based on images as state variables. It may be a configuration. According to this configuration, it is possible to easily calculate an image processing parameter for detecting an object with high accuracy.

さらに、学習部は、状態変数としての画像に基づいて画像処理パラメーターを変化させ
る行動を決定し、画像処理パラメーターを最適化する構成であっても良い。この構成によ
れば、ロボットの使用環境に応じた画像処理パラメーターとなるように最適化することが
できる。 Further, the learning unit may be configured to determine an action for changing the image processing parameter based on the image as the state variable and optimize the image processing parameter. According to this configuration, the image processing parameters can be optimized according to the use environment of the robot.

さらに、学習部は、対象物の検出結果の良否に基づいて、行動による報酬を評価する構
成であっても良い。この構成によれば、対象物の検出精度を高める学習を実行することが
できる。 Further, the learning unit may be configured to evaluate a reward due to behavior based on the quality of the detection result of the object. According to this configuration, it is possible to execute learning that improves the detection accuracy of an object.

さらに、学習部は、対象物の検出結果に基づいてロボットが行った作業の良否に基づい
て、行動による報酬を評価する構成であっても良い。この構成によれば、ロボットの作業
を成功させる学習を実行することができる。 Further, the learning unit may be configured to evaluate a reward based on behavior based on the quality of work performed by the robot based on the detection result of the object. According to this configuration, it is possible to execute learning that makes the robot work successful.

さらに、算出部は、状態変数の観測と、当該状態変数に応じた行動の決定と、当該行動
によって得られる報酬の評価とを繰り返すことによって、画像処理パラメーターを最適化
する構成であっても良い。この構成によれば、画像処理パラメーターを自動的に最適化す
ることができる。 Further, the calculation unit may be configured to optimize the image processing parameter by repeating the observation of the state variable, the determination of the action according to the state variable, and the evaluation of the reward obtained by the action. . According to this configuration, the image processing parameters can be automatically optimized.

さらに、算出部は、機械学習を用いて、ロボットの動作に関する動作パラメーターを算
出し、制御部は、動作パラメーターに基づいてロボットを制御する構成であっても良い。
この構成によれば、対象物の検出精度とともにロボットの動作性能を向上させることがで
きる。 Further, the calculation unit may be configured to calculate an operation parameter related to the operation of the robot using machine learning, and the control unit may control the robot based on the operation parameter.
According to this configuration, it is possible to improve the operation performance of the robot as well as the detection accuracy of the object.

さらに、算出部は、ロボットの作業対象である対象物を光学系で撮像した画像に基づい
て画像処理パラメーターおよび動作パラメーターを算出する構成であっても良い。この構
成によれば、対象物の検出精度が向上するようにロボットを動作させることができる。 Further, the calculation unit may be configured to calculate image processing parameters and operation parameters based on an image obtained by capturing an object that is a work target of the robot with an optical system. According to this configuration, the robot can be operated so that the detection accuracy of the object is improved.

ロボットシステムの斜視図である。It is a perspective view of a robot system. 制御装置の機能ブロック図である。It is a functional block diagram of a control device. パラメーターを示す図である。It is a figure which shows a parameter. 加減速特性を示す図である。It is a figure which shows an acceleration / deceleration characteristic. ピックアップ処理のフローチャートである。It is a flowchart of a pick-up process. 算出部に関連する構成のブロック図である。It is a block diagram of the structure relevant to a calculation part. 光学パラメーターを学習する際の例を示す図である。It is a figure which shows the example at the time of learning an optical parameter. 多層ニューラルネットワークの例を示す図である。It is a figure which shows the example of a multilayer neural network. 学習処理のフローチャートである。It is a flowchart of a learning process. 動作パラメーターを学習する際の例を示す図である。It is a figure which shows the example at the time of learning an operation parameter. 力制御パラメーターを学習する際の例を示す図である。It is a figure which shows the example at the time of learning a force control parameter.

以下、本発明の実施形態について添付図面を参照しながら以下の順に説明する。なお、
各図において対応する構成要素には同一の符号が付され、重複する説明は省略される。
（１）ロボットシステムの構成：
（２）ロボットの制御：
（３）ピックアップ処理：
（４）学習処理：
（４−１）光学パラメーターの学習：
（４−２）光学パラメーターの学習例：
（４−３）動作パラメーターの学習：
（４−４）動作パラメーターの学習例：
（４−５）力制御パラメーターの学習：
（４−６）力制御パラメーターの学習例：
（５）他の実施形態： Hereinafter, embodiments of the present invention will be described in the following order with reference to the accompanying drawings. In addition,
In each figure, corresponding components are denoted by the same reference numerals, and redundant description is omitted.
(1) Robot system configuration:
(2) Robot control:
(3) Pickup processing:
(4) Learning process:
(4-1) Learning optical parameters:
(4-2) Optical parameter learning example:
(4-3) Learning of operation parameters:
(4-4) Example of learning operation parameters:
(4-5) Force control parameter learning:
(4-6) Force control parameter learning example:
(5) Other embodiments:

（１）ロボットシステムの構成：
図１は本発明の一実施形態にかかる制御装置で制御されるロボットを示す斜視図である
。本発明の一実施例としてのロボットシステムは、図１に示すように、ロボット１〜３を
備えている。ロボット１〜３はエンドエフェクターを備える６軸ロボットであり、ロボッ
ト１〜３には異なるエンドエフェクターが取り付けられている。すなわち、ロボット１に
は、撮像部２１が取り付けられ、ロボット２には照明部２２が取り付けられ、ロボット３
にはグリッパー２３が取り付けられている。なお、ここでは、撮像部２１および照明部２
２を光学系と呼ぶ。 (1) Robot system configuration:
FIG. 1 is a perspective view showing a robot controlled by a control device according to an embodiment of the present invention. The robot system as an embodiment of the present invention includes robots 1 to 3 as shown in FIG. Robots 1 to 3 are 6-axis robots having end effectors, and different end effectors are attached to the robots 1 to 3. That is, the imaging unit 21 is attached to the robot 1, the illumination unit 22 is attached to the robot 2, and the robot 3
Is fitted with a gripper 23. In addition, here, the imaging unit 21 and the illumination unit 2
2 is called an optical system.

ロボット１〜３は、制御装置４０によって制御される。制御装置４０はケーブルにより
ロボット１〜３と通信可能に接続される。なお、制御装置４０の構成要素がロボット１に
備えられていても良い。また、制御装置４０は複数の装置によって構成されても良い（例
えば、後述する学習部と制御部とが異なる装置に備えられる等）。また、制御装置４０は
、図示しない教示装置をケーブル、または無線通信によって接続可能である。教示装置は
、専用のコンピューターであってもよいし、ロボット１を教示するためのプログラムがイ
ンストールされた汎用のコンピューターであってもよい。さらに、制御装置４０と教示装
置とは、一体に構成されていてもよい。 The robots 1 to 3 are controlled by the control device 40. The control device 40 is communicably connected to the robots 1 to 3 through a cable. The components of the control device 40 may be provided in the robot 1. The control device 40 may be configured by a plurality of devices (for example, a learning unit and a control unit described later are provided in different devices). Moreover, the control apparatus 40 can connect the teaching apparatus which is not illustrated by a cable or radio | wireless communication. The teaching device may be a dedicated computer or a general-purpose computer in which a program for teaching the robot 1 is installed. Furthermore, the control device 40 and the teaching device may be configured integrally.

ロボット１〜３は、アームに各種のエンドエフェクターを装着して使用される単腕ロボ
ットであり、本実施形態において、ロボット１〜３においてアームや軸の構成は同等であ
る。図１においてはロボット３においてアームや軸の構成を説明する符号が付されている
。ロボット３において示されたように、ロボット１〜３は、基台Ｔと、６個のアーム部材
Ａ１〜Ａ６と、６個の関節Ｊ１〜Ｊ６を備える。基台Ｔは作業台に固定されている。基台
Ｔと６個のアーム部材Ａ１〜Ａ６は関節Ｊ１〜Ｊ６によって連結される。アーム部材Ａ１
〜Ａ６とエンドエフェクターは可動部であり、これらの可動部が動作することによってロ
ボット１〜３は各種の作業を行うことができる。 The robots 1 to 3 are single-arm robots that are used with various end effectors attached to the arms. In this embodiment, the robots 1 to 3 have the same configuration of arms and axes. In FIG. 1, reference numerals for explaining the configuration of arms and shafts in the robot 3 are attached. As shown in the robot 3, the robots 1 to 3 include a base T, six arm members A1 to A6, and six joints J1 to J6. The base T is fixed to the work table. The base T and the six arm members A1 to A6 are connected by joints J1 to J6. Arm member A1
A6 and the end effector are movable parts, and the robots 1 to 3 can perform various operations by operating these movable parts.

本実施形態において、関節Ｊ２、Ｊ３、Ｊ５は曲げ関節であり、関節Ｊ１、Ｊ４、Ｊ６
はねじり関節である。アームＡのうち最も先端側のアーム部材Ａ６には、力覚センサーＰ
とエンドエフェクターとが装着される。ロボット１〜３は、６軸のアームを駆動させるこ
とによって、可動範囲内においてエンドエフェクターを任意の位置に配置し、任意の姿勢
（角度）とすることができる。 In the present embodiment, the joints J2, J3, J5 are bending joints, and the joints J1, J4, J6
Is a torsional joint. The most distal arm member A6 of the arm A has a force sensor P
And the end effector are mounted. Each of the robots 1 to 3 can drive the 6-axis arm to place the end effector at an arbitrary position within the movable range and to have an arbitrary posture (angle).

ロボット３が備えるエンドエフェクターはグリッパー２３であり、対象物Ｗを把持する
ことができる。ロボット２が備えるエンドエフェクターは照明部２２であり、照射範囲に
光を照射することができる。ロボット１が備えるエンドエフェクターは撮像部２１であり
、視野内の画像を撮像することができる。本実施形態においては、ロボット１〜３が備え
るエンドエフェクターに対して相対的に固定された位置がツールセンターポイント（ＴＣ
Ｐ）として定義される。ＴＣＰの位置はエンドエフェクターの基準の位置となり、ＴＣＰ
が原点となり、エンドエフェクターに対して相対的に固定された３次元直交座標系である
ＴＣＰ座標系が定義される。 The end effector included in the robot 3 is a gripper 23 and can grip the object W. The end effector included in the robot 2 is the illumination unit 22 and can irradiate light to the irradiation range. The end effector included in the robot 1 is an imaging unit 21 and can capture an image in the field of view. In this embodiment, the position fixed relative to the end effector included in the robots 1 to 3 is the tool center point (TC).
P). The position of TCP becomes the reference position of the end effector, and TCP
Is the origin, and a TCP coordinate system, which is a three-dimensional orthogonal coordinate system fixed relative to the end effector, is defined.

力覚センサーＰは、６軸の力検出器である。力覚センサーＰは、力覚センサー上の点を
原点とした３次元直交座標系であるセンサー座標系において互いに直交する３個の検出軸
と平行な力の大きさと、当該３個の検出軸まわりのトルクの大きさとを検出する。なお、
本実施例では６軸ロボットを例にしているが、ロボットの態様は種々の態様であっても良
いし、ロボット１〜３の態様が異なっていてもよい。また、関節Ｊ６以外の関節Ｊ１〜Ｊ
５のいずれか１つ以上に力検出器としての力覚センサーを備えても良い。 The force sensor P is a six-axis force detector. The force sensor P is a three-dimensional orthogonal coordinate system having a point on the force sensor as the origin, and the magnitude of the force parallel to the three detection axes orthogonal to each other and the three detection axes. The magnitude of torque is detected. In addition,
In the present embodiment, a 6-axis robot is taken as an example, but the aspect of the robot may be various aspects, and the aspects of the robots 1 to 3 may be different. Also, joints J1 to J other than joint J6
Any one or more of 5 may be provided with a force sensor as a force detector.

ロボット１〜３が設置された空間を規定する座標系をロボット座標系というとき、ロボ
ット座標系は、水平面上において互いに直交するｘ軸とｙ軸と、鉛直上向きを正方向とす
るｚ軸とによって規定される３次元の直交座標系である（図１参照）。ｚ軸における負の
方向は概ね重力方向と一致する。またｘ軸周りの回転角をＲｘで表し、ｙ軸周りの回転角
をＲｙで表し、ｚ軸周りの回転角をＲｚで表す。ｘ，ｙ，ｚ方向の位置により３次元空間
における任意の位置を表現でき、Ｒｘ，Ｒｙ，Ｒｚ方向の回転角により３次元空間におけ
る任意の姿勢を表現できる。以下、位置と表記した場合、姿勢も意味し得ることとする。
また、力と表記した場合、トルクも意味し得ることとする。 When the coordinate system that defines the space in which the robots 1 to 3 are installed is referred to as a robot coordinate system, the robot coordinate system includes an x-axis and a y-axis that are orthogonal to each other on a horizontal plane, and a z-axis that has a vertically upward direction as a positive direction. It is a defined three-dimensional orthogonal coordinate system (see FIG. 1). The negative direction on the z-axis generally coincides with the direction of gravity. A rotation angle around the x axis is represented by Rx, a rotation angle around the y axis is represented by Ry, and a rotation angle around the z axis is represented by Rz. An arbitrary position in the three-dimensional space can be expressed by a position in the x, y, and z directions, and an arbitrary posture in the three-dimensional space can be expressed by a rotation angle in the Rx, Ry, and Rz directions. Hereinafter, when it is described as a position, it can also mean a posture.
In addition, when expressed as force, it can also mean torque.

なお、本実施形態においてはロボットに作用する力を制御する力制御が実行可能であり
、力制御においては、任意の点に作用する当該作用力が目標力になるように制御される。
各種の部位に作用する力は、３次元直交座標系である力制御座標系において定義される。
目標力（トルクを含む）は、力制御座標系で表現された力の作用点を起点としたベクトル
で表現可能であり、後述する学習が行われる以前において、目標力ベクトルの起点は力制
御座標系の原点であり、作用力の方向は力制御座標系の１軸方向と一致している。ただし
、後述する学習が行われた場合、目標力ベクトルの起点は力制御座標系の原点と異なり得
るし、目標力ベクトルの方向は力制御座標系の軸方向と異なり得る。 In the present embodiment, force control for controlling the force acting on the robot can be executed. In the force control, control is performed so that the acting force acting on an arbitrary point becomes the target force.
Forces acting on various parts are defined in a force control coordinate system which is a three-dimensional orthogonal coordinate system.
The target force (including torque) can be expressed as a vector starting from the point of action of the force expressed in the force control coordinate system. Before the learning described later is performed, the starting point of the target force vector is the force control coordinate. This is the origin of the system, and the direction of the acting force coincides with the one-axis direction of the force control coordinate system. However, when learning described later is performed, the starting point of the target force vector may be different from the origin of the force control coordinate system, and the direction of the target force vector may be different from the axial direction of the force control coordinate system.

本実施形態において各種の座標系の関係は予め定義されており、各種の座標系での座標
値は互いに変換可能である。すなわち、ＴＣＰ座標系、センサー座標系、ロボット座標系
、力制御座標系における位置やベクトルは互いに変換可能である。ここでは簡単のため、
制御装置４０がＴＣＰの位置およびＴＣＰに作用する作用力をロボット座標系で制御する
説明をするが、ロボット１〜３の位置やロボット１〜３に作用する力は、各種の座標系で
定義でき、互いに変換可能であるため、位置や力がどの座標系で定義され、制御されても
良い。むろん、ここで述べた座標系以外にも他の座標系（例えば対象物に固定されたオブ
ジェクト座標系等）が定義され、変換可能であっても良い。 In this embodiment, the relationship between various coordinate systems is defined in advance, and coordinate values in various coordinate systems can be converted into each other. That is, positions and vectors in the TCP coordinate system, sensor coordinate system, robot coordinate system, and force control coordinate system can be mutually converted. For simplicity here,
The control device 40 will explain the position of the TCP and the acting force acting on the TCP in the robot coordinate system, but the position of the robots 1 to 3 and the force acting on the robots 1 to 3 can be defined in various coordinate systems. Since they can be converted to each other, the position and force may be defined and controlled in any coordinate system. Of course, other than the coordinate system described here, another coordinate system (for example, an object coordinate system fixed to the object) may be defined and converted.

（２）ロボットの制御：
ロボット１は、教示を行うことにより各種作業が可能となる汎用ロボットであり、図２
に示すようにアクチュエーターとしてのモーターＭ１〜Ｍ６と、センサーとしてのエンコ
ーダーＥ１〜Ｅ６とを備える。アームを制御することはモーターＭ１〜Ｍ６を制御するこ
とを意味する。モーターＭ１〜Ｍ６とエンコーダーＥ１〜Ｅ６とは、関節Ｊ１〜Ｊ６のそ
れぞれに対応して備えられており、エンコーダーＥ１〜Ｅ６はモーターＭ１〜Ｍ６の回転
角度を検出する。また、各モーターＭ１〜Ｍ６には電力を供給する電源線が接続されてお
り、各電源線には電流計が備えられている。従って、制御装置４０は、各モーターＭ１〜
Ｍ６に供給された電流を計測することができる。 (2) Robot control:
The robot 1 is a general-purpose robot that can perform various operations by teaching.
As shown in FIG. 2, motors M1 to M6 as actuators and encoders E1 to E6 as sensors are provided. Controlling the arm means controlling the motors M1 to M6. Motors M1 to M6 and encoders E1 to E6 are provided corresponding to the joints J1 to J6, respectively, and the encoders E1 to E6 detect the rotation angles of the motors M1 to M6. Each motor M1 to M6 is connected to a power supply line for supplying power, and each power supply line is provided with an ammeter. Therefore, the control device 40 includes the motors M1 to M1.
The current supplied to M6 can be measured.

制御装置４０は、コンピューター等のハードウェア資源と記憶部４４に記憶された各種
のソフトウェア資源を備え、プログラムを実行可能である。本実施形態において制御装置
４０は、算出部４１、検出部４２、制御部４３として機能する。なお、ハードウェア資源
は、ＣＰＵ，ＲＡＭ，ＲＯＭ等からなる構成であっても良いし、ＡＳＩＣ等によって構成
されても良く、種々の構成を採用可能である。 The control device 40 includes hardware resources such as a computer and various software resources stored in the storage unit 44, and can execute a program. In the present embodiment, the control device 40 functions as a calculation unit 41, a detection unit 42, and a control unit 43. The hardware resource may be composed of a CPU, RAM, ROM, or the like, or may be composed of an ASIC or the like, and various configurations can be adopted.

本実施形態において検出部４２は対象物を検出する処理を実行することが可能であり、
制御部４３はロボット１〜３のアームを駆動することが可能である。検出部４２は、光学
系２０を構成する撮像部２１と照明部２２とに接続されている。検出部４２は、撮像部２
１を制御し、撮像部２１が備える撮像センサーによって撮像された画像を取得することが
できる。また、検出部４２は、照明部２２を制御し、出力光の明るさを変化させることが
できる。 In the present embodiment, the detection unit 42 can execute processing for detecting an object,
The control unit 43 can drive the arms of the robots 1 to 3. The detection unit 42 is connected to the imaging unit 21 and the illumination unit 22 constituting the optical system 20. The detection unit 42 includes the imaging unit 2.
1 can be acquired, and an image captured by the image sensor included in the imaging unit 21 can be acquired. Moreover, the detection part 42 can control the illumination part 22 and can change the brightness of output light.

撮像部２１から画像が出力されると、検出部４２は、撮像画像に基づいてテンプレート
マッチング処理を行い、対象物の位置（位置姿勢）を検出する処理を行う。すなわち、検
出部４２は、記憶部４４に記憶されたテンプレートデータ４４ｃに基づいてテンプレート
マッチング処理を実行する。テンプレートデータ４４ｃは複数の位置姿勢毎のテンプレー
トである。従って、テンプレートデータ４４ｃに対して位置姿勢をＩＤ等で対応づけてお
けば、適合したテンプレートデータ４４ｃの種類によって検出部４２から見た対象物の位
置姿勢を特定することができる。 When an image is output from the imaging unit 21, the detection unit 42 performs a template matching process based on the captured image, and performs a process of detecting the position (position and orientation) of the target object. In other words, the detection unit 42 executes template matching processing based on the template data 44 c stored in the storage unit 44. The template data 44c is a template for each of a plurality of positions and orientations. Therefore, if the position and orientation are associated with the template data 44c by ID or the like, the position and orientation of the object viewed from the detection unit 42 can be specified by the type of the template data 44c that is adapted.

具体的には、検出部４２は、複数の位置姿勢毎のテンプレートデータ４４ｃを順次処理
対象とし、テンプレートデータ４４ｃの大きさを変化させながら、撮像された画像と比較
する。そして、検出部４２は、テンプレートデータ４４ｃと画像との差分が閾値以下の像
を対象物の像として検出する。 Specifically, the detection unit 42 sequentially processes the template data 44c for each of a plurality of positions and orientations, and compares the template data 44c with the captured image while changing the size of the template data 44c. Then, the detection unit 42 detects an image in which the difference between the template data 44c and the image is equal to or less than a threshold value as an image of the object.

対象物の像が検出されると、検出部４２は、予め決められた座標系の関係と適合したテ
ンプレートデータ４４ｃの大きさに基づいて対象物の位置姿勢を特定する。すなわち、テ
ンプレートデータ４４ｃの大きさから撮像部２１と対象物との光軸方向の距離が判明し、
画像内で検出された対象物の位置から光軸に垂直な方向の位置が判明する。 When the image of the object is detected, the detection unit 42 identifies the position and orientation of the object based on the size of the template data 44c that matches the relationship of the predetermined coordinate system. That is, the distance in the optical axis direction between the imaging unit 21 and the object is determined from the size of the template data 44c.
The position in the direction perpendicular to the optical axis is determined from the position of the object detected in the image.

そこで、例えば、撮像部２１の撮像センサーの光軸と撮像平面上の２軸とがＴＣＰ座標
系の各軸に平行に定義されている場合であれば、検出部４２は、テンプレートデータ４４
ｃの大きさと、テンプレートデータ４４ｃが画像と適合した位置とに基づいて、ＴＣＰ座
標系において対象物の位置を特定することができる。また、検出部４２は、適合したテン
プレートデータ４４ｃのＩＤに基づいて、ＴＣＰ座標系における対象物の姿勢を特定する
ことができる。このため、検出部４２は、上述の座標系の対応関係を利用し、任意の座標
系、例えば、ロボット座標系における対象物の位置姿勢を特定することができる。 Therefore, for example, if the optical axis of the imaging sensor of the imaging unit 21 and the two axes on the imaging plane are defined in parallel to the respective axes of the TCP coordinate system, the detection unit 42 includes the template data 44.
The position of the object can be specified in the TCP coordinate system based on the size of c and the position where the template data 44c matches the image. Further, the detection unit 42 can specify the posture of the object in the TCP coordinate system based on the ID of the adapted template data 44c. For this reason, the detection part 42 can specify the position and orientation of the target in an arbitrary coordinate system, for example, a robot coordinate system, using the correspondence relationship of the coordinate system described above.

なお、テンプレートマッチング処理は、対象物の位置姿勢を特定するための処理であれ
ば良く、種々の処理を採用可能である。例えば、テンプレートデータ４４ｃと画像との差
分は、階調値の差分によって評価されても良いし、画像の特徴（例えば、画像の勾配等）
の差分によって評価されても良い。 The template matching process may be a process for specifying the position and orientation of the object, and various processes can be employed. For example, the difference between the template data 44c and the image may be evaluated by a difference in gradation value, or image characteristics (for example, image gradient, etc.)
You may evaluate by the difference of.

検出部４２は、パラメーターを参照して当該テンプレートマッチング処理を行う。すな
わち、記憶部４４には、各種のパラメーター４４ａが記憶されており、当該パラメーター
４４ａには、検出部４２の検出に関するパラメーターが含まれている。図３は、パラメー
ター４４ａの例を示す図である。図３に示す例において、パラメーター４４ａは、光学パ
ラメーターと動作パラメーターと力制御パラメーターとを含んでいる。 The detection unit 42 performs the template matching process with reference to the parameters. That is, various parameters 44 a are stored in the storage unit 44, and the parameters 44 a include parameters related to detection by the detection unit 42. FIG. 3 is a diagram illustrating an example of the parameter 44a. In the example shown in FIG. 3, the parameter 44a includes an optical parameter, an operation parameter, and a force control parameter.

光学パラメーターは、検出部４２の検出に関するパラメーターである。動作パラメータ
ーと力制御パラメーターとはロボット１〜３を制御する際のパラメーターであり、詳細は
後述する。光学パラメーターは、撮像部２１に関する撮像部パラメーターと、照明部２２
に関する照明部パラメーターと、撮像部２１によって撮像された対象物の画像に対する画
像処理に関する画像処理パラメーターとが含まれる。 The optical parameter is a parameter related to detection by the detection unit 42. The operation parameter and the force control parameter are parameters for controlling the robots 1 to 3 and will be described in detail later. The optical parameters are the imaging unit parameters related to the imaging unit 21 and the illumination unit 22.
And an image processing parameter related to image processing on the image of the object imaged by the imaging unit 21.

図３においては、これらのパラメーターの例が示されている。すなわち、対象物を撮像
する際に撮像部２１が配置される位置が撮像部の位置として定義され、撮像部パラメータ
ーに含まれている。また、撮像部２１は、露光時間と絞りを調整可能な機構を備えており
、対象物を撮像する際の露光時間および絞りの値が撮像部パラメーターに含まれている。
なお、撮像部の位置は、種々の手法で記述されて良く、例えば、撮像部２１のＴＣＰの位
置がロボット座標系で記述される構成等を採用可能である。 In FIG. 3, examples of these parameters are shown. That is, the position where the imaging unit 21 is arranged when imaging the object is defined as the position of the imaging unit, and is included in the imaging unit parameters. Further, the imaging unit 21 includes a mechanism capable of adjusting the exposure time and the aperture, and the exposure time and the aperture value when imaging the object are included in the imaging unit parameters.
Note that the position of the imaging unit may be described by various methods. For example, a configuration in which the position of the TCP of the imaging unit 21 is described in the robot coordinate system may be employed.

検出部４２は、撮像部パラメーターを参照し、撮像部２１の位置を後述する位置制御部
４３ａに受け渡す。この結果、位置制御部４３ａは、目標位置Ｌtを生成し、当該目標位
置Ｌtに基づいてロボット１を制御する。また、検出部４２は、撮像部パラメーターを参
照し、撮像部２１の露光時間と絞りを設定する。この結果、撮像部２１においては当該露
光時間と絞りによって撮像が行われる状態となる。 The detection unit 42 refers to the imaging unit parameters and transfers the position of the imaging unit 21 to a position control unit 43a described later. As a result, the position control unit 43a generates a target position Lt, and controls the robot 1 based on the target position Lt. Further, the detection unit 42 sets the exposure time and aperture of the imaging unit 21 with reference to the imaging unit parameters. As a result, the imaging unit 21 is in a state where imaging is performed with the exposure time and the aperture.

また、対象物を撮像する際に照明部２２が配置される位置が照明部の位置として定義さ
れ、照明部パラメーターに含まれている。また、照明部２２は、明るさを調整可能な機構
を備えており、対象物を撮像する際の明るさの値が照明部パラメーターに含まれている。
照明部の位置も、種々の手法で記述されて良く、例えば、照明部２２のＴＣＰの位置がロ
ボット座標系で記述される構成等を採用可能である。 Moreover, the position where the illumination part 22 is arrange | positioned when imaging a target object is defined as a position of an illumination part, and is contained in the illumination part parameter. Moreover, the illumination part 22 is provided with the mechanism which can adjust brightness, and the value of the brightness at the time of imaging a target object is contained in the illumination part parameter.
The position of the illumination unit may also be described by various methods. For example, a configuration in which the position of the TCP of the illumination unit 22 is described in the robot coordinate system may be employed.

検出部４２は、照明部パラメーターを参照し、照明部２２の位置を後述する位置制御部
４３ａに受け渡す。この結果、位置制御部４３ａは、目標位置Ｌtを生成し、当該目標位
置Ｌtに基づいてロボット２を制御する。また、検出部４２は、照明部パラメーターを参
照し、照明部２２の明るさを設定する。この結果、照明部２２においては当該明るさの光
が出力される状態となる。 The detection unit 42 refers to the illumination unit parameters and passes the position of the illumination unit 22 to a position control unit 43a described later. As a result, the position control unit 43a generates a target position Lt and controls the robot 2 based on the target position Lt. The detection unit 42 sets the brightness of the illumination unit 22 with reference to the illumination unit parameter. As a result, the illumination unit 22 is in a state in which light having the brightness is output.

検出部４２は、撮像部２１によって撮像された画像に対してテンプレートマッチング処
理を適用する際に、画像処理パラメーターを参照する。すなわち、画像処理パラメーター
には、テンプレートマッチング処理を実行する際の処理順序を示す画像処理シーケンスが
含まれている。また、本実施形態において、テンプレートマッチング処理における閾値が
可変であり、現在のテンプレートマッチングの閾値が画像処理パラメーターに含まれてい
る。さらに、検出部４２は、テンプレートデータ４４ｃと画像とを比較する前に各種の処
理を実行可能である。図３においては、各種の処理として平滑化処理と鮮鋭化処理が例示
されており、それぞれの強度が画像処理パラメーターに含まれている。 The detection unit 42 refers to the image processing parameter when applying the template matching process to the image captured by the imaging unit 21. In other words, the image processing parameter includes an image processing sequence indicating the processing order when executing the template matching processing. In the present embodiment, the threshold value in the template matching process is variable, and the current template matching threshold value is included in the image processing parameter. Furthermore, the detection unit 42 can execute various processes before comparing the template data 44c and the image. In FIG. 3, a smoothing process and a sharpening process are illustrated as various processes, and each intensity is included in the image processing parameters.

撮像部２１から画像が出力されると、検出部４２は、画像処理シーケンスに基づいて、
画像処理の順序（実行するか否かを含む）を決定し、当該順序で平滑化処理や鮮鋭化処理
等の画像処理を実行する。このとき、検出部４２は、画像処理パラメーターに記述された
強度で平滑化処理や鮮鋭化処理等の画像処理を実行する。また、画像処理シーケンスに含
まれる比較（テンプレートデータ４４ｃと画像との比較）を実行する際には、画像処理パ
ラメーターが示す閾値に基づいて比較を行う。 When an image is output from the imaging unit 21, the detection unit 42 is based on the image processing sequence.
The order of image processing (including whether or not to execute) is determined, and image processing such as smoothing processing and sharpening processing is executed in that order. At this time, the detection unit 42 performs image processing such as smoothing processing and sharpening processing with the intensity described in the image processing parameter. Further, when executing the comparison included in the image processing sequence (comparison between the template data 44c and the image), the comparison is performed based on the threshold value indicated by the image processing parameter.

なお、以上のように検出部４２は、光学パラメーターに基づいて撮像部２１や照明部２
２の位置を特定し、ロボット１、ロボット２を動作させることが可能であるが、ロボット
１およびロボット２を駆動する際の位置は、後述する動作パラメーターや力制御パラメー
ターによって与えられてもよい。 As described above, the detection unit 42 is based on the optical parameters and the imaging unit 21 and the illumination unit 2.
2 can be specified and the robot 1 and the robot 2 can be operated. However, the position when the robot 1 and the robot 2 are driven may be given by an operation parameter or a force control parameter described later.

本実施形態において、制御部４３は、位置制御部４３ａ、力制御部４３ｂ、接触判定部
４３ｃ、サーボ４３ｄを備えている。また、制御部４３においては、モーターＭ１〜Ｍ６
の回転角度の組み合わせと、ロボット座標系におけるＴＣＰの位置との対応関係Ｕ１が図
示しない記憶媒体に記憶され、座標系の対応関係Ｕ２が定義され、図示しない記憶媒体に
記憶されている。従って、制御部４３や後述する算出部４１は、対応関係Ｕ２に基づいて
、任意の座標系におけるベクトルを他の座標系におけるベクトルに変換することができる
。例えば、制御部４３、算出部４１は、力覚センサーＰの出力に基づいてセンサー座標系
でのロボット１〜３への作用力を取得し、ロボット座標系におけるＴＣＰの位置に作用す
る力に変換することができる。また、制御部４３、算出部４１は、力制御座標系で表現さ
れた目標力をロボット座標系におけるＴＣＰの位置における目標力に変換することができ
る。むろん、対応関係Ｕ１，Ｕ２は記憶部４４に記憶されていても良い。 In the present embodiment, the control unit 43 includes a position control unit 43a, a force control unit 43b, a contact determination unit 43c, and a servo 43d. Moreover, in the control part 43, motor M1-M6
The correspondence U1 between the combination of the rotation angles and the TCP position in the robot coordinate system is stored in a storage medium (not shown), and the correspondence U2 of the coordinate system is defined and stored in a storage medium (not shown). Therefore, the control unit 43 and the calculation unit 41 to be described later can convert a vector in an arbitrary coordinate system into a vector in another coordinate system based on the correspondence relationship U2. For example, the control unit 43 and the calculation unit 41 obtain the acting force on the robots 1 to 3 in the sensor coordinate system based on the output of the force sensor P, and convert it into a force acting on the TCP position in the robot coordinate system. can do. Further, the control unit 43 and the calculation unit 41 can convert the target force expressed in the force control coordinate system into a target force at the TCP position in the robot coordinate system. Of course, the correspondences U1 and U2 may be stored in the storage unit 44.

制御部４３は、アームを駆動することによって、ロボット１〜３とともに移動する各種
の部位の位置や各種の部位に作用する力を制御することができ、位置の制御は主に位置制
御部４３ａ、力の制御は主に力制御部４３ｂによって実行される。サーボ４３ｄは、サー
ボ制御を実行することが可能であり、エンコーダーＥ１〜Ｅ６の出力が示すモーターＭ１
〜Ｍ６の回転角度Ｄ_aと、制御目標である目標角度Ｄtとを一致させるフィードバック制御
を実行する。すなわち、サーボ４３ｄは、回転角度Ｄ_aと目標角度Ｄtとの偏差、当該偏差
の積分、当該偏差の微分にサーボゲインＫpp，Ｋpi，Ｋpdを作用させたＰＩＤ制御を実行
することができる。 The control unit 43 can control the positions of various parts that move together with the robots 1 to 3 and the forces acting on the various parts by driving the arm. The position control is mainly performed by the position control unit 43a, Force control is mainly executed by the force control unit 43b. The servo 43d can execute servo control, and the motor M1 indicated by the outputs of the encoders E1 to E6.
A rotation angle D _a of ～M6, executes a feedback control to match the target angle Dt is the control target. That is, the servo 43d can execute a deviation between the rotational angle D _a and the target angle Dt, the integral of the deviation, servo gain Kpp the differential of the deviation, Kpi, the PID control by the action of Kpd.

さらに、サーボ４３ｄは、当該サーボゲインＫpp，Ｋpi，Ｋpdが作用した出力と、回転
角度Ｄ_aの微分との偏差、当該偏差の積分、当該偏差の微分にサーボゲインＫvp，Ｋvi，
Ｋvdを作用させたＰＩＤ制御を実行することができる。当該サーボ４３ｄによる制御は、
モーターＭ１〜Ｍ６のそれぞれに対して実行可能である。従って、各サーボゲインはロボ
ット１〜３が備える６軸のそれぞれについて実行可能である。なお、本実施形態において
、制御部４３は、サーボ４３ｄに制御信号を出力し、サーボゲインＫpp，Ｋpi，Ｋpd，Ｋ
vp，Ｋvi，Ｋvdを変化させることができる。 Further, the servo 43d is the servo gain Kpp, Kpi, and an output Kpd acts, the deviation between the differential of the rotational angle D _a, the integral of the deviation, servo gain Kvp the differential of the deviation, Kvi,
PID control using Kvd can be executed. Control by the servo 43d is
This can be executed for each of the motors M1 to M6. Therefore, each servo gain can be executed for each of the six axes included in the robots 1 to 3. In this embodiment, the control unit 43 outputs a control signal to the servo 43d, and the servo gains Kpp, Kpi, Kpd, K
vp, Kvi, and Kvd can be changed.

記憶部４４には、上述のパラメーター４４ａに加え、ロボット１〜３を制御するための
ロボットプログラム４４ｂが記憶される。本実施形態において、パラメーター４４ａおよ
びロボットプログラム４４ｂは、教示によって生成され、記憶部４４に記憶されるが、後
述する算出部４１によって修正され得る。なお、ロボットプログラム４４ｂは、主に、ロ
ボット１〜３が実施する作業のシーケンス（工程の順序）を示し、予め定義されたコマン
ドの組み合わせによって記述される。また、パラメーター４４ａは、主に、各工程を実現
するために必要とされる具体的な値であり、各コマンドの引数として記述される。 The storage unit 44 stores a robot program 44b for controlling the robots 1 to 3 in addition to the parameter 44a described above. In the present embodiment, the parameter 44a and the robot program 44b are generated by teaching and stored in the storage unit 44, but may be corrected by the calculation unit 41 described later. The robot program 44b mainly indicates a sequence of operations (step order) performed by the robots 1 to 3, and is described by a combination of predefined commands. The parameter 44a is a specific value mainly required to realize each process, and is described as an argument of each command.

ロボット１〜３を制御するためのパラメーター４４ａには、上述の光学パラメーターの
他に、動作パラメーターと力制御パラメーターが含まれる。動作パラメーターは、ロボッ
ト１〜３の動作に関するパラメーターであり、本実施形態においては、位置制御の際に参
照されるパラメーターである。すなわち、本実施形態において、一連の作業は複数の工程
に分けられ、各工程を実施する際のパラメーター４４ａが教示によって生成される。動作
パラメーターには、当該複数の工程における始点と終点を示すパラメーターが含まれてい
る。当該始点と終点は、種々の座標系で定義されて良く、本実施形態においては制御対象
のロボットのＴＣＰの始点および終点がロボット座標系で定義される。すなわち、ロボッ
ト座標系の各軸についての並進位置と回転位置とが定義される。 The parameters 44a for controlling the robots 1 to 3 include operation parameters and force control parameters in addition to the optical parameters described above. The operation parameter is a parameter related to the operation of the robots 1 to 3 and is a parameter referred to in position control in the present embodiment. That is, in this embodiment, a series of operations are divided into a plurality of processes, and a parameter 44a for performing each process is generated by teaching. The operation parameters include parameters indicating start points and end points in the plurality of steps. The start point and end point may be defined in various coordinate systems. In this embodiment, the TCP start point and end point of the robot to be controlled are defined in the robot coordinate system. That is, a translation position and a rotation position for each axis of the robot coordinate system are defined.

また、動作パラメーターには、複数の工程におけるＴＣＰの加減速特性が含まれている
。加減速特性は、ロボット１〜３のＴＣＰが各工程の始点から終点まで移動する際の期間
と当該期間内の各時刻におけるＴＣＰの速度を示している。図４は、当該加減速特性の例
を示す図であり、始点におけるＴＣＰの移動開始時刻ｔ₁からＴＣＰが終点に到達する時
刻ｔ₄までの期間内の各時刻においてＴＣＰの速度Ｖが定義されている。また、本実施形
態において加減速特性には定速期間が含まれる。 Further, the operation parameters include TCP acceleration / deceleration characteristics in a plurality of steps. The acceleration / deceleration characteristics indicate the period when the TCP of the robots 1 to 3 moves from the start point to the end point of each process and the TCP speed at each time within the period. FIG. 4 is a diagram illustrating an example of the acceleration / deceleration characteristics, in which the TCP speed V is defined at each time within the period from the TCP movement start time t _{1 at the} start point to the time t _{4 at} which the TCP reaches the end point. ing. In the present embodiment, the acceleration / deceleration characteristics include a constant speed period.

定速期間は時刻ｔ₂〜ｔ₃の期間であり、この期間内に置いて速度は一定である。また、
この期間の前後においてＴＣＰは加速し、また、減速する。すなわち、時刻ｔ₁〜ｔ₂まで
の期間においてＴＣＰは加速し、時刻ｔ₃〜ｔ₄までの期間においてＴＣＰは減速する。当
該加減速特性も種々の座標系で定義されて良く、本実施形態においては制御対象のロボッ
トのＴＣＰについての速度であり、ロボット座標系で定義される。すなわち、ロボット座
標系の各軸についての並進速度と回転速度（角速度）とが定義される。 The constant speed period is a period from time t _{2 to} t ₃ , and the speed is constant within this period. Also,
Before and after this period, TCP accelerates and decelerates. That, TCP is accelerated during the period from time t ₁ ~t _2, TCP is decelerated during the period up to time t ₃ ~t _4. The acceleration / deceleration characteristics may also be defined in various coordinate systems. In this embodiment, the acceleration / deceleration characteristics are the speeds of TCP of the robot to be controlled and are defined in the robot coordinate system. That is, the translation speed and rotation speed (angular speed) for each axis of the robot coordinate system are defined.

さらに、動作パラメーターには、サーボゲインＫpp，Ｋpi，Ｋpd，Ｋvp，Ｋvi，Ｋvdが
含まれている。すなわち、制御部４３は、動作パラメーターとして記述された値になるよ
うにサーボ４３ｄに制御信号を出力し、サーボゲインＫpp，Ｋpi，Ｋpd，Ｋvp，Ｋvi，Ｋ
vdを調整することができる。本実施形態において当該サーボゲインは、上述の工程毎の値
であるが、後述の学習等によってより短い期間毎の値とされても良い。 Further, the operation parameters include servo gains Kpp, Kpi, Kpd, Kvp, Kvi, Kvd. That is, the control unit 43 outputs a control signal to the servo 43d so as to have a value described as an operation parameter, and the servo gains Kpp, Kpi, Kpd, Kvp, Kvi, K
vd can be adjusted. In the present embodiment, the servo gain is a value for each process described above, but may be a value for a shorter period by learning or the like described later.

力制御パラメーターは、ロボット１〜３の力制御に関するパラメーターであり、本実施
形態においては、力制御の際に参照されるパラメーターである。始点、終点、加減速特性
、サーボゲインは、動作パラメーターと同様のパラメーターであり、始点、終点、加減速
特性はロボット座標系の３軸の並進と回転について定義される。また、サーボゲインはモ
ーターＭ１〜Ｍ６のそれぞれについて定義される。ただし、力制御の場合、始点および終
点の中の少なくとも一部は定義されない場合（任意とされる場合）もある。例えば、ある
方向に作用する力が０になるように衝突回避や倣い制御が行われる場合、当該方向におけ
る始点および終点は定義されず、当該方向の力を０にするように位置が任意に変化し得る
状態が定義される場合もある。 The force control parameter is a parameter related to the force control of the robots 1 to 3 and is a parameter referred to in the force control in the present embodiment. The start point, end point, acceleration / deceleration characteristics, and servo gain are parameters similar to the operation parameters, and the start point, end point, acceleration / deceleration characteristics are defined for translation and rotation of the three axes of the robot coordinate system. The servo gain is defined for each of the motors M1 to M6. However, in the case of force control, at least a part of the start point and the end point may not be defined (may be arbitrary). For example, when collision avoidance or copying control is performed so that the force acting in a certain direction becomes zero, the start point and end point in that direction are not defined, and the position changes arbitrarily so that the force in that direction becomes zero Possible states may be defined.

また、力制御パラメーターには、力制御座標系を示す情報が含まれている。力制御座標
系は、力制御の目標力を定義するための座標系であり、後述の学習が行われる前において
は目標力ベクトルの起点が原点であり、目標力ベクトルの方向に１軸が向いている。すな
わち、教示において力制御における各種の目標力が定義される際に、各作業の各工程にお
ける目標力の作用点が教示される。例えば、対象物の一点を他の物体に当て、両者の接触
点で対象物から他の物体に一定の目標力を作用させた状態で対象物の向きを変化させる場
合において、対象物が他の物体と接触する点が目標力の作用点となり、当該作用点を原点
とした力制御座標系が定義される。そこで、力制御パラメーターにおいては、力制御の目
標力が作用する点を原点とし、目標力の方向に１軸が向いている座標系、すなわち、力制
御座標系を特定するための情報を、パラメーターに含んでいる。なお、当該パラメーター
は種々の定義が可能であるが、例えば、力制御座標系と他の座標系（ロボット座標系等）
との関係を示すデータによって定義可能である。 The force control parameter includes information indicating a force control coordinate system. The force control coordinate system is a coordinate system for defining a target force for force control. Before learning described later, the origin of the target force vector is the origin, and one axis is oriented in the direction of the target force vector. ing. That is, when various target forces in force control are defined in teaching, an action point of the target force in each step of each work is taught. For example, when one point of an object is applied to another object and the direction of the object is changed while a certain target force is applied from the object to the other object at the contact point between the two objects, A point in contact with the object becomes an action point of the target force, and a force control coordinate system with the action point as the origin is defined. Therefore, in the force control parameter, information for specifying the coordinate system in which the point where the target force of force control acts is the origin and one axis is oriented in the direction of the target force, that is, the force control coordinate system is Is included. The parameter can be defined in various ways. For example, the force control coordinate system and other coordinate systems (robot coordinate system, etc.)
It can be defined by data indicating the relationship between

さらに、力制御パラメーターには、目標力が含まれている。目標力は、各種の作業にお
いて、任意の点に作用すべき力として教示される力であり、力制御座標系において定義さ
れる。すなわち、目標力を示す目標力ベクトルが、目標力ベクトルの起点と、起点からの
６軸成分（３軸の並進力、３軸のトルク）として定義され、力制御座標系で表現されてい
る。なお、力制御座標系と他の座標系との関係を利用すれば、当該目標力を任意の座標系
、例えば、ロボット座標系におけるベクトルに変換することが可能である。 Furthermore, the force control parameter includes a target force. The target force is a force taught as a force to be applied to an arbitrary point in various operations, and is defined in a force control coordinate system. That is, a target force vector indicating the target force is defined as a starting point of the target force vector and a six-axis component (three-axis translational force, three-axis torque) from the starting point, and is expressed in a force control coordinate system. If the relationship between the force control coordinate system and another coordinate system is used, the target force can be converted into a vector in an arbitrary coordinate system, for example, a robot coordinate system.

さらに、力制御パラメーターには、インピーダンスパラメーターが含まれている。すな
わち、本実施形態において力制御部４３ｂが実施する力制御は、インピーダンス制御であ
る。インピーダンス制御は、仮想の機械的インピーダンスをモーターＭ１〜Ｍ６によって
実現する制御である。この際、ＴＣＰが仮想的に有する質量が仮想慣性係数ｍとして定義
され、ＴＣＰが仮想的に受ける粘性抵抗が仮想粘性係数ｄとして定義され、ＴＣＰが仮想
的に受ける弾性力のバネ定数が仮想弾性係数ｋとして定義される。インピーダンスパラメ
ーターはこれらのｍ，ｄ，ｋであり、ロボット座標系の各軸に対する並進と回転について
定義される。本実施形態において当該力制御座標系、目標力、インピーダンスパラメータ
ーは、上述の工程毎の値であるが、後述の学習等によってより短い期間毎の値とされても
良い。 Furthermore, the force control parameter includes an impedance parameter. That is, the force control performed by the force control unit 43b in the present embodiment is impedance control. The impedance control is control for realizing virtual mechanical impedance by the motors M1 to M6. At this time, the mass that TCP virtually has is defined as virtual inertia coefficient m, the viscous resistance that TCP virtually receives is defined as virtual viscosity coefficient d, and the spring constant of the elastic force that TCP virtually receives is the virtual elasticity. Defined as coefficient k. The impedance parameters are these m, d, k and are defined for translation and rotation for each axis of the robot coordinate system. In the present embodiment, the force control coordinate system, the target force, and the impedance parameter are values for each process described above, but may be values for a shorter period by learning or the like described later.

本実施形態において、一連の作業は複数の工程に分けられ、各工程を実施するロボット
プログラム４４ｂが教示によって生成されるが、位置制御部４３ａは、ロボットプログラ
ム４４ｂが示す各工程をさらに微小時間ΔＴ毎の微小工程に細分化する。そして、位置制
御部４３ａは、パラメーター４４ａに基づいて微小工程毎の目標位置Ｌtを生成する。力
制御部４３ｂは、パラメーター４４ａに基づいて一連の作業の各工程における目標力ｆLt
を取得する。 In the present embodiment, a series of operations are divided into a plurality of processes, and a robot program 44b for performing each process is generated by teaching. However, the position control unit 43a further processes each process indicated by the robot program 44b by a minute time ΔT. It subdivides into every minute process. Then, the position controller 43a generates a target position Lt for each micro process based on the parameter 44a. The force control unit 43b determines the target force fLt in each step of a series of operations based on the parameter 44a.
To get.

すなわち、位置制御部４３ａは、動作パラメーターまたは力制御パラメーターが示す始
点、終点、加減速特性を参照し、始点から終点まで当該加減速特性で移動する場合（姿勢
の場合は姿勢が変化する場合）の微小工程毎のＴＣＰの位置を目標位置Ｌtとして生成す
る。力制御部４３ｂは、各工程についての力制御パラメーターが示す目標力を参照し、力
制御座標系とロボット座標系との対応関係Ｕ２に基づいて当該目標力をロボット座標系に
おける目標力ｆLtに変換する。当該目標力ｆLtは、任意の点に作用する力として変換され
得るが、ここでは、後述の作用力がＴＣＰに作用している力として表現されるため、当該
作用力と目標力ｆLtとを運動方程式で解析するため、目標力ｆLtがＴＣＰの位置における
力に変換されるとして説明を行う。むろん、工程によっては、目標力ｆLtが定義されない
場合もあり、この場合、力制御を伴わない位置制御が行われる。 That is, the position control unit 43a refers to the start point, end point, and acceleration / deceleration characteristics indicated by the operation parameter or force control parameter, and moves with the acceleration / deceleration characteristics from the start point to the end point (when the posture changes in the case of the posture). The TCP position for each micro process is generated as the target position Lt. The force control unit 43b refers to the target force indicated by the force control parameter for each process, and converts the target force into the target force fLt in the robot coordinate system based on the correspondence U2 between the force control coordinate system and the robot coordinate system. To do. The target force fLt can be converted as a force acting on an arbitrary point, but here, since the action force described later is expressed as a force acting on the TCP, the action force and the target force fLt are moved. In order to analyze with the equation, the description will be made assuming that the target force fLt is converted into a force at the TCP position. Of course, depending on the process, the target force fLt may not be defined. In this case, position control without force control is performed.

なお、ここでＬの文字は、ロボット座標系を規定する軸の方向（ｘ，ｙ，ｚ，Ｒｘ，Ｒ
ｙ，Ｒｚ）のなかのいずれか１個の方向を表すこととする。また、Ｌは、Ｌ方向の位置も
表すこととする。例えば、Ｌ＝ｘの場合、ロボット座標系にて設定された目標位置のｘ方
向成分がＬt＝ｘtと表記され、目標力のｘ方向成分がｆLt＝ｆxtと表記される。 Here, the letter L is the direction of the axis (x, y, z, Rx, R) that defines the robot coordinate system.
y, Rz) represents any one direction. L represents the position in the L direction. For example, when L = x, the x-direction component of the target position set in the robot coordinate system is expressed as Lt = xt, and the x-direction component of the target force is expressed as fLt = fxt.

位置制御や力制御を実行するため、制御部４３は、ロボット１〜３の状態を取得するこ
とができる。すなわち、制御部４３は、モーターＭ１〜Ｍ６の回転角度Ｄaを取得し、対
応関係Ｕ１に基づいて、当該回転角度Ｄaをロボット座標系におけるＴＣＰの位置Ｌ（ｘ
，ｙ，ｚ，Ｒｘ，Ｒｙ，Ｒｚ）に変換することができる。また制御部４３は、対応関係Ｕ
２を参照し、ＴＣＰの位置Ｌと、力覚センサーＰの検出値および位置とに基づいて、力覚
センサーＰに現実に作用している力をＴＣＰに作用している作用力ｆ_Ｌに変換してロボッ
ト座標系において特定することができる。 In order to execute position control and force control, the control unit 43 can acquire the states of the robots 1 to 3. That is, the control unit 43 acquires the rotation angle Da of the motors M1 to M6, and based on the correspondence U1, the control unit 43 calculates the rotation angle Da in the robot coordinate system using the TCP position L (x
, Y, z, Rx, Ry, Rz). In addition, the control unit 43 determines the correspondence U
2, the force actually acting on the force sensor P is converted into the acting force f _L acting on the TCP based on the position L of the TCP and the detection value and position of the force sensor P. And can be specified in the robot coordinate system.

すなわち、力覚センサーＰに作用している力は、センサー座標系で定義される。そこで
、制御部４３は、ロボット座標系におけるＴＣＰの位置Ｌと対応関係Ｕ２と力覚センサー
Ｐの検出値に基づいて、ロボット座標系においてＴＣＰに作用する作用力ｆ_Ｌを特定する
。また、ロボットに作用するトルクは、作用力ｆ_Ｌと、ツール接触点（エンドエフェクタ
ーとワークの接触点）から力覚センサーＰまでの距離とから算出することができ、図示さ
れないｆ_Ｌトルク成分として特定される。なお、制御部４３は、作用力ｆ_Ｌに対して重力
補償を行う。重力補償とは、作用力ｆ_Ｌから重力成分を除去する処理である。重力補償は
、例えば、ＴＣＰの姿勢ごとにＴＣＰに作用する作用力ｆ_Ｌの重力成分を予め調査してお
き、作用力ｆ_Ｌから当該重力成分を減算するなどして実現可能である。 That is, the force acting on the force sensor P is defined in the sensor coordinate system. Therefore, the control unit 43, based on the detected value of the position L and the relationship U2 and force sensor P of TCP in the robot coordinate system, identifying the action force f _L acting on the TCP in the robot coordinate system. Further, the torque acting on the robot can be calculated from the acting force f _L and the distance from the tool contact point (end effector and workpiece contact point) to the force sensor P, and as an f _L torque component (not shown). Identified. The control unit 43 performs gravity compensation against the action force _{f L.} The gravity compensation is a process for removing gravity components from the action force f _L. Gravity compensation, for example, in advance investigating gravity component acting force f _L acting on TCP for each position of the TCP, can be realized by, for example, subtracting the gravity component from the action force f _L.

ＴＣＰに作用する重力以外の作用力ｆ_Ｌと、ＴＣＰに作用すべき目標力ｆLtとが特定さ
れると、力制御部４３ｂは、対象物等の物体がＴＣＰに存在し、当該ＴＣＰに力が作用し
得る状態において、インピーダンス制御による補正量ΔＬ（以後、力由来補正量ΔＬと呼
ぶ。）を取得する。すなわち、力制御部４３ｂはパラメーター４４ａを参照して目標力ｆ
Ltとインピーダンスパラメーターｍ，ｄ，ｋを取得し、運動方程式（１）に代入して力由
来補正量ΔＬを取得する。なお、当該力由来補正量ΔＬは、ＴＣＰが機械的インピーダン
スを受けた場合に、目標力ｆLtと作用力ｆ_Ｌとの力偏差Δｆ_Ｌ（ｔ）を解消するために、
ＴＣＰが移動すべき位置Ｌの大きさを意味する。

And action force f _L other than the gravity acting on the TCP, when the the target force fLt should act on the TCP is specified, the force control unit 43b is present in the TCP is an object of the object such as, force in the TCP is In a state where it can act, a correction amount ΔL by impedance control (hereinafter referred to as a force-derived correction amount ΔL) is acquired. That is, the force control unit 43b refers to the parameter 44a and sets the target force f
Lt and impedance parameters m, d, and k are acquired and substituted into the equation of motion (1) to acquire a force-derived correction amount ΔL. Note that the force from the correction amount ΔL, if TCP is subjected to mechanical impedance, in order to eliminate the power of the target force fLt the working force f _L deviation Δf L _(t),
This means the size of the position L where the TCP should move.

（１）式の左辺は、ＴＣＰの位置Ｌの２階微分値に仮想慣性係数ｍを乗算した第１項と
、ＴＣＰの位置Ｌの微分値に仮想粘性係数ｄを乗算した第２項と、ＴＣＰの位置Ｌに仮想
弾性係数ｋを乗算した第３項とによって構成される。（１）式の右辺は、目標力ｆLtから
現実の作用力ｆ_Ｌを減算した力偏差Δｆ_Ｌ（ｔ）によって構成される。（１）式における
微分とは、時間による微分を意味する。 The left side of the equation (1) includes a first term obtained by multiplying the second-order differential value of the TCP position L by the virtual inertia coefficient m, a second term obtained by multiplying the differential value of the TCP position L by the virtual viscosity coefficient d, and And a third term obtained by multiplying the position L of the TCP by the virtual elastic coefficient k. The right side of equation (1) is composed of a force deviation Δf _L (t) obtained by subtracting the actual acting force f _L from the target force f Lt. The differentiation in the equation (1) means differentiation with time.

力由来補正量ΔＬが得られると、制御部４３は、対応関係Ｕ１に基づいて、ロボット座
標系を規定する各軸の方向の動作位置を、各モーターＭ１〜Ｍ６の目標の回転角度である
目標角度Ｄtに変換する。サーボ４３ｄは、目標角度ＤtからモーターＭ１〜Ｍ６の現実の
回転角度であるエンコーダーＥ１〜Ｅ６の出力（回転角度Ｄa）を減算することにより、
駆動位置偏差Ｄe（＝Ｄt−Ｄa）を算出する。サーボ４３ｄは、パラメーター４４ａを参
照してサーボゲインＫpp，Ｋpi，Ｋpd，Ｋvp，Ｋvi，Ｋvdを取得し、駆動位置偏差Ｄeに
サーボゲインＫpp，Ｋpi，Ｋpdを乗算した値と、現実の回転角度Ｄaの時間微分値である
駆動速度との差である駆動速度偏差に、サーボゲインＫvp，Ｋvi，Ｋvdを乗算した値とを
加算することにより、制御量Ｄcを導出する。制御量Ｄcは、モーターＭ１〜Ｍ６のそれぞ
れについて特定され、各モーターＭ１〜Ｍ６の制御量ＤcでモーターＭ１〜Ｍ６のそれぞ
れが制御される。制御部４３がモーターＭ１〜Ｍ６を制御する信号は、ＰＷＭ（Pulse Wi
dth Modulation）変調された信号である。 When the force-derived correction amount ΔL is obtained, the control unit 43 determines the operation position in the direction of each axis that defines the robot coordinate system based on the correspondence U1 as the target rotation angle of each motor M1 to M6. Convert to angle Dt. The servo 43d subtracts the outputs (rotation angle Da) of the encoders E1 to E6, which are the actual rotation angles of the motors M1 to M6, from the target angle Dt.
A drive position deviation De (= Dt−Da) is calculated. The servo 43d acquires the servo gains Kpp, Kpi, Kpd, Kvp, Kvi, Kvd with reference to the parameter 44a, the value obtained by multiplying the drive position deviation De by the servo gains Kpp, Kpi, Kpd, and the actual rotation angle Da. A control amount Dc is derived by adding a value obtained by multiplying the drive speed deviation, which is a difference from the drive speed, which is a time differential value of the above, by the servo gains Kvp, Kvi, Kvd. The control amount Dc is specified for each of the motors M1 to M6, and each of the motors M1 to M6 is controlled by the control amount Dc of each motor M1 to M6. The signal that the control unit 43 controls the motors M1 to M6 is PWM (Pulse Wi
dth modulation).

以上のように、運動方程式に基づいて目標力ｆLtから制御量Ｄcを導出してモーターＭ
１〜Ｍ６を制御するモードを力制御モードというものとする。また制御部４３は、エンド
エフェクター等の構成要素が対象物Ｗから力を受けない非接触状態の工程では、力制御を
行わず、目標位置から線形演算で導出する回転角度でモーターＭ１〜Ｍ６を制御する。目
標位置から線形演算で導出する回転角度でモーターＭ１〜Ｍ６を制御するモードを位置制
御モードというものとする。さらに、制御部４３は、目標位置から線形演算で導出する回
転角度と目標力を運動方程式に代入して導出する回転角度とを例えば線型結合によって統
合し、統合した回転角度でモーターＭ１〜Ｍ６を制御するハイブリッドモードでもロボッ
ト１を制御することができる。これらのモードはロボットプログラム４４ｂによって予め
決められる。 As described above, the control amount Dc is derived from the target force fLt based on the equation of motion, and the motor M
A mode for controlling 1 to M6 is referred to as a force control mode. Further, the control unit 43 does not perform force control in a non-contact state process in which components such as the end effector do not receive force from the object W, and controls the motors M1 to M6 at a rotation angle derived from the target position by linear calculation. Control. A mode in which the motors M1 to M6 are controlled at a rotation angle derived from the target position by linear calculation is referred to as a position control mode. Further, the control unit 43 integrates the rotation angle derived from the target position by linear calculation and the rotation angle derived by substituting the target force into the equation of motion, for example, by linear coupling, and the motors M1 to M6 are integrated with the integrated rotation angle. The robot 1 can be controlled even in the hybrid mode to be controlled. These modes are predetermined by the robot program 44b.

位置制御モードまたはハイブリッドモードで制御を行う場合、位置制御部４３ａは、微
小工程毎の目標位置Ｌtを取得する。微小工程毎の目標位置Ｌtが得られると、制御部４３
は、対応関係Ｕ１に基づいて、ロボット座標系を規定する各軸の方向の動作位置を、各モ
ーターＭ１〜Ｍ６の目標の回転角度である目標角度Ｄtに変換する。サーボ４３ｄは、パ
ラメーター４４ａを参照してサーボゲインＫpp，Ｋpi，Ｋpd，Ｋvp，Ｋvi，Ｋvdを取得し
、目標角度Ｄtに基づいて、制御量Ｄcを導出する。制御量Ｄcは、モーターＭ１〜Ｍ６の
それぞれについて特定され、各モーターＭ１〜Ｍ６の制御量ＤcでモーターＭ１〜Ｍ６の
それぞれが制御される。この結果、各工程において、ＴＣＰは、微小工程毎の目標位置Ｌ
tを経由し、加減速特性に従って始点から終点まで移動する。 When the control is performed in the position control mode or the hybrid mode, the position control unit 43a acquires a target position Lt for each micro process. When the target position Lt for each micro process is obtained, the control unit 43
Converts the operation position in the direction of each axis defining the robot coordinate system to a target angle Dt that is a target rotation angle of each motor M1 to M6 based on the correspondence U1. The servo 43d acquires the servo gains Kpp, Kpi, Kpd, Kvp, Kvi, Kvd with reference to the parameter 44a, and derives the control amount Dc based on the target angle Dt. The control amount Dc is specified for each of the motors M1 to M6, and each of the motors M1 to M6 is controlled by the control amount Dc of each motor M1 to M6. As a result, in each process, the TCP is a target position L for each micro process.
Moves from the start point to the end point according to the acceleration / deceleration characteristics via t.

なお、ハイブリッドモードでは、制御部４３は、微小工程毎の目標位置Ｌtに、力由来
補正量ΔＬを加算することにより動作位置（Ｌt＋ΔＬ）を特定し、当該動作位置に基づ
いて目標角度Ｄtを取得し、制御量Ｄcを取得する。 In the hybrid mode, the control unit 43 specifies the operation position (Lt + ΔL) by adding the force-derived correction amount ΔL to the target position Lt for each minute process, and acquires the target angle Dt based on the operation position. Then, the control amount Dc is acquired.

接触判定部４３ｃは、ロボット１〜３が作業において想定されていない物体と接触した
か否かを判定する機能を実行する。本実施形態において、接触判定部４３ｃは、ロボット
１〜３のそれぞれが備える力覚センサーＰの出力を取得し、出力が予め決められた基準値
を超えた場合にロボット１〜３が作業において想定されていない物体と接触したと判定す
る。この場合において、種々の処理が行われて良いが、本実施形態において接触判定部４
３ｃは、ロボット１〜３の制御量Ｄcを０としてロボット１〜３を停止させる。なお、停
止させる際の制御量は、種々の制御量であって良く、直前の制御量Ｄcをキャンセルする
制御量でロボット１〜３を動作させる構成等であっても良い。 The contact determination unit 43c executes a function of determining whether or not the robots 1 to 3 are in contact with an object that is not assumed in the work. In the present embodiment, the contact determination unit 43c acquires the output of the force sensor P included in each of the robots 1 to 3, and the robots 1 to 3 assume the work when the output exceeds a predetermined reference value. It is determined that the object has not been touched. In this case, various processes may be performed, but in this embodiment, the contact determination unit 4
3c sets the control amount Dc of the robots 1 to 3 to 0 and stops the robots 1 to 3. The control amount for stopping may be various control amounts, and may be a configuration in which the robots 1 to 3 are operated with a control amount that cancels the immediately preceding control amount Dc.

（３）ピックアップ処理：
次に、以上の構成におけるロボット１〜３の動作を説明する。ここでは、ロボット２の
照明部２２で照明され、ロボット１の撮像部２１で撮像された対象物Ｗをロボット３のグ
リッパー２３でピックアップする作業を例にして説明する。むろん、ロボット１〜３によ
る作業は、ピックアップ作業に限定されず、他にも種々の作業（例えば、ネジ締め作業、
挿入作業、ドリルによる穴あけ作業、バリ取り作業、研磨作業、組み立て作業、製品チェ
ック作業等）に適用可能である。ピックアップ処理は、上述のコマンドによって記述され
たロボット制御プログラムによって検出部４２および制御部４３が実行する処理によって
実現される。本実施形態においてピックアップ処理は、作業台に対象物Ｗが配置した状態
で実行される。 (3) Pickup processing:
Next, the operation of the robots 1 to 3 in the above configuration will be described. Here, an operation of picking up the object W illuminated by the illumination unit 22 of the robot 2 and imaged by the imaging unit 21 of the robot 1 by the gripper 23 of the robot 3 will be described as an example. Of course, the work by the robots 1 to 3 is not limited to the pick-up work, and other various work (for example, screw tightening work,
It is applicable to insertion work, drilling work with a drill, deburring work, polishing work, assembly work, product check work, and the like. The pickup process is realized by a process executed by the detection unit 42 and the control unit 43 by the robot control program described by the above-described command. In the present embodiment, the pickup process is executed in a state where the object W is arranged on the work table.

図５は、ピックアップ処理のフローチャートの例を示す図である。ピックアップ処理が
開始されると、検出部４２は、撮像部２１が撮像した画像を取得する（ステップＳ１００
）。すなわち、検出部４２は、パラメーター４４ａを参照して照明部２２の位置を特定し
、当該位置を位置制御部４３ａに対して受け渡す。この結果、位置制御部４３ａは、現在
の照明部２２の位置を始点、パラメーター４４ａが示す照明部２２の位置を終点とした位
置制御を実行し、パラメーター４４ａが示す照明部の位置に照明部２２を移動させる。次
に、検出部４２はパラメーター４４ａを参照して照明部２２の明るさを特定し、照明部２
２を制御して照明の明るさを当該明るさに設定する。 FIG. 5 is a diagram illustrating an example of a flowchart of the pickup process. When the pickup process is started, the detection unit 42 acquires an image captured by the imaging unit 21 (step S100).
). That is, the detection unit 42 specifies the position of the illumination unit 22 with reference to the parameter 44a, and delivers the position to the position control unit 43a. As a result, the position control unit 43a executes position control using the current position of the illumination unit 22 as a start point and the position of the illumination unit 22 indicated by the parameter 44a as an end point, and the illumination unit 22 is positioned at the position of the illumination unit indicated by the parameter 44a. Move. Next, the detection unit 42 specifies the brightness of the illumination unit 22 with reference to the parameter 44a, and the illumination unit 2
2 is set to the brightness of the illumination.

さらに、検出部４２は、パラメーター４４ａを参照して撮像部２１の位置を特定し、当
該位置を位置制御部４３ａに対して受け渡す。この結果、位置制御部４３ａは、現在の撮
像部２１の位置を始点、パラメーター４４ａが示す撮像部２１の位置を終点とした位置制
御を実行し、パラメーター４４ａが示す照明部の位置に撮像部２１を移動させる。次に、
検出部４２はパラメーター４４ａを参照して撮像部２１の露光時間および絞りを特定し、
撮像部２１を制御して露光時間および絞りを当該露光時間および絞りに設定する。露光時
間および絞りの設定が完了すると、撮像部２１は、画像を撮像し、検出部４２に対して出
力する。検出部４２は、当該画像を取得する。 Furthermore, the detection unit 42 specifies the position of the imaging unit 21 with reference to the parameter 44a, and delivers the position to the position control unit 43a. As a result, the position control unit 43a executes position control with the current position of the imaging unit 21 as the start point and the position of the imaging unit 21 indicated by the parameter 44a as the end point, and the imaging unit 21 is positioned at the position of the illumination unit indicated by the parameter 44a. Move. next,
The detection unit 42 specifies the exposure time and aperture of the imaging unit 21 with reference to the parameter 44a,
The imaging unit 21 is controlled to set the exposure time and aperture to the exposure time and aperture. When the setting of the exposure time and the aperture is completed, the imaging unit 21 captures an image and outputs it to the detection unit 42. The detection unit 42 acquires the image.

次に、検出部４２は、画像に基づいて、対象物の検出が成功したか否かを判定する（ス
テップＳ１０５）。すなわち、検出部４２は、パラメーター４４ａを参照して画像処理シ
ーケンスを特定し、当該画像処理シーケンスが示す各処理をパラメーター４４ａが示す強
度で実行する。また、検出部４２は、テンプレートデータ４４ｃを参照し、テンプレート
データ４４ｃと画像との差分を閾値と比較し、差分が閾値以下である場合に、対象物の検
出が成功したと判定する。 Next, the detection unit 42 determines whether the detection of the object has succeeded based on the image (step S105). That is, the detection unit 42 identifies an image processing sequence with reference to the parameter 44a, and executes each process indicated by the image processing sequence with the intensity indicated by the parameter 44a. In addition, the detection unit 42 refers to the template data 44c, compares the difference between the template data 44c and the image with a threshold value, and determines that the detection of the target object is successful when the difference is equal to or less than the threshold value.

ステップＳ１０５において、対象物の検出が成功したと判定されない場合、検出部４２
は、テンプレートデータ４４ｃと画像の相対位置、またはテンプレートデータ４４ｃの大
きさ、の少なくとも一方を変化させ、ステップＳ１００以降の処理を繰り返す。一方、ス
テップＳ１０５において、対象物の検出が成功したと判定された場合、制御部４３は、制
御目標を特定する（ステップＳ１１０）。 If it is not determined in step S105 that the object has been successfully detected, the detection unit 42
Changes at least one of the relative position of the template data 44c and the image, or the size of the template data 44c, and repeats the processing after step S100. On the other hand, when it determines with the detection of a target object having succeeded in step S105, the control part 43 specifies a control target (step S110).

本例におけるピックアップ処理は、検出部４２が検出した対象物Ｗの位置姿勢に合わせ
てロボット３のグリッパー２３を移動させ、姿勢を変化させ、グリッパー２３で対象物Ｗ
をピックアップし、所定の位置まで対象物Ｗを運んでグリッパー２３から対象物Ｗを離す
作業である。そこで、位置制御部４３ａおよび力制御部４３ｂは、ロボットプログラム４
４ｂに基づいて一連の作業を構成する複数の工程を特定する。 In the pick-up process in this example, the gripper 23 of the robot 3 is moved in accordance with the position and orientation of the object W detected by the detection unit 42 to change the attitude.
Is picked up, the object W is carried to a predetermined position, and the object W is separated from the gripper 23. Therefore, the position control unit 43a and the force control unit 43b are provided with the robot program 4
A plurality of steps constituting a series of operations are specified based on 4b.

制御目標の特定対象となる工程は、各工程の中で未処理かつ時系列で先に存在する工程
である。制御目標の特定対象となる工程が力制御モードの工程である場合、力制御部４３
ｂは、パラメーター４４ａの力制御パラメーターを参照し、力制御座標系、目標力を取得
する。力制御部４３ｂは、力制御座標系に基づいて、当該目標力をロボット座標系の目標
力ｆLtに変換する。また、力制御部４３ｂは、力覚センサーＰの出力をＴＣＰに作用して
いる作用力ｆ_Ｌに変換する。さらに、力制御部４３ｂは、パラメーター４４ａの力制御パ
ラメーターを参照し、インピーダンスパラメーターｍ，ｄ，ｋに基づいて、力由来補正量
ΔＬを制御目標として取得する。 The process to be specified as the control target is a process that is unprocessed and exists in time series in each process. When the process to be specified as the control target is a process in the force control mode, the force control unit 43
b refers to the force control parameter of the parameter 44a, and acquires the force control coordinate system and the target force. The force control unit 43b converts the target force into a target force fLt in the robot coordinate system based on the force control coordinate system. Moreover, the force control unit 43b converts the output of the force sensor P to the action force f _L acting to TCP. Further, the force control unit 43b refers to the force control parameter of the parameter 44a, and acquires the force-derived correction amount ΔL as a control target based on the impedance parameters m, d, and k.

制御目標の特定対象となる工程が位置制御モードである場合、位置制御部４３ａは、当
該工程を微小工程に細分化する。そして、位置制御部４３ａは、パラメーター４４ａの動
作パラメーターを参照し、始点、終点、および加減速特性に基づいて、微小工程毎の目標
位置Ｌtを制御目標として取得する。制御目標の特定対象となる工程がハイブリッドモー
ドである場合、位置制御部４３ａは、当該工程を微小工程に細分化し、パラメーター４４
ａの力制御パラメーターを参照し、始点、終点、および加減速特性に基づいて、微小工程
毎の目標位置Ｌtを取得し、力制御座標系、目標力ｆLt、インピーダンスパラメーター、
作用力ｆ_Ｌに基づいて力由来補正量ΔＬを取得する。これらの目標位置Ｌtおよび力由来
補正量ΔＬが制御目標である。 When the process targeted for specifying the control target is the position control mode, the position control unit 43a subdivides the process into micro processes. Then, the position control unit 43a refers to the operation parameter of the parameter 44a, and acquires the target position Lt for each minute process as a control target based on the start point, end point, and acceleration / deceleration characteristics. When the process targeted for specifying the control target is in the hybrid mode, the position control unit 43a subdivides the process into micro processes, and sets the parameter 44
The target position Lt for each micro process is acquired based on the start point, end point, and acceleration / deceleration characteristics with reference to the force control parameter of a, the force control coordinate system, the target force fLt, the impedance parameter,
Obtaining a force derived correction amount ΔL based on acting force f _L. These target position Lt and force-derived correction amount ΔL are control targets.

制御目標が特定されると、サーボ４３ｄは、現在の制御目標でロボット３を制御する（
ステップＳ１１５）。すなわち、現在の工程が力制御モードまたはハイブリッドモードの
工程である場合、サーボ４３ｄは、パラメーター４４ａの力制御パラメーターを参照し、
サーボゲインに基づいて、制御目標に対応する制御量Ｄcを特定し、モーターＭ１〜Ｍ６
のそれぞれを制御する。現在の工程が位置制御モードの工程である場合、サーボ４３ｄは
、パラメーター４４ａの動作パラメーターを参照し、サーボゲインに基づいて、制御目標
に対応する制御量Ｄcを特定し、モーターＭ１〜Ｍ６のそれぞれを制御する。 When the control target is specified, the servo 43d controls the robot 3 with the current control target (
Step S115). That is, when the current process is a process in the force control mode or the hybrid mode, the servo 43d refers to the force control parameter of the parameter 44a,
Based on the servo gain, the control amount Dc corresponding to the control target is specified, and the motors M1 to M6 are specified.
Control each of the. When the current process is a process in the position control mode, the servo 43d refers to the operation parameter of the parameter 44a, specifies the control amount Dc corresponding to the control target based on the servo gain, and each of the motors M1 to M6. To control.

次に、制御部４３は、現在の工程が終了したか否かを判定する（ステップＳ１２０）。
当該判定は、種々の終了判定条件によって実行されてよく、位置制御であれば、例えば、
ＴＣＰが目標位置に達したことや目標位置においてＴＣＰが整定したこと等が挙げられる
。力制御であれば、例えば、作用力が目標力に一致した状態から作用力が指定の大きさ以
上、または指定の大きさ以下に変化したことや、ＴＣＰが指定の範囲外になったこと等が
挙げられる。前者は、例えば、ピックアップ作業における対象物の把持動作の完了や、把
持解除動作の完了等が挙げられる。後者は、例えば、ドリルによる対象物の貫通作業にお
いてドリルが貫通した場合等が挙げられる。 Next, the control unit 43 determines whether or not the current process is finished (step S120).
The determination may be performed according to various end determination conditions. If position control is performed, for example,
For example, the TCP has reached the target position, or the TCP has settled at the target position. In the case of force control, for example, the action force has changed from a state where the action force matches the target force to a value greater than or less than the specified magnitude, or that TCP is out of the specified range. Is mentioned. The former includes, for example, completion of the gripping operation of the object in the pick-up work, completion of the grip releasing operation, and the like. The latter includes, for example, a case where a drill penetrates in an object penetration work by a drill.

むろん、他にも各工程が失敗したと推定される場合において、工程が終了したと判定さ
れて良い。ただし、この場合には、作業の中止や中断が行われることが好ましい。工程の
失敗を判定するための終了判定条件としては、例えば、ＴＣＰの移動速度や加速度が上限
値を超えた場合やタイムアウトが発生した場合等が挙げられる。終了判定条件を充足した
か否かは、各種のセンサー、力覚センサーＰや撮像部２１、他のセンサー等が利用されて
良い。 Of course, when it is estimated that each process has failed, it may be determined that the process has been completed. However, in this case, it is preferable that the work is stopped or interrupted. As an end determination condition for determining a process failure, for example, a case where a TCP moving speed or acceleration exceeds an upper limit value or a time-out occurs can be cited. Various sensors, force sensor P, imaging unit 21, other sensors, and the like may be used to determine whether or not the end determination condition is satisfied.

ステップＳ１２０において、現在の工程が終了したと判定されない場合、制御部４３は
、微小時間ΔＴ後に、次の微小工程についてステップＳ１１５以降の処理を実行する。す
なわち、現在の工程が位置制御モードまたはハイブリッドモードである場合、位置制御部
４３ａは、次の微小工程における目標位置Ｌtを制御目標としてロボット３を制御する。
また、現在の工程が力制御モードまたはハイブリッドモードである場合、力制御部４３ｂ
は、再度力覚センサーＰの出力に基づいて作用力ｆ_Ｌを取得し、最新の作用力ｆ_Ｌに基づ
いて特定される力由来補正量ΔＬを制御目標としてロボット３を制御する。 In step S120, when it is not determined that the current process is completed, the control unit 43 executes the processes after step S115 for the next micro process after the micro time ΔT. That is, when the current process is the position control mode or the hybrid mode, the position control unit 43a controls the robot 3 using the target position Lt in the next minute process as a control target.
Further, when the current process is the force control mode or the hybrid mode, the force control unit 43b
Obtains the acting force f _L again based on the output of the force sensor P, and controls the robot 3 using the force-derived correction amount ΔL specified based on the latest acting force f _L as a control target.

ステップＳ１２０において、現在の工程が終了したと判定された場合、制御部４３は、
作業が終了したか否かを判定する（ステップＳ１２５）。すなわち、ステップＳ１２０で
終了したと判定された工程が最終工程であった場合、制御部４３は、作業が終了したと判
定する。ステップＳ１２５で作業が終了したと判定されなかった場合、制御部４３は、作
業シーケンスの次の工程を現在の工程に変更し（ステップＳ１３０）、ステップＳ１１０
以降の処理を実行する。ステップＳ１２５で作業が終了したと判定された場合、制御部４
３は、作業が終了したと判定し、ピックアップ処理を終了する。 If it is determined in step S120 that the current process is completed, the control unit 43
It is determined whether the work has been completed (step S125). That is, when the process determined to have been completed in step S120 is the final process, the control unit 43 determines that the work has been completed. If it is not determined in step S125 that the work has been completed, the control unit 43 changes the next process in the work sequence to the current process (step S130), and step S110.
The subsequent processing is executed. When it is determined in step S125 that the work has been completed, the control unit 4
3 determines that the work is finished, and finishes the pickup process.

（４）学習処理：
本実施形態にかかる制御装置４０は、以上のように、パラメーター４４ａに基づいてロ
ボット１〜３を制御することができる。上述の実施形態において、パラメーター４４ａは
教示によって生成されたが、人為的な教示によってパラメーター４４ａを最適化すること
は困難である。 (4) Learning process:
As described above, the control device 40 according to the present embodiment can control the robots 1 to 3 based on the parameter 44a. In the embodiment described above, the parameter 44a is generated by teaching, but it is difficult to optimize the parameter 44a by artificial teaching.

例えば、検出部４２による対象物Ｗの検出において、同じ対象物Ｗであっても、光学パ
ラメーターが異なれば対象物Ｗの位置、画像内での対象物の像や位置、対象物Ｗに生じる
影など、様々な要素が変化し得る。従って、光学パラメーターを変化させると、検出部４
２による対象物Ｗの検出精度が変化し得る。そして、光学パラメーターを変化させた場合
に対象物Ｗの検出精度がどのように変化するのかは、必ずしも明らかではない。 For example, in the detection of the object W by the detection unit 42, even if the object W is the same, if the optical parameters are different, the position of the object W, the image and position of the object in the image, and the shadow generated on the object W Various factors can change. Therefore, when the optical parameter is changed, the detection unit 4
The detection accuracy of the object W by 2 can change. And it is not always clear how the detection accuracy of the object W changes when the optical parameter is changed.

また、動作パラメーターや力制御パラメーターはロボット１〜３の制御に利用されるが
、ロボット１〜３のように複数の自由度（可動軸）を有するロボットは極めて多数のパタ
ーンで動作することが可能である。そして、ロボット１〜３においては、振動や異音、オ
ーバーシュート等の好ましくない動作が発生しないようにパターンが決められている必要
がある。さらに、エンドエフェクターとして各種の装置が取り付けられる場合、ロボット
１〜３の重心が変化し得るため、最適な動作パラメーター、力制御パラメーターも変化し
得る。そして、動作パラメーターや力制御パラメーターを変化させた場合に、ロボット１
〜３の動作がどのように変化するのかは、必ずしも明らかではない。 The operation parameters and force control parameters are used to control the robots 1 to 3, but a robot having a plurality of degrees of freedom (movable axes) like the robots 1 to 3 can operate in a very large number of patterns. It is. In the robots 1 to 3, the pattern needs to be determined so that undesirable operations such as vibration, abnormal noise, and overshoot do not occur. Furthermore, when various devices are attached as end effectors, the center of gravity of the robots 1 to 3 can change, so that the optimum operation parameters and force control parameters can also change. When the operation parameter or force control parameter is changed, the robot 1
It is not always clear how the operations of .about.3 change.

さらに、力制御パラメーターは、ロボット１〜３において力制御が行われる場合に利用
されるが、ロボット１〜３において実施される各作業において、力制御パラメーターを変
化させた場合に、ロボット１〜３の動作がどのように変化するのかは、必ずしも明らかで
はない。例えば、どのような方向においてどのようなインピーダンスパラメーターが最適
であるのか、全ての作業工程において推定することは困難である。このため、検出部４２
の検出精度を高めたり、ロボット１〜３の潜在的な性能を引き出したりするためには極め
て多数の試行錯誤を行う必要がある。 Further, the force control parameter is used when the force control is performed in the robots 1 to 3. However, when the force control parameter is changed in each operation performed in the robots 1 to 3, the robots 1 to 3 are used. It is not always clear how the operation of the system changes. For example, it is difficult to estimate in all work steps what impedance parameter is optimal in which direction. Therefore, the detection unit 42
In order to improve the detection accuracy of the robot and to extract the potential performance of the robots 1 to 3, it is necessary to perform a great number of trials and errors.

しかし、人為的に極めて多数の試行錯誤を行うことは困難であるため、対象物Ｗの検出
精度が充分に高く、当該検出精度がほぼ上限に達していると推定される状態や、ロボット
１〜３の潜在的な性能が引き出されている状態（所要時間や消費電力等のパフォーマンス
のさらなる向上が困難な状態）を人為的に実現することは困難である。また、パラメータ
ー４４ａの調整を行うためには、パラメーター４４ａの変化による検出精度の変化やロボ
ット１〜３の動作の変化を熟知しているオペレーターが必要になり、熟知していないオペ
レーターがパラメーター４４ａの調整を行うことは困難である。また、常に熟練のオペレ
ーターを必要とするシステムはとても不便である。 However, since it is difficult to artificially perform a large number of trials and errors, the detection accuracy of the object W is sufficiently high and the detection accuracy is estimated to have almost reached the upper limit. It is difficult to artificially realize a state in which the potential performance of 3 is drawn (a state where it is difficult to further improve performance such as required time and power consumption). Further, in order to adjust the parameter 44a, an operator who is familiar with the change in detection accuracy due to the change in the parameter 44a and the change in the operation of the robots 1 to 3 is required. It is difficult to make adjustments. Also, a system that always requires a skilled operator is very inconvenient.

そこで、本実施形態においては、人為的なパラメーター４４ａの決定作業を行うことな
く、自動的にパラメーター４４ａを決定するための構成を備えている。なお、本実施形態
によれば、多少のパラメーター４４ａの変更によって検出精度がより向上しないと推定さ
れる（検出精度が極大であると推定される）状態や、多少のパラメーター４４ａの変更に
よってロボット１〜３の性能が高性能化することはないと推定される（性能が極大である
と推定される）状態を実現することができる。本実施形態においては、これらの状態を最
適化された状態と呼ぶ。 Therefore, in the present embodiment, there is provided a configuration for automatically determining the parameter 44a without performing artificial determination of the parameter 44a. According to the present embodiment, it is estimated that the detection accuracy is not further improved by slightly changing the parameter 44a (the detection accuracy is estimated to be maximal), or the robot 1 is slightly changed by the parameter 44a. It is possible to realize a state in which the performance of ˜3 is estimated not to improve (the performance is estimated to be maximal). In the present embodiment, these states are referred to as optimized states.

本実施形態において制御装置４０は、パラメーター４４ａの自動的な決定のために算出
部４１を備えている。本実施形態において、算出部４１は、機械学習を用いて、光学パラ
メーターと動作パラメーターと力制御パラメーターとを算出することができる。図６は、
算出部４１の構成を示す図であり、図２に示す構成の一部を省略し、算出部４１の詳細を
示した図である。なお、図６に示す記憶部４４は、図２に示す記憶部４４と同一の記憶媒
体であり、各図においては記憶された情報の一部の図示が省略されている。 In the present embodiment, the control device 40 includes a calculation unit 41 for automatically determining the parameter 44a. In the present embodiment, the calculation unit 41 can calculate an optical parameter, an operation parameter, and a force control parameter using machine learning. FIG.
FIG. 3 is a diagram illustrating a configuration of a calculation unit 41, in which a part of the configuration illustrated in FIG. 2 is omitted and details of the calculation unit 41 are illustrated. The storage unit 44 illustrated in FIG. 6 is the same storage medium as the storage unit 44 illustrated in FIG. 2, and some of the stored information is not illustrated in each drawing.

算出部４１は、状態変数を観測する状態観測部４１ａと、観測された状態変数に基づい
てパラメーター４４ａを学習する学習部４１ｂとを備えている。本実施形態において、状
態観測部４１ａは、パラメーター４４ａを変化させたことによって生じた結果を状態変数
として観測する。このため、状態観測部４１ａは、サーボ４３ｄの制御結果と、エンコー
ダーＥ１〜Ｅ６の値と、力覚センサーＰの出力と、検出部４２が取得する画像とを状態変
数として取得可能である。 The calculation unit 41 includes a state observation unit 41a that observes a state variable, and a learning unit 41b that learns a parameter 44a based on the observed state variable. In the present embodiment, the state observation unit 41a observes a result generated by changing the parameter 44a as a state variable. For this reason, the state observation unit 41a can acquire the control result of the servo 43d, the values of the encoders E1 to E6, the output of the force sensor P, and the image acquired by the detection unit 42 as state variables.

具体的には、状態観測部４１ａは、サーボ４３ｄの制御結果として、モーターＭ１〜Ｍ
６に供給される電流値を観測する。当該電流値は、モーターＭ１〜Ｍ６で出力されるトル
クに相当する。エンコーダーＥ１〜Ｅ６の出力は、対応関係Ｕ１に基づいてロボット座標
系におけるＴＣＰの位置に変換される。従って、状態観測部４１ａは、ロボット１であれ
ば撮像部２１の位置、ロボット２であれば照明部２２の位置、ロボット３であればグリッ
パー２３の位置を観測することができる。 Specifically, the state observing unit 41a receives the motors M1 to MM as the control result of the servo 43d.
The current value supplied to 6 is observed. The current value corresponds to the torque output from the motors M1 to M6. The outputs of the encoders E1 to E6 are converted into TCP positions in the robot coordinate system based on the correspondence relationship U1. Therefore, the state observation unit 41a can observe the position of the imaging unit 21 for the robot 1, the position of the illumination unit 22 for the robot 2, and the position of the gripper 23 for the robot 3.

力覚センサーＰの出力は、対応関係Ｕ２に基づいてロボット座標系におけるＴＣＰに作
用する作用力に変換される。従って、状態観測部４１ａは、ロボット１〜３への作用力を
状態変数として観測することができる。検出部４２が取得する画像は、撮像部２１で撮像
された画像であり、状態観測部４１ａは、当該画像を状態変数として観測することができ
る。状態観測部４１ａは、学習対象のパラメーター４４ａに応じて観測対象の状態変数を
適宜選択することができる。 The output of the force sensor P is converted into an acting force acting on the TCP in the robot coordinate system based on the correspondence U2. Accordingly, the state observation unit 41a can observe the acting force on the robots 1 to 3 as a state variable. The image acquired by the detection unit 42 is an image captured by the imaging unit 21, and the state observation unit 41a can observe the image as a state variable. The state observation unit 41a can appropriately select the state variable to be observed according to the learning target parameter 44a.

学習部４１ｂは、学習によってパラメーター４４ａを最適化することができればよく、
本実施形態においては、強化学習によってパラメーター４４ａを最適化する。具体的には
、学習部４１ｂは、状態変数に基づいてパラメーター４４ａを変化させる行動を決定し、
当該行動を実行する。当該行動後の状態に応じて報酬を評価すれば、当該行動の行動価値
が判明する。そこで、算出部４１は、状態変数の観測と、当該状態変数に応じた行動の決
定と、当該行動によって得られる報酬の評価とを繰り返すことによって、パラメーター４
４ａを最適化する。 The learning unit 41b only needs to be able to optimize the parameter 44a by learning,
In the present embodiment, the parameter 44a is optimized by reinforcement learning. Specifically, the learning unit 41b determines an action for changing the parameter 44a based on the state variable,
Perform the action. If the reward is evaluated according to the state after the action, the action value of the action is revealed. Therefore, the calculation unit 41 repeats the observation of the state variable, the determination of the action according to the state variable, and the evaluation of the reward obtained by the action, so that the parameter 4
Optimize 4a.

本実施形態において、算出部４１は、パラメーター４４ａの中から学習対象のパラメー
ターを選択して学習を行うことができる。本実施形態においては、光学パラメーターの学
習と、動作パラメーターの学習と、力制御パラメーターの学習とのそれぞれを独立して実
行することができる。 In the present embodiment, the calculation unit 41 can perform learning by selecting a learning target parameter from the parameters 44a. In the present embodiment, learning of the optical parameter, learning of the operation parameter, and learning of the force control parameter can be performed independently.

（４−１）光学パラメーターの学習：
図７はエージェントと環境とからなる強化学習のモデルに沿って光学パラメーターの学
習例を説明する図である。図７に示すエージェントは、予め決められた方策に応じて行動
ａを選択する機能に相当し、学習部４１ｂによって実現される。環境は、エージェントが
選択した行動ａと現在の状態ｓとに基づいて次の状態ｓ'を決定し、行動ａと状態ｓと状
態ｓ'とに基づいて即時報酬ｒを決定する機能に相当し、状態観測部４１ａおよび学習部
４１ｂによって実現される。 (4-1) Learning optical parameters:
FIG. 7 is a diagram for explaining an optical parameter learning example along a reinforcement learning model including an agent and an environment. The agent shown in FIG. 7 corresponds to a function of selecting the action a according to a predetermined policy, and is realized by the learning unit 41b. The environment corresponds to the function of determining the next state s ′ based on the action a selected by the agent and the current state s, and determining the immediate reward r based on the action a, the state s, and the state s ′. This is realized by the state observation unit 41a and the learning unit 41b.

本実施形態においては、予め決められた方策によって学習部４１ｂが行動ａを選択し、
状態観測部４１ａが状態の更新を行う処理を繰り返すことにより、ある状態ｓにおけるあ
る行動ａの行動価値関数Ｑ（ｓ，ａ）を算出するＱ学習が採用される。すなわち、本例に
おいては、下記の式（２）によって行動価値関数を更新する。そして、行動価値関数Ｑ（
ｓ，ａ）が適正に収束した場合には、当該行動価値関数Ｑ（ｓ，ａ）を最大化する行動ａ
が最適な行動であると見なされ、当該行動ａを示すパラメーター４４ａが最適化されたパ
ラメーターであると見なされる。

In the present embodiment, the learning unit 41b selects the action a by a predetermined policy,
Q learning for calculating an action value function Q (s, a) of a certain action a in a certain state s is adopted by the state observation unit 41a repeating the process of updating the state. That is, in this example, the action value function is updated by the following equation (2). And the behavior value function Q (
When s, a) converges properly, the action a that maximizes the action value function Q (s, a)
Are considered to be optimal behaviors, and the parameter 44a indicating the behavior a is considered to be an optimized parameter.

ここで、行動価値関数Ｑ（ｓ，ａ）は、状態ｓにおいて行動ａを取った場合において将
来にわたって得られる収益（本例では割引報酬総和）の期待値である。報酬はｒであり、
状態ｓ、行動ａ、報酬ｒの添え字ｔは、時系列で繰り返す試行過程における１回分のステ
ップを示す番号（試行番号と呼ぶ）であり、行動決定後に状態が変化すると試行番号がイ
ンクリメントされる。従って、式（２）内の報酬ｒ_t+1は状態ｓ_tで行動ａ_tが選択され、
状態がｓ_t+1になった場合に得られる報酬である。αは学習率、γは割引率である。また
、ａ'は、状態ｓ_t+1で取り得る行動ａ_t+1の中で行動価値関数Ｑ（ｓ_t+1，ａ_t+1）を最大
化する行動であり、ｍａｘ_ａ'Ｑ（ｓ_t+1，ａ'）は、行動ａ'が選択されたことによって最
大化された行動価値関数である。 Here, the action value function Q (s, a) is an expected value of profit (in this example, the sum of discount rewards) obtained in the future when the action a is taken in the state s. The reward is r,
The subscript t of the state s, the action a, and the reward r is a number (referred to as a trial number) indicating one step in the trial process repeated in time series, and the trial number is incremented when the state changes after the action is determined. . Thus, reward r _{t + 1} in equation (2) Behavioral a _t in state s _t is selected,
This is a reward obtained when the state becomes _{st + 1} . α is a learning rate and γ is a discount rate. Further, a ′ is an action that maximizes the action value function Q (s _{t + 1} , a _{t + 1} ) among actions a _{t + 1 that} can be taken in the state s _{t + 1} , and max _{a ′} Q ( s _{t + 1} , a ′) is an action value function maximized by selecting the action a ′.

光学パラメーターの学習においては、光学パラメーターを変化させることが行動の決定
に相当しており、学習対象のパラメーターと取り得る行動とを示す行動情報４４ｄが記憶
部４４に予め記録される。すなわち、当該行動情報４４ｄに学習対象として記述された光
学パラメーターが学習対象となる。図７においては、光学パラメーターの中の撮像部パラ
メーターと、照明部パラメーターと、画像処理パラメーターとの一部が学習対象となって
いる例を示している。 In learning of optical parameters, changing the optical parameters corresponds to determination of behavior, and behavior information 44 d indicating parameters to be learned and possible behavior is recorded in the storage unit 44 in advance. That is, the optical parameter described as the learning target in the behavior information 44d is the learning target. FIG. 7 shows an example in which some of the imaging unit parameters, the illumination unit parameters, and the image processing parameters in the optical parameters are learning targets.

具体的には、撮像部パラメーターの中で撮像部２１のｘ座標、ｙ座標が学習対象となっ
ている。従って、この例においてｚ座標やｘｙｚ軸に対する回転（姿勢）は学習対象とな
っておらず、撮像部２１は、対象物Ｗが置かれる作業台に向いている状態であるとともに
、撮像部２１のｘ−ｙ平面内での移動が学習対象である。むろん、他の撮像部パラメータ
ー、例えば、撮像部２１の姿勢やｚ座標、露光時間や絞りが学習対象であっても良い。 Specifically, among the imaging unit parameters, the x coordinate and y coordinate of the imaging unit 21 are learning targets. Therefore, in this example, rotation (posture) with respect to the z coordinate and the xyz axis is not a learning target, and the imaging unit 21 is in a state of being directed to a work table on which the object W is placed, and Movement in the xy plane is a learning target. Of course, other imaging unit parameters, for example, the orientation, z coordinate, exposure time, and aperture of the imaging unit 21 may be learning targets.

また、図７に示す例においては、照明部パラメーターの中で、照明部２２のｘ座標、ｙ
座標および照明部の明るさが学習対象となっている。従って、この例においてｚ座標やｘ
ｙｚ軸に対する回転（姿勢）は学習対象となっておらず、照明部２２は、対象物Ｗが置か
れる作業台に向いている状態であるとともに、照明部２２のｘ−ｙ平面内での移動が学習
対象である。むろん、他の照明部パラメーター、例えば、照明部２２の姿勢やｚ座標が学
習対象であっても良い。 In the example shown in FIG. 7, among the illumination unit parameters, the x coordinate of the illumination unit 22, y
The coordinates and the brightness of the illumination unit are learning targets. Therefore, in this example, the z coordinate and x
The rotation (posture) with respect to the yz axis is not a learning target, and the illumination unit 22 is in a state of facing the work table on which the object W is placed, and the illumination unit 22 moves in the xy plane. Is a learning target. Of course, other illumination unit parameters, for example, the posture of the illumination unit 22 and the z coordinate may be the learning target.

さらに、図７に示す例においては、画像処理パラメーターの中で、平滑化処理の強度と
鮮鋭化処理の強度とテンプレートマッチングの閾値が学習対象となっている。従って、こ
の例において、画像処理シーケンスは学習対象となっておらず、撮像部２１で撮像された
画像に対する画像処理の順序は変化しない（むろん、画像処理シーケンスが学習対象であ
る実施形態も採用可能である）。 Further, in the example shown in FIG. 7, the smoothing processing strength, the sharpening processing strength, and the template matching threshold are learning targets in the image processing parameters. Therefore, in this example, the image processing sequence is not a learning target, and the order of image processing for the image captured by the imaging unit 21 does not change (of course, an embodiment in which the image processing sequence is a learning target can also be adopted. Is).

図７に示す例において行動には値を一定値増加させる行動と、値を一定値減少させる行
動とが存在する。従って、図７に示す全８個のパラメーターにおいて取り得る行動は全１
６個である（行動ａ１〜行動ａ１６）。行動情報４４ｄは、学習対象のパラメーターと取
り得る行動とを示しているため、図７に示す例であれば、図示した８個のパラメーターが
行動情報４４ｄに学習対象として記述される。また、各行動を特定するための情報（行動
のＩＤ、各行動での増減量等）が行動情報４４ｄに記述される。 In the example illustrated in FIG. 7, there are actions that increase the value by a certain value and actions that decrease the value by a certain value. Therefore, all possible actions in all 8 parameters shown in FIG.
There are six (action a1 to action a16). Since the action information 44d indicates the parameters to be learned and the actions that can be taken, in the example shown in FIG. 7, the illustrated eight parameters are described as learning objects in the action information 44d. Also, information for identifying each action (action ID, increase / decrease amount in each action, etc.) is described in the action information 44d.

図７に示す例において、報酬は対象物Ｗの検出の成否に基づいて特定される。すなわち
、学習部４１ｂは、行動ａとして光学パラメーターを変化させた後、当該光学パラメータ
ーによってロボット１，２を動作させ、検出部４２によって撮像部２１が撮像した画像を
取得する。そして、学習部４１ｂは、当該光学パラメーターに基づいてテンプレートマッ
チング処理を実行し、対象物Ｗの検出が成功したか否かを判定する。さらに、学習部４１
ｂは、検出の成否によって行動ａ、状態ｓ、ｓ'の報酬を決定する。当該報酬は、対象物
Ｗの検出の成否に基づいて決定されれば良く、例えば、検出の成功に正（例えば＋１）、
検出の失敗に負（例えば−１）の報酬を与える構成等を採用可能である。この構成によれ
ば、対象物の検出精度を高めるように最適化を行うことができる。 In the example shown in FIG. 7, the reward is specified based on the success or failure of detection of the object W. That is, the learning unit 41b changes the optical parameter as the action a, then operates the robots 1 and 2 with the optical parameter, and acquires the image captured by the imaging unit 21 with the detection unit 42. And the learning part 41b performs a template matching process based on the said optical parameter, and determines whether the detection of the target object W was successful. Further, the learning unit 41
b determines the reward of the action a, the states s, and s ′ depending on the success or failure of the detection. The reward may be determined based on the success or failure of the detection of the object W. For example, the reward is positive (for example, +1),
A configuration that gives a negative (for example, -1) reward for detection failure can be employed. According to this configuration, optimization can be performed so as to increase the detection accuracy of the object.

現在の状態ｓにおいて行動ａが採用された場合における次の状態ｓ'は、行動ａとして
のパラメーターの変化が行われた後にロボット１，２を動作させ、状態観測部４１ａが状
態を観測することによって特定可能である。なお、本例にかかる光学パラメーターの学習
においてロボット３は動作しない。図７に示す例において、状態変数には、撮像部２１の
ｘ座標、ｙ座標と、照明部２２のｘ座用、ｙ座標、照明部２２の明るさと、平滑化処理の
強度、鮮鋭化処理の強度、テンプレートマッチングの閾値と、撮像部２１で撮像された画
像とが含まれている。 The next state s ′ when the action a is adopted in the current state s is that the robot 1 and 2 are operated after the parameter change as the action a is performed, and the state observation unit 41a observes the state. Can be specified. Note that the robot 3 does not operate in learning of optical parameters according to this example. In the example shown in FIG. 7, the state variables include x-coordinate and y-coordinate of the imaging unit 21, x-coordinate for the illumination unit 22, y-coordinate, brightness of the illumination unit 22, smoothing processing strength, and sharpening processing. Intensity, a template matching threshold, and an image captured by the imaging unit 21 are included.

従って、この例において、状態観測部４１ａは、行動ａが実行された後に、ロボット１
のエンコーダーＥ１〜Ｅ６の出力をＵ１に基づいて変換して撮像部２１のｘ座標およびｙ
座標を観測する。また、状態観測部４１ａは、行動ａが実行された後に、ロボット２のエ
ンコーダーＥ１〜Ｅ６の出力をＵ１に基づいて変換して照明部２２のｘ座標およびｙ座標
を観測する。 Accordingly, in this example, the state observing unit 41a performs the operation after the action a is executed.
The outputs of the encoders E1 to E6 are converted based on U1, and the x coordinate and y of the imaging unit 21 are converted.
Observe the coordinates. In addition, after the action a is executed, the state observation unit 41a converts the outputs of the encoders E1 to E6 of the robot 2 based on U1, and observes the x coordinate and the y coordinate of the illumination unit 22.

本実施形態において、照明部２２の明るさは、パラメーター４４ａによって誤差無く調
整可能であると見なされており（または誤差が影響ないと見なされており）、状態観測部
４１ａは、パラメーター４４ａに含まれる照明部の明るさを取得して状態変数が観測され
たと見なす。むろん、照明部２２の明るさは、センサー等によって実測されても良いし、
撮像部２１が撮像した画像に基づいて（例えば、平均階調値等により）観測されても良い
。状態観測部４１ａは、平滑化処理の強度、鮮鋭化処理の強度、テンプレートマッチング
の閾値についても、パラメーター４４ａを参照して現在の値を取得し、状態変数が観測さ
れたと見なす。 In the present embodiment, it is assumed that the brightness of the illumination unit 22 can be adjusted without error by the parameter 44a (or that the error is considered to have no effect), and the state observation unit 41a is included in the parameter 44a. It is assumed that the state variable was observed by obtaining the brightness of the illuminated part. Of course, the brightness of the illumination unit 22 may be measured by a sensor or the like,
The image may be observed based on an image captured by the image capturing unit 21 (for example, by an average gradation value). The state observing unit 41a also obtains current values for the smoothing processing strength, the sharpening processing strength, and the template matching threshold with reference to the parameter 44a, and considers that the state variable has been observed.

さらに、状態観測部４１ａにおいては、撮像部２１が撮像し、検出部４２が取得した画
像を状態変数として取得する（図７に示す太枠）。すなわち、状態観測部４１ａは、撮像
部２１が撮像した画像（対象物が存在し得る注目領域等の画像であっても良い）の画素毎
の階調値を状態変数として観測する。撮像部のｘ座標等は、行動であるとともに観測対象
としての状態であるが、撮像部２１が撮像した画像は行動ではない。従って、この意味で
、撮像された画像は、光学パラメーターの変化から直接的に推定することが困難な変化を
し得る状態変数である。また、検出部４２は、当該画像に基づいて対象物を検出するため
、当該画像は検出の成否に直接的に影響を与え得る状態変数である。従って、状態変数と
して、当該画像を観測することにより、人為的に改善することが困難なパラメーターの改
善を行い、効果的に検出部４２の検出精度を高めるように光学パラメーターを最適化する
ことが可能になる。 Further, in the state observation unit 41a, the image captured by the imaging unit 21 and acquired by the detection unit 42 is acquired as a state variable (thick frame shown in FIG. 7). In other words, the state observation unit 41a observes the gradation value for each pixel of the image captured by the imaging unit 21 (may be an image of a region of interest or the like where the target object may exist) as a state variable. The x-coordinate and the like of the image capturing unit are actions and a state as an observation target, but the image captured by the image capturing unit 21 is not a behavior. Therefore, in this sense, the captured image is a state variable that can be difficult to estimate directly from changes in optical parameters. Further, since the detection unit 42 detects an object based on the image, the image is a state variable that can directly affect the success or failure of the detection. Therefore, by observing the image as a state variable, it is possible to improve parameters that are difficult to improve artificially and to optimize the optical parameters so as to effectively increase the detection accuracy of the detection unit 42. It becomes possible.

（４−２）光学パラメーターの学習例：
次に、光学パラメーターの学習例を説明する。学習の過程で参照される変数や関数を示
す情報は、学習情報４４ｅとして記憶部４４に記憶される。すなわち、算出部４１は、状
態変数の観測と、当該状態変数に応じた行動の決定と、当該行動によって得られる報酬の
評価とを繰り返すことによって行動価値関数Ｑ（ｓ，ａ）を収束させる構成が採用されて
いる。そこで、本例において、学習の過程で状態変数と行動と報酬との時系列の値が、順
次、学習情報４４ｅに記録されていく。 (4-2) Optical parameter learning example:
Next, an example of learning optical parameters will be described. Information indicating variables and functions referred to in the learning process is stored in the storage unit 44 as learning information 44e. That is, the calculation unit 41 is configured to converge the action value function Q (s, a) by repeating the observation of the state variable, the determination of the action according to the state variable, and the evaluation of the reward obtained by the action. Is adopted. Therefore, in this example, the time series values of the state variable, the action, and the reward are sequentially recorded in the learning information 44e during the learning process.

行動価値関数Ｑ（ｓ，ａ）は、種々の手法で算出されて良く、多数回の試行に基づいて
算出されても良いが、本実施形態においては、行動価値関数Ｑ（ｓ，ａ）を近似的に算出
する一手法であるＤＱＮ（ＤｅｅｐＱ−Ｎｅｔｗｏｒｋ）が採用されている。ＤＱＮに
おいては、多層ニューラルネットワークを用いて行動価値関数Ｑ（ｓ，ａ）を推定する。
本例においては、状態ｓを入力とし、選択し得る行動の数Ｎ個の行動価値関数Ｑ（ｓ，ａ
）の値を出力とする多層ニューラルネットワークが採用されている。 The action value function Q (s, a) may be calculated by various methods, and may be calculated based on a number of trials. In the present embodiment, the action value function Q (s, a) is calculated. DQN (Deep Q-Network), which is a method of calculating approximately, is employed. In DQN, an action value function Q (s, a) is estimated using a multilayer neural network.
In this example, the state value s is input, and the action value function Q (s, a) of N actions that can be selected.
) Is used as the output.

図８は、本例において採用されている多層ニューラルネットワークを模式的に示す図で
ある。図８において、多層ニューラルネットワークは、Ｍ個（Ｍは２以上の整数）の状態
変数を入力とし、Ｎ個（Ｎは２以上の整数）個の行動価値関数Ｑの値を出力としている。
例えば、図７に示す例であれば、撮像部のｘ座標〜テンプレートマッチングの閾値までの
８個の状態変数と撮像された画像の画素数との和がＭ個であり、Ｍ個の状態変数の値が多
層ニューラルネットワークに入力される。図８においては、試行番号ｔにおけるＭ個の状
態をｓ_1t〜ｓ_Mtとして示している。 FIG. 8 is a diagram schematically showing a multilayer neural network employed in this example. In FIG. 8, the multilayer neural network receives M (M is an integer of 2 or more) state variables as inputs and outputs N (N is an integer of 2 or more) action value functions Q as outputs.
For example, in the example illustrated in FIG. 7, the sum of the eight state variables from the x coordinate of the imaging unit to the threshold value for template matching and the number of pixels of the captured image is M, and the M state variables Is input to the multilayer neural network. In FIG. 8, M states at trial number t are shown as s _1t to s _Mt.

Ｎ個は選択し得る行動ａの数であり、多層ニューラルネットワークの出力は、入力され
た状態ｓにおいて特定の行動ａが選択された場合の行動価値関数Ｑの値である。図８にお
いては、試行番号ｔにおいて選択し得る行動ａ_1t〜ａ_Ntのそれぞれにおける行動価値関数
ＱをＱ（ｓ_t，ａ_1t）〜Ｑ（ｓ_t，ａ_Nt）として示している。当該Ｑに含まれるｓ_tは入力
された状態ｓ_1t〜ｓ_Mtを代表して示す文字である。図７に示す例であれば、１６個の行動
が選択可能であるためＮ＝１６である。むろん、行動ａの内容や数（Ｎの値）、状態ｓの
内容や数（Ｍの値）は試行番号ｔに応じて変化しても良い。 N is the number of actions a that can be selected, and the output of the multilayer neural network is the value of the action value function Q when a specific action a is selected in the input state s. In FIG. 8, the action value functions Q in the actions a _{1t to} a _Nt that can be selected at the trial number t are shown as Q (s _t , a _1t ) to Q (s _t , a _Nt ). S _t included in the Q is a character showing on behalf of the state s _1t ~s _Mt entered. In the example shown in FIG. 7, since 16 actions can be selected, N = 16. Of course, the contents and number of actions a (value of N) and the contents and number of states s (value of M) may be changed according to the trial number t.

図８に示す多層ニューラルネットワークは、各層の各ノードにおいて直前の層の入力（
１層目においては状態ｓ）に対する重みｗの乗算とバイアスｂの加算とを実行し、必要に
応じて活性化関数を経た出力を得る（次の層の入力になる）演算を実行するモデルである
。本例においては、層ＤＬがＰ個（Ｐは１以上の整数）存在し、各層において複数のノー
ドが存在する。 The multi-layer neural network shown in FIG. 8 has the input of the immediately preceding layer at each node of each layer (
In the first layer, a model that executes multiplication of the weight w and addition of the bias b to the state s) and obtains an output through an activation function as necessary (becomes input to the next layer). is there. In this example, there are P layers DL (P is an integer of 1 or more), and there are a plurality of nodes in each layer.

図８に示す多層ニューラルネットワークは各層における重み、とバイアスｂ、活性化関
数、層の順序等によって特定される。そこで、本実施形態においては、当該多層ニューラ
ルネットワークを特定するためのパラメーター（入力から出力を得るために必要な情報）
が学習情報４４ｅとして記憶部４４に記録される。なお、学習の際には、多層ニューラル
ネットワークを特定するためのパラメーターの中で可変の値（例えば，重みｗとバイアス
ｂ）を更新していくことになる。ここでは、学習の過程で変化し得る多層ニューラルネッ
トワークのパラメーターをθと表記する。当該θを使用すると、上述の行動価値関数Ｑ（
ｓ_t，ａ_1t）〜Ｑ（ｓ_t，ａ_Nt）は、Ｑ（ｓ_t，ａ_1t；θ_t）〜Ｑ（ｓ_t，ａ_Nt；θ_t）とも表
記できる。 The multilayer neural network shown in FIG. 8 is specified by weights in each layer, bias b, activation function, layer order, and the like. Therefore, in the present embodiment, parameters for identifying the multilayer neural network (information necessary for obtaining output from input)
Is recorded in the storage unit 44 as learning information 44e. In learning, variable values (for example, weight w and bias b) are updated in parameters for specifying the multilayer neural network. Here, the parameter of the multilayer neural network that can change in the course of learning is expressed as θ. Using the θ, the behavior value function Q (
_{_{s t, a 1t) ~Q (}} s t, a Nt) _{_{is, Q (s t, a 1t}} ; θ t) ~Q (s t, a Nt; θ t) with can be expressed.

次に、図９に示すフローチャートに沿って学習処理の手順を説明する。光学パラメータ
ーの学習処理は、ロボット１，２の運用過程において実施されても良いし、実運用の前に
事前に学習処理が実行されてもよい。ここでは、実運用の前に事前に学習処理が実行され
る構成（多層ニューラルネットワークを示すθが最適化されると、その情報が保存され、
次回以降の運用で利用される構成）に従って学習処理を説明する。 Next, the procedure of the learning process will be described along the flowchart shown in FIG. The optical parameter learning process may be performed in the operation process of the robots 1 and 2, or the learning process may be executed in advance before the actual operation. Here, a configuration in which learning processing is executed in advance before actual operation (when θ indicating a multilayer neural network is optimized, the information is stored,
The learning process will be described according to the configuration used in the subsequent operations.

学習処理が開始されると、算出部４１は、学習情報４４ｅを初期化する（ステップＳ２
００）。すなわち、算出部４１は、学習を開始する際に参照されるθの初期値を特定する
。初期値は、種々の手法によって決められて良く、過去に学習が行われていない場合にお
いては、任意の値やランダム値等がθの初期値となっても良いし、ロボット１，２や撮像
部２１，照明部２２の光学特性を模擬するシミュレーション環境を準備し、当該環境に基
づいて学習または推定したθを初期値としてもよい。 When the learning process is started, the calculation unit 41 initializes the learning information 44e (Step S2).
00). That is, the calculation unit 41 specifies an initial value of θ referred to when learning is started. The initial value may be determined by various methods, and if learning has not been performed in the past, an arbitrary value, a random value, or the like may be the initial value of θ, or the robots 1 and 2 and the imaging A simulation environment that simulates the optical characteristics of the unit 21 and the illumination unit 22 is prepared, and θ learned or estimated based on the environment may be set as an initial value.

過去に学習が行われた場合は、当該学習済のθが初期値として採用される。また、過去
に類似の対象についての学習が行われた場合は、当該学習におけるθが初期値とされても
良い。過去の学習は、ロボット１，２を用いてユーザーが行ってもよいし、ロボット１，
２の製造者がロボット１，２の販売前に行ってもよい。この場合、製造者は、対象物や作
業の種類に応じて複数の初期値のセットを用意しておき、ユーザーが学習する際に初期値
を選択する構成であっても良い。θの初期値が決定されると、当該初期値が現在のθの値
として学習情報４４ｅに記憶される。 When learning has been performed in the past, the learned θ is adopted as an initial value. Further, when learning is performed on a similar target in the past, θ in the learning may be set as an initial value. The past learning may be performed by the user using the robots 1 and 2.
Two manufacturers may perform before selling robots 1 and 2. In this case, the manufacturer may prepare a plurality of sets of initial values according to the object and the type of work, and select the initial values when the user learns. When the initial value of θ is determined, the initial value is stored in the learning information 44e as the current value of θ.

次に、算出部４１は、パラメーターを初期化する（ステップＳ２０５）。ここでは、光
学パラメーターが学習対象であるため、算出部４１は、光学パラメーターを初期化する。
すなわち、算出部４１は、ロボット１のエンコーダーＥ１〜Ｅ６の出力を対応関係Ｕ１で
変換し、撮像部２１の位置を初期値として設定する。また、算出部４１は、予め決められ
た初期の露光時間（過去に学習が行われた場合には最新の露光時間）を撮像部２１の露光
時間の初期値として設定する。さらに、算出部４１は、撮像部２１に制御信号を出力し、
現在の絞りの値を初期値として設定する。 Next, the calculation unit 41 initializes parameters (step S205). Here, since the optical parameter is a learning target, the calculation unit 41 initializes the optical parameter.
That is, the calculation unit 41 converts the outputs of the encoders E1 to E6 of the robot 1 with the correspondence U1, and sets the position of the imaging unit 21 as an initial value. Further, the calculation unit 41 sets a predetermined initial exposure time (the latest exposure time when learning has been performed in the past) as an initial value of the exposure time of the imaging unit 21. Furthermore, the calculation unit 41 outputs a control signal to the imaging unit 21,
Set the current aperture value as the initial value.

さらに、算出部４１は、ロボット２のエンコーダーＥ１〜Ｅ６の出力を対応関係Ｕ１で
変換し、照明部２２の位置を初期値として設定する。また、算出部４１は、予め決められ
た初期の明るさ（過去に学習が行われた場合には最新の明るさ）を照明部２２の明るさの
初期値として設定する。さらに、算出部４１は、平滑化処理の強度、鮮鋭化処理の強度、
テンプレートマッチングの閾値、画像処理シーケンスについて予め決められた初期値（過
去に学習が行われた場合には最新の値）を設定する。初期化されたパラメーターは記憶部
４４に現在のパラメーター４４ａとして記憶される。 Furthermore, the calculation unit 41 converts the outputs of the encoders E1 to E6 of the robot 2 with the correspondence U1, and sets the position of the illumination unit 22 as an initial value. Further, the calculation unit 41 sets a predetermined initial brightness (the latest brightness when learning has been performed in the past) as an initial value of the brightness of the illumination unit 22. Furthermore, the calculation unit 41 has a smoothing process strength, a sharpening process strength,
A template matching threshold and an initial value determined in advance for the image processing sequence (the latest value if learning has been performed in the past) are set. The initialized parameter is stored in the storage unit 44 as the current parameter 44a.

次に、状態観測部４１ａは、状態変数を観測する（ステップＳ２１０）。すなわち、制
御部４３は、パラメーター４４ａおよびロボットプログラム４４ｂを参照してロボット１
，２を制御する。検出部４２は、制御後の状態で撮像部２１が撮像した画像に基づいて対
象物Ｗの検出処理（上述のステップＳ１００，Ｓ１０５に相当）を実行する。この後、状
態観測部４１ａは、ロボット１のエンコーダーＥ１〜Ｅ６の出力をＵ１に基づいて変換し
て撮像部２１のｘ座標およびｙ座標を観測する。また、状態観測部４１ａは、ロボット２
のエンコーダーＥ１〜Ｅ６の出力をＵ１に基づいて変換して照明部２２のｘ座標およびｙ
座標を観測する。さらに、状態観測部４１ａは、パラメーター４４ａを参照して照明部２
２に設定されるべき明るさを取得して状態変数が観測されたと見なす。 Next, the state observation unit 41a observes a state variable (step S210). That is, the control unit 43 refers to the parameter 44a and the robot program 44b, and the robot 1
, 2 are controlled. The detection unit 42 executes a detection process of the target object W (corresponding to steps S100 and S105 described above) based on an image captured by the imaging unit 21 in a state after control. Thereafter, the state observation unit 41a converts the outputs of the encoders E1 to E6 of the robot 1 based on U1, and observes the x coordinate and the y coordinate of the imaging unit 21. In addition, the state observation unit 41a is connected to the robot 2
The outputs of the encoders E1 to E6 are converted based on U1, and the x coordinate and y of the illumination unit 22 are converted.
Observe the coordinates. Further, the state observing unit 41a refers to the parameter 44a and the illumination unit 2
The brightness to be set to 2 is acquired and the state variable is considered to be observed.

さらに、状態観測部４１ａは、平滑化処理の強度、鮮鋭化処理の強度、テンプレートマ
ッチングの閾値についても、パラメーター４４ａを参照して現在の値を取得し、状態変数
が観測されたと見なす。さらに、状態観測部４１ａにおいては、撮像部２１が撮像し、検
出部４２が取得した画像を取得し、各画素の階調値を状態変数として取得する。 Furthermore, the state observation unit 41a also obtains current values for the smoothing processing strength, the sharpening processing strength, and the template matching threshold with reference to the parameter 44a, and considers that the state variable has been observed. Further, in the state observation unit 41a, an image captured by the imaging unit 21 and acquired by the detection unit 42 is acquired, and a gradation value of each pixel is acquired as a state variable.

次に、学習部４１ｂは、行動価値を算出する（ステップＳ２１５）。すなわち、学習部
４１ｂは、学習情報４４ｅを参照してθを取得し、学習情報４４ｅが示す多層ニューラル
ネットワークに最新の状態変数を入力し、Ｎ個の行動価値関数Ｑ（ｓ_t，ａ_1t；θ_t）〜Ｑ
（ｓ_t，ａ_Nt；θ_t）を算出する。 Next, the learning unit 41b calculates an action value (step S215). That is, the learning unit 41b obtains θ by referring to the learning information 44e, inputs the latest state variable to the multilayer neural network indicated by the learning information 44e, and N action value functions Q (s _t , a _1t ; θ _t ) to Q
(S _t , a _Nt ; θ _t ) is calculated.

なお、最新の状態変数は、初回の実行時においてステップＳ２１０、２回目以降の実行
時においてステップＳ２２５の観測結果である。また、試行番号ｔは初回の実行時におい
て０、２回目以降の実行時において１以上の値となる。学習処理が過去に実施されていな
い場合、学習情報４４ｅが示すθは最適化されていないため、行動価値関数Ｑの値として
は不正確な値となり得るが、ステップＳ２１５以後の処理の繰り返しにより、行動価値関
数Ｑは徐々に最適化していく。また、ステップＳ２１５以後の処理の繰り返しにおいて、
状態ｓ、行動ａ、報酬ｒは、各試行番号ｔに対応づけられて記憶部４４に記憶され、任意
のタイミングで参照可能である。 Note that the latest state variable is the observation result of step S210 at the first execution time and the observation result at step S225 at the second and subsequent execution time. The trial number t is 0 at the first execution, and is 1 or more at the second and subsequent executions. If the learning process has not been performed in the past, θ indicated by the learning information 44e is not optimized, and thus may be an incorrect value as the value of the behavior value function Q. However, by repeating the processes after step S215, The action value function Q is gradually optimized. In the repetition of the processing after step S215,
The state s, the action a, and the reward r are stored in the storage unit 44 in association with each trial number t, and can be referred to at an arbitrary timing.

次に、学習部４１ｂは、行動を選択し、実行する（ステップＳ２２０）。本実施形態に
おいては、行動価値関数Ｑ（ｓ，ａ）を最大化する行動ａが最適な行動であると見なされ
る処理が行われる。そこで、学習部４１ｂは、ステップＳ２１５において算出されたＮ個
の行動価値関数Ｑ（ｓ_t，ａ_1t；θ_t）〜Ｑ（ｓ_t，ａ_Nt；θ_t）の値の中で最大の値を特定
する。そして、学習部４１ｂは、最大の値を与えた行動を選択する。例えば、Ｎ個の行動
価値関数Ｑ（ｓ_t，ａ_1t；θ_t）〜Ｑ（ｓ_t，ａ_Nt；θ_t）の中でＱ（ｓ_t，ａ_Nt；θ_t）が最
大値であれば、学習部４１ｂは、行動ａ_Ntを選択する。 Next, the learning unit 41b selects and executes an action (step S220). In the present embodiment, a process is performed in which the action a that maximizes the action value function Q (s, a) is regarded as the optimum action. Therefore, the learning unit 41b has a maximum value among the N action value functions Q (s _t , a _1t ; θ _t ) to Q (s _t , a _Nt ; θ _t ) calculated in step S215. Is identified. Then, the learning unit 41b selects an action that gives the maximum value. For example, if N action value functions Q (s _t , a _1t ; θ _t ) to Q (s _t , a _Nt ; θ _t ), Q (s _t , a _Nt ; θ _t ) is the maximum value. For example, the learning unit 41b selects the action a _Nt .

行動が選択されると、学習部４１ｂは、当該行動に対応するパラメーター４４ａを変化
させる。例えば、図７に示す例において、撮像部のｘ座標を一定値増加させる行動ａ１が
選択された場合、学習部４１ｂは、光学パラメーターの撮像部パラメーターが示す撮像部
の位置においてｘ座標を一定値増加させる。パラメーター４４ａの変化が行われると、制
御部４３は、当該パラメーター４４ａを参照してロボット１，２を制御する。検出部４２
は、制御後の状態で撮像部２１が撮像した画像に基づいて対象物Ｗの検出処理を実行する
。 When an action is selected, the learning unit 41b changes a parameter 44a corresponding to the action. For example, in the example illustrated in FIG. 7, when the action a1 that increases the x coordinate of the imaging unit by a certain value is selected, the learning unit 41b sets the x coordinate to a certain value at the position of the imaging unit indicated by the imaging parameter of the optical parameter. increase. When the parameter 44a is changed, the control unit 43 controls the robots 1 and 2 with reference to the parameter 44a. Detection unit 42
Performs the detection process of the object W based on the image captured by the imaging unit 21 in the state after the control.

次に、状態観測部４１ａは、状態変数を観測する（ステップＳ２２５）。すなわち、状
態観測部４１ａは、ステップＳ２１０における状態変数の観測と同様の処理を行って、状
態変数として、撮像部２１のｘ座標およびｙ座標、照明部２２のｘ座標およびｙ座標、照
明部２２に設定されるべき明るさ、平滑化処理の強度、鮮鋭化処理の強度、テンプレート
マッチングの閾値、撮像部２１が撮像した画像の各画素の階調値を取得する。なお、現在
の試行番号がｔである場合（選択された行動がａ_tである場合）、ステップＳ２２５で取
得される状態ｓはｓ_t+1である。 Next, the state observation unit 41a observes a state variable (step S225). That is, the state observation unit 41a performs the same processing as the state variable observation in step S210, and as the state variables, the x coordinate and y coordinate of the imaging unit 21, the x coordinate and y coordinate of the illumination unit 22, and the illumination unit 22 Brightness, smoothing processing strength, sharpening processing strength, template matching threshold, and gradation value of each pixel of the image captured by the imaging unit 21 are acquired. Incidentally, if the current trial number is t (when the selected action is a _t), the state s is obtained in step S225 it is s _{t + 1.}

次に、学習部４１ｂは、報酬を評価する（ステップＳ２３０）。本例において、報酬は
、対象物Ｗの検出の成否に基づいて決定される。そこで、学習部４１ｂは、検出部４２か
ら対象物の検出結果の成否（ステップＳ１０５の成否）を取得し、検出成功であれば既定
量の正の報酬、検出失敗であれば既定量の負の報酬を取得する。なお、現在の試行番号が
ｔである場合、ステップＳ２３０で取得される報酬ｒはｒ_t+1である。 Next, the learning unit 41b evaluates the reward (step S230). In this example, the reward is determined based on the success or failure of detection of the object W. Therefore, the learning unit 41b acquires the success or failure of the detection result of the object from the detection unit 42 (success / failure of step S105). If the detection is successful, a predetermined amount of positive reward is obtained. Get rewards. When the current trial number is t, the reward r acquired in step S230 is r _{t + 1} .

本実施形態においては式（２）に示す行動価値関数Ｑの更新を目指しているが、行動価
値関数Ｑを適切に更新していくためには、行動価値関数Ｑを示す多層ニューラルネットワ
ークを最適化（θを最適化）していかなくてはならない。図８に示す多層ニューラルネッ
トワークによって行動価値関数Ｑを適正に出力させるためには、当該出力のターゲットと
なる教師データが必要になる。すなわち、多層ニューラルネットワークの出力と、ターゲ
ットとの誤差を最小化するようにθを改善することによって、多層ニューラルネットワー
クが最適化されることが期待される。 In this embodiment, the action value function Q shown in Expression (2) is aimed to be updated. However, in order to appropriately update the action value function Q, the multilayer neural network showing the action value function Q is optimized. (Θ should be optimized). In order to properly output the action value function Q by the multilayer neural network shown in FIG. 8, teacher data as a target of the output is required. That is, it is expected that the multilayer neural network is optimized by improving θ so as to minimize the error between the output of the multilayer neural network and the target.

しかし、本実施形態において、学習が完了していない段階では行動価値関数Ｑの知見が
なく、ターゲットを特定することは困難である。そこで、本実施形態においては、式（２
）の第２項、いわゆるＴＤ誤差（ＴｅｍｐｏｒａｌＤｉｆｆｅｒｅｎｃｅ）を最小化す
る目的関数によって多層ニューラルネットワークを示すθの改善を実施する。すなわち、
（ｒ_t+1＋γｍａｘ_ａ'Ｑ（ｓ_t+1，ａ'；θ_t））をターゲットとし、ターゲットとＱ（ｓ_t
，ａ_t；θ_t）との誤差が最小化するようにθを学習する。ただし、ターゲット（ｒ_t+1＋
γｍａｘ_ａ'Ｑ（ｓ_t+1，ａ'；θ_t））は、学習対象のθを含んでいるため、本実施形態に
おいては、ある程度の試行回数にわたりターゲットを固定する（例えば、最後に学習した
θ（初回学習時はθの初期値）で固定する）。本実施形態においては、ターゲットを固定
する試行回数である既定回数が予め決められている。 However, in this embodiment, there is no knowledge of the action value function Q at the stage where learning is not completed, and it is difficult to specify a target. Therefore, in this embodiment, the expression (2
), The improvement of θ representing the multilayer neural network is implemented by an objective function that minimizes the so-called TD error (Temporal Difference). That is,
(R _{t + 1} + γmax _{a ′} Q (s _{t + 1} , a ′; θ _t )) as a target, and the target and Q (s _t
, A _t ; θ _t ), θ is learned so as to be minimized. However, the target (r _{t + 1} +
Since γmax _{a ′} Q (s _{t + 1} , a ′; θ _t )) includes θ to be learned, in the present embodiment, the target is fixed for a certain number of trials (for example, last learning is performed). (The initial value is fixed at the initial learning). In the present embodiment, a predetermined number, which is the number of trials for fixing the target, is determined in advance.

このような前提で学習を行うため、ステップＳ２３０で報酬が評価されると、学習部４
１ｂは目的関数を算出する（ステップＳ２３５）。すなわち、学習部４１ｂは、試行のそ
れぞれにおけるＴＤ誤差を評価するための目的関数（例えば、ＴＤ誤差の２乗の期待値に
比例する関数やＴＤ誤差の２乗の総和等）を算出する。なお、ＴＤ誤差は、ターゲットが
固定された状態で算出されるため、固定されたターゲットを（ｒ_t+1＋γｍａｘ_ａ'Ｑ（ｓ
_t+1，ａ'；θ_-））と表記すると、ＴＤ誤差は（ｒ_t+1＋γｍａｘ_ａ'Ｑ（ｓ_t+1，ａ'；θ_-
）−Ｑ（ｓ_t，ａ_t；θ_t））である。当該ＴＤ誤差の式において報酬ｒ_t+1は、行動ａ_tに
よってステップＳ２３０で得られた報酬である。 In order to perform learning based on such premise, when the reward is evaluated in step S230, the learning unit 4
1b calculates an objective function (step S235). That is, the learning unit 41b calculates an objective function (for example, a function proportional to the expected value of the square of the TD error, the sum of the squares of the TD error, etc.) for evaluating the TD error in each trial. Since the TD error is calculated in a state where the target is fixed, the fixed target is expressed as (rt _{+ 1} + γmax _{a ′} Q (s
_{t + 1} , a ′; θ ₋ )), the TD error is (r _{t + 1} + γmax _{a ′} Q (s _{t + 1} , a ′; θ _−).
) −Q (s _t , a _t ; θ _t )). Reward r _{t + 1} in the formula of the TD error, the action a _t a reward obtained in step S230.

また、ｍａｘ_ａ'Ｑ（ｓ_t+1，ａ'；θ_-）は、行動ａ_tによってステップＳ２２５で算出
される状態ｓ_t+1を、固定されたθ_-で特定される多層ニューラルネットワークの入力とし
た場合に得られる出力の中の最大値である。Ｑ（ｓ_t，ａ_t；θ_t）は、行動ａ_tが選択され
る前の状態ｓ_tを、試行番号ｔの段階のθ_tで特定される多層ニューラルネットワークの入
力とした場合に得られる出力の中で、行動ａ_tに対応した出力の値である。 _{_{Moreover, max a 'Q (s t}} + 1, a'; θ -) is the state s _{t + 1} calculated in step S225 by the action a _t, a fixed theta _- the multi-layer neural network that is identified by This is the maximum output that can be obtained when input is used. _{_{Q (s t, a t;}} θ t) is obtained when the state s _t before the action a _t is selected, and the input of the multi-layered neural network that is identified at the stage of theta _t trial number t in the output, which is the value of output corresponding to the action a _t.

目的関数が算出されると、学習部４１ｂは、学習が終了したか否か判定する（ステップ
Ｓ２４０）。本実施形態においては、ＴＤ誤差が充分に小さいか否かを判定するための閾
値が予め決められており、目的関数が閾値以下である場合、学習部４１ｂは、学習が終了
したと判定する。 When the objective function is calculated, the learning unit 41b determines whether learning has ended (step S240). In the present embodiment, a threshold for determining whether or not the TD error is sufficiently small is determined in advance, and when the objective function is equal to or less than the threshold, the learning unit 41b determines that learning has ended.

ステップＳ２４０において学習が終了したと判定されない場合、学習部４１ｂは、行動
価値を更新する（ステップＳ２４５）。すなわち、学習部４１ｂは、ＴＤ誤差のθによる
偏微分に基づいて目的関数を小さくするためのθの変化を特定し、θを変化させる。むろ
ん、ここでは、各種の手法でθを変化させることが可能であり、例えば、ＲＭＳＰｒｏｐ
等の勾配降下法を採用可能である。また、学習率等による調整も適宜実施されて良い。以
上の処理によれば、行動価値関数Ｑがターゲットに近づくようにθを変化させることがで
きる。 If it is not determined in step S240 that the learning has ended, the learning unit 41b updates the action value (step S245). That is, the learning unit 41b specifies a change in θ for reducing the objective function based on the partial differentiation of the TD error by θ, and changes θ. Of course, it is possible to change θ by various methods, for example, RMSProp
A gradient descent method such as Further, adjustment based on a learning rate or the like may be performed as appropriate. According to the above processing, θ can be changed so that the action value function Q approaches the target.

ただし、本実施形態においては、上述のようにターゲットが固定されているため、学習
部４１ｂは、さらに、ターゲットを更新するか否かの判定を行う。具体的には学習部４１
ｂは、既定回数の試行が行われたか否かを判定し（ステップＳ２５０）、ステップＳ２５
０において、既定回数の試行が行われたと判定された場合に、学習部４１ｂは、ターゲッ
トを更新する（ステップＳ２５５）。すなわち、学習部４１ｂは、ターゲットを算出する
際に参照されるθを最新のθに更新する。この後、学習部４１ｂは、ステップＳ２１５以
降の処理を繰り返す。一方、ステップＳ２５０において、既定回数の試行が行われたと判
定されなければ、学習部４１ｂは、ステップＳ２５５をスキップしてステップＳ２１５以
降の処理を繰り返す。 However, in the present embodiment, since the target is fixed as described above, the learning unit 41b further determines whether to update the target. Specifically, the learning unit 41
b determines whether a predetermined number of trials have been made (step S250), and step S25.
In 0, when it is determined that a predetermined number of trials have been performed, the learning unit 41b updates the target (step S255). That is, the learning unit 41b updates θ that is referred to when calculating the target to the latest θ. Thereafter, the learning unit 41b repeats the processes after step S215. On the other hand, if it is not determined in step S250 that the predetermined number of trials has been performed, the learning unit 41b skips step S255 and repeats the processes in and after step S215.

ステップＳ２４０において学習が終了したと判定された場合、学習部４１ｂは、学習情
報４４ｅを更新する（ステップＳ２６０）。すなわち、学習部４１ｂは、学習によって得
られたθを、ロボット１，２による作業や検出部４２による検出の際に参照されるべきθ
として学習情報４４ｅに記録する。当該θを含む学習情報４４ｅが記録されている場合、
ステップ１００〜Ｓ１０５のようにロボット１，２による作業が行われる際に、検出部４
２はパラメーター４４ａに基づいて対象物の検出処理を行う。そして、検出部４２による
検出が成功するまで、撮像部２１による撮像が繰り返される工程においては、状態観測部
４１ａによる現在の状態の観測と、学習部４１ｂによる行動の選択が繰り返される。むろ
ん、この際、学習部４１ｂは、状態を入力として算出された出力Ｑ（ｓ，ａ）の中で最大
値を与える行動ａを選択する。そして、行動ａが選択された場合、行動ａが行われた状態
に相当する値となるようにパラメーター４４ａが更新される。 When it is determined in step S240 that the learning has been completed, the learning unit 41b updates the learning information 44e (step S260). That is, the learning unit 41b should refer to θ obtained by learning when performing operations by the robots 1 and 2 and detection by the detection unit 42.
As the learning information 44e. When learning information 44e including the θ is recorded,
When the operations by the robots 1 and 2 are performed as in steps 100 to S105, the detection unit 4
2 performs the object detection process based on the parameter 44a. Then, in the process in which imaging by the imaging unit 21 is repeated until detection by the detection unit 42 is successful, observation of the current state by the state observation unit 41a and selection of behavior by the learning unit 41b are repeated. Of course, at this time, the learning unit 41b selects the action a that gives the maximum value among the outputs Q (s, a) calculated using the state as an input. And when the action a is selected, the parameter 44a is updated so that it may become a value equivalent to the state in which the action a was performed.

以上の構成によれば、検出部４２は、行動価値関数Ｑが最大化される行動ａを選択しな
がら対象物の検出処理を実行することができる。当該行動価値関数Ｑは、上述の処理によ
り、多数の試行が繰り返された結果、最適化されている。そして、当該試行は、算出部４
１によって自動で行われ、人為的に実施不可能な程度の多数の試行を容易に実行すること
ができる。従って、本実施形態によれば、人為的に決められた光学パラメーターよりも高
い確率で対象物を高精度に検出することができる。 According to the above configuration, the detection unit 42 can execute the object detection process while selecting the action a in which the action value function Q is maximized. The action value function Q is optimized as a result of many trials being repeated by the above-described processing. The trial is calculated by the calculation unit 4.
It is possible to easily perform a number of trials that are performed automatically by the number 1 and cannot be performed artificially. Therefore, according to the present embodiment, the object can be detected with high probability with a higher probability than the optical parameters determined artificially.

さらに、本実施形態において検出部４２は、対象物の位置姿勢を検出する構成であるた
め、本実施形態によれば高精度に対象物の位置姿勢を検出することができる。さらに、本
実施形態によれば、最適化された行動価値関数Ｑに基づいて、光学パラメーターである撮
像部パラメーターを算出することができる。従って、対象物の検出精度を高めるように撮
像部２１を調整することができる。さらに、本実施形態によれば、最適化された行動価値
関数Ｑに基づいて、光学パラメーターである照明部パラメーターを算出することができる
。従って、対象物の検出精度を高めるように照明部２２を調整することができる。 Furthermore, in the present embodiment, the detection unit 42 is configured to detect the position and orientation of the object. Therefore, according to the present embodiment, the position and orientation of the object can be detected with high accuracy. Furthermore, according to the present embodiment, it is possible to calculate an imaging unit parameter that is an optical parameter based on the optimized behavior value function Q. Therefore, the imaging unit 21 can be adjusted so as to increase the detection accuracy of the object. Furthermore, according to the present embodiment, the illumination unit parameter that is an optical parameter can be calculated based on the optimized behavior value function Q. Therefore, the illumination unit 22 can be adjusted to increase the detection accuracy of the object.

さらに、本実施形態によれば、最適化された行動価値関数Ｑに基づいて、光学パラメー
ターである画像処理パラメーターを算出することができる。従って、対象物の検出精度を
高める画像処理を実行することが可能になる。さらに、本実施形態によれば、自動で行動
価値関数Ｑが最適化されるため、高精度に対象物を検出する光学パラメーターを容易に算
出することができる。また、行動価値関数Ｑの最適化は自動的に行われるため、最適な光
学パラメーターの算出も自動的に行うことができる。 Furthermore, according to the present embodiment, an image processing parameter that is an optical parameter can be calculated based on the optimized behavior value function Q. Therefore, it is possible to execute image processing that increases the detection accuracy of the object. Furthermore, according to this embodiment, since the action value function Q is automatically optimized, it is possible to easily calculate an optical parameter for detecting an object with high accuracy. In addition, since the optimization of the behavior value function Q is automatically performed, the optimum optical parameter can be automatically calculated.

さらに、本実施形態において学習部４１ｂは、状態変数としての画像に基づいて光学パ
ラメーターを変化させる行動を決定し、光学パラメーターを最適化する。従って、照明部
２２によって照明が行われている実環境下において撮像部２１で実際に撮像した画像に基
づいて光学パラメーターを最適化することができる。従って、ロボット１，２の使用環境
に応じた光学パラメーターとなるように最適化することができる。 Further, in the present embodiment, the learning unit 41b determines an action for changing the optical parameter based on the image as the state variable, and optimizes the optical parameter. Therefore, it is possible to optimize the optical parameters based on the image actually captured by the imaging unit 21 in an actual environment where illumination is performed by the illumination unit 22. Therefore, it is possible to optimize the optical parameters according to the usage environment of the robots 1 and 2.

本実施形態においては、撮像部２１の位置および照明部２２の位置が行動に含まれてお
り、当該行動に基づいて行動価値関数Ｑを最適化することで撮像部２１の位置および照明
部２２の位置に関するパラメーター４４ａを最適化することができる。従って、学習後に
おいては、少なくとも、撮像部２１と照明部２２の相対位置関係が理想化される。また、
対象物Ｗが作業台の固定位置またはほぼ固定された位置に置かれるのならば、学習後にお
いて、撮像部２１と照明部２２のロボット座標系における位置が理想化されると考えるこ
ともできる。さらに、本実施形態においては、撮像部２１によって撮像された画像が状態
として観測される。従って、本実施形態によれば、各種の画像の状態に対応した撮像部２
１の位置や照明部２２の位置が理想化される。 In the present embodiment, the position of the imaging unit 21 and the position of the illumination unit 22 are included in the behavior, and the position of the imaging unit 21 and the illumination unit 22 are optimized by optimizing the behavior value function Q based on the behavior. The position-related parameter 44a can be optimized. Therefore, after learning, at least the relative positional relationship between the imaging unit 21 and the illumination unit 22 is idealized. Also,
If the object W is placed at a fixed position or a substantially fixed position on the work table, it can be considered that the positions of the imaging unit 21 and the illumination unit 22 in the robot coordinate system are idealized after learning. Furthermore, in this embodiment, the image imaged by the imaging unit 21 is observed as a state. Therefore, according to the present embodiment, the imaging unit 2 corresponding to various image states.
The position of 1 and the position of the illumination unit 22 are idealized.

（４−３）動作パラメーターの学習：
動作パラメーターの学習においても、学習対象のパラメーターを選択することが可能で
あり、ここでは、その一例を説明する。図１０は、動作パラメーターの学習例を図７と同
様のモデルで説明した図である。本例も式（２）に基づいて行動価値関数Ｑ（ｓ，ａ）を
最適化する。従って、最適化後の行動価値関数Ｑ（ｓ，ａ）を最大化する行動ａが最適な
行動であると見なされ、当該行動ａを示すパラメーター４４ａが最適化されたパラメータ
ーであると見なされる。 (4-3) Learning of operation parameters:
In learning of operation parameters, it is possible to select a parameter to be learned. Here, an example will be described. FIG. 10 is a diagram illustrating an example of learning of operation parameters using the same model as in FIG. This example also optimizes the behavior value function Q (s, a) based on the equation (2). Therefore, the behavior a that maximizes the optimized behavior value function Q (s, a) is regarded as the optimum behavior, and the parameter 44a indicating the behavior a is regarded as the optimized parameter.

動作パラメーターの学習においても、動作パラメーターを変化させることが行動の決定
に相当しており、学習対象のパラメーターと取り得る行動とを示す行動情報４４ｄが記憶
部４４に予め記録される。すなわち、当該行動情報４４ｄに学習対象として記述された動
作パラメーターが学習対象となる。図１０においては、ロボット３における動作パラメー
ターの中のサーボゲインと加減速特性が学習対象であり、動作の始点および終点は学習対
象となっていない。なお、動作の始点および終点は教示位置であるが、本実施形態におい
ては他の位置は教示されない。従って、本実施形態においては、ロボット３に対して教示
された教示位置を含まない構成である。 Also in learning of the operation parameter, changing the operation parameter corresponds to the determination of the action, and action information 44 d indicating the learning target parameter and the action that can be taken is recorded in the storage unit 44 in advance. That is, the operation parameter described as the learning target in the behavior information 44d is the learning target. In FIG. 10, the servo gain and acceleration / deceleration characteristics in the operation parameters in the robot 3 are learning targets, and the start point and end point of the operation are not learning targets. Note that the start point and end point of the operation are teaching positions, but other positions are not taught in this embodiment. Therefore, in the present embodiment, the teaching position taught to the robot 3 is not included.

具体的には、動作パラメーターの中のサーボゲインＫpp，Ｋpi，Ｋpd，Ｋvp，Ｋvi，Ｋ
vdは、モーターＭ１〜Ｍ６のそれぞれについて定義され、６軸のそれぞれについて増減可
能である。従って、本実施形態においては、１軸あたり６個のサーボゲインのそれぞれを
増加または減少させることが可能であり、増加について３６個の行動、減少についても３
６個の行動、計７２個の行動（行動ａ１〜ａ７２）を選択し得る。 Specifically, the servo gains Kpp, Kpi, Kpd, Kvp, Kvi, K in the operation parameters
vd is defined for each of the motors M1 to M6, and can be increased or decreased for each of the six axes. Therefore, in this embodiment, each of the six servo gains per axis can be increased or decreased, and 36 actions for the increase and 3 actions for the decrease.
Six actions, a total of 72 actions (actions a1 to a72) can be selected.

一方、動作パラメーターの中の加減速特性は図４に示すような特性であり、モーターＭ
１〜Ｍ６のそれぞれについて（６軸について）定義される。本実施形態において加減速特
性は、加速域における加速度、減速域における加速度、車速が０より大きい期間の長さ（
図４に示すｔ₄）を変化させることができる。なお、本実施形態において加速域や減速域
におけるカーブは加速度の増減によって定義され、例えば、増減後の加速度がカーブ中央
の傾きを示し、当該中央の周囲のカーブは予め決められた規則に従って変化する。むろん
、加減速特性の調整法は他にも種々の手法が採用可能である。 On the other hand, the acceleration / deceleration characteristics in the operation parameters are as shown in FIG.
1 to M6 are defined (for 6 axes). In the present embodiment, the acceleration / deceleration characteristics include the acceleration in the acceleration region, the acceleration in the deceleration region, and the length of the period in which the vehicle speed is greater than zero (
T ₄ ) shown in FIG. 4 can be changed. In the present embodiment, the curves in the acceleration region and the deceleration region are defined by increase / decrease of acceleration, for example, the acceleration after increase / decrease indicates the inclination of the center of the curve, and the curve around the center changes according to a predetermined rule. . Of course, various other methods can be used for adjusting the acceleration / deceleration characteristics.

いずれにしても、本実施形態においては、１軸あたり３個の要素（加速域、減速域、期
間）で加減速特性を調整可能であり、各要素に応じた数値（加速度、期間長）を増加また
は減少させることが可能である。従って、増加について１８個の行動、減少についても１
８個の行動、計３６個の行動（行動ａ７３〜ａ１０８）を選択し得る。本実施形態におい
ては、以上のようにして予め定義された行動の選択肢に対応するパラメーターが、行動情
報４４ｄに学習対象として記述される。また、各行動を特定するための情報（行動のＩＤ
、各行動での増減量等）が行動情報４４ｄに記述される。 In any case, in this embodiment, the acceleration / deceleration characteristics can be adjusted by three elements (acceleration region, deceleration region, period) per axis, and numerical values (acceleration, period length) corresponding to each element can be adjusted. It can be increased or decreased. Therefore, 18 actions for increase and 1 for decrease
Eight actions, a total of 36 actions (actions a73 to a108) may be selected. In the present embodiment, parameters corresponding to action options defined in advance as described above are described as learning targets in the action information 44d. In addition, information for identifying each action (action ID
The amount of increase / decrease in each action is described in the action information 44d.

図１０に示す例において、報酬はロボット３が行った作業の良否に基づいて評価される
。すなわち、学習部４１ｂは、行動ａとして動作パラメーターを変化させた後、当該動作
パラメーターによってロボット３を動作させ、検出部４２によって検出された対象物をピ
ックアップする作業を実行する。さらに、学習部４１ｂは、作業の良否を観測し、作業の
良否を評価する。そして、学習部４１ｂは、作業の良否によって行動ａ、状態ｓ、ｓ'の
報酬を決定する。 In the example shown in FIG. 10, the reward is evaluated based on the quality of the work performed by the robot 3. That is, the learning unit 41b changes the operation parameter as the action a, and then operates the robot 3 with the operation parameter to perform an operation of picking up the object detected by the detection unit 42. Furthermore, the learning unit 41b observes the quality of the work and evaluates the quality of the work. And the learning part 41b determines the reward of action a, state s, s' by the quality of work.

なお、作業の良否は作業の成否（ピックアップ成否等）のみならず、作業の質を含む。
具体的には、学習部４１ｂは、図示しない計時回路に基づいて作業の開始から終了まで（
ステップＳ１１０の開始からステップＳ１２５で終了と判定されるまで）の所要時間を取
得する。そして、学習部４１ｂは、作業の所要時間が基準よりも短い場合に正（例えば＋
１）、作業の所要時間が基準よりも長い場合に負（例えば−１）の報酬を与える。なお、
基準は種々の要素によって特定されて良く、例えば、前回の作業の所要時間であっても良
いし、過去の最短所要時間であっても良いし、予め決められた時間であっても良い。 The quality of work includes not only the success or failure of the work (pickup success or failure) but also the quality of the work.
Specifically, the learning unit 41b is based on a timing circuit (not shown) from the start to the end of work (
The required time from the start of step S110 to the end of step S125 is acquired. And the learning part 41b is positive (for example, +) when the time required for the work is shorter than the reference.
1) When the time required for the work is longer than the reference, a negative (for example, -1) reward is given. In addition,
The reference may be specified by various factors, and may be, for example, the time required for the previous work, the shortest required time in the past, or a predetermined time.

さらに、学習部４１ｂは、作業の各工程において、ロボット３のエンコーダーＥ１〜Ｅ
６の出力をＵ１に基づいて変換してグリッパー２３の位置を取得する。そして、学習部４
１ｂは、各工程の目標位置（終点）と、工程終了の際のグリッパー２３の位置とのずれ量
を取得し、グリッパー２３の位置と目標位置とのずれ量が基準よりも小さい場合に正、基
準よりも大きい場合の負の報酬を与える。なお、基準は種々の要素によって特定されて良
く、例えば、前回のずれ量であっても良いし、過去の最短のずれ量であっても良いし、予
め決められたずれ量であっても良い。 Furthermore, the learning unit 41b performs encoders E1 to E of the robot 3 in each work process.
6 is converted based on U1, and the position of the gripper 23 is obtained. And learning part 4
1b is obtained when the amount of deviation between the target position (end point) of each process and the position of the gripper 23 at the end of the process is acquired, and when the amount of deviation between the position of the gripper 23 and the target position is smaller than the reference, Give a negative reward if it is larger than the standard. The reference may be specified by various factors, and may be, for example, the previous deviation amount, the past shortest deviation amount, or a predetermined deviation amount. .

さらに、学習部４１ｂは、作業の各工程において取得したグリッパー２３の位置を、整
定する以前の所定期間にわたって取得し、当該期間における振動強度を取得する。そして
、学習部４１ｂは、当該振動強度の程度が基準よりも小さい場合に正、基準よりも大きい
場合の負の報酬を与える。なお、基準は種々の要素によって特定されて良く、例えば、前
回の振動強度の程度であっても良いし、過去の最小の振動強度の程度であっても良いし、
予め決められた振動強度の程度であっても良い。振動強度の程度は、種々の手法で特定さ
れて良く、目標位置からの乖離の積分値や閾値以上の振動が生じている期間など、種々の
手法を採用可能である。なお、所定期間は、種々の期間とすることが可能であり、工程の
始点から終点にわたる期間であれば、動作中の振動強度による報酬が評価され、工程の終
期が所定期間とされれば、残留振動の強度による報酬が評価される。 Further, the learning unit 41b acquires the position of the gripper 23 acquired in each step of the work over a predetermined period before settling, and acquires the vibration intensity in the period. Then, the learning unit 41b gives a positive reward when the degree of the vibration intensity is smaller than the reference, and a negative reward when larger than the reference. Note that the reference may be specified by various factors, for example, the degree of the previous vibration intensity or the past minimum vibration intensity,
It may be a predetermined vibration intensity level. The degree of vibration intensity may be specified by various methods, and various methods such as an integral value of deviation from the target position and a period in which a vibration exceeding a threshold value is generated can be employed. The predetermined period can be various periods. If the period extends from the start point of the process to the end point, the reward based on the vibration intensity during the operation is evaluated, and if the end of the process is the predetermined period, The reward based on the strength of the residual vibration is evaluated.

さらに、学習部４１ｂは、作業の各工程の終期において取得したグリッパー２３の位置
を、整定する以前の所定期間にわたって取得し、当該期間における目標位置からの乖離の
最大値をオーバーシュート量として取得する。そして、学習部４１ｂは、当該オーバーシ
ュート量が基準よりも小さい場合に正、基準よりも大きい場合の負の報酬を与える。なお
、基準は種々の要素によって特定されて良く、例えば、前回のオーバーシュート量の程度
であっても良いし、過去の最小のオーバーシュート量であっても良いし、予め決められた
オーバーシュート量であっても良い。 Further, the learning unit 41b acquires the position of the gripper 23 acquired at the end of each step of the work over a predetermined period before settling, and acquires the maximum deviation from the target position in the period as the overshoot amount. . Then, the learning unit 41b gives a positive reward when the overshoot amount is smaller than the reference and a negative reward when the overshoot amount is larger than the reference. The reference may be specified by various factors, for example, the degree of the previous overshoot amount, the past minimum overshoot amount, or a predetermined overshoot amount. It may be.

さらに、本実施形態においては、制御装置４０、ロボット１〜３、作業台等の少なくと
も１カ所に集音装置が取り付けられており、学習部４１ｂは、作業中に集音装置が取得し
た音を示す情報を取得する。そして、学習部４１ｂは、作業中の発生音の大きさが基準よ
りも小さい場合に正、基準よりも大きい場合の負の報酬を与える。なお、基準は種々の要
素によって特定されて良く、例えば、前回の作業または工程の発生音の大きさの程度であ
っても良いし、過去の発生音の大きさの最小値であっても良いし、予め決められた大きさ
であっても良い。また、発生音の大きさは、音圧の最大値で評価されても良いし、所定期
間内の音圧の統計値（平均値等）で評価されても良く、種々の構成を採用可能である。 Furthermore, in this embodiment, the sound collecting device is attached to at least one place such as the control device 40, the robots 1 to 3, and the work table, and the learning unit 41b uses the sound collected by the sound collecting device during work. Get the information shown. Then, the learning unit 41b gives a positive reward when the volume of the generated sound during work is smaller than the reference, and a negative reward when larger than the reference. Note that the reference may be specified by various factors, and may be, for example, the magnitude of the sound generated in the previous operation or process, or the minimum value of the sound generated in the past. However, it may be a predetermined size. Further, the magnitude of the generated sound may be evaluated by the maximum value of the sound pressure, or may be evaluated by a statistical value (average value, etc.) of the sound pressure within a predetermined period, and various configurations can be adopted. is there.

現在の状態ｓにおいて行動ａが採用された場合における次の状態ｓ'は、行動ａとして
のパラメーターの変化が行われた後にロボット３を動作させ、状態観測部４１ａが状態を
観測することによって特定可能である。なお、本例にかかる動作パラメーターの学習は、
ロボット１，２による対象物の検出完了後に、ロボット３に関して実行される。 The next state s ′ when the action a is adopted in the current state s is specified by operating the robot 3 after the parameter change as the action a is performed, and the state observation unit 41a observes the state. Is possible. In addition, learning of the operation parameter concerning this example is
After the detection of the object by the robots 1 and 2 is completed, the process is performed on the robot 3.

図１０に示す例において、状態変数には、モーターＭ１〜Ｍ６の電流、エンコーダーＥ
１〜Ｅ６の値、力覚センサーＰの出力が含まれている。従って、状態観測部４１ａは、サ
ーボ４３ｄの制御結果として、モーターＭ１〜Ｍ６に供給される電流値を観測することが
できる。当該電流値は、モーターＭ１〜Ｍ６で出力されるトルクに相当する。また、エン
コーダーＥ１〜Ｅ６の出力は、対応関係Ｕ１に基づいてロボット座標系におけるＴＣＰの
位置に変換される。従って、状態観測部４１ａは、ロボット３が備えるグリッパー２３の
位置情報を観測することになる。 In the example shown in FIG. 10, the state variables include the currents of the motors M1 to M6 and the encoder E.
The values of 1 to E6 and the output of the force sensor P are included. Therefore, the state observation unit 41a can observe the current values supplied to the motors M1 to M6 as the control result of the servo 43d. The current value corresponds to the torque output from the motors M1 to M6. The outputs of the encoders E1 to E6 are converted into the TCP position in the robot coordinate system based on the correspondence U1. Therefore, the state observation unit 41a observes the position information of the gripper 23 provided in the robot 3.

力覚センサーＰの出力は積分することによってロボットの位置に変換することができる
。すなわち、状態観測部４１ａは、対応関係Ｕ２に基づいてロボット座標系においてＴＣ
Ｐへの作用力を積分することでＴＣＰの位置を取得する。従って、本実施形態において状
態観測部４１ａは、力覚センサーＰの出力も利用してロボット３が備えるグリッパー２３
の位置情報を観測する。なお、状態は、各種の手法で観測されて良く、上述の変換が行わ
れない値（電流値やエンコーダー、力覚センサーの出力値）が状態として観測されても良
い。 The output of the force sensor P can be converted into the position of the robot by integrating. In other words, the state observing unit 41a uses the TC in the robot coordinate system based on the correspondence U2.
The TCP position is acquired by integrating the acting force on P. Therefore, in the present embodiment, the state observing unit 41a uses the output of the force sensor P and the gripper 23 provided in the robot 3.
Observe location information. The state may be observed by various methods, and a value (current value, encoder, or output value of the force sensor) that does not perform the above-described conversion may be observed as the state.

状態観測部４１ａは、行動であるサーボゲインや加減速特性の調整結果を直接的に観測
しているのではなく、調整の結果、ロボット３で得られた変化をモーターＭ１〜Ｍ６の電
流、エンコーダーＥ１〜Ｅ６の値、力覚センサーＰの出力として観測している。従って、
行動による影響を間接的に観測していることになり、この意味で、本実施形態の状態変数
は、動作パラメーターの変化から直接的に推定することが困難な変化をし得る状態変数で
ある。 The state observing unit 41a does not directly observe the adjustment result of the servo gain and acceleration / deceleration characteristics, which are actions, but the change obtained by the robot 3 as a result of the adjustment, the current of the motors M1 to M6, the encoder The values of E1 to E6 and the output of the force sensor P are observed. Therefore,
In this sense, the state variable of this embodiment is a state variable that can be difficult to estimate directly from the change of the operation parameter.

また、モーターＭ１〜Ｍ６の電流、エンコーダーＥ１〜Ｅ６の値、力覚センサーＰの出
力は、ロボット３の動作を直接的に示しており、当該動作は作業の良否を直接的に示して
いる。従って、状態変数として、モーターＭ１〜Ｍ６の電流、エンコーダーＥ１〜Ｅ６の
値、力覚センサーＰの出力を観測することにより、人為的に改善することが困難なパラメ
ーターの改善を行い、効果的に作業の質を高めるように動作パラメーターを最適化するこ
とが可能になる。この結果、人為的に決められた動作パラメーターよりも高性能な動作を
行う動作パラメーターを高い確率で算出することができる。 Further, the currents of the motors M1 to M6, the values of the encoders E1 to E6, and the output of the force sensor P directly indicate the operation of the robot 3, and the operation directly indicates the quality of the work. Therefore, by observing the currents of the motors M1 to M6, the values of the encoders E1 to E6, and the output of the force sensor P as state variables, parameters that are difficult to improve artificially are effectively improved. It is possible to optimize the operating parameters so as to improve the quality of work. As a result, it is possible to calculate with high probability an operation parameter that performs higher performance than an artificially determined operation parameter.

（４−４）動作パラメーターの学習例：
次に、動作パラメーターの学習例を説明する。学習の過程で参照される変数や関数を示
す情報は、学習情報４４ｅとして記憶部４４に記憶される。すなわち、算出部４１は、状
態変数の観測と、当該状態変数に応じた行動の決定と、当該行動によって得られる報酬の
評価とを繰り返すことによって行動価値関数Ｑ（ｓ，ａ）を収束させる構成が採用されて
いる。そこで、本例において、学習の過程で状態変数と行動と報酬との時系列の値が、順
次、学習情報４４ｅに記録されていく。 (4-4) Example of learning operation parameters:
Next, an example of learning operation parameters will be described. Information indicating variables and functions referred to in the learning process is stored in the storage unit 44 as learning information 44e. That is, the calculation unit 41 is configured to converge the action value function Q (s, a) by repeating the observation of the state variable, the determination of the action according to the state variable, and the evaluation of the reward obtained by the action. Is adopted. Therefore, in this example, the time series values of the state variable, the action, and the reward are sequentially recorded in the learning information 44e during the learning process.

なお、本実施形態において、動作パラメーターの学習は位置制御モードで実行される。
位置制御モードでの学習を実行するためには、位置制御モードのみで構成される作業がロ
ボット３のロボットプログラム４４ｂとして生成されても良いし、任意のモードを含む作
業がロボット３のロボットプログラム４４ｂとして生成されている状況において、その中
の位置制御モードのみを用いて学習してもよい。 In the present embodiment, learning of the operation parameter is executed in the position control mode.
In order to execute learning in the position control mode, an operation composed only of the position control mode may be generated as the robot program 44b of the robot 3, or an operation including any mode may be performed as the robot program 44b of the robot 3. May be learned using only the position control mode among them.

行動価値関数Ｑ（ｓ，ａ）は、種々の手法で算出されて良く、多数回の試行に基づいて
算出されても良いが、ここでは、ＤＱＮによって行動価値関数Ｑを最適化する例を説明す
る。行動価値関数Ｑの最適化に利用される多層ニューラルネットワークは、上述の図８に
おいて模式的に示される。図１０に示すような状態が観測される本例であれば、ロボット
３におけるモーターＭ１〜Ｍ６の電流、エンコーダーＥ１〜Ｅ６の値、力覚センサーＰの
出力（６軸の出力）が状態であるため、状態ｓの数Ｍ＝１８である。また、図１０に示す
行動が選択され得る本例であれば、１０８個の行動が選択可能であるためＮ＝１０８であ
る。むろん、行動ａの内容や数（Ｎの値）、状態ｓの内容や数（Ｍの値）は試行番号ｔに
応じて変化しても良い。 The behavior value function Q (s, a) may be calculated by various methods, and may be calculated based on a large number of trials. Here, an example in which the behavior value function Q is optimized by DQN will be described. To do. The multilayer neural network used for optimizing the behavior value function Q is schematically shown in FIG. In this example in which the state shown in FIG. 10 is observed, the currents of the motors M1 to M6 in the robot 3, the values of the encoders E1 to E6, and the output of the force sensor P (6-axis output) are the states. Therefore, the number M of states s = 18. Further, in this example in which the behavior shown in FIG. 10 can be selected, since 108 behaviors can be selected, N = 108. Of course, the contents and number of actions a (value of N) and the contents and number of states s (value of M) may be changed according to the trial number t.

本実施形態においても、当該多層ニューラルネットワークを特定するためのパラメータ
ー（入力から出力を得るために必要な情報）が学習情報４４ｅとして記憶部４４に記録さ
れる。ここでも学習の過程で変化し得る多層ニューラルネットワークのパラメーターをθ
と表記する。当該θを使用すると、上述の行動価値関数Ｑ（ｓ_t，ａ_1t）〜Ｑ（ｓ_t，ａ_Nt
）は、Ｑ（ｓ_t，ａ_1t；θ_t）〜Ｑ（ｓ_t，ａ_Nt；θ_t）とも表記できる。 Also in the present embodiment, parameters for identifying the multilayer neural network (information necessary for obtaining output from input) are recorded in the storage unit 44 as learning information 44e. Again, the multilayer neural network parameters that can change during the learning process
Is written. When the θ is used, the behavior value functions Q (s _t , a _1t ) to Q (s _t , a _Nt described above are used.
) Can also be expressed as Q (s _t , a _1t ; θ _t ) to Q (s _t , a _Nt ; θ _t ).

次に、図９に示すフローチャートに沿って学習処理の手順を説明する。動作パラメータ
ーの学習処理は、ロボット３の運用過程において実施されても良いし、実運用の前に事前
に学習処理が実行されてもよい。ここでは、実運用の前に事前に学習処理が実行される構
成（多層ニューラルネットワークを示すθが最適化されると、その情報が保存され、次回
以降の運用で利用される構成）に従って学習処理を説明する。 Next, the procedure of the learning process will be described along the flowchart shown in FIG. The operation parameter learning process may be performed in the operation process of the robot 3, or the learning process may be executed in advance before the actual operation. Here, learning processing is performed according to a configuration in which learning processing is executed in advance before actual operation (when θ indicating a multilayer neural network is optimized, the information is stored and used in the subsequent operation). Will be explained.

学習処理が開始されると、算出部４１は、学習情報４４ｅを初期化する（ステップＳ２
００）。すなわち、算出部４１は、学習を開始する際に参照されるθの初期値を特定する
。初期値は、種々の手法によって決められて良く、過去に学習が行われていない場合にお
いては、任意の値やランダム値等がθの初期値となっても良いし、ロボット３や対象物を
模擬するシミュレーション環境を準備し、当該環境に基づいて学習または推定したθを初
期値としてもよい。 When the learning process is started, the calculation unit 41 initializes the learning information 44e (Step S2).
00). That is, the calculation unit 41 specifies an initial value of θ referred to when learning is started. The initial value may be determined by various methods. If learning has not been performed in the past, an arbitrary value, a random value, or the like may be the initial value of θ, and the robot 3 or the object A simulation environment to be simulated may be prepared, and θ learned or estimated based on the environment may be set as an initial value.

過去に学習が行われた場合は、当該学習済のθが初期値として採用される。また、過去
に類似の対象についての学習が行われた場合は、当該学習におけるθが初期値とされても
良い。過去の学習は、ロボット３を用いてユーザーが行ってもよいし、ロボット３の製造
者がロボット３の販売前に行ってもよい。この場合、製造者は、対象物や作業の種類に応
じて複数の初期値のセットを用意しておき、ユーザーが学習する際に初期値を選択する構
成であっても良い。θの初期値が決定されると、当該初期値が現在のθの値として学習情
報４４ｅに記憶される。 When learning has been performed in the past, the learned θ is adopted as an initial value. Further, when learning is performed on a similar target in the past, θ in the learning may be set as an initial value. The past learning may be performed by the user using the robot 3, or may be performed by the manufacturer of the robot 3 before the robot 3 is sold. In this case, the manufacturer may prepare a plurality of sets of initial values according to the object and the type of work, and select the initial values when the user learns. When the initial value of θ is determined, the initial value is stored in the learning information 44e as the current value of θ.

次に、算出部４１は、パラメーターを初期化する（ステップＳ２０５）。ここでは、動
作パラメーターが学習対象であるため、算出部４１は、動作パラメーターを初期化する。
すなわち、学習が行われていない状態であれば、算出部４１は、教示によって生成された
パラメーター４４ａに含まれる動作パラメーターを初期値として設定する。過去に何らか
の学習が行われた状態であれば、算出部４１は、学習の際に最後に利用されていたパラメ
ーター４４ａに含まれる動作パラメーターを初期値として設定する。 Next, the calculation unit 41 initializes parameters (step S205). Here, since the operation parameter is a learning target, the calculation unit 41 initializes the operation parameter.
That is, if learning is not performed, the calculation unit 41 sets an operation parameter included in the parameter 44a generated by teaching as an initial value. If any learning has been performed in the past, the calculation unit 41 sets an operation parameter included in the parameter 44a used last in the learning as an initial value.

次に、状態観測部４１ａは、状態変数を観測する（ステップＳ２１０）。すなわち、制
御部４３は、パラメーター４４ａおよびロボットプログラム４４ｂを参照してロボット３
を制御する（上述のステップＳ１１０〜Ｓ１３０に相当）。この後、状態観測部４１ａは
、モーターＭ１〜Ｍ６に供給される電流値を観測する。また、状態観測部４１ａは、エン
コーダーＥ１〜Ｅ６の出力を取得し、対応関係Ｕ１に基づいてロボット座標系におけるＴ
ＣＰの位置に変換する。さらに、状態観測部４１ａは、力覚センサーＰの出力を積分し、
ＴＣＰの位置を取得する。 Next, the state observation unit 41a observes a state variable (step S210). That is, the control unit 43 refers to the parameter 44a and the robot program 44b to
(Corresponding to steps S110 to S130 described above). Thereafter, the state observation unit 41a observes the current value supplied to the motors M1 to M6. In addition, the state observing unit 41a acquires the outputs of the encoders E1 to E6, and based on the correspondence relationship U1, T
Convert to CP position. Furthermore, the state observation unit 41a integrates the output of the force sensor P,
Get TCP location.

行動が選択されると、学習部４１ｂは、当該行動に対応するパラメーター４４ａを変化
させる。例えば、図１０に示す例において、モーターＭ１のサーボゲインＫppを一定値増
加させる行動ａ１が選択された場合、学習部４１ｂは、動作パラメーターが示すモーター
Ｍ１のサーボゲインＫppの値を一定値増加させる。パラメーター４４ａの変化が行われる
と、制御部４３は、当該パラメーター４４ａを参照してロボット３を制御し、一連の作業
を実行させる。なお、本実施形態においては、行動選択のたびに一連の作業が実行される
が、行動選択のたびに一連の作業の一部が実行される構成（一連の作業を構成する複数の
工程の少なくとも１工程が実行される構成）であっても良い。 When an action is selected, the learning unit 41b changes a parameter 44a corresponding to the action. For example, in the example shown in FIG. 10, when the action a1 that increases the servo gain Kpp of the motor M1 is selected, the learning unit 41b increases the value of the servo gain Kpp of the motor M1 indicated by the operation parameter by a constant value. . When the parameter 44a is changed, the control unit 43 refers to the parameter 44a and controls the robot 3 to execute a series of operations. In the present embodiment, a series of work is executed every time an action is selected, but a configuration in which a part of the series of work is executed every time an action is selected (at least a plurality of steps constituting the series of work). A configuration in which one step is executed) may be used.

次に、状態観測部４１ａは、状態変数を観測する（ステップＳ２２５）。すなわち、状
態観測部４１ａは、ステップＳ２１０における状態変数の観測と同様の処理を行って、状
態変数として、モーターＭ１〜Ｍ６に供給される電流値、エンコーダーＥ１〜Ｅ６の出力
に基づいて特定されるＴＣＰの位置、力覚センサーＰの出力に基づいて特定されるＴＣＰ
の位置を取得する。なお、現在の試行番号がｔである場合（選択された行動がａ_tである
場合）、ステップＳ２２５で取得される状態ｓはｓ_t+1である。 Next, the state observation unit 41a observes a state variable (step S225). That is, the state observation unit 41a performs the same processing as the state variable observation in step S210, and is specified as the state variable based on the current values supplied to the motors M1 to M6 and the outputs of the encoders E1 to E6. TCP specified based on the position of the TCP and the output of the force sensor P
Get the position of. Incidentally, if the current trial number is t (when the selected action is a _t), the state s is obtained in step S225 it is s _{t + 1.}

次に、学習部４１ｂは、報酬を評価する（ステップＳ２３０）。すなわち、学習部４１
ｂは、図示しない計時回路に基づいて作業の開始から終了までの所要時間を取得し、作業
の所要時間が基準よりも短い場合に正、作業の所要時間が基準よりも長い場合に負の報酬
を取得する。さらに、学習部４１ｂは、作業の各工程の終了段階におけるグリッパー２３
の位置を取得し、各工程の目標位置とのずれ量を取得する。そして、学習部４１ｂは、グ
リッパー２３の位置と目標位置とのずれ量が基準よりも小さい場合に正、基準よりも大き
い場合の負の報酬を取得する。一連の作業が複数の工程で構成される場合、各工程の報酬
の和が取得されても良いし、統計値（平均値等）が取得されても良い。 Next, the learning unit 41b evaluates the reward (step S230). That is, the learning unit 41
b is the time required from the start to the end of work based on a timing circuit (not shown), and is positive when the time required for the work is shorter than the reference, and negative when the time required for the work is longer than the reference. To get. Further, the learning unit 41b has a gripper 23 at the end stage of each process.
Are acquired, and the amount of deviation from the target position of each process is acquired. Then, the learning unit 41b acquires a positive reward when the amount of deviation between the position of the gripper 23 and the target position is smaller than the reference, and a negative reward when larger than the reference. When a series of operations is composed of a plurality of processes, the sum of rewards of each process may be acquired, or a statistical value (an average value or the like) may be acquired.

さらに、学習部４１ｂは、作業の各工程において取得したグリッパー２３の位置に基づ
いて振動強度を取得する。そして、学習部４１ｂは、当該振動強度の程度が基準よりも小
さい場合に正、基準よりも大きい場合の負の報酬を取得する。一連の作業が複数の工程で
構成される場合、各工程の報酬の和が取得されても良いし、統計値（平均値等）が取得さ
れても良い。 Furthermore, the learning unit 41b acquires the vibration intensity based on the position of the gripper 23 acquired in each process. Then, the learning unit 41b acquires a positive reward when the degree of vibration intensity is smaller than the reference and a negative reward when larger than the reference. When a series of operations is composed of a plurality of processes, the sum of rewards of each process may be acquired, or a statistical value (an average value or the like) may be acquired.

さらに、学習部４１ｂは、作業の各工程の終期において取得したグリッパー２３の位置
に基づいてオーバーシュート量を取得する。そして、学習部４１ｂは、当該オーバーシュ
ート量が基準よりも小さい場合に正、基準よりも大きい場合の負の報酬を取得する一連の
作業が複数の工程で構成される場合、各工程の報酬の和が取得されても良いし、統計値（
平均値等）が取得されても良い。 Furthermore, the learning unit 41b acquires the overshoot amount based on the position of the gripper 23 acquired at the end of each step of the work. The learning unit 41b is positive when the overshoot amount is smaller than the reference, and when a series of operations for obtaining a negative reward when the overshoot amount is larger than the reference is configured by a plurality of steps, the learning unit 41b The sum may be obtained, or the statistical value (
Average value etc.) may be acquired.

さらに、学習部４１ｂは、作業中に集音装置が取得した音を示す情報を取得する。そし
て、学習部４１ｂは、作業中の発生音の大きさが基準よりも小さい場合に正、基準よりも
大きい場合の負の報酬を取得する。なお、現在の試行番号がｔである場合、ステップＳ２
３０で取得される報酬ｒはｒ_t+1である。 Furthermore, the learning unit 41b acquires information indicating the sound acquired by the sound collector during work. Then, the learning unit 41b acquires a positive reward when the magnitude of the generated sound during work is smaller than the reference and a negative reward when the generated sound is larger than the reference. If the current trial number is t, step S2
The reward r acquired at 30 is r _{t + 1} .

本例においても、式（２）に示す行動価値関数Ｑの更新を目指しているが、行動価値関
数Ｑを適切に更新していくためには、行動価値関数Ｑを示す多層ニューラルネットワーク
を最適化（θを最適化）していかなくてはならない。そして、図８に示す多層ニューラル
ネットワークによって行動価値関数Ｑを適正に出力させるためには、当該出力のターゲッ
トとなる教師データが必要になる。すなわち、多層ニューラルネットワークの出力と、タ
ーゲットとの誤差を最小化するようにθを改善すると、多層ニューラルネットワークが最
適化されることが期待される。 In this example as well, the behavior value function Q shown in Expression (2) is aimed to be updated. However, in order to appropriately update the behavior value function Q, the multilayer neural network representing the behavior value function Q is optimized. (Θ should be optimized). In order to properly output the action value function Q by the multilayer neural network shown in FIG. 8, teacher data as a target of the output is required. That is, when θ is improved so as to minimize the error between the output of the multilayer neural network and the target, it is expected that the multilayer neural network is optimized.

しかし、本実施形態において、学習が完了していない段階では行動価値関数Ｑの知見が
なく、ターゲットを特定することは困難である。そこで、本実施形態においては、式（２
）の第２項、いわゆるＴＤ誤差を最小化する目的関数によって多層ニューラルネットワー
クを示すθの改善を実施する。すなわち、（ｒ_t+1＋γｍａｘ_ａ'Ｑ（ｓ_t+1，ａ'；θ_t）
）をターゲットとし、ターゲットとＱ（ｓ_t，ａ_t；θ_t）との誤差が最小化するようにθ
を学習する。ただし、ターゲット（ｒ_t+1＋γｍａｘ_ａ'Ｑ（ｓ_t+1，ａ'；θ_t））は、学
習対象のθを含んでいるため、本実施形態においては、ある程度の試行回数にわたりター
ゲットを固定する（例えば、最後に学習したθ（初回学習時はθの初期値）で固定する）
。本実施形態においては、ターゲットを固定する試行回数である既定回数が予め決められ
ている。 However, in this embodiment, there is no knowledge of the action value function Q at the stage where learning is not completed, and it is difficult to specify a target. Therefore, in this embodiment, the expression (2
2), the improvement of θ indicating the multilayer neural network is implemented by an objective function that minimizes the so-called TD error. That is, (r _{t + 1} + γmax _{a ′} Q (s _{t + 1} , a ′; θ _t )
) As a target, and θ so that the error between the target and Q (s _t , a _t ; θ _t ) is minimized
To learn. However, since the target (r _{t + 1} + γmax _{a ′} Q (s _{t + 1} , a ′; θ _t )) includes θ to be learned, in this embodiment, the target is set over a certain number of trials. Fixed (for example, fixed at the last learned θ (initial value of θ at the first learning))
. In the present embodiment, a predetermined number, which is the number of trials for fixing the target, is determined in advance.

ステップＳ２４０において学習が終了したと判定された場合、学習部４１ｂは、学習情
報４４ｅを更新する（ステップＳ２６０）。すなわち、学習部４１ｂは、学習によって得
られたθを、ロボット３による作業の際に参照されるべきθとして学習情報４４ｅに記録
する。当該θを含む学習情報４４ｅが記録されている場合、ステップ１１０〜Ｓ１３０の
ようにロボット３による作業が行われる際に、制御部４３はパラメーター４４ａに基づい
てロボット３を制御する。そして、当該作業の過程においては、状態観測部４１ａによる
現在の状態の観測と、学習部４１ｂによる行動の選択が繰り返される。むろん、この際、
学習部４１ｂは、状態を入力として算出された出力Ｑ（ｓ，ａ）の中で最大値を与える行
動ａを選択する。そして、行動ａが選択された場合、行動ａが行われた状態に相当する値
となるようにパラメーター４４ａが更新される。 When it is determined in step S240 that the learning has been completed, the learning unit 41b updates the learning information 44e (step S260). That is, the learning unit 41b records θ obtained by learning in the learning information 44e as θ to be referred to when the robot 3 performs work. When the learning information 44e including the θ is recorded, the control unit 43 controls the robot 3 based on the parameter 44a when the robot 3 performs work as in steps 110 to S130. And in the process of the said operation | work, observation of the present state by the state observation part 41a and selection of the action by the learning part 41b are repeated. Of course,
The learning unit 41b selects the action a that gives the maximum value among the outputs Q (s, a) calculated using the state as an input. And when the action a is selected, the parameter 44a is updated so that it may become a value equivalent to the state in which the action a was performed.

以上の構成によれば、制御部４３は、行動価値関数Ｑが最大化される行動ａを選択しな
がら作業を実行することができる。当該行動価値関数Ｑは、上述の処理により、多数の試
行が繰り返された結果、最適化されている。そして、当該試行は、算出部４１によって自
動で行われ、人為的に実施不可能な程度の多数の試行を容易に実行することができる。従
って、本実施形態によれば、人為的に決められた動作パラメーターよりも高い確率でロボ
ット３の作業の質を高めることができる。 According to the above configuration, the control unit 43 can perform work while selecting the action a that maximizes the action value function Q. The action value function Q is optimized as a result of many trials being repeated by the above-described processing. And the said trial is automatically performed by the calculation part 41, and many trials of the grade which cannot be implemented artificially can be performed easily. Therefore, according to the present embodiment, it is possible to improve the quality of work of the robot 3 with a higher probability than the operation parameters determined artificially.

さらに、本実施形態においては、行動によってパラメーター４４ａとしてのサーボゲイ
ンが変化する。従って、人為的な調整によって適切な設定を行うことが困難な、モーター
を制御するためのサーボゲインを自動的に調整することができる。さらに、本実施形態に
おいては、行動によってパラメーター４４ａとしての加減速特性が変化する。従って、人
為的な調整によって適切な設定を行うことが困難な加減速特性を自動的に調整することが
できる。 Furthermore, in the present embodiment, the servo gain as the parameter 44a changes depending on the action. Therefore, it is possible to automatically adjust the servo gain for controlling the motor, which is difficult to set appropriately by artificial adjustment. Furthermore, in the present embodiment, the acceleration / deceleration characteristics as the parameter 44a change depending on the behavior. Therefore, it is possible to automatically adjust the acceleration / deceleration characteristics that are difficult to set appropriately by artificial adjustment.

さらに、本実施形態においては、行動によってロボットの動作の始点および終点が変化
しない。従って、本実施形態においては、ロボット３が予定された始点および終点から外
れ、利用者の意図しない動作が行われることを防止することができる。さらに、本実施形
態においては、行動によってロボットに対する教示位置である始点および終点は変化しな
い。従って、本実施形態においては、ロボット３が教示された位置から外れ、利用者の意
図しない動作が行われることを防止することができる。なお、本実施形態において、教示
位置は始点および終点であるが、他の位置が教示位置となってもよい。例えば、始点と終
点との間で通過すべき位置や取るべき姿勢がある場合、これらが教示位置（教示姿勢）で
あっても良い。 Furthermore, in the present embodiment, the start point and the end point of the robot motion do not change depending on the behavior. Therefore, in the present embodiment, it is possible to prevent the robot 3 from deviating from the scheduled start point and end point and performing an operation not intended by the user. Furthermore, in this embodiment, the start point and the end point that are teaching positions for the robot do not change depending on the behavior. Therefore, in the present embodiment, it is possible to prevent the robot 3 from moving out of the taught position and performing an operation not intended by the user. In the present embodiment, the teaching position is the start point and the end point, but other positions may be the teaching position. For example, when there is a position to be passed or a posture to be taken between the start point and the end point, these may be teaching positions (teaching postures).

さらに、本実施形態においては、ロボット３が行った作業の良否に基づいて行動による
報酬を評価するため、ロボット３の作業を成功させるようにパラメーターを最適化するこ
とができる。さらに、本実施形態においては、作業の所要時間が基準よりも短い場合に報
酬を正と評価するため、ロボット３を短い時間で作業させる動作パラメーターを容易に算
出することができる。さらに、本実施形態においては、ロボット３の位置と目標位置との
ずれ量が基準よりも小さい場合に報酬を正と評価するため、ロボット３を目標位置に正確
に移動させる動作パラメーターを容易に算出することができる。 Furthermore, in this embodiment, since the reward by action is evaluated based on the quality of the work performed by the robot 3, the parameters can be optimized so that the work of the robot 3 is successful. Furthermore, in this embodiment, since the reward is evaluated as positive when the time required for the work is shorter than the reference, it is possible to easily calculate an operation parameter that causes the robot 3 to work in a short time. Furthermore, in this embodiment, when the amount of deviation between the position of the robot 3 and the target position is smaller than the reference, the reward is evaluated as positive, and therefore, an operation parameter for accurately moving the robot 3 to the target position is easily calculated. can do.

さらに、本実施形態においては、振動強度が基準よりも小さい場合に報酬を正と評価す
るため、ロボット３の動作による振動を発生させる可能性が低い動作パラメーターを容易
に算出することができる。さらに、本実施形態においては、ロボット３の位置のオーバー
シュートが基準よりも小さい場合に報酬を正と評価するため、ロボット３がオーバーシュ
ートする可能性が低い動作パラメーターを容易に算出することができる。さらに、本実施
形態においては、発生音が基準よりも小さい場合に報酬を正と評価するため、ロボット３
に異常を発生させる可能性が低い動作パラメーターを容易に算出することができる。 Furthermore, in this embodiment, since the reward is evaluated as positive when the vibration intensity is smaller than the reference, it is possible to easily calculate an operation parameter that is less likely to generate vibration due to the operation of the robot 3. Furthermore, in the present embodiment, when the overshoot of the position of the robot 3 is smaller than the reference, the reward is evaluated as positive, so that it is possible to easily calculate an operation parameter with a low possibility of the robot 3 overshooting. . Furthermore, in this embodiment, the robot 3 is evaluated to evaluate the reward as positive when the generated sound is smaller than the reference.
It is possible to easily calculate an operation parameter that is less likely to cause an abnormality.

さらに、本実施形態によれば、自動で行動価値関数Ｑが最適化されるため、高性能な動
作を行う動作パラメーターを容易に算出することができる。また、行動価値関数Ｑの最適
化は自動的に行われるため、最適な動作パラメーターの算出も自動的に行うことができる
。 Furthermore, according to the present embodiment, the action value function Q is automatically optimized, so that operation parameters for performing high-performance operations can be easily calculated. Further, since the behavior value function Q is automatically optimized, the optimum operation parameter can be automatically calculated.

さらに、本実施形態においては、ロボット３において汎用的に使用される力覚センサー
Ｐによってロボット３の位置情報を取得するため、ロボット３で汎用的に使用されるセン
サーに基づいて位置情報を算出することができる。 Furthermore, in the present embodiment, the position information of the robot 3 is acquired by the force sensor P that is generally used in the robot 3. Therefore, the position information is calculated based on the sensor that is generally used by the robot 3. be able to.

さらに、本実施形態において学習部４１ｂは、状態変数としてのロボット３の動作結果
を実測し、動作パラメーターを最適化する。従って、ロボット３によって作業が行われて
いる実環境下において合わせて動作パラメーターを最適化することができる。従って、ロ
ボット３の使用環境に応じた動作パラメーターとなるように最適化することができる。 Further, in the present embodiment, the learning unit 41b actually measures the operation result of the robot 3 as the state variable and optimizes the operation parameter. Therefore, it is possible to optimize the operation parameters in accordance with the actual environment where the robot 3 is performing work. Accordingly, the operation parameters can be optimized according to the usage environment of the robot 3.

さらに、本実施形態において状態観測部４１ａは、ロボット３にエンドエフェクターと
してのグリッパー２３が設けられた状態で状態変数を観測する。また、学習部４１ｂは、
ロボット３にエンドエフェクターとしてのグリッパー２３が設けられた状態で行動として
のパラメーター４４ａの変更が実行される。この構成によれば、エンドエフェクターとし
てのグリッパー２３を用いた動作を行うロボット３に適した動作パラメーターを容易に算
出することができる。 Furthermore, in this embodiment, the state observation unit 41a observes the state variable in a state where the gripper 23 as an end effector is provided in the robot 3. In addition, the learning unit 41b
The change of the parameter 44a as the action is executed in a state where the gripper 23 as the end effector is provided in the robot 3. According to this configuration, it is possible to easily calculate an operation parameter suitable for the robot 3 that performs an operation using the gripper 23 as an end effector.

さらに、本実施形態において状態観測部４１ａは、エンドエフェクターとしてのグリッ
パー２３が対象物を把持した状態で状態変数を観測する。また、学習部４１ｂは、エンド
エフェクターとしてのグリッパー２３が対象物を把持した状態で行動としてのパラメータ
ー４４ａの変更が実行される。この構成によれば、エンドエフェクターとしてのグリッパ
ー２３で対象物を把持して動作を行うロボット３に適した動作パラメーターを容易に算出
することができる。 Furthermore, in this embodiment, the state observation unit 41a observes the state variable in a state where the gripper 23 as an end effector grips the object. In addition, the learning unit 41b changes the parameter 44a as an action in a state where the gripper 23 as an end effector grips the target object. According to this configuration, it is possible to easily calculate operation parameters suitable for the robot 3 that operates by gripping an object with the gripper 23 as an end effector.

（４−５）力制御パラメーターの学習：
力制御パラメーターの学習においても、学習対象のパラメーターを選択することが可能
であり、ここでは、その一例を説明する。図１１は、力制御パラメーターの学習例を図７
と同様のモデルで説明した図である。本例も式（２）に基づいて行動価値関数Ｑ（ｓ，ａ
）を最適化する。従って、最適化後の行動価値関数Ｑ（ｓ，ａ）を最大化する行動ａが最
適な行動であると見なされ、当該行動ａを示すパラメーター４４ａが最適化されたパラメ
ーターであると見なされる。 (4-5) Force control parameter learning:
In learning of force control parameters, it is possible to select a parameter to be learned, and an example thereof will be described here. FIG. 11 shows an example of learning of the force control parameter.
It is the figure demonstrated with the same model. Also in this example, the behavior value function Q (s, a
). Therefore, the behavior a that maximizes the optimized behavior value function Q (s, a) is regarded as the optimum behavior, and the parameter 44a indicating the behavior a is regarded as the optimized parameter.

力制御パラメーターの学習においても、力制御パラメーターを変化させることが行動の
決定に相当しており、学習対象のパラメーターと取り得る行動とを示す行動情報４４ｄが
記憶部４４に予め記録される。すなわち、当該行動情報４４ｄに学習対象として記述され
た力制御パラメーターが学習対象となる。図１１においては、ロボット３における力制御
パラメーターの中のインピーダンスパラメーターと、力制御座標系と、目標力と、ロボッ
ト３の動作の始点および終点が学習対象である。なお、力制御における動作の始点および
終点は教示位置であるが、力制御パラメーターの学習によって変動し得る。また、力制御
座標系の原点は、ロボット３のＴＣＰ（ツールセンターポイント）からのオフセット点で
あり、学習前においては目標力が作用する作用点である。従って、力制御座標系（原点座
標と軸回転角）と目標力とが変化すると、ＴＣＰからのオフセット点の位置が変化するこ
とになり、目標力の作用点が力制御座標系の原点ではない場合も生じ得る。 Also in the learning of the force control parameter, changing the force control parameter corresponds to the determination of the behavior, and behavior information 44 d indicating the learning target parameter and the possible behavior is recorded in the storage unit 44 in advance. That is, the force control parameter described as the learning target in the behavior information 44d is the learning target. In FIG. 11, the impedance parameter, the force control coordinate system, the target force, the start point and the end point of the operation of the robot 3 are the learning objects. Note that the starting point and the ending point of the operation in force control are teaching positions, but can be changed by learning force control parameters. The origin of the force control coordinate system is an offset point from the TCP (tool center point) of the robot 3, and is an action point where the target force acts before learning. Therefore, when the force control coordinate system (the origin coordinate and the axis rotation angle) and the target force change, the position of the offset point from the TCP changes, and the point of action of the target force is not the origin of the force control coordinate system. Cases can also arise.

力制御パラメーターの中のインピーダンスパラメーターｍ，ｋ，ｄは、ロボット座標系
の各軸に対する並進と回転について定義される。従って、本実施形態においては、１軸あ
たり３個のインピーダンスパラメーターｍ，ｄ，ｋのそれぞれを増加または減少させるこ
とが可能であり、増加について１８個の行動、減少についても１８個の行動、計３６個の
行動（行動ａ１〜ａ３６）を選択し得る。 Impedance parameters m, k, and d among the force control parameters are defined for translation and rotation with respect to each axis of the robot coordinate system. Therefore, in the present embodiment, it is possible to increase or decrease each of the three impedance parameters m, d, and k per axis, 18 actions for increase, 18 actions for decrease, 36 actions (actions a1 to a36) may be selected.

一方、力制御座標系は当該座標系の原点座標と、力制御座標系の軸の回転角度と、がロ
ボット座標系を基準として表現されることによって定義される。従って、本実施形態にお
いては、原点座標の３軸方向への増減と、３軸の軸回転角の増減とが可能であり、原点座
標の増加について３個、減少について３個、軸回転角の増加について３個、減少について
３個の行動が可能であり、計１２個の行動（行動ａ３７〜ａ４８）を選択し得る。目標力
は、目標力ベクトルで表現され、目標力の作用点と、力制御座標系の６軸それぞれの成分
の大きさ（３軸の並進力、３軸のトルク）によって定義される。従って、本実施形態にお
いては、目標力の作用点の３軸方向への増減について６個、６軸それぞれの成分の増加に
ついて６個、減少について６個の行動が可能であり、計１８個の行動（行動ａ４９〜ａ６
６）を選択し得る。 On the other hand, the force control coordinate system is defined by expressing the origin coordinates of the coordinate system and the rotation angle of the axis of the force control coordinate system with reference to the robot coordinate system. Therefore, in this embodiment, the origin coordinate can be increased or decreased in the three-axis direction, and the three-axis axis rotation angle can be increased or decreased. Three actions for increase and three actions for decrease are possible, and a total of 12 actions (actions a37 to a48) can be selected. The target force is expressed by a target force vector, and is defined by the action point of the target force and the magnitude of each of the six axes of the force control coordinate system (three-axis translational force, three-axis torque). Therefore, in this embodiment, six actions can be performed for the increase and decrease of the action point of the target force in the three-axis direction, six actions for the increase of each component of the six axes, and six actions for the decrease. Action (Action a49-a6
6) may be selected.

ロボット３の動作の始点および終点は、ロボット座標系の各軸方向に沿って座標の増減
が可能であり、始点の増減について６個、終点の増減について６個の計１２個の行動（行
動ａ６７〜ａ７８）を選択し得る。本実施形態においては、以上のようにして予め定義さ
れた行動の選択肢に対応するパラメーターが、行動情報４４ｄに学習対象として記述され
る。また、各行動を特定するための情報（行動のＩＤ、各行動での増減量等）が行動情報
４４ｄに記述される。 The start point and end point of the movement of the robot 3 can be increased or decreased along each axis direction of the robot coordinate system. The total of 12 actions (action a67) is 6 for increasing and decreasing the start point and 6 for increasing and decreasing the end point. ~ A78) may be selected. In the present embodiment, parameters corresponding to action options defined in advance as described above are described as learning targets in the action information 44d. Also, information for identifying each action (action ID, increase / decrease amount in each action, etc.) is described in the action information 44d.

図１１に示す例において、報酬はロボット３が行った作業の良否に基づいて評価される
。すなわち、学習部４１ｂは、行動ａとして力制御パラメーターを変化させた後、当該力
制御パラメーターによってロボット３を動作させ、検出部４２によって検出された対象物
をピックアップする作業を実行する。さらに、学習部４１ｂは、作業の良否を観測し、作
業の良否を評価する。そして、学習部４１ｂは、作業の良否によって行動ａ、状態ｓ、ｓ
'の報酬を決定する。 In the example shown in FIG. 11, the reward is evaluated based on the quality of the work performed by the robot 3. That is, the learning unit 41b changes the force control parameter as the action a, and then operates the robot 3 with the force control parameter to perform an operation of picking up the object detected by the detection unit 42. Furthermore, the learning unit 41b observes the quality of the work and evaluates the quality of the work. The learning unit 41b then determines the behavior a, states s, s depending on the quality of the work.
'Determine the reward.

さらに、学習部４１ｂは、作業の各工程において、ロボット３のエンコーダーＥ１〜Ｅ
６の出力をＵ１に基づいて変換してグリッパー２３の位置を取得する。そして、学習部４
１ｂは、作業の各工程において取得したグリッパー２３の位置を、整定する以前の所定期
間にわたって取得し、当該期間における振動強度を取得する。そして、学習部４１ｂは、
当該振動強度の程度が基準よりも小さい場合に正、基準よりも大きい場合の負の報酬を与
える。なお、基準は種々の要素によって特定されて良く、例えば、前回の振動強度の程度
であっても良いし、過去の最小の振動強度の程度であっても良いし、予め決められた振動
強度の程度であっても良い。 Furthermore, the learning unit 41b performs encoders E1 to E of the robot 3 in each work process.
6 is converted based on U1, and the position of the gripper 23 is obtained. And learning part 4
1b acquires the position of the gripper 23 acquired in each step of the work over a predetermined period before settling, and acquires the vibration intensity in the period. Then, the learning unit 41b
A positive reward is given when the magnitude of the vibration intensity is smaller than the standard, and a negative reward is given when the magnitude is greater than the standard. The reference may be specified by various factors. For example, the reference may be the previous vibration strength level, the past minimum vibration strength level, or a predetermined vibration strength level. It may be a degree.

振動強度の程度は、種々の手法で特定されて良く、目標位置からの乖離の積分値や閾値
以上の振動が生じている期間など、種々の手法を採用可能である。なお、所定期間は、種
々の期間とすることが可能であり、工程の始点から終点にわたる期間であれば、動作中の
振動強度による報酬が評価され、工程の終期が所定期間とされれば、残留振動の強度によ
る報酬が評価される。なお、力制御においては、前者の振動強度による報酬の方が重要で
ある場合が多い。前者の振動強度による報酬の方が重要であれば、後者の残留振動の強度
による報酬は評価されない構成とされても良い。 The degree of vibration intensity may be specified by various methods, and various methods such as an integral value of deviation from the target position and a period in which a vibration exceeding a threshold value is generated can be employed. The predetermined period can be various periods. If the period extends from the start point of the process to the end point, the reward based on the vibration intensity during the operation is evaluated, and if the end of the process is the predetermined period, The reward based on the strength of the residual vibration is evaluated. In the force control, the reward based on the former vibration intensity is often more important. If the former reward based on vibration intensity is more important, the latter reward based on residual vibration intensity may not be evaluated.

なお、力制御パラメーターの学習においては、動作パラメーターの学習において報酬と
されていた、目標位置からの乖離は報酬に含まれない。すなわち、力制御パラメーターの
学習においては、工程の始点や終点が学習に応じて変動し得るため、報酬には含まれてい
ない。 In the learning of the force control parameter, the deviation from the target position, which is the reward in the learning of the motion parameter, is not included in the reward. That is, in the learning of the force control parameter, since the start point and end point of the process can vary depending on the learning, they are not included in the reward.

現在の状態ｓにおいて行動ａが採用された場合における次の状態ｓ'は、行動ａとして
のパラメーターの変化が行われた後にロボット３を動作させ、状態観測部４１ａが状態を
観測することによって特定可能である。なお、本例にかかる力制御パラメーターの学習は
、ロボット１，２による対象物の検出完了後に、ロボット３に関して実行される。 The next state s ′ when the action a is adopted in the current state s is specified by operating the robot 3 after the parameter change as the action a is performed, and the state observation unit 41a observes the state. Is possible. Note that the learning of the force control parameter according to the present example is executed for the robot 3 after the detection of the object by the robots 1 and 2 is completed.

図１１に示す例において、状態変数には、モーターＭ１〜Ｍ６の電流、エンコーダーＥ
１〜Ｅ６の値、力覚センサーＰの出力が含まれている。従って、状態観測部４１ａは、サ
ーボ４３ｄの制御結果として、モーターＭ１〜Ｍ６に供給される電流値を観測する。当該
電流値は、モーターＭ１〜Ｍ６で出力されるトルクに相当する。また、エンコーダーＥ１
〜Ｅ６の出力は、対応関係Ｕ１に基づいてロボット座標系におけるＴＣＰの位置に変換さ
れる。従って、状態観測部４１ａは、ロボット３が備えるグリッパー２３の位置情報を観
測することになる。 In the example shown in FIG. 11, the state variables include the currents of the motors M1 to M6 and the encoder E.
The values of 1 to E6 and the output of the force sensor P are included. Therefore, the state observation unit 41a observes the current value supplied to the motors M1 to M6 as the control result of the servo 43d. The current value corresponds to the torque output from the motors M1 to M6. Encoder E1
The output of .about.E6 is converted into the TCP position in the robot coordinate system based on the correspondence U1. Therefore, the state observation unit 41a observes the position information of the gripper 23 provided in the robot 3.

本実施形態においては、ロボットの運動中に力覚センサーＰによって検出された出力を
積分することによってロボットの位置を算出することができる。すなわち、状態観測部４
１ａは、対応関係Ｕ２に基づいてロボット座標系において運動中のＴＣＰへの作用力を積
分することでＴＣＰの位置を取得する。従って、本実施形態において状態観測部４１ａは
、力覚センサーＰの出力も利用してロボット３が備えるグリッパー２３の位置情報を観測
する。なお、状態は、各種の手法で観測されて良く、上述の変換が行われない値（電流値
やエンコーダー、力覚センサーの出力値）が状態として観測されても良い。 In the present embodiment, the position of the robot can be calculated by integrating the output detected by the force sensor P during the movement of the robot. That is, the state observation unit 4
1a acquires the position of the TCP by integrating the acting force on the moving TCP in the robot coordinate system based on the correspondence U2. Accordingly, in the present embodiment, the state observation unit 41a observes the position information of the gripper 23 provided in the robot 3 using the output of the force sensor P. The state may be observed by various methods, and a value (current value, encoder, or output value of the force sensor) that does not perform the above-described conversion may be observed as the state.

状態観測部４１ａは、行動であるインピーダンスパラメーターや力制御座標系、工程の
始点および終点の調整結果を直接的に観測しているのではなく、調整の結果、ロボット３
で得られた変化をモーターＭ１〜Ｍ６の電流、エンコーダーＥ１〜Ｅ６の値、力覚センサ
ーＰの出力として観測している。従って、行動による影響を間接的に観測していることに
なり、この意味で、本実施形態の状態変数は、力制御パラメーターの変化から直接的に推
定することが困難な変化をし得る状態変数である。 The state observing unit 41a does not directly observe the adjustment results of the impedance parameter, the force control coordinate system, and the process start point and end point, which are behaviors.
Are observed as the currents of the motors M1 to M6, the values of the encoders E1 to E6, and the output of the force sensor P. Therefore, the influence of the action is indirectly observed. In this sense, the state variable of the present embodiment is a state variable that can be difficult to estimate directly from the change of the force control parameter. It is.

また、モーターＭ１〜Ｍ６の電流、エンコーダーＥ１〜Ｅ６の値、力覚センサーＰの出
力は、ロボット３の動作を直接的に示しており、当該動作は作業の良否を直接的に示して
いる。従って、状態変数として、モーターＭ１〜Ｍ６の電流、エンコーダーＥ１〜Ｅ６の
値、力覚センサーＰの出力を観測することにより、人為的に改善することが困難なパラメ
ーターの改善を行い、効果的に作業の質を高めるように力制御パラメーターを最適化する
ことが可能になる。この結果、人為的に決められた力制御パラメーターよりも高性能な動
作を行う力制御パラメーターを高い確率で算出することができる。 Further, the currents of the motors M1 to M6, the values of the encoders E1 to E6, and the output of the force sensor P directly indicate the operation of the robot 3, and the operation directly indicates the quality of the work. Therefore, by observing the currents of the motors M1 to M6, the values of the encoders E1 to E6, and the output of the force sensor P as state variables, parameters that are difficult to improve artificially are effectively improved. It is possible to optimize the force control parameters to improve the quality of work. As a result, it is possible to calculate with high probability a force control parameter that performs a higher-performance operation than an artificially determined force control parameter.

（４−６）力制御パラメーターの学習例：
次に、力制御パラメーターの学習例を説明する。学習の過程で参照される変数や関数を
示す情報は、学習情報４４ｅとして記憶部４４に記憶される。すなわち、算出部４１は、
状態変数の観測と、当該状態変数に応じた行動の決定と、当該行動によって得られる報酬
の評価とを繰り返すことによって行動価値関数Ｑ（ｓ，ａ）を収束させる構成が採用され
ている。そこで、本例において、学習の過程で状態変数と行動と報酬との時系列の値が、
順次、学習情報４４ｅに記録されていく。 (4-6) Force control parameter learning example:
Next, an example of learning force control parameters will be described. Information indicating variables and functions referred to in the learning process is stored in the storage unit 44 as learning information 44e. That is, the calculation unit 41
A configuration is adopted in which the behavior value function Q (s, a) is converged by repeating the observation of the state variable, the determination of the behavior according to the state variable, and the evaluation of the reward obtained by the behavior. Therefore, in this example, the time series values of state variables, actions, and rewards during the learning process
The information is sequentially recorded in the learning information 44e.

なお、本実施形態において、力制御パラメーターの学習は力制御モードで実行される（
位置制御のみが行われる位置制御モードでは力制御パラメーターの学習は行われない）。
力制御モードでの学習を実行するためには、力制御モードのみで構成される作業がロボッ
ト３のロボットプログラム４４ｂとして生成されても良いし、任意のモードを含む作業が
ロボット３のロボットプログラム４４ｂとして生成されている状況において、その中の力
制御モードのみを用いて学習してもよい。 In the present embodiment, the learning of the force control parameter is executed in the force control mode (
In the position control mode in which only position control is performed, learning of force control parameters is not performed).
In order to execute learning in the force control mode, an operation composed only of the force control mode may be generated as the robot program 44b of the robot 3, or an operation including any mode may be performed as the robot program 44b of the robot 3. May be learned using only the force control mode among them.

行動価値関数Ｑ（ｓ，ａ）は、種々の手法で算出されて良く、多数回の試行に基づいて
算出されても良いが、ここでは、ＤＱＮによって行動価値関数Ｑを最適化する例を説明す
る。行動価値関数Ｑの最適化に利用される多層ニューラルネットワークは、上述の図８に
おいて模式的に示される。図１１に示すような状態が観測される本例であれば、ロボット
３におけるモーターＭ１〜Ｍ６の電流、エンコーダーＥ１〜Ｅ６の値、力覚センサーＰの
出力（６軸の出力）が状態であるため、状態ｓの数Ｍ＝１８である。図１１に示すような
行動が選択され得る本例であれば、６０個の行動が選択可能であるためＮ＝７８である。
むろん、行動ａの内容や数（Ｎの値）、状態ｓの内容や数（Ｍの値）は試行番号ｔに応じ
て変化しても良い。 The behavior value function Q (s, a) may be calculated by various methods, and may be calculated based on a large number of trials. Here, an example in which the behavior value function Q is optimized by DQN will be described. To do. The multilayer neural network used for optimizing the behavior value function Q is schematically shown in FIG. In this example in which the state shown in FIG. 11 is observed, the currents of the motors M1 to M6 in the robot 3, the values of the encoders E1 to E6, and the output of the force sensor P (6-axis output) are the states. Therefore, the number M of states s = 18. In the present example in which actions as shown in FIG. 11 can be selected, N = 78 since 60 actions can be selected.
Of course, the contents and number of actions a (value of N) and the contents and number of states s (value of M) may be changed according to the trial number t.

本実施形態においても、当該多層ニューラルネットワークを特定するためのパラメータ
ー（入力から出力を得るために必要な情報）が学習情報４４ｅとして記憶部４４に記録さ
れる。ここでも学習の過程で変化し得る多層ニューラルネットワークのパラメーターをθ
と表記する。当該θを使用すると、上述の行動価値関数Ｑ（ｓ_t，ａ_1t）〜Ｑ（ｓ_t，ａ_Nt
）は、Ｑ（ｓ_t，ａ_1t；θ_t）〜Ｑ（ｓ_t，ａ_Nt；θ_t）とも表記できる。 Also in the present embodiment, parameters for identifying the multilayer neural network (information necessary for obtaining output from input) are recorded in the storage unit 44 as learning information 44e. Again, the multilayer neural network parameters that can change during the learning process
Is written. When the θ is used, the above-described action value functions Q (s _t , a _1t ) to Q (s _t , a _Nt
) Can also be expressed as Q (s _t , a _1t ; θ _t ) to Q (s _t , a _Nt ; θ _t ).

次に、図９に示すフローチャートに沿って学習処理の手順を説明する。力制御パラメー
ターの学習処理は、ロボット３の運用過程において実施されても良いし、実運用の前に事
前に学習処理が実行されてもよい。ここでは、実運用の前に事前に学習処理が実行される
構成（多層ニューラルネットワークを示すθが最適化されると、その情報が保存され、次
回以降の運用で利用される構成）に従って学習処理を説明する。 Next, the procedure of the learning process will be described along the flowchart shown in FIG. The learning process of the force control parameter may be performed in the operation process of the robot 3, or the learning process may be executed in advance before the actual operation. Here, learning processing is performed according to a configuration in which learning processing is executed in advance before actual operation (when θ indicating a multilayer neural network is optimized, the information is stored and used in the subsequent operation). Will be explained.

次に、算出部４１は、パラメーターを初期化する（ステップＳ２０５）。ここでは、力
制御パラメーターが学習対象であるため、算出部４１は、力制御パラメーターを初期化す
る。すなわち、学習が行われていない状態であれば、算出部４１は、教示によって生成さ
れたパラメーター４４ａに含まれる力制御パラメーターを初期値として設定する。過去に
何らかの学習が行われた状態であれば、算出部４１は、学習の際に最後に利用されていた
パラメーター４４ａに含まれる力制御パラメーターを初期値として設定する。 Next, the calculation unit 41 initializes parameters (step S205). Here, since the force control parameter is a learning target, the calculation unit 41 initializes the force control parameter. That is, if learning is not performed, the calculation unit 41 sets a force control parameter included in the parameter 44a generated by teaching as an initial value. If any learning has been performed in the past, the calculation unit 41 sets a force control parameter included in the parameter 44a used last in the learning as an initial value.

行動が選択されると、学習部４１ｂは、当該行動に対応するパラメーター４４ａを変化
させる。例えば、図１１に示す例において、ロボット座標系のｘ軸に関するインピーダン
スパラメーターｍを一定値増加させる行動ａ１が選択された場合、学習部４１ｂは、力制
御パラメーターが示すｘ軸に関するインピーダンスパラメーターｍを一定値増加させる。
パラメーター４４ａの変化が行われると、制御部４３は、当該パラメーター４４ａを参照
してロボット３を制御し、一連の作業を実行させる。なお、本実施形態においては、行動
選択のたびに一連の作業が実行されるが、行動選択のたびに一連の作業の一部が実行され
る構成（一連の作業を構成する複数の工程の少なくとも１工程が実行される構成）であっ
ても良い。 When an action is selected, the learning unit 41b changes a parameter 44a corresponding to the action. For example, in the example shown in FIG. 11, when the action a1 that increases the impedance parameter m related to the x axis of the robot coordinate system by a certain value is selected, the learning unit 41b keeps the impedance parameter m related to the x axis indicated by the force control parameter constant. Increase the value.
When the parameter 44a is changed, the control unit 43 refers to the parameter 44a and controls the robot 3 to execute a series of operations. In the present embodiment, a series of work is executed every time an action is selected, but a configuration in which a part of the series of work is executed every time an action is selected (at least a plurality of steps constituting the series of work). A configuration in which one step is executed) may be used.

次に、学習部４１ｂは、報酬を評価する（ステップＳ２３０）。すなわち、学習部４１
ｂは、図示しない計時回路に基づいて作業の開始から終了までの所要時間を取得し、作業
の所要時間が基準よりも短い場合に正、作業の所要時間が基準よりも長い場合に負の報酬
を取得する。さらに、学習部４１ｂは、作業の各工程におけるグリッパー２３の位置を取
得し、作業の各工程において取得したグリッパー２３の位置に基づいて振動強度を取得す
る。そして、学習部４１ｂは、当該振動強度の程度が基準よりも小さい場合に正、基準よ
りも大きい場合の負の報酬を取得する。一連の作業が複数の工程で構成される場合、各工
程の報酬の和が取得されても良いし、統計値（平均値等）が取得されても良い。 Next, the learning unit 41b evaluates the reward (step S230). That is, the learning unit 41
b is the time required from the start to the end of work based on a timing circuit (not shown), and is positive when the time required for the work is shorter than the reference, and negative when the time required for the work is longer than the reference. To get. Furthermore, the learning unit 41b acquires the position of the gripper 23 in each step of work, and acquires the vibration intensity based on the position of the gripper 23 acquired in each step of work. Then, the learning unit 41b acquires a positive reward when the degree of vibration intensity is smaller than the reference and a negative reward when larger than the reference. When a series of operations is composed of a plurality of processes, the sum of rewards of each process may be acquired, or a statistical value (an average value or the like) may be acquired.

以上の構成によれば、制御部４３は、行動価値関数Ｑが最大化される行動ａを選択しな
がら作業を実行することができる。当該行動価値関数Ｑは、上述の処理により、多数の試
行が繰り返された結果、最適化されている。そして、当該試行は、算出部４１によって自
動で行われ、人為的に実施不可能な程度の多数の試行を容易に実行することができる。従
って、本実施形態によれば、人為的に決められた力制御パラメーターよりも高い確率でロ
ボット３の作業の質を高めることができる。 According to the above configuration, the control unit 43 can perform work while selecting the action a that maximizes the action value function Q. The action value function Q is optimized as a result of many trials being repeated by the above-described processing. And the said trial is automatically performed by the calculation part 41, and many trials of the grade which cannot be implemented artificially can be performed easily. Therefore, according to the present embodiment, it is possible to improve the work quality of the robot 3 with higher probability than the artificially determined force control parameter.

さらに、本実施形態においては、行動によってパラメーター４４ａとしてのインピーダ
ンスパラメーターが変化する。従って、人為的な調整によって適切な設定を行うことが困
難な、インピーダンスパラメーターを自動的に調整することができる。さらに、本実施形
態においては、行動によってパラメーター４４ａとしての始点と終点が変化する。従って
、人為的に設定された始点や終点を、より高性能に力制御を行うように自動的に調整する
ことができる。 Furthermore, in the present embodiment, the impedance parameter as the parameter 44a changes depending on the behavior. Therefore, it is possible to automatically adjust the impedance parameter, which is difficult to set appropriately by artificial adjustment. Furthermore, in the present embodiment, the start point and the end point as the parameter 44a change depending on the action. Accordingly, the artificially set start point and end point can be automatically adjusted to perform force control with higher performance.

さらに、本実施形態においては、行動によってパラメーター４４ａとしての力制御座標
系が変化する。この結果、ロボット３のＴＣＰからのオフセット点の位置が変化する。従
って、人為的な調整によって適切な設定を行うことが困難な、ＴＣＰからのオフセット点
の位置を自動的に調整することができる。さらに、本実施形態においては、行動によって
パラメーター４４ａとしての目標力が変化し得る。従って、人為的な調整によって適切な
設定を行うことが困難な、目標力を自動的に調整することができる。特に、力制御座標系
と目標力との組み合わせを人為的に理想化することは困難であるため、これらの組が自動
的に調整される構成は、有用である。 Furthermore, in this embodiment, the force control coordinate system as the parameter 44a changes depending on the action. As a result, the position of the offset point from the TCP of the robot 3 changes. Therefore, it is possible to automatically adjust the position of the offset point from the TCP, which is difficult to set appropriately by artificial adjustment. Further, in the present embodiment, the target force as the parameter 44a can change depending on the behavior. Therefore, it is possible to automatically adjust the target force, which is difficult to make an appropriate setting by artificial adjustment. In particular, since it is difficult to artificially idealize the combination of the force control coordinate system and the target force, a configuration in which these sets are automatically adjusted is useful.

さらに、本実施形態においては、ロボット３が行った作業の良否に基づいて行動による
報酬を評価するため、ロボット３の作業を成功させるようにパラメーターを最適化するこ
とができる。さらに、本実施形態においては、作業の所要時間が基準よりも短い場合に報
酬を正と評価するため、ロボット３を短い時間で作業させる力制御パラメーターを容易に
算出することができる。 Furthermore, in this embodiment, since the reward by action is evaluated based on the quality of the work performed by the robot 3, the parameters can be optimized so that the work of the robot 3 is successful. Furthermore, in this embodiment, since the reward is evaluated as positive when the time required for the work is shorter than the reference, it is possible to easily calculate a force control parameter that causes the robot 3 to work in a short time.

さらに、本実施形態においては、振動強度が基準よりも小さい場合に報酬を正と評価す
るため、ロボット３の動作による振動を発生させる可能性が低い力制御パラメーターを容
易に算出することができる。さらに、本実施形態においては、ロボット３の位置のオーバ
ーシュートが基準よりも小さい場合に報酬を正と評価するため、ロボット３がオーバーシ
ュートする可能性が低い力制御パラメーターを容易に算出することができる。さらに、本
実施形態においては、発生音が基準よりも小さい場合に報酬を正と評価するため、ロボッ
ト３に異常を発生させる可能性が低い力制御パラメーターを容易に算出することができる
。 Furthermore, in this embodiment, since the reward is evaluated as positive when the vibration intensity is smaller than the reference, it is possible to easily calculate a force control parameter with a low possibility of causing vibration due to the operation of the robot 3. Furthermore, in this embodiment, since the reward is evaluated as positive when the overshoot of the position of the robot 3 is smaller than the reference, it is possible to easily calculate a force control parameter with a low possibility that the robot 3 will overshoot. it can. Furthermore, in the present embodiment, since the reward is evaluated as positive when the generated sound is smaller than the reference, it is possible to easily calculate a force control parameter with a low possibility of causing an abnormality in the robot 3.

さらに、本実施形態によれば、自動で行動価値関数Ｑが最適化されるため、高性能な力
制御を行う力制御パラメーターを容易に算出することができる。また、行動価値関数Ｑの
最適化は自動的に行われるため、最適な力制御パラメーターの算出も自動的に行うことが
できる。 Furthermore, according to this embodiment, since the action value function Q is automatically optimized, a force control parameter for performing high-performance force control can be easily calculated. Further, since the optimization of the action value function Q is automatically performed, the optimum force control parameter can be automatically calculated.

さらに、本実施形態において学習部４１ｂは、状態変数としてのロボット３の動作結果
を実測し、力制御パラメーターを最適化する。従って、ロボット３によって作業が行われ
ている実環境下において合わせて力制御パラメーターを最適化することができる。従って
、ロボット３の使用環境に応じた力制御パラメーターとなるように最適化することができ
る。 Furthermore, in the present embodiment, the learning unit 41b actually measures the operation result of the robot 3 as the state variable and optimizes the force control parameter. Therefore, it is possible to optimize the force control parameter in the actual environment where the work is performed by the robot 3. Therefore, it is possible to optimize the force control parameter according to the usage environment of the robot 3.

さらに、本実施形態において状態観測部４１ａは、ロボット３にエンドエフェクターと
してのグリッパー２３が設けられた状態で状態変数を観測する。また、学習部４１ｂは、
ロボット３にエンドエフェクターとしてのグリッパー２３が設けられた状態で行動として
のパラメーター４４ａの変更が実行される。この構成によれば、エンドエフェクターとし
てのグリッパー２３を用いた動作を行うロボット３に適した力制御パラメーターを容易に
算出することができる。 Furthermore, in this embodiment, the state observation unit 41a observes the state variable in a state where the gripper 23 as an end effector is provided in the robot 3. In addition, the learning unit 41b
The change of the parameter 44a as the action is executed in a state where the gripper 23 as the end effector is provided in the robot 3. According to this configuration, it is possible to easily calculate a force control parameter suitable for the robot 3 that performs an operation using the gripper 23 as an end effector.

さらに、本実施形態において状態観測部４１ａは、エンドエフェクターとしてのグリッ
パー２３が対象物を把持した状態で状態変数を観測する。また、学習部４１ｂは、エンド
エフェクターとしてのグリッパー２３が対象物を把持した状態で行動としてのパラメータ
ー４４ａの変更が実行される。この構成によれば、エンドエフェクターとしてのグリッパ
ー２３で対象物を把持して動作を行うロボット３に適した力制御パラメーターを容易に算
出することができる。 Furthermore, in this embodiment, the state observation unit 41a observes the state variable in a state where the gripper 23 as an end effector grips the object. In addition, the learning unit 41b changes the parameter 44a as an action in a state where the gripper 23 as an end effector grips the target object. According to this configuration, it is possible to easily calculate a force control parameter suitable for the robot 3 that operates by gripping an object with the gripper 23 as an end effector.

（５）他の実施形態：
以上の実施形態は本発明を実施するための一例であり、他にも種々の実施形態を採用可
能である。例えば、制御装置は、ロボットに内蔵されていても良いし、ロボットの設置場
所と異なる場所、例えば外部のサーバー等に備えられていても良い。また、制御装置は、
複数の装置で構成されていても良く、制御部４３と算出部４１とが異なる装置で構成され
ても良い。また、制御装置は、ロボットコントローラー、ティーチングペンダント、ＰＣ
、ネットワークにつながるサーバー等であっても良いし、これらが含まれていても良い。
さらに、上述の実施形態の一部の構成が省略されてもよいし、処理の順序が変動または省
略されてもよい。さらに、上述の実施形態においては、ＴＣＰについて目標位置や目標力
の初期ベクトルが設定されたが、他の位置、例えば力覚センサーＰについてのセンサー座
標系の原点やネジの先端等について目標位置や目標力の初期ベクトルが設定されても良い
。 (5) Other embodiments:
The above embodiment is an example for carrying out the present invention, and various other embodiments can be adopted. For example, the control device may be built in the robot, or may be provided in a location different from the installation location of the robot, for example, an external server. The control device
It may be configured by a plurality of devices, and the control unit 43 and the calculation unit 41 may be configured by different devices. The controller is a robot controller, teaching pendant, PC
A server connected to the network may be included, or these may be included.
Furthermore, a part of the configuration of the above-described embodiment may be omitted, and the processing order may be changed or omitted. Further, in the above embodiment, the initial vector of the target position and target force is set for TCP, but the target position and the origin of the sensor coordinate system for the force sensor P, the tip of the screw, etc. An initial vector of the target force may be set.

ロボットは、任意の態様の可動部で任意の作業を実施できれば良い。エンドエフェクタ
ーは、対象物に関する作業に利用される部位であり、任意のツールが取り付けられて良い
。対象物は、ロボットによる作業対象となる物体であれば良く、エンドエフェクターによ
って把持された物体であっても良いし、エンドエフェクターが備えるツールで扱われる物
体であっても良く、種々の物体が対象物となり得る。 The robot only needs to be able to carry out any work with the movable part in any form. The end effector is a part used for work related to an object, and an arbitrary tool may be attached thereto. The target object may be an object that is a work target of the robot, may be an object gripped by the end effector, may be an object handled by a tool provided in the end effector, or may be various objects. Can be a thing.

ロボットに作用させる目標力は、当該ロボットを力制御によって駆動する際にロボット
に作用させる目標力であれば良く、例えば、力覚センサー等の力検出部によって検出され
る力（または当該力から算出される力）を特定の力に制御する際に、当該力が目標力とな
る。また、力覚センサー以外のセンサー、例えば加速度センサーで検出される力（または
当該力から算出される力）が目標力になるように制御されても良いし、加速度や角速度が
特定の値になるように制御されても良い。 The target force to be applied to the robot may be a target force to be applied to the robot when the robot is driven by force control. For example, a force detected by a force detection unit such as a force sensor (or calculated from the force) Force) is controlled to a specific force, the force becomes the target force. Further, a sensor other than a force sensor, for example, a force detected by an acceleration sensor (or a force calculated from the force) may be controlled so as to become a target force, and acceleration or angular velocity becomes a specific value. It may be controlled as follows.

さらに、上述の学習処理においては、試行のたびにθの更新によって行動価値を更新し
、既定回数の試行が行われるまでターゲットを固定したが、複数回の試行が行われてから
θの更新が行われてもよい。例えば、第１既定回数の試行が行われるまでターゲットが固
定され、第２既定回数（＜第１既定回数）の試行が行われるまでθを固定する構成が挙げ
られる。この場合、第２既定回数の試行後に第２既定回数分のサンプルに基づいてθを更
新し、さらに試行回数が第１既定回数を超えた場合に最新のθでターゲットを更新する構
成となる。 Furthermore, in the above-described learning process, the action value is updated by updating θ for each trial, and the target is fixed until a predetermined number of trials are performed. It may be done. For example, the target is fixed until the first predetermined number of trials is performed, and θ is fixed until the second predetermined number of times (<first predetermined number of times) is performed. In this case, after the second predetermined number of trials, θ is updated based on the second predetermined number of samples, and when the number of trials exceeds the first predetermined number, the target is updated with the latest θ.

さらに、学習処理においては、公知の種々の手法が採用されてよく、例えば、体験再生
や報酬のＣｌｉｐｐｉｎｇ等が行われてもよい。さらに、図８においては、層ＤＬがＰ個
（Ｐは１以上の整数）存在し、各層において複数のノードが存在するが、各層の構造は、
種々の構造を採用可能である。例えば、層の数やノードの数は種々の数を採用可能である
し、活性化関数としても種々の関数を採用可能であるし、ネットワーク構造が畳み込みニ
ューラルネットワーク構造等になっていても良い。また、入力や出力の態様も図８に示す
例に限定されず、例えば、状態ｓと行動ａとが入力される構成や、行動価値関数Ｑを最大
化する行動ａがｏｎｅ−ｈｏｔベクトルとして出力される構成が少なくとも利用される例
が採用されても良い。 Further, in the learning process, various known methods may be employed, for example, experience reproduction, reward clipping, etc. may be performed. Furthermore, in FIG. 8, there are P layers DL (P is an integer of 1 or more), and there are a plurality of nodes in each layer.
Various structures can be employed. For example, various numbers of layers and nodes may be employed, various functions may be employed as the activation function, and the network structure may be a convolutional neural network structure or the like. Also, the input and output modes are not limited to the example shown in FIG. 8. For example, the configuration in which the state s and the action a are input, and the action a that maximizes the action value function Q are output as a one-hot vector. An example in which at least the configuration to be used is used may be adopted.

上述の実施形態においては、行動価値関数に基づいてｇｒｅｅｄｙ方策で行動を行って
試行しながら、行動価値関数を最適化することにより、最適化された行動価値関数に対す
るｇｒｅｅｄｙ方策が最適方策であると見なしている。この処理は、いわゆる価値反復法
であるが、他の手法、例えば、方策反復法によって学習が行われてもよい。さらに、状態
ｓ、行動ａ、報酬ｒ等の各種変数においては、各種の正規化が行われてよい。 In the above-described embodiment, the greedy policy for the optimized behavior value function is the optimal policy by optimizing the behavior value function while performing the trial with the greedy policy based on the behavior value function. I consider it. This process is a so-called value iteration method, but learning may be performed by another method, for example, a policy iteration method. Furthermore, various normalizations may be performed on various variables such as state s, action a, and reward r.

機械学習の手法としては、種々の手法を採用であり、行動価値関数Ｑに基づいたε−ｇ
ｒｅｅｄｙ方策によって試行が行われてもよい。また、強化学習の手法としても上述のよ
うなＱ学習に限定されず、ＳＡＲＳＡ等の手法が用いられても良い。また、方策のモデル
と行動価値関数のモデルを別々にモデル化した手法、例えば、Ａｃｔｏｒ−Ｃｒｉｔｉｃ
アルゴリズムが利用されても良い。Ａｃｔｏｒ−Ｃｒｉｔｉｃアルゴリズムを利用するの
であれば、方策を示すａｃｔｏｒであるμ（ｓ；θ）と、行動価値関数を示すｃｒｉｔｉ
ｃであるＱ（ｓ，ａ；θ）とを定義し、μ（ｓ；θ）にノイズを加えた方策に従って行動
を生成して試行し、試行結果に基づいてａｃｔｏｒとｃｒｉｔｉｃを更新することで方策
と行動価値関数とを学習する構成であっても良い。 As a machine learning technique, various techniques are employed, and ε-g based on the action value function Q
An attempt may be made by a ready strategy. Further, the reinforcement learning method is not limited to the Q learning as described above, and a method such as SARSA may be used. In addition, a method in which a policy model and an action value function model are separately modeled, for example, Actor-Critic
An algorithm may be used. If the Actor-Critic algorithm is used, μ (s; θ) that is an actor indicating a policy and criti that indicates an action value function.
By defining Q (s, a; θ) as c, generating an action according to a policy in which noise is added to μ (s; θ), and updating actor and critic based on the result of the trial. It may be configured to learn a policy and an action value function.

算出部は、機械学習を用いて、学習対象のパラメーターを算出することができればよく
、パラメーターとしては、光学パラメーター、画像処理パラメーター、動作パラメーター
、力制御パラメーターの少なくとも１個であれば良い。機械学習は、サンプルデータを用
いてよりよいパラメーターを学習する処理であれば良く、上述の強化学習以外にも、教師
あり学習やクラスタリングなど種々の手法によって各パラメーターを学習する構成を採用
可能である。 The calculation unit only needs to be able to calculate a parameter to be learned using machine learning. The parameter may be at least one of an optical parameter, an image processing parameter, an operation parameter, and a force control parameter. Machine learning may be any process that learns better parameters using sample data. In addition to the above-described reinforcement learning, a configuration in which each parameter is learned by various methods such as supervised learning and clustering can be adopted. .

光学系は、対象物を撮像することができる。すなわち、対象物が含まれる領域を視野に
した画像を取得する構成を備える。光学系の構成要素としては上述のように、撮像部や照
明部を含むことが好ましく、他にも種々の構成要素が含まれていて良い。また、上述のよ
うに、撮像部や照明部はロボットのアームによって移動可能であっても良いし、２次元的
な移動機構によって移動可能であっても良いし、固定的であっても良い。むろん、撮像部
や照明部は交換可能であっても良い。また、光学系で用いる光（撮像部による検出光や照
明部の出力光）の帯域は可視光帯域に限定されず、赤外線や紫外線、Ｘ線等の任意の電磁
波が用いられる構成が採用可能である。 The optical system can image the object. That is, it has a configuration for acquiring an image with a field of view including an object as a field of view. As described above, the optical system preferably includes an imaging unit and an illumination unit, and may include various other components. Further, as described above, the imaging unit and the illumination unit may be movable by a robot arm, may be moved by a two-dimensional movement mechanism, or may be fixed. Of course, the imaging unit and the illumination unit may be interchangeable. The band of light used in the optical system (detection light from the imaging unit and output light from the illumination unit) is not limited to the visible light band, and a configuration using any electromagnetic wave such as infrared rays, ultraviolet rays, and X-rays can be employed. is there.

光学パラメーターは、光学系の状態を変化させ得る値であれば良く、撮像部や照明部等
で構成される光学系において状態を直接的または間接的に特定するための数値等が光学パ
ラメーターとなる。例えば、撮像部や照明部等の位置や角度等を示す値のみならず、撮像
部や照明部の種類を示す数値（ＩＤや型番等）が光学パラメーターとなり得る。 The optical parameter may be a value that can change the state of the optical system, and a numerical value or the like for specifying the state directly or indirectly in the optical system including the imaging unit, the illumination unit, and the like is the optical parameter. . For example, not only a value indicating the position or angle of the imaging unit or the illumination unit, but also a numerical value (ID or model number) indicating the type of the imaging unit or the illumination unit can be an optical parameter.

検出部は、算出された光学パラメーターによる光学系での撮像結果に基づいて、対象物
を検出することができる。すなわち、検出部は、学習された光学パラメーターによって光
学系を動作させて対象物を撮像し、撮像結果に基づいて対象物の検出処理を実行する構成
を備える。 The detection unit can detect the object based on the imaging result of the optical system based on the calculated optical parameter. In other words, the detection unit has a configuration in which the optical system is operated according to the learned optical parameter to capture an image of the object, and the object detection process is executed based on the imaging result.

検出部は対象物を検出することができればよく、上述の実施形態のように、対象物の位
置姿勢が検出される構成の他、対象物の有無が検出される構成であっても良く、種々の構
成を採用可能である。なお、対象物の位置姿勢は、例えば、３軸における位置と３軸に対
する回転角とによる６個のパラメーターによって定義可能であるが、むろん、必要に応じ
て任意の数のパラメーターが考慮されなくても良い。例えば、平面上に設置された対象物
であれば、少なくとも１個の位置に関するパラメーターが既知であるとして検出対象から
除外されても良い。また、平面に固定的な向きで設置された対象物であれば、姿勢に関す
るパラメーターが検出対象から除外されても良い。 The detection unit only needs to be able to detect the target, and may be configured to detect the presence or absence of the target in addition to the configuration to detect the position and orientation of the target as in the above-described embodiment. The configuration can be adopted. The position and orientation of the object can be defined by, for example, six parameters based on the position in the three axes and the rotation angle with respect to the three axes. Of course, an arbitrary number of parameters are not taken into account as necessary. Also good. For example, if it is the target object installed on the plane, it may be excluded from the detection target because the parameter relating to at least one position is known. In addition, as long as the object is installed in a fixed orientation on a plane, the parameters related to the posture may be excluded from the detection target.

対象物は、光学系で撮像され、検出される対象となる物体であればよく、ロボットの作
業対象となるワークや、ワークの周辺の物体、ロボットの一部など、種々の物体が想定可
能である。また、撮像結果に基づいて対象物を検出する手法としても種々の手法を採用可
能であり、画像の特徴量抽出によって対象物が検出されても良いし、対象物の動作（人等
の可動物体等の検出）によって対象物が検出されても良く、種々の手法が採用されてよい
。 The target object may be any object that is picked up and detected by an optical system, and various objects such as a work target of the robot, an object around the work, and a part of the robot can be assumed. is there. In addition, various methods can be adopted as a method for detecting an object based on the imaging result, and the object may be detected by extracting the feature amount of the image, or the action of the object (movable object such as a person). The object may be detected by the detection of the like, and various methods may be adopted.

制御部は、対象物の検出結果に基づいてロボットを制御することができる。すなわち、
制御部は、対象物の検出結果に応じてロボットの制御内容を決定する構成を備える。従っ
て、ロボットの制御は、上述のような対象物をつかむための制御の他にも種々の制御が行
われてよい。例えば、対象物に基づいてロボットの位置決めをする制御や、対象物に基づ
いてロボットの動作を開始または終了させる制御など、種々の制御が想定される。 The control unit can control the robot based on the detection result of the object. That is,
The control unit has a configuration for determining the control content of the robot according to the detection result of the object. Therefore, various controls may be performed on the robot in addition to the control for grasping the object as described above. For example, various controls such as control for positioning the robot based on the object and control for starting or ending the operation of the robot based on the object are assumed.

ロボットの態様は、種々の態様であって良く、上述の実施形態のような垂直多関節ロボ
ット以外にも直交ロボット、水平多関節ロボット、双腕ロボット等であって良い。また、
種々の態様のロボットが組み合わされても良い。むろん、軸の数やアームの数、エンドエ
フェクターの態様等は種々の態様を採用可能である。例えば、撮像部２１や照明部２２が
ロボット３の上方に存在する平面に取り付けられ、当該平面上で撮像部２１や照明部２２
が移動可能であっても良い。 The aspect of the robot may be various aspects, and may be an orthogonal robot, a horizontal articulated robot, a double arm robot, or the like other than the vertical articulated robot as in the above-described embodiment. Also,
Various types of robots may be combined. Of course, various modes can be adopted as the number of axes, the number of arms, the mode of the end effector, and the like. For example, the imaging unit 21 and the illumination unit 22 are attached to a plane existing above the robot 3, and the imaging unit 21 and the illumination unit 22 are on the plane.
May be movable.

状態観測部は、行動等の試行に応じて変化した結果を観測することができればよく、各
種のセンサー等によって状態が観測されても良いし、ある状態から他の状態に変化させる
制御が行われ、制御の失敗（エラー等）が観測されなければ当該他の状態が観測されたと
見なされる構成であっても良い。前者のセンサーによる観測は、位置等の検出の他にも撮
像センサーによる画像の取得も含まれる。 The state observation unit only needs to be able to observe the result of the change according to the trial of the action, etc., the state may be observed by various sensors or the like, and control to change from one state to another state is performed. If no control failure (error or the like) is observed, the other state may be regarded as being observed. Observation by the former sensor includes acquisition of an image by an imaging sensor in addition to detection of a position or the like.

さらに、上述の実施形態における行動や状態、報酬は例であり、他の行動や状態、報酬
を含む構成や任意の行動や状態が省略された構成であっても良い。例えば、撮像部２１や
照明部２２が交換可能であるロボット１，２において、撮像部２１や照明部２２の種類の
変更を行動として選択可能であり、状態として種類を観測可能であっても良い。接触判定
部４３ｃによる判定結果に基づいて報酬が決定されても良い。すなわち、学習部４１ｂに
おける学習過程において、接触判定部４３ｃが作業において想定されていない物体とロボ
ットとが接触したと判定した場合、当該直前の行動による報酬を負に設定する構成を採用
可能である。この構成によれば、ロボットが想定外の物体に接触しないようにパラメータ
ー４４ａを最適化することができる。 Furthermore, the behavior, state, and reward in the above-described embodiment are examples, and a configuration including other behavior, state, and reward, or a configuration in which any behavior or state is omitted may be used. For example, in the robots 1 and 2 in which the imaging unit 21 and the illumination unit 22 are replaceable, a change in the type of the imaging unit 21 or the illumination unit 22 can be selected as an action, and the type can be observed as a state. . The reward may be determined based on the determination result by the contact determination unit 43c. That is, in the learning process in the learning unit 41b, when the contact determination unit 43c determines that an object that is not assumed in the work has come into contact with the robot, it is possible to adopt a configuration in which the reward for the immediately preceding action is set to be negative. . According to this configuration, the parameter 44a can be optimized so that the robot does not touch an unexpected object.

また、例えば、光学パラメーターの最適化に際して、ロボット１〜３によって対象物の
検出結果に基づいた作業（例えば、上述のピックアップ作業等）を行い、学習部４１ｂが
、対象物の検出結果に基づいてロボット１〜３が行った作業の良否に基づいて、行動によ
る報酬を評価する構成であってもよい。この構成は、例えば、図７に示す報酬の中で、対
象物の検出の替わりに、または、対象物の検出に加えて作業の成否（例えば、ピックアッ
プの成否）を報酬とする構成が挙げられる。 Further, for example, when optimizing the optical parameters, the robots 1 to 3 perform an operation (for example, the above-described pickup operation) based on the detection result of the object, and the learning unit 41b performs the operation based on the detection result of the object. A configuration may be used in which rewards based on actions are evaluated based on the quality of work performed by the robots 1 to 3. This configuration includes, for example, a configuration in which the success or failure of the work (for example, the success or failure of the pickup) is used as a reward instead of the detection of the object or in addition to the detection of the object in the reward shown in FIG. .

作業の成否は、例えば、作業の成否を判定可能な工程（ピックアップの工程等）におけ
るステップＳ１２０の判定結果等で定義可能である。この場合、行動や状態において、ロ
ボット１〜３の動作に関する行動や状態が含まれても良い。さらに、この構成においては
、ロボット１〜３の作業対象である対象物を撮像部２１および照明部２２を備える光学系
で撮像した画像を状態とすることが好ましい。この構成によれば、ロボットの作業を成功
させるように光学パラメーターを最適化することができる。なお、光学パラメーターや動
作パラメーター、力制御パラメーターを学習するために観測される状態としての画像は、
撮像部２１で撮像された画像そのものであっても良いし、撮像部２１で撮像された画像に
対して画像処理（例えば、上述の平滑化処理や鮮鋭化処理等）が行われた後の画像であっ
ても良い。 The success or failure of the work can be defined by, for example, the determination result of step S120 in a process (pickup process or the like) that can determine the success or failure of the work. In this case, the behavior and state relating to the operation of the robots 1 to 3 may be included. Furthermore, in this configuration, it is preferable to set an image obtained by capturing an object that is a work target of the robots 1 to 3 with an optical system including the imaging unit 21 and the illumination unit 22. According to this configuration, the optical parameters can be optimized so as to make the robot work successful. In addition, the image as the state observed for learning the optical parameter, the operation parameter, and the force control parameter is
The image itself captured by the image capturing unit 21 may be used, or an image after image processing (for example, the above-described smoothing process or sharpening process) is performed on the image captured by the image capturing unit 21. It may be.

さらに、光学パラメーター、動作パラメーター、力制御パラメーターのそれぞれを別個
に最適化するのではなく、これらのパラメーターの中の２種以上を最適化する構成が採用
されてもよい。例えば、図７に示す例において、動作パラメーターや力制御パラメーター
を変化させる行動が含まれる構成であれば、光学パラメーターとともに、動作パラメータ
ーや力制御パラメーターを最適化することが可能である。この場合、最適化された動作パ
ラメーターや力制御パラメーターに基づいてロボット１〜３が制御される。この構成によ
れば、対象物の検出を伴う作業を行うパラメーターを最適化することができ、対象物の検
出精度を高める学習を実行することができる。 Furthermore, instead of optimizing each of the optical parameter, operation parameter, and force control parameter separately, a configuration in which two or more of these parameters are optimized may be employed. For example, in the example shown in FIG. 7, the operation parameter and the force control parameter can be optimized together with the optical parameter as long as the action includes changing the operation parameter and the force control parameter. In this case, the robots 1 to 3 are controlled based on the optimized operation parameters and force control parameters. According to this configuration, it is possible to optimize parameters for performing work involving the detection of an object, and it is possible to perform learning for improving the detection accuracy of the object.

画像処理パラメーターは、対象物の撮像結果としての画像を変化させ得る値であれば良
く、図３に示す例に限定されず、追加または削除されてよい。例えば、画像処理の有無や
画像処理の強度、画像処理の順序など、実行される画像処理アルゴリズムを特定するため
の数値（処理順序等を示すフラグ等を含む）等が画像処理パラメーターとなり得る。より
具体的には、画像処理としては、二値化処理、直線検出処理、円検出処理、色検出処理、
ＯＣＲ処理等があげられる。 The image processing parameter is not limited to the example illustrated in FIG. 3 and may be added or deleted as long as it is a value that can change the image as the imaging result of the object. For example, numerical values for specifying an image processing algorithm to be executed such as the presence / absence of image processing, the intensity of image processing, and the order of image processing (including a flag indicating the processing order) can be used as image processing parameters. More specifically, image processing includes binarization processing, straight line detection processing, circle detection processing, color detection processing,
An OCR process etc. are mention | raise | lifted.

さらに、画像処理は、複数の種類の画像処理を組み合わせた処理であってもよい。例え
ば、円検出処理とＯＣＲ処理を組み合わせて、「円内の文字を認識する処理」という処理
が行われてもよい。いずれにしても、各画像処理の有無や強度を示すパラメーターが画像
処理パラメーターとなり得る。また、これらの画像処理パラメーターの変化が行動となり
得る。 Further, the image processing may be processing that combines a plurality of types of image processing. For example, a process of “a process for recognizing characters in a circle” may be performed by combining the circle detection process and the OCR process. In any case, a parameter indicating the presence / absence and intensity of each image processing can be an image processing parameter. Also, changes in these image processing parameters can be actions.

動作パラメーターは、上述の実施形態に挙げられたパラメーターに限定されない。例え
ば、学習対象となる動作パラメーターに、ロボット１〜３が備える慣性センサーに基づい
て制御を行うためのサーボゲインが含まれていても良い。すなわち、慣性センサーの出力
に基づいた制御ループでモーターＭ１〜Ｍ６が制御される構成において、当該制御ループ
におけるサーボゲインが行動によって変化する構成であっても良い。例えば、ロボット１
〜３に取り付けられたエンコーダーＥ１〜Ｅ６に基づいてロボット１〜３の特定の部位の
角速度を算出し、慣性センサーの一種であるジャイロセンサーによって当該特定の部位の
角速度を検出し、両者の差分にジャイロサーボゲインを乗じてフィードバック制御を行う
構成において、当該ジャイロサーボゲインが行動によって変化する構成が挙げられる。こ
の構成であれば、ロボットの特定の部位に生じる角速度の振動成分を抑制する制御を行う
ことができる。むろん、慣性センサーはジャイロセンサーに限定されず、加速度センサー
等において同様のフィードバック制御が行われる構成において加速度ゲインが行動によっ
て変化する構成であっても良い。以上の構成によれば、人為的な調整によって適切な設定
を行うことが困難な、慣性センサーに基づいて制御を行うためのサーボゲインを自動的に
調整することができる。なお、加速度センサーはロボットの運動によって生じる加速度を
検知するセンサーであり、上述の力覚センサーはロボットに作用する力を検知するセンサ
ーである。通常、加速度センサーと力覚センサーとは異なるセンサーであるが、一方が他
方の機能を代替できる場合には、一方が他方として機能しても良い。 The operating parameters are not limited to those listed in the above embodiment. For example, a servo gain for performing control based on an inertial sensor included in the robots 1 to 3 may be included in the operation parameter to be learned. That is, in the configuration in which the motors M1 to M6 are controlled by the control loop based on the output of the inertial sensor, the configuration in which the servo gain in the control loop changes depending on the behavior may be used. For example, robot 1
The angular velocities of specific parts of the robots 1 to 3 are calculated based on encoders E1 to E6 attached to -3, the angular velocities of the specific parts are detected by a gyro sensor which is a kind of inertial sensor, and the difference between the two is calculated. In a configuration in which feedback control is performed by multiplying a gyro servo gain, a configuration in which the gyro servo gain changes depending on behavior can be cited. If it is this structure, control which suppresses the vibration component of the angular velocity which arises in the specific site | part of a robot can be performed. Of course, the inertial sensor is not limited to the gyro sensor, and may be configured such that the acceleration gain varies depending on the behavior in the configuration in which the same feedback control is performed in the acceleration sensor or the like. According to the above configuration, it is possible to automatically adjust the servo gain for performing control based on the inertial sensor, which is difficult to perform appropriate setting by artificial adjustment. The acceleration sensor is a sensor that detects acceleration generated by the movement of the robot, and the above-described force sensor is a sensor that detects force acting on the robot. Normally, an acceleration sensor and a force sensor are different sensors, but if one can substitute for the other, one may function as the other.

むろん、力制御パラメーターも上述の実施形態に挙げられたパラメーターに限定されな
いし、学習対象となるパラメーターも適宜選択されてよい。例えば、目標力に関し、６軸
中の全成分または一部の成分が行動として選択し得ない（すなわち固定である）構成であ
っても良い。この構成は、固定された固定対象物（細い筒等）に、ロボットが把持した対
象物を挿入する作業において、目標力は固定対象物のある点に対して固定的な成分を有す
るが、ロボットの挿入作業に応じて力制御座標系が変化するように学習する構成等を想定
する事ができる。 Of course, the force control parameter is not limited to the parameter described in the above embodiment, and the parameter to be learned may be appropriately selected. For example, regarding the target force, a configuration in which all or some of the components in the six axes cannot be selected as actions (that is, fixed) may be employed. In this configuration, the target force has a component fixed to a certain point of the fixed object in the operation of inserting the object gripped by the robot into a fixed object (such as a thin cylinder). It is possible to assume a configuration in which learning is performed such that the force control coordinate system changes in accordance with the insertion operation.

学習部４１ｂは、ロボット３が把持した対象物を作業完了前に落とした場合、ロボット
３の作業対象である対象物の一部が作業完了前に分離した場合、ロボット３が破損した場
合、ロボット３の作業対象である対象物が破損した場合、の少なくとも１つにおいて報酬
を負と評価する構成であっても良い。ロボット３が把持した対象物を作業完了前に落とし
た場合に報酬を負と評価する構成によれば、対象物を落とさずに作業を完了させる可能性
が高い動作パラメーターや力制御パラメーターを容易に算出することができる。 When the learning unit 41b drops the object gripped by the robot 3 before the completion of the work, when a part of the object that is the work target of the robot 3 is separated before the completion of the work, when the robot 3 is damaged, When the target object that is the third work target is damaged, at least one of the targets may be evaluated as negative in reward. According to the configuration in which the reward is evaluated as negative when the object gripped by the robot 3 is dropped before the work is completed, it is easy to set an operation parameter and a force control parameter that are highly likely to complete the work without dropping the object. Can be calculated.

ロボット３の作業対象である対象物の一部が作業完了前に分離した場合に報酬を負と評
価する構成によれば、対象物を分離させることなく作業を完了させる可能性が高い動作パ
ラメーターや力制御パラメーターを容易に算出することができる。ロボット３が破損した
場合に報酬を負と評価する構成によれば、ロボット３を破損させる可能性が低い動作パラ
メーターや力制御パラメーターを容易に算出することができる。 According to the configuration in which the reward is evaluated as negative when a part of the target object that is the work target of the robot 3 is separated before the work is completed, the operation parameter that is highly likely to complete the work without separating the target object or Force control parameters can be easily calculated. According to the configuration in which the reward is evaluated as negative when the robot 3 is damaged, it is possible to easily calculate an operation parameter and a force control parameter that are less likely to damage the robot 3.

ロボット３の作業対象である対象物が破損した場合に報酬を負と評価する構成によれば
、対象物を破損させる可能性が低い動作パラメーターや力制御パラメーターを容易に算出
することができる。なお、ロボット３が把持した対象物を作業完了前に落としたか否か、
ロボット３の作業対象である対象物の一部が作業完了前に分離したか否か、ロボット３が
破損したか否か、ロボット３の作業対象である対象物が破損したか否かは、各種のセンサ
ー、例えば撮像部２１等によって検出される構成を採用可能である。 According to the configuration in which a reward is evaluated as negative when an object that is a work target of the robot 3 is damaged, it is possible to easily calculate an operation parameter and a force control parameter that are less likely to damage the object. Whether or not the object gripped by the robot 3 has been dropped before the completion of the work,
Whether a part of the object that is the work target of the robot 3 is separated before the work is completed, whether the robot 3 is damaged, whether the object that is the work target of the robot 3 is damaged is various It is possible to employ a configuration that is detected by a sensor such as the imaging unit 21.

さらに、学習部４１ｂは、ロボット３による作業が正常に完了した場合において報酬を
正と評価する構成であっても良い。ロボット３による作業が正常に完了した場合に報酬を
正と評価する構成によれば、ロボット３の作業を成功させる動作パラメーターや力制御パ
ラメーターを容易に算出することができる。 Further, the learning unit 41b may be configured to evaluate the reward as positive when the operation by the robot 3 is normally completed. According to the configuration in which the reward is evaluated as positive when the operation by the robot 3 is normally completed, it is possible to easily calculate an operation parameter and a force control parameter that make the operation of the robot 3 successful.

さらに、ロボット３の位置を検出するための位置検出部は、上述の実施形態のようなエ
ンコーダー、力覚センサーに限定されず、他のセンサー、専用の慣性センサーや撮像部２
１等の光学センサー、距離センサー等であっても良い。また、センサーはロボットに内蔵
されていても良いが、ロボットの外部に配置されても良い。ロボットの外部に配置された
位置検出部を利用すれば、ロボットの動作に影響されることなく位置情報を算出すること
ができる。 Furthermore, the position detection unit for detecting the position of the robot 3 is not limited to the encoder and the force sensor as in the above-described embodiment, but other sensors, a dedicated inertial sensor, and the imaging unit 2.
An optical sensor such as 1 or a distance sensor may be used. The sensor may be built in the robot, but may be disposed outside the robot. If a position detection unit arranged outside the robot is used, position information can be calculated without being affected by the operation of the robot.

さらに、算出部４１は、ロボットの異なる複数の動作に基づいて、複数の動作に共通の
動作パラメーターや力制御パラメーターを算出する構成であっても良い。複数の動作は、
最適化された動作パラメーターを利用して実行される動作を含んでいれば良い。従って、
複数の動作は、異なる種類の複数の作業（ピックアップ作業、研磨作業、ネジ締め作業な
ど）である構成や、同種の作業（ネジの大きさが異なる複数のネジ締め作業等）である構
成等が挙げられる。この構成によれば、各種の動作に適用可能な汎用的な動作パラメータ
ーや力制御パラメーターを容易に算出することができる。 Further, the calculation unit 41 may be configured to calculate operation parameters and force control parameters common to a plurality of operations based on a plurality of different operations of the robot. Multiple actions
It is only necessary to include operations that are performed using optimized operation parameters. Therefore,
A plurality of operations include a configuration that is a plurality of different types of operations (pickup operation, polishing operation, screw tightening operation, etc.), a configuration that is the same type of work (multiple screw tightening operations with different screw sizes, etc.), etc. Can be mentioned. According to this configuration, general-purpose operation parameters and force control parameters applicable to various operations can be easily calculated.

１〜３…ロボット、２０…光学系、２１…撮像部、２２…照明部、２３…グリッパー、
４０…制御装置、４１…算出部、４１ａ…状態観測部、４１ｂ…学習部、４２…検出部、
４３…制御部、４３ａ…位置制御部、４３ｂ…力制御部、４３ｃ…接触判定部、４３ｄ…
サーボ、４４…記憶部、４４ａ…パラメーター、４４ｂ…ロボットプログラム、４４ｃ…
テンプレートデータ、４４ｄ…行動情報、４４ｅ…学習情報 DESCRIPTION OF SYMBOLS 1-3 ... Robot, 20 ... Optical system, 21 ... Imaging part, 22 ... Illumination part, 23 ... Gripper,
40 ... control device, 41 ... calculation unit, 41a ... state observation unit, 41b ... learning unit, 42 ... detection unit,
43 ... control unit, 43a ... position control unit, 43b ... force control unit, 43c ... contact determination unit, 43d ...
Servo, 44 ... storage unit, 44a ... parameter, 44b ... robot program, 44c ...
Template data, 44d ... behavior information, 44e ... learning information

Claims

A calculation unit that calculates image processing parameters related to image processing on an image of an object captured by the imaging unit using machine learning;
A detection unit for detecting the object based on an image on which the image processing has been executed according to the calculated image processing parameter;
A control unit that controls the robot based on the detection result of the object.

The detector detects a position and orientation of the object;
The control device according to claim 1.

The calculation unit includes:
As a state variable, a state observation unit that observes at least the image processed by the image processing parameter, and
A learning unit that learns the image processing parameter based on the image as the state variable,
The control device according to claim 1 or 2.

The learning unit
Determining an action to change the image processing parameter based on the image as the state variable, and optimizing the image processing parameter;
The control device according to claim 3.

The learning unit
Based on the quality of the detection result of the object, evaluate the reward by the action,
The control device according to claim 4.

The learning unit
Based on the quality of the work performed by the robot based on the detection result of the object, the reward by the action is evaluated.
The control device according to claim 4 or 5.

The calculation unit includes:
Optimizing the image processing parameters by repeating the observation of the state variable, the determination of the action according to the state variable, and the evaluation of the reward obtained by the action,
The control device according to any one of claims 4 to 6.

The calculation unit includes:
Using machine learning, calculate motion parameters related to the motion of the robot,
The controller is
Controlling the robot based on the operating parameters;
The control apparatus in any one of Claims 1-7.

The calculation unit includes:
Calculating the image processing parameter and the operation parameter based on an image captured by the imaging unit of the object that is a work target of the robot;
The control device according to claim 8.

A robot controlled by the control device according to claim 1.

The control device according to any one of claims 1 to 9, the robot controlled by the control device, the imaging unit,
A robot system comprising: