JP2022018654A

JP2022018654A - Assembly work order planning device, and assembly work order planning method

Info

Publication number: JP2022018654A
Application number: JP2020121910A
Authority: JP
Inventors: 利浩森澤; Toshihiro Morisawa
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2020-07-16
Filing date: 2020-07-16
Publication date: 2022-01-27
Anticipated expiration: 2040-07-16
Also published as: JP7474653B2; WO2022014128A1

Abstract

To plan the assembly order of components constituting a product and the work order of work subjects that may include robots and workers.SOLUTION: An assembly work order planning device includes: an information acquisition unit for acquiring assembly state transition information including information indicating a process up to assembly into a final product through a sub-assembly product obtained by assembling a plurality of components; an assembly work definition unit for defining, on the basis of the assembly state transition information, work that can be performed by a work subject with respect to an assembly state comprising the two components before assembly or the sub-assembly product; a constraint condition setting unit for setting a constraint condition about whether or not the work can be performed; a learning unit for performing, according to the constraint condition, the reinforcement learning of a method for selecting the work with respect to the assembly state; and an assembly work order generation unit for generating the assembly work order of the product on the basis of the result of the reinforcement learning.SELECTED DRAWING: Figure 1

Description

本発明は、組立作業順序計画装置、及び組立作業順序計画方法に関する。 The present invention relates to an assembly work sequence planning device and an assembly work sequence planning method.

複数の部品からなる製品の組立作業においては、事前に効率の良い作業方法と順序が計画される。組立作業現場では、作業員とロボットとが個別に作業を行う場合だけでなく、作業員とロボットとが混在して作業を行う場合がある。作業員とロボットとが混在する作業現場では、組立作業順序計画を効率化して生産の効率化を図ることが重要である。特に、ロボットによる作業は事前にロボットによる作業の内容を完全に定義しておく必要があり、ロボットの普及と低コスト化に伴うラインやセルのロボット構成の多様化が進む状況においては組立作業順序計画の自動化が望まれている。 In the assembly work of a product consisting of a plurality of parts, an efficient work method and order are planned in advance. At the assembly work site, not only the worker and the robot may work individually, but also the worker and the robot may work together. In a work site where workers and robots coexist, it is important to improve the efficiency of assembly work sequence planning and production efficiency. In particular, it is necessary to completely define the contents of the work by the robot in advance for the work by the robot, and in the situation where the robot configuration of the line and cell is diversifying due to the spread of the robot and the cost reduction, the assembly work order. Planning automation is desired.

組立作業順序の計画方法に関し、例えば特許文献１には、製品の３次元ＣＡＤ(Computer Aided Design)モデルから部品毎の部品属性と部品配置と他の部品との隣接関係情報を抽出し、部品間の結合優先関係を有向グラフとして生成し、隣接関係情報からアセンブリグラフを生成し、部品の分解順序を求めることで、その逆順序として組立順序を生成する技術が記載されている。 Regarding the method of planning the assembly work order, for example, in Patent Document 1, for example, in Patent Document 1, the component attributes and component arrangements for each component and the adjacency relationship information with other components are extracted from the three-dimensional CAD (Computer Aided Design) model of the product, and the components are separated from each other. A technique is described in which a combination priority relationship is generated as a directed graph, an assembly graph is generated from the adjacency relationship information, and an assembly order is generated as the reverse order by obtaining the decomposition order of parts.

また、例えば特許文献２には、３次元ＣＡＤデータを入力とし、タスクプランナで組立作業計画を行い、組立作業はペトリネットによりモデル化し、最適経路を探索することでロボットプログラムをオフラインで生成する技術が記載されている。 Further, for example, in Patent Document 2, a technique of inputting 3D CAD data, performing an assembly work plan with a task planner, modeling the assembly work with a Petri net, and searching for an optimum route to generate a robot program offline. Is described.

さらに、例えば特許文献３には、ロボットによる作業動作のシミュレーションを実行して、実行結果に基づいて制御指令を判定する際、判定結果が良好な場合と不良な場合とにそれぞれ対する結果ラベルを訓練データとして、制御指令を学習する技術が記載されている。 Further, for example, in Patent Document 3, when a simulation of a work operation by a robot is executed and a control command is determined based on the execution result, a result label for each of a case where the determination result is good and a case where the determination result is bad is trained. As data, a technique for learning control commands is described.

特許第６１９９２１０号公報Japanese Patent No. 6199210 特許第３７０５６７２号公報Japanese Patent No. 3705672 特許第６４５７４２１号公報Japanese Patent No. 6457421

特許文献１に記載の技術は、３次元ＣＡＤデータから抽出した部品間の結合の優先関係及び隣接関係に基づく有向グラフとアセンブリグラフとを用いて組立順序を生成するものであり、製品中にある部品の数に応じて膨大な数の組立順序が生成されてしまうことになる。また、結合の優先関係等をマニュアル設定する必要性があり、効率的に組立順序を決めることが困難である。さらに、作業主体の作業順序は生成できない。 The technique described in Patent Document 1 generates an assembly order by using a directed graph and an assembly graph based on the priority relation and the adjacency relation of the connection between the parts extracted from the three-dimensional CAD data, and the parts in the product. A huge number of assembly sequences will be generated according to the number of. In addition, it is necessary to manually set the priority relationship of coupling and the like, and it is difficult to efficiently determine the assembly order. Furthermore, the work sequence of the work subject cannot be generated.

特許文献２に記載の技術は、組立作業計画方法を含むが、組立順序は製品の状態レベル、対象移動レベル、手先移動レベルと階層化されておりペトリネットでモデル化される。ペトリネットを構成した段階で組立順序が定まるが、ペトリネットを自動構成（自動モデル化）する方法ではないため、ロボット及び作業員を作業主体とする組立作業順序の生成はできない。 The technique described in Patent Document 2 includes an assembly work planning method, but the assembly order is layered into a product state level, a target movement level, and a hand movement level, and is modeled by a Petri net. The assembly order is determined at the stage when the Petri net is configured, but since it is not a method of automatically configuring (automatically modeling) the Petri net, it is not possible to generate an assembly work order mainly for robots and workers.

特許文献３に記載の技術は、強化学習技術を利用して、ロボットを含む機械の制御指令を学習するものであり、作業動作のシミュレーションを実行することで学習を進める。組立作業順序は一連の作業動作を定める際にすでに定義されるものであるので、組立作業順序を生成するものではない。 The technique described in Patent Document 3 is to learn a control command of a machine including a robot by using a reinforcement learning technique, and the learning is advanced by executing a simulation of a work operation. Since the assembly work sequence is already defined when defining a series of work operations, it does not generate an assembly work sequence.

本発明は、上記の点に鑑みてなされたものであって、製品を構成する各部品の組立順序と、ロボット及び作業員を含み得る作業主体の作業順序とを計画できるようにすることを目的とする。 The present invention has been made in view of the above points, and an object of the present invention is to be able to plan the assembly order of each component constituting a product and the work order of a work subject including a robot and a worker. And.

本願は、上記課題の少なくとも一部を解決する手段を複数含んでいるが、その例を挙げるならば、以下のとおりである。 The present application includes a plurality of means for solving at least a part of the above problems, and examples thereof are as follows.

上記課題を解決するため、本発明の一態様に係る組立作業順序計画装置は、複数の部品を組付けた部組品を経て最終的な製品に組立てられるまでの過程を表す情報を含む組立状態遷移情報を取得する情報取得部と、前記組立状態遷移情報に基づき、組付前の２つの前記部品または前記部組品からなる組立状態に対して作業主体が実行し得る作業を定義する組立作業定義部と、前記作業の実行の可否に関する制約条件を設定する制約条件設定部と、前記制約条件に従い、前記組立状態に対する前記作業の選択方法を強化学習する学習部と、前記強化学習の結果に基づいて前記製品の組立作業順序を生成する組立作業順序生成部と、を備えることを特徴とする。 In order to solve the above problems, the assembly work sequence planning device according to one aspect of the present invention is an assembly state including information indicating a process from assembling to a final product through a component in which a plurality of parts are assembled. An assembly work that defines work that can be performed by a work subject for an assembly state consisting of two parts or parts before assembly, based on an information acquisition unit that acquires transition information and the assembly state transition information. A definition unit, a constraint condition setting unit that sets a constraint condition regarding whether or not the work can be executed, a learning unit that reinforces learning how to select the work for the assembly state according to the constraint condition, and a result of the reinforcement learning. It is characterized by including an assembly work order generation unit that generates an assembly work order of the product based on the above.

本発明によれば、製品を構成する各部品の組立順序と、ロボット及び作業員を含み得る作業主体の作業順序とを計画することが可能となる。 According to the present invention, it is possible to plan the assembly order of each component constituting the product and the work order of the work subject including the robot and the worker.

上記した以外の課題、構成、及び効果は、以下の実施形態の説明により明らかにされる。 Issues, configurations, and effects other than those described above will be clarified by the description of the following embodiments.

図１は、本発明の第１の実施形態に係る組立作業順序計画装置の構成例を示す図である。FIG. 1 is a diagram showing a configuration example of an assembly work sequence planning device according to the first embodiment of the present invention. 図２は、組立作業環境の一例を示す図である。FIG. 2 is a diagram showing an example of an assembly work environment. 図３は、複数の部品から成る製品の一例を示す図である。FIG. 3 is a diagram showing an example of a product composed of a plurality of parts. 図４は、図３に示された製品の組立状態遷移を表すＡＮＤ／ＯＲ木を示す図である。FIG. 4 is a diagram showing an AND / OR tree representing the assembly state transition of the product shown in FIG. 図５は、図４に対応する各組立状態への遷移の一覧を示す図である。FIG. 5 is a diagram showing a list of transitions to each assembly state corresponding to FIG. 図６は、製品の組立状態遷移に基づき、作業主体を１台のロボットとして設定した組立作業順序の一例、及び作業毎の組立状態を示す図である。FIG. 6 is a diagram showing an example of an assembly work order in which the work subject is set as one robot based on the assembly state transition of the product, and an assembly state for each work. 図７は、組立状態遷移、及び組立作業環境に基づき、作業主体を１台のロボットとして設定した組立作業順序の一例、及び作業毎の組立状態を示す図である。FIG. 7 is a diagram showing an example of an assembly work order in which a work subject is set as one robot based on an assembly state transition and an assembly work environment, and an assembly state for each work. 図８は、組立状態遷移、及び組立作業環境に基づき、作業主体を２台のロボットとして設定した組立作業順序の一例、及び作業毎の組立状態を示す図である。FIG. 8 is a diagram showing an example of an assembly work order in which the work subject is set as two robots based on the assembly state transition and the assembly work environment, and the assembly state for each work. 図９は、第１の実施形態に係る組立作業順序計画装置による組立作業順序計画処理の一例を説明するフローチャートである。FIG. 9 is a flowchart illustrating an example of the assembly work sequence planning process by the assembly work sequence planning device according to the first embodiment. 図１０は、本発明の第２の実施形態に係る組立作業順序計画装置の構成例を示す図である。FIG. 10 is a diagram showing a configuration example of an assembly work sequence planning device according to a second embodiment of the present invention. 図１１は、複数の部品から成る製品の一例を示す図である。FIG. 11 is a diagram showing an example of a product composed of a plurality of parts. 図１２は、図１１の組立作業に対応する各組立状態への遷移の一覧を示している。FIG. 12 shows a list of transitions to each assembly state corresponding to the assembly operation of FIG. 図１３は、製品の組立状態遷移、及び組立作業環境に基づき、作業主体を１台のロボット及び１名の作業員として設定した組立作業順序の一例、及び作業毎の組立状態を示す図である。FIG. 13 is a diagram showing an example of an assembly work order in which the work subject is set as one robot and one worker based on the assembly state transition of the product and the assembly work environment, and the assembly state for each work. .. 図１４は、第２の実施形態に係る組立作業順序計画装置による組立作業順序計画処理の一例を説明するフローチャートである。FIG. 14 is a flowchart illustrating an example of the assembly work sequence planning process by the assembly work sequence planning device according to the second embodiment. 図１５は、Ａ３Ｃの概要を説明するための図である。FIG. 15 is a diagram for explaining the outline of A3C. 図１６は、行動選択関数と状態価値関数の訓練の処理内容について説明するための図である。FIG. 16 is a diagram for explaining the processing contents of the training of the action selection function and the state value function.

以下、本発明の複数の実施形態について図面に基づいて説明する。なお、各実施形態を説明するための全図において、同一の部材には原則として同一の符号を付し、その繰り返しの説明は省略する。また、以下の実施形態において、その構成要素（要素ステップ等も含む）は、特に明示した場合および原理的に明らかに必須であると考えられる場合等を除き、必ずしも必須のものではないことは言うまでもない。また、「Ａからなる」、「Ａよりなる」、「Ａを有する」、「Ａを含む」と言うときは、特にその要素のみである旨明示した場合等を除き、それ以外の要素を排除するものでないことは言うまでもない。同様に、以下の実施形態において、構成要素等の形状、位置関係等に言及するときは、特に明示した場合および原理的に明らかにそうでないと考えられる場合等を除き、実質的にその形状等に近似または類似するもの等を含むものとする。 Hereinafter, a plurality of embodiments of the present invention will be described with reference to the drawings. In addition, in all the drawings for explaining each embodiment, in principle, the same members are designated by the same reference numerals, and the repeated description thereof will be omitted. Further, in the following embodiments, it is needless to say that the components (including element steps and the like) are not necessarily essential except when explicitly stated and when it is clearly considered to be essential in principle. stomach. In addition, when saying "consisting of A", "consisting of A", "having A", and "including A", other elements are excluded unless it is clearly stated that it is only that element. It goes without saying that it is not something to do. Similarly, in the following embodiments, when the shape, positional relationship, etc. of the constituent elements, etc. are referred to, the shape, etc. It shall include those that are close to or similar to.

＜第１の実施形態＞
図１は、本発明の第１の実施形態に係る組立作業順序計画装置１０の構成例を示している。 <First Embodiment>
FIG. 1 shows a configuration example of the assembly work sequence planning device 10 according to the first embodiment of the present invention.

組立作業順序計画装置１０は、ロボット及び作業員を含み得る作業主体により部品を組付け製品を完成させる際の組立作業順序を計画するためのものである。 The assembly work order planning device 10 is for planning an assembly work order when a product is completed by assembling parts by a work subject including a robot and a worker.

組立作業順序計画装置１０は、演算部１１、記憶部１２、入力部１３、出力部１４、及び通信部１５の各機能ブロックを備える。 The assembly work order planning device 10 includes each functional block of a calculation unit 11, a storage unit 12, an input unit 13, an output unit 14, and a communication unit 15.

組立作業順序計画装置１０は、ＣＰＵ(Central Processing Unit)等のプロセッサ、ＤＲＡＭ(Dynamic Random Access Memory)等のメモリ、ＨＤＤ(Hard Disk Drive)やＳＳＤ(Solid State Drive)等のストレージ、キーボード、マウス、タッチパネル等の入力デバイス、ディスプレイ等の出力デバイス、及び、ＮＩＣ(Network Interface Card)等の通信モジュールを備えるパーソナルコンピュータ等の一般的なコンピュータから成る。 The assembly work sequence planning device 10 includes a processor such as a CPU (Central Processing Unit), a memory such as a DRAM (Dynamic Random Access Memory), a storage such as an HDD (Hard Disk Drive) and an SSD (Solid State Drive), a keyboard, and a mouse. It consists of an input device such as a touch panel, an output device such as a display, and a general computer such as a personal computer provided with a communication module such as an NIC (Network Interface Card).

演算部１１は、コンピュータのプロセッサにより実現される。演算部１１は、情報取得部１１１、組立作業定義部１１２、制約条件設定部１１３、行動選択・価値関数構成部１１４、報酬設定部１１５、学習部１１６、及び組立作業順序生成部１１７の機能ブロックを有する。これらの機能ブロックは、コンピュータのプロセッサがメモリにロードされた所定のプログラムを実行することによって実現される。ただし、これらの機能ブロックの一部または全部を集積回路等によりハードウェアとして実現してもよい。 The arithmetic unit 11 is realized by a computer processor. The calculation unit 11 is a functional block of an information acquisition unit 111, an assembly work definition unit 112, a constraint condition setting unit 113, an action selection / value function configuration unit 114, a reward setting unit 115, a learning unit 116, and an assembly work order generation unit 117. Have. These functional blocks are realized by the processor of the computer executing a predetermined program loaded in memory. However, a part or all of these functional blocks may be realized as hardware by an integrated circuit or the like.

情報取得部１１１は、通信部１５を介して、インターネットや携帯電話通信網等からなるネットワーク１に接続されたＣＡＤ(Computer Aided Design)システム２０から組立作業環境・製品情報１２１、及び組立状態遷移情報１２２を取得して記憶部１２に格納する。 The information acquisition unit 111 is connected to the CAD (Computer Aided Design) system 20 connected to the network 1 including the Internet and the mobile phone communication network via the communication unit 15, and the assembly work environment / product information 121 and the assembly state transition information. The 122 is acquired and stored in the storage unit 12.

ここで、組立作業環境・製品情報１２１には、予めＣＡＤシステム２０によりＣＡＤデータとしてモデリングされている、組立作業環境に存在する物体（例えば、ロボット、ステージ、部品載置用トレイ、搬送装置、ロボットのアームに装着するハンドやツール、枠構造物等）の形状及び位置を表す情報が含まれる。さらに、組立作業環境・製品情報１２１には、製品を構成する複数の部品に関する情報が含まれる。 Here, the assembly work environment / product information 121 includes an object (for example, a robot, a stage, a tray for mounting parts, a transfer device, a robot) existing in the assembly work environment, which is modeled as CAD data in advance by the CAD system 20. Information indicating the shape and position of the hand or tool attached to the arm, the frame structure, etc.) is included. Further, the assembly work environment / product information 121 includes information on a plurality of parts constituting the product.

組立状態遷移情報１２２には、複数の部品を組付けた部組品を経て最終的な製品に組立てられるまでの過程を表す情報と、接触拘束条件が含まれる。ここで、接触拘束条件とは、部品間の接触による部品の移動方向の制約を表す。接触拘束条件によれば、組立過程における組立状態の遷移関係が得られる。 The assembly state transition information 122 includes information indicating a process from assembling to a final product through a component in which a plurality of parts are assembled, and a contact constraint condition. Here, the contact constraint condition represents a constraint in the moving direction of the parts due to contact between the parts. According to the contact constraint condition, the transition relationship of the assembly state in the assembly process can be obtained.

組立作業定義部１１２は、組立状態遷移情報１２２を参照し、組付前の２つの部品または部組品からなる組立状態に対して、組立作業を定義し、定義した組立作業を組立作業情報１２３として記憶部１２に格納する。なお、組立作業定義部１１２は、組立作業環境（ロボットのハンド、トレイ、ステージ等）の作業前の状態を設定することができる。さらに、組立作業定義部１１２は、例えば、ステージ等の組立作業環境の状態を遷移させる作業を設定することができる。 The assembly work definition unit 112 refers to the assembly state transition information 122, defines the assembly work for the assembly state consisting of two parts or components before assembly, and defines the defined assembly work as the assembly work information 123. Is stored in the storage unit 12. The assembly work definition unit 112 can set the state of the assembly work environment (robot hand, tray, stage, etc.) before work. Further, the assembly work definition unit 112 can set a work for transitioning the state of the assembly work environment such as a stage.

制約条件設定部１１３は、制約条件を設定し、制約条件情報１２４として記憶部１２に格納する。制約条件の設定とは、組立状態に対して実施できない作業を対応付けることである。組立状態遷移から組立作業を定義した場合、強化学習において、組立状態に対して作業を選択する場合、組立状態がその作業の事前に想定されたものではないときには、制約条件により、選択できない作業と判断される。組立作業環境の状態に対する作業の選択についても同様である。 The constraint condition setting unit 113 sets the constraint condition and stores it in the storage unit 12 as the constraint condition information 124. Setting the constraint condition is to associate the work that cannot be performed with the assembled state. When the assembly work is defined from the assembly state transition, when the work is selected for the assembly state in reinforcement learning, and when the assembly state is not expected in advance of the work, the work cannot be selected due to constraints. Judged. The same applies to the selection of work for the state of the assembly work environment.

なお、組立作業の順序に制約条件を設定してもよい。例えば、ある部品を組付ける前にある作業を実行しなければならない場合には、該部品が存在する状態で該作業が選択されなければならない。反対に、ある部品の組付後に、ある作業を実行しなければならない場合には、該部品が存在しない状態で該作業が選択されなければならない。 Constraints may be set in the order of assembly work. For example, if a task must be performed prior to assembling a component, the task must be selected in the presence of the component. Conversely, if a task must be performed after assembly of a component, the task must be selected in the absence of the component.

行動選択・価値関数構成部１１４は、組立作業順序を計画するための、組立状態に対する作業を選択し、組立作業順序を設定する強化学習に用いる、状態と行動との関係を表現する行動選択関数、及び、状態と状態価値との関係を表現する価値関数を構成する。なお、行動選択関数、及び価値関数を構成するためには、状態（ｓｔａｔｅ）、及び行動（ａｃｔｉｏｎ）の項目が必要であり、これらは、組立状態遷移情報１２２及び組立作業情報１２３から取得される。本実施形態の場合、組立状態が、強化学習技術分野における用語としての状態に相当し、作業が、強化学習技術分野における用語としての行動に相当する。 The action selection / value function component 114 is an action selection function that expresses the relationship between the state and the action, which is used for reinforcement learning to select the work for the assembly state and set the assembly work order for planning the assembly work order. , And construct a value function that expresses the relationship between the state and the state value. In addition, in order to construct the action selection function and the value function, the items of the state (state) and the action (action) are necessary, and these are acquired from the assembly state transition information 122 and the assembly work information 123. .. In the case of the present embodiment, the assembled state corresponds to the state as a term in the field of reinforcement learning technology, and the work corresponds to the action as a term in the field of reinforcement learning technology.

なお、強化学習を実現するためのアルゴリズムはテーブルＱ学習、関数近似法、さらにはそれらを修正、拡張した手法など多数存在する。本実施形態では、深層ニューラルネットワークを近似関数とした深層強化学習の1例であるｏｎ－ｐｏｌｉｃｙ手法のＡ３Ｃ(Asynchronous Advantage Actor-Critic)を採用する。Ａ３Ｃでは、状態を入力、選択される行動と価値関数を出力としてニューラルネットワークを構成する。ニューラルネットワークの中間層の層数、各層のノード数については事前に設定しておく。 There are many algorithms for realizing reinforcement learning, such as table Q learning, function approximation method, and methods that modify or extend them. In this embodiment, A3C (Asynchronous Advantage Actor-Critic) of the on-policy method, which is an example of deep reinforcement learning using a deep neural network as an approximate function, is adopted. In A3C, a neural network is constructed by inputting a state and outputting a selected action and a value function. The number of layers in the intermediate layer of the neural network and the number of nodes in each layer are set in advance.

他の深層強化学習としては、ｏｆｆ－ｐｏｌｉｃｙ手法のＤＱＮ(Deep Q-learning Network)があり、行動価値関数を深層ニューラルネットワークで構成することとなる。テーブルＱ学習では、状態と行動価値を対応付けるQテーブル(配列)を構成することとなる。いずれの方法を採用してもよいが、強化学習を行うためには行動選択と行動選択を評価するための価値を状態から求める仕掛けを構成することが必要となる。なお、強化学習については、図１５及び図１６を参照して後述する。 As another deep reinforcement learning, there is DQN (Deep Q-learning Network) of the off-policy method, and the action value function is constructed by the deep neural network. In table Q learning, a Q table (array) that associates states with action values is constructed. Either method may be adopted, but in order to perform reinforcement learning, it is necessary to construct a mechanism for obtaining the action selection and the value for evaluating the action selection from the state. Reinforcement learning will be described later with reference to FIGS. 15 and 16.

報酬設定部１１５は、組立状態に対して、制約条件に抵触するため実施できない作業が選択された場合には負の報酬を設定する。また、製品が完成し、組立作業の完了状態が得られた場合には正の報酬を設定する。 The reward setting unit 115 sets a negative reward for the assembled state when a work that cannot be performed because it conflicts with the constraint condition is selected. In addition, when the product is completed and the assembly work is completed, a positive reward is set.

学習部１１６は、組立の初期状態から、組立状態に対する作業選択が失敗して組立失敗となるか、組立状態に対する作業選択が成功して組立作業が完了して組立成功となるかの、一連の行動を意味するエピソードを繰り返すことにより強化学習を実行する。 From the initial state of assembly, the learning unit 116 is a series of whether the work selection for the assembly state fails and the assembly fails, or the work selection for the assembly state succeeds and the assembly work is completed and the assembly succeeds. Reinforcement learning is performed by repeating episodes that mean actions.

具体的には、エピソードを繰り返し、状態に対する行動選択を学習し、組立成功の結果が得られるようになれば、強化学習を完了する。強化学習が完了することによって、状態に対して良い行動が選択される行動選択関数が得られたこととなる。なお、１つのエピソードの処理においては、初期状態に対して行動を選択し、次の状態を得る。さらに行動を選択して、その次の行動を得る。このようにステップ毎に行動選択を繰返して状態を進め、行動選択を誤った場合、または許されない状態となった場合には、負の報酬を得てエピソードを終了する。反対に、成功した状態になった場合には、正の報酬を得てエピソードを終了する。以下、学習のことを訓練と称することもある。 Specifically, the episode is repeated, the action selection for the state is learned, and when the result of the successful assembly is obtained, the reinforcement learning is completed. By completing the reinforcement learning, the behavior selection function that selects the good behavior for the state is obtained. In the processing of one episode, an action is selected for the initial state, and the next state is obtained. Select another action to get the next action. In this way, the action selection is repeated step by step to advance the state, and if the action selection is wrong or the state is not allowed, a negative reward is obtained and the episode ends. Conversely, if successful, the episode ends with a positive reward. Hereinafter, learning may be referred to as training.

組立作業順序生成部１１７は、学習部１１６による強化学習の結果に基づいて組立作業順序を設定する。具体的には、強化学習の結果得られた行動選択関数に基づき、組立の初期状態から組立作業が完了した組立成功の状態までの過程で選択された一連の行動を組立作業順序として生成する。また各作業段階での組立状態、組立作業環境の状態も生成する。 The assembly work order generation unit 117 sets the assembly work order based on the result of reinforcement learning by the learning unit 116. Specifically, based on the action selection function obtained as a result of reinforcement learning, a series of actions selected in the process from the initial state of assembly to the state of successful assembly in which the assembly work is completed are generated as the assembly work order. It also generates the assembly status and assembly work environment status at each work stage.

記憶部１２は、コンピュータのメモリ及びストレージによって実現される。記憶部１２には、組立作業環境・製品情報１２１、組立状態遷移情報１２２、組立作業情報１２３、及び制約条件情報１２４が格納される。記憶部１２には、これら以外の情報を格納するようにしてもよい。 The storage unit 12 is realized by the memory and storage of the computer. The storage unit 12 stores the assembly work environment / product information 121, the assembly state transition information 122, the assembly work information 123, and the constraint condition information 124. Information other than these may be stored in the storage unit 12.

入力部１３は、コンピュータの入力デバイスによって実現される。入力部１３は、オペレータ（ユーザ）からの各種の操作を受け付ける。出力部１４は、コンピュータの出力デバイスによって実現される。出力部１４は、例えば、操作入力画面を表示する。通信部１５、コンピュータの通信モジュールによって実現される。通信部１５は、ネットワーク１を介してＣＡＤシステム２０と接続し、ＣＡＤシステム２０から所定の情報を受信する。 The input unit 13 is realized by an input device of a computer. The input unit 13 receives various operations from the operator (user). The output unit 14 is realized by an output device of a computer. The output unit 14 displays, for example, an operation input screen. It is realized by the communication unit 15 and the communication module of the computer. The communication unit 15 connects to the CAD system 20 via the network 1 and receives predetermined information from the CAD system 20.

ＣＡＤシステム２０は、組立作業順序計画装置１０からの要求に応じ、組立作業環境・製品情報１２１、及び組立状態遷移情報１２２を供給する。 The CAD system 20 supplies the assembly work environment / product information 121 and the assembly state transition information 122 in response to the request from the assembly work sequence planning device 10.

次に、図２は、組立作業環境の一例を示している。 Next, FIG. 2 shows an example of an assembly work environment.

該組立作業環境には、２台のロボットＲ１，Ｒ２、台座３１１に設けられたステージ３１２、トレイ３２１，３２２、及び、ハンド設置台３３１，３３２が設けられている。 In the assembly work environment, two robots R1 and R2, a stage 312 provided on the pedestal 311, trays 321 and 322, and a hand mounting table 331 and 332 are provided.

ロボットＲ１は、ハンドが交換可能であり、同図においてはハンド３０３が装着されている。同様に、ロボットＲ２は、ハンドが交換可能であり、同図においてはハンド３０４が装着されている。なお、本実施形態において、ハンドは、ロボット(マニピュレータ)のエンドエフェクタを意味し、必ずしも把持する構造を有するものに限らない。ハンドには、例えば、ドライバのようなツール(工具)も含まれる。 The robot R1 has a replaceable hand, and the hand 303 is attached in the figure. Similarly, the robot R2 has a replaceable hand, and the hand 304 is attached in the figure. In the present embodiment, the hand means an end effector of a robot (manipulator), and is not necessarily limited to a hand having a gripping structure. The hand also includes, for example, a tool such as a screwdriver.

トレイ３２１，３２２には、次の作業でステージ３１２に移動、載置されて組付作業の対象となる部品が置かれている。ハンド設置台３３１，３３２には、現在装着されているハンド３０３，３０４と交換可能な交換用ハンド３３３，３３４が置かれている。 Parts that are moved to the stage 312 and placed on the trays 321 and 322 in the next operation and are to be assembled are placed on the trays 321 and 322. On the hand installation bases 331 and 332, replacement hands 333 and 334 that can be replaced with the currently mounted hands 303 and 304 are placed.

同図の組立作業環境では、２台のロボットＲ１，Ｒ２が同時に稼働してステージ３１２上に載置した部品を組付ける作業を実行することができる。 In the assembly work environment shown in the figure, the two robots R1 and R2 can operate simultaneously to perform the work of assembling the parts placed on the stage 312.

次に、図３は、組立作業順序計画装置１０にて組立作業順序を計画する製品４０の一例を示している。製品４０は、ベース部品Ａの上に板部品Ｂ,Ｃを配置し、それぞれをネジ部品Ｄ,Ｅによって締結、固定することにより完成する構造を有する。 Next, FIG. 3 shows an example of the product 40 in which the assembly work order is planned by the assembly work order planning device 10. The product 40 has a structure completed by arranging plate parts B and C on the base part A and fastening and fixing each of them with screw parts D and E.

図４は、図３に示された製品４０の組立状態遷移を表すＡＮＤ／ＯＲ木を示している。 FIG. 4 shows an AND / OR tree representing the assembly state transition of the product 40 shown in FIG.

該ＡＮＤ/ＯＲ木はツリー構造であって、単体の部品、２つ以上の部品が組付けられた部組品、または製品を表す各組立状態は、楕円で示す各ノード(節点)によって表わされる。ある組立状態から次の組立状態への遷移は、あるノードと他のノードとを接続するエッジ(稜線)によって表される。 The AND / OR tree has a tree structure, and each assembly state representing a single part, a substructure in which two or more parts are assembled, or a product is represented by each node (node) indicated by an ellipse. .. The transition from one assembly state to the next is represented by an edge (ridge) connecting one node to another.

組立状態は、部品単体はベース部品Ａ、板部品Ｂ，Ｃ、ネジ部品Ｄ，Ｅの５状態、２つの部品からなる部組品ＡＢ，ＡＣの２状態、３つの部品から成る部組品ＡＢＣ，ＡＢＤ、ＡＣＥの３状態、４つの部品から成る部組品ＡＢＣＤ，ＡＢＣＥの２状態、４つの部品から成る完成品ＡＢＣＤＥの１状態の全１３状態である。 As for the assembly state, the individual parts are in 5 states of base part A, plate parts B and C, screw parts D and E, 2 states of 2 parts AB and AC, and ABC of 3 parts. , ABD, ACE 3 states, 4 parts assembly ABCD, ABCE 2 states, 4 parts finished product ABCDE 1 state, all 13 states.

例えば、ノードＡは、ベース部品Ａが単体で存在する組立状態を表し、ベース部品Ａに板部品Ｂを組付ける作業を行うことにより、部組品ＡＢの組立状態を表すノードＡＢに遷移する。また、例えば、ノードＡＢは、ベース部品Ａに板部品Ｂが組付けられた部組品としての組立状態を表し、部組品ＡＢに板部品Ｃを組付ける作業を行うことにより、部組品ＡＢＣの組立状態を表すノードＡＢＣに遷移する。 For example, the node A represents an assembled state in which the base component A exists as a single unit, and by performing the work of assembling the plate component B to the base component A, the node A transitions to the node AB representing the assembled state of the component AB. Further, for example, the node AB represents an assembled state as a component in which the plate component B is assembled to the base component A, and by performing the work of assembling the plate component C to the component AB, the node AB is assembled. Transition to the node ABC representing the assembly state of ABC.

なお、ＡＮＤ/ＯＲ木におけるノード間の遷移には、部品間の接触拘束条件が反映される。例えば、ネジ部品Ｄの組付け作業は、ベース部品Ａ上に板部品Ｂが配置されていなければ実行できない。よって、ネジ部品Ｄの組付け作業を選択するには、少なくともベース部品Ａに板部品Ｂが配置されている組付状態を表すノードＡＢ（部組品ＡＢ）、ノードＡＢＣ（部組品ＡＢＣ）またはノードＡＢＣＥ（部組品ＡＢＣＥ）に既に遷移していることが条件となる。 The transition between nodes in the AND / OR tree reflects the contact constraint conditions between the parts. For example, the assembly work of the screw component D cannot be performed unless the plate component B is arranged on the base component A. Therefore, in order to select the assembly work of the screw component D, at least the node AB (assembly product AB) and the node ABC (component ABC) representing the assembly state in which the plate component B is arranged on the base component A. Alternatively, it is a condition that the node ABCE (component ABCE) has already been transitioned.

図５は、図４に示されたＡＮＤ/ＯＲ木における各組立状態への遷移の一覧を示している。 FIG. 5 shows a list of transitions to each assembly state in the AND / OR tree shown in FIG.

製品４０は、始めにベース部品Ａに対して板部品Ｂ，Ｃのどちらから組付けてもよい。また、部組品ＡＢＣに対しては、ネジ部品Ｄ，Ｅのどちらから組付けてもよい、したがって、単体の部品から完成品を得るまでの組立順序は６通りとなる。各組立状態への遷移は、図５に示すＮｏ１～Ｎｏ１２の１２通りが存在する。 The product 40 may first be assembled from either the plate parts B or C to the base part A. Further, the assembly product ABC may be assembled from either the screw parts D or E, and therefore, there are six assembly sequences until the finished product is obtained from a single part. There are 12 transitions to each assembly state, No1 to No12 shown in FIG.

次に、製品４０の組立状態遷移（ＡＮＤ/ＯＲ木）（図４）に基づく組立作業順序の設定方法について説明する。 Next, a method of setting the assembly work order based on the assembly state transition (AND / OR tree) (FIG. 4) of the product 40 will be described.

なお、図２の組立作業環境には２台のロボットＲ１，Ｒ２が存在したが、はじめに、作業主体を１台のロボット（ロボットＲ１，Ｒ２の一方）だけとする場合について説明し、次に、作業主体を複数（例えば、２台）のロボットとする場合について説明する。 Although there were two robots R1 and R2 in the assembly work environment of FIG. 2, first, a case where the work subject is only one robot (one of the robots R1 and R2) will be described. A case where the work subject is a plurality of (for example, two) robots will be described.

なお、前提として、１台のロボットによる１回の作業により、組立状態が１つだけ遷移するものとする。 As a premise, it is assumed that only one assembly state is changed by one operation by one robot.

図４の組立状態遷移に基づいて定義される作業は以下の作業Ｗ１～Ｗ５となる。
作業Ｗ１：ベース部品Ａの組付。
作業Ｗ２：板部品Ｂの組付。
作業Ｗ３：板部品Ｃの組付。
作業Ｗ４：ネジ部品Ｄの組付。
作業Ｗ５：ネジ部品Ｄの組付。 The work defined based on the assembly state transition in FIG. 4 is the following work W1 to W5.
Work W1: Assembly of base component A.
Work W2: Assembly of plate part B.
Work W3: Assembly of plate part C.
Work W4: Assembly of screw part D.
Work W5: Assembling the screw part D.

組立作業の制約条件及び報酬は、組立状態遷移から直接に導かれる。例えば、作業Ｗ４（ネジ部品Ｄの組付）は、実行前に部組品ＡＢ、部組品ＡＢＣ、または部組品ＡＢＣＥのいずれかの組立状態が存在しなければない。したがって、組立状態として部組品ＡＢ、部組品ＡＢＣ、または部組品ＡＢＣＥのいずれかが存在しない状態において、作業Ｗ４を選択した場合には負の報酬を設定するようにする。 Constraints and rewards for assembly work are derived directly from assembly state transitions. For example, the work W4 (assembly of the screw component D) must have an assembled state of any of the component AB, the component ABC, or the component ABCE before the execution. Therefore, if the work W4 is selected in a state where any of the substructure AB, the substructure ABC, or the substructure ABCE does not exist as the assembled state, a negative reward is set.

また、例えば、組立作業の初期状態において、ベース部品Ａはステージ３１２の上に予め載置されていることを前提とすれば、組立作業の１番目に作業Ｗ１を選択した場合には負の報酬を設定するようにする。 Further, for example, assuming that the base component A is pre-mounted on the stage 312 in the initial state of the assembly work, a negative reward is given when the work W1 is selected first in the assembly work. To set.

なお、ベース部品Ａをステージ３１２の上に載置する作業を設計したい場合には、ベース部品Ａに関する状態として、トレイ３２１（または３２２）に載置されている状態、及び、ステージ３１２に載置されている状態を設け、トレイ３２１（または３２２）に載置されている状態から、作業Ｗ１を実行することにより、ステージ３１２に載置されている状態に遷移するようにすればよい。 If you want to design the work of placing the base component A on the stage 312, the state of the base component A is that it is mounted on the tray 321 (or 322) and that it is mounted on the stage 312. The state in which the work W1 is placed may be provided, and the state in which the product is placed on the tray 321 (or 322) may be changed to the state in which the work W1 is placed on the stage 312.

図６は、製品４０の組立状態遷移（図４）に基づいて設定した組立作業順序の一例、及び作業毎の組立状態を示している。 FIG. 6 shows an example of the assembly work order set based on the assembly state transition (FIG. 4) of the product 40, and the assembly state for each work.

該組立作業順序は、作業主体を１台のロボットのみとし、製品４０の組立状態に基づいて、作業Ｗ１～業Ｗ５の５種類の作業を選択させる強化学習によって設定される。 The assembly work order is set by reinforcement learning in which the work subject is only one robot and five types of work, work W1 to work W5, are selected based on the assembly state of the product 40.

同図に示す組立作業順序は作業ステップ０番から４番まであり、０番は初期状態、４番は完了状態である。各部品等に対応して記載されている状態値０は、対応する部品等が存在していない状態であることを意味し、状態値１は対応する部品等が存在している状態ことを意味する。組立作業の初期状態は、各部品が単体で存在し、部組品及び製品は存在しない状態と定義する。組立作業の完了状態は、製品ＡＢＣＤＥが存在する状態と定義する。 The assembly work order shown in the figure is from work steps 0 to 4, where 0 is the initial state and 4 is the completed state. The state value 0 described corresponding to each part or the like means that the corresponding part or the like does not exist, and the state value 1 means that the corresponding part or the like exists. do. The initial state of assembly work is defined as the state in which each part exists as a single unit and the components and products do not exist. The completed state of the assembly work is defined as the state in which the product ABCDE exists.

作業ステップ０番は初期状態である。次の作業ステップ１番では、ロボットが作業Ｗ２（板部品Ｂの組付）を実行する。これにより、ベース部品Ａ、及び板部品Ｂが消滅し、部組品ＡＢが出現する。 Work step 0 is the initial state. In the next work step 1, the robot executes work W2 (assembly of the plate component B). As a result, the base component A and the plate component B disappear, and the assembly product AB appears.

次の作業ステップ２番では、ロボットが作業Ｗ３（板部品Ｃの組付）を実行する。これにより、板部品Ｃ及び部組品ＡＢが消滅し、部組品ＡＢＣが出現する。次の作業ステップ３番では、ロボットが作業Ｗ４（ネジ部品Ｄの組付）を実行する。これにより、ネジ部品Ｄ及び部組品ＡＢＣが消滅し、部組品ＡＢＣＤが出現する。次の作業ステップ４番では、ロボットが作業Ｗ５（ネジ部品Ｅの組付）を実行する。これにより、ネジ部品Ｅ及び部組品ＡＢＣＤが消滅し、製品ＡＢＣＤＥが出現する。これにより、完了状態が得られて、組立作業が終了される。 In the next work step 2, the robot executes work W3 (assembly of the plate component C). As a result, the plate component C and the substructure AB disappear, and the substructure ABC appears. In the next work step 3, the robot executes work W4 (assembly of the screw component D). As a result, the screw component D and the component ABC disappear, and the component ABCD appears. In the next work step 4, the robot executes work W5 (assembly of the screw component E). As a result, the screw component E and the component ABCD disappear, and the product ABCDE appears. As a result, the completed state is obtained and the assembly work is completed.

なお、作業Ｗ２（板部品Ｂの組付）と作業Ｗ３（板部品Ｃの組付）の順序はいずれを先に実行してもよく、組立状態の遷移のみから定めることはできない。 The order of work W2 (assembly of plate component B) and work W3 (assembly of plate component C) may be executed first, and cannot be determined only from the transition of the assembled state.

作業Ｗ２は、ベース部品Ａ、部組品ＡＣ、または部組品ＡＣＥの組立状態においてのみ、選択可能である。仮に、板部品Ｃを組付ける前に板部品Ｂを組付けなければならないという制約条件が設定されていれば、作業Ｗ２の前提となる組立状態は、ベース部品Ａの単体のみとなる。 The work W2 can be selected only in the assembled state of the base component A, the assembly AC, or the assembly ACE. If the constraint condition that the plate component B must be assembled before the plate component C is assembled, the assembly state that is the premise of the work W2 is only the base component A alone.

次に、製品４０の組立状態遷移に加え、組立作業環境の状態にも基づいて作業を設定する場合の例を説明する。組立作業環境の状態としては、ロボットのハンドを交換する作業を導入する。 Next, an example of setting the work based on the state of the assembly work environment in addition to the assembly state transition of the product 40 will be described. As the state of the assembly work environment, the work of exchanging the hands of the robot is introduced.

ネジ部品Ｄ，Ｅを組付ける場合、ロボットにはドライバハンドを装着し、板部品Ｂ，Ｃを組付ける場合、ロボットにはグリップハンドを装着するものとする。 When assembling the screw parts D and E, the driver hand shall be attached to the robot, and when assembling the plate parts B and C, the grip hand shall be attached to the robot.

この場合、組立作業環境の状態として、ロボットのハンドに関する「グリップハンド」及び「ドライバハンド」を追加し、上述した作業Ｗ１～Ｗ５に追加し、以下の作業Ｗ６～Ｗ８を定義すればよい。 In this case, as the state of the assembly work environment, the "grip hand" and the "driver hand" related to the robot hand may be added, added to the above-mentioned works W1 to W5, and the following works W6 to W8 may be defined.

作業Ｗ６：グリップハンドの装着。
作業Ｗ７：ドライバハンドの装着。
作業Ｗ８：ハンドの取り外し。 Work W6: Attaching the grip hand.
Work W7: Installation of the driver hand.
Work W8: Removal of the hand.

そして、前提条件として、作業Ｗ２（板部品Ｂの組付），Ｗ３（板部品Ｃの組付）については、グリップハンドが装着されている状態である場合にのみ作業可能であると設定する。同様に、作業Ｗ４（ネジ部品Ｄの組付），Ｗ５（ネジ部品Ｅの組付）については、ドライバハンドが装着されている状態である場合にのみ作業可能であると設定する。 Then, as a precondition, it is set that the work W2 (assembly of the plate component B) and W3 (assembly of the plate component C) can be performed only when the grip hand is attached. Similarly, it is set that the work W4 (assembly of the screw component D) and W5 (assembly of the screw component E) can be performed only when the driver hand is attached.

作業Ｗ６（グリップハンドの装着），Ｗ７（ドライバハンド）の装着については、ロボットにハンドが装着されていない状態である場合にのみ作業可能であると設定する。作業Ｗ８（ハンドの取り外し）については、ロボットにグリップハンドまたはドライバハンドが装着されている状態である場合にのみ作業可能であると設定する。 It is set that the work W6 (attachment of the grip hand) and W7 (attachment of the driver hand) can be performed only when the hand is not attached to the robot. Regarding the work W8 (removal of the hand), it is set that the work can be performed only when the grip hand or the driver hand is attached to the robot.

組立作業の初期状態は、各部品が単体で存在し、ロボットにクリップハンド及びドライバハンドのいずれもが装着されていない状態と定義する。組立作業の完了状態は、製品ＡＢＣＤＥが得られ、さらに、ロボットにハンドが装着されていない状態と定義する。 The initial state of the assembly work is defined as a state in which each part exists as a single unit and neither the clip hand nor the driver hand is attached to the robot. The completed state of the assembly work is defined as the state in which the product ABCDE is obtained and the hand is not attached to the robot.

なお、組立作業環境には、ドライバハンド及びグリップハンド以外のハンドを準備してもよい。さらに、１台のロボットが複数のアームを有し、ドライバハンド及びグリップハンドを同時に装着できるようにしてもよい。 A hand other than the driver hand and the grip hand may be prepared in the assembly work environment. Further, one robot may have a plurality of arms so that a driver hand and a grip hand can be attached at the same time.

図７は、製品４０の組立状態遷移（図４）、及び組立作業環境（ロボットのハンドの状態）に基づいて設定した組立作業順序の一例、及び作業毎の組立状態を示している。 FIG. 7 shows an example of the assembly work order set based on the assembly state transition (FIG. 4) of the product 40, the assembly work environment (the state of the robot hand), and the assembly state for each work.

該組立作業順序は、作業主体を１台のロボットのみとし、製品４０の組立状態及び組立作業環境に基づいて、作業Ｗ１～Ｗ８までの８種類の作業を選択させる強化学習によって設定される。 The assembly work order is set by reinforcement learning in which the work subject is only one robot and eight types of work from work W1 to W8 are selected based on the assembly state and the assembly work environment of the product 40.

同図に示す組立作業順序は作業ステップ０番から８番まであり、０番が初期状態、８番が完了状態である。部品Ａ等に対応して記載されている状態値は図６の場合と同様である。 The assembly work order shown in the figure is from work steps 0 to 8, where 0 is the initial state and 8 is the completed state. The state values described corresponding to the component A and the like are the same as in the case of FIG.

例えば、作業ステップ０番の初期状態では、単体の部品のみが存在し、ロボットにはクリップハンド及びドライバハンドのいずれもが装着されていない。 For example, in the initial state of work step 0, only a single component exists, and neither the clip hand nor the driver hand is attached to the robot.

次の作業ステップ１番では、ロボットが作業Ｗ６（グリップハンドの装着）を実行する。これにより、ロボットによる作業Ｗ２（板部品Ｂの組付），Ｗ３（板部品Ｃの組付）が実行可能となる。次の作業ステップ２番では、ロボットが作業Ｗ２を実行する。これにより、ベース部品Ａ、及び板部品Ｂが消滅し、部組品ＡＢが出現する。次の作業ステップ３番では、ロボットが作業Ｗ３を実行する。これにより、板部品Ｃ、及び部組品ＡＢが消滅し、部組品ＡＢＣが出現する。 In the next work step 1, the robot executes work W6 (attachment of the grip hand). As a result, the work W2 (assembly of the plate component B) and W3 (assembly of the plate component C) by the robot can be executed. In the next work step 2, the robot executes the work W2. As a result, the base component A and the plate component B disappear, and the assembly product AB appears. In the next work step 3, the robot executes the work W3. As a result, the plate component C and the substructure AB disappear, and the substructure ABC appears.

次の作業ステップ４番では、ロボットが作業Ｗ８（ハンドの取り外し）を実行する。これにより、ロボットによる作業Ｗ６（グリップハンドの装着），Ｗ７（ドライバハンドの装着）が実行可能となる。次の作業ステップ５番では、ロボットが作業Ｗ７を実行する。これにより、ロボットによる作業Ｗ４（ネジ部品Ｄの組付），Ｗ５（ネジ部品Ｅの組付）が実行可能となる。 In the next work step 4, the robot executes work W8 (removal of the hand). As a result, the work W6 (attachment of the grip hand) and W7 (attachment of the driver hand) by the robot can be executed. In the next work step 5, the robot executes the work W7. As a result, the work W4 (assembly of the screw component D) and W5 (assembly of the screw component E) by the robot can be executed.

次の作業ステップ６番では、ロボットが作業Ｗ４を実行する。これにより、ネジ部品Ｄ、及び部組品ＡＢＣが消滅し、部組品ＡＢＣＤが出現する。次の作業ステップ７番では、ロボットが作業Ｗ５を実行する。これにより、ネジ部品Ｅ、及び部組品ＡＢＣＤが消滅し、製品ＡＢＣＤＥが出現する。次の作業ステップ８番では、ロボットが作業Ｗ８（ハンドの取り外し）を実行する。これにより、完了状態が得られて組立作業が終了される。 In the next work step 6, the robot executes the work W4. As a result, the screw component D and the component ABC disappear, and the component ABCD appears. In the next work step 7, the robot executes the work W5. As a result, the screw component E and the component ABCD disappear, and the product ABCDE appears. In the next work step 8, the robot executes work W8 (removal of the hand). As a result, the completed state is obtained and the assembly work is completed.

次に、作業主体を複数のロボット（一例として、図２の２台のロボットＲ１，Ｒ２）とする場合について説明する。 Next, a case where the work subject is a plurality of robots (for example, the two robots R1 and R2 in FIG. 2) will be described.

この場合、組立作業環境の状態として、２台のロボットそれぞれのハンドに関する「グリップハンド」及び「ドライバハンド」を追加する。２台のロボットが選択し得る行動は、上述したように定義した作業Ｗ１～Ｗ８である。ただし、２台のロボットのうち、一方のロボットが作業しない（作業できない）場合が発生し得るので、以下の作業Ｗ０を追加して定義する。
作業Ｗ０：待機。 In this case, a "grip hand" and a "driver hand" for each of the hands of the two robots are added as the state of the assembly work environment. The actions that the two robots can select are the tasks W1 to W8 defined as described above. However, since there may be a case where one of the two robots does not work (cannot work), the following work W0 is additionally defined.
Work W0: Standby.

なお、作業Ｗ０の前提となる状態や、作業０から遷移可能な状態等に制限はない。 There are no restrictions on the state that is the premise of work W0, the state that can be transitioned from work 0, and the like.

２台のロボットに対しては、同時に同じ作業を選択してもよい。ただし、作業前の状態に制約があって、一方のロボットに対して選択した作業により作業前の状態が無くなってしまうのであれば、他方のロボットに対して同じ作業を選択した場合には負の報酬を設定するようにする。 The same work may be selected for two robots at the same time. However, if there are restrictions on the state before work and the state before work disappears due to the work selected for one robot, it will be negative if the same work is selected for the other robot. Try to set rewards.

組立作業の初期状態は、各部品が単体で存在し、２台のロボットそれぞれにハンドが装着されていない状態と定義する。組立作業の完了状態は、製品ＡＢＣＤＥが得られ、さらに、２台のロボットそれぞれにハンドが装着されていない状態と定義する。 The initial state of the assembly work is defined as a state in which each part exists as a single unit and no hand is attached to each of the two robots. The completed state of the assembly work is defined as the state in which the product ABCDE is obtained and the hands are not attached to each of the two robots.

図８は、製品４０の組立状態遷移（図４）、及び組立作業環境（ハンドの状態）に基づいて設定した組立作業順序の一例、及び作業毎の組立状態を示している。 FIG. 8 shows an example of the assembly work order set based on the assembly state transition (FIG. 4) of the product 40 and the assembly work environment (hand state), and the assembly state for each work.

該組立作業順序は、製品４０の組立作業を実施する作業主体を２台のロボットとし、製品４０の組立状態及び組立作業環境に基づいて、作業Ｗ０～Ｗ８までの９種類の作業を選択させる強化学習によって設定される。 The assembly work sequence is strengthened so that the work subject that carries out the assembly work of the product 40 is two robots, and nine types of work from work W0 to W8 are selected based on the assembly state of the product 40 and the assembly work environment. Set by learning.

なお、同図においては、２台のロボットをロボットＲ１，Ｒ２とし、ロボットＲ１のハンドの状態をＲ１グリップ及びＲ１ドライバ、ロボットＲ２のハンドの状態をＲ２グリップ及びＲ２ドライバとしている。 In the figure, the two robots are the robots R1 and R2, the state of the hand of the robot R1 is the R1 grip and the R1 driver, and the state of the hand of the robot R2 is the R2 grip and the R2 driver.

同図に示す組立作業順序は作業ステップ０番から５番まであり、０番が初期状態、５番が完了状態である。部品Ａ等に対応して記載されている状態値は図６の場合と同様である。 The assembly work order shown in the figure is from work steps 0 to 5, where 0 is the initial state and 5 is the completed state. The state values described corresponding to the component A and the like are the same as in the case of FIG.

例えば、作業ステップ０番の初期状態では、単体の部品のみが存在し、部組品、及び製品は存在しない。また、２台のロボットＲ１，Ｒ２それぞれにはクリップハンド及びドライバハンドが装着されていない。 For example, in the initial state of work step 0, there are only single parts, and there are no components or products. Further, the clip hand and the driver hand are not attached to each of the two robots R1 and R2.

次の作業ステップ１番では、ロボットＲ１が作業Ｗ６（グリップハンドの装着）を実行し、ロボット２が作業Ｗ７（ドライバハンドの装着）を実行する。これにより、ロボットＲ１による作業Ｗ２（板部品Ｂの組付），Ｗ３（板部品Ｃの組付）が実行可能となり、ロボットＲ２による作業Ｗ４（ネジ部品Ｄの組付），Ｗ５（ネジ部品Ｅの組付）が実行可能となる。 In the next work step 1, the robot R1 executes the work W6 (attachment of the grip hand), and the robot 2 executes the work W7 (attachment of the driver hand). As a result, the work W2 (assembly of the plate part B) and W3 (assembly of the plate part C) by the robot R1 can be executed, and the work W4 (assembly of the screw part D) and W5 (assembly of the screw part E) by the robot R2 can be executed. Assembling) becomes feasible.

次の作業ステップ２番では、ロボットＲ１が作業Ｗ２を実行する。これにより、ベース部品Ａ、及び板部品Ｂが消滅し、部組品ＡＢが出現する。一方、ロボットＲ２が作業Ｗ０（待機）を実行する。次の作業ステップ３番では、ロボットＲ１が作業Ｗ３を実行し、ロボットＲ２が作業Ｗ４（ネジ部品Ｄの組付）を実行する。これにより、板部品Ｃ、ネジ部品Ｄ、及び部組品ＡＢが消滅し、部組品ＡＢＣＤが出現する。 In the next work step 2, the robot R1 executes the work W2. As a result, the base component A and the plate component B disappear, and the assembly product AB appears. On the other hand, the robot R2 executes work W0 (standby). In the next work step 3, the robot R1 executes the work W3, and the robot R2 executes the work W4 (assembly of the screw component D). As a result, the plate component C, the screw component D, and the substructured product AB disappear, and the substructured product ABCD appears.

次の作業ステップ４番では、ロボットＲ１が作業Ｗ８（ハンドの取り外し）を実行し、ロボットＲ２が作業Ｗ５（ネジ部品Ｅの組付）を実行する。これにより、ネジ部品Ｅ、及び部組品ＡＢＣＤが消滅し、製品ＡＢＣＤＥが出現する。次の作業ステップ５番では、ロボットＲ１が作業Ｗ０（待機）を実行し、ロボットＲ２が作業Ｗ８（ハンドの取り外し）を実行する。これにより、完了状態が得られて組立作業が終了される。 In the next work step 4, the robot R1 executes the work W8 (removal of the hand), and the robot R2 executes the work W5 (assembly of the screw component E). As a result, the screw component E and the component ABCD disappear, and the product ABCDE appears. In the next work step 5, the robot R1 executes the work W0 (standby), and the robot R2 executes the work W8 (removal of the hand). As a result, the completed state is obtained and the assembly work is completed.

なお、組立作業順序は、作業主体を３台以上のロボットとして設定することも可能である。また、組立作業環境として、ハンド以外にトレイ３２１，３２２、ステージ３１２等の状態を追加してもよい。さらに、作業の定義に、部品を所定の位置から他の位置まで搬送する搬送装置の作業を追加してもよい。 The assembly work order can also be set as a work subject as three or more robots. Further, as an assembly work environment, a state such as trays 321 and 322 and a stage 312 may be added in addition to the hand. Further, the work of the transport device for transporting the parts from a predetermined position to another position may be added to the definition of the work.

＜組立作業順序計画装置１０による組立作業順序計画処理＞
次に、図９は、組立作業順序計画装置１０による組立作業順序計画処理の一例を説明するフローチャートである。 <Assembly work sequence planning process by the assembly work sequence planning device 10>
Next, FIG. 9 is a flowchart illustrating an example of the assembly work order planning process by the assembly work order planning device 10.

前提として、ＣＡＤシステム２０は、製品の形状モデル及びロボット等から構成される組立作業環境の形状モデルをモデリング済みであり、組立作業環境・製品情報１２１、及び組立状態遷移情報（ＡＮＤ／ＯＲ木）１２２を組立作業順序計画装置１０に供給可能であるとする。 As a premise, the CAD system 20 has already modeled the shape model of the assembly work environment composed of the product shape model and the robot, and the assembly work environment / product information 121 and the assembly state transition information (AND / OR tree). It is assumed that 122 can be supplied to the assembly work sequence planning device 10.

該組立作業順序計画処理は、例えば、ユーザからの所定の操作に応じて開始される。 The assembly work sequence planning process is started, for example, in response to a predetermined operation from the user.

始めに、情報取得部１１１が、ＣＡＤシステム２０から組立作業環境・製品情報１２１、を取得して記憶部１２に格納する（ステップＳ１）。次に、情報取得部１１１が、ＣＡＤシステム２０から組立状態遷移情報１２２を取得して記憶部１２に格納する（ステップＳ２）。 First, the information acquisition unit 111 acquires the assembly work environment / product information 121 from the CAD system 20 and stores it in the storage unit 12 (step S1). Next, the information acquisition unit 111 acquires the assembly state transition information 122 from the CAD system 20 and stores it in the storage unit 12 (step S2).

次に、組立作業定義部１１２が、組立作業環境（ロボットのハンド、トレイ、ステージ等）の作業前の状態を設定する（ステップＳ３）、次に、組立作業定義部１１２が、組立状態遷移情報１２２に基づいて、組付前の２つの部品または部組品からなる組立状態に対して、組立作業を定義し、定義した組立作業を組立作業情報１２３として記憶部１２に格納する（ステップＳ４）。 Next, the assembly work definition unit 112 sets the pre-work state of the assembly work environment (robot hand, tray, stage, etc.) (step S3), and then the assembly work definition unit 112 sets the assembly state transition information. Based on 122, an assembly work is defined for an assembly state consisting of two parts or components before assembly, and the defined assembly work is stored in the storage unit 12 as assembly work information 123 (step S4). ..

次に、制約条件設定部１１３が、制約条件を設定し、制約条件情報１２４として記憶部１２に格納する（ステップＳ５）。 Next, the constraint condition setting unit 113 sets the constraint condition and stores it in the storage unit 12 as the constraint condition information 124 (step S5).

次に、行動選択・価値関数構成部１１４が、強化学習に用いる行動選択関数、及び、価値関数を定義する（ステップＳ６）。次に、報酬設定部１１５が、各組立状態に対して選択された作業に対して報酬を設定する（ステップＳ７）。 Next, the action selection / value function component 114 defines the action selection function and the value function used for reinforcement learning (step S6). Next, the reward setting unit 115 sets a reward for the work selected for each assembly state (step S7).

次に、学習部１１６が、組立の初期状態から、組立状態に対する作業選択が失敗して組立失敗となるか、組立状態に対する作業選択が成功して組立作業が完了して組立成功となるかの、一連の行動を意味するエピソードを繰り返すことにより強化学習を実行する（ステップＳ８）。 Next, from the initial state of assembly, the learning unit 116 fails to select the work for the assembly state and fails to assemble, or the work selection for the assembly state succeeds and the assembly work is completed and the assembly succeeds. , Reinforcement learning is executed by repeating an episode meaning a series of actions (step S8).

具体的には、エピソードを繰り返し、状態に対する行動選択を学習し、組立成功の結果が得られるようになれば、強化学習を完了する。強化学習が完了することによって、状態に対して良い行動が選択される行動選択関数が得られたこととなる。なお、１つのエピソードの処理においては、初期状態に対して行動を選択し、次の状態を得る。さらに行動を選択して、その次の行動を得る。このようにステップ毎に行動選択を繰返して状態を進め、行動選択を誤った場合、または許されない状態となった場合には、負の報酬を得てエピソードを終了する。反対に、成功した状態になった場合には、正の報酬を得てエピソードを終了する。 Specifically, the episode is repeated, the action selection for the state is learned, and when the result of the successful assembly is obtained, the reinforcement learning is completed. By completing the reinforcement learning, the behavior selection function that selects the good behavior for the state is obtained. In the processing of one episode, an action is selected for the initial state, and the next state is obtained. Select another action to get the next action. In this way, the action selection is repeated step by step to advance the state, and if the action selection is wrong or the state is not allowed, a negative reward is obtained and the episode ends. Conversely, if successful, the episode ends with a positive reward.

強化学習では、エピソードの繰返し回数が少ない期間は成功に至らずに失敗となるが、エピソードの繰返し回数が増えて学習が進むと、エピソードが成功するようになる。そこで、エピソードが所定回数（例えば、３回）連続して成功した場合に強化学習を終了するようにする。 In reinforcement learning, a period in which the number of repetitions of an episode is small does not lead to success and fails, but as the number of repetitions of an episode increases and learning progresses, the episode becomes successful. Therefore, when the episode succeeds a predetermined number of times (for example, three times) in a row, the reinforcement learning is terminated.

次に、組立作業順序生成部１１７が、強化学習結果として得られた行動選択関数、及び状態価値関数を使って、初期状態からエピソードを試行することにより、製品が完成するまでの作業の一連の選択結果をつなげて組立作業順序を設定する（ステップＳ９）。 Next, the assembly work sequence generation unit 117 uses the action selection function and the state value function obtained as a result of reinforcement learning to try episodes from the initial state, thereby completing a series of work until the product is completed. The selection results are connected to set the assembly work order (step S9).

以上に説明した組立作業順序計画装置１０による組立作業順序計画処理によれば、製品を構成する各部品の組立順序と、ロボットを作業主体とする作業順序とを計画することが可能となる。 According to the assembly work order planning process by the assembly work order planning device 10 described above, it is possible to plan the assembly order of each component constituting the product and the work order in which the robot is the main work subject.

＜第２の実施形態＞
次に、図１０は、本発明の第２の実施形態に係る組立作業順序計画装置１００の構成例を示している。 <Second embodiment>
Next, FIG. 10 shows a configuration example of the assembly work sequence planning device 100 according to the second embodiment of the present invention.

組立作業順序計画装置１００は、ロボット及び作業員を含み得る作業主体により部品を組付けて製品を完成させる際の組立作業順序を計画するためのものである。 The assembly work order planning device 100 is for planning an assembly work order when a product is completed by assembling parts by a work subject including a robot and a worker.

組立作業順序計画装置１００は、本発明の第１の実施形態に係る組立作業順序計画装置１０（図１）における組立作業定義部１１２を、組立状態ベース組立作業定義部１１２Ａ、及び作業状態ベース組立作業定義部１１２Ｂに分割し、シミュレーション指示部１１８を追加したものである。 The assembly work sequence planning device 100 includes the assembly work definition unit 112 in the assembly work sequence planning device 10 (FIG. 1) according to the first embodiment of the present invention, the assembly state-based assembly work definition unit 112A, and the work state-based assembly. It is divided into the work definition unit 112B and the simulation instruction unit 118 is added.

また、組立作業順序計画装置１００は、記憶部１２に格納される情報として、制約条件判定シミュレーション情報１２５、及び組立作業順序シミュレーション情報１２６を追加したものである。 Further, the assembly work order planning device 100 adds constraint condition determination simulation information 125 and assembly work order simulation information 126 as information stored in the storage unit 12.

さらに、組立作業順序計画装置１００の外部には、組立作業順序計画装置１００がネットワーク１を介して接続可能なロボットシミュレータ３０が追加されている。ロボットシミュレータ３０は、組立作業順序計画装置１００からの指示に従い、組立作業環境の設置されているロボットＲ１，Ｒ２による作業のシミュレーションを実行し、シミュレーション結果を組立作業順序計画装置１００に出力する。 Further, outside the assembly work sequence planning device 100, a robot simulator 30 to which the assembly work sequence planning device 100 can be connected via the network 1 is added. The robot simulator 30 executes a simulation of work by the robots R1 and R2 in which the assembly work environment is installed according to the instruction from the assembly work order planning device 100, and outputs the simulation result to the assembly work order planning device 100.

なお、組立作業順序計画装置１００の構成要素のうち、組立作業順序計画装置１０（図１）の構成要素と共通するものについては同一の符号を付してその説明を省略する。 Of the components of the assembly work sequence planning device 100, those that are common to the components of the assembly work sequence planning device 10 (FIG. 1) are designated by the same reference numerals and the description thereof will be omitted.

組立状態ベース組立作業定義部１１２Ａは、組立作業順序計画装置１０（図１）における組立作業定義部１１２と同様に、組立状態遷移情報１２２を参照して、組立作業を定義する。 The assembly state-based assembly work definition unit 112A defines the assembly work with reference to the assembly state transition information 122, similarly to the assembly work definition unit 112 in the assembly work sequence planning device 10 (FIG. 1).

作業状態ベース組立作業定義部１１２Ｂは、組立作業環境・製品情報１２１を参照し、組立作業環境の状態に対して組立に必要となる作業を定義する。例えば、ある部品を組付けるためにドライバハンドが必要であるならば、該部品を組付ける前の組立状態において、ロボットにドライバハンドを装着する作業を定義する。この場合、該部品を組み付ける前の組立状態であって、ロボットにドライバハンドが装着されている状態であれば、該部品を組付ける作業が実行可能となる。 The work state-based assembly work definition unit 112B refers to the assembly work environment / product information 121 and defines the work required for assembly with respect to the state of the assembly work environment. For example, if a driver hand is required to assemble a part, the work of attaching the driver hand to the robot in the assembled state before assembling the part is defined. In this case, if the assembly state before assembling the parts and the driver hand is attached to the robot, the work of assembling the parts can be executed.

なお、組立作業環境としての状態を設定し得るものはロボットのハンドだけではない。例えば、組立作業環境に複数のステージが存在している場合には、組立に利用中であるか否かという状態を設定できる。また、作業主体となるロボットや作業者は、ある作業を実施しているので、作業主体が実施している作業の種類を作業主体の状態とみなして設定してもよい。 It should be noted that the robot hand is not the only one that can set the state as the assembly work environment. For example, when there are a plurality of stages in the assembly work environment, it is possible to set the state of whether or not the stage is being used for assembly. Further, since the robot or the worker who is the work subject is carrying out a certain work, the type of the work being carried out by the work subject may be regarded as the state of the work subject and set.

シミュレーション指示部１１８は、外部に設けたロボットシミュレータ３０に対し、制約条件を設定するための個別組立作業シミュレーションをロボットシミュレータ３０に指示し、そのシミュレーション結果を取得する。この場合、制約条件設定部１１３は、そのシミュレーション結果に基づいて制約条件を修正することができる。また、シミュレーション指示部１１８は、ロボットシミュレータ３０に対し、最終的に得られた組立作業順序に従った組立作業シミュレーションを指示し、そのシミュレーション結果を取得する。該シミュレーション結果は、最終的に得られた組立作業順序が有効であることの確認に用いることができる。 The simulation instruction unit 118 instructs the robot simulator 30 to perform an individual assembly work simulation for setting constraint conditions to the robot simulator 30 provided externally, and acquires the simulation result. In this case, the constraint condition setting unit 113 can modify the constraint condition based on the simulation result. Further, the simulation instruction unit 118 instructs the robot simulator 30 to perform an assembly work simulation according to the finally obtained assembly work order, and acquires the simulation result. The simulation results can be used to confirm that the finally obtained assembly work sequence is valid.

制約条件判定シミュレーション情報１２５は、制約条件を設定するための個別組立作業シミュレーションをロボットシミュレータ３０に指示した際の条件と、そのシミュレーション結果を含む情報である。 The constraint condition determination simulation information 125 is information including a condition when the robot simulator 30 is instructed to perform an individual assembly work simulation for setting a constraint condition, and a simulation result thereof.

組立作業順序シミュレーション情報１２６は、組立作業順序シミュレーションをロボットシミュレータ３０に指示した際の条件と、そのシミュレーション結果を含む情報である。 The assembly work order simulation information 126 is information including a condition when the robot simulator 30 is instructed to perform an assembly work order simulation and a simulation result thereof.

次に、図１１は、組立作業順序計画装置１００にて組立作業順序を計画する製品５０の一例を示している。 Next, FIG. 11 shows an example of the product 50 in which the assembly work order is planned by the assembly work order planning device 100.

製品５０は、ベース部品Ａの上にボックス部品Ｂを配置して、ネジ部品Ｃ，Ｄによって締結、固定し、ベース部品Ａとボックス部品Ｂとを配線部品Ｅ，Ｆによって結線することにより完成する構造を有する。 The product 50 is completed by arranging the box component B on the base component A, fastening and fixing it with the screw components C and D, and connecting the base component A and the box component B with the wiring components E and F. Has a structure.

製品５０を完成させるまでの複数の組立作業のうち、ベース部品Ａの上にボックス部品Ｂを配置する作業、及び、ネジ部品Ｃ，Ｄを締結する作業をロボットが実行し、配線部品Ｅ，Ｆを結線する作業を作業員が実行するものとする。 Of the plurality of assembly work until the product 50 is completed, the robot executes the work of arranging the box part B on the base part A and the work of fastening the screw parts C and D, and the wiring parts E and F. It is assumed that the worker performs the work of connecting the wires.

図１２は、製品５０の組立作業における各組立状態への遷移の一覧を示している。 FIG. 12 shows a list of transitions to each assembly state in the assembly work of the product 50.

製品５０の構造上、始めにベース部品Ａに対してボックス部品Ｂを組付けて部組品ＡＢを出現させる必要がある。部組品ＡＢに対しては、ネジ部品Ｃ，Ｄ、及び配線部品Ｅ，Ｆのいずれから組付けてもよい、したがって、単体の部品から完成品を得るまでの組立順序は１×４！＝２４通りとなる。 Due to the structure of the product 50, it is necessary to first assemble the box component B to the base component A to make the component AB appear. Assembling parts AB may be assembled from any of screw parts C and D, and wiring parts E and F. Therefore, the assembly order until a finished product is obtained from a single part is 1 × 4! = 24 ways.

製品５０の組立状態は、単体であるベース部品Ａ、ボックス部品Ｂ、ネジ部品Ｃ，Ｄ、及び配線部品Ｅ，Ｆの６状態と、２つの部品からなる部組品ＡＢの１状態と、３つの部品から成る部組品ＡＢＣ，ＡＢＤ，ＡＢＥ，ＡＢＦの４状態と、４つの部品から成る部組品ＡＢＣＤ，ＡＢＣＥ，ＡＢＣＦ，ＡＢＤＥ，ＡＢＤＦ，ＡＢＥＦの６状態と、５つの部品から成る部組品ＡＢＣＤＥ，ＡＢＣＤＦ，ＡＢＣＥＦ，ＡＢＤＥＦの４状態と、６つの部品から成る完成品ＡＢＣＤＥＦの１状態との全２２状態である。 The assembled state of the product 50 is 6 states of the base part A, the box part B, the screw parts C and D, and the wiring parts E and F, which are single units, and 1 state of the assembly product AB consisting of 2 parts, and 3 states. 4 states of ABC, ABD, ABE, ABF consisting of one part, 6 states of ABCD, ABCE, ABCF, ABDE, ABDF, ABEF consisting of 4 parts, and a group consisting of 5 parts. There are 22 states, 4 states of the product ABCDE, ABCDF, ABCEF, and ABDEF, and 1 state of the finished product ABCDEF composed of 6 parts.

製品５０の各組立状態への遷移は、図１２に示すＮｏ１～Ｎｏ３３の３３通りとなる。このうち、部品Ｅ，Ｆを組付ける遷移は作業員による作業に応じて行われ、他の部品を組付ける遷移はロボットによる作業に応じて行われる。 The transition of the product 50 to each assembled state is 33 ways of No1 to No33 shown in FIG. Of these, the transition for assembling parts E and F is performed according to the work by the worker, and the transition for assembling other parts is performed according to the work by the robot.

以下、作業主体を１台のロボット、及び１名の作業員とする。ロボットは、ボックス部品Ｂの配置にグリップハンドを使用し、ネジ部品Ｃ，Ｄの締結にドライバハンドを使用するものとする。 Hereinafter, the work subject will be one robot and one worker. The robot shall use the grip hand for arranging the box component B and the driver hand for fastening the screw components C and D.

この場合、作業主体のロボット及び作業員に対して定義される作業は以下のとおりである。 In this case, the work defined for the robot and the worker who are the main workers is as follows.

ロボットによる作業
作業Ｗ０：待機。
作業Ｗ１：ボックス部品Ｂの組付。
作業Ｗ２：ネジ部品Ｃの組付。
作業Ｗ３：ネジ部品Ｄの組付。
作業Ｗ４：グリップハンドの装着。
作業Ｗ５：ドライバハンドの装着。
作業Ｗ６：ハンドの取り外し。
作業員による作業
作業Ｐ０：待機。
作業Ｐ１：配線部品Ｅの組付。
作業Ｐ２：配線部品Ｆの組付。 Work by robot Work W0: Standby.
Work W1: Assembling box part B.
Work W2: Assembly of screw part C.
Work W3: Assembly of screw part D.
Work W4: Attaching the grip hand.
Work W5: Installation of the driver hand.
Work W6: Removal of the hand.
Work by workers Work P0: Standby.
Work P1: Assembly of wiring component E.
Work P2: Assembling the wiring component F.

以下、制約条件の例を挙げる。例えば、ロボットに対しては、ハンドを装着していない状態である場合には作業Ｗ４（グリップハンドの装着），Ｗ５（ドライバハンドの装着）を選択でき、グリップハンドまたはドライバハンドを装着している状態である場合には作業Ｗ６（ハンドの取り外し）を選択できる。作業Ｗ１（ボックス部品Ｂの組付）を選択するには、グリップハンドが装着されている状態が必要となる。作業Ｗ２（ネジ部品Ｃの組付），Ｗ３（ネジ部品Ｄの組付）を選択するには、ドライバハンドが装着されている状態が必要となる。また、例えば、作業員に対しては、作業Ｐ１（配線部品Ｅの組付），Ｐ２配線部品Ｆの組付を選択するためには、組立状態遷移の前状態が少なくともベース部品Ａにボックス部品Ｂが組付けられた部組品ＡＢ以降の状態である必要がある。 The following are examples of constraints. For example, for a robot, when the hand is not attached, work W4 (attachment of the grip hand) or W5 (attachment of the driver hand) can be selected, and the grip hand or the driver hand is attached. If it is in the state, work W6 (removal of the hand) can be selected. In order to select the work W1 (assembly of the box component B), it is necessary that the grip hand is attached. In order to select work W2 (assembly of screw component C) or W3 (assembly of screw component D), it is necessary that the driver hand is attached. Further, for example, in order to select the assembly of the work P1 (assembly of the wiring component E) and the assembly of the P2 wiring component F for the worker, the state before the assembly state transition is at least the box component in the base component A. It is necessary to be in the state after the component AB to which B is assembled.

組立作業の初期状態は、単体の部品のみが存在し、ベース部品Ａがステージ３１２の上に予め載置されており、ロボットにはグリップハンド及びドライバハンドのいずれもが装着されていない状態と定義する。組立作業の完了状態は、製品ＡＢＣＤＥＦが得られ、さらにロボットにグリップハンド及びドライバハンドのいずれもが装着されていない状態と定義する。 The initial state of the assembly work is defined as a state in which only a single part exists, the base part A is pre-mounted on the stage 312, and neither the grip hand nor the driver hand is attached to the robot. do. The completed state of the assembly work is defined as a state in which the product ABCDEF is obtained and neither the grip hand nor the driver hand is attached to the robot.

図１３は、製品５０の組立状態遷移（不図示）、及び組立作業環境（ハンドの状態）に基づいて設定した組立作業順序の一例、及び作業毎の組立状態を示している。 FIG. 13 shows an example of the assembly work order set based on the assembly state transition (not shown) of the product 50 and the assembly work environment (hand state), and the assembly state for each work.

該組立作業順序は、作業主体を１台のロボット及び１名の作業員とし、製品５０の組立状態及び組立作業環境に基づいて、ロボットによる作業Ｗ０～Ｗ６の７種類と、作業員による作業Ｐ０～Ｐ２の３種類との合計１０種類の作業を選択させる強化学習によって設定される。 The assembly work sequence consists of one robot and one worker as the main work subject, and seven types of robot work W0 to W6 and work P0 by the worker based on the assembly state and assembly work environment of the product 50. It is set by reinforcement learning to select a total of 10 types of work with 3 types of ~ P2.

同図に示す組立作業順序は作業ステップ０番から７番まであり、０番が初期状態、７番が完了状態である。部品Ａ等に対応して記載されている状態値は図６の場合と同様である。 The assembly work order shown in the figure is from work steps 0 to 7, where 0 is the initial state and 7 is the completed state. The state values described corresponding to the component A and the like are the same as in the case of FIG.

例えば、作業ステップ０番の初期状態では、単体の部品のみが存在し、ロボットにはクリップハンド及びドライバハンドが装着されていない。 For example, in the initial state of work step 0, only a single component exists, and the robot is not equipped with a clip hand and a driver hand.

次の作業ステップ１番では、ロボットが作業Ｗ４（グリップハンドの装着）を実行し、作業員が作業Ｐ０（待機）を実行する。これにより、ロボットによる作業Ｗ１（ボックス部品Ｂの組付）が作業可能となる。 In the next work step 1, the robot executes the work W4 (attachment of the grip hand), and the worker executes the work P0 (standby). As a result, the work W1 (assembly of the box component B) by the robot becomes possible.

次の作業ステップ２番では、ロボットが作業Ｗ１を実行し、作業員が作業Ｐ０（待機）を実行する。これにより、ベース部品Ａ、及びボックス部品Ｂが消滅し、部組品ＡＢが出現する。次の作業ステップ３番では、ロボットが作業Ｗ６（ハンドの取り外し）を実行し、作業員がＰ１（配線部品Ｅの組付）を実行する。これにより、配線部品Ｅ、及び部組品ＡＢが消滅し、部組品ＡＢＥが出現する。 In the next work step 2, the robot executes the work W1 and the worker executes the work P0 (standby). As a result, the base component A and the box component B disappear, and the assembly product AB appears. In the next work step 3, the robot executes the work W6 (removal of the hand), and the worker executes P1 (assembly of the wiring component E). As a result, the wiring component E and the substructure AB disappear, and the substructure ABE appears.

次の作業ステップ４番では、ロボットが作業Ｗ５（ドライバハンドの装着）を実行し、作業員が作業Ｐ２（配線部品Ｆの組付）を実行する。これにより、ロボットによる作業Ｗ２（ネジ部品Ｃの組付），Ｗ３（ネジ部品Ｄの組付）が作業可能となる。また、配線部品Ｆ、及び部組品ＡＢＥが消滅し、部組品ＡＢＥＦが出現する。 In the next work step 4, the robot executes the work W5 (attachment of the driver hand), and the worker executes the work P2 (assembly of the wiring component F). As a result, the work W2 (assembly of the screw component C) and W3 (assembly of the screw component D) by the robot can be performed. Further, the wiring component F and the substructured product ABE disappear, and the substructured product ABEF appears.

次の作業ステップ５番では、ロボットが作業Ｗ２（ネジ部品Ｃの組付）を実行し、作業員が作業Ｐ０（待機）を実行する。これにより、ネジ部品Ｃ、及び部組品ＡＢＥＦが消滅し、部組品ＡＢＣＥＦが出現する。 In the next work step 5, the robot executes the work W2 (assembly of the screw component C), and the worker executes the work P0 (standby). As a result, the screw component C and the substructured product ABEF disappear, and the substructured product ABCEF appears.

次の作業ステップ６番では、ロボットが作業Ｗ３（ネジ部品Ｄの組付）を実行し、作業員が作業Ｐ０（待機）を実行する。これにより、ネジ部品Ｄ、及び部組品ＡＢＣＥＦが消滅し、製品ＡＢＣＤＥＦが出現する。 In the next work step 6, the robot executes the work W3 (assembly of the screw component D), and the worker executes the work P0 (standby). As a result, the screw component D and the component ABCEF disappear, and the product ABCDEF appears.

次の作業ステップ７番では、ロボットが作業Ｗ６（ハンドの取り外し）を実行し、作業員が作業Ｐ０（待機）を実行する。これにより、完了状態が得られて組立作業が終了される。 In the next work step 7, the robot executes the work W6 (removal of the hand), and the worker executes the work P0 (standby). As a result, the completed state is obtained and the assembly work is completed.

なお、部品の数が増加して必要な組立作業の数が増えたり、作業主体としてのロボットや作業員を複数にしたりしても組立作業順序を計画することができる。さらに、組立作業環境として、ハンド以外にトレイ３２１，３２２、ステージ３１２等の状態を追加したり、作業の定義に、部品を所定の位置から他の位置まで搬送装置の作業を追加したりしても組立作業順序を計画することができる。 It should be noted that the assembly work order can be planned even if the number of parts increases and the number of required assembly work increases, or if the number of robots or workers as the work subject increases. Furthermore, as an assembly work environment, the states of trays 321, 322, stages 312, etc. are added in addition to the hand, and the work of the transport device is added to the work definition from a predetermined position to another position. You can also plan the assembly work sequence.

＜組立作業順序計画装置１００による組立作業順序計画処理＞
次に、図１４は、組立作業順序計画装置１００による組立作業順序計画処理の一例を説明するフローチャートである。なお、該組立作業順序計画処理のステップＳ２１～Ｓ３２のうち、ステップＳ２１～Ｓ２３，Ｓ２６，Ｓ２８～Ｓ３１の処理は、組立作業順序計画装置１０による組立作業順序計画処理（図９）のステップＳ１～Ｓ３，Ｓ５，Ｓ６～Ｓ９の処理と同様であるため、その説明を適宜省略する。 <Assembly work sequence planning process by the assembly work sequence planning device 100>
Next, FIG. 14 is a flowchart illustrating an example of the assembly work order planning process by the assembly work order planning device 100. Of the steps S21 to S32 of the assembly work order planning process, the processes of steps S21 to S23, S26, S28 to S31 are the steps S1 to S1 of the assembly work order planning process (FIG. 9) by the assembly work order planning device 10. Since it is the same as the processing of S3, S5, S6 to S9, the description thereof will be omitted as appropriate.

始めに、情報取得部１１１が、ＣＡＤシステム２０から組立作業環境・製品情報１２１、及び組立状態遷移情報１２２を取得して記憶部１２に格納する（ステップＳ２１，Ｓ２２）。次に、組立作業定義部１１２が、組立作業環境の作業前の状態を設定する（ステップＳ２３）。 First, the information acquisition unit 111 acquires the assembly work environment / product information 121 and the assembly state transition information 122 from the CAD system 20 and stores them in the storage unit 12 (steps S21 and S22). Next, the assembly work definition unit 112 sets the state of the assembly work environment before work (step S23).

次に、組立状態ベース組立作業定義部１１２Ａが、組立状態遷移情報１２２に基づいて、組付前の２つの部品または部組品からなる組立状態に対して、組立作業を定義し、定義した組立作業を組立作業情報１２３として記憶部１２に格納する（ステップＳ２４）。 Next, the assembly state-based assembly work definition unit 112A defines and defines the assembly work for the assembly state consisting of two parts or parts before assembly based on the assembly state transition information 122. The work is stored in the storage unit 12 as the assembly work information 123 (step S24).

次に、作業状態ベース組立作業定義部１１２Ｂが、組立作業環境・製品情報１２１を参照し、このように、組立作業環境の状態に対して組立に必要となる作業を定義する（ステップＳ２５）。 Next, the work state-based assembly work definition unit 112B refers to the assembly work environment / product information 121, and thus defines the work required for assembly with respect to the state of the assembly work environment (step S25).

次に、制約条件設定部１１３が、制約条件を設定し、制約条件情報１２４として記憶部１２に格納する（ステップＳ２６）。 Next, the constraint condition setting unit 113 sets the constraint condition and stores it in the storage unit 12 as the constraint condition information 124 (step S26).

次に、シミュレーション指示部１１８が、ロボットシミュレータ３０に対して、制約条件を設定するための個別組立作業シミュレーションをロボットシミュレータ３０に指示し、そのシミュレーション結果を取得する（ステップＳ２７）。具体的には、各組立作業を行う前の組立状態、組立作業環境の状態をロボットシミュレータ３０に設定し、対象の組立作業のシミュレーションを実行させる。シミュレーションにより、例えば、ロボットが他のロボット等に干渉したり（衝突したり）、関節の可動角度不足等によってロボットが目的の部品を組付ける姿勢を取れなかったり等の不具合が発生した場合、その組立作業は実現不可能と判断する。よって、このシミュレーション結果に従い、実施できない組立作業を制約条件に追加設定する。もしくは、実現不可能と判断した組立作業の定義を削除するようにしてもよい。 Next, the simulation instruction unit 118 instructs the robot simulator 30 to perform an individual assembly work simulation for setting constraint conditions, and acquires the simulation result (step S27). Specifically, the assembly state before each assembly work and the state of the assembly work environment are set in the robot simulator 30, and the simulation of the target assembly work is executed. If the simulation causes a problem such as the robot interfering with another robot (collision) or the robot cannot take the posture to assemble the target parts due to insufficient movable angle of the joints, etc. It is judged that the assembly work is not feasible. Therefore, according to this simulation result, the assembly work that cannot be performed is additionally set as a constraint condition. Alternatively, the definition of the assembly work determined to be unrealizable may be deleted.

次に、行動選択・価値関数構成部１１４が、強化学習に用いる行動選択関数、及び、価値関数を定義する（ステップＳ２８）。次に、報酬設定部１１５が、各組立状態に対して選択された作業に対して報酬を設定する（ステップＳ２９）。 Next, the action selection / value function component 114 defines the action selection function and the value function used for reinforcement learning (step S28). Next, the reward setting unit 115 sets a reward for the work selected for each assembly state (step S29).

次に、学習部１１６が、組立の初期状態から、組立状態に対する作業選択が失敗して組立失敗となるか、組立状態に対する作業選択が成功して組立作業が完了して組立成功となるかの、一連の行動を意味するエピソードを繰り返すことにより強化学習を実行する。所定回数（例えば、３回）のエピソードが連続で成功した場合に強化学習を終了する（ステップＳ３０）。 Next, from the initial state of assembly, the learning unit 116 fails to select the work for the assembly state and fails to assemble, or the work selection for the assembly state succeeds and the assembly work is completed and the assembly succeeds. , Perform reinforcement learning by repeating episodes that mean a series of actions. Reinforcement learning ends when a predetermined number of episodes (for example, three times) are successful in succession (step S30).

次に、組立作業順序生成部１１７が、強化学習結果として得られた行動選択関数、及び状態価値関数を使って、初期状態からエピソードを試行することにより、製品が完成するまでの一連の作業の選択結果をつなげた組立作業順序を生成する（ステップＳ３１）。 Next, the assembly work sequence generation unit 117 uses the action selection function and the state value function obtained as a result of reinforcement learning to try episodes from the initial state, thereby performing a series of work until the product is completed. An assembly work sequence in which the selection results are connected is generated (step S31).

次に、シミュレーション指示部１１８が、ロボットシミュレータ３０に対して、ステップＳ３１で得られた組立作業順序のシミュレーションをロボットシミュレータ３０に指示し、そのシミュレーション結果を取得して不具合がないことを確認する（ステップＳ３２）。 Next, the simulation instruction unit 118 instructs the robot simulator 30 to simulate the assembly work order obtained in step S31, acquires the simulation result, and confirms that there is no problem (there is no problem). Step S32).

以上に説明した組立作業順序計画装置１００による組立作業順序計画処理によれば、製品を構成する各部品の組立順序と、ロボット及び作業員を含む作業主体の作業順序とを計画することが可能となる。 According to the assembly work order planning process by the assembly work order planning device 100 described above, it is possible to plan the assembly order of each part constituting the product and the work order of the work main body including the robot and the worker. Become.

＜強化学習について＞
次に、上述した本発明の各実施形態にて採用した強化学習について説明する。 <About reinforcement learning>
Next, the reinforcement learning adopted in each embodiment of the present invention described above will be described.

一般的に、強化学習の方式は、マルコフ決定過程（ＭＤＰ:Markov decision process）のモデルに基づいており、ある状態において、行動を選択し、状態を更新するものであり、行動または状態には良否、すなわち価値が伴う、というモデルに基づいて行われる。 In general, the method of reinforcement learning is based on a model of Markov decision process (MDP), in which an action is selected and a state is updated in a certain state, and the action or state is good or bad. That is, it is based on the model that value accompanies.

強化学習の方式としては、行動選択を行動価値のテーブルの値を用いて実行し、またその行動選択の学習によりテーブルの値を更新するＱ学習、行動価値関数を深層ニューラルネットワークとして行動価値から行動を選択するＤＱＮ(Deep Q-learning Network)、行動選択関数(方策)と状態価値関数を深層ニューラルネットワークとしたＡ３Ｃ(Asynchronous Advantage Actor-Critic)が知られている。 As a method of reinforcement learning, action selection is executed using the value of the action value table, and Q-learning that updates the table value by learning the action selection, and action from the action value using the action value function as a deep neural network. DQN (Deep Q-learning Network) that selects the above, and A3C (Asynchronous Advantage Actor-Critic) that uses the action selection function (policy) and the state value function as a deep neural network are known.

さらに他の方式も多数存在するが、いずれも強化学習の問題として、状態、行動、報酬によってモデル化される。 There are many other methods, but all are modeled by state, behavior, and reward as problems of reinforcement learning.

以下、上述した実施形態に採用するＡ３Ｃについて詳述する。強化学習では、１つのエピソードにおいてステップ毎に、状態に対して行動を選択することにより次の状態に遷移し、行動選択の結果や状態に対して報酬を与える。そして、報酬がエピソードの終了条件を満たすなら、エピソードを終了する。そして、エピソードを繰り返し、状態に対して正しく行動選択できるようになれば、または状態に対して選択される行動が確定的になれば、強化学習を終了する。 Hereinafter, the A3C adopted in the above-described embodiment will be described in detail. In reinforcement learning, each step in one episode, the action is selected for the state to transition to the next state, and the result of the action selection or the state is rewarded. Then, if the reward meets the end condition of the episode, the episode ends. Then, when the episode is repeated and the correct action can be selected for the state, or when the action selected for the state becomes deterministic, the reinforcement learning ends.

Ａ３Ｃは、非同期(Asynchronous)という単語が表すように、複数のエージェント(アクターとも称される)が個別にエピソードを実行し、それぞれのエージェントが単一の行動選択関数、状態価値関数を訓練（学習）し、また行動選択に利用する方法である。 In A3C, as the word Asynchronous implies, multiple agents (also called actors) perform episodes individually, and each agent trains (learns) a single action selection function, state value function. ), And it is a method used for action selection.

図１４は、Ａ３Ｃの概要を説明するための図である。同図の場合、エージェントは３つである。行動選択関数、状態価値関数１４０１は、共有の深層ニューラルネットワーク(ＤＮＮ:Deep Neural Network)として構成される。これは共有ＤＮＮと称される。また、同図の場合、入力層の状態変数を３種類、出力層の行動を３種類としている。出力層の価値(状態価値)は必ず１つである。 FIG. 14 is a diagram for explaining the outline of A3C. In the case of the figure, there are three agents. The action selection function and the state value function 1401 are configured as a shared deep neural network (DNN). This is referred to as a shared DNN. Further, in the case of the figure, the state variables of the input layer are set to 3 types, and the actions of the output layer are set to 3 types. The value (state value) of the output layer is always one.

中間層は１層であり、ノード数は４つである。ノードの活性化関数は、行動の出力層ではｓｏｆｔｍａｘ、価値の出力層ではｌｉｎｅａｒ、中間層ではＲｅＬＵ(Rectified Linear Unit)を設定する。各エージェントは状態、行動、次状態、報酬を格納するメモリを備え、また自分の行動を選択するためのＤＮＮを備える。例えばエージェント１はメモリ１４１１、及びＤＮＮ１４２１を備える。エージェント２は、メモリ１４１２、及びＤＮＮ１４２２を備え、エージェント３はメモリ１４１３、及びＤＮＮ１４２３を備える。 The middle layer is one layer and the number of nodes is four. The activation function of the node sets softmax in the output layer of action, linear in the output layer of value, and ReLU (Rectified Linear Unit) in the intermediate layer. Each agent has a memory for storing states, actions, next states, rewards, and a DNN for selecting its own action. For example, the agent 1 includes a memory 1411 and a DNN 1421. The agent 2 includes a memory 1412 and a DNN 1422, and the agent 3 includes a memory 1413 and a DNN 1423.

各エージェントは、それぞれのエピソード処理においてステップのデータが蓄積されたら、個別に共有ＤＮＮを訓練する。そして、訓練後、共有ＤＮＮを自分用のＤＮＮにコピーし、自分用のＤＮＮを利用して状態に対する行動を選択する。これは各エージェント間でＤＮＮ訓練と行動選択のタイミングが非同期なので、訓練と行動選択が競合しないための構成である。 Each agent trains the shared DNN individually once the step data is accumulated in each episode process. Then, after the training, the shared DNN is copied to the own DNN, and the action for the state is selected by using the own DNN. This is a configuration for training and action selection not to conflict because the timing of DNN training and action selection is asynchronous between each agent.

なお、単一のエージェントで強化学習を行う場合、特に共有ＤＮＮを設ける必要はない。この場合の学習法はＡ２Ｃ(Advantage Actor-Critic)と称される。なお、Ａ２ＣにおけるAdvantageは、行動価値と状態価値の差、すなわち、行動選択の良さを表す量であり、また、Ａ２ＣにおけるActor-Criticは、行動選択と状態価値評価とが別々に計算される方式であることを意味する。 When performing reinforcement learning with a single agent, it is not necessary to provide a shared DNN in particular. The learning method in this case is called A2C (Advantage Actor-Critic). The advantage in A2C is the difference between the action value and the state value, that is, the quantity indicating the goodness of the action selection, and the Actor-Critic in the A2C is a method in which the action selection and the state value evaluation are calculated separately. Means that

次に、図１６は、行動選択関数と状態価値関数の訓練の処理内容について説明するための図であり、ＤＮＮ（深層ニューラルネットワーク）１５０１を模式的に示している。 Next, FIG. 16 is a diagram for explaining the processing contents of the training of the action selection function and the state value function, and schematically shows the DNN (deep neural network) 1501.

ＤＮＮ１５０１は、状態sを入力として、行動選択の出力(方策)をπ（ｓ）、状態価値の出力をＶ（ｓ）とする。ＤＮＮ１５０１の中のノードに対して定義される関数(活性化関数)のパラメータをθとする。方策は確率方策とも呼ばれ、状態sにおいて行動aをとる確率π（ａ|ｓ）とも表記される。 The DNN1501 takes the state s as an input, sets the output (measure) of action selection as π (s), and sets the output of the state value to V (s). Let θ be the parameter of the function (activation function) defined for the node in DNN1501. The policy is also called a probability policy, and is also expressed as the probability π (a | s) of taking action a in the state s.

行動価値関数、状態価値関数の訓練とは、状態s、行動a、次状態s’、報酬rのデータから、それらの関係を予測する関係が得られるように、ＤＮＮ１５０１のパラメータθを最適化することである。特に、行動に対する状態価値の関係を正しく推定できることが重要である。 The training of the action value function and the state value function optimizes the parameter θ of the DNN 1501 so that the relationship predicting the relationship can be obtained from the data of the state s, the action a, the next state s', and the reward r. That is. In particular, it is important to be able to correctly estimate the relationship between state value and behavior.

状態価値関数Ｖ（ｓ）は、方策π（ｓ）の下で次式（１）によって表現される。 The state value function V (s) is expressed by the following equation (1) under the policy π (s).

ここで、Ｅは、添え字である方策π（ｓ）における期待値を意味する。割引率γは、次の状態価値(将来の価値)を現在の値に補正するための係数である。割引率γの値は、一例としては０．９９であるが、強化学習のためのパラメータとして調整できる。 Here, E means the expected value in the subscript policy π (s). The discount rate γ is a coefficient for correcting the next state value (future value) to the current value. The value of the discount rate γ is 0.99 as an example, but it can be adjusted as a parameter for reinforcement learning.

方策の価値は、状態ｓの分布ρにおける期待値として次式（２）によって表される。 The value of the policy is expressed by the following equation (2) as the expected value in the distribution ρ of the state s.

式（２）は割引済み報酬関数と称される。割引済み報酬関数のＤＮＮのパラメータθに関する変化には、次式（３）で表現される方策勾配定理がある。 Equation (2) is called the discounted reward function. There is a policy gradient theorem expressed by the following equation (3) in the change of the discounted reward function with respect to the parameter θ of the DNN.

ここで、∇は勾配(基底についての１階の偏微分)を意味する。期待値は状態sが方策πの分布ρの亘る範囲、行動aは状態sに対する方策πに亘る範囲による。行動価値関数Ｑ（ｓ,ａ）、アドバンテージ関数Ａ（ｓ,ａ）は、次式（４），（５）に示す通りであり、データから計算可能である。 Here, ∇ means the gradient (first-order partial differential with respect to the basis). The expected value depends on the range of the state s over the distribution ρ of the policy π, and the action a depends on the range of the policy π for the state s. The action value function Q (s, a) and the advantage function A (s, a) are as shown in the following equations (4) and (5), and can be calculated from the data.

そして、割引済み報酬関数（式（２））の値を最大とするようにネットワークパラメータθを最適化すれば、状態に対して良い報酬が得られるような方策と状態価値の関係が得られる。 Then, if the network parameter θ is optimized so as to maximize the value of the discounted reward function (Equation (2)), the relationship between the policy and the state value so as to obtain a good reward for the state can be obtained.

割引済み報酬関数の負(マイナス)を方策損失とすれば、方策損失を最小化すればよい。方策損失の他にも、行動価値と状態価値は一致していることが望ましいので、アドバンテージ絶対値の大きさを意味する価値損失も存在する。また、状態に対して方策は一意に決まることが望ましく、方策は確率的に決まるのでエントロピーでモデル化した正則化項も最適化に利用できる。そこで、損失関数Ｌを方策損失Ｌ_π、価値損失Ｌ_ｖ、正則化項Ｌ_ｒｅｇを用いて次式（６）のように定義し、損失関数を最小化する。 If the negative (minus) of the discounted reward function is the policy loss, the policy loss should be minimized. In addition to the policy loss, it is desirable that the action value and the state value match, so there is also a value loss that means the magnitude of the advantage absolute value. In addition, it is desirable that the policy is uniquely determined for the state, and since the policy is stochastically determined, the regularization term modeled by entropy can also be used for optimization. Therefore, the loss function L is defined as the following equation (6) using the policy loss L _π , the value loss L _v , and the regularization term L _reg , and the loss function is minimized.

ここで、ｃ_ｖ，ｃ_ｒｅｇは係数である。方策損失Ｌ_πは、割引済み報酬関数の定義から次式（７）の通りとなる。訓練に使うｎは、ステップのデータ数である。 Here, c _v and c _reg are coefficients. The policy loss L _π is as shown in the following equation (7) from the definition of the discounted reward function. N used for training is the number of data in the step.

価値損失Ｌ_ｖは、次式（８）に示すように、アドバンテージ関数Ａ（ｓ,ａ）の２乗とする。 The value loss L _v is the square of the advantage function A (s, a) as shown in the following equation (8).

正則化項は、次式（９），（１０）に示すように、エントロピーＨ（π（ｓ））を計算して得る。 The regularization term is obtained by calculating the entropy H (π (s)) as shown in the following equations (9) and (10).

ここで、ｎ_{ａｃｔｉｏｎ}は行動の数である。 Here, n _action is the number of actions.

最適化計算による訓練は、深層ニューラルネットワークの場合、勾配法(gradient)を利用する。この訓練自体は、強化学習とは別の深層ニューラルネットワークの学習と同様であり、深層ニューラルネットワーク技術を活用すれば実現可能である。 Training by optimization calculation uses the gradient method in the case of deep neural networks. This training itself is similar to learning of a deep neural network, which is different from reinforcement learning, and can be realized by utilizing deep neural network technology.

強化学習の各種のアルゴリズムには、例えばε－ｇｒｅｅｄｙ法のような強化学習の探索と学習結果活用(exploration-exploitation)の特性を利用する方法を強化学習技術分野における技法として利用すればよい。 For various algorithms of reinforcement learning, a method of utilizing the characteristics of search for reinforcement learning and utilization of learning results (exploration-exploitation) such as the ε-greedy method may be used as a technique in the field of reinforcement learning technology.

組立作業順序生成における作業選択のためには、定義した作業を行動とする。状態は組立状態、組立作業環境の状態であり、その変数値は０または１となる。報酬の設定としては、一例としては、正の報酬値は１、負の報酬値は－１として、報酬が発生しないときは０とすればよい。 For work selection in assembly work sequence generation, the defined work is an action. The state is the assembly state and the state of the assembly work environment, and the variable value is 0 or 1. As an example, the positive reward value may be 1, the negative reward value may be -1, and 0 may be set when no reward is generated.

以上が行動選択関数と状態価値関数の訓練の処理内容についての説明である。以上で、本発明の実施形態で採用した強化学習の説明を終了する。 The above is the explanation of the processing contents of the training of the action selection function and the state value function. This is the end of the explanation of the reinforcement learning adopted in the embodiment of the present invention.

本発明は、上述した実施形態に限定されるものではなく、様々な変形が可能である。例えば、上述した実施形態は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また、ある実施形態の構成の一部を他の実施形態の構成に置き換えたり、追加したりすることが可能である。 The present invention is not limited to the above-described embodiment, and various modifications are possible. For example, the above-described embodiments have been described in detail in order to explain the present invention in an easy-to-understand manner, and are not necessarily limited to those having all the described configurations. Further, it is possible to replace or add a part of the configuration of one embodiment with the configuration of another embodiment.

また、上記の各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等によりハードウェアで実現してもよい。また、上記の各構成、機能等は、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することによりソフトウェアで実現してもよい。各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリや、ハードディスク、ＳＳＤ等の記録装置、または、ＩＣカード、ＳＤカード、ＤＶＤ等の記録媒体に置くことができる。また、制御線や情報線は説明上必要と考えられるものを示しており、製品上必ずしも全ての制御線や情報線を示しているとは限らない。実際には殆ど全ての構成が相互に接続されていると考えてもよい。 Further, each of the above configurations, functions, processing units, processing means and the like may be realized by hardware by designing a part or all of them by, for example, an integrated circuit. Further, each of the above configurations, functions, and the like may be realized by software by the processor interpreting and executing a program that realizes each function. Information such as programs, tables, and files that realize each function can be placed in a memory, a recording device such as a hard disk or SSD, or a recording medium such as an IC card, SD card, or DVD. In addition, the control lines and information lines indicate those that are considered necessary for explanation, and do not necessarily indicate all the control lines and information lines in the product. In practice, it can be considered that almost all configurations are interconnected.

１０・・・組立作業順序計画装置、１１・・・演算部、１１１・・・情報取得部、１１２・・・組立作業定義部、１１２Ａ・・・組立状態ベース組立作業定義部、１１２Ｂ・・・作業状態ベース組立作業定義部、１１３・・・制約条件設定部、１１４・・・行動選択・価値関数構成部、１１５・・・報酬設定部、１１６・・・学習部、１１７・・・組立作業順序生成部、１１８・・・シミュレーション指示部、１２・・・記憶部、１２１・・・組立作業環境・製品情報、１２２・・・組立状態遷移情報、１２２・・・組立状態遷移情報、１２３・・・組立作業情報、１２４・・・制約条件情報、１２５・・・制約条件判定シミュレーション情報、１２６・・・組立作業順序シミュレーション情報、１３・・・入力部、１４・・・出力部、１５・・・通信部、２０・・・ＣＡＤシステム、３０・・・ロボットシミュレータ、４０，５０・・・製品、１００・・・組立作業順序計画装置、３０３，３０４・・・ハンド、３１１・・・台座、３１２・・・ステージ、３２１，３２２・・・トレイ、３３１，３３２・・・ハンド設置台、３３３，３３４・・・交換用ハンド 10 ... Assembly work order planning device, 11 ... Calculation unit, 111 ... Information acquisition unit, 112 ... Assembly work definition unit, 112A ... Assembly status base assembly work definition unit, 112B ... Work state-based assembly work definition unit, 113 ... constraint condition setting unit, 114 ... action selection / value function configuration unit, 115 ... reward setting unit, 116 ... learning unit, 117 ... assembly work Order generation unit, 118 ... Simulation instruction unit, 12 ... Storage unit, 121 ... Assembly work environment / product information, 122 ... Assembly status transition information, 122 ... Assembly status transition information, 123.・・ Assembly work information, 124 ・・・ constraint condition information, 125 ・・・ constraint condition judgment simulation information, 126 ・・・ assembly work order simulation information, 13 ・・・ input unit, 14 ・・・ output unit, 15 ・・・ Communication unit, 20 ・・・ CAD system, 30 ・・・ robot simulator, 40, 50 ・・・ product, 100 ・・・ assembly work sequence planning device, 303, 304 ・・・ hand, 311 ・・・ pedestal , 312 ... Stage, 321, 322 ... Tray, 331, 332 ... Hand installation stand, 333, 334 ... Replacement hand

Claims

An information acquisition unit that acquires assembly state transition information, including information that represents the process of assembling into the final product through a component that is assembled with multiple parts.
Based on the assembly state transition information, the assembly work definition unit that defines the work that the work subject can perform for the assembly state consisting of the two parts or the parts before assembly, and the assembly work definition unit.
A constraint condition setting unit that sets a constraint condition regarding whether or not the work can be executed, and a constraint condition setting unit.
A learning unit that reinforces learning how to select the work for the assembled state according to the constraints.
An assembly work order generation unit that generates an assembly work order of the product based on the result of the reinforcement learning,
An assembly work sequence planning device characterized by comprising.

The assembly work sequence planning apparatus according to claim 1.
The information acquisition unit acquires information on an assembly work environment / product including information on an object existing in the assembly work environment and information on a plurality of parts constituting the product.
The assembly work definition unit defines the work that can be executed for the state of the assembly work environment based on the assembly work environment / product information.
The learning unit is an assembly work sequence planning device, characterized in that it reinforces and learns a method of selecting the work with respect to the assembly state and the state of the assembly work environment in accordance with the constraint conditions.

The assembly work sequence planning apparatus according to claim 2.
In the reinforcement learning, the learning unit
The relationship between the assembly work environment, the work, and the constraints is defined by the action selection function and the state value function.
Define the initial state and work completion state of the assembly work environment,
In learning in episodes
With respect to the state of the assembly work environment during the assembly work, the step of selecting the action of the work subject and obtaining the next state is repeated.
If the selected action does not meet the work constraints, a negative reward is given to end the episode and end the episode.
If the work is completed due to the selected action, a positive reward will be given to end the episode and the episode will end.
The action selection function and the state value function are trained from the obtained state, action, and reward data.
An assembly work sequence planning device characterized by learning how to select work for the state of the assembly work environment by repeating the episode.

The assembly work sequence planning apparatus according to claim 2.
An assembly work sequence planning apparatus, wherein the assembly work environment includes a robot as a work subject, one of the robot's hands, a stage, a tray, and at least one of a transfer device.

The assembly work sequence planning apparatus according to claim 1.
The constraint condition setting unit is characterized in that an assembly state before execution of work that can be executed by a work subject with respect to the assembly state, which is defined based on the assembly state transition information, is set as the constraint condition. Assembly work sequence planning device.

The assembly work sequence planning apparatus according to claim 2.
The constraint condition setting unit constrains the assembly state before execution of the work that can be executed and the state of the assembly work environment for the state of the assembly work environment defined based on the assembly work environment / product information. An assembly work sequence planning device characterized by being set as a condition.

The assembly work sequence planning apparatus according to claim 2.
The assembly work sequence planning device, characterized in that the assembly work environment includes at least one robot as the work subject and at least one worker.

The assembly work sequence planning apparatus according to claim 1.
The robot simulator is provided with a simulation instruction unit for instructing an individual assembly work simulation for setting the constraint conditions.
The constraint condition setting unit is an assembly work sequence planning device characterized in that the constraint condition is corrected based on the result of the individual assembly work simulation.

The assembly work sequence planning apparatus according to claim 8.
The simulation instruction unit is an assembly work order planning device, characterized in that the robot simulator is instructed to perform an assembly work simulation according to the generated assembly work order.

It is an assembly work sequence planning method using an assembly work sequence planning device.
Acquires assembly state transition information including information showing the process from assembling to the final product through a component in which multiple parts are assembled.
Based on the assembly state transition information, the work that can be performed by the work subject for the assembly state consisting of the two parts or the parts before assembly is defined.
Set constraints on whether or not the above work can be executed, and set
Reinforcement learning of the work selection method for the assembled state according to the constraints.
An assembly work sequence planning method comprising a step of generating an assembly work sequence of the product based on the result of the reinforcement learning.