JP7287414B2

JP7287414B2 - Logistics rule learning device, logistics rule execution device, logistics rule learning method, and logistics rule learning program

Info

Publication number: JP7287414B2
Application number: JP2021059592A
Authority: JP
Inventors: 有介吉成
Original assignee: JFE Steel Corp
Current assignee: JFE Steel Corp
Priority date: 2020-06-25
Filing date: 2021-03-31
Publication date: 2023-06-06
Anticipated expiration: 2041-03-31
Also published as: JP2022008030A

Description

本発明は、輸送対象物を輸送する複数の輸送機器の物流ルールを作成する物流ルール学習装置、物流ルール実行装置、物流ルール学習方法、及び物流ルール学習プログラムに関する。 The present invention relates to a physical distribution rule learning device, a physical distribution rule execution device, a physical distribution rule learning method, and a physical distribution rule learning program for creating physical distribution rules for a plurality of transportation equipment for transporting objects to be transported.

特許文献１には、学習目的と制御を学習する際に行われる学習動作の許容要件とを含む学習データを受け付ける学習データ受付部と、学習データに基づいて学習を実行するニューラルネットワークと、ニューラルネットワークによる学習結果を出力する出力部と、を備え、ニューラルネットワークは、学習目的の初期段階を達成するための第１学習を実行し、第１学習の結果に基づいて学習動作が許容要件に適合する制御範囲を学習する第２学習を実行し、第２学習の結果に基づいて制御範囲内において学習目的を達成するための第３学習を実行する学習装置が記載されている。 In Patent Document 1, a learning data reception unit that receives learning data including learning purposes and allowable requirements for learning operations performed when learning control, a neural network that performs learning based on the learning data, and a neural network and an output unit for outputting a learning result by, the neural network performing a first learning to achieve an initial stage of the learning objective, and based on the result of the first learning, the learning behavior meets the allowable requirements. A learning device is described that performs second learning to learn a control range, and performs third learning to achieve a learning objective within the control range based on the result of the second learning.

特開２０１８－２００５３９号公報JP 2018-200539 A

特許文献１に記載の学習装置では、輸送機器同士の干渉を伴う複数の輸送機器の物流ルールを学習する場合、輸送機器同士の干渉を避けることが可能な輸送機器の制御範囲を学習することが難しく、また制御範囲の学習だけでは輸送機器同士の干渉を避けることはできない。また、輸送機器同士の干渉を伴う複数の輸送機器の物流ルールは、if-then形式等で記述すると非常に複雑なものとなり、メンテナンス等の管理が容易ではない。 In the learning device described in Patent Literature 1, when learning a distribution rule for a plurality of transportation equipment that involves interference between the transportation equipment, it is possible to learn the control range of the transportation equipment that can avoid the interference between the transportation equipment. It is difficult, and it is not possible to avoid interference between transportation equipment only by learning the control range. In addition, the physical distribution rules for a plurality of transportation equipment that involve interference between transportation equipment become very complicated if described in an if-then format, etc., and management such as maintenance is not easy.

本発明は、上記課題に鑑みてなされたものであって、その目的は、輸送機器同士の干渉を伴う複数の輸送機器の物流ルールを容易に作成可能な物流ルール学習装置、物流ルール実行装置、物流ルール学習方法、及び物流ルール学習プログラムを提供することにある。 The present invention has been made in view of the above problems, and aims to provide a physical distribution rule learning device, a physical distribution rule execution device, and a physical distribution rule learning device capable of easily creating a physical distribution rule for a plurality of transportation devices that involve interference between the transportation devices. An object of the present invention is to provide a physical distribution rule learning method and a physical distribution rule learning program.

本発明に係る物流ルール学習装置は、輸送対象物を輸送する複数の輸送機器の物流ルールを作成する物流ルール学習装置であって、前記輸送機器の動きを含めた物流環境を再現するシミュレータ部と、前記物流環境内における前記輸送機器の次行動を決定して該輸送機器に指示する学習時次行動決定部と、前記輸送機器に対して指示された次行動を前記シミュレータ部が再現したことにより得られる前記物流環境の状態を解釈する学習時状態解釈部と、前記学習時状態解釈部により解釈された前記物流環境の状態に基づいて前記輸送機器の次行動を評価する報酬値を計算する報酬計算部と、前記学習時次行動決定部が指示した輸送機器の次行動、前記学習時状態解釈部によって解釈された前記物流環境の状態、及び前記報酬計算部によって計算された前記報酬値を入力として前記物流ルールを修正学習する学習部と、を備えることを特徴とする。 A physical distribution rule learning device according to the present invention is a physical distribution rule learning device for creating physical distribution rules for a plurality of transportation equipment for transporting objects to be transported, and includes a simulator section for reproducing a physical distribution environment including movement of the transportation equipment. a learning-time next-action determining unit that determines a next action of the transportation equipment in the physical distribution environment and instructs the transportation equipment; a learning-time state interpreting unit for interpreting the obtained logistics environment state; and a reward for calculating a reward value for evaluating the next action of the transportation equipment based on the logistics environment state interpreted by the learning-time state interpreting unit. a calculation unit, the next action of the transportation equipment instructed by the next action determination unit during learning, the state of the logistics environment interpreted by the state interpretation unit during learning, and the reward value calculated by the reward calculation unit; and a learning unit that corrects and learns the physical distribution rule.

本発明に係る物流ルール学習装置は、上記発明において、前記シミュレータ部は、前記輸送機器の動きを含めた物流環境を再現する離散事象シミュレータであることを特徴とする。 The physical distribution rule learning apparatus according to the present invention is characterized in that, in the above invention, the simulator section is a discrete event simulator that reproduces the physical distribution environment including the movement of the transportation equipment.

本発明に係る物流ルール学習装置は、上記発明において、前記学習時状態解釈部が解釈する物流環境の状態は、離散事象発生時の物流環境の状態であることを特徴とする。 A physical distribution rule learning apparatus according to the present invention is characterized in that, in the above invention, the state of the physical distribution environment interpreted by the state interpreting section at the time of learning is the state of the physical distribution environment when the discrete event occurs.

本発明に係る物流ルール学習装置は、上記発明において、離散事象発生時の物流環境の状態には、輸送機器の位置及び停止状態を含めた行先、輸送機器に積み込まれている輸送対象物の有無及び向け先、輸送機器に積み込むべき輸送対象物の位置及び向け先が含まれることを特徴とする。 In the physical distribution rule learning device according to the present invention, in the above invention, the state of the physical distribution environment at the time of occurrence of the discrete event includes the destination including the position and stop state of the transportation equipment, the presence or absence of the transportation object loaded on the transportation equipment. and destination, and the location and destination of the transport object to be loaded onto the transportation equipment.

本発明に係る物流ルール学習装置は、上記発明において、前記報酬計算部は、前記輸送機器が前記輸送対象物を正しく輸送したことに対する値、学習１単位内での輸送時間に関する値、及び学習１単位内での輸送失敗に関する値のうちの少なくとも一つに基づいて前記報酬値を計算することを特徴とする。 In the physical distribution rule learning device according to the present invention, in the above invention, the reward calculation unit includes a value for correctly transporting the transport object by the transport equipment, a value for transport time within one unit of learning, and a value for transport time within one learning unit. The reward value is calculated based on at least one of the values relating to transport failure within the unit.

本発明に係る物流ルール学習装置は、上記発明において、前記学習部は、ニューラルネットワークを用いて学習を行うことを特徴とする。 The physical distribution rule learning device according to the present invention is characterized in that, in the above invention, the learning unit performs learning using a neural network.

本発明に係る物流ルール学習装置は、上記発明において、前記学習部は、動きが互いに線対称である複数の輸送機器については、同じニューラルネットワークを用いて学習を行うことを特徴とする。 The physical distribution rule learning apparatus according to the present invention is characterized in that, in the above invention, the learning unit performs learning using the same neural network for a plurality of transport equipment whose movements are symmetrical with each other.

本発明に係る物流ルール実行装置は、本発明に係る物流ルール学習装置が備える前記学習部の学習結果を格納する学習結果記憶部と、前記学習結果記憶部から前記学習結果を読みだす学習結果読み込み部と、実空間での物流環境の状態又は第２シミュレータ部によって再現された物流環境の状態を解釈する実行時状態解釈部と、前記学習結果読み込み部から読みだされた学習結果と前記実行時状態解釈部で解釈された物流環境の状態を用いて物流環境内における前記輸送機器の次行動を決定して該輸送機器に指示する実行時次行動決定部と、を備えることを特徴とする。 A physical distribution rule execution device according to the present invention includes a learning result storage unit for storing learning results of the learning unit included in the physical distribution rule learning device according to the present invention, and a learning result reading unit for reading the learning results from the learning result storage unit. a runtime state interpretation unit that interprets the state of the physical distribution environment in the real space or the state of the physical distribution environment reproduced by the second simulator; and an execution-time next-action determining unit that determines the next action of the transportation equipment in the physical distribution environment using the state of the physical distribution environment interpreted by the state interpretation unit and instructs the transportation equipment.

本発明に係る物流ルール実行装置は、上記発明において、前記実行時状態解釈部が解釈する物流環境の状態は、離散事象発生時の物流環境の状態であることを特徴とする。 A physical distribution rule execution apparatus according to the present invention is characterized in that, in the above invention, the state of the physical distribution environment interpreted by the runtime state interpretation unit is the state of the physical distribution environment when the discrete event occurs.

本発明に係る物流ルール実行装置は、上記発明において、離散事象発生時の物流環境の状態には、輸送機器の位置及び停止状態を含めた行先、輸送機器に積み込まれている輸送対象物の有無及び向け先、輸送機器に積み込むべき輸送対象物の位置及び向け先が含まれることを特徴とする。 In the physical distribution rule execution apparatus according to the present invention, in the above invention, the state of the physical distribution environment at the time of occurrence of the discrete event includes the destination including the position and stop state of the transportation equipment, the presence or absence of the transportation object loaded on the transportation equipment. and destination, and the location and destination of the transport object to be loaded onto the transportation equipment.

本発明に係る物流ルール学習方法は、輸送対象物を輸送する複数の輸送機器の物流ルールを作成する物流ルール学習方法であって、前記輸送機器の動きを含めた物流環境を再現する第一ステップと、前記物流環境内における前記輸送機器の次行動を決定して該輸送機器に指示する第二ステップと、前記輸送機器に対して指示された次行動を前記第一ステップで再現したことにより得られる前記物流環境の状態を解釈する第三ステップと、前記第三ステップにおいて解釈された前記物流環境の状態に基づいて前記輸送機器の次行動を評価する報酬値を計算する第四ステップと、前記第二ステップにおいて指示した輸送機器の次行動、前記第三ステップにおいて解釈された前記物流環境の状態、及び前記第四ステップにおいて計算された前記報酬値を入力として前記物流ルールを修正学習するする第五ステップと、を含むことを特徴とする。 A physical distribution rule learning method according to the present invention is a physical distribution rule learning method for creating physical distribution rules for a plurality of transportation equipment for transporting objects to be transported, and is a first step of reproducing a physical distribution environment including movement of the transportation equipment. and a second step of determining the next action of the transportation equipment in the physical distribution environment and instructing the transportation equipment, and reproducing the instructed next behavior of the transportation equipment in the first step. a third step of interpreting the state of the logistics environment to be processed; a fourth step of calculating a reward value for evaluating the next action of the vehicle based on the state of the logistics environment interpreted in the third step; The next action of the transportation equipment instructed in the second step, the state of the physical distribution environment interpreted in the third step, and the remuneration value calculated in the fourth step are used as inputs to correct and learn the physical distribution rule. and five steps.

本発明に係る物流ルール学習プログラムは、コンピュータに輸送対象物を輸送する複数の輸送機器の物流ルールを作成する処理を実行させる物流ルール学習プログラムであって、前記輸送機器の動きを含めた物流環境を再現する第一処理と、前記物流環境内における前記輸送機器の次行動を決定して該輸送機器に指示する第二処理と、前記輸送機器に対して指示された次行動を前記第一処理で再現したことにより得られる前記物流環境の状態を解釈する第三処理と、前記第三処理において解釈された前記物流環境の状態に基づいて前記輸送機器の次行動を評価する報酬値を計算する第四処理と、前記第二処理において指示した輸送機器の次行動、前記第三処理において解釈された前記物流環境の状態、及び前記第四処理において計算された前記報酬値を入力として前記物流ルールを修正学習するする第五処理と、をコンピュータに実行させることを特徴とする。 A physical distribution rule learning program according to the present invention is a physical distribution rule learning program that causes a computer to execute processing for creating physical distribution rules for a plurality of transportation equipment for transporting objects to be transported, and is a physical distribution environment including movement of the transportation equipment. a second process of determining the next action of the transportation equipment in the logistics environment and instructing the transportation equipment, and the first process of determining the next action instructed to the transportation equipment and calculating a reward value for evaluating the next action of the transportation equipment based on the state of the physical distribution environment interpreted in the third processing. a fourth process, the next action of the transportation equipment instructed in the second process, the state of the physical distribution environment interpreted in the third process, and the remuneration value calculated in the fourth process as inputs; and a fifth process of correcting and learning by a computer.

本発明に係る物流ルール学習装置、物流ルール実行装置、物流ルール学習方法、及び物流ルール学習プログラムによれば、輸送機器同士の干渉を伴う複数の輸送機器の物流ルールを容易に作成することができる。 According to the physical distribution rule learning device, the physical distribution rule execution device, the physical distribution rule learning method, and the physical distribution rule learning program according to the present invention, it is possible to easily create physical distribution rules for a plurality of transportation equipment that involve interference between transportation equipment. .

図１は、本発明の一実施形態である物流制御システムの構成を示すブロック図である。FIG. 1 is a block diagram showing the configuration of a physical distribution control system that is one embodiment of the present invention. 図２は、本発明の一実施形態である物流ルール学習処理の流れを示すフローチャートである。FIG. 2 is a flow chart showing the flow of distribution rule learning processing, which is one embodiment of the present invention. 図３は、シミュレーションモデルの構成例を示す図である。FIG. 3 is a diagram showing a configuration example of a simulation model. 図４は、ＤＱＮの構成例を示す図である。FIG. 4 is a diagram showing a configuration example of DQN. 図５は、図２に示すモデル実行処理の流れを示すフローチャートである。FIG. 5 is a flowchart showing the flow of model execution processing shown in FIG. 図６は、本発明によって図３に示すシミュレーションモデルで１０個の積み荷を運送する物流ルールを学習させた一例を示す図である。FIG. 6 is a diagram showing an example in which the simulation model shown in FIG. 3 is made to learn a physical distribution rule for transporting 10 cargoes according to the present invention.

以下、図面を参照して、本発明の一実施形態である物流制御システムについて説明する。 A physical distribution control system according to an embodiment of the present invention will be described below with reference to the drawings.

〔構成〕
まず、図１を参照して、本発明の一実施形態である物流制御システムの構成について説明する。〔composition〕
First, referring to FIG. 1, the configuration of a physical distribution control system that is an embodiment of the present invention will be described.

図１は、本発明の一実施形態である物流制御システムの構成を示すブロック図である。図１に示すように、本発明の一実施形態である物流制御システム１は、学習装置２、共通装置３、及び実行装置４を備えている。 FIG. 1 is a block diagram showing the configuration of a physical distribution control system that is one embodiment of the present invention. As shown in FIG. 1, a physical distribution control system 1, which is an embodiment of the present invention, comprises a learning device 2, a common device 3, and an execution device 4.

学習装置２は、コンピュータ等の周知の情報処理装置によって構成され、学習処理によって輸送対象物を輸送する複数の輸送機器の物流ルールを作成する。本実施形態では、学習装置２は、情報処理装置内部の演算処理部がコンピュータプログラムを実行することにより、シミュレータ部２１、学習時次行動決定部２２、学習時状態解釈部２３、報酬計算部２４、及び学習部２５として機能する。 The learning device 2 is composed of a well-known information processing device such as a computer, and prepares physical distribution rules for a plurality of transportation equipment for transporting objects to be transported by learning processing. In the present embodiment, the learning apparatus 2 includes a simulator unit 21, a next action determination unit 22 during learning, a state interpretation unit 23 during learning, and a reward calculation unit 24 by executing a computer program by an arithmetic processing unit inside the information processing apparatus. , and the learning unit 25 .

シミュレータ部２１は、学習時次行動決定部２２から出力される情報を用いて、輸送機器の動きを含めた実空間で実行される複数の輸送機器の物流環境をシミュレーションにより再現する。 The simulator unit 21 uses the information output from the learning-time-next-action determination unit 22 to reproduce, by simulation, the physical distribution environment of a plurality of transportation equipment executed in real space, including the movement of the transportation equipment.

学習時次行動決定部２２は、シミュレータ部２１により再現される物流環境内における輸送機器の次行動を決定し、決定した輸送機器の次行動を示す情報をシミュレータ部２１に出力する。 The learning-time next action determination unit 22 determines the next action of the transportation equipment in the distribution environment reproduced by the simulator unit 21 and outputs information indicating the determined next action of the transportation equipment to the simulator unit 21 .

学習時状態解釈部２３は、学習時次行動決定部２２が決定した輸送機器の次行動によって発生した物流環境のイベント状態を解釈し、物流環境のイベント状態を示す情報を報酬計算部２４及び学習部２５に出力する。 The learning-time state interpreting unit 23 interprets the event state of the logistics environment caused by the next action of the transportation equipment determined by the learning-time next action determining unit 22, and transmits information indicating the event state of the logistics environment to the reward calculation unit 24 and the learning. Output to unit 25 .

報酬計算部２４は、学習時状態解釈部２３によって解釈された物流環境のイベント状態に基づいて学習時次行動決定部２２が決定した輸送機器の次行動を評価する報酬値を計算し、計算された報酬値を示す情報を学習部２５に出力する。 The reward calculation unit 24 calculates a reward value for evaluating the next action of the transportation equipment determined by the next action determination unit 22 during learning based on the event state of the logistics environment interpreted by the state interpretation unit 23 during learning. Information indicating the reward value obtained is output to the learning unit 25 .

学習部２５は、学習時次行動決定部２２が決定した輸送機器の次行動、学習時状態解釈部２３によって解釈された物流環境のイベント状態、及び報酬計算部２４によって計算された報酬値を学習データとしてニューラルネットワークによる学習処理を繰り返すことにより、物流環境のイベント状態を入力、輸送機器の次行動を決定する値を出力とするニューラルネットワークを複数の輸送機器の物流ルールとして作成する。 The learning unit 25 learns the next action of the transportation equipment determined by the next action determination unit 22 when learning, the event state of the logistics environment interpreted by the state interpretation unit 23 when learning, and the reward value calculated by the reward calculation unit 24. By repeating the learning process by the neural network as data, a neural network is created as a distribution rule for a plurality of transportation equipment, in which the event state of the distribution environment is input and the value that determines the next action of the transportation equipment is output.

共通装置３は、学習装置２及び実行装置４に接続され、学習結果記憶部３１を備えている。学習結果記憶部３１は、不揮発性の記憶装置によって構成され、学習部２５によって学習されたニューラルネットワークの構造及び重みに関する情報を学習結果として格納する。 The common device 3 is connected to the learning device 2 and the execution device 4 and has a learning result storage unit 31 . The learning result storage unit 31 is configured by a non-volatile storage device, and stores information about the structure and weights of the neural network learned by the learning unit 25 as learning results.

実行装置４は、コンピュータ等の周知の情報処理装置によって構成され、学習処理によって作成された複数の輸送機器の物流ルールを用いて輸送対象物の輸送時（実行時）における輸送機器の次行動を決定する。本実施形態では、実行装置４は、情報処理装置内部の演算処理部がコンピュータプログラムを実行することにより、学習結果読み込み部４１、実行時状態解釈部４２、及び実行時次行動決定部４３として機能する。 The execution device 4 is composed of a well-known information processing device such as a computer, and uses the logistics rules for a plurality of transportation devices created by learning processing to determine the next action of the transportation device when transporting the object to be transported (at the time of execution). decide. In the present embodiment, the execution device 4 functions as a learning result reading unit 41, an execution state interpretation unit 42, and an execution next action determination unit 43 by the arithmetic processing unit inside the information processing apparatus executing a computer program. do.

学習結果読み込み部４１は、学習結果記憶部３１から学習されたニューラルネットワークの構造及び重みに関する情報を読み出し、学習済みのニューラルネットワーク（物流ルール）を再構成する。 The learning result reading unit 41 reads information about the structure and weights of the learned neural network from the learning result storage unit 31, and reconstructs the learned neural network (distribution rule).

実行時状態解釈部４２は、実空間の物流環境又はシミュレータ部２１と同じ構成を持つ第２シミュレータ部によって再現された物流環境（実空間／第２シミュレータ部５の物流環境）のイベント状態を解釈し、解釈した実空間／第２シミュレータ部５の物流環境のイベント状態を示す情報を実行時次行動決定部４３に出力する。ここで、第２シミュレータ部を配置する理由は、実際の物流環境に学習済の物流ルールを適用する前に、その動作を確認して性能を評価する等の目的で使用するためである。第２シミュレータ部はシミュレータ部２１をそのまま利用してもよいし、ここで示すように第２シミュレータ部として別に構成してもよい。 The run-time state interpretation unit 42 interprets the event state of the distribution environment in the real space or the distribution environment reproduced by the second simulator unit having the same configuration as the simulator unit 21 (real space/distribution environment of the second simulator unit 5). and outputs the interpreted information indicating the event state of the physical distribution environment of the real space/second simulator unit 5 to the execution time next action determination unit 43 . The reason for arranging the second simulator unit here is to use it for the purpose of checking its operation and evaluating its performance before applying the learned distribution rule to the actual distribution environment. The second simulator section may use the simulator section 21 as it is, or may be configured separately as a second simulator section as shown here.

実行時次行動決定部４３は、実行時状態解釈部４２によって解釈された実空間／第２シミュレータ部５の物流環境のイベント状態を学習済みのニューラルネットワークに入力し、学習済みのニューラルネットワークの出力に基づいて物流環境内における輸送機器の次行動を決定する。例えば学習済みのニューラルネットワークの出力が行動評価関数値（Ｑ値）である場合、実行時次行動決定部４３は、Ｑ値が最大となる行動を物流環境内における輸送機器の次行動に決定する。そして、実行時次行動決定部４３は、決定結果に基づいて実空間／第２シミュレータ部５の物流環境内における輸送機器に対して次行動を指示する。 The run-time next action determination unit 43 inputs the event state of the physical space/distribution environment of the second simulator unit 5 interpreted by the run-time state interpretation unit 42 to the learned neural network, and outputs the learned neural network. determines the next action of the vehicle within the logistics environment based on For example, when the output of the learned neural network is the action evaluation function value (Q value), the execution-time next action determination unit 43 determines the action with the maximum Q value as the next action of the transportation equipment in the logistics environment. . Then, the execution-time next action determination unit 43 instructs the next action to the transportation equipment in the physical distribution environment of the real space/second simulator unit 5 based on the determination result.

物流制御システムは、以下に示す物流ルール学習処理を実行することにより、輸送機器同士の干渉を伴う複数の輸送機器の物流ルールを容易に作成することを可能にする。以下、図２を参照して、本発明の一実施形態である物流ルール学習処理を実行する際の物流制御システム１の動作について説明する。 The physical distribution control system executes the physical distribution rule learning process described below, thereby making it possible to easily create a physical distribution rule for a plurality of transportation equipments that involve interference between the transportation equipments. The operation of the physical distribution control system 1 when executing the physical distribution rule learning process, which is one embodiment of the present invention, will be described below with reference to FIG.

〔物流ルール学習処理〕
図２は、本発明の一実施形態である物流ルール学習処理の流れを示すフローチャートである。図２に示すフローチャートは、物流ルール学習処理の実行指令が入力されたタイミングで開始となり、物流ルール学習処理はステップＳ１の処理に進む。 [Logistics rule learning process]
FIG. 2 is a flow chart showing the flow of distribution rule learning processing, which is one embodiment of the present invention. The flow chart shown in FIG. 2 starts at the timing when a command to execute the distribution rule learning process is input, and the distribution rule learning process proceeds to the process of step S1.

なお以下では、図３に示すシミュレーションモデル６により再現される物流環境の物流ルールを学習処理により作成する場合を一例として挙げて物流ルール学習処理を説明する。図３に示すシミュレーションモデル６は、同一走路(Ｌ０１～１１)上を走行する＃１，２，３クレーンを用いて、発生場所Ｐ１～Ｐ３で発生した輸送対象物(以降、積荷と称する)を積荷に指定されている向け先Ｄ１～Ｄ４に搬送する離散事象型のシミュレーションモデルである。本実施形態の物流ルール学習処理では、Ｎ個の積荷を全て正しく輸送するための＃１，２，３クレーンの物流ルールを作成することとし、シミュレーションモデル６は以下の前提で動作する。 In the following, the distribution rule learning process will be described by taking as an example a case where the distribution rule for the distribution environment reproduced by the simulation model 6 shown in FIG. 3 is created by the learning process. The simulation model 6 shown in FIG. 3 uses #1, 2, and 3 cranes traveling on the same track (L01 to L11) to transport objects (hereinafter referred to as cargo) generated at generation locations P1 to P3. It is a discrete event type simulation model that delivers to the destinations D1-D4 specified in the cargo. In the physical distribution rule learning process of the present embodiment, physical distribution rules for cranes #1, 2, and 3 for correctly transporting all N loads are created, and the simulation model 6 operates on the following assumptions.

［積荷］
・発生場所Ｐ１～Ｐ３からランダムに発生し、発生間隔は予め与える確率分布に従う。
・向け先Ｄ１～Ｄ４もランダムに決定される。
・積荷は向け先Ｄ１～Ｄ４に運ばれたら消滅する。 [cargo]
・Randomly generated from generation locations P1 to P3, and the generation interval follows a given probability distribution.
・Destinations D1 to D4 are also randomly determined.
・The cargo disappears when it is transported to destinations D1 to D4.

［クレーン］
・１１個に分割された走路上を独立して左右に走行する。
・各クレーンは次行動を受け取り、受け取った次行動を実行する。
・次行動はクレーンの行先の走路(Ｌ０１～Ｌ１１)とし、次行動を受け取ったクレーンは行先の走路まで走行する。
・クレーンは１つの積荷しか積み込めない。
・クレーンの走行速度、積荷の積込、及び積降しに要する時間は予め与えられる。 [crane]
・Run independently on the left and right on the track divided into 11 pieces.
- Each crane receives the next action and executes the received next action.
・The next action is the crane's destination track (L01 to L11), and the crane that receives the next action travels to the destination track.
• The crane can only load one load.
・The travel speed of the crane and the time required for loading and unloading are given in advance.

［シミュレーション終了条件］
・Ｎ個の積荷を正しい向け先に輸送した場合は成功として終了する。
・発生場所Ｐ１～Ｐ３に積荷が２個以上滞留した場合は不成功として終了させる。
・クレーン同士が衝突した場合は不成功として終了させる。 [Simulation end condition]
Terminate as successful if N shipments have been transported to the correct destinations.
・When two or more cargoes are stagnated at the generation locations P1 to P3, the operation is terminated as unsuccessful.
・If the cranes collide with each other, the operation is terminated as unsuccessful.

また、本実施形態では、学習アルゴリズムとしてＤＱＮ(Deep Q-Network)を用いる。ＤＱＮはＱ学習と呼ばれる強化学習手法の行動評価関数値（Ｑ値）の表現にディープニューラルネットワークを組み合わせてＱ値の表現力を強化したものである。図４にＤＱＮの構成例を示す。図４に示すＤＱＮでは、環境は図３に示すシミュレーションモデル６である。エージェントは、＃１，２，３クレーンの時刻ｔにおける次行動ａ（ｉ，ｔ）（ｉ＝１～３）を決定して環境に指示し、環境のイベント状態ｓ（ｔ）と報酬値ｒ（ｉ，ｔ）を取得する。そして、エージェントは、各クレーンについてイベント状態ｓ（ｔ）から取り得る＃１，２，３クレーンの次行動ａ(＝ａ_１～ａ_Ｍ)の行動評価関数値Ｑ（ｓ，ａ）を以下に示す数式（１）により更新する。数式（１）において、nは各エージェントを示す。ディープニューラルネットワークでの学習は、以下の数式（２）に示す値を正解値とみなして、以下の数式（３）に示す値が最小となるように各層の重みを調整することによって行う。数式（２）におけるＡ（ｓ（ｔ＋１））は状態ｓ（ｔ＋１）におけるアクションの集合を示し、数式（３）におけるＤは一連の学習単位を示すエピソードの集合を示す。 Moreover, in this embodiment, DQN (Deep Q-Network) is used as a learning algorithm. DQN enhances the expressiveness of the Q value by combining the expression of the action evaluation function value (Q value) of a reinforcement learning method called Q-learning with a deep neural network. FIG. 4 shows a configuration example of DQN. In the DQN shown in FIG. 4, the environment is the simulation model 6 shown in FIG. The agent determines the next action a(i, t) (i=1 to 3) of the #1, 2, and 3 cranes at time t and instructs the environment, and the event state s(t) of the environment and the reward value r Get (i, t). Then, the agent calculates the action evaluation function value Q(s, a) of the next action a (=a ₁ to a _M ) of the #1, 2, and 3 cranes that can be taken from the event state s(t) for each crane as follows: It is updated by the formula (1) shown. In formula (1), n indicates each agent. Learning in the deep neural network is performed by regarding the value shown in the following formula (2) as the correct value and adjusting the weight of each layer so that the value shown in the following formula (3) is minimized. A(s(t+1)) in equation (2) denotes a set of actions in state s(t+1), and D in equation (3) denotes a set of episodes representing a series of learning units.

ここで、数式（１）においてα(０＜α＜１）は学習の速度を決める学習率、γ(０＜γ＜１)はＱ値を発散させないために学習回数と共にＱ値を減少させるための割引率である。時系列での物流ルールを学習していく場合は、次行動の時系列を過去に遡りＱ値を更新する。 Here, in formula (1), α (0<α<1) is a learning rate that determines the speed of learning, and γ (0<γ<1) is for decreasing the Q value along with the number of times of learning in order not to diverge the Q value. is the discount rate of When learning the physical distribution rule in chronological order, the chronological order of the next action is traced back to the past and the Q value is updated.

なお、ＤＱＮでは報酬値が最大となる行動を選択するだけでなく、より効率的な物流ルールを生成するために、一定の確率でランダムな行動選択を行うε-グリーディ法も実行する。ε－グリーディ法は、ある適当な定数εを用意し、次行動選択時に０～１間の乱数を生成し、その値が定数ε以下であればランダムに行動を選択し、定数εより大きければＱ値の大きい行動を選択する。 DQN not only selects the action that maximizes the reward value, but also executes the ε-greedy method that randomly selects actions with a certain probability in order to generate more efficient logistics rules. The ε-greedy method prepares an appropriate constant ε and generates a random number between 0 and 1 when selecting the next action. Choose actions with a high Q value.

ステップＳ１の処理では、学習装置２が、学習部２５及びエピソード回数の初期化を実行する。具体的には、学習装置２は、学習部２５のニューラルネットワークの重みをランダムに設定する。なお、エピソードとは学習の単位を示すものであり、本実施形態では、１エピソードは１回の物流シミュレーション実行での学習となる。これにより、ステップＳ１の処理は完了し、物流ルール学習処理はステップＳ２の処理に進む。 In the process of step S1, the learning device 2 initializes the learning unit 25 and the number of episodes. Specifically, the learning device 2 randomly sets the weights of the neural network of the learning unit 25 . Note that an episode indicates a unit of learning, and in the present embodiment, one episode is learning for one execution of a physical distribution simulation. As a result, the process of step S1 is completed, and the distribution rule learning process proceeds to the process of step S2.

ステップＳ２の処理では、学習装置２が、エピソード回数を１増数する。これにより、ステップＳ２の処理は完了し、物流ルール学習処理はステップＳ３の処理に進む。 In the process of step S2, the learning device 2 increments the number of episodes by one. As a result, the process of step S2 is completed, and the distribution rule learning process proceeds to the process of step S3.

ステップＳ３の処理では、学習装置２が、エピソード回数が所定数Ｍより大きいか否かを判別する。判別の結果、エピソード回数が所定数Ｍより大きい場合（ステップＳ３：Ｙｅｓ）、学習装置２は、物流ルール学習処理をステップＳ４の処理に進める。一方、エピソード回数が所定数Ｍ以下である場合には（ステップＳ３：Ｎｏ）、学習装置２は、物流ルール学習処理をステップＳ５の処理に進める。 In the processing of step S3, the learning device 2 determines whether or not the number of episodes is greater than a predetermined number M. As a result of determination, if the number of episodes is greater than the predetermined number M (step S3: Yes), the learning device 2 advances the distribution rule learning process to the process of step S4. On the other hand, when the number of episodes is equal to or less than the predetermined number M (step S3: No), the learning device 2 advances the logistics rule learning process to the process of step S5.

ステップＳ４の処理では、学習装置２が、学習部２５によって学習されたニューラルネットワークの構造及び重みに関する情報を学習結果として共通装置３の学習結果記憶部３１に格納する。これにより、ステップＳ４の処理は完了し、一連の物流ルール学習処理は終了する。 In the process of step S4, the learning device 2 stores information about the structure and weights of the neural network learned by the learning section 25 in the learning result storage section 31 of the common device 3 as a learning result. As a result, the process of step S4 is completed, and the series of logistics rule learning process ends.

ステップＳ５の処理では、学習装置２が、図３に示すシミュレーションモデル６を初期化する。具体的には、学習装置２は、図３に示すシミュレーションモデル６における＃１，２，３クレーンの位置、積荷の発生間隔、積荷の向け先決定のランダムシード、＃１，２，３クレーンの走行速度、積荷の積込及び積降しに要する時間を設定する。これにより、ステップＳ５の処理は完了し、物流ルール学習処理はステップＳ６の処理に進む。 In the process of step S5, the learning device 2 initializes the simulation model 6 shown in FIG. Specifically, the learning device 2 determines the positions of the #1, 2, and 3 cranes in the simulation model 6 shown in FIG. Set the travel speed and the time required for loading and unloading cargo. As a result, the process of step S5 is completed, and the distribution rule learning process proceeds to the process of step S6.

ステップＳ６の処理では、学習時次行動決定部２２が、図３に示すシミュレーションモデル６における処理対象のクレーン（該当クレーン）の次行動ａ（ｉ，ｔ）をε－グリーディ法により決定し、シミュレータ部２１に対して該当クレーンの次行動ａ（ｉ，ｔ）を指示する。これにより、ステップＳ６の処理は完了し、物流ルール学習処理はステップＳ７の処理に進む。 In the process of step S6, the learning-time next action determining unit 22 determines the next action a(i, t) of the crane to be processed (corresponding crane) in the simulation model 6 shown in FIG. The next action a(i, t) of the corresponding crane is instructed to the unit 21 . As a result, the process of step S6 is completed, and the distribution rule learning process proceeds to the process of step S7.

ステップＳ７の処理では、シミュレータ部２１が、学習時次行動決定部２２によって指示された該当クレーンの次行動ａ（ｉ，ｔ）に基づいてシミュレーションモデル６を実行することにより物流環境を再現し、以下の表１～３に示す物流環境の状態を示す状態テーブルを作成する（モデル実行処理）。このモデル実行処理の詳細については、図５に示すフローチャートを参照して後述する。これにより、ステップＳ７の処理は完了し、物流ルール学習処理はステップＳ８の処理に進む。 In the process of step S7, the simulator unit 21 reproduces the physical distribution environment by executing the simulation model 6 based on the next action a(i, t) of the corresponding crane instructed by the learning next action determination unit 22, A state table indicating the state of the physical distribution environment shown in Tables 1 to 3 below is created (model execution processing). The details of this model execution processing will be described later with reference to the flowchart shown in FIG. As a result, the process of step S7 is completed, and the distribution rule learning process proceeds to the process of step S8.

ステップＳ８の処理では、学習時状態解釈部２３及び報酬計算部２４が、ステップＳ７の処理によって作成された状態テーブルを用いて、該当クレーンの次行動ａ（ｉ，ｔ）に対応する物流環境のイベント状態ｓ（ｔ）及び報酬値ｒ（ｉ，ｔ）を算出する。なお、図３に示すシミュレーションモデル６では、＃１クレーン１及び＃３クレーンは互いに線対称な動きをするため、＃１クレーン及び＃３クレーンの物流ルールを学習するニューラルネットワークは１つでも構わない。また、ニューラルネットワークの入力は学習時状態解釈部２３により作成された１８個の値、出力は該当クレーンの次行動である該当クレーンの行先に関する１１個のＱ値である。＃１，＃３クレーンのニューラルネットワークの入力例を表４、＃２クレーンのニューラルネットワークの入力例を表５、ニューラルネットワークの出力例を表６に示す。また、本実施形態では、報酬値は－１．０～１．０の範囲内で与えられ、表１に示す状態テーブルから表７に示す値を算出する。 In the process of step S8, the learning state interpretation unit 23 and the reward calculation unit 24 use the state table created by the process of step S7 to determine the physical distribution environment corresponding to the next action a(i, t) of the corresponding crane. Compute the event state s(t) and the reward value r(i,t). In the simulation model 6 shown in FIG. 3, since the #1 crane 1 and the #3 crane move in line symmetry with each other, only one neural network may be used to learn the physical distribution rules of the #1 crane and the #3 crane. . The inputs of the neural network are 18 values generated by the state interpretation unit 23 during learning, and the outputs are 11 Q values relating to the destination of the crane, which is the next action of the crane. Table 4 shows an input example of the neural network for the #1 and #3 cranes, Table 5 shows an input example of the neural network for the #2 crane, and Table 6 shows an output example of the neural network. Further, in this embodiment, the reward value is given within the range of -1.0 to 1.0, and the values shown in Table 7 are calculated from the state table shown in Table 1.

ステップＳ９の処理では、学習部２５が、ステップＳ８の処理において算出された物流環境のイベント状態ｓ（ｔ）及び報酬値ｒ（ｉ，ｔ）の値より数式（１）を更新し、過去の時間を遡り数式（２）が最小となるようにニューラルネットワークの重みを更新する。これにより、ステップＳ９の処理は完了し、物流ルール学習処理はステップＳ１０の処理に進む。 In the process of step S9, the learning unit 25 updates the formula (1) based on the value of the event state s(t) of the distribution environment and the reward value r(i, t) calculated in the process of step S8. The weights of the neural network are updated so that the expression (2) is minimized going back in time. As a result, the process of step S9 is completed, and the distribution rule learning process proceeds to the process of step S10.

ステップＳ１０の処理では、学習装置２が、該当クレーンの次行動によって発生した物流環境のイベントが終了イベントであるか否かを判別する。具体的には、表１に示す衝突フラグ（クレーン同士の衝突が発生したか否かを示すフラグ、０：衝突無し、１：衝突発生）、滞留フラグ（滞留している積荷があるか否かを示すフラグ、０：滞留あり、１：滞留無し）、及び終了フラグ（Ｎ個の積荷の輸送が終了したか否かを示すフラグ、０：輸送未終了、１：輸送終了）のうちのいずれかの値が１である場合、学習装置２は、発生イベントは終了イベントであると判断し（ステップＳ１０：Ｙｅｓ）、物流ルール学習処理をステップＳ２の処理に戻す。一方、表１に示す衝突フラグ、滞留フラグ、及び終了フラグの値が全て０である場合には、学習装置２は、発生イベントは終了イベントではないと判断し（ステップＳ１０：Ｎｏ）、学習装置２は、物流ルール学習処理をステップＳ６の処理に戻す。 In the process of step S10, the learning device 2 determines whether or not the event in the physical distribution environment caused by the next action of the corresponding crane is the end event. Specifically, the collision flag shown in Table 1 (a flag indicating whether or not a collision has occurred between cranes; 0: no collision; 1: collision); 0: with stagnation, 1: without stagnation), and end flag (flag indicating whether transportation of N cargoes has been completed, 0: transportation not completed, 1: transportation completed) If this value is 1, the learning device 2 determines that the occurrence event is the end event (step S10: Yes), and returns the distribution rule learning process to the process of step S2. On the other hand, when the values of the collision flag, the retention flag, and the end flag shown in Table 1 are all 0, the learning device 2 determines that the occurrence event is not the end event (step S10: No), and the learning device 2 returns the distribution rule learning process to the process of step S6.

〔モデル実行処理〕
次に、図５を参照して、上記ステップＳ７のモデル実行処理について詳しく説明する。 [Model execution processing]
Next, referring to FIG. 5, the model execution processing of step S7 will be described in detail.

図５は、図２に示すモデル実行処理の流れを示すフローチャートである。図５に示すフローチャートは、図２に示すステップＳ６の処理が完了したタイミングで開始となり、モデル実行処理はステップＳ２１の処理に進む。 FIG. 5 is a flowchart showing the flow of model execution processing shown in FIG. The flowchart shown in FIG. 5 starts when the process of step S6 shown in FIG. 2 is completed, and the model execution process proceeds to the process of step S21.

ステップＳ２１の処理では、シミュレータ部２１が、モデル実行処理において用いる全てのフラグ（表１に示す衝突フラグ、滞留フラグ、終了フラグ、積込フラグ、降しフラグ）の値を０に初期化する。これにより、ステップＳ２１の処理は完了し、モデル実行処理はステップＳ２２の処理に進む。 In the process of step S21, the simulator unit 21 initializes the values of all flags (collision flag, stay flag, end flag, loading flag, unloading flag shown in Table 1) used in the model execution process to zero. Thereby, the process of step S21 is completed, and the model execution process proceeds to the process of step S22.

ステップＳ２２の処理では、シミュレータ部２１が、学習時次行動決定部２２が決定した該当クレーンの次行動に従って、次行動で指定された行先に向けて該当クレーンの走行を開始させる。これにより、ステップＳ２２の処理は完了し、モデル実行処理はステップＳ２３の処理に進む。 In the process of step S22, the simulator section 21 causes the corresponding crane to start traveling toward the destination designated by the next action according to the next action of the corresponding crane determined by the next action determining section 22 when learning. Thereby, the process of step S22 is completed, and the model execution process proceeds to the process of step S23.

ステップＳ２３の処理では、シミュレータ部２１が、所定のイベントが発生するまでシミュレーションモデル６を実行する。これにより、ステップＳ２３の処理は完了し、モデル実行処理はステップＳ２４の処理に進む。 In the processing of step S23, the simulator section 21 executes the simulation model 6 until a predetermined event occurs. Thereby, the process of step S23 is completed, and the model execution process proceeds to the process of step S24.

ステップＳ２４の処理では、シミュレータ部２１が、クレーン同士の衝突が発生したか否かを判別する。判別の結果、クレーン同士の衝突が発生した場合（ステップＳ２４：Ｙｅｓ）、シミュレータ部２１は、モデル実行処理をステップＳ２５の処理に進める。一方、クレーン同士の衝突が発生していない場合には（ステップＳ２４：Ｎｏ）、シミュレータ部２１は、モデル実行処理をステップＳ２６の処理に進める。 In the process of step S24, the simulator unit 21 determines whether or not a collision between cranes has occurred. As a result of determination, if a collision between cranes occurs (step S24: Yes), the simulator unit 21 advances the model execution process to the process of step S25. On the other hand, if the collision between the cranes has not occurred (step S24: No), the simulator unit 21 advances the model execution process to the process of step S26.

ステップＳ２５の処理では、シミュレータ部２１が、衝突フラグの値を１（衝突発生）に設定する。これにより、ステップＳ２５の処理は完了し、モデル実行処理はステップＳ３９の処理に進む。 In the process of step S25, the simulator unit 21 sets the value of the collision flag to 1 (occurrence of collision). As a result, the process of step S25 is completed, and the model execution process proceeds to the process of step S39.

ステップＳ２６の処理では、シミュレータ部２１が、該当クレーンが行先に到着したか否かを判別する。判別の結果、該当クレーンが行先に到着した場合（ステップＳ２６：Ｙｅｓ）、シミュレータ部２１は、モデル実行処理をステップＳ３０の処理に進める。一方、該当クレーンが行先に到着していない場合には（ステップＳ２６：Ｎｏ）、シミュレータ部２１は、モデル実行処理をステップＳ２７の処理に進める。 In the process of step S26, the simulator unit 21 determines whether or not the crane has arrived at the destination. As a result of the determination, if the crane has arrived at the destination (step S26: Yes), the simulator section 21 advances the model execution process to the process of step S30. On the other hand, if the crane has not arrived at the destination (step S26: No), the simulator unit 21 advances the model execution process to step S27.

ステップＳ２７の処理では、シミュレータ部２１が、発生場所Ｐ１～Ｐ３に積荷が発生したか否かを判別する。判別の結果、発生場所Ｐ１～Ｐ３に積荷が発生した場合（ステップＳ２７：Ｙｅｓ）、シミュレータ部２１は、モデル実行処理をステップＳ２８の処理に進める。一方、発生場所Ｐ１～Ｐ３に積荷が発生していない場合には（ステップＳ２７：Ｎｏ）、シミュレータ部２１は、モデル実行処理をステップＳ２３の処理に戻す。 In the process of step S27, the simulator unit 21 determines whether cargo has occurred at the generation locations P1 to P3. As a result of the determination, if cargo has occurred at the generation locations P1 to P3 (step S27: Yes), the simulator section 21 advances the model execution processing to the processing of step S28. On the other hand, when cargo has not occurred at the generation locations P1 to P3 (step S27: No), the simulator section 21 returns the model execution processing to the processing of step S23.

ステップＳ２８の処理では、シミュレータ部２１が、発生場所Ｐ１～Ｐ３に積荷が２個以上滞留しているか否かを判別する。判別の結果、積荷が２個以上滞留している場合（ステップＳ２８：Ｙｅｓ）、シミュレータ部２１は、モデル実行処理をステップＳ２９に進める。一方、積荷が２個以上滞留していない場合には（ステップＳ２８：Ｎｏ）、シミュレータ部２１は、モデル実行処理をステップＳ３９の処理に進める。 In the process of step S28, the simulator unit 21 determines whether or not two or more cargoes are staying at the generation locations P1 to P3. As a result of determination, when two or more cargoes are stagnating (step S28: Yes), the simulator unit 21 advances the model execution process to step S29. On the other hand, if two or more cargoes are not stagnant (step S28: No), the simulator section 21 advances the model execution process to the process of step S39.

ステップＳ２９の処理では、シミュレータ部２１が、滞留フラグの値を１（滞留あり）に設定する。これにより、ステップＳ２９の処理は完了し、モデル実行処理はステップＳ３９の処理に進む。 In the process of step S29, the simulator unit 21 sets the value of the stay flag to 1 (with stay). Thereby, the processing of step S29 is completed, and the model execution processing proceeds to the processing of step S39.

ステップＳ３０の処理では、シミュレータ部２１が、該当クレーンに積荷があるか否かを判別する。判別の結果、該当クレーンに積荷がある場合（ステップＳ３０：Ｙｅｓ）、シミュレータ部２１は、モデル実行処理をステップＳ３４の処理に進める。一方、該当クレーンに積荷がない場合には（ステップＳ３０：Ｎｏ）、シミュレータ部２１は、モデル実行処理をステップＳ３１の処理に進める。 In the process of step S30, the simulator unit 21 determines whether or not the crane has cargo. As a result of determination, if there is a load on the crane (step S30: Yes), the simulator unit 21 advances the model execution process to the process of step S34. On the other hand, if there is no load on the crane (step S30: No), the simulator unit 21 advances the model execution process to step S31.

ステップＳ３１の処理では、シミュレータ部２１が、該当クレーンの到着地に積荷がある発生場所であるか否かを判別する。判別の結果、該当クレーンの到着地に積荷がある発生場所である場合（ステップＳ３１：Ｙｅｓ）、シミュレータ部２１は、モデル実行処理をステップＳ３２の処理に進める。一方、該当クレーンの到着地に積荷がある発生場所でない場合には（ステップＳ３１：Ｎｏ）、シミュレータ部２１は、モデル実行処理をステップＳ３９の処理に進める。 In the process of step S31, the simulator unit 21 determines whether or not the arrival place of the crane is a place where cargo is present. As a result of the determination, if the arrival place of the crane is the generation place where there is cargo (step S31: Yes), the simulator unit 21 advances the model execution process to the process of step S32. On the other hand, if the destination of the crane is not the place where there is a load (step S31: No), the simulator unit 21 advances the model execution process to step S39.

ステップＳ３２の処理では、シミュレータ部２１が、発生場所にある積荷の該当クレーンへの積込を実行する。これにより、ステップＳ３２の処理は完了し、モデル実行処理はステップＳ３３の処理に進む。 In the process of step S32, the simulator unit 21 loads the cargo at the location of occurrence to the corresponding crane. As a result, the process of step S32 is completed, and the model execution process proceeds to the process of step S33.

ステップＳ３３の処理では、シミュレータ部２１が、積込フラグ（積込が成功したか否かを示すフラグ）の値を１（積込成功）に設定する。これにより、ステップＳ３３の処理は完了し、モデル実行処理はステップＳ３９の処理に進む。 In the processing of step S33, the simulator unit 21 sets the value of the loading flag (flag indicating whether or not the loading was successful) to 1 (successful loading). Thereby, the process of step S33 is completed, and the model execution process proceeds to the process of step S39.

ステップＳ３４の処理では、シミュレータ部２１が、該当クレーンの到着地が積荷の向け先であるか否かを判別する。判別の結果、該当クレーンの到着地が積荷の向け先である場合（ステップＳ３４：Ｙｅｓ）、シミュレータ部２１は、モデル実行処理をステップＳ３５の処理に進める。一方、該当クレーンの到着地が積荷の向け先でない場合には（ステップＳ３４：Ｎｏ）、シミュレータ部２１は、モデル実行処理をステップＳ３９の処理に進める。 In the processing of step S34, the simulator unit 21 determines whether or not the destination of the crane is the cargo destination. As a result of determination, if the destination of the crane is the cargo destination (step S34: Yes), the simulator unit 21 advances the model execution process to the process of step S35. On the other hand, if the destination of the crane is not the cargo destination (step S34: No), the simulator unit 21 advances the model execution process to step S39.

ステップＳ３５の処理では、シミュレータ部２１が、該当クレーンの到着地への積荷の積降し作業を実行する。これにより、ステップＳ３５の処理は完了し、モデル実行処理はステップＳ３６の処理に進む。 In the processing of step S35, the simulator unit 21 executes the work of loading and unloading the cargo to the destination of the corresponding crane. As a result, the process of step S35 is completed, and the model execution process proceeds to the process of step S36.

ステップＳ３６の処理では、シミュレータ部２１が、降しフラグ（積荷の積降しが成功したか否かを示すフラグ）の値を１（積降し成功）に設定する。これにより、ステップＳ３６の処理は完了し、モデル実行処理はステップＳ３７の処理に進む。 In the process of step S36, the simulator unit 21 sets the value of the unloading flag (flag indicating whether or not the cargo has been successfully unloaded) to 1 (successful unloading). Thereby, the process of step S36 is completed, and the model execution process proceeds to the process of step S37.

ステップＳ３７の処理では、シミュレータ部２１が、Ｎ個の積荷の輸送が終了したか否かを判別する。判別の結果、Ｎ個の積荷の輸送が終了した場合（ステップＳ３７：Ｙｅｓ）、シミュレータ部２１は、モデル実行処理をステップＳ３８の処理に進める。一方、Ｎ個の積荷の輸送が終了していない場合には（ステップＳ３７：Ｎｏ）、シミュレータ部２１は、モデル実行処理をステップＳ３９の処理に進める。 In the process of step S37, the simulator unit 21 determines whether transportation of N loads has been completed. As a result of the determination, if transportation of N loads has been completed (step S37: Yes), the simulator unit 21 advances the model execution process to the process of step S38. On the other hand, if transportation of N loads has not been completed (step S37: No), the simulator unit 21 advances the model execution process to the process of step S39.

ステップＳ３８の処理では、シミュレータ部２１が、終了フラグの値を１に設定する。これにより、ステップＳ３８の処理は完了し、モデル実行処理はステップＳ３９の処理に進む。 In the process of step S38, the simulator unit 21 sets the value of the end flag to 1. As a result, the process of step S38 is completed, and the model execution process proceeds to the process of step S39.

ステップＳ３９の処理では、シミュレータ部２１が、ステップＳ２１～ステップＳ３８の処理結果に基づいて表１～３に示す状態テーブルを作成する。これにより、ステップＳ３９の処理は完了し、一連のモデル実行処理は終了する。 In the processing of step S39, the simulator unit 21 creates state tables shown in Tables 1 to 3 based on the processing results of steps S21 to S38. As a result, the processing of step S39 is completed, and the series of model execution processing ends.

以上の説明から明らかなように、本発明の一実施形態である物流制御システム１では、学習装置２が、輸送機器の動きを含めた物流環境を再現するシミュレータ部２１と、物流環境内における輸送機器の次行動を決定して輸送機器に指示する学習時次行動決定部２２と、輸送機器に対して指示された次行動をシミュレータ部２１が再現したことにより得られる物流環境の状態を解釈する学習時状態解釈部２３と、学習時状態解釈部２３により解釈された物流環境の状態に基づいて輸送機器の次行動を評価する報酬値を計算する報酬計算部２４と、学習時次行動決定部２２が指示した輸送機器の次行動、学習時状態解釈部２３によって解釈された物流環境の状態、及び報酬計算部２４によって計算された報酬値を入力として物流ルールを修正学習する学習部２５と、を備えるので、輸送機器同士の干渉を伴う複数の輸送機器の物流ルールを容易に作成することができる。また、作成された学習ルールは、学習結果記憶部３１に保存され、実行装置４にある学習結果読み込み部４１によって実行装置４側に読み取られる。実行に際しては、実空間の物流環境から実行時状態解釈部４２によって状態を把握し、実行時次行動決定部４３によって、実際の行動が決定されて、輸送機器等に実行命令を送る。これによって、学習装置２によって学習された物流ルールによる複数の輸送機器の物流制御を可能とすることができる。 As is clear from the above description, in the physical distribution control system 1 according to one embodiment of the present invention, the learning device 2 includes the simulator section 21 that reproduces the physical distribution environment including the movement of the transportation equipment, and the transportation system in the physical distribution environment. A learning time next action determining unit 22 that decides the next action of the equipment and instructs the transportation equipment, and a simulator unit 21 that reproduces the next action instructed to the transportation equipment to interpret the state of the physical distribution environment obtained. A learning state interpretation unit 23, a reward calculation unit 24 for calculating a reward value for evaluating the next action of the transportation equipment based on the state of the logistics environment interpreted by the learning state interpretation unit 23, and a learning next action determination unit. a learning unit 25 that corrects and learns the logistics rules by inputting the next action of the transportation equipment instructed by 22, the state of the logistics environment interpreted by the learning state interpretation unit 23, and the reward value calculated by the reward calculation unit 24; , it is possible to easily create a distribution rule for a plurality of transportation equipments that involve interference between the transportation equipments. The created learning rule is stored in the learning result storage unit 31 and read by the execution device 4 side by the learning result reading unit 41 in the execution device 4 . At the time of execution, the status is grasped by the runtime status interpreting unit 42 from the physical distribution environment in the real space, and the actual behavior is determined by the runtime next action determining unit 43, and an execution command is sent to the transportation equipment or the like. As a result, physical distribution control for a plurality of transport equipment can be made possible according to the physical distribution rule learned by the learning device 2 .

最後に、本発明を用いた実施例を示す。図６は、本発明によって図３に示すシミュレーションモデルで１０個の積み荷を運送する物流ルールを学習させた一例を示す。本例では、学習回数は２０万回、学習後の平均報酬値は正の値となっている。これは、学習によって作成された物流ルールに従って動作させたことが成功して、複数のクレーンの干渉を避けつつ、積み荷を指定された場所に運んでいることを示しており、本発明を適用することによる効果を証明している。 Finally, an example using the present invention is presented. FIG. 6 shows an example in which the simulation model shown in FIG. 3 is used to learn a physical distribution rule for transporting 10 loads according to the present invention. In this example, the number of times of learning is 200,000, and the average reward value after learning is a positive value. This indicates that the operation according to the distribution rule created by learning has succeeded, and the cargo is being transported to the specified location while avoiding interference between multiple cranes. It proves the effect of

以上、本発明者らによってなされた発明を適用した実施形態について説明したが、本実施形態による本発明の開示の一部をなす記述及び図面により本発明は限定されることはない。すなわち、本実施形態に基づいて当業者等によりなされる他の実施の形態、実施例、及び運用技術等は全て本発明の範疇に含まれる。 Although the embodiments to which the invention made by the present inventors is applied have been described above, the present invention is not limited by the descriptions and drawings forming a part of the disclosure of the present invention according to the present embodiments. That is, other embodiments, examples, operation techniques, etc. made by those skilled in the art based on this embodiment are all included in the scope of the present invention.

１物流制御システム
２学習装置
３共通装置
４実行装置
５実空間／第２シミュレータ部
６シミュレーションモデル
２１シミュレータ部
２２学習時次行動決定部
２３学習時状態解釈部
２４報酬計算部
２５学習部
３１学習結果記憶部
４１学習結果読み込み部
４２実行時状態解釈部
４３実行時次行動決定部 1 Logistics Control System 2 Learning Device 3 Common Device 4 Execution Device 5 Real Space/Second Simulator Section 6 Simulation Model 21 Simulator Section 22 Next Action Determining Section at Learning 23 State Interpreting Section at Learning 24 Reward Calculating Section 25 Learning Section 31 Learning Result Storage unit 41 Learning result reading unit 42 Execution state interpretation unit 43 Execution next action determination unit

Claims

A physical distribution rule learning device for creating physical distribution rules for a plurality of transportation equipment for transporting objects to be transported,
a simulator unit that reproduces the logistics environment including the movement of the transportation equipment;
a learning-time-next-action determining unit that determines a next action of the transportation equipment in the logistics environment and instructs the transportation equipment;
a learning-time state interpretation unit that interprets the state of the logistics environment obtained by reproducing the next action instructed to the transportation equipment by the simulator unit;
a reward calculation unit that calculates a reward value for evaluating the next action of the transportation equipment based on the state of the logistics environment interpreted by the state interpretation unit during learning;
The next action of the transportation equipment instructed by the next action determination unit during learning, the state of the physical distribution environment interpreted by the state interpretation unit during learning, and the reward value calculated by the reward calculation unit are input into the physical distribution rule. a learning unit that corrects and learns
with
The physical distribution rule learning device, wherein the learning unit uses the same neural network to learn a plurality of transportation equipment whose motions are symmetrical with each other.

2. A physical distribution rule learning apparatus according to claim 1, wherein said simulator unit is a discrete event simulator that reproduces a physical distribution environment including movement of said transportation equipment.

3. The physical distribution rule learning device according to claim 2, wherein the state of the physical distribution environment interpreted by the state interpreting unit at the time of learning is the state of the physical distribution environment when a discrete event occurs.

The state of the physical distribution environment at the time of the occurrence of discrete events includes the location and destination of transportation equipment, including its stopped state, the presence or absence of objects loaded on transportation equipment and their destinations, and the location of objects to be loaded on transportation equipment. 4. The physical distribution rule learning device according to claim 3, wherein a destination is included.

The reward calculation unit calculates at least one of a value for correctly transporting the transport object by the transport equipment, a value for transport time within one unit of learning, and a value for transport failure within one unit of learning. 5. The physical distribution rule learning device according to any one of claims 1 to 4, wherein said reward value is calculated based on .

The physical distribution rule learning device according to any one of claims 1 to 5, wherein the learning unit performs learning using a neural network.

A learning result storage unit for storing learning results of the learning unit provided in the physical distribution rule learning device according to claim 1;
a learning result reading unit that reads the learning result from the learning result storage unit;
a run-time state interpretation unit that interprets the state of the physical distribution environment in real space or the state of the physical distribution environment reproduced by the second simulator;
Using the learning result read from the learning result reading unit and the state of the physical distribution environment interpreted by the runtime state interpreting unit, the next action of the transportation device in the physical distribution environment is determined and instructed to the transportation device. a runtime next action determination unit;
A physical distribution rule execution device comprising:

8. The physical distribution rule execution device according to claim 7 , wherein the state of the physical distribution environment interpreted by the runtime state interpretation unit is the state of the physical distribution environment when a discrete event occurs.

The state of the physical distribution environment at the time of the occurrence of discrete events includes the location and destination of transportation equipment, including its stopped state, the presence or absence of objects loaded on transportation equipment and their destinations, and the location of objects to be loaded on transportation equipment. 9. The physical distribution rule execution device according to claim 8 , wherein a destination is included.

A logistics rule learning method for creating a logistics rule for a plurality of transportation equipment for transporting objects to be transported, comprising:
a first step of reproducing the logistics environment including the movement of the transportation equipment;
a second step of determining and directing a next action of the vehicle within the logistics environment;
a third step of interpreting the state of the logistics environment obtained by reproducing the next action instructed to the transportation equipment in the first step;
a fourth step of calculating a reward value that evaluates the next action of the vehicle based on the conditions of the logistics environment interpreted in the third step;
The next action of the transportation equipment instructed in the second step, the state of the physical distribution environment interpreted in the third step, and the remuneration value calculated in the fourth step are used as inputs to correct and learn the physical distribution rule. five steps and
including
In the fifth step, a physical distribution rule learning method, wherein learning is performed using the same neural network for a plurality of transportation equipment whose movements are symmetrical with each other.

A distribution rule learning program for causing a computer to execute processing for creating distribution rules for a plurality of transportation equipment for transporting objects to be transported,
a first process that reproduces a logistics environment including the movement of the transportation equipment;
a second process of determining a next action of the transportation equipment in the logistics environment and instructing the transportation equipment;
a third process of interpreting the state of the physical distribution environment obtained by reproducing the next action instructed to the transportation equipment in the first process;
a fourth process of calculating a reward value for evaluating the next action of the transportation equipment based on the state of the logistics environment interpreted in the third process;
modifying and learning the physical distribution rule using the next action of the transportation equipment instructed in the second process, the state of the physical distribution environment interpreted in the third process, and the remuneration value calculated in the fourth process as inputs; five treatments;
on the computer , and
A physical distribution rule learning program characterized in that, in the fifth processing, learning is performed using the same neural network for a plurality of transport equipment whose movements are line symmetrical with each other.