JP6663064B1

JP6663064B1 - Order management device, order management method and order management program

Info

Publication number: JP6663064B1
Application number: JP2019093580A
Authority: JP
Inventors: 浩詩末次
Original assignee: ｓｇｌａｂ株式会社
Priority date: 2019-05-17
Filing date: 2019-05-17
Publication date: 2020-03-11
Anticipated expiration: 2039-05-17
Also published as: JP2020187681A

Abstract

【課題】発注数の選択に強化学習を適用して、複数の物品をより効率的に発注することができる発注管理装置等を提供する。【解決手段】発注管理装置は、所定の時点における複数の物品の在庫数を含む状態と、複数の物品を所定数発注する行動との関数であって、複数の物品について設定されている複数の評価関数の第１値を、複数の物品について設定されている複数の推定モデルによって推定する第１推定部と、第１値に基づいて、行動を選択する選択部と、行動を行った場合の複数の物品の保管コスト及び発注された物品の輸送コストに基づいて、複数の物品それぞれに関する報酬を算出する報酬算出部と、行動を行った後の状態について、取り得る行動に関する複数の評価関数の第２値を複数の推定モデルによって推定する第２推定部と、割引率を掛けた第２値及び報酬の和と、第１値との差を小さくするように、複数の推定モデルのパラメータを更新する更新部と、を備える。【選択図】図１PROBLEM TO BE SOLVED: To provide an order management device and the like capable of ordering a plurality of articles more efficiently by applying reinforcement learning to the selection of the number of orders. An order management apparatus is a function of a state including a stock quantity of a plurality of articles at a predetermined time point and an action of ordering a predetermined number of the plurality of articles, and sets a plurality of items set for the plurality of articles. A first estimation unit that estimates the first value of the evaluation function by a plurality of estimation models set for a plurality of articles, a selection unit that selects an action based on the first value, and a case of performing an action Based on the storage cost of a plurality of goods and the transportation cost of ordered goods, a reward calculation unit that calculates a reward for each of the plurality of goods, and a plurality of evaluation functions regarding possible behaviors regarding the state after the behavior is performed. A second estimation unit that estimates the second value with a plurality of estimation models, and a parameter of the plurality of estimation models so as to reduce the difference between the first value and the sum of the second value and the reward multiplied by the discount rate. Update It includes a Shinbu, a. [Selection diagram] Fig. 1

Description

本発明は、発注管理装置、発注管理方法及び発注管理プログラムに関する。 The present invention relates to an order management device, an order management method, and an order management program.

小売業等の仕入れを必要とする業態では、生産者や卸売業者から商品を仕入れて在庫を確保し、消費者に販売することが行われている。複数の商品を販売する場合に、それぞれの商品について在庫が枯渇しないように仕入量を決定する方法として、例えば非特許文献１に記載の方法や非特許文献２に記載の方法が知られている。 In the retail business or the like that requires purchasing, it is common to purchase goods from producers and wholesalers to secure stocks and sell them to consumers. When a plurality of products are sold, a method described in Non-Patent Document 1 or a method described in Non-Patent Document 2 is known as a method of determining a purchase amount so that inventory of each product is not depleted. I have.

また、下記特許文献１には、店舗における原材料の在庫予測数に基づいて発注数を管理する発注数管理装置が記載されている。発注数管理装置は、店舗で使用される原材料の使用予測数を算出し、納品日ごとの入荷予定の納品数を算出し、直近の棚卸数と使用予測数と納品数とを用いて店舗の在庫予測数を算出し、使用予測数、納品数及び在庫予測数を納品日ごとに表示する。 Further, Patent Document 1 below describes an order quantity management device that manages the order quantity based on the estimated stock quantity of raw materials in a store. The order quantity management device calculates the estimated number of raw materials used in the store, calculates the number of deliveries scheduled to be received for each delivery date, and uses the latest inventory count, estimated use number, and The predicted number of inventory is calculated, and the predicted number of use, the number of deliveries, and the predicted number of inventory are displayed for each delivery date.

一方で、近年、強化学習と呼ばれる機械学習の手法が研究されている。例えば、下記非特許文献３では、行動空間が高次元である場合にも適用可能なbranching deep Q-networkと呼ばれるニューラルネットワークのアーキテクチャが提唱されている。 On the other hand, in recent years, a machine learning method called reinforcement learning has been studied. For example, Non-Patent Document 3 below proposes a neural network architecture called branching deep Q-network that is applicable even when the action space has a high dimension.

特開２０１８−１２８８６２号公報JP 2018-128862 A

J. L. Balintfy, "On a basic class of multi-item inventory problems,"Management Science, vol. 10, no. 2, pp. 287-297, 1964.J. L. Balintfy, "On a basic class of multi-item inventory problems," Management Science, vol. 10, no. 2, pp. 287-297, 1964. A. Ishigaki and Y. Hirakawa, "Design of a economic order-point system based on forecasted inventory positions," Journal of Japan Industrial Management Association, vol. 59, no. 4, pp. 290-295, 2008.A. Ishigaki and Y. Hirakawa, "Design of a economic order-point system based on forecasted inventory positions," Journal of Japan Industrial Management Association, vol. 59, no. 4, pp. 290-295, 2008. A. Tavakoli, F. Pardo, and P. Kormushev, "Action branching architectures for deep reinforcement learning," CoRR, vol. abs/1711.08946, 2017.A. Tavakoli, F. Pardo, and P. Kormushev, "Action branching architectures for deep reinforcement learning," CoRR, vol.abs / 1711.08946, 2017.

例えば非特許文献１に記載の方法を用いて複数の物品の発注数を決定する場合、必ずしも最適な発注が行えないことがある。例えば、所定数の物品を１つのパレットにまとめて、複数のパレットをコンテナに積み、当該コンテナを輸送する場合、非特許文献１に記載の方法ではコンテナを効率的に使用できず、仕入れに際して余剰コストが発生してしまうことがある。 For example, when the number of orders for a plurality of articles is determined using the method described in Non-Patent Document 1, an optimal order may not always be performed. For example, when a predetermined number of articles are put together on one pallet, a plurality of pallets are stacked in a container, and the container is transported, the container described in Non-Patent Document 1 cannot use the container efficiently, and extra Costs may be incurred.

そこで、本発明は、発注数の選択に強化学習を適用して、複数の物品をより効率的に発注することができる発注管理装置、発注管理方法及び発注管理プログラムを提供する。 Accordingly, the present invention provides an order management device, an order management method, and an order management program that can apply reinforcement learning to the selection of the number of orders and more efficiently order a plurality of articles.

本発明の一態様に係る発注管理装置は、所定の時点における複数の物品の在庫数を含む状態と、複数の物品を所定数発注する行動との関数であって、複数の物品について設定されている複数の評価関数の第１値を、複数の物品について設定されている複数の推定モデルによって推定する第１推定部と、第１値に基づいて、行動を選択する選択部と、行動を行った場合の複数の物品の保管コスト及び発注された物品の輸送コストに基づいて、複数の物品それぞれに関する報酬を算出する報酬算出部と、行動を行った後の状態について、取り得る行動に関する複数の評価関数の第２値を複数の推定モデルによって推定する第２推定部と、割引率を掛けた第２値及び報酬の和と、第１値との差を小さくするように、複数の推定モデルのパラメータを更新する更新部と、を備える。 The order management device according to an aspect of the present invention is a function of a state including a stock number of a plurality of articles at a predetermined time and an action of ordering a plurality of articles by a predetermined number, and is set for a plurality of articles. A first estimator for estimating first values of a plurality of evaluation functions according to a plurality of estimation models set for a plurality of articles; a selector for selecting an action based on the first value; A reward calculating unit that calculates a reward for each of the plurality of articles based on a storage cost of the plurality of articles and a transportation cost of the ordered articles in a case where A second estimating unit for estimating a second value of the evaluation function by a plurality of estimation models; and a plurality of estimation models so as to reduce a difference between the first value and the sum of the second value and the reward multiplied by the discount rate. Parameter And an update unit new to.

この態様によれば、強化学習の報酬を複数の物品それぞれについて算出し、複数の物品について設定されている複数の推定モデルのパラメータを更新していくことで、より適切な評価関数の値を推定することができるようになり、複数の物品をより効率的に発注することができる。 According to this aspect, the reward of the reinforcement learning is calculated for each of the plurality of articles, and the parameters of the plurality of estimation models set for the plurality of articles are updated, whereby a more appropriate value of the evaluation function is estimated. And a plurality of articles can be ordered more efficiently.

上記態様において、報酬算出部は、保管コストと、輸送コストを発注された物品の数で除算した値との和によって、複数の物品それぞれに関する報酬を算出してもよい。 In the above aspect, the reward calculation unit may calculate the reward for each of the plurality of articles by the sum of the storage cost and the value obtained by dividing the transportation cost by the number of ordered articles.

この態様によれば、複数の物品が同時に発注されるように報酬を与えることができ、複数の物品をコンテナに搭載して輸送する場合であっても、コストを抑えるように複数の物品を発注することができる。 According to this aspect, a reward can be given so that a plurality of items are ordered at the same time, and even when a plurality of items are loaded in a container and transported, the plurality of items can be ordered so as to reduce costs. can do.

上記態様において、報酬算出部は、保管コストと、輸送コストを発注された物品の数で除算した値と、複数の物品が欠品した場合のペナルティコストとの和によって、複数の物品それぞれに関する報酬を算出してもよい。 In the above aspect, the reward calculation unit calculates a reward for each of the plurality of articles by a sum of a storage cost, a value obtained by dividing a transportation cost by the number of ordered articles, and a penalty cost in a case where a plurality of articles are out of stock. May be calculated.

この態様によれば、複数の物品が欠品しないように報酬を与えることができ、複数の物品の在庫が尽きる確率が小さくなるように複数の物品を発注することができる。 According to this aspect, a reward can be given so that the plurality of articles do not run out, and a plurality of articles can be ordered so that the probability of running out of stock of the plurality of articles is reduced.

上記態様において、選択部は、所定の確率で、ランダムに行動を選択し、１から所定の確率を引いた確率で、第１値が最大となる行動を選択してもよい。 In the above aspect, the selection unit may randomly select an action at a predetermined probability, and select an action having a maximum first value at a probability obtained by subtracting the predetermined probability from one.

この態様によれば、新たな行動の探索と、経験的に最良である行動の選択とのバランスを取りながら、より効率的な発注ができるようになる。 According to this aspect, it is possible to place an order more efficiently while balancing the search for a new action and the selection of the action that is best empirically.

上記態様において、複数の推定モデルは、状態に関する状態値を推定する第１モデルと、状態における行動のアドバンテージ関数を推定する第２モデルとをそれぞれ含んでもよい。 In the above aspect, the plurality of estimation models may include a first model for estimating a state value related to a state and a second model for estimating an advantage function of an action in the state.

この態様によれば、評価関数の値を推定する推定モデルのうち、状態のみに依存する部分を第１モデルによって推定し、状態及び行動に依存する部分を第２モデルによって推定することで、より適切な評価関数の値を推定することができるようになり、複数の物品をより効率的に発注することができる。 According to this aspect, of the estimation model for estimating the value of the evaluation function, a part that depends only on the state is estimated by the first model, and a part that depends on the state and the behavior is estimated by the second model. An appropriate value of the evaluation function can be estimated, and a plurality of articles can be ordered more efficiently.

上記態様において、更新部は、報酬及び割引率を掛けた第２値の和と、第１値との差の２乗について、複数の物品について過去に記録された状態、行動、報酬及び行動を行った後の状態に関する期待値が小さくなるように、パラメータを更新してもよい。 In the above aspect, the updating unit may update the state, behavior, reward, and behavior of a plurality of articles recorded in the past with respect to the sum of the second value multiplied by the reward and the discount rate and the square of the difference between the first value and the sum. The parameter may be updated so that the expected value related to the state after the execution is reduced.

この態様によれば、推定モデルのパラメータを更新する際の不安定性を抑えることができ、より適切な評価関数の値を推定することができるようになる。 According to this aspect, instability when updating the parameters of the estimation model can be suppressed, and a more appropriate value of the evaluation function can be estimated.

上記態様において、更新部は、報酬及び割引率を掛けた第２値の和と、第１値との差のＨｕｂｅｒ損失関数について、複数の物品について過去に記録された状態、行動、報酬及び行動を行った後の状態に関する期待値が小さくなるように、パラメータを更新してもよい。 In the above aspect, the updating unit may determine, for a Huber loss function of a difference between the sum of the second value multiplied by the reward and the discount rate and the first value, states, actions, rewards, and actions previously recorded for a plurality of articles. The parameter may be updated so that the expected value related to the state after performing the above is reduced.

この態様によれば、推定モデルのパラメータを更新する際の外れ値に対する不安定性を抑えることができ、より適切な評価関数の値を推定することができるようになる。 According to this aspect, instability with respect to outliers when updating the parameters of the estimation model can be suppressed, and a more appropriate value of the evaluation function can be estimated.

上記態様において、状態は、所定の時点において輸送中の物品の数を含んでもよい。 In the above aspect, the condition may include the number of items being transported at a given point in time.

この態様によれば、複数の物品の在庫数のみならず、輸送中の物品の数を考慮して評価関数の値を推定することができ、複数の物品をより効率的に発注することができる。 According to this aspect, it is possible to estimate the value of the evaluation function in consideration of the number of articles in transit as well as the number of stocks of the plurality of articles, and it is possible to order the plurality of articles more efficiently. .

上記態様において、状態は、所定の時点から輸送に要する期間経過後の在庫数の推定値を含んでもよい。 In the above aspect, the state may include an estimated value of the stock quantity after a period required for transportation has elapsed from a predetermined time point.

この態様によれば、複数の物品の在庫数のみならず、将来の在庫数を考慮して評価関数の値を推定することができ、複数の物品をより効率的に発注することができる。 According to this aspect, it is possible to estimate the value of the evaluation function in consideration of not only the number of stocks of a plurality of articles but also the number of stocks in the future, and it is possible to order a plurality of articles more efficiently.

上記態様において、状態は、所定の時点から輸送に要する期間が経過するまでの物品の需要数の推定値を含んでもよい。 In the above aspect, the state may include an estimated value of the demand number of the article from a predetermined point in time until a period required for transportation elapses.

この態様によれば、複数の物品の在庫数のみならず、輸送中の物品が到着するまでに生じると推定される複数の物品の需要を考慮して評価関数の値を推定することができ、複数の物品をより効率的に発注することができる。 According to this aspect, it is possible to estimate the value of the evaluation function in consideration of not only the inventory number of the plurality of articles, but also the demand for the plurality of articles that are estimated to occur until the article in transit arrives, A plurality of articles can be ordered more efficiently.

上記態様において、状態は、所定の時点から輸送に要する期間が経過してから、所定の期間が経過するまでの物品の需要数の推定値を含んでもよい。 In the above aspect, the state may include an estimated value of the number of demands for the article from a time when a period required for transportation elapses from a predetermined time to a time when the predetermined period elapses.

この態様によれば、複数の物品の在庫数のみならず、輸送中の物品が到着した後に生じると推定される複数の物品の需要を考慮して評価関数の値を推定することができ、複数の物品をより効率的に発注することができる。 According to this aspect, it is possible to estimate the value of the evaluation function in consideration of not only the inventory number of the plurality of articles, but also the demand for the plurality of articles that are estimated to occur after the article in transit arrives. Items can be ordered more efficiently.

本発明の他の態様に係る発注管理方法は、所定の時点における複数の物品の在庫数を含む状態と、複数の物品を所定数発注する行動との関数であって、複数の物品について設定されている複数の評価関数の第１値を、複数の物品について設定されている複数の推定モデルによって推定すること、第１値に基づいて、行動を選択することと、行動を行った場合の複数の物品の保管コスト及び発注された物品の輸送コストに基づいて、複数の物品それぞれに関する報酬を算出することと、行動を行った後の状態について、取り得る行動に関する複数の評価関数の第２値を複数の推定モデルによって推定することと、割引率を掛けた第２値及び報酬の和と、第１値との差を小さくするように、複数の推定モデルのパラメータを更新することと、を含む。 An order management method according to another aspect of the present invention is a function of a state including a stock quantity of a plurality of articles at a predetermined time and an action of ordering a plurality of articles by a predetermined number, and is set for a plurality of articles. Estimating the first values of the plurality of evaluation functions according to the plurality of estimation models set for the plurality of articles, selecting an action based on the first value, and selecting a plurality of values when the action is performed. Calculating a reward for each of the plurality of articles based on the storage cost of the article and the transportation cost of the ordered article, and a second value of a plurality of evaluation functions for possible actions with respect to a state after the action has been performed Is estimated by a plurality of estimation models, and the parameters of the plurality of estimation models are updated so as to reduce the difference between the first value and the sum of the second value and the reward multiplied by the discount rate. Including .

本発明の他の態様に係る発注管理プログラムは、発注管理装置に備えられた演算部を、所定の時点における複数の物品の在庫数を含む状態と、複数の物品を所定数発注する行動との関数であって、複数の物品について設定されている複数の評価関数の第１値を、複数の物品について設定されている複数の推定モデルによって推定する第１推定部、第１値に基づいて、行動を選択する選択部、行動を行った場合の複数の物品の保管コスト及び発注された物品の輸送コストに基づいて、複数の物品それぞれに関する報酬を算出する報酬算出部、行動を行った後の状態について、取り得る行動に関する複数の評価関数の第２値を複数の推定モデルによって推定する第２推定部、及び割引率を掛けた第２値及び報酬の和と、第１値との差を小さくするように、複数の推定モデルのパラメータを更新する更新部、として機能させる。 An order management program according to another aspect of the present invention includes an operation unit provided in an order management device, the operation unit including a state including a stock quantity of a plurality of articles at a predetermined time and an action of ordering a predetermined number of a plurality of articles. A first estimating unit that estimates a first value of a plurality of evaluation functions set for a plurality of articles by a plurality of estimation models set for a plurality of articles, based on the first value, A selection unit that selects an action, a reward calculation unit that calculates a reward for each of the plurality of articles based on a storage cost of the plurality of articles when the action is performed and a transport cost of the ordered article, For the state, a second estimator for estimating second values of a plurality of evaluation functions relating to possible actions by a plurality of estimation models, and a difference between the first value and the sum of the second value and the reward multiplied by the discount rate. Make smaller Sea urchin, update section for updating the parameters of a plurality of estimation models, to function as a.

本発明によれば、発注数の選択に強化学習を適用して、複数の物品をより効率的に発注することができる発注管理装置、発注管理方法及び発注管理プログラムを提供することができる。 According to the present invention, it is possible to provide an order management device, an order management method, and an order management program capable of ordering a plurality of articles more efficiently by applying reinforcement learning to selection of the number of orders.

本発明の実施形態に係る発注管理装置の機能ブロックを示す図である。It is a figure showing the functional block of the order management device concerning the embodiment of the present invention. 本実施形態に係る発注管理装置の物理的構成を示す図である。It is a figure showing the physical composition of the order management device concerning this embodiment. 本実施形態に係る発注管理装置の複数の推定モデルの概念図である。It is a key map of a plurality of presumption models of an order management device concerning this embodiment. 物品数が２である場合における、本実施形態に係る発注管理装置による総コストと、比較例の総コストとを示す図である。It is a figure showing the total cost by the order management device concerning this embodiment, and the total cost of a comparative example when the number of articles is two. 本実施形態に係る発注管理装置により管理される複数の物品の在庫数の時間変化と発注タイミングを示す図である。It is a figure which shows the time change of the stock number of several articles managed by the order management apparatus which concerns on this embodiment, and order timing. 物品数が１０である場合における、本実施形態に係る発注管理装置による総コストと、比較例の総コストとを示す図である。It is a figure showing the total cost by the order management device concerning this embodiment when the number of articles is 10, and the total cost of the comparative example. 本実施形態に係る発注管理装置により実行される処理のフローチャートである。5 is a flowchart of a process executed by the order management device according to the embodiment.

添付図面を参照して、本発明の実施形態について説明する。なお、各図において、同一の符号を付したものは、同一又は同様の構成を有する。 An embodiment of the present invention will be described with reference to the accompanying drawings. In each of the drawings, the components denoted by the same reference numerals have the same or similar configurations.

図１は、本発明の実施形態に係る発注管理装置１０の機能ブロックを示す図である。発注管理装置１０は、取得部１１、第１推定部１２、記憶部１３、選択部１４、報酬算出部１５、第２推定部１６及び更新部１７を備える。 FIG. 1 is a diagram showing functional blocks of an order management device 10 according to an embodiment of the present invention. The order management device 10 includes an acquisition unit 11, a first estimation unit 12, a storage unit 13, a selection unit 14, a reward calculation unit 15, a second estimation unit 16, and an update unit 17.

取得部１１は、管理端末２１から、所定の時点における複数の物品の在庫数を取得する。本実施形態において、複数の物品は、コンテナに格納されて遠隔地から輸送され、倉庫２０に保管されるものとする。もっとも、複数の物品は、必ずしも倉庫２０に保管されなくてもよく、任意の場所に保管されてよい。 The acquisition unit 11 acquires, from the management terminal 21, the inventory numbers of a plurality of articles at a predetermined time. In the present embodiment, a plurality of articles are stored in a container, transported from a remote location, and stored in the warehouse 20. However, a plurality of articles need not always be stored in the warehouse 20, and may be stored in an arbitrary place.

取得部１１は、所定の時点において輸送中の物品の数を取得してもよい。輸送中の物品の数は、発注時点から輸送に要する期間が経過していない発注済みの物品の数である。 The acquisition unit 11 may acquire the number of articles being transported at a predetermined time. The number of articles being transported is the number of ordered articles for which the period required for transportation has not elapsed from the time of ordering.

第１推定部１２は、所定の時点における複数の物品の在庫数を含む状態と、複数の物品を所定数発注する行動との関数であって、複数の物品について設定されている複数の評価関数の第１値を、複数の物品について設定されている複数の推定モデル１３ａによって推定する。以下では、所定の時点ｔにおける物品ｄ（ｄ＝１〜Ｎ）の在庫数をＩ_d,tと表し、状態をｓ_tと表し、物品ｄを所定数発注する行動をａ_dと表し、物品ｄに関する評価関数の第１値をＱ_d（ｓ_t，ａ_d）と表す。 The first estimating unit 12 is a function of a state including a stock number of a plurality of articles at a predetermined time and an action of ordering a predetermined number of a plurality of articles, and a plurality of evaluation functions set for the plurality of articles. Is estimated by a plurality of estimation models 13a set for a plurality of articles. Hereinafter, it represents the number of stocked goods d (d = 1 to N) at a given time t I _d, and _t, represents the state s _t, represents the behavior of a predetermined number of ordered goods d and a _d, article The first value of the evaluation function for _d is represented as Q _d (s _t , a _d ).

第１推定部１２は、それぞれニューラルネットワークで構成される複数の推定モデル１３ａによって、状態ｓ_tにおいて可能な行動ａ_dについて、評価関数の第１値Ｑ_d（ｓ_t，ａ_d）を推定してよい。複数の推定モデル１３ａは、例えば複数の隠れ層を有する全結合ニューラルネットワークであってよいが、他のモデルであってもよい。 The first estimation unit 12, a plurality of estimation models 13a composed of neural networks respectively, the action a _d capable in the state s _t, the first value _{_{_{Q d (s t, a d}}} ) of the evaluation function to estimate the May be. The plurality of estimation models 13a may be, for example, a fully-connected neural network having a plurality of hidden layers, but may be other models.

状態ｓ_tは、所定の時点ｔにおいて輸送中の物品ｄの数ＯＯ_d,tを含んでよい。輸送中の物品ｄの数ＯＯ_d,tを状態ｓ_tに含めることで、複数の物品の在庫数のみならず、輸送中の物品の数を考慮して評価関数の値を推定することができ、複数の物品をより効率的に発注することができる。 State s _t may include the number OO _{d, t} articles d in transit at a given time t. The number OO d of articles d during _{transport, t} a By including the state s _t, not inventory of a plurality of articles only, taking into account the number of articles being transported can estimate the value of the evaluation function In addition, a plurality of articles can be ordered more efficiently.

状態ｓ_tは、所定の時点ｔから輸送に要する期間ＬＴ経過後の物品ｄの在庫数の推定値Ｉ_d,t,t+LTを含んでよい。在庫数の推定値Ｉ_d,t,t+LTを状態ｓ_tに含めることで、複数の物品の在庫数のみならず、将来の在庫数を考慮して評価関数の値を推定することができ、複数の物品をより効率的に発注することができる。 State s _t may include an estimate I _{d, t, t + LT} inventory of goods d after a period LT lapse required for transportation from a predetermined time t. By including the estimated value I _d of _{inventory, t,} a _{t + LT} state s _t, not inventory of a plurality of articles only, it is possible to estimate the value of the evaluation function in consideration of the future inventory quantity In addition, a plurality of articles can be ordered more efficiently.

状態ｓ_tは、所定の時点ｔから輸送に要する期間ＬＴが経過するまでの物品ｄの需要数の推定値ｆ_d,t ^t:LTを含んでよい。なお、推定値ｆ_d,t ^t:LTは、期間［ｔ，ｔ＋ＬＴ］における物品ｄの需要数の総和の推定値であってよい。物品ｄの需要数の推定値ｆ_d,t ^t:LTを状態ｓ_tに含めることで、複数の物品の在庫数のみならず、輸送中の物品が到着するまでに生じると推定される複数の物品の需要を考慮して評価関数の値を推定することができ、複数の物品をより効率的に発注することができる。 State s _t is the estimate of demand number of articles d until the end of the period LT required for transportation from a predetermined time t f _{d, t} ^t: may comprise ^LT. Note that the estimated value f _{d, t} ^{t: LT} may be an estimated value of the total number of demands of the article d in the period [t, t + LT]. Estimate f _d of the number of demand of goods _{d, t} ^{t: LT} to that included in state s _t, not inventory of a plurality of articles only, a plurality of articles during transport is estimated to occur to arrive The value of the evaluation function can be estimated in consideration of the demand for the article, and a plurality of articles can be ordered more efficiently.

また、状態ｓ_tは、所定の時点ｔから輸送に要する期間ＬＴが経過してから、所定の期間Ｍが経過するまでの物品ｄの需要数の推定値ｆ_d,t ^t+LT:Mを含んでよい。なお、推定値ｆ_d,t ^t+LT:Mは、期間［ｔ＋ＬＴ，ｔ＋ＬＴ＋Ｍ］における物品ｄの需要数の総和の推定値であってよい。物品ｄの需要数の推定値ｆ_d,t ^t+LT:Mを状態ｓ_tに含めることで、複数の物品の在庫数のみならず、輸送中の物品が到着した後に生じると推定される複数の物品の需要を考慮して評価関数の値を推定することができ、複数の物品をより効率的に発注することができる。 The state s _t is after the elapse of the period LT required for transportation from a predetermined time t, the estimated value f _d demand number of articles d to a predetermined period of time M has _{elapsed, t} ^{t + LT:} the ^M May include. Note that the estimated value f _{d, t} ^{t + LT: M} may be an estimated value of the total number of demands of the article d in the period [t + LT, t + LT + M]. By including the estimated value f _{d, t} ^{t + LT: M} of the demand number of the item d in the state _st , not only the inventory number of a plurality of items but also a plurality of items estimated to occur after the items in transit arrive. The value of the evaluation function can be estimated in consideration of the demand for the article, and a plurality of articles can be ordered more efficiently.

第１推定部１２は、物品ｄの在庫数をＩ_d,tと、輸送中の物品ｄの数ＯＯ_d,tと、ｔ＋ＬＴの時点における物品ｄの在庫数の推定値Ｉ_d,t,t+LTと、ｔからｔ＋ＬＴまでの物品ｄの需要数の推定値ｆ_d,t ^t:LTと、ｔ＋ＬＴからｔ＋ＬＴ＋Ｍまでの物品ｄの需要数の推定値ｆ_d,t ^t+LT:Mとを、変換モデルで変換した特徴量を状態ｓ_tとしてもよい。ここで、変換モデルは、例えば全結合ニューラルネットワークであってよいが、他のモデルであってもよい。 The first estimating unit 12 calculates the stock number of the article d by I _{d, t} , the number of the article d in transit OO _{d, t,} and the estimated value I _{d, t, t} of the stock number of the article d at the time of t + LT. _{+ LT} , the estimated value f _d, ^{tt: LT} of the demand number of the article d from t to t + LT, and the estimated value f _d, ^{tt + LT: M} of the demand number of the article d from t + LT to t + LT + M. , the converted feature amount conversion model may state s _t. Here, the conversion model may be, for example, a fully connected neural network, but may be another model.

記憶部１３は、複数の推定モデル１３ａを記憶する。複数の推定モデル１３ａは、複数の物品毎に設定されていてよい。 The storage unit 13 stores a plurality of estimation models 13a. The plurality of estimation models 13a may be set for each of a plurality of articles.

選択部１４は、評価関数の第１値Ｑ_d（ｓ_t，ａ_d）に基づいて、行動ａ_dを選択する。選択部１４は、所定の確率ε(０≦ε≦１)で、ランダムに行動ａ_dを選択し、１から所定の確率を引いた確率（１−ε）で、第１値Ｑ_d（ｓ_t，ａ_d）が最大となる行動ａ_dを選択してよい。すなわち、選択部１４は、確率（１−ε）で、ａ_d＝ａｒｇｍａｘ_aＱ_d（ｓ_t，ａ）により行動ａ_dを選択してよい。このようにして、新たな行動の探索と、経験的に最良である行動の選択とのバランスを取りながら、より効率的な発注ができるようになる。 Selecting unit 14, the first value _{_{_{Q d (s t, a d}}} ) of the evaluation function based on, select an action a _d. The selecting unit 14 randomly selects the action a _d with a predetermined probability ε (0 ≦ ε ≦ 1), and obtains a first value Q _d (s _t, a _d) may select the action a _d to be the maximum. That is, the selection unit 14, with probability _{(1-ε), a d} = arg max a Q d (s t, a) may select an action a _d by. In this way, more efficient ordering can be performed while balancing the search for a new action and the selection of the action that is best empirically.

報酬算出部１５は、行動ａ_dを行った場合の複数の物品の保管コスト及び発注された物品の輸送コストに基づいて、複数の物品それぞれに関する報酬ｒ_dを算出する。ここで、所定の時点ｔにおける物品ｄの保管コストをＣ^hold _d,tと表し、所定の時点ｔにおける輸送コストをＣ^trans _tと表し、物品ｄの報酬をｒ_dと表すとき、ｒ_d＝−（Ｃ^hold _d,t＋Ｃ^trans _t／Ｎ）であってよい。ここで、Ｎは、複数の物品の数（物品の種類の数）である。すなわち、報酬算出部１５は、保管コストと、輸送コストを発注された物品の数で除算した値との和によって、複数の物品それぞれに関する報酬を算出してよい。 Compensation calculation unit 15, based on the transportation costs of storage costs and ordered articles in a plurality of articles in the case of performing an action a _d, calculates the reward r _d for each plurality of articles. Here, when the storage cost of the article d at the predetermined time point t is represented as C ^hold _{d, t} , the transportation cost at the predetermined time point _t is represented as C ^trans _t, and the reward of the article _d is represented as rd, r _d = − (C ^hold _{d, t} + C ^trans _t / N). Here, N is the number of a plurality of articles (the number of types of articles). That is, the reward calculation unit 15 may calculate the reward for each of the plurality of articles by the sum of the storage cost and the value obtained by dividing the transportation cost by the number of ordered articles.

輸送コストは、コンテナの数に応じて定まり、コンテナに収容される物品の多寡には依存しないことがある。そのため、物品をパレットにまとめてコンテナで輸送する場合、複数種類の物品を１つのコンテナにまとめた方が、複数種類の物品を個別にコンテナに収容して輸送する場合よりもコストを減らすことができる場合がある。本実施形態に係る報酬算出部１５によれば、輸送コストを複数の物品に関する報酬に分配することで、複数の物品が同時に発注されるように報酬を与えることができ、複数の物品をコンテナに搭載して輸送する場合であっても、コストを抑えるように複数の物品を発注することができる。 The transportation cost is determined according to the number of containers, and may not depend on the number of articles stored in the containers. Therefore, in the case where goods are put together on a pallet and transported in a container, it is possible to reduce costs by collecting a plurality of types of goods in a single container, compared to a case where a plurality of types of goods are individually stored in a container and transported. May be possible. According to the reward calculation unit 15 according to the present embodiment, by distributing the transportation cost to the reward for a plurality of articles, a reward can be given so that a plurality of articles can be ordered at the same time, and a plurality of articles can be assigned to a container. Even in the case of loading and transporting, a plurality of items can be ordered so as to reduce costs.

報酬算出部１５は、保管コストと、輸送コストを発注された物品の数で除算した値と、複数の物品が欠品した場合のペナルティコストとの和によって、複数の物品それぞれに関する報酬を算出してもよい。ここで、所定の時点ｔで物品ｄが欠品した場合のペナルティコストをＣ^pel _d,tと表すとき、ｒ_d＝−（Ｃ^hold _d,t＋Ｃ^pel _d,t＋Ｃ^trans _t／Ｎ）であってよい。このようにして、複数の物品が欠品しないように報酬を与えることができ、複数の物品の在庫が尽きる確率が小さくなるように複数の物品を発注することができる。 The reward calculation unit 15 calculates a reward for each of the plurality of articles by a sum of a storage cost, a value obtained by dividing a transportation cost by the number of ordered articles, and a penalty cost in a case where a plurality of articles are out of stock. You may. Here, the penalty cost of goods d is shortage at a given time t when expressed as ^{_{_{C pel d, t, r d}}} = - (C hold d, t + C pel d, t + C trans t / N) in May be. In this way, a reward can be given so that a plurality of articles will not be out of stock, and a plurality of articles can be ordered so that the probability of running out of stock of the plurality of articles is reduced.

第２推定部１６は、行動ａ_dを行った後の状態ｓ´について、取り得る行動ａ´に関する複数の評価関数の第２値Ｑ_d（ｓ´，ａ´）を複数の推定モデル１３ａによって推定する。なお、第２推定部１６は、更新部によって複数の推定モデル１３ａのパラメータを更新する場合、更新を所定回数行うまで、古いパラメータを用いた複数の推定モデルによって、複数の評価関数の第２値Ｑ_d ^-（ｓ´，ａ´）を推定してもよい。 The second estimating unit 16 calculates the second values Q _d (s ′, a ′) of a plurality of evaluation functions relating to the possible action a ′ with respect to the state s ′ after the action a _d is performed by the plurality of estimation models 13a. presume. When the updating unit updates the parameters of the plurality of estimation models 13a, the second estimation unit 16 uses the plurality of estimation models using old parameters to update the second values of the plurality of evaluation functions until the update is performed a predetermined number of times. Q _d ⁻ (s ′, a ′) may be estimated.

更新部１７は、割引率を掛けた評価関数の第２値及び報酬の和と、評価関数の第１値との差を小さくするように、複数の推定モデル１３ａのパラメータを更新する。具体的には、割引率をγと表すとき、ｙ_d＝ｒ_d＋γａｒｇｍａｘ_a´Ｑ_d ^-（ｓ´，ａ´）として、更新部１７は、Ｅ_(s,ad,rd,s´₎〜_D［Ｌ（ｙ_d，Ｑ_d（ｓ，ａ_d））］を最小化するように、複数の推定モデル１３ａのパラメータを更新してよい。ここで、Ｌ（ｙ_d，Ｑ_d（ｓ，ａ_d））は、ｙ_dとＱ_d（ｓ，ａ_d）の差を評価する損失関数である。また、Ｅ_(s,ad,rd,s´₎〜_D［・］は、複数の物品について過去に記録された状態ｓ、行動ａ_d、報酬ｒ_d及び行動を行った後の状態ｓ´に関する期待値を表す。更新部１７は、例えば、損失関数を複数の推定モデル１３ａのパラメータで偏微分して、誤差逆伝播法によって複数の推定モデル１３ａのパラメータを更新してよい。 The updating unit 17 updates the parameters of the plurality of estimation models 13a so as to reduce the difference between the second value of the evaluation function multiplied by the discount rate and the sum of the reward and the first value of the evaluation function. Specifically, when referring to the discount rate and _{_{γ, y d = r d +}} γarg max a'Q d - (s', a') as, update unit _{17, E (s, ad, rd} , s') _{_{~ D [L (y d,}} Q d (s, a d))] so as to minimize, or to update the parameters of a plurality of estimation models 13a. _{_{Here, L (y d, Q d}} (s, a d)) is the loss function for evaluating the difference between y _d and _{_{Q d (s, a d)}} . _{Also, E (s, ad, rd} , s') ~ D [·] is the state recorded in the past for a plurality of articles s, action a _d, relating to the state s'after the reward r _d and behavior Indicates expected value. The updating unit 17 may, for example, partially differentiate the loss function with the parameters of the plurality of estimation models 13a, and update the parameters of the plurality of estimation models 13a by the backpropagation method.

本実施形態に係る発注管理装置１０によれば、強化学習の報酬を複数の物品それぞれについて算出し、複数の物品について設定されている複数の推定モデル１３ａのパラメータを更新していくことで、より適切な評価関数の値を推定することができるようになり、複数の物品をより効率的に発注することができる。 According to the order management device 10 according to the present embodiment, the reward for reinforcement learning is calculated for each of a plurality of articles, and the parameters of the plurality of estimation models 13a set for the plurality of articles are updated, so that more An appropriate value of the evaluation function can be estimated, and a plurality of articles can be ordered more efficiently.

更新部１７は、報酬及び割引率を掛けた評価関数の第２値の和と、評価関数の第１値との差の２乗について、複数の物品について過去に記録された状態、行動、報酬及び行動を行った後の状態に関する期待値が小さくなるように、複数の推定モデル１３ａのパラメータを更新してよい。すなわち、損失関数は、Ｌ（ｙ_d，Ｑ_d（ｓ，ａ_d））∝Σ_d=1 ^N（ｙ_d−Ｑ_d（ｓ，ａ_d））²であってよい。このように、過去に記録された状態、行動、報酬及び行動を行った後の状態に関する２乗誤差の期待値が小さくなるように複数の推定モデル１３ａのパラメータを更新することで、推定モデルのパラメータを更新する際の不安定性を抑えることができ、より適切な評価関数の値を推定することができるようになる。 The update unit 17 calculates the state, action, and reward of the sum of the second value of the evaluation function multiplied by the reward and the discount rate and the square of the difference between the first value of the evaluation function and the plurality of articles. In addition, the parameters of the plurality of estimation models 13a may be updated so that the expected value regarding the state after the action has been performed becomes smaller. That is, the loss _{_{function, L (y d, Q d}} (s, a d)) αΣ d = 1 N (y d -Q d (s, a d)) may be ^2. As described above, by updating the parameters of the plurality of estimation models 13a so that the expected value of the square error related to the state recorded in the past, the action, the reward, and the state after the action is performed, the estimation model 13a is updated. Instability in updating the parameter can be suppressed, and a more appropriate value of the evaluation function can be estimated.

また、更新部１７は、報酬及び割引率を掛けた第２値の和と、第１値との差のＨｕｂｅｒ損失関数について、複数の物品について過去に記録された状態、行動、報酬及び行動を行った後の状態に関する期待値が小さくなるように、パラメータを更新してもよい。ここで、Ｈｕｂｅｒ損失関数は、Σ_d=1 ^N（ｙ_d−Ｑ_d（ｓ，ａ_d））²≦δ²（δは所定のパラメータ）の場合にΣ_d=1 ^N（ｙ_d−Ｑ_d（ｓ，ａ_d））²に比例し、Σ_d=1 ^N（ｙ_d−Ｑ_d（ｓ，ａ_d））²＞δ²の場合にΣ_d=1 ^N｜ｙ_d−Ｑ_d（ｓ，ａ_d）｜に比例する損失関数である。これにより、推定モデルのパラメータを更新する際の外れ値に対する不安定性を抑えることができ、より適切な評価関数の値を推定することができるようになる。 The updating unit 17 also updates the state, behavior, reward, and behavior previously recorded for a plurality of articles with respect to the Huber loss function of the difference between the second value multiplied by the reward and the discount rate and the first value. The parameter may be updated so that the expected value related to the state after the execution is reduced. Here, Huber loss _{^{function, Σ d = 1 N (y}} d -Q d (s, a d)) 2 ≦ δ 2 (δ is a predetermined parameter) When the _{^{_{Σ d = 1 N (y d}}} -Q of _{_{d (s, a d))}} 2 in _{^{proportion, Σ d = 1 N (y}} d -Q d (s, a d)) 2> in the case of ^{_{^{δ 2 Σ d = 1 N |}}} y d -Q d ( s, a _d ) | As a result, instability with respect to outliers when updating the parameters of the estimation model can be suppressed, and a more appropriate value of the evaluation function can be estimated.

図２は、本実施形態に係る発注管理装置１０の物理的構成を示す図である。発注管理装置１０は、演算部に相当するＣＰＵ（Central Processing Unit）１０ａと、記憶部に相当するＲＡＭ（Random Access Memory）１０ｂと、記憶部に相当するＲＯＭ（Read only Memory）１０ｃと、通信部１０ｄと、入力部１０ｅと、表示部１０ｆと、を有する。これらの各構成は、バスを介して相互にデータ送受信可能に接続される。なお、本例では発注管理装置１０が一台のコンピュータで構成される場合について説明するが、発注管理装置１０は、複数のコンピュータが組み合わされて実現されてもよい。また、図２で示す構成は一例であり、発注管理装置１０はこれら以外の構成を有してもよいし、これらの構成のうち一部を有さなくてもよい。 FIG. 2 is a diagram illustrating a physical configuration of the order management device 10 according to the present embodiment. The order management device 10 includes a CPU (Central Processing Unit) 10a corresponding to an arithmetic unit, a RAM (Random Access Memory) 10b corresponding to a storage unit, a ROM (Read only Memory) 10c corresponding to a storage unit, and a communication unit. 10d, an input unit 10e, and a display unit 10f. These components are connected to each other via a bus so that data can be transmitted and received. In this example, the case where the order management device 10 is configured by one computer will be described. However, the order management device 10 may be realized by combining a plurality of computers. The configuration illustrated in FIG. 2 is an example, and the order management device 10 may have a configuration other than these, or may not have a part of these configurations.

ＣＰＵ１０ａは、ＲＡＭ１０ｂ又はＲＯＭ１０ｃに記憶されたプログラムの実行に関する制御やデータの演算、加工を行う制御部である。ＣＰＵ１０ａは、強化学習を適用して、複数の物品の発注量を管理するプログラム（発注管理プログラム）を実行する演算部である。ＣＰＵ１０ａは、入力部１０ｅや通信部１０ｄから種々のデータを受け取り、データの演算結果を表示部１０ｆに表示したり、ＲＡＭ１０ｂに格納したりする。 The CPU 10a is a control unit that performs control related to execution of a program stored in the RAM 10b or the ROM 10c and calculates and processes data. The CPU 10a is a calculation unit that executes a program (order management program) for managing the order quantities of a plurality of articles by applying reinforcement learning. The CPU 10a receives various data from the input unit 10e and the communication unit 10d, and displays a calculation result of the data on the display unit 10f and stores it in the RAM 10b.

ＲＡＭ１０ｂは、記憶部のうちデータの書き換えが可能なものであり、例えば半導体記憶素子で構成されてよい。ＲＡＭ１０ｂは、ＣＰＵ１０ａが実行するプログラム、複数の物品の在庫数といったデータを記憶してよい。なお、これらは例示であって、ＲＡＭ１０ｂには、これら以外のデータが記憶されていてもよいし、これらの一部が記憶されていなくてもよい。 The RAM 10b is a storage unit in which data can be rewritten, and may be composed of, for example, a semiconductor storage element. The RAM 10b may store data such as a program executed by the CPU 10a and inventory numbers of a plurality of articles. These are merely examples, and the RAM 10b may store data other than these or some of them may not be stored.

ＲＯＭ１０ｃは、記憶部のうちデータの読み出しが可能なものであり、例えば半導体記憶素子で構成されてよい。ＲＯＭ１０ｃは、例えば発注管理プログラムや、書き換えが行われないデータを記憶してよい。 The ROM 10c is a storage unit from which data can be read, and may be configured by, for example, a semiconductor storage element. The ROM 10c may store, for example, an order management program or data that is not rewritten.

通信部１０ｄは、発注管理装置１０を他の機器に接続するインターフェースである。通信部１０ｄは、インターネット等の通信ネットワークＮに接続されてよい。 The communication unit 10d is an interface that connects the order management device 10 to another device. The communication unit 10d may be connected to a communication network N such as the Internet.

入力部１０ｅは、ユーザからデータの入力を受け付けるものであり、例えば、キーボード及びタッチパネルを含んでよい。 The input unit 10e accepts data input from a user, and may include, for example, a keyboard and a touch panel.

表示部１０ｆは、ＣＰＵ１０ａによる演算結果を視覚的に表示するものであり、例えば、ＬＣＤ（Liquid Crystal Display）により構成されてよい。表示部１０ｆは、複数の物品の在庫数の推移や発注数の推移を表示してよい。 The display unit 10f is for visually displaying the calculation result by the CPU 10a, and may be configured by, for example, an LCD (Liquid Crystal Display). The display unit 10f may display a change in the stock number of a plurality of articles and a change in the number of orders.

発注管理プログラムは、ＲＡＭ１０ｂやＲＯＭ１０ｃ等のコンピュータによって読み取り可能な記憶媒体に記憶されて提供されてもよいし、通信部１０ｄにより接続される通信ネットワークを介して提供されてもよい。発注管理装置１０では、ＣＰＵ１０ａが発注管理プログラムを実行することにより、図１を用いて説明した様々な動作が実現される。なお、これらの物理的な構成は例示であって、必ずしも独立した構成でなくてもよい。例えば、発注管理装置１０は、ＣＰＵ１０ａとＲＡＭ１０ｂやＲＯＭ１０ｃが一体化したＬＳＩ（Large-Scale Integration）を備えていてもよい。 The order management program may be provided by being stored in a computer-readable storage medium such as the RAM 10b or the ROM 10c, or may be provided via a communication network connected by the communication unit 10d. In the order management apparatus 10, the CPU 10a executes the order management program to realize the various operations described with reference to FIG. Note that these physical configurations are merely examples, and are not necessarily independent configurations. For example, the order management device 10 may include an LSI (Large-Scale Integration) in which the CPU 10a is integrated with the RAM 10b and the ROM 10c.

図３は、本実施形態に係る発注管理装置１０の複数の推定モデル１３ａの概念図である。複数の推定モデル１３ａは、状態に関する状態値を推定する第１モデル１３ｂと、状態における行動のアドバンテージ関数を推定する第２モデル１３ｃとをそれぞれ含む。以下では、状態をｓと表し、物品ｄに関する状態値をＶ_d（ｓ）と表し、物品ｄに関するアドバンテージ関数をＡ_d（ｓ，ａ_d）と表す。 FIG. 3 is a conceptual diagram of a plurality of estimation models 13a of the order management device 10 according to the present embodiment. The plurality of estimation models 13a each include a first model 13b for estimating a state value related to a state, and a second model 13c for estimating an advantage function of an action in the state. Hereinafter, the state is represented as s, the state value relating to the article d is represented as V _d (s), and the advantage function relating to the article d is represented as A _d (s, _ad ).

複数の推定モデル１３ａは、複数の物品毎に設定された第１モデル１３ｂによって、状態ｓに関する状態値Ｖ_d（ｓ）を推定する。ここで、状態ｓは、物品ｄの在庫数をＩ_d,tと、輸送中の物品ｄの数ＯＯ_d,tと、ｔ＋ＬＴの時点における物品ｄの在庫数の推定値Ｉ_d,t,t+LTと、ｔからｔ＋ＬＴまでの物品ｄの需要数の推定値ｆ_d,t ^t:LTと、ｔ＋ＬＴからｔ＋ＬＴ＋Ｍまでの物品ｄの需要数の推定値ｆ_d,t ^t+LT:Mとを、変換モデルで変換した特徴量（Shared Representation）であってよい。 The plurality of estimation models 13a estimate the state value V _d (s) related to the state s by using the first model 13b set for each of the plurality of articles. Here, the state s indicates the stock number of the article d as I _{d, t} , the number OO _{d, t} of the article d in transit _, and the estimated value I _{d, t, t} of the stock number of the article d at the time of t + LT. _{+ LT} , the estimated value f _d, ^{tt: LT} of the demand number of the article d from t to t + LT, and the estimated value f _d, ^{tt + LT: M} of the demand number of the article d from t + LT to t + LT + M. , A feature amount (Shared Representation) converted by the conversion model.

また、複数の推定モデル１３ａは、複数の物品毎に設定された第２モデル１３ｃによって、状態ｓにおける行動ａ_dのアドバンテージ関数Ａ_d（ｓ，ａ_d）を推定する。 Further, a plurality of estimation models 13a is the second model 13c which is set for each of a plurality of articles, Advantage function A _{_d} (s, a _d) action a _d in the state s to estimate.

そして、複数の推定モデル１３ａは、複数の物品毎に推定された状態値Ｖ_d（ｓ）とアドバンテージ関数Ａ_d（ｓ，ａ_d）の和によって、評価関数の値を推定してよい。すなわち、評価関数の値Ｑ_d（ｓ，ａ_d）は、Ｑ_d（ｓ，ａ_d）＝Ｖ_d（ｓ）＋Ａ_d（ｓ，ａ_d）によって推定されてよい。このように、評価関数の値を推定する推定モデルのうち、状態のみに依存する部分を第１モデル１３ｂによって推定し、状態及び行動に依存する部分を第２モデル１３ｃによって推定することで、より適切な評価関数の値を推定することができるようになり、複数の物品をより効率的に発注することができる。 Then, the plurality of estimation models 13a may estimate the value of the evaluation function by the sum of the state value V _d (s) estimated for each of the plurality of articles and the advantage function A _d (s, _ad ). That is, the value Q _d (s, _ad ) of the evaluation function may be estimated by Q _d (s, _ad ) = V _d (s) + A _d (s, _ad ). As described above, in the estimation model for estimating the value of the evaluation function, a part that depends only on the state is estimated by the first model 13b, and a part that depends on the state and the behavior is estimated by the second model 13c. An appropriate value of the evaluation function can be estimated, and a plurality of articles can be ordered more efficiently.

また、図３に示すように、本実施形態に係る報酬算出部１５は、物品の種類によらずに生じる報酬（Global Reward）を、複数の物品に分配して、各物品を発注する行動に関する報酬を算出する。例えば、報酬算出部１５は、ｒ_d＝−（Ｃ^hold _d,t＋Ｃ^pel _d,t＋Ｃ^trans _t／Ｎ）によって物品ｄに関する報酬ｒ_dを算出してよい。ここで、Ｃ^trans _t／Ｎが複数の物品に分配される報酬である。 Further, as shown in FIG. 3, the reward calculation unit 15 according to the present embodiment distributes a reward (Global Reward) generated regardless of the type of an article to a plurality of articles and performs an action of ordering each article. Calculate the reward. For example, reward calculation unit _{^{15, r d = - (C hold}} d, t + C pel d, t + C trans t / N) by may calculate the reward r _d an article d. Here, C ^trans _t / N is a reward distributed to a plurality of articles.

図４は、物品数が２である場合における、本実施形態に係る発注管理装置１０による総コスト（Total Cost）と、比較例の総コストとを示す図である。なお、総コストは、報酬の符号をマイナスにした値であり、０に近いほど性能が良いことを表している。同図では、横軸に強化学習のエピソード数を示し、縦軸に総コストを示している。 FIG. 4 is a diagram illustrating the total cost (total cost) of the order management device 10 according to the present embodiment and the total cost of the comparative example when the number of articles is two. The total cost is a value obtained by setting the sign of the reward to a minus value, and the closer to 0, the better the performance. In the figure, the horizontal axis indicates the number of episodes of reinforcement learning, and the vertical axis indicates the total cost.

同図では、２種類の物品の需要の期待値が、時間とともに線形に増大する場合について、第１モデル１３ｂ及び第２モデル１３ｃにより構成される複数の推定モデル１３ａによって推定された評価関数に基づいて行動を選択し、物品の輸送コストを複数の物品に関する報酬に分配した場合の総コスト（報酬）の推移を第１グラフＧ１によって示している。ここで、２種類の物品の需要のうち定常部分は、所定のパラメータのガウス分布に従って揺らぐものとしてシミュレーションを行っている。また、それぞれ単一のモデルで構成され、第１モデル１３ｂ及び第２モデル１３ｃを含まない複数の推定モデル１３ａによって推定された評価関数に基づいて行動を選択し、物品の輸送コストを複数の物品に関する報酬に分配した場合の総コスト（報酬）の推移を第２グラフＧ２によって示している。さらに、比較例として、非特許文献３で提案された強化学習を用いる手法における総コスト（報酬）の推移を第３グラフＧ３によって示している。また、比較例として、非特許文献２で提案された強化学習を用いない手法における総コスト（報酬）を基準値Ｒｅｆとして示している。 In the figure, in the case where the expected values of the demands of the two kinds of articles increase linearly with time, based on the evaluation functions estimated by the plurality of estimation models 13a including the first model 13b and the second model 13c. The first graph G1 shows the transition of the total cost (reward) when an action is selected and the transportation cost of the article is distributed to the rewards for a plurality of articles. Here, the simulation is performed assuming that the steady part of the demand for the two types of articles fluctuates according to a Gaussian distribution of a predetermined parameter. Further, an action is selected based on an evaluation function estimated by a plurality of estimation models 13a, each of which is constituted by a single model and does not include the first model 13b and the second model 13c, and the transportation cost of the article is reduced by a plurality of articles. The transition of the total cost (reward) in the case where the reward is distributed to the related rewards is shown by the second graph G2. Further, as a comparative example, the transition of the total cost (reward) in the method using reinforcement learning proposed in Non-Patent Document 3 is shown by a third graph G3. Further, as a comparative example, the total cost (reward) in the method that does not use reinforcement learning proposed in Non-Patent Document 2 is shown as a reference value Ref.

実線で示す第１グラフＧ１及び破線で示す第２グラフＧ２によれば、本実施形態に係る発注管理装置１０によって、２種類の物品の発注について、基準値Ｒｅｆよりも０に近い総コスト（基準値Ｒｅｆよりも大きな報酬）が達成できることが確認できる。一方、比較例として記載した第３グラフＧ３は、基準値Ｒｅｆよりも大きなマイナスの総コストになってしまうばかりか、学習が不安定であり、総コストが収束しないことが確認できる。このように、本実施形態に係る発注管理装置１０によれば、複数の物品の需要が時間とともに変化する場合であっても、安定的に強化学習を進めることができ、複数の物品をより効率的に発注することができる。 According to the first graph G1 indicated by a solid line and the second graph G2 indicated by a broken line, the order management device 10 according to the present embodiment has a total cost (reference value) closer to 0 than the reference value Ref for ordering two types of articles. It can be confirmed that a reward larger than the value Ref) can be achieved. On the other hand, in the third graph G3 described as the comparative example, it can be confirmed that not only the total cost is larger than the reference value Ref but also the learning is unstable and the total cost does not converge. As described above, according to the order management device 10 according to the present embodiment, even when the demand for a plurality of articles changes with time, the reinforcement learning can be stably advanced, and the plurality of articles can be more efficiently used. You can place an order.

図５は、本実施形態に係る発注管理装置１０により管理される複数の物品の在庫数の時間変化と発注タイミングを示す図である。同図では、第１物品（Product1）の在庫数Ｉ１及び第２物品（Product2）の在庫数Ｉ２を実線で示し、第１物品の発注数Ｏ１及び第２物品の発注数Ｏ２を破線で示し、第１物品の需要ｄ１及び第２物品の需要ｄ２を一点鎖線で示している。また、総発注数（total order）Ｔを破線で示している。同図の横軸は、シミュレーションのステップ数（Step）であり、時間に相当する。なお、第１物品の需要ｄ１及び第２物品の需要ｄ２は、強化学習のエージェントが直接観測できる量ではなく、シミュレーションのために生成されるものである。第１物品の需要ｄ１及び第２物品の需要ｄ２は、時間の経過とともに平均が増大する正規分布に従うように生成されている。 FIG. 5 is a diagram showing a time change of the stock numbers of a plurality of articles managed by the order management apparatus 10 according to the present embodiment and an order timing. In the figure, the stock quantity I1 of the first article (Product1) and the stock quantity I2 of the second article (Product2) are indicated by solid lines, the order quantity O1 of the first article and the order quantity O2 of the second article are indicated by broken lines, The demand d1 of the first article and the demand d2 of the second article are indicated by dashed lines. The total order number T is shown by a broken line. The horizontal axis in the figure is the number of simulation steps (Step), which corresponds to time. Note that the demand d1 for the first article and the demand d2 for the second article are not amounts that can be directly observed by the reinforcement learning agent, but are generated for simulation. The demand d1 of the first article and the demand d2 of the second article are generated so as to follow a normal distribution whose average increases with time.

同図に示す結果は、２種類の物品の需要の期待値が、時間とともに線形に増大する場合について、物品のパレットを最大で２０個格納できるコンテナを用いて複数の物品を輸送する設定のシミュレーションによって得ている。ここで、輸送に要する期間は３ステップとし、取り得る行動は、０パレットの発注（発注なし）、１パレットの発注、２パレットの発注及び３パレットの発注としている。また、保管コストは０．０２、欠品コストは１．０、輸送コストは１．０と設定している。 The result shown in the figure is a simulation of a setting for transporting a plurality of articles using a container capable of storing up to 20 pallets of articles when the expected value of demand for two kinds of articles increases linearly with time. Have gained. Here, the period required for transportation is three steps, and the actions that can be taken are: 0 pallet order (no order), 1 pallet order, 2 pallet order, and 3 pallet order. The storage cost is set at 0.02, the stockout cost is set at 1.0, and the transportation cost is set at 1.0.

同図によれば、第１物品の在庫数Ｉ１及び第２物品の在庫数Ｉ２が０とならないように、適切に第１物品の発注及び第２物品の発注が行われていることが確認できる。また、第１物品の発注数Ｏ１及び第２物品の発注数Ｏ２が同じタイミングで立ち上がる場合が多く、複数の物品をまとめて発注し、輸送コストを抑えられていることが確認できる。 According to the figure, it can be confirmed that the ordering of the first article and the ordering of the second article are performed appropriately so that the stock quantity I1 of the first article and the stock quantity I2 of the second article do not become 0. . In addition, the order number O1 of the first article and the order number O2 of the second article often rise at the same timing, and it can be confirmed that a plurality of articles are ordered together and the transportation cost is suppressed.

図６は、物品数が１０である場合における、本実施形態に係る発注管理装置１０による総コスト（Total Cost）と、比較例の総コストとを示す図である。なお、総コストは、報酬の符号をマイナスにした値であり、０に近いほど性能が良いことを表している。同図では、横軸に強化学習のエピソード数を示し、縦軸に総コストを示している。 FIG. 6 is a diagram illustrating the total cost (total cost) of the order management device 10 according to the present embodiment and the total cost of the comparative example when the number of articles is 10. The total cost is a value obtained by setting the sign of the reward to a minus value, and the closer to 0, the better the performance. In the figure, the horizontal axis indicates the number of episodes of reinforcement learning, and the vertical axis indicates the total cost.

同図では、１０種類の物品の需要の期待値が、時間とともに線形に増大する場合について、第１モデル１３ｂ及び第２モデル１３ｃにより構成される複数の推定モデル１３ａによって推定された評価関数に基づいて行動を選択し、物品の輸送コストを複数の物品に関する報酬に分配した場合の総コスト（報酬）の推移を第５グラフＧ５によって示している。ここで、１０種類の物品の需要のうち定常部分は、所定のパラメータのガウス分布に従って揺らぐものとしてシミュレーションを行っている。また、それぞれ単一のモデルで構成され、第１モデル１３ｂ及び第２モデル１３ｃを含まない複数の推定モデル１３ａによって推定された評価関数に基づいて行動を選択し、物品の輸送コストを複数の物品に関する報酬に分配した場合の総コスト（報酬）の推移を第６グラフＧ６によって示している。さらに、比較例として、非特許文献２で提案された強化学習を用いる手法における総コスト（報酬）の推移を第７グラフＧ７によって示している。また、比較例として、非特許文献３で提案された強化学習を用いない手法における総コスト（報酬）を基準値Ｒｅｆとして示している。 In the figure, when the expected values of the demands of the ten kinds of articles increase linearly with time, based on the evaluation functions estimated by the plurality of estimation models 13a including the first model 13b and the second model 13c. The transition of the total cost (reward) in the case where the action is selected and the transportation cost of the article is distributed to the rewards related to the plurality of articles is shown by a fifth graph G5. Here, the simulation is performed assuming that the steady part of the demand for the ten types of articles fluctuates according to a Gaussian distribution of predetermined parameters. Further, an action is selected based on an evaluation function estimated by a plurality of estimation models 13a, each of which is constituted by a single model and does not include the first model 13b and the second model 13c, and the transportation cost of the article is reduced by a plurality of articles. The transition of the total cost (reward) when the reward is distributed to the related rewards is shown by a sixth graph G6. Further, as a comparative example, the transition of the total cost (reward) in the method using reinforcement learning proposed in Non-Patent Document 2 is shown by a seventh graph G7. Further, as a comparative example, the total cost (reward) in the method that does not use reinforcement learning proposed in Non-Patent Document 3 is shown as a reference value Ref.

実線で示す第５グラフＧ５及び破線で示す第６グラフＧ６によれば、本実施形態に係る発注管理装置１０によって、１０種類の物品の発注について、基準値Ｒｅｆよりも０に近い総コスト（基準値Ｒｅｆよりも大きな報酬）が達成できることが確認できる。なお、第６グラフＧ６には若干の不安定性が見られるため、第１モデル１３ｂ及び第２モデル１３ｃにより構成される複数の推定モデル１３ａによって推定された評価関数に基づいて行動を選択し、物品の輸送コストを複数の物品に関する報酬に分配する手法が最も優れていると考えられる。 According to the fifth graph G5 indicated by the solid line and the sixth graph G6 indicated by the dashed line, the order management device 10 according to the present embodiment has a total cost (reference value) closer to 0 than the reference value Ref for ordering 10 types of articles. It can be confirmed that a reward larger than the value Ref) can be achieved. Since the sixth graph G6 shows some instability, an action is selected based on the evaluation function estimated by the plurality of estimation models 13a constituted by the first model 13b and the second model 13c, and the product is selected. Is considered to be the best way to distribute the transportation cost of the goods to rewards for multiple goods.

一方、比較例として記載した第７グラフＧ７は、基準値Ｒｅｆよりも大きなマイナスの総コストになってしまうばかりか、学習が不安定であり、総コストが収束しないことが確認できる。このように、本実施形態に係る発注管理装置１０によれば、複数の物品の需要が時間とともに変化する場合であっても、安定的に強化学習を進めることができ、複数の物品をより効率的に発注することができる。 On the other hand, in the seventh graph G7 described as a comparative example, it can be confirmed that not only the total cost is larger than the reference value Ref but also the learning is unstable and the total cost does not converge. As described above, according to the order management device 10 according to the present embodiment, even when the demand for a plurality of articles changes with time, the reinforcement learning can be stably advanced, and the plurality of articles can be more efficiently used. You can place an order.

図７は、本実施形態に係る発注管理装置１０により実行される処理のフローチャートである。はじめに、発注管理装置１０は、複数の物品の在庫数と、輸送中の物品の数を取得する（Ｓ１０）。 FIG. 7 is a flowchart of a process executed by the order management device 10 according to the present embodiment. First, the order management device 10 acquires the inventory numbers of a plurality of articles and the number of articles being transported (S10).

また、発注管理装置１０は、輸送に要する期間経過後の在庫数の推定値と、輸送に要する期間が経過するまでの物品の需要数の推定値と、輸送に要する期間が経過してから、所定の期間が経過するまでの物品の需要数の推定値とを算出する（Ｓ１１）。なお、発注管理装置１０は、複数の物品の在庫数と、輸送中の物品の数と、輸送に要する期間経過後の在庫数の推定値と、輸送に要する期間が経過するまでの物品の需要数の推定値と、輸送に要する期間が経過してから、所定の期間が経過するまでの物品の需要数の推定値とを、強化学習の状態として用いる。 In addition, the order management device 10 calculates the estimated value of the number of stocks after the period required for the transportation, the estimated value of the demand number of articles until the period required for the transportation has elapsed, and the time required for the transportation has elapsed. An estimated value of the demand number of articles until a predetermined period elapses is calculated (S11). Note that the order management device 10 calculates the inventory count of the plurality of articles, the number of articles in transit, the estimated value of the stock quantity after the elapse of the time required for the transport, and the demand of the articles until the elapse of the time required for the transport. The estimated value of the number and the estimated value of the demand number of the articles from the elapse of the period required for transportation to the elapse of a predetermined period are used as the state of reinforcement learning.

その後、発注管理装置１０は、複数の物品それぞれに関する評価関数の第１値を、複数の推定モデルによって推定する（Ｓ１２）。発注管理装置１０は、処理Ｓ１０及びＳ１１によって特定した状態について、取り得る行動の評価関数の値を、複数の物品毎に設定された複数の推定モデルによって推定する。 Then, the order management device 10 estimates the first value of the evaluation function for each of the plurality of articles by using a plurality of estimation models (S12). The order management device 10 estimates the value of the evaluation function of the action that can be taken for the state specified by the processes S10 and S11 by using a plurality of estimation models set for each of a plurality of articles.

発注管理装置１０は、所定の確率で、ランダムに行動を選択し、１から所定の確率を引いた確率で、評価関数の第１値が最大となる行動を選択する（Ｓ１３）。なお、発注管理装置１０は、他の方法で行動を選択してもよい。 The order management device 10 randomly selects an action with a predetermined probability, and selects an action that maximizes the first value of the evaluation function with a probability obtained by subtracting the predetermined probability from 1 (S13). Note that the order management device 10 may select an action by another method.

発注管理装置１０は、保管コストと、発注された物品の輸送コストを物品の数で除算した値と、複数の物品が欠品した場合のペナルティコストとの和によって、複数の物品それぞれに関する報酬を算出する（Ｓ１４）。 The order management device 10 provides a reward for each of the plurality of articles by a sum of a storage cost, a value obtained by dividing a transport cost of the ordered article by the number of articles, and a penalty cost in a case where a plurality of articles are missing. It is calculated (S14).

発注管理装置１０は、選択した行動を行った後の状態について、取り得る行動に関する評価関数の第２値を、複数の推定モデルによって推定する（Ｓ１５）。そして、発注管理装置１０は、割引率を掛けた評価関数の第２値及び報酬の和と、評価関数の第１値との差を小さくするように、複数の推定モデルのパラメータを更新する（Ｓ１６）。 The order management device 10 estimates the second value of the evaluation function relating to the possible behavior using a plurality of estimation models for the state after the selected behavior has been performed (S15). Then, the order management device 10 updates the parameters of the plurality of estimation models so as to reduce the difference between the sum of the second value and the reward of the evaluation function multiplied by the discount rate and the first value of the evaluation function ( S16).

発注管理装置１０は、処理を終了しない場合（Ｓ１７：ＮＯ）、処理Ｓ１０〜Ｓ１６を繰り返し実行して、強化学習を行う。なお、処理を終了する条件は、複数の推定モデルの損失関数の値が所定期間にわたって所定値以下となることであったり、強化学習のエピソード数が所定回数以上となることであったりしてよい。 If the process is not to be ended (S17: NO), the order management device 10 repeatedly executes the processes S10 to S16 to perform reinforcement learning. The condition for terminating the process may be that the values of the loss functions of the plurality of estimation models are equal to or less than a predetermined value over a predetermined period, or that the number of episodes of reinforcement learning is equal to or more than a predetermined number. .

以上説明した実施形態は、本発明の理解を容易にするためのものであり、本発明を限定して解釈するためのものではない。実施形態が備える各要素並びにその配置、材料、条件、形状及びサイズ等は、例示したものに限定されるわけではなく適宜変更することができる。また、異なる実施形態で示した構成同士を部分的に置換し又は組み合わせることが可能である。 The embodiments described above are intended to facilitate understanding of the present invention, and are not intended to limit and interpret the present invention. The components included in the embodiment and their arrangement, material, condition, shape, size, and the like are not limited to those illustrated, but can be appropriately changed. It is also possible to partially replace or combine the configurations shown in the different embodiments.

１０…発注管理装置、１０ａ…ＣＰＵ、１０ｂ…ＲＡＭ、１０ｃ…ＲＯＭ、１０ｄ…通信部、１０ｅ…入力部、１０ｆ…表示部、１１…取得部、１２…第１推定部、１３…記憶部、１３ａ…複数の推定モデル、１４…選択部、１５…報酬算出部、１６…第２推定部、１７…更新部 10 order management device, 10a CPU, 10b RAM, 10c ROM, 10d communication unit, 10e input unit, 10f display unit, 11 acquisition unit, 12 estimation unit, 13 storage unit 13a: Multiple estimation models, 14: Selection unit, 15: Reward calculation unit, 16: Second estimation unit, 17: Update unit

Claims

A function of a state including the stock numbers of a plurality of articles at a predetermined time point and an action of ordering the plurality of articles by a predetermined number, wherein a first value of a plurality of evaluation functions set for the plurality of articles is calculated. A first estimating unit that estimates using a plurality of estimation models set for the plurality of articles;
A selection unit that selects the action based on the first value;
A reward for calculating a reward unique to each of the plurality of articles and a reward independent of the type of the plurality of articles , based on a storage cost of the plurality of articles and a transport cost of ordered articles when the action is performed. A calculating unit;
A second estimating unit that estimates, by the plurality of estimation models, second values of the plurality of evaluation functions related to the possible behavior with respect to the state after performing the action;
The second value multiplied by a discount rate, the sum of a reward unique to each of the plurality of articles and a reward independent of the type of the plurality of articles, and the plurality of the plurality of An updating unit that updates parameters of the estimation model;
An order management device comprising:

The compensation calculation unit, by the storage costs, to calculate the specific compensation to each of the plurality of articles, said a value obtained by dividing the number of transportation costs were ordered article thus does not depend on the type of the plurality of articles Calculate rewards ,
The order management device according to claim 1.

The reward calculation unit calculates a reward unique to each of the plurality of articles according to the storage cost , and calculates a reward irrespective of the type of the plurality of articles by a value obtained by dividing the transportation cost by the number of ordered articles. Calculating a reward unique to each of the plurality of articles by a penalty cost when the plurality of articles are out of stock,
The order management device according to claim 1.

The selecting unit randomly selects the action with a predetermined probability, and selects the action in which the first value is maximized at a probability obtained by subtracting the predetermined probability from 1;
The order management device according to claim 1.

The plurality of estimation models each include a first model that estimates a state value related to the state, and a second model that estimates an advantage function of the action in the state.
The order management device according to claim 1.

The update unit is the second value multiplied by pre-Symbol discount rate, the sum of compensation that does not depend on the type of specific compensation and said plurality of articles to each of the plurality of articles, the difference between the first value 2 Regarding the power, the state, the action, the reward unique to each of the plurality of articles, the reward independent of the type of the plurality of articles, and the state after performing the action regarding the plurality of articles in the past. Updating the parameters so that the expected value is reduced,
The order management device according to claim 1.

The update unit is the second value multiplied by pre-Symbol discount rate, the sum of compensation that does not depend on the type of specific compensation and said plurality of articles to each of the plurality of articles, Huber of the difference between said first value For the loss function, the states previously recorded for the plurality of articles , the behavior, a reward unique to each of the plurality of articles, a reward independent of the type of the plurality of articles, and the state after performing the action Updating the parameter so that the expected value for
The order management device according to claim 1.

The state includes the number of items in transit at the predetermined time,
The order management device according to any one of claims 1 to 7.

The state includes an estimated value of the stock quantity after a period required for transportation from the predetermined point in time,
An order management device according to any one of claims 1 to 8.

The state includes an estimated value of the demand number of the goods from the predetermined time until the time required for transportation has elapsed,
The order management device according to any one of claims 1 to 9.

The state includes an estimated value of the demand number of the goods until a predetermined period elapses after a period required for transportation has elapsed from the predetermined time.
The order management device according to any one of claims 1 to 10.

By the calculation unit provided in the order management device,
A function of a state including the stock numbers of a plurality of articles at a predetermined time point and an action of ordering the plurality of articles by a predetermined number, wherein a first value of a plurality of evaluation functions set for the plurality of articles is calculated. Estimating by a plurality of estimation models set for the plurality of articles,
Selecting the action based on the first value;
Calculating a reward unique to each of the plurality of articles and a reward independent of the type of the plurality of articles based on a storage cost of the plurality of articles and a transport cost of ordered articles when the action is performed. When,
For the state after performing the action, estimating a second value of the plurality of evaluation functions for the possible action by the plurality of estimation models;
The second value multiplied by a discount rate, the sum of a reward unique to each of the plurality of articles and a reward independent of the type of the plurality of articles, and the plurality of the plurality of Updating the parameters of the estimation model;
Perform order management method.

The calculation unit provided in the order management device,
A function of a state including the stock numbers of a plurality of articles at a predetermined time point and an action of ordering the plurality of articles by a predetermined number, wherein a first value of a plurality of evaluation functions set for the plurality of articles is calculated. A first estimating unit that estimates using a plurality of estimation models set for the plurality of articles;
A selection unit that selects the action based on the first value;
A reward for calculating a reward unique to each of the plurality of articles and a reward independent of the type of the plurality of articles , based on a storage cost of the plurality of articles and a transport cost of ordered articles when the action is performed. Calculator,
A second estimator for estimating, by the plurality of estimation models, second values of the plurality of evaluation functions relating to the possible actions, for the state after performing the action, and the second value multiplied by a discount rate ; An update unit that updates a parameter of the plurality of estimation models so as to reduce a difference between the reward unique to each of the plurality of articles and a reward independent of the type of the plurality of articles, and a difference between the first value and the sum.
Order management program to function as.