JP2020187416A

JP2020187416A - Physical distribution management system

Info

Publication number: JP2020187416A
Application number: JP2019089900A
Authority: JP
Inventors: 和弘小池; Kazuhiro Koike
Original assignee: Askul Corp
Current assignee: Askul Corp
Priority date: 2019-05-10
Filing date: 2019-05-10
Publication date: 2020-11-19
Anticipated expiration: 2039-05-10
Also published as: JP7308073B2

Abstract

To provide a technique to support the overall optimization of physical distribution management.SOLUTION: A predictor has: an acquisition part for acquiring product information of e-commerce products; and a control part for generating a third behavior model based on the reinforcement learning of a first behavior model and a second behavior model to calculate mutually different rewards using the acquired product information and environmental state variables for multiple allowed behavior variables in an environment related to a physical distribution of the products, and predicting an optimal behavior variable in the environment using the third behavior model.SELECTED DRAWING: Figure 1

Description

本発明は、物流管理システムに関する。 The present invention relates to a physical distribution management system.

サプライチェーンにおける物流管理において、いわゆるBullwhip Effect（以下「ＢＥ
」と称する）が知られている。ＢＥとは、サプライチェーンの下流における需要予測と意思決定の結果、需要が拡大しながら、サプライチェーンの下流から上流に向かって伝搬していく現象である。この現象は過剰在庫や欠品に繋がるため、発生のメカニズムとＢＥの抑制手法は、長年に渡り研究対象となっている。ＢＥの発生要因としては、価格表、発注頻度、返品方針、価格販売施策の頻度と深さ、情報共有の程度、需要予測方法、欠品時の配分ルールなどが挙げられる。 In logistics management in the supply chain, the so-called Bullwhip Effect (hereinafter "BE")
Is known). BE is a phenomenon in which demand propagates from the downstream to the upstream of the supply chain while the demand expands as a result of demand forecasting and decision making in the downstream of the supply chain. Since this phenomenon leads to excess inventory and shortages, the mechanism of occurrence and the method of suppressing BE have been the subject of research for many years. Factors that cause BE include price list, ordering frequency, return policy, frequency and depth of price sales measures, degree of information sharing, demand forecasting method, allocation rule when out of stock, and so on.

また、ＢＥを検証する１手法として、Beer Game（以下「ＢＧ」と称する）が知られて
いる。ＢＧでは、直列に繋がったビールのサプライチェーンにおけるシミュレーションゲームであり、４プレーヤー（小売業者、卸売業者、流通業者、製造業者）の各プレーヤーが、決められた期間内でのコスト最小化を競う。 In addition, Beer Game (hereinafter referred to as "BG") is known as one method for verifying BE. BG is a simulation game in the supply chain of beer connected in series, and each player of 4 players (retailer, wholesaler, distributor, manufacturer) competes for cost minimization within a fixed period.

ＢＧの過程はMarkov Decision Process（ＭＤＰ）として知られており、ＢＧにおいて
、各プレーヤーが観測できる情報は、隣り合うプレーヤーとの注文と商品のやり取りと自身の在庫レベルのみであるため、いわゆるPartially Observable Markov Decision Process（ＰＯＭＤＰ）である。ＢＧでは、各プレーヤーは、観測可能な情報からコストを最小化する行動を選択する。しかしながら、ＢＧは観測空間と行動空間が大きく、非定常な時系列を扱うため複雑な問題となる。そこで、ＢＧに深層強化学習を適用することで、サプライチェーンにおける物流プロセスの全体最適化を実現できる可能性は示されている（非特許文献１−３）。 The process of BG is known as the Markov Decision Process (MDP), and in BG, the only information that each player can observe is the order and exchange of goods with adjacent players and their own inventory level, so the so-called Partially Observable. Markov Decision Process (POMDP). In BG, each player selects an action that minimizes costs from observable information. However, BG has a large observation space and action space, and handles a non-stationary time series, which poses a complicated problem. Therefore, it has been shown that by applying deep reinforcement learning to BG, it is possible to realize overall optimization of the distribution process in the supply chain (Non-Patent Documents 1-3).

V. Mnih et al. Human-level control through deep reinforcement learning, doi 10.1038/nature14236 (2015).V. Mnih et al. Human-level control through deep reinforcement learning, doi 10.1038 / nature14236 (2015). Afshin Oroojlooyjadid et al. A Deep Q-Network for the Beer Game: A Reinforcement Learning Algorithm to Solve Inventory Optimization Problems, arXiv:1708.05924v2 (2018).Afshin Oroojlooyjadid et al. A Deep Q-Network for the Beer Game: A Reinforcement Learning Algorithm to Solve Inventory Optimization Problems, arXiv: 1708.05924v2 (2018). Taiki Fuji et al. Deep Multi-Agent Reinforcement Learning using DNN-Weight Evolution to Optimize Supply Chain Performance. doi 10.24251/HICSS.2018.157 (2018).Taiki Fuji et al. Deep Multi-Agent Reinforcement Learning using DNN-Weight Evolution to Optimize Supply Chain Performance. Doi 10.24251 / HICSS.2018.157 (2018).

インターネットを利用した通信販売（ネット通販）においては、サイバー空間における情報の流れの効率とフィジカル空間における物の流れの効率の差が顕著になってきており、この差がＢＥの新たな発生要因となる可能性がある。例えば、Electronic Commerce（
ＥＣ）サイトでの高度に効率化された販売施策によって需要の変動が増幅された結果、配送遅延や欠品、過剰在庫などが発生する場合が考えられる。このため、上記の従来技術を用いても、ネット通販を対象としたサプライチェーンにおける物流プロセスの全体最適化を実現することはできない可能性があり、そのような全体最適化を実現する確立された手法が提案されていなかった。 In mail-order sales using the Internet (online mail order), the difference between the efficiency of information flow in cyberspace and the efficiency of goods flow in physical space is becoming noticeable, and this difference is a new cause of BE. There is a possibility of becoming. For example, Electronic Commerce (
As a result of increased fluctuations in demand due to highly efficient sales measures on EC) sites, delivery delays, shortages, excess inventory, etc. may occur. Therefore, even if the above-mentioned conventional technology is used, it may not be possible to realize the overall optimization of the distribution process in the supply chain for online shopping, and it has been established to realize such an overall optimization. No method was proposed.

そこで、本件開示の技術は、上記の事情に鑑みてなされたものであり、その目的とするところは、物流管理の全体最適化を支援する技術を提供することである。 Therefore, the technology disclosed in this case was made in view of the above circumstances, and the purpose thereof is to provide a technology that supports the overall optimization of physical distribution management.

本件開示の予測装置は、電子商取引の商品の商品情報を取得する取得部と、商品の物流に関連する環境において許容される行動変数の複数の値に対して、取得した商品情報と環境の状態変数とを用いて互いに異なる報酬を算出する第１の行動モデルと第２の行動モデルの強化学習を基に第３の行動モデルを生成し、第３の行動モデルを用いて環境における最適な行動変数の値を予測する制御部とを有する。これにより、本予測装置によって、商品の物流管理において相反する目標が設定された２つの行動モデルから全体最適化に叶うバランスの取れた行動モデルを生成して、最適な行動変数を予測することができる。 The prediction device disclosed in this case is an acquisition unit that acquires product information of products in electronic commerce, and the acquired product information and the state of the environment for multiple values of behavior variables that are allowed in the environment related to the distribution of products. A third behavior model is generated based on the reinforcement learning of the first behavior model and the second behavior model that calculate different rewards using variables, and the optimum behavior in the environment is generated using the third behavior model. It has a control unit that predicts the value of a variable. As a result, this prediction device can generate a balanced behavior model that achieves overall optimization from two behavior models for which conflicting goals are set in product distribution management, and predict the optimum behavior variable. it can.

また、上記の予測装置において、行動変数は、商品の販売業者による商品の発注数であり、取得部によって取得される商品情報は、一定期間にわたる商品の発注数と出荷数と入荷数の変動を示す情報であり、環境の状態変数は、商品の売上と、商品の仕入れ値と、商品の欠品に関連する欠品コストと、商品の販売促進に関連する販売促進コストと、商品の過剰在庫に関連する在庫コストと、商品の配送に関連する配送コストであり、制御部は、行動変数の複数の値に対応する発注数それぞれに対して、取得した商品情報と環境の状態変数とを用いて互いに異なる報酬を算出する第１の行動モデルと第２の行動モデルの強化学習を基に第３の行動モデルを生成し、第３の行動モデルを用いて最適な発注数を予測してもよい。これにより、ＢＥＩが閾値を超えないように制御され、かつ需要数をできるだけ満たすように出荷が行われることで、在庫コストの最小化と売上の最大化という、相反する目標のバランスを取り、当該商品の在庫の最適化を図ることができる。 Further, in the above prediction device, the behavior variable is the number of ordered products by the seller of the product, and the product information acquired by the acquisition unit changes the number of ordered products, the number of shipments, and the number of arrivals of the products over a certain period. The information to be shown, the environmental state variables are the sales of the product, the purchase price of the product, the shortage cost related to the shortage of the product, the sales promotion cost related to the promotion of the product, and the excess inventory of the product. The related inventory cost and the delivery cost related to the delivery of the product, and the control unit uses the acquired product information and the environmental state variable for each of the number of orders corresponding to multiple values of the behavior variable. A third behavior model may be generated based on the strengthening learning of the first behavior model and the second behavior model that calculate different rewards, and the optimum number of orders may be predicted using the third behavior model. .. As a result, the BEI is controlled so as not to exceed the threshold value, and the shipment is performed so as to satisfy the number of demands as much as possible, thereby balancing the conflicting goals of minimizing inventory cost and maximizing sales. It is possible to optimize the inventory of products.

また、上記の予測装置において、行動変数は、倉庫における商品の保管に関連する人時であり、取得部によって取得される商品情報は、商品の販売促進の実施日を示す情報であり、環境の状態変数は、商品の販売促進の実施日の曜日に応じて決定される販売促進コストであり、制御部は、行動変数の複数の値に対応する人時それぞれに対して、取得した商品情報と環境の状態変数とを用いて互いに異なる報酬を算出する第１の行動モデルと第２の行動モデルの強化学習を基に第３の行動モデルを生成し、第３の行動モデルを用いて最適な人時を予測してもよい。これにより、販売促進によって変動する需要数を満たすように出荷を行い、かつ商品が保管される倉庫における人時を低く維持することで、売上最大化と出荷コスト最小化という、相反する目標のバランスを取り、本シミュレーションで設定される販売促進の実施日における当該商品の出荷に割り当てられる人時の最適化を図ることができる。 Further, in the above prediction device, the action variable is the human time related to the storage of the product in the warehouse, and the product information acquired by the acquisition department is the information indicating the implementation date of the sales promotion of the product, and is the environment. The state variable is the sales promotion cost determined according to the day of the sales promotion of the product, and the control unit uses the acquired product information and the acquired product information for each of the human hours corresponding to the plurality of values of the behavior variable. A third behavior model is generated based on the reinforcement learning of the first behavior model and the second behavior model that calculate different rewards using the state variables of the environment, and the third behavior model is used to be optimal. You may predict man-time. This balances the conflicting goals of maximizing sales and minimizing shipping costs by shipping to meet the number of demands that fluctuates due to sales promotion and keeping man-hours low in the warehouse where products are stored. It is possible to optimize the man-hours assigned to the shipment of the product on the sales promotion implementation date set in this simulation.

また、上記の予測装置において、行動変数は、ＥＣサイトにおいて商品のレコメンドのされやすさを示す値であり、取得部によって取得される商品情報は、ＥＣサイトにおける一定期間にわたる商品のクリック数と表示回数とを示す情報であり、環境の状態変数は、商品のサイズと、商品の売上と、商品の仕入れ値と、商品のサイズに応じて決定される在庫コストであり、制御部は、行動変数の複数の値に対応する商品のレコメンドのされやすさを示す値それぞれに対して、取得した商品情報と環境の状態変数とを用いて互いに異なる報酬を算出する第１の行動モデルと第２の行動モデルの強化学習を基に第３の行動モデルを生成し、第３の行動モデルを用いて商品のレコメンドのされやすさを示す最適な値を予測してもよい。これにより、レコメンドされないために売れない商品とレコメンドされても売れない商品とを判定して、売れる可能性がより高い商品がレコメンドされるようにすることで、ＥＣサイトにおける商品のレコメンドの最適化を図ることができる。 Further, in the above prediction device, the action variable is a value indicating the ease with which the product is recommended on the EC site, and the product information acquired by the acquisition unit is displayed as the number of clicks on the product over a certain period on the EC site. It is information indicating the number of times, and the state variables of the environment are the size of the product, the sales of the product, the purchase price of the product, and the inventory cost determined according to the size of the product. The first behavior model and the second behavior that calculate different rewards using the acquired product information and the environmental state variables for each value indicating the ease of product recommendation corresponding to multiple values. A third behavior model may be generated based on the strengthening learning of the model, and the third behavior model may be used to predict the optimum value indicating the ease with which the product is recommended. By doing this, it is possible to determine which products cannot be sold because they are not recommended and which products cannot be sold even if they are recommended, and to recommend products that are more likely to sell, thereby optimizing product recommendations on the EC site. Can be planned.

また、上記の予測装置において、行動変数は、ＥＣサイトにおいて商品の配送指定日を
変更することでユーザに付与されるインセンティブであり、取得部によって取得される商品情報は、商品の配送先および配送日時と、商品の配送業者の配送エリアおよび配送可能な配送先数と、商品の配送の所要時間と、商品の配送における配送距離とを示す情報であり、制御部は、行動変数の複数の値に対応するインセンティブそれぞれに対して、取得した商品情報と環境の状態変数とを用いて互いに異なる報酬を算出する第１の行動モデルと第２の行動モデルの強化学習を基に第３の行動モデルを生成し、第３の行動モデルを用いて最適なインセンティブを予測してもよい。これにより、配送コスト最小化とインセンティブコスト最小化の２つの目標のバランスを取りつつ、ＥＣサイトにおいて当該商品の配送日の変更をユーザに促す場合に付与されるインセンティブの大きさの最適化を図ることができる。 Further, in the above prediction device, the action variable is an incentive given to the user by changing the designated delivery date of the product on the EC site, and the product information acquired by the acquisition unit is the delivery destination and delivery of the product. Information indicating the date and time, the delivery area of the delivery company of the product, the number of delivery destinations that can be delivered, the time required for delivery of the product, and the delivery distance in the delivery of the product, and the control unit controls multiple values of the action variable. A third behavior model based on the strengthening learning of the first behavior model and the second behavior model that calculates different rewards using the acquired product information and the environmental state variables for each of the incentives corresponding to And a third behavioral model may be used to predict optimal incentives. In this way, while balancing the two goals of minimizing delivery costs and minimizing incentive costs, the size of the incentive given when prompting the user to change the delivery date of the product on the EC site is optimized. be able to.

本件開示の技術によれば、物流管理の全体最適化を支援する技術を提供することができる。 According to the technology disclosed in the present case, it is possible to provide a technology that supports the overall optimization of physical distribution management.

図１は、第１実施形態に係る予測装置の一例を示す。FIG. 1 shows an example of a prediction device according to the first embodiment. 図２Ａは、ＢＧにおけるプレーヤーおよびサプライチェーンの関係を示し、図２Ｂは、第１実施形態において実行されるNetshop Gameおけるプレーヤーおよびサプライチェーンの関係を示す。FIG. 2A shows the relationship between the player and the supply chain in BG, and FIG. 2B shows the relationship between the player and the supply chain in the Netshop Game executed in the first embodiment. 図３は、第１実施形態において実行されるシミュレーションのアルゴリズムの一例を示す。FIG. 3 shows an example of a simulation algorithm executed in the first embodiment. 図４は、第１実施形態において設定される環境パラメータの一例を示す。FIG. 4 shows an example of the environmental parameters set in the first embodiment. 図５は、第１実施形態において設定されるNetshop Gameの報酬の一例を示す。FIG. 5 shows an example of the Netshop Game reward set in the first embodiment. 図６は、第１実施形態において実行される上記の各状態変数の更新の一例を示す。FIG. 6 shows an example of updating each of the above state variables executed in the first embodiment. 図７は、第１実施形態におけるメタプレーヤーに対するゴール条件の一例を示す。FIG. 7 shows an example of a goal condition for a meta player in the first embodiment. 図８は、第１実施形態においてメタプレーヤーを対象としたタイムステップと報酬の変化の一例を示すグラフである。FIG. 8 is a graph showing an example of changes in time steps and rewards for a meta player in the first embodiment. 図９は、第１実施形態におけるサイバープレーヤーの獲得報酬の推移の一例を示すグラフである。FIG. 9 is a graph showing an example of the transition of the acquisition reward of the cyber player in the first embodiment. 図１０は、第１実施形態におけるフィジカルプレーヤーの獲得報酬の推移の一例を示すグラフである。FIG. 10 is a graph showing an example of a transition of the acquisition reward of the physical player in the first embodiment. 図１１は、第１実施形態におけるメタプレーヤーの獲得報酬の推移の一例を示すグラフである。FIG. 11 is a graph showing an example of the transition of the acquisition reward of the meta player in the first embodiment. 図１２は、第１実施形態における各プレーヤーのシミュレーション結果の一例を示す。FIG. 12 shows an example of the simulation result of each player in the first embodiment. 図１３Ａは、変形例１におけるシミュレーションの報酬の一例を示し、図１３Ｂは、変形例１におけるシミュレーションの状態変数の更新の一例を示し、図１３Ｃは、変形例１におけるシミュレーションのゴール条件の一例を示す。FIG. 13A shows an example of the simulation reward in the modified example 1, FIG. 13B shows an example of updating the state variable of the simulation in the modified example 1, and FIG. 13C shows an example of the goal condition of the simulation in the modified example 1. Shown. 図１４Ａは、変形例２におけるシミュレーションの報酬の一例を示し、図１４Ｂは、変形例２におけるシミュレーションの状態変数の更新の一例を示し、図１４Ｃは、変形例２におけるシミュレーションのゴール条件の一例を示す。FIG. 14A shows an example of the simulation reward in the modification 2, FIG. 14B shows an example of updating the state variable of the simulation in the modification 2, and FIG. 14C shows an example of the goal condition of the simulation in the modification 2. Shown. 図１５Ａは、変形例３におけるシミュレーションの報酬の一例を示し、図１５Ｂは、変形例３におけるシミュレーションの状態変数の更新の一例を示し、図１５Ｃは、変形例３におけるシミュレーションのゴール条件の一例を示す。FIG. 15A shows an example of the simulation reward in the modified example 3, FIG. 15B shows an example of updating the state variable of the simulation in the modified example 3, and FIG. 15C shows an example of the goal condition of the simulation in the modified example 3. Shown.

以下に、図面を参照しながら、本件開示の技術の好適な実施の形態について説明する。ただし、以下に記載されている構成部品の構成は、本件開示の技術が適用される装置の構成や各種条件により適宜変更されるべきものである。よって、本件開示の技術の技術的範囲を以下の記載に限定する趣旨のものではない。 Hereinafter, preferred embodiments of the techniques disclosed in the present disclosure will be described with reference to the drawings. However, the configuration of the components described below should be appropriately changed depending on the configuration of the device to which the technology disclosed in the present disclosure is applied and various conditions. Therefore, it is not intended to limit the technical scope of the technology disclosed in this case to the following description.

（第１実施形態）
図１は、第１実施形態に係る予測装置の概略構成を示す図である。図１に示すように、予測装置１は、制御部１１、記憶部１２、操作部１３、表示部１４、通信部１５を有する。 (First Embodiment)
FIG. 1 is a diagram showing a schematic configuration of a prediction device according to the first embodiment. As shown in FIG. 1, the prediction device 1 includes a control unit 11, a storage unit 12, an operation unit 13, a display unit 14, and a communication unit 15.

ネット通販では、ＥＣサイトやネットによる取引などサイバー空間で完結するプロセスと、物流倉庫や配送センターなどフィジカル空間で行われるプロセスでは特性の違いがある。例えばサイバー空間では商品１００個はあくまでも数値データである。このため、強気な販売施策によって、販売個数を１０倍の１０００個に増やすことについては、物理的な制約を受けにくいといえる。一方で、フィジカル空間では商品１００個は体積と重量を持つ実体である。このため、販売個数を１０００個に増やすことについて、倉庫のキャパシティや出荷能力など物理制約の影響を大きく受けるといえる。また、上記のＢＧでは、サプライチェーンの上流プレーヤーに対する注文においてリードタイムが存在するが、ネット通販における取引では注文のリードタイムは無視することができる。このため、ネット通販の物流プロセスの問題を扱う場合、上記のＢＧのモデルを用いても、問題の解決策を見いだすことができない可能性がある。 In online shopping, there are differences in characteristics between processes that are completed in cyberspace such as transactions via EC sites and the Internet, and processes that are performed in physical spaces such as distribution warehouses and distribution centers. For example, in cyberspace, 100 products are just numerical data. For this reason, it can be said that it is less likely to be physically restricted to increase the number of units sold by 10 times to 1000 pieces by bullish sales measures. On the other hand, in the physical space, 100 products are entities having volume and weight. Therefore, it can be said that increasing the number of units sold to 1000 is greatly affected by physical restrictions such as warehouse capacity and shipping capacity. Further, in the above BG, there is a lead time in an order to an upstream player in the supply chain, but the lead time of the order can be ignored in a transaction in online shopping. Therefore, when dealing with the problem of the distribution process of online shopping, it may not be possible to find a solution to the problem even if the above BG model is used.

ここで、図２に、ＢＧ（図２Ａ）と第１実施形態において実行されるNetshop Game（図２Ｂ）のそれぞれにおけるプレーヤーおよびサプライチェーンの関係を示す。第１実施形態では、予測装置１によって図２Ａに例示する環境が設定されたNetshop Gameと称するシミュレーションが実行される。図２Ｂに示すように、上記のＢＧでは、小売業者、卸売業者、流通業者、製造業者の４プレーヤーが直列に連結されているが、Netshop Gameでは、プレーヤーは小売業者のみとし、小売業者に対してサプライチェーンの上流側に位置付けられる卸売業者、流通業者、製造業者の３プレーヤーは供給業者として１プレーヤーにまとめる。第１実施形態ではNetshop Gameで取り扱われる電子商取引の商品は１種類であると想定する。さらに、Netshop Gameでは、小売業者を、サイバー空間で発生するプロセスを管理するサイバープレーヤーと、フィジカル空間で発生するプロセスを管理するフィジカルプレーヤーとに分ける。 Here, FIG. 2 shows the relationship between the player and the supply chain in each of the BG (FIG. 2A) and the Netshop Game (FIG. 2B) executed in the first embodiment. In the first embodiment, the prediction device 1 executes a simulation called Netshop Game in which the environment illustrated in FIG. 2A is set. As shown in FIG. 2B, in the above BG, four players, a retailer, a wholesaler, a distributor, and a manufacturer, are connected in series, but in Netshop Game, the players are only retailers, and the players are referred to the retailers. The three players, the wholesaler, the distributor, and the manufacturer, which are located on the upstream side of the supply chain, are combined into one player as a supplier. In the first embodiment, it is assumed that there is only one type of electronic commerce product handled by Netshop Game. In addition, Netshop Game divides retailers into cyber players who manage processes that occur in cyberspace and physical players who manage processes that occur in physical space.

サイバープレーヤーは、顧客からの需要に対してどれだけ欠品なく商品を供給できたかを示すフィルレート（Fill Rate；ＦＲとも称する）があらかじめ設定された閾値よりも
大きくなるように行動する。サイバープレーヤーは、サイバー空間で報酬最大化を狙いとする。サイバープレーヤーは、例えばＥＣサイト上でセールなどのセールスプロモーションを積極的に行う。また、サイバープレーヤーは、在庫が増加することによって生じるコストを無視し、欠品による機会損失を最小化するように行動する。 The cyber player acts so that the fill rate (also referred to as FR), which indicates how well the product can be supplied in response to the demand from the customer, becomes larger than a preset threshold value. Cyber players aim to maximize rewards in cyberspace. Cyber players actively carry out sales promotions such as sales on EC sites, for example. Cyber players also ignore the costs of increased inventories and act to minimize lost opportunities due to shortages.

また、フィジカルプレーヤーは、ブルウィップ効果インデックス（Bullwhip Effect Index；ＢＥＩとも称する）があらかじめ設定された閾値よりも小さくなるように行動する
。フィジカルプレーヤーは、物流コストの最小化を狙いとする。フィジカルプレーヤーは、欠品による機会損失を無視し、倉庫および配送に関連するコストを最小化すべく、在庫をできるだけ抑え、ＢＥが大きくならないように行動する。 In addition, the physical player acts so that the Bullwhip Effect Index (also referred to as BEI) becomes smaller than a preset threshold value. Physical players aim to minimize logistics costs. Physical players ignore lost opportunities due to shortages and act to minimize inventory and keep BEs low in order to minimize costs associated with warehousing and delivery.

第１実施形態では、予測装置１において図２Ｂに示す環境をOpenAI Gymによって実装し、サイバープレーヤーとフィジカルプレーヤーそれぞれの上記目的を達成するための最適な行動について、Deep Q-Network（ＤＱＮ）によって学習および評価を行う。さらに、第
1実施形態では、サイバープレーヤーとフィジカルプレーヤーそれぞれの報酬とNetshop Gameのゴール条件を踏まえて両者のバランスを取るように行動するメタプレーヤーを導入
する。なお、サイバープレーヤーとフィジカルプレーヤーが、商品情報と環境の状態変数とを用いて互いに異なる報酬を算出する第１の行動モデルと第２の行動モデルの一例である。また、メタプレーヤーが、第１の行動モデルと第２の行動モデルの強化学習を基に生成される第３の行動モデルの一例である。 In the first embodiment, the environment shown in FIG. 2B is implemented in the prediction device 1 by OpenAI Gym, and the optimal actions for achieving the above objectives of the cyber player and the physical player are learned by Deep Q-Network (DQN). And evaluate. In addition, the first
In one embodiment, a meta player is introduced that acts to balance the rewards of the cyber player and the physical player and the goal conditions of the Netshop Game. It is an example of a first behavior model and a second behavior model in which a cyber player and a physical player calculate different rewards by using product information and environmental state variables. Further, the meta player is an example of a third behavior model generated based on reinforcement learning of the first behavior model and the second behavior model.

次に、Netshop Gameにおける各プレーヤーに対する設定の詳細について説明する。まず、Netshop Gameにおいて、タイムステップｔにおける観測可能な状態変数ｏ_ｔを以下の式（１）で定義する。

ここで、ＩＬ_ｔは、タイムステップｔにおける在庫数、ＯＯ_ｔは供給業者に対して発注済みであるが未入荷の状態である商品数、ｄ_ｔは顧客からの商品の需要数、ＲＳ_ｔは、供給業者から入荷した商品数、ＳＳ_ｔは、顧客に出荷済みの商品数、ａ_ｔは、タイムステップｔにおいて発生するアクション、すなわち供給業者への商品の発注数である。このように、第１実施形態では、これら６変数の回帰分析を用いる。 Next, the details of the settings for each player in Netshop Game will be described. First, in Netshop Game, defined by the following equation observable state variables o _t at time step t (1).

Here, IL _t is the number of products in stock at the time step t, OO _t is the number of products that have been ordered from the supplier but are not in stock, _dt is the number of products _requested by the customer, and RS _t is , items that arrived from the supplier, SS _t is, items shipped to the customer, a _t is, actions that occur in the time step t, that is, the order quantity of goods to suppliers. As described above, in the first embodiment, the regression analysis of these six variables is used.

また、Netshop Gameでは、ゲーム開始からゴール条件達成までを１エピソードとし、１エピソードにおける全タイムステップの状態変数ｈｏ_ｔを以下の式（２）に示すとして記憶する。

Further, the Netshop Game, from game start to a goal conditions achieve the episode is stored as indicating the state variable ho _t of all time steps in one episode in equation (2) below.

次に、Netshop Gameにおけるアクション空間について説明する。アクションは、商品の物流に関連する環境において許容される行動変数であるともいえる。また、アクションの各値が行動変数の各値となる。上記のアクションに示すように、本実施形態でのアクションは、供給業者への商品の発注数である。アクション空間、すなわちプレーヤーに許容される商品の発注数の自由度、すなわち発注数の下限から上限までの範囲が広すぎると、予測装置１における処理効率が低下する可能性がある。そこで、本実施形態では、一例としてアクション空間を０から２０の離散値集合［０，１，２，・・・，２０］としてNetshop Gameを実施する。 Next, the action space in Netshop Game will be described. Actions can also be said to be permissible behavioral variables in the environment related to the distribution of goods. In addition, each value of the action becomes each value of the action variable. As shown in the above action, the action in this embodiment is the number of orders for goods from the supplier. If the action space, that is, the degree of freedom in the number of ordered products allowed by the player, that is, the range from the lower limit to the upper limit of the ordered number is too wide, the processing efficiency in the prediction device 1 may decrease. Therefore, in the present embodiment, as an example, the Netshop Game is implemented with the action space as a set of discrete values [0, 1, 2, ..., 20] from 0 to 20.

次に、Netshop Gameにおける報酬について説明する。上記のＢＧでは、以下の式（３）に示すように、タイムステップｔにおける在庫数によって報酬が決定される。 Next, the reward in Netshop Game will be explained. In the above BG, as shown in the following formula (3), the reward is determined by the number of stocks in the time step t.

ここで、変数ｘについて（ｘ）^＋：ｍａｘ（０，ｘ）、（ｘ）⁻：ｍａｘ（０，−ｘ）である。また、在庫数が正の場合は在庫数分の在庫コストｃｈを乗算し、在庫数が負の場合は欠品による機会損失コストｃｐを乗算する。また、ｉは１〜４までの整数であり各値
が各プレーヤーに対応する。したがって、式（３）によって全プレーヤーの報酬の総計が算出される。

Here, for the variable x, (x) ⁺ : max (0, x), (x) ⁻ : max (0, −x). If the number of stocks is positive, the stock cost ch for the number of stocks is multiplied, and if the number of stocks is negative, the opportunity loss cost cp due to shortage is multiplied. Further, i is an integer from 1 to 4, and each value corresponds to each player. Therefore, the total reward of all players is calculated by the formula (3).

Netshop Gameでは、報酬の算出に、売値、仕入れ値、販売促進費、配送費が追加される。具体的には、サイバープレーヤーの報酬を式（４）、フィジカルプレーヤーの報酬を式（５）、メタプレーヤーの報酬を式（６）によって算出する。

ここで、ｓ_ｐは売値、ｃ_ｒは仕入れ値、ｃ_ｓは販売促進費、ｃ_ｄは配送費である。Netshop Gameでは各プレーヤーはこれらの値を観測できないものとする。 In Netshop Game, selling price, purchasing price, sales promotion cost, and shipping cost are added to the calculation of reward. Specifically, the cyber player reward is calculated by the formula (4), the physical player reward is calculated by the formula (5), and the meta player reward is calculated by the formula (6).

Here, _{s p} is the selling price, _{c r} is Shiirene, _{c s} sales promotion expenses, and _{c d} is the shipping costs. Netshop Game does not allow each player to observe these values.

式（４）が示すように、サイバープレーヤーは、欠品の機会損失コストに加え、供給業者への発注数が需要より大きい場合にそれらの差を販売促進費として報酬に加算する。また、式（５）が示すように、フィジカルプレーヤーは、在庫数分の在庫コストと顧客への出荷数分の配送コストとを報酬に加算する。サイバープレーヤーは過剰在庫を気にせず、フィジカルプレーヤーは欠品を気にしない、という偏った指向となるように報酬が設定されているのは、局所最適解を求める状況を意図的に発生させるためである。一方、式（６）が示すように、メタプレーヤーは、上記の指向のバランスを取った全体最適解を求めるよう、上記の各コストを報酬に加算する。 As shown by the formula (4), in addition to the opportunity loss cost of the shortage, the cyber player adds the difference between the orders to the supplier to the reward as the promotion cost when the number of orders is larger than the demand. Further, as shown by the equation (5), the physical player adds the inventory cost for the number of stocks and the delivery cost for the number of shipments to the customer to the reward. The rewards are set so that cyber players do not care about excess inventory and physical players do not care about shortages, because it intentionally creates a situation where a local optimum solution is sought. Is. On the other hand, as shown by the equation (6), the metaplayer adds each of the above costs to the reward so as to obtain the overall optimal solution that balances the above orientations.

次に、Netshop Gameにおける各プレーヤーのゴール条件について説明する。上記のＢＧでは、あらかじめ決められたタイムステップ期間内における報酬の合計値によって各プレーヤーが競争するが、Netshop Gameでは、以下の２つの指標がゴール条件として設定される。各プレーヤーは、いずれかの指標を達成した時点で１エピソードを終了する。 Next, the goal conditions of each player in Netshop Game will be described. In the above BG, each player competes based on the total value of rewards within a predetermined time step period, but in Netshop Game, the following two indicators are set as goal conditions. Each player ends one episode when he or she achieves any of the indicators.

Netshop Gameにおけるゴール条件の１つの指標が以下の式（７）で示されるＢＥＩであり、もう１つの指標が以下の式（８）で示されるＦＲである。

ここで、ｄｅｍａｎｄは、タイムステップｔにおける顧客からの需要数の直近ｐ期間分の配列である（ｐは正の整数）。また、ｓｈｉｐｐｅｄは、タイムステップｔにおける商品の出荷数の直近ｐ期間分の配列である。また、Ｖａｒ（ｘ）は、変数ｘの分散、Ｍｅａｎ（ｘ）は、変数ｘの平均である。さらに、Netshop Gameでは、以下の式（９）および式（１０）によって各プレーヤーのゴール条件の達成を判断する。

ここで、式（９）がフィジカルプレーヤーのゴール条件に用いられ、式（１０）がサイバープレーヤーのゴール条件に用いられる。また、式（９）および式（１０）がメタプレーヤーのゴール条件に用いられる。 One index of the goal condition in Netshop Game is BEI represented by the following formula (7), and the other index is FR represented by the following formula (8).

Here, demand is an array for the latest p period of the number of demands from customers in the time step t (p is a positive integer). Further, skipped is an array for the latest p period of the number of goods shipped in the time step t. Var (x) is the variance of the variable x, and Mean (x) is the mean of the variable x. Further, in Netshop Game, the achievement of the goal condition of each player is judged by the following equations (9) and (10).

Here, equation (9) is used for the goal condition of the physical player, and equation (10) is used for the goal condition of the cyber player. Further, the equations (9) and (10) are used as the goal condition of the meta player.

次に、Netshop Gameにおいて生成される顧客からの需要数について説明する。Netshop Gameでは、タイムステップｔにおける顧客からの需要数Ｄｔが、以下の式（１１）に示されるように、タイムステップｔの直近ｐ期間分の需要数の平均値に正規分布に従う確率変数ｘを加えて生成される。

なお、初期値は、１から１０までの自然数からランダムに選択された値が採用される。 Next, the number of customer demands generated by Netshop Game will be described. In Netshop Game, the number of demands Dt from customers in time step t is a random variable x that follows a normal distribution to the average value of the number of demands for the latest p period of time step t, as shown in the following equation (11). In addition it is generated.

As the initial value, a value randomly selected from a natural number from 1 to 10 is adopted.

図３に、本実施形態においてOpenAI Gymによって実行されるシミュレーションのアルゴリズムの一例を示す。図３に示されるように、Netshop Gameでは、アクションの結果から得られる経験の蓄積および活用のバランスを取る方法として、いわゆるイプシロングリーディアルゴリズムを用い、ＤＮＮ（Deep Neural Network）を用いた状態評価を行う。図
３のアルゴリズムは、上記の非特許文献にも記載されている周知のものであるため、ここでは詳細な説明は省略する。 FIG. 3 shows an example of a simulation algorithm executed by OpenAI Gym in this embodiment. As shown in Fig. 3, Netshop Game uses the so-called Epsilon Leady Algorithm as a method of balancing the accumulation and utilization of experience obtained from the results of actions, and performs state evaluation using DNN (Deep Neural Network). .. Since the algorithm of FIG. 3 is a well-known algorithm described in the above non-patent documents, detailed description thereof will be omitted here.

次に、本実施形態における各プレーヤーのパラメータと報酬計算の定義について説明する。図４に、OpenAI Gymによって設定される環境パラメータの一例を示す。図４に示すように、本実施形態で実行されるシミュレーションでは、ゴール条件に用いられるＢＥＩの閾値（「bei_threshold」）は０．９５、ＦＲの閾値（「fillrate_threshold」）は０．
９５、在庫コスト（「stock_cost」）は０．５、欠品コスト（「shortage_cost」）は１
．０、販売促進費（「promotion_cost」）は０．０１、配送コスト（「delivery_cost」
）は０．０１とする。また、顧客からの需要数の生成には、図中「demand_volatility」
（需要変動）、「demand_min」（最小値）、「demand_max」（最大値）の各値が使用される。また、商品の販売価格（「sales_price」）は２．０、商品の仕入れ値（「purchase_price」）は１．２とする。 Next, the parameters of each player and the definition of the reward calculation in this embodiment will be described. FIG. 4 shows an example of the environmental parameters set by the OpenAI Gym. As shown in FIG. 4, in the simulation executed in the present embodiment, the BEI threshold value (“bei_threshold”) used for the goal condition is 0.95, and the FR threshold value (“fillrate_threshold”) is 0.
95, inventory cost ("stock_cost") is 0.5, shortage cost ("shortage_cost") is 1.
.. 0, sales promotion cost ("promotion_cost") is 0.01, delivery cost ("delivery_cost")
) Is 0.01. In addition, to generate the number of demands from customers, "demand_volatility" in the figure
(Demand fluctuation), "demand_min" (minimum value), and "demand_max" (maximum value) are used. The selling price of the product (“sales_price”) is 2.0, and the purchase price of the product (“purchase_price”) is 1.2.

図５は、OpenAI Gymによって設定されるNetshop Gameの報酬の一例を示す。図５に示すように、サイバープレーヤー、フィジカルプレーヤー、メタプレーヤーの各プレーヤーの報酬が設定される。 FIG. 5 shows an example of Netshop Game rewards set by OpenAI Gym. As shown in FIG. 5, rewards for each player of the cyber player, the physical player, and the meta player are set.

メタプレーヤーの報酬についてより具体的に説明すると、メタプレーヤーに対しては、商品が欠品の場合は在庫数分の欠品コストを加算する（図中「cost += abs(IL) * self.shortage_cost if IL < 0 else 0」）。また、顧客からの需要数よりも発注数が多くなる
場合は販売促進費を加算する（図中「cost += (action - d) * self.promotion_cost if action > d else 0」）。また、現在の在庫数に対しては在庫数分の在庫コストを加算す
る（図中「cost += IL * self.stock_cost if IL > 0 else 0」）。また、出荷数に対し
ては配送コストを加算する（図中「cost += SS * self.delivery_cost if SS > 0 else 0」）。また、販売価格（売上）は、出荷数に売単価を乗算した値とする（図中「sales_price = SS * self.sales_price」）。また、仕入れ値は、入荷数に仕入れ単価を乗算した
値とする（図中「purchase_price = RS * self.purchase_price」）。そして、報酬は、
売上から仕入れ値および上記の各コストを除算した値とする（図中「reward = abs(sales_price)-abs(purchase_price)-abs(cost)」）。メタプレーヤーは、このように設定され
る報酬をもとに後述するゴール条件の達成を目指す。 To explain the reward of the meta player more specifically, if the product is out of stock, the shortage cost for the number of stocks is added to the meta player (“cost + = abs (IL) * self.” In the figure. shortage_cost if IL <0 else 0 "). If the number of orders is larger than the number of orders from customers, the sales promotion cost is added ("cost + = (action --d) * self.promotion_cost if action> d else 0" in the figure). In addition, the inventory cost for the number of inventories is added to the current number of inventories (“cost + = IL * self.stock_cost if IL> 0 else 0” in the figure). In addition, the delivery cost is added to the number of shipments (“cost + = SS * self.delivery_cost if SS> 0 else 0” in the figure). In addition, the selling price (sales) is the value obtained by multiplying the number of shipments by the selling unit price (“sales_price = SS * self.sales_price” in the figure). In addition, the purchase price is the value obtained by multiplying the number of arrivals by the purchase unit price (“purchase_price = RS * self.purchase_price” in the figure). And the reward is
It is the value obtained by dividing the purchase price and each of the above costs from the sales (“reward = abs (sales_price)-abs (purchase_price)-abs (cost)” in the figure). The meta player aims to achieve the goal condition described later based on the reward set in this way.

次に、図６は、OpenAI Gymにおいて実行される上記の各状態変数の更新の一例を示す。図６に示すように、１つのタイムステップにおいて、現在の在庫数（図中「IL = self.state[0]」）、未入荷の発注数（図中「OO = self.state[1]」）、１つ前のタイムステップにおける発注数（図中「a = self.state[5]」）が用いられる。また、需要の移動平均に正規分布に従う変動数を加算することで、顧客からの需要数が乱数を用いて生成される（図中「d = max(self.demand_min,int(dave + np.random.randn() * self.demand_volatility))」）。また、入荷数リードタイムを１として発注数に応じた商品が入荷するものとする（図中「RS = a # received shipment = last order ( shipping lead time = 1
)」）。 Next, FIG. 6 shows an example of updating each of the above state variables executed in OpenAI Gym. As shown in FIG. 6, in one time step, the current stock quantity (“IL = self.state [0]” in the figure) and the number of unstocked orders (“OO = self.state [1]” in the figure). ), The number of orders in the previous time step (“a = self.state [5]” in the figure) is used. In addition, by adding the number of fluctuations that follow a normal distribution to the moving average of demand, the number of demands from customers is generated using random numbers (in the figure, "d = max (self.demand_min, int (dave + np.random)". .randn () * self.demand_volatility)) "). In addition, it is assumed that the products corresponding to the number of orders are received with the lead time of the number of arrivals as 1 (“RS = a # received shipment = last order (shipping lead time = 1” in the figure).
) ").

そして、まず供給業者から入荷した商品数を在庫数に加算する（図中「IL += RS」）。その後、顧客からの需要数が在庫数よりも小さい場合は（図中「if d < IL:」）、需要数分の商品数を顧客に出荷し（図中「SS = d」）、顧客からの需要数が在庫数以上となる場合は（図中「else: # d >= IL」）、在庫数を顧客への出荷数とし（図中「SS = IL」）、在庫数を変更する（図中「IL -= SS」）。そして、今回の発注数から入荷数を除算した値を未入荷の発注数に加算する（図中「OO += action - RS」）。このように各状態変数が
変更され、変更後の状態変数を用いて次のタイムステップにおける商品の発注、入荷、出荷がそれぞれ行われる。 Then, first, the number of products received from the supplier is added to the number of stocks (“IL + = RS” in the figure). After that, if the number of demands from the customer is smaller than the number of stocks (“if d <IL:” in the figure), the number of products corresponding to the number of demands is shipped to the customer (“SS = d” in the figure), and the customer If the number of demands for is greater than or equal to the number of stocks (“else: # d> = IL” in the figure), the number of stocks is set as the number of shipments to customers (“SS = IL” in the figure), and the number of stocks is changed (“SS = IL” in the figure). In the figure, "IL-= SS"). Then, the value obtained by dividing the number of orders received this time by the number of orders received is added to the number of orders not received (“OO + = action --RS” in the figure). In this way, each state variable is changed, and the changed state variable is used to order, receive, and ship the goods in the next time step.

図７は、メタプレーヤーに対するゴール条件の一例を示す。図７に示すように、現在のタイムステップから直近の一定期間（例えば１００日）における出荷数の分散（図中「vars4 = np.var(shipped)」）と需要数の分散（図中「vars2 = np.var(demand)」）とを算
出し、算出した分散からＢＥＩを算出する（図中「bei = vars4 / vars2 if vars2 != 0 else bei_threshold」）。そして、算出したＢＥＩが上記のＢＥＩの閾値より小さい場合は（図中「if bei < bei_threshold:」）、算出したＢＥＩはＢＥＩの条件を満たしたとする（図中「bei_flag = True」）。また、現在のタイムステップから直近の一定期間（
例えば１００日）における需要数に対する出荷数の平均を算出する（図中「fillrate = np.mean(shipped / demand)」）。そして、算出した平均が上記のＦＲの閾値よりも大きい場合は（図中「if fillrate > self.fillrate_threshold:」）、算出した平均はＦｉｌｌｒａｔｅの条件を満たしたとする（図中「fill_flag = True」）。そして、上記の２つの条件がいずれも満たされた場合にゴール条件が達成されたとみなす（図中「done = fill_flag & sum_flag」）。 FIG. 7 shows an example of a goal condition for a meta player. As shown in FIG. 7, the variance of the number of shipments (“vars4 = np.var (shipped)” in the figure) and the variance of the number of demands (“vars2” in the figure) in the most recent fixed period (for example, 100 days) from the current time step. = np.var (demand) ") and calculate the BEI from the calculated variance ("bei = vars4 / vars2 if vars2! = 0 else bei_threshold "in the figure). Then, when the calculated BEI is smaller than the above BEI threshold value (“if bei <bei_threshold:” in the figure), it is assumed that the calculated BEI satisfies the BEI condition (“bei_flag = True” in the figure). Also, the most recent fixed period from the current time step (
For example, the average number of shipments to the number of demands in 100 days) is calculated (“fill rate = np.mean (shipped / demand)” in the figure). Then, when the calculated average is larger than the above-mentioned FR threshold value (“if fillrate> self.fillrate_threshold:” in the figure), it is assumed that the calculated average satisfies the Filllate condition (“fill_flag = True” in the figure). .. Then, when both of the above two conditions are satisfied, it is considered that the goal condition has been achieved (“done = fill_flag & sum_flag” in the figure).

各プレーヤーは、図５に示すように設定される報酬を基に上記のゴール条件の達成を目指して、タイムステップごとに図６に示す状態変数の更新を繰り返していく。図８に、Netshop Gameにおいてタイムステップの上限を５００００ステップとして強化学習を行った場合の、メタプレーヤーを対象としたタイムステップと報酬の変化の一例を示すグラフである。グラフの横軸はタイムステップを表し、グラフの縦軸は上記の通りメタプレーヤーが得る報酬を表す。図８のグラフが示すように、メタプレーヤーの報酬は、タイムステッ
プが進むほど報酬が一定値に向かって上昇していくことがわかる。したがって、タイムステップが進むたびにメタプレーヤーの学習が蓄積されていくものといえる。 Each player repeats the update of the state variable shown in FIG. 6 at each time step with the aim of achieving the above goal condition based on the reward set as shown in FIG. FIG. 8 is a graph showing an example of changes in time steps and rewards for a meta player when reinforcement learning is performed with the upper limit of time steps set to 50,000 steps in Netshop Game. The horizontal axis of the graph represents the time step, and the vertical axis of the graph represents the reward that the meta player gets as described above. As shown in the graph of FIG. 8, it can be seen that the reward of the meta player increases toward a constant value as the time step progresses. Therefore, it can be said that the learning of the meta player is accumulated as the time step progresses.

図９〜図１１は、上記のNetshop Gameによって学習済みの各プレーヤーの行動モデルを用いて、再度Netshop Gameを１００エピソード分実行したテストにおける、需要数と発注数と在庫数との変化の一例を示すグラフである。グラフの横軸はタイムステップを表し（図中「ｓｔｅｐ」）、グラフの縦軸は商品数を表す（図中「ｉｔｅｍｓ」）。なお、図９は、サイバープレーヤーの獲得報酬の推移を示し、図１０は、フィジカルプレーヤーの獲得報酬の推移を示し、図１１は、メタプレーヤーの獲得報酬の推移を示す。ここで、プレーヤーが学習済みであるとは、図９に示すように、プレーヤーが獲得する報酬がほぼ一定に推移するような状態までプレーヤーの学習が蓄積された状態であるとする。また、１エピソードのタイムステップ数の上限を１０００ステップとし、プレーヤーが上記のゴール条件を達成することなく１０００ステップ目のタイムステップの行動を完了した時点でエピソードを終了する。なお、各図のグラフは、各エピソードの最後の１００ステップ分、すなわちエピソード終了時点から１００ステップ分遡った獲得報酬の推移を示す。また、各図において、「ｓｔｏｃｋ」は在庫数、「ｄｅｍａｎｄ」は需要数、「ｓｈｉｐｐｅｄ」は出荷数をそれぞれ表す。 9 to 11 show an example of changes in the number of demands, the number of orders, and the number of inventories in a test in which the behavior model of each player learned by the above Netshop Game is used and the Netshop Game is executed again for 100 episodes. It is a graph which shows. The horizontal axis of the graph represents the time step (“step” in the figure), and the vertical axis of the graph represents the number of products (“items” in the figure). Note that FIG. 9 shows the transition of the acquisition reward of the cyber player, FIG. 10 shows the transition of the acquisition reward of the physical player, and FIG. 11 shows the transition of the acquisition reward of the meta player. Here, it is assumed that the player has already learned is a state in which the player's learning has been accumulated to a state in which the reward acquired by the player changes to a substantially constant state, as shown in FIG. Further, the upper limit of the number of time steps in one episode is set to 1000 steps, and the episode ends when the player completes the action of the time step of the 1000th step without achieving the above goal condition. The graph in each figure shows the transition of the earned reward for the last 100 steps of each episode, that is, 100 steps back from the end of the episode. Further, in each figure, "stock" represents the number of stocks, "demand" represents the number of demands, and "shipped" represents the number of shipments.

図９の在庫数の推移が示すように、Netshop Gameにおいてサイバープレーヤーの行動によって過剰在庫が増大して高止まりする傾向があると考えられる。また、図１０では在庫数が０である状態が複数ステップに亘って継続していることから、フィジカルプレーヤーの行動によって欠品が頻繁に発生する傾向があると考えられる。一方、図１１では在庫数の推移を示すグラフが概ねのこぎり形となっている。これは、在庫数が０となっても直後のステップあるいは数ステップ以内で在庫数が増加し、在庫数が増えすぎた場合でも需要数および出荷数が在庫数を押し下げるように働いていると考えることができる。したがって、メタプレーヤーの行動によってバランスのとれた在庫数が実現できる可能性があると考えられる。 As shown by the transition of the inventory quantity in FIG. 9, it is considered that the excess inventory tends to increase and remain high due to the behavior of the cyber player in Netshop Game. Further, in FIG. 10, since the state in which the number of stocks is 0 continues over a plurality of steps, it is considered that shortages tend to occur frequently due to the actions of the physical player. On the other hand, in FIG. 11, the graph showing the transition of the number of inventories is generally saw-shaped. It is considered that this is because the inventory quantity increases within the next step or within a few steps even if the inventory quantity becomes 0, and even if the inventory quantity increases too much, the demand quantity and the shipment quantity work to push down the inventory quantity. be able to. Therefore, it is possible that a balanced inventory can be achieved by the actions of the meta player.

図１２に、上記のテストにおける各プレーヤーのゲーム条件達成までの所要ステップ数、ＢＥＩの値、ＦＲの値、獲得報酬の値について１００エピソードの平均値を求めた結果を示す。図９〜１１のグラフにおいて、ゴール条件を達成するまでに要するステップ数（図中「ｓｔｅｐｓ」）は、小さいほどよい。また、ＢＥＩの値が１．０より小さくなる場合に、ＢＥの影響が抑制されていると考えられる。また、ＦＲの値は顧客からの需要に対して出荷できる割合を示しており、１．０に近いほどよい。図１２からわかるように、報酬の最大化だけが目的であればサイバープレーヤーが最適であるが、図９に示す在庫数の推移からわかるように、サイバープレーヤーの場合はエピソード内で在庫数が大きい状態が継続するという現象が生じている。この現象が発生する一因としては、サイバープレーヤーに対する報酬の設定においては過剰在庫を抑制する要素を与えていないことが挙げられる。 FIG. 12 shows the results of calculating the average value of 100 episodes for the number of steps required for each player to achieve the game conditions, the BEI value, the FR value, and the earned reward value in the above test. In the graphs of FIGS. 9 to 11, the smaller the number of steps (“steps” in the figure) required to achieve the goal condition, the better. Further, when the value of BEI is smaller than 1.0, it is considered that the influence of BE is suppressed. In addition, the value of FR indicates the ratio of shipment to the demand from the customer, and the closer it is to 1.0, the better. As can be seen from FIG. 12, the cyber player is optimal if the purpose is only to maximize the reward, but as can be seen from the transition of the inventory quantity shown in FIG. 9, in the case of the cyber player, the inventory quantity is large in the episode. The phenomenon that the state continues is occurring. One of the reasons why this phenomenon occurs is that the setting of rewards for cyber players does not give an element to suppress excess inventory.

図１２に示すように、フィジカルプレーヤーの場合は、在庫数が小さい値で維持され、このためＢＥＩもサイバープレーヤーに比べて低い値となるが、図１０に示す在庫数の推移からわかるように、在庫数が０の状態が複数ステップにわたって継続する、すなわち欠品の状態が継続することがあり、結果としてＦＲの値も低い値となっている。 As shown in FIG. 12, in the case of the physical player, the inventory quantity is maintained at a small value, and therefore the BEI is also a lower value than that of the cyber player, but as can be seen from the transition of the inventory quantity shown in FIG. The state where the number of stocks is 0 may continue over a plurality of steps, that is, the state of shortage may continue, and as a result, the value of FR is also low.

また、図１２に示すように、メタプレーヤーの場合は、サイバープレーヤーとフィジカルプレーヤーに比べると獲得報酬は低いが、ＢＥＩの値はサイバープレーヤーとフィジカルプレーヤーの場合よりも小さくなり、ＦＲの値は、過剰在庫がより多く発生するサイバープレーヤーの場合と欠品がより多く発生するフィジカルプレーヤーの場合の間の値となっている。また、図１１の在庫数の推移を示すグラフにおいて、複数ステップにわたって
在庫数の変化の周期および変動幅がほぼ一定となる部分があることから在庫数、入荷数、出荷数のバランスが取れている、すなわち在庫を安定させるための適正な在庫数管理の学習効果が得られていると考えられる。 Further, as shown in FIG. 12, in the case of the meta player, the earned reward is lower than that of the cyber player and the physical player, but the BEI value is smaller than that of the cyber player and the physical player, and the FR value is The value is between the case of a cyber player with more excess inventory and the case of a physical player with more shortages. Further, in the graph showing the transition of the inventory quantity in FIG. 11, the inventory quantity, the arrival quantity, and the shipment quantity are well-balanced because there is a part in which the cycle and the fluctuation range of the inventory quantity change are substantially constant over a plurality of steps. That is, it is considered that the learning effect of proper inventory quantity management for stabilizing the inventory is obtained.

第１実施形態では、予測装置１の制御部１１が取得部としての通信部１５を制御して商品情報ＤＢ（データベース）２と通信する。通信部１５は、商品情報ＤＢ２に記憶されている、一定期間における商品の発注数、出荷数、入荷数の情報を取得する。なお、ここで商品情報ＤＢ２から取得される情報が、一定期間にわたる商品の発注数と出荷数と入荷数の変動を示す商品情報の一例である。制御部１１は通信部１５によって取得された情報を基に各状態変数に基づいて決定される報酬を、発注数を変更しながら算出する強化学習によって上記の行動モデルを生成する。そして、制御部１１は、この行動モデルを用いて、Netshop Gameにおけるメタプレーヤーの行動のシミュレーションを行う。 In the first embodiment, the control unit 11 of the prediction device 1 controls the communication unit 15 as the acquisition unit to communicate with the product information DB (database) 2. The communication unit 15 acquires information on the number of orders, the number of shipments, and the number of arrivals of products in a certain period, which is stored in the product information DB 2. The information acquired from the product information DB 2 is an example of product information indicating fluctuations in the number of orders, the number of shipments, and the number of arrivals of products over a certain period. The control unit 11 generates the above behavior model by reinforcement learning that calculates a reward determined based on each state variable based on the information acquired by the communication unit 15 while changing the number of orders. Then, the control unit 11 uses this behavior model to simulate the behavior of the meta player in the Netshop Game.

具体的には、制御部１１は、メタプレーヤーによる商品の発注数を上限から下限まで変更しながら、例えば上記の例では発注数を０から２０まで１ずつ増加させながら、シミュレーションを繰り返す。制御部１１は、一例として上記のように１エピソードのタイムステップ数の上限を１０００ステップとし、メタプレーヤーが上記のゴール条件を達成することなく１０００ステップ目のタイムステップの行動を完了した時点でエピソードを終了することとして、各発注数に対して１００エピソード実行した結果を基に、各発注数に対するゴール条件達成までのステップ数、ＢＥＩの値、ＦＲの値、獲得報酬について１００エピソードの平均値を算出し、算出結果を記憶部１２に記憶したり、表示部１４に表示したり、通信部１５から外部装置（図示せず）に送信したりすることで、算出結果を出力する。なお、予測装置１による算出結果が、予測装置によって予測される最適な発注数の一例である。 Specifically, the control unit 11 repeats the simulation while changing the number of ordered products by the meta player from the upper limit to the lower limit, for example, increasing the number of orders by 1 from 0 to 20 in the above example. As an example, the control unit 11 sets the upper limit of the number of time steps of one episode to 1000 steps as described above, and when the meta player completes the action of the 1000th step time step without achieving the above goal condition, the episode Based on the result of executing 100 episodes for each order quantity, the average value of 100 episodes for the number of steps to achieve the goal condition for each order quantity, BEI value, FR value, and acquisition reward is calculated. The calculation result is output by calculating and storing the calculation result in the storage unit 12, displaying it on the display unit 14, and transmitting it from the communication unit 15 to an external device (not shown). The calculation result by the prediction device 1 is an example of the optimum number of orders predicted by the prediction device.

ユーザは、予測装置１による算出結果を確認して、シミュレーションの対象となった商品の実際の発注数を決定することができる。したがって、第１実施形態によれば、販売業者による商品の発注数が、予測装置１によるシミュレーションによる算出結果を基に調整される。これにより、ＢＥＩが閾値を超えないように制御され、かつ需要数をできるだけ満たすように出荷が行われることで、在庫コストの最小化と売上の最大化という、相反する目標のバランスを取り、当該商品の在庫の最適化を図ることができる。 The user can confirm the calculation result by the prediction device 1 and determine the actual number of orders for the product targeted for the simulation. Therefore, according to the first embodiment, the number of ordered products by the seller is adjusted based on the calculation result by the simulation by the prediction device 1. As a result, the BEI is controlled so as not to exceed the threshold value, and the shipment is performed so as to satisfy the number of demands as much as possible, thereby balancing the conflicting goals of minimizing inventory cost and maximizing sales. It is possible to optimize the inventory of products.

以上が本実施形態に関する説明であるが、本実施形態の予測装置１の構成や処理は、上記説明の内容に限定されるものではなく、本発明の技術的思想と同一性を失わない範囲内において種々の変更が可能である。以下に、上記の実施形態の変形例について説明する。なお、以下の説明において、上記と同様の構成や処理などについては同一の符号を付し、詳細な説明については省略する。 The above is the description of the present embodiment, but the configuration and processing of the prediction device 1 of the present embodiment are not limited to the contents of the above description, and are within the range that does not lose the identity with the technical idea of the present invention. Various changes can be made in. A modified example of the above embodiment will be described below. In the following description, the same reference numerals will be given to the same configurations and processes as described above, and detailed description thereof will be omitted.

（変形例１）
変形例１では、ＥＣサイトで販売される１商品の販売促進の戦略に沿った商品の在庫を保管する倉庫の作業者について、１日の出荷に必要な作業人時の決定を行う。なお、本変形例で実行されるシミュレーションでは、１つのタイムステップを１日とし、各日に曜日が設定され、商品の販売促進が実施される日があらかじめ設定されているものと想定する。販売促進の実施日の一例として、５を含む日（５日、１５日、２５日）や五十日などが挙げられる。なお、販売促進の実施日を示す情報は、例えば商品情報ＤＢ２に記憶されていて予測装置１が商品情報ＤＢ２から取得してもよいし、ユーザが操作部１３を操作して入力してもよい。図１３に、予測装置１において、OpenAI Gymによって設定される本シミュレーションの報酬の一例（図１３Ａ）とOpenAI Gymにおいて実行される各状態変数の更新の一例（図１３Ｂ）を示す。なお、本シミュレーションにおけるアクション（行動変数）は、１日における商品の出荷に必要な作業人時であり、０から所定の上限値までの値が
選択される。 (Modification example 1)
In the first modification, a warehouse worker who stores an inventory of products in line with a sales promotion strategy for one product sold on an EC site is determined as a man-hour required for daily shipment. In the simulation executed in this modification, it is assumed that one time step is one day, the day of the week is set for each day, and the day when the product sales promotion is carried out is set in advance. Examples of sales promotion implementation dates include days including 5 (5th, 15th, 25th) and 50th. The information indicating the sales promotion implementation date may be stored in the product information DB 2, for example, and may be acquired by the prediction device 1 from the product information DB 2, or may be input by the user by operating the operation unit 13. .. FIG. 13 shows an example of the reward of this simulation set by the OpenAI Gym in the prediction device 1 (FIG. 13A) and an example of updating each state variable executed in the OpenAI Gym (FIG. 13B). The action (behavior variable) in this simulation is the man-hour required for shipping the product in one day, and a value from 0 to a predetermined upper limit is selected.

図１３Ａに示すように、本変形例で実行されるシミュレーションでは、メタプレーヤーに対して、需要数と商品を出荷可能な在庫数との差の累積の絶対値を算出する（図中「cost += abs(b)」）。また、現在のステップにおける曜日が土曜日または日曜日であるか否かを判定する（図中「if self.sat_sun_day_flag(y):」）。そして、土曜日または日曜日である場合は、必要人時と出荷に対応可能な人時との差に、土曜日・日曜日用の係数（土日稼働係数）を乗算してコストとする（図中「cost += abs(f) * self.manpower_cost_holiday」）。一方、曜日が平日である場合は、必要人時と出荷に対応可能な人時との差に
、平日用の係数（平日稼働係数）を乗算してコストとする（図中「cost += abs(f) * self.manpower_cost_weekday」）。なお、これらのコストが、商品の販売促進の実施日の曜
日に応じて決定される販売促進コストの一例である。そして、算出したコストのマイナスの値を報酬とする。メタプレーヤーは、このように設定される報酬をもとに後述するゴール条件の達成を目指す。 As shown in FIG. 13A, in the simulation executed in this modification, the absolute value of the cumulative difference between the number of demands and the number of stocks that can be shipped is calculated for the metaplayer (“cost +” in the figure). = abs (b) "). It also determines whether the day of the week in the current step is Saturday or Sunday (“if self.sat_sun_day_flag (y):” in the figure). Then, if it is Saturday or Sunday, the difference between the required man-hours and the man-hours that can be shipped is multiplied by the coefficient for Saturday and Sunday (Saturday-Sunday operation coefficient) to obtain the cost (“cost +” in the figure). = abs (f) * self.manpower_cost_holiday "). On the other hand, if the day is a weekday, the difference between the required manpower and the manpower that can be shipped is multiplied by the weekday coefficient (weekday operating coefficient) to obtain the cost (“cost + = abs” in the figure). (f) * self.manpower_cost_weekday "). It should be noted that these costs are an example of sales promotion costs determined according to the day of the week on which the product sales promotion is carried out. Then, the negative value of the calculated cost is used as the reward. The meta player aims to achieve the goal condition described later based on the reward set in this way.

また、図１３Ｂに示すように、１つのタイムステップにおいて、まず、日付が１日進められる（図中「self.step_date += datetime.timedelta(days=1)」）。次に、現在日が販売促進の実施日であるか否かを判定する（図中「t = self.five_six_day_flag(self.step_date.day)」）。次に、現在日の曜日を特定する（図中「y = self.step_date.weekday()」）。そして、販売促進の実施日（ｔ）と曜日（ｙ）と移動平均（ｄａｖｅ）とを用いて商品の需要数を決定する（図中「d = self.demand_by_date(t,y,dave)」）。次に、１時
間あたりの出荷数を決定する（図中「r = self.avg_productivity」）。そして、需要数
分の商品を出荷するために必要となる人時を算出する（図中「n = int( d / r )」）。次に、アクションで選択された人時を特定する（図中「m = action」）。そして、アクションで選択された人時によって出荷可能な在庫数を算出する（図中「p = m * r」）。また
、需要数分の商品を出荷するために必要な人時（ｎ）と出荷に対応可能な人時（ｍ）との差（ｆ）を算出する（図中「f = n - m」）。なお、図１３Ａ、図１３Ｂからわかるよう
に、この差（ｆ）を用いて報酬が計算される。そして、需要数と出荷可能な在庫数の差の累積を算出する（図中「b = self.state[B] + (d - p)」）。このように各状態変数が変
更され、変更後の状態変数を用いて次のタイムステップにおいて、生成される需要を基に商品の出荷が行われる。 Further, as shown in FIG. 13B, in one time step, the date is first advanced by one day (“self.step_date + = datetime.timedelta (days = 1)” in the figure). Next, it is determined whether or not the current date is the implementation date of the sales promotion (“t = self.five_six_day_flag (self.step_date.day)” in the figure). Next, specify the day of the week of the current day (“y = self.step_date.weekday ()” in the figure). Then, the number of demands for the product is determined using the sales promotion implementation date (t), the day of the week (y), and the moving average (dave) (“d = self.demand_by_date (t, y, dave)” in the figure). .. Next, the number of shipments per hour is determined (“r = self.avg_productivity” in the figure). Then, the man-hours required to ship the products for the number of demands are calculated (“n = int (d / r)” in the figure). Next, identify the man-hour selected in the action (“m = action” in the figure). Then, the number of stocks that can be shipped is calculated according to the man-hour selected in the action (“p = m * r” in the figure). In addition, the difference (f) between the man-hours (n) required to ship the products for the number of demands and the man-hours (m) that can handle the shipment is calculated (“f = n − m” in the figure). .. As can be seen from FIGS. 13A and 13B, the reward is calculated using this difference (f). Then, the cumulative difference between the number of demands and the number of stocks that can be shipped is calculated (“b = self.state [B] + (d − p)” in the figure). In this way, each state variable is changed, and in the next time step using the changed state variable, the goods are shipped based on the generated demand.

図１３Ｃは、本変形例におけるシミュレーションにおけるゴール条件の一例を示す。図１３Ｃに示すように、現在のタイムステップから直近の一定期間（例えば１００日）における需要数に対する出荷数の平均を算出する（図中「fillrate = np.mean(shipped / demand)」）。そして、算出した平均が上記のＦＲの閾値よりも大きい場合は（図中「if fillrate > self.fillrate_threshold:」）、算出した平均はＦｉｌｌｒａｔｅの条件を満たしたとする（図中「fill_flag = True」）。また、現在のタイムステップから直近の一定期間（例えば１００日）における各日の人時ｎ、ｍそれぞれの合計を算出する（図中「ssum = np.sum(history,axis=0)」）。また、現在のタイムステップにおける需要数分の商
品の出荷を行うために必要な人時ｎの月当たりの合計を算出する（図中「nsum = int(ssum[N]/self.months)」）。また、アクションでｎ選択される人時ｍの月当たりの合計を算
出する（図中「msum = int(ssum[M]/self.months)」）。そして、算出した人時ｍの月当
たりの合計が算出した人時ｎの月当たりの合計以下である場合は（図中「if nsum >= msum:」）、コスト最小化が達成されたとして人時の条件が満たされたとする（図中「sum_flag = True」）。そして、上記の２つの条件がいずれも満たされた場合にゴール条件が達
成されたとみなす（図中「done = fill_flag & sum_flag」）。 FIG. 13C shows an example of the goal condition in the simulation in this modified example. As shown in FIG. 13C, the average number of shipments with respect to the number of demands in the most recent fixed period (for example, 100 days) is calculated from the current time step (“fill rate = np.mean (shipped / demand)” in the figure). Then, when the calculated average is larger than the above-mentioned FR threshold value (“if fillrate> self.fillrate_threshold:” in the figure), it is assumed that the calculated average satisfies the Filllate condition (“fill_flag = True” in the figure). .. In addition, the sum of man-hours n and m for each day in the most recent fixed period (for example, 100 days) is calculated from the current time step (“ssum = np.sum (history, axis = 0)” in the figure). In addition, the monthly total of man-hours n required to ship products for the number of demands in the current time step is calculated (“nsum = int (ssum [N] /self.months)” in the figure). .. In addition, the total number of man-hours m selected by the action per month is calculated (“msum = int (ssum [M] /self.months)” in the figure). Then, when the calculated monthly total of man-hour m is less than or equal to the calculated monthly total of man-hour n (“if nsum> = msum:” in the figure), it is assumed that the cost minimization has been achieved. It is assumed that the time condition is satisfied (“sum_flag = True” in the figure). Then, when both of the above two conditions are satisfied, it is considered that the goal condition has been achieved (“done = fill_flag & sum_flag” in the figure).

本変形例では、制御部１１は、上記の報酬を、複数の人時それぞれに対して算出する強化学習によって行動モデルを生成し、生成した行動モデルを用いてシミュレーションを行
う。具体的に、制御部１１は、一例として１日（１ステップ）において出荷に必要な人時を下限から上限まで変更しながら、例えば０からアクションで選択可能な上限値まで所定人時ずつ増加させながらシミュレーションを繰り返す。そして、制御部１１は、一例として上記のように１エピソードのタイムステップ数の上限を１０００ステップとし、メタプレーヤーが上記のゴール条件を達成することなく１０００ステップ目のタイムステップの行動を完了した時点でエピソードを終了することとして、各人時に対して１００エピソード実行した結果を基に、各人時に対するゴール条件達成までのタイムステップ数、ＦＲの値について１００エピソードの平均値を算出する。そして、制御部１１は、算出結果を記憶部１２に記憶したり、表示部１４に表示したり、通信部１５から外部装置に送信したりすることで、算出結果を出力する。 In this modification, the control unit 11 generates a behavior model by reinforcement learning that calculates the above reward for each of a plurality of man-hours, and performs a simulation using the generated behavior model. Specifically, the control unit 11 changes the man-hours required for shipping from the lower limit to the upper limit in one day (one step) as an example, and increases the man-hours from 0 to the upper limit value that can be selected by the action by a predetermined man-hour. While repeating the simulation. Then, as an example, the control unit 11 sets the upper limit of the number of time steps of one episode to 1000 steps as described above, and when the meta player completes the action of the 1000th step time step without achieving the above goal condition. As the end of the episode, the average value of 100 episodes is calculated for the number of time steps until the goal condition is achieved for each man-hour and the FR value based on the result of executing 100 episodes for each man-hour. Then, the control unit 11 outputs the calculation result by storing the calculation result in the storage unit 12, displaying it on the display unit 14, and transmitting it from the communication unit 15 to the external device.

ユーザは、予測装置１による算出結果を確認して、シミュレーションの対象となった商品について、販売促進の実施日における出荷に対する人時を決定することができる。したがって、第１実施形態によれば、商品の出荷に割り当てられる人時が、予測装置１によるシミュレーションによる算出結果を基に調整される。これにより、販売促進によって変動する需要数を満たすように出荷を行い、かつ商品が保管される倉庫における人時を低く維持することで、売上最大化と出荷コスト最小化という、相反する目標のバランスを取り、本シミュレーションで設定される販売促進の実施日における当該商品の出荷に割り当てられる人時の最適化を図ることができる。 The user can confirm the calculation result by the prediction device 1 and determine the man-hours for the shipment of the product subject to the simulation on the sales promotion implementation date. Therefore, according to the first embodiment, the man-hours assigned to the shipment of the goods are adjusted based on the calculation result by the simulation by the prediction device 1. This balances the conflicting goals of maximizing sales and minimizing shipping costs by shipping to meet the number of demands that fluctuates due to sales promotion and keeping man-hours low in the warehouse where products are stored. It is possible to optimize the man-hours assigned to the shipment of the product on the sales promotion implementation date set in this simulation.

（変形例２）
変形例２では、ＥＣサイトで販売される商品を対象に、在庫数が変動しない不動在庫を特定して当該商品のレコメンドの要否の判定を行う。なお、本変形例で実行されるシミュレーションでは、１つのタイムステップを１日とし、ユーザが情報処理端末を使用してインターネット上でＥＣサイトの商品ページを示す情報を検索し、検索結果のページからＥＣサイトの商品ページに移動するものと想定する。また、商品情報の検索結果に当該商品ページの情報が表示される回数を商品の表示回数とし、当該商品ページへの移動回数をクリック数とする。また、１商品が倉庫内で占有する空間の指標となるサイズがあらかじめ定められているとする。 (Modification 2)
In the second modification, the immovable inventory in which the number of inventories does not fluctuate is specified for the products sold on the EC site, and the necessity of recommending the products is determined. In the simulation executed in this modified example, one time step is set to one day, and the user searches the information indicating the product page of the EC site on the Internet using the information processing terminal, and from the search result page. It is assumed that you will move to the product page of the EC site. In addition, the number of times the product page information is displayed in the product information search result is defined as the number of times the product is displayed, and the number of times the product is moved to the product page is defined as the number of clicks. Further, it is assumed that the size that is an index of the space occupied by one product in the warehouse is predetermined.

図１４Ａは、予測装置１において、OpenAI Gymによって設定されるシミュレーションの報酬の一例を示し、図１４Ｂは、OpenAI Gymにおいて実行される各状態変数の更新の一例を示す。なお、本シミュレーションにおけるアクション（行動変数）は、商品のレコメンドの要否を決定するための閾値であり、例えば０．０から１．０まで０．１刻みの値のうちいずれかの値が選択される。なお、当該閾値が、ＥＣサイトにおいて商品のレコメンドのされやすさを示す値の一例である。 FIG. 14A shows an example of the simulation reward set by the OpenAI Gym in the prediction device 1, and FIG. 14B shows an example of updating each state variable executed in the OpenAI Gym. The action (behavior variable) in this simulation is a threshold value for determining the necessity of product recommendation, and for example, any value from 0.0 to 1.0 in 0.1 increments is selected. Will be done. The threshold value is an example of a value indicating the ease with which a product is recommended on an EC site.

図１４Ａに示すように、本変形例で実行されるシミュレーションでは、メタプレーヤーに対して、一定期間、すなわち所定数連続する複数ステップにおいて、商品ページのクリック数（図中「click_num」）と商品ページの表示回数（図中「view_num」）を基に、当
該期間の商品ページのクリックスルー率を算出する（図中「click_through_rate = click_num / view_num」）。そして、クリックスルー率に利益とサイズを乗算した値を報酬と
する（図中「reward = click_through_rate * profit * size」）。すなわち、利益またはサイズが大きくなるほど報酬も大きくなる。メタプレーヤーは、このように設定される報酬をもとに後述するゴール条件の達成を目指す。 As shown in FIG. 14A, in the simulation executed in this modified example, the number of clicks on the product page (“click_num” in the figure) and the product page in a fixed period, that is, in a predetermined number of consecutive multiple steps, are applied to the meta player. Based on the number of impressions ("view_num" in the figure), the click-through rate of the product page for the period is calculated ("click_through_rate = click_num / view_num" in the figure). Then, the value obtained by multiplying the click-through rate by the profit and the size is used as the reward (“reward = click_through_rate * profit * size” in the figure). That is, the greater the profit or size, the greater the reward. The meta player aims to achieve the goal condition described later based on the reward set in this way.

また、図１４Ｂに示すように、１つのタイムステップにおいて、商品１個あたりの利益（一例として販売価格から仕入れ値と在庫コストとを差し引いた額）（図中「profit」）と、１日における商品ページのクリック数（図中「click_num_one_day」）と、１日にお
ける商品ページの表示回数（図中「view_num_one_day」）と、商品が倉庫に保管されてい
る日数である在庫日数（図中「stock_days」）と、在庫数（図中「stock_num」）と、上
記のアクションで選択される閾値（図中「boost_value」）が状態変数である。また、本
変形例では、図中「boost_value」以外の値を示す情報が商品情報ＤＢ２に記憶されてい
る。また、商品情報ＤＢ２には、上記の各値が、例えば過去３０日など、過去の一定期間にわたって日別に記憶されているものとする。予測装置１は、商品情報ＤＢ２に記憶されている一定期間にわたる各値の情報を取得し、タイムステップが１つ進むごとに、取得した情報を基に翌日の各値を特定し、特定した値で各状態変数を更新する。このように各状態変数が変更され、変更後の状態変数を用いて次のタイムステップにおける商品のレコメンドの要否判定が行われる。 Further, as shown in FIG. 14B, in one time step, the profit per product (as an example, the amount obtained by subtracting the purchase price and the inventory cost from the selling price) (“profit” in the figure) and the product in one day. The number of page clicks (“click_num_one_day” in the figure), the number of impressions of the product page in one day (“view_num_one_day” in the figure), and the number of days in stock (“stock_days” in the figure), which is the number of days the product is stored in the warehouse. The state variables are the number of stocks (“stock_num” in the figure) and the threshold value (“boost_value” in the figure) selected by the above action. Further, in this modified example, information indicating a value other than "boost_value" in the figure is stored in the product information DB 2. Further, it is assumed that each of the above values is stored in the product information DB 2 day by day over a certain period in the past, for example, the past 30 days. The prediction device 1 acquires information of each value stored in the product information DB 2 over a certain period of time, and each time the time step advances, each value of the next day is specified based on the acquired information, and the specified value is specified. Update each state variable with. In this way, each state variable is changed, and the necessity of product recommendation in the next time step is determined using the changed state variable.

図１４Ｃは、本変形例におけるシミュレーションにおけるゴール条件の一例を示す。図１４Ｃに示すように、上記の一定期間におけるクリックスルー率があらかじめ設定された閾値より大きい場合は（図中「if click_through_rate >= click_through_rate _threshold:」）、クリックスルー率の条件が満たされたとする（図中「click_through_rate_flag
= True」）。また、現在のタイムステップから直近の一定期間（例えば１００日）の商
品の在庫日数平均「stock_days」があらかじめ設定された閾値より大きい場合は（図中「if stock_days >= stock_days_threshold:」）、商品の在庫日数平均の条件が満たされたとする（図中「stock_days_flag = True」）。そして、上記の２つの条件がいずれも満たされた場合にゴール条件が達成されたとみなす（図中「done = click_through_rate_flag
& stock_days_flag」）。 FIG. 14C shows an example of the goal condition in the simulation in this modified example. As shown in FIG. 14C, when the click-through rate in the above-mentioned fixed period is larger than the preset threshold value (“if click_through_rate> = click_through_rate _threshold:” in the figure), it is assumed that the click-through rate condition is satisfied ( In the figure, "click_through_rate_flag"
= True "). Also, if the average inventory days "stock_days" of a product for the most recent fixed period (for example, 100 days) from the current time step is larger than a preset threshold ("if stock_days> = stock_days_threshold:" in the figure), the product It is assumed that the condition of the average number of days in stock is satisfied (“stock_days_flag = True” in the figure). Then, when both of the above two conditions are satisfied, it is considered that the goal condition has been achieved (“done = click_through_rate_flag” in the figure.
& stock_days_flag ").

本変形例では、制御部１１は、取得した商品情報と、商品のサイズと、商品の売上と、商品の仕入れ値と、商品のサイズに応じて決定される在庫コストとに基づいて決定される報酬を、商品のレコメンドのされやすさを示す複数の値それぞれに対して算出する強化学習によって行動モデルを生成し、生成した行動モデルを用いてシミュレーションを行う。具体的に、制御部１１は、商品のレコメンドの要否を決定するための閾値を下限から上限まで変更しながら、例えば０．０から１．０まで０．１ずつ増加させながらシミュレーションを繰り返す。ここで、制御部１１は、１エピソードのタイムステップ数を、商品情報ＤＢ２から取得した上記の各値の情報の対象期間の日数とし、各閾値に対して１００エピソード実行した結果を基に、各閾値に対するゴール条件達成までのタイムステップ数、クリックスルー率、在庫日数平均について１００エピソードの平均値を算出する。例えば、商品情報ＤＢ２から過去３０日にわたる上記の各値の情報が取得される場合は、１エピソードのタイムステップ数は３０となる。制御部１１は、算出結果を記憶部１２に記憶したり、表示部１４に表示したり、通信部１５から外部装置に送信したりすることで、算出結果を出力する。 In this modification, the control unit 11 determines the reward based on the acquired product information, the product size, the product sales, the product purchase price, and the inventory cost determined according to the product size. Is generated by reinforcement learning that is calculated for each of a plurality of values indicating the ease of product recommendation, and a simulation is performed using the generated behavior model. Specifically, the control unit 11 repeats the simulation while changing the threshold value for determining the necessity of product recommendation from the lower limit to the upper limit and increasing the threshold value from 0.0 to 1.0 by 0.1, for example. Here, the control unit 11 sets the number of time steps of one episode as the number of days of the target period of the above-mentioned information of each value acquired from the product information DB2, and based on the result of executing 100 episodes for each threshold value, each The average value of 100 episodes is calculated for the number of time steps, the click-through rate, and the average number of days in stock until the goal condition is achieved with respect to the threshold value. For example, when the information of each of the above values over the past 30 days is acquired from the product information DB2, the number of time steps in one episode is 30. The control unit 11 outputs the calculation result by storing the calculation result in the storage unit 12, displaying it on the display unit 14, and transmitting it from the communication unit 15 to the external device.

ユーザは、予測装置１による算出結果を確認して、シミュレーションの対象となった商品のレコメンドの要否を決定することができる。したがって、変形例１によれば、ＥＣサイトにおける商品のレコメンドの要否が、予測装置１によるシミュレーションによる算出結果を基に判定される。これにより、ＥＣサイトにおいて、在庫日数の閾値を超えた商品がレコメンドされる、すなわちユーザの情報処理端末に表示される。そして、ユーザが表示された商品をクリックしないなど、レコメンドされた商品に対するユーザの操作が発生しない状況が継続する場合は、当該商品がレコメンドされないようになる。この結果、レコメンドされないために売れない商品とレコメンドされても売れない商品とを判定して、売れる可能性がより高い商品がレコメンドされるようにすることで、ＥＣサイトにおける商品のレコメンドの最適化を図ることができる。 The user can confirm the calculation result by the prediction device 1 and determine whether or not the recommendation of the product targeted for the simulation is necessary. Therefore, according to the first modification, the necessity of recommending the product on the EC site is determined based on the calculation result by the simulation by the prediction device 1. As a result, products that exceed the inventory days threshold are recommended on the EC site, that is, displayed on the user's information processing terminal. Then, if the situation in which the user does not operate on the recommended product, such as when the user does not click the displayed product, continues, the product is not recommended. As a result, the product that cannot be sold because it is not recommended and the product that cannot be sold even if it is recommended are determined, and the product that is more likely to be sold is recommended, thereby optimizing the recommendation of the product on the EC site. Can be planned.

（変形例３）
変形例３では、ＥＣサイトで販売される商品を対象に、ユーザによる商品の注文時に配送指定日を変更することでユーザに付与されるインセンティブの決定を行う。ここで、イ
ンセンティブとは、商品の注文に対してユーザに還元されるＥＣサイトで利用可能なポイントなどである。インセンティブについては周知であるため詳細な説明は省略する。なお、本変形例におけるアクション（行動変数）は、インセンティブであり、例えば０．０から１．０まで０．１刻みの値のうちいずれかの値が選択される。また、本変形例で実行されるシミュレーションでは、１つのタイムステップを１日とし、ユーザがＥＣサイトにおいて商品の注文時に配送指定日を変更することが可能であるものと想定する。 (Modification 3)
In the third modification, the incentive given to the user is determined by changing the designated delivery date when the user orders the product for the product sold on the EC site. Here, the incentive is a point that can be used on the EC site that is returned to the user for the order of the product. Since the incentives are well known, detailed explanations will be omitted. The action (behavior variable) in this modification is an incentive, and for example, any value from 0.0 to 1.0 in increments of 0.1 is selected. Further, in the simulation executed in this modification, it is assumed that one time step is one day and the user can change the designated delivery date when ordering the product on the EC site.

図１５Ａは、予測装置１において、OpenAI Gymによって設定されるシミュレーションの報酬の一例を示し、図１５Ｂは、OpenAI Gymにおいて実行される各状態変数の更新の一例を示す。 FIG. 15A shows an example of the simulation reward set by the OpenAI Gym in the prediction device 1, and FIG. 15B shows an example of updating each state variable executed in the OpenAI Gym.

図１５Ａに示すように、本変形例で実行されるシミュレーションでは、シミュレーションの対象となる商品の配送に必要な人時の値を報酬とする（図中「reward = click_through_rate * profit * size」）。すなわち、利益またはサイズが大きくなるほど報酬も大きくなる。メタプレーヤーは、このように設定される報酬をもとにゴール条件の達成を目指す。 As shown in FIG. 15A, in the simulation executed in this modification, the man-hour value required for delivery of the product to be simulated is used as a reward (“reward = click_through_rate * profit * size” in the figure). That is, the greater the profit or size, the greater the reward. The meta player aims to achieve the goal condition based on the reward set in this way.

また、図１５Ｂに示すように、１つのタイムステップにおいて、商品の注文情報（注文ＩＤ、配送先の緯度および経度、現在の指定された配送日時である配送指定日および配送時間帯、配送の個口数を含む）（図中「order_info : (order_id, latitude, longitude,
date, time, parcel_num)」）、商品の配送における時空間クラスタ情報（現在の配送指定日および配送時間帯、商品の配送を行うクラスタの中心位置の緯度および経度、当該クラスタの半径、当該クラスタの配送可能な配送先数、現在の当該クラスタ内で予定されている配送の配送先数）（図中「time_space_cluster_info : [ (date,time,latitude,longitude,radius,max,num) , ….]」）、配送に必要な総人時（図中「man_hour」）、配送に必要な総配送員人数（図中「deivers_num」）、クラスタの総数（図中「cluster_num」）、配送に伴う所要時間である総待ち時間（図中「idle_time」）、総配送距離（図中「distance」）、配送指定日時と最適化された日時との差（図中「difference」）、配送指定
日を変更した場合にユーザに付与されるインセンティブ（一例としてポイント数。図中「incentive : basic_point * action (0.0〜1.0)」）が状態変数である。また、本変形例
では、図中「incentive : basic_point * action (0.0〜1.0)」以外の値を示す情報が商
品情報ＤＢ２に記憶されている。なお、クラスタが配送業者の配送エリアの一例である。予測装置１は、商品情報ＤＢ２に記憶されている一定期間にわたる各値の情報を取得し、タイムステップが１つ進むごとに、取得した情報を基に翌日の各値を特定し、特定した値で各状態変数を更新する。このように各状態変数が変更され、変更後の状態変数を用いて次のタイムステップにおける配送に必要な総人時の算出が行われる。 Further, as shown in FIG. 15B, in one time step, the order information of the product (order ID, latitude and longitude of the delivery destination, the specified delivery date and time, which is the current specified delivery date and time, and the individual delivery port. Including number) (In the figure, "order_info: (order_id, latitude, longitude,
date, time, parcel_num) ”), spatiotemporal cluster information in the delivery of goods (current specified delivery date and delivery time zone, latitude and longitude of the center position of the cluster that delivers the goods, radius of the cluster, of the cluster Number of delivery destinations that can be delivered, number of delivery destinations that are currently scheduled for delivery within the cluster) ("time_space_cluster_info: [(date, time, latitude, longitude, radius, max, num), ....]" in the figure. ), The total number of people required for delivery (“man_hour” in the figure), the total number of delivery personnel required for delivery (“deivers_num” in the figure), the total number of clusters (“cluster_num” in the figure), and the time required for delivery. When a certain total waiting time (“idle_time” in the figure), total delivery distance (“distance” in the figure), difference between the specified delivery date and time and the optimized date and time (“difference” in the figure), and the specified delivery date are changed. The incentive given to the user (as an example, the number of points. “Incentive: basic_point * action (0.0 to 1.0)” in the figure) is a state variable. Further, in this modified example, information indicating a value other than "incentive: basic_point * action (0.0 to 1.0)" in the figure is stored in the product information DB2. A cluster is an example of a delivery area of a delivery company. The prediction device 1 acquires information of each value stored in the product information DB 2 over a certain period of time, and each time the time step advances, each value of the next day is specified based on the acquired information, and the specified value is specified. Update each state variable with. In this way, each state variable is changed, and the changed state variable is used to calculate the total man-hours required for delivery in the next time step.

図１５Ｃは、本変形例におけるシミュレーションにおけるゴール条件の一例を示す。図１５Ｃに示すように、現在のタイムステップから直近の一定期間（例えば１００日）において配送に必要な総人時（図中「man_hour_sum」）と当該配送に割当可能な最大人時（図中「max_ man_hour_sum」）とを算出する。そして、算出したそれぞれの人時を基に、当
該期間における配送員の稼働率を算出する（図中「man_hour_rate = man_hour / max_ man_hour_sum」）。そして、算出した配送員の稼働率があらかじめ設定された閾値以下の場合は（図中「if man_hour_rate <= man_hour_rate_threshold:」）、配送員の稼働率の条件が満たされたとする（図中「man_hour_rate_flag = True」）。また、当該期間のイン
センティブ（「incentive」）の値があらかじめ設定された閾値以下の場合は（図中「if incentive <= incentive_threshold:」）、インセンティブの条件が満たされたとする（
図中「incentive_flag = True」）。そして、上記の２つの条件がいずれも満たされた場
合にゴール条件が達成されたとみなす（図中「done = man_hour_rate_flag & incentive_flag」）。 FIG. 15C shows an example of the goal condition in the simulation in this modified example. As shown in FIG. 15C, the total man-hours required for delivery (“man_hour_sum” in the figure) and the maximum man-hours that can be assigned to the delivery in the most recent fixed period (for example, 100 days) from the current time step (“man_hour_sum” in the figure). max_ man_hour_sum ") and calculate. Then, based on each calculated man-hour, the operating rate of the delivery staff in the relevant period is calculated (“man_hour_rate = man_hour / max_ man_hour_sum” in the figure). Then, if the calculated utilization rate of the delivery staff is less than or equal to the preset threshold value (“if man_hour_rate <= man_hour_rate_threshold:” in the figure), it is assumed that the condition of the utilization rate of the delivery staff is satisfied (“man_hour_rate_flag =” in the figure). True "). If the value of the incentive (“incentive”) for the period is less than or equal to the preset threshold value (“if incentive <= incentive_threshold:” in the figure), the incentive condition is satisfied (
In the figure, "incentive_flag = True"). Then, when both of the above two conditions are satisfied, it is considered that the goal condition has been achieved (“done = man_hour_rate_flag & incentive_flag” in the figure).

本変形例では、制御部１１は、取得した商品情報から算出される配送に必要な総人時に基づいて決定される報酬を、複数のインセンティブそれぞれに対して算出する強化学習によって行動モデルを生成し、生成した行動モデルを用いてシミュレーションを行う。具体的には、制御部１１は、上記のインセンティブの割合を示す値を下限から上限まで変更しながら、例えば０．０から１．０まで０．１ずつ増加させながらシミュレーションを繰り返す。ここで、制御部１１は、１エピソードのタイムステップ数を、商品情報ＤＢ２から取得した上記の各値の情報の対象期間の日数とし、各インセンティブの割合に対して１００エピソード実行した結果を基に、各インセンティブの割合に対するゴール条件達成までのタイムステップ数、配送員の稼働率について１００エピソードの平均値を算出する。例えば、商品情報ＤＢ２から過去３０日にわたる上記の各値の情報が取得される場合は、１エピソードのタイムステップ数は３０となる。制御部１１は、算出結果を記憶部１２に記憶したり、表示部１４に表示したり、通信部１５から外部装置に送信したりすることで、算出結果を出力する。 In this modification, the control unit 11 generates a behavior model by reinforcement learning that calculates a reward determined based on the total number of people required for delivery calculated from the acquired product information for each of a plurality of incentives. , Perform a simulation using the generated behavior model. Specifically, the control unit 11 repeats the simulation while changing the value indicating the ratio of the incentives from the lower limit to the upper limit and increasing the value by 0.1 from 0.0 to 1.0, for example. Here, the control unit 11 sets the number of time steps of one episode as the number of days of the target period of the above-mentioned information of each value acquired from the product information DB2, and based on the result of executing 100 episodes with respect to the ratio of each incentive. , Calculate the average value of 100 episodes for the number of time steps to achieve the goal condition for the ratio of each incentive and the occupancy rate of the delivery staff. For example, when the information of each of the above values over the past 30 days is acquired from the product information DB2, the number of time steps in one episode is 30. The control unit 11 outputs the calculation result by storing the calculation result in the storage unit 12, displaying it on the display unit 14, and transmitting it from the communication unit 15 to the external device.

ユーザは、予測装置１による算出結果を確認して、ＥＣサイトにおいてユーザがシミュレーションの対象となった商品を注文する際に、当該商品の配送指定日を変更することでユーザに付与されるインセンティブの大きさを決定することができる。したがって、変形例１によれば、ＥＣサイトにおいて上記インセンティブの大きさが、予測装置１によるシミュレーションによる算出結果を基に調整される。これにより、配送コスト最小化とインセンティブコスト最小化の２つの目標のバランスを取りつつ、ＥＣサイトにおいて当該商品の配送日の変更をユーザに促す場合に付与されるインセンティブの大きさの最適化を図ることができる。 When the user confirms the calculation result by the prediction device 1 and orders the product to be simulated on the EC site, the user is given an incentive to be given to the user by changing the designated delivery date of the product. The size can be determined. Therefore, according to the first modification, the magnitude of the incentive is adjusted at the EC site based on the calculation result by the simulation by the prediction device 1. In this way, while balancing the two goals of minimizing delivery costs and minimizing incentive costs, the size of the incentive given when prompting the user to change the delivery date of the product on the EC site is optimized. be able to.

１予測装置
１１制御部
２商品情報ＤＢ 1 Predictor 11 Control unit 2 Product information DB

Claims

The acquisition department that acquires product information of e-commerce products,
The first behavior model and the first behavior model in which different rewards are calculated by using the acquired product information and the state variable of the environment for a plurality of values of the behavior variables allowed in the environment related to the distribution of the products. It is characterized by having a control unit that generates a third behavior model based on the strengthening learning of the second behavior model and predicts the optimum value of the behavior variable in the environment using the third behavior model. Predictor to do.

The behavior variable is the number of orders for the product by the seller of the product.
The product information acquired by the acquisition unit is information indicating changes in the number of orders, the number of shipments, and the number of arrivals of the product over a certain period of time.
The environmental state variables are the sales of the product, the purchase price of the product, the shortage cost related to the shortage of the product, the sales promotion cost related to the promotion of the product, and the excess inventory of the product. The inventory cost related to and the delivery cost related to the delivery of the goods.
The control unit uses the acquired product information and the state variable of the environment to calculate different rewards for each of the number of orders corresponding to the plurality of values of the behavior variable with the first behavior model. The prediction according to claim 1, wherein the third behavior model is generated based on the reinforcement learning of the second behavior model, and the optimum number of orders is predicted using the third behavior model. apparatus.

The behavioral variable is the man-hours associated with the storage of the goods in the warehouse.
The product information acquired by the acquisition unit is information indicating an implementation date of sales promotion of the product.
The state variable of the environment is a sales promotion cost determined according to the day of the week on which the sales promotion of the product is carried out.
The control unit uses the acquired product information and the state variable of the environment to calculate different rewards for each of the human times corresponding to the plurality of values of the behavior variable with the first behavior model. The prediction according to claim 1, wherein the third behavior model is generated based on the reinforcement learning of the second behavior model, and the optimum human time is predicted using the third behavior model. apparatus.

The behavior variable is a value indicating the ease with which the product is recommended on the EC site.
The product information acquired by the acquisition unit is information indicating the number of clicks and the number of impressions of the product over a certain period on the EC site.
The state variables of the environment are the size of the product, the sales of the product, the purchase price of the product, and the inventory cost determined according to the size of the product.
The control unit uses the acquired product information and the state variable of the environment to give different rewards to each of the values indicating the ease of recommendation of the product corresponding to the plurality of values of the behavior variable. The third behavior model is generated based on the calculated reinforcement learning of the first behavior model and the second behavior model, and the ease of recommendation of the product is shown using the third behavior model. The prediction device according to claim 1, further comprising predicting an optimum value.

The behavior variable is an incentive given to the user by changing the designated delivery date of the product on the EC site.
The product information acquired by the acquisition unit includes the delivery destination and delivery date and time of the product, the delivery area and the number of deliverable destinations of the delivery company of the product, the time required for delivery of the product, and the product. It is information indicating the delivery distance in the delivery of
The control unit uses the acquired product information and the state variable of the environment to calculate different rewards for each of the incentives corresponding to the plurality of values of the behavior variable, and the first behavior model and the above. The prediction device according to claim 1, wherein the third behavior model is generated based on the reinforcement learning of the second behavior model, and the optimum incentive is predicted using the third behavior model.