CN113592240B

CN113592240B - An MTO enterprise order processing method and system

Info

Publication number: CN113592240B
Application number: CN202110749378.1A
Authority: CN
Inventors: 吴克宇; 钱静; 胡星辰; 陈超; 成清; 程光权; 冯旸赫; 杜航
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-07-02
Filing date: 2021-07-02
Publication date: 2023-10-13
Anticipated expiration: 2041-07-02
Also published as: CN113592240A

Abstract

The embodiment of the invention provides an order processing method and system for an MTO enterprise, comprising the following steps: when the current order arrives at an order-oriented production (MTO) enterprise, establishing an order acceptance strategy model based on a Markov Decision Process (MDP) theory for determining an optimal strategy according to a current order queue of the MTO enterprise and the current arriving order; converting the order receiving strategy model based on the MDP theory according to the post-state theory of the reinforcement learning algorithm to obtain an MDP model based on the post-state; the difficulty in solving the MDP model based on the rear state is reduced through the learning process of the learning parameter vector of the reinforcement learning algorithm, and the MDP optimization model based on the rear state is obtained; and solving the MDP optimization model based on the rear state by adopting a three-layer artificial neural network to determine an optimal strategy for whether the current arriving order is accepted. Modeling of the dynamic change arrival of the order is more consistent with the actual state of the order in reality.

Description

An MTO enterprise order processing method and system

技术领域Technical field

本发明涉及订单接受优化领域，具体涉及一种MTO企业订单处理方法及系统。The invention relates to the field of order acceptance optimization, and specifically relates to an MTO enterprise order processing method and system.

背景技术Background technique

随着顾客需求个性化不断提升，越来越多的企业开始采用面向订单生产(make-to-order，MTO)模式，以便更容易地观察和接触终端客户，最大限度地满足顾客的个性化需求。所谓MTO模式，是指企业根据客户订单而进行生产的企业，不同的客户对订单的类型有着不同的需求，MTO企业根据客户提出的订单需求，对订单进行组织和生产方式。在通常情况下，MTO企业的产能是有限的，而且加上各种成本因素的限制，企业不可能接受所有随机到达的顾客订单，这就需要MTO企业制定相应的订单接受策略。因此，研究MTO企业如何在有限资源内做出订单选择决策问题，对MTO企业充分利用有限资源、实现长期利润最大化发挥着巨大的作用。As customer demand continues to become more personalized, more and more companies are beginning to adopt the make-to-order (MTO) model to more easily observe and contact end customers and maximize customer satisfaction. . The so-called MTO model refers to an enterprise that produces according to customer orders. Different customers have different needs for the type of orders. MTO enterprises organize and produce orders according to the order requirements put forward by customers. Under normal circumstances, the production capacity of MTO companies is limited, and coupled with various cost factors, it is impossible for companies to accept all randomly arriving customer orders, which requires MTO companies to formulate corresponding order acceptance strategies. Therefore, studying how MTO companies make order selection decisions within limited resources plays a huge role in MTO companies making full use of limited resources and maximizing long-term profits.

在实现本发明过程中，申请人发现现有技术中至少存在如下问题：策略模型过于简化，难以接近真实情况。In the process of implementing the present invention, the applicant found that there are at least the following problems in the prior art: the strategy model is too simplified and difficult to approach the real situation.

发明内容Contents of the invention

本发明实施例提供一种MTO企业订单处理方法及系统，订单动态变化到达的建模，与现实中实际发生订单的状态更符合。Embodiments of the present invention provide an MTO enterprise order processing method and system. The modeling of dynamically changing order arrivals is more consistent with the status of actual orders in reality.

为达上述目的，一方面，本发明实施例提供一种MTO企业订单处理方法，包括：To achieve the above objectives, on the one hand, embodiments of the present invention provide an MTO enterprise order processing method, including:

当具有当前订单到达面向订单生产MTO企业后，针对该MTO企业所具有的当前订单队列以及当前到达的订单，建立基于马尔科夫决策过程MDP理论的订单接受策略模型，所述基于MDP理论的订单接受策略模型用于确定最优策略；其中，当前到达订单表示MTO企业接收到但还未决定是否接受的订单；所述策略为接受当前到达订单或者拒绝当前到达订单，所述最优策略是指选择该策略时该MTO企业的利益最优；When a current order arrives at an order-oriented production MTO enterprise, an order acceptance strategy model based on the Markov decision process MDP theory is established based on the current order queue of the MTO enterprise and the currently arriving orders. The order based on the MDP theory The acceptance strategy model is used to determine the optimal strategy; among them, the current arrival order represents the order that the MTO enterprise has received but has not yet decided whether to accept; the strategy is to accept the current arrival order or reject the current arrival order, and the optimal strategy refers to When choosing this strategy, the interests of the MTO company are optimal;

根据强化学习算法的后状态理论对基于MDP理论的订单接受策略模型进行转化，转化后得到基于后状态的MDP模型；通过强化学习算法的学习参数向量的学习过程降低基于后状态的MDP模型的求解难度，得到基于后状态的MDP优化模型；The order acceptance strategy model based on the MDP theory is transformed according to the post-state theory of the reinforcement learning algorithm, and the MDP model based on the post-state is obtained after the transformation; the solution of the MDP model based on the post-state is reduced through the learning process of the learning parameter vector of the reinforcement learning algorithm. Difficulty, obtain the MDP optimization model based on the post-state;

采用三层人工神经网络对基于后状态的MDP优化模型进行求解得到求解结果，根据求解结果确定针对当前到达订单是否接受的最优策略。A three-layer artificial neural network is used to solve the MDP optimization model based on the post-state to obtain the solution result. Based on the solution result, the optimal strategy for accepting the current arriving order is determined.

另一方面，本发明实施例提供一种MTO企业订单处理系统，包括：On the other hand, embodiments of the present invention provide an MTO enterprise order processing system, including:

模型构建单元，用于当具有当前订单到达面向订单生产MTO企业后，针对该MTO企业所具有的当前订单队列以及当前到达的订单，建立基于马尔科夫决策过程MDP理论的订单接受策略模型，所述基于MDP理论的订单接受策略模型用于确定最优策略；其中，当前到达订单表示MTO企业接收到但还未决定是否接受的订单；所述策略为接受当前到达订单或者拒绝当前到达订单，所述最优策略是指选择该策略时该MTO企业的利益最优；The model building unit is used to establish an order acceptance strategy model based on the Markov decision process MDP theory based on the current order queue of the MTO enterprise and the currently arriving orders when the current order arrives at the order-oriented production MTO enterprise. The above-mentioned order acceptance strategy model based on MDP theory is used to determine the optimal strategy; among them, the current arrival order represents the order that the MTO enterprise has received but has not yet decided whether to accept; the strategy is to accept the current arrival order or reject the current arrival order, so The above-mentioned optimal strategy means that the interests of the MTO enterprise are optimal when choosing this strategy;

模型转化单元，用于根据强化学习算法的后状态理论对基于MDP理论的订单接受策略模型进行转化，转化后得到基于后状态的MDP模型；The model transformation unit is used to transform the order acceptance strategy model based on the MDP theory according to the post-state theory of the reinforcement learning algorithm. After the transformation, an MDP model based on the post-state is obtained;

模型优化单元，用于通过强化学习算法的学习参数向量的学习过程降低基于后状态的MDP模型的求解难度，得到基于后状态的MDP优化模型；The model optimization unit is used to reduce the difficulty of solving the post-state-based MDP model through the learning process of the learning parameter vector of the reinforcement learning algorithm, and obtain the post-state-based MDP optimization model;

求解单元，用于采用三层人工神经网络对基于后状态的MDP优化模型进行求解得到求解结果，根据求解结果确定针对当前到达订单是否接受的最优策略。The solving unit is used to use a three-layer artificial neural network to solve the MDP optimization model based on the post-state to obtain the solution results, and determine the optimal strategy for accepting the current arriving order based on the solution results.

上述技术方案具有如下有益效果：订单动态变化到达的建模，与现实中实际发生订单的状态更符合。The above technical solution has the following beneficial effects: the modeling of the dynamic change of orders is more consistent with the status of actual orders in reality.

附图说明Description of the drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts.

图1是本发明实施例的一种MTO企业订单处理方法的流程图；Figure 1 is a flow chart of an MTO enterprise order processing method according to an embodiment of the present invention;

图2是本发明实施例的一种MTO企业订单处理系统的结构图；Figure 2 is a structural diagram of an MTO enterprise order processing system according to an embodiment of the present invention;

图3是三层神经网络架构；Figure 3 is the three-layer neural network architecture;

图4是样本学习率；Figure 4 is the sample learning rate;

图5是不同单位生产能力；Figure 5 shows the production capacity of different units;

图6是不同订单到达率的平均利润；Figure 6 shows the average profit of different order arrival rates;

图7是不同订单到达率的接受率；Figure 7 shows the acceptance rate of different order arrival rates;

图8是不同库存成本下的平均利润；Figure 8 shows the average profit under different inventory costs;

图9是不同库存成本下的订单接受率；Figure 9 shows the order acceptance rate under different inventory costs;

图10是顾客优先级因素。Figure 10 shows customer priority factors.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.

如图1所示，结合本发明的实施例，提供一种MTO企业订单处理方法，包括：As shown in Figure 1, combined with the embodiment of the present invention, an MTO enterprise order processing method is provided, including:

S101：当具有当前订单到达面向订单生产MTO企业后，针对该MTO企业所具有的当前订单队列以及当前到达的订单，建立基于马尔科夫决策过程MDP理论的订单接受策略模型，所述基于MDP理论的订单接受策略模型用于确定最优策略；其中，当前到达订单表示MTO企业接收到但还未决定是否接受的订单；所述策略为接受当前到达订单或者拒绝当前到达订单，所述最优策略是指选择该策略时该MTO企业的利益最优；S101: When a current order arrives at an order-oriented production MTO enterprise, establish an order acceptance strategy model based on the Markov decision process MDP theory based on the current order queue and currently arriving orders of the MTO enterprise. The said order acceptance strategy model is based on the MDP theory. The order acceptance strategy model is used to determine the optimal strategy; where the current arriving order represents an order that the MTO enterprise has received but has not yet decided whether to accept; the strategy is to accept the current arriving order or reject the current arriving order, and the optimal strategy It means that the interests of the MTO enterprise are optimal when choosing this strategy;

S102：根据强化学习算法的后状态理论对基于MDP理论的订单接受策略模型进行转化，转化后得到基于后状态的MDP模型；S102: Transform the order acceptance strategy model based on MDP theory according to the post-state theory of the reinforcement learning algorithm, and obtain the MDP model based on the post-state after transformation;

S103：通过强化学习算法的学习参数向量的学习过程降低基于后状态的MDP模型的求解难度，得到基于后状态的MDP优化模型；S103: Reduce the difficulty of solving the post-state-based MDP model through the learning process of the learning parameter vector of the reinforcement learning algorithm, and obtain the post-state-based MDP optimization model;

S104：采用三层人工神经网络对基于后状态的MDP优化模型进行求解得到求解结果，根据求解结果确定针对当前到达订单是否接受的最优策略。S104: Use a three-layer artificial neural network to solve the MDP optimization model based on the post-state to obtain the solution result, and determine the optimal strategy for accepting the current arriving order based on the solution result.

优选地，步骤101具体包括：Preferably, step 101 specifically includes:

假设每个订单不进行拆分生产、订单生产完成后一次性发给顾客、并且MTO企业一旦接受订单则该订单不能被更改或取消；Assume that each order is not split into production, that the order is shipped to the customer in one go after production is completed, and that once the MTO company accepts the order, the order cannot be changed or canceled;

根据当前订单队列和当前到达订单确定当前订单队列内各订单和当前到达订单的信息，订单信息包括：订单所对应的顾客优先级μ、单位产品价格pr、产品需要数量q、提前期lt及最迟交货期dt；其中，当前订单队列内的订单为已接受订单，订单到达服从参数为λ的泊松分布，订单内单位产品价格和产品需求数量分别服从均匀分布；Determine the information of each order in the current order queue and the current arriving order based on the current order queue and the current arriving order. The order information includes: the customer priority μ corresponding to the order, the unit product price pr, the required product quantity q, the lead time lt and the minimum Late delivery date dt; among them, the order in the current order queue is an accepted order, the order arrival obeys the Poisson distribution with parameter λ, and the unit product price and product demand quantity in the order obey uniform distribution respectively;

根据当前订单队列内各订单和当前到达订单的信息确定所面临的回报子项，所回报子项包括：拒绝当前到达订单的拒绝成本、接受当前到达订单的利润、当前订单队列内订单的延期惩罚成本、当前订单队列内订单的库存成本；其中：Determine the reward sub-items faced based on the information of each order in the current order queue and the current arriving order. The reported sub-items include: rejection cost of rejecting the current arriving order, profit of accepting the currently arriving order, and delay penalty for orders in the current order queue. Cost, the inventory cost of the order in the current order queue; where:

如果拒绝当前到达订单，则产生拒绝成本：μ*J，其中，J表示未考虑顾客优先级时的拒绝成本；If the currently arriving order is rejected, a rejection cost will be incurred: μ*J, where J represents the rejection cost without considering customer priority;

如果接受当前到达订单，则获得该订单的利润I：I＝pr*q，同时消耗生产成本C：C＝c*q，其中c为单位产品生产成本；If the current arrival order is accepted, the profit of the order I will be obtained: I=pr*q, and the production cost C will be consumed at the same time: C=c*q, where c is the unit product production cost;

针对当前订单队列内各订单，该MTO企业按照先来先服务的原则进行生产，如果当前订单队列内存在延迟订单，所述延迟订单的交货时间处于最迟交货期指内，该MTO企业产生向所延迟订单相应的顾客支付延期惩罚成本Y：其中，t表示在已经接受的订单仍需要的生产时间、b表示该MTO企业的单位生产能力、u表示该MTO企业的单位时间单位产品延期惩罚成本；For each order in the current order queue, the MTO company will produce according to the principle of first come, first served. If there is a delayed order in the current order queue, and the delivery time of the delayed order is within the latest delivery period, the MTO company will The delay penalty cost Y is paid to the customer corresponding to the delayed order: Among them, t represents the production time still required for the accepted order, b represents the unit production capacity of the MTO enterprise, and u represents the unit product delay penalty cost per unit time of the MTO enterprise;

如果当前订单队列内订单的产品在提前期之内生成完成、且该产品暂存在仓库中，产品被暂存在MTO企业仓库中从而产生库存成本N：其中，h表示单位时间单位产品的库存成本；If the product of the order in the current order queue is generated within the lead time and the product is temporarily stored in the warehouse, the product is temporarily stored in the MTO enterprise warehouse, resulting in inventory cost N: Among them, h represents the inventory cost per unit product per unit time;

根据马尔科夫决策过程MDP理论，以及当前订单队列内各订单和当前到达订单的信息、回报子项，建立该MTO企业的基于MDP理论的订单接受策略模型，所述该MTO企业的基于MDP理论的订单接受策略模型为四元组(S，A，f，R)，所述四元组分别为状态空间S、动作空间A、状态转移函数f、奖励函数R，其中：According to the Markov decision process MDP theory, as well as the information and return items of each order in the current order queue and the current arriving order, an order acceptance strategy model based on the MDP theory of the MTO enterprise is established. The MTO enterprise's order acceptance strategy model is based on the MDP theory. The order acceptance strategy model is a four-tuple (S, A, f, R), which are the state space S, the action space A, the state transition function f, and the reward function R, where:

状态空间S表示MTO企业订单处理方法所在系统的状态；所述状态空间S为nx6维向量，其中，n表示订单类型的数量，6代表6种订单信息：顾客优先级μ、单位产品价格pr、产品需求数量q、提前期lt、最迟交货期dt当前订单队列内的订单仍需要的生产完成时间t，其中，t具有预设最大上限值；The state space S represents the state of the system where the MTO enterprise order processing method is located; the state space S is an nx6-dimensional vector, where n represents the number of order types, and 6 represents 6 types of order information: customer priority μ, unit product price pr, Product demand quantity q, lead time lt, latest delivery date dt still requires production completion time t for orders in the current order queue, where t has a preset maximum upper limit value;

动作空间A表示针对当前到达订单的动作集合；当在m时刻具有当前订单到达时，该MTO企业需要做出接受订单或拒绝订单的动作决策，将接受订单或拒绝订单的动作集合形成动作空间量A，A＝(a₁，a₂)，其中a₁表示接受订单，a₂表示拒绝订单；Action space A represents the action set for the currently arriving order; when the current order arrives at time m, the MTO enterprise needs to make an action decision of accepting the order or rejecting the order, and the action set of accepting the order or rejecting the order forms the action space amount A, A=(a ₁ , a ₂ ), where a ₁ represents accepting the order and a ₂ represents rejecting the order;

状态转移函数f表示由当前状态转移到m决策时刻状态，所述m决策时刻是指针对当前到达订单采取动作的时刻；所述状态转移函数f的生成过程为：The state transition function f represents the transition from the current state to the m decision time state, and the m decision time refers to the time when action is taken for the current arriving order; the generation process of the state transition function f is:

假定订单信息μ，pr，q，lt，dt均是独立且同分布，根据初始状态s和在m决策时刻已经采取的动作a时，将下一个决策时刻m+1的状态的概率密度函数f(·|(s，a))分别表示为f_M(x)、f_PR(x)、f_Q(x)、f_LT(x)、f_DT(x)；Assume that the order information μ, pr, q, lt, and dt are all independent and identically distributed. According to the initial state s and the action a taken at decision time m, the probability density function f of the state at the next decision time m+1 is (·|(s, a)) are respectively expressed as f _M (x), f _PR (x), f _Q (x), f _LT (x), f _DT (x);

且在下一个决策时刻m+1的订单信息μ_m+1，pr_m+1，q_m+1，lt_m+1，dt_m+1关于(s_m，a_m)是独立的；其中，t_m+1表示为：And the order information μ _m+1 , pr _m+1 , q _m+1 , lt _m+1 , dt _m+1 at the next decision time m+1 are independent with respect to (s _m , a _m ); where, t _m+1 is expressed as:

式(1)表示t_m+1受(s_m，a_m)的影响，且不同的(q_m，t_m，a_m)导致不同的订单生产时间，以及t_m+1还受到订单到达时间间隔的影响；其中，AT_m→m+1表示两个订单之间到达的时间间隔，即根据每个订单的到达服从参数为λ的泊松分布；Equation (1) indicates that t _m+1 is affected by (s _m , a _m ), and different (q _m , t _m , a _m ) lead to different order production times, and t _m+1 is also affected by the order arrival time. The influence of interval; among them, AT _m→m+1 represents the time interval between the arrival of two orders, that is, according to the arrival of each order, it obeys the Poisson distribution with parameter λ;

根据当前状态s和动作a、以及订单信息μ_m+1，pr_m+1，q_m+1，lt_m+1，dt_m+1关于(s_m，a_m)是独立的特点，得到下一决策时刻m+1状态s′的条件概率密度，所述下一决策时刻m+1状态s′的条件概率密度表示为：According to the current state s, action a, and order information μ _m+1 , pr _m+1 , q _m+1 , lt _m+1 , dt _m+1 are independent characteristics about (s _m , a _m ), we get the following The conditional probability density of state s′ at a decision moment m+1, and the conditional probability density of state s′ at m+1 at the next decision moment are expressed as:

f(s′|s，a)＝f_M(μ′)*f_PR(pr′)*f_Q(q′)*f_LT(lt′)*f_DT(dt′)*f_T(t′|s，a)f(s′|s,a)＝f _M (μ′)*f _PR (pr′)*f _Q (q′)*f _LT (lt′)*f _DT (dt′)*f _T (t′ |s,a)

其中f_T(t′|s，a)表示在当前状态s下、采取动作a后，下一决策时刻m+1生产已接受的订单仍需要的生产时间，f_T(t′|s，a)的具体形式由式(1)和相关随机变量定义；where f _T (t′|s, a) represents the production time still required to produce the accepted order at the next decision moment m+1 after taking action a in the current state s, f _T (t′|s, a ) is defined by formula (1) and related random variables;

当MTO企业在m决策时刻，通过奖励函数R表示对当前到达订单的所采取动作后所获得的相应回报，奖励函数R表示为：When the MTO enterprise makes a decision at m, the reward function R represents the corresponding reward obtained after taking actions on the currently arrived order. The reward function R is expressed as:

其中，当a_m＝1表示如果该MTO企业接收当前到达订单，奖励函数R表示为I-C-Y-N；Among them, when a _m =1 means that if the MTO enterprise receives the current arriving order, the reward function R is expressed as ICYN;

当a_m＝0表示如果该MTO企业拒绝当前到达订单，奖励函数R表示为-μ*J；When a _m = 0, it means that if the MTO enterprise rejects the current arriving order, the reward function R is expressed as -μ*J;

对于该基于MDP理论的订单接受策略模型内的任意策略，根据奖励函数定义其相应的价值函数，通过所述价值函数表示策略所对应的其平均长期利润，所述价值函数表示为：For any strategy in the order acceptance strategy model based on MDP theory, its corresponding value function is defined according to the reward function, and the average long-term profit corresponding to the strategy is represented by the value function. The value function is expressed as:

其中，π表示任意策略，γ表示未来奖励折扣，0＜γ≤1，通过设置γ保证该公式定义的求和项具有有意义，n表示总决策时刻数量，每一个当前订单对应一个决策时刻；Among them, π represents any strategy, γ represents future reward discount, 0 < γ ≤ 1, and by setting γ to ensure that the summation term defined by the formula is meaningful, n represents the total number of decision-making moments, and each current order corresponds to a decision-making moment;

通过任意策略π的平均长期利润确定最优策略π^*，最优策略π^*用来保证企业长期利润最大化，此时该MTO企业的利益最最优；所述最优策略π^*表示为：The optimal strategy π ^* is determined by the average long-term profit of any strategy π. The optimal strategy π ^* is used to ensure the long-term profit maximization of the enterprise. At this time, the interests of the MTO enterprise are optimal; the optimal strategy π ^* is expressed as:

其中，其中Π表示所有策略集合。Among them, Π represents the set of all strategies.

优选地，步骤102具体包括：Preferably, step 102 specifically includes:

在m时刻接收到当前到达订单并决策后，根据强化学习算法设置后状态变量p_m，通过后状态变量表示在m决策时刻选择动作a_m后生产已经接受的订单仍需要的生产时间；其中，后状态是两个连续状态之间的中间变量；After receiving the current arriving order at time m and making a decision, the post-state variable p _m is set according to the reinforcement learning algorithm. The post-state variable represents the production time still needed to produce the accepted order after selecting action a _m at decision time m; where, The post-state is an intermediate variable between two consecutive states;

根据当前状态s_m和动作a_m确定后状态变量p_m，将所述后状态变量p_m表示为：The post-state variable p _m is determined according to the current state s _m and action a _m , and the post-state variable p _m is expressed as:

根据p_m将下一个决策时刻m+1已经接受的订单仍需要的生产时间t_m+1表示为：According to p _m, the production time t m+1 still required for the order that has been accepted at the next decision time _m+1 is expressed as:

其中，式(5)中，表示变量x和0之间取较大的数，AT为两个订单之间到达的时间间隔；根据当前后状态将下一个决策时刻状态S′＝(μ′，p′，q′，lt′，dt′，t′)的条件概率密度表示为：Among them, in formula (5), Indicates the larger number between the variables , dt′, t′) conditional probability density is expressed as:

其中，条件概率密度函数f_T(·|p)由式(5)和相关的随机变量所定义；Among them, the conditional probability density function f _T (·|p) is defined by equation (5) and related random variables;

在设置后状态变量后改写MDP理论中的条件期望E[]，以及，在改写条件期望E[]后定义后状态的价值函数，将所述后状态的价值函数表示为：After setting the post-state variable, rewrite the conditional expectation E[] in the MDP theory, and after rewriting the conditional expectation E[], define the value function of the post-state, and express the value function of the post-state as:

J^*(p)＝γE[V^*(s′)|p] (7)J ^* (p)＝γE[V ^* (s′)|p] (7)

通过后状态价值函数构造最优策略略π^*从而将最优策略略π^*变为一维状态空间，所述最优策略π^*表示为：The optimal strategy π ^* is constructed through the post-state value function, thereby turning the optimal strategy π ^* into a one-dimensional state space. The optimal strategy π ^* is expressed as:

优先地，步骤103具体包括：Preferably, step 103 specifically includes:

所述强化算法在通过后状态价值函数J^*构造最优策略π^*进行求解时不直接计算J^*，采取通过学习参数向量的学习过程来实现J^*的近似进行求解；采取通过学习参数向量的学习过程来实现J^*的近似进行求解，具体如下：The reinforcement algorithm does not directly calculate J ^* when constructing the optimal policy ^{π *} ^through the post-state value function J * for solving, but adopts the learning process of learning parameter vectors to realize the approximation of J ^* for solving; adopts the method of learning parameter vectors The learning process is used to achieve the approximate solution of J ^* , as follows:

根据给定的参数向量θ中确定以及Determined according to the given parameter vector θ as well as

将参数向量θ采用强化算法自数据样本中进行参数学习，得到参数向量θ^*，并使用学习得到的参数向量θ^*确定的来近似J^*，根据近似J^*确定最优策略π^*。The parameter vector θ is used to perform parameter learning from data samples using the reinforcement algorithm to obtain the parameter vector θ ^* , and the parameter vector θ ^* obtained through learning is determined. To approximate J ^* , determine the optimal strategy π ^* based on the approximate J ^* .

优选地，步骤104具体包括：Preferably, step 104 specifically includes:

通过三层人工神经网络ANN求解J^*(p)，将J^*(p)近似到任意精度，将采用三层人工神经网络ANN表示为：J ^* (p) is solved through a three-layer artificial neural network ANN, and J ^* (p) is approximated to arbitrary precision. Using a three-layer artificial neural network ANN is expressed as:

其中，参数向量可以表示为：Among them, the parameter vector can be expressed as:

θ＝[w₁，...，w_N，α₁，...α_N，u₁，...u_N，β]θ=[w ₁ ,...,w _N ,α ₁ ,...α _N ,u ₁ ,...u _N ,β]

Φ_H(x)＝1/(1+e^-x)Φ _H (x)＝1/(1+e ^-x )

式(11)的函数是一个三层的单输入单输出神经网络，其只有一个单节点的输入层，其输出表示后状态p的值、以及一个包含N个节点的隐层，第i个节点的输入是加权后态值w_i*p和隐含层偏置α_i的和；且每个隐含层节点的输入输出关系由Φ_H(·)函数表示，其中Φ_H(·)被称为激活函数；输出层有一个节点，输出层的输入是隐含层加权输出和输出层偏置β的总和，输出层的输出代表最终逼近的函数值/> Function of equation (11) It is a three-layer single-input single-output neural network. It has only a single-node input layer, whose output represents the value of the post-state p, and a hidden layer containing N nodes. The input of the i-th node is the weighted post-state. The sum of the value w _i *p and the hidden layer bias α _i ; and the input-output relationship of each hidden layer node is represented by the Φ _H (·) function, where Φ _H (·) is called the activation function; the output layer There is a node. The input of the output layer is the sum of the weighted output of the hidden layer and the output layer bias β. The output of the output layer represents the final approximate function value/>

如图2所示，结合本发明的实施例，提供一种MTO企业订单处理系统，包括：As shown in Figure 2, combined with the embodiment of the present invention, an MTO enterprise order processing system is provided, including:

模型构建单元21，用于当具有当前订单到达面向订单生产MTO企业后，针对该MTO企业所具有的当前订单队列以及当前到达的订单，建立基于马尔科夫决策过程MDP理论的订单接受策略模型，所述基于MDP理论的订单接受策略模型用于确定最优策略；其中，当前到达订单表示MTO企业接收到但还未决定是否接受的订单；所述策略为接受当前到达订单或者拒绝当前到达订单，所述最优策略是指选择该策略时该MTO企业的利益最优；The model building unit 21 is used to establish an order acceptance strategy model based on the Markov decision process MDP theory based on the current order queue of the MTO enterprise and the currently arriving orders after the current order arrives at the order-oriented production MTO enterprise. The order acceptance strategy model based on MDP theory is used to determine the optimal strategy; where the current arrival order represents an order that the MTO enterprise has received but has not yet decided whether to accept; the strategy is to accept the current arrival order or reject the current arrival order, The optimal strategy refers to the best interests of the MTO enterprise when choosing this strategy;

模型转化单元22，用于根据强化学习算法的后状态理论对基于MDP理论的订单接受策略模型进行转化，转化后得到基于后状态的MDP模型；The model transformation unit 22 is used to transform the order acceptance strategy model based on the MDP theory according to the post-state theory of the reinforcement learning algorithm, and obtain an MDP model based on the post-state after transformation;

模型优化单元23，用于通过强化学习算法的学习参数向量的学习过程降低基于后状态的MDP模型的求解难度，得到基于后状态的MDP优化模型；The model optimization unit 23 is used to reduce the difficulty of solving the post-state-based MDP model through the learning process of the learning parameter vector of the reinforcement learning algorithm, and obtain the post-state-based MDP optimization model;

求解单元24，用于采用三层人工神经网络对基于后状态的MDP优化模型进行求解得到求解结果，根据求解结果确定针对当前到达订单是否接受的最优策略。The solving unit 24 is used to use a three-layer artificial neural network to solve the post-state-based MDP optimization model to obtain a solution result, and determine the optimal strategy for whether to accept the current arriving order based on the solution result.

优选地，所述模型构建单元21，具体用于：Preferably, the model building unit 21 is specifically used for:

优选地，所述模型转化单元22，具体用于：Preferably, the model conversion unit 22 is specifically used for:

在m时刻接收到当前到达订单并决策后，根据强化学习算法设置后状态变量p_m，通过后状态变量表示在m决策时刻选择动作a_m后生产已经接受的订单仍需要的生产时间；其中，后状态是两个连续状态之间的中间变量；After receiving the current arriving order at time m and making a decision, the post-state variable p _m is set according to the reinforcement learning algorithm. The post-state variable represents the production time still needed to produce the accepted order after selecting action a m at decision time _m ; where, The post-state is an intermediate variable between two consecutive states;

J^*(p)＝γE[V^*(s′)|p] (7)J ^* (p)＝γE[V ^* (s′)|p] (7)

优选地，所述模型优化单元23，具体用于：Preferably, the model optimization unit 23 is specifically used for:

优选地，所述求解单元24，具体用于：Preferably, the solving unit 24 is specifically used for:

Φ_H(x)＝1/(1+e^-x)Φ _H (x)＝1/(1+e ^-x )

下面结合具体的应用实例对本发明实施例上述技术方案进行详细说明，实施过程中没有介绍到的技术细节，可以参考前文的相关描述。The above technical solutions of the embodiments of the present invention will be described in detail below with reference to specific application examples. For technical details that are not introduced during the implementation process, reference can be made to the relevant descriptions above.

本发明为基于后状态强化学习的MTO企业订单接受策略的方法及系统，为了解决现有研究中的模型考虑因素不全面和求解过程复杂度高问题，本发明将考虑MDP模型，进一步地引入库存成本以及多种顾客优先级订单接受决策要素。同时，提出了一种基于后状态(after-state)和神经网络相结合的低复杂度的求解算法来解决MTO企业订单接受问题。The present invention is a method and system for MTO enterprise order acceptance strategy based on post-state reinforcement learning. In order to solve the problems of incomplete model considerations and high complexity of the solution process in existing research, the present invention will consider the MDP model and further introduce inventory Cost and multiple customer priority order acceptance decision factors. At the same time, a low-complexity solution algorithm based on a combination of after-state and neural networks is proposed to solve the MTO enterprise order acceptance problem.

随着客户多样化需求不断提升，根据客户对订单的不同需求来组织生产的订单生产型(make-to-order，MTO)模式在企业生产活动中越来越重要。由于企业生产能力的限制，MTO企业需要制定合理的订单接受策略，即如何根据生产能力和订单状态确定是否接受到达订单，以提高企业的生产效益。As the diversified needs of customers continue to increase, the make-to-order (MTO) model, which organizes production according to customers' different needs for orders, is becoming more and more important in corporate production activities. Due to the limitation of enterprise production capacity, MTO enterprises need to formulate reasonable order acceptance strategies, that is, how to determine whether to accept arriving orders based on production capacity and order status, in order to improve the enterprise's production efficiency.

在传统的订单接受问题基础上，本发明提出了更完备的MTO企业订单接收问题的模型：在延期交货成本、拒绝成本、生产成本传统模型要素的基础上，本发明进一步考虑了订单库存成本、多种顾客优先级因素，并将最优订单接受问题建模为马尔科夫决策过程(MDP)。此外，由于经典的MDP求解方法依赖于对高维状态价值函数的求解和估计，其计算复杂性较高。因此，为了减少复杂性，本发明提出了使用一维后状态价值函数来代替高维状态价值函数，并结合神经网络来近似后状态价值函数。最后，通过仿真验证了本发明所提出的订单接受策略模型和算法的适用性和优越性。On the basis of the traditional order acceptance problem, the present invention proposes a more complete model of the MTO enterprise order acceptance problem: on the basis of the traditional model elements of delayed delivery cost, rejection cost, and production cost, the present invention further considers the order inventory cost. , multiple customer priority factors, and model the optimal order acceptance problem as a Markov decision process (MDP). In addition, since the classic MDP solution method relies on the solution and estimation of high-dimensional state value functions, its computational complexity is high. Therefore, in order to reduce complexity, the present invention proposes to use a one-dimensional post-state value function to replace the high-dimensional state value function, and combine it with a neural network to approximate the post-state value function. Finally, the applicability and superiority of the order acceptance strategy model and algorithm proposed by the present invention are verified through simulation.

一、本发明所要解决的问题描述与模型假设1. Description of the problem to be solved by this invention and model assumptions

本发明假设某产能有限的MTO企业通过单一生产线进行生产。假设市场上存在n种类型的顾客订单，订单相关的信息包括顾客优先级μ、单位产品价格pr(pr是price的缩写)、数量q、提前期lt(全称“leadtime”)及最迟交货期dt(全称“deliverytime”)等。提前期lt是指约定的交货时间，在无意外情况下订单生产的工作时间周期，即从订单生产工作开始到订单生产工作结束的时间；但是当于企业不能完全保证是否能在约定时间内完成时，则会在约定的交货时间顺延预设时间段，预设的顺延预设时间段则为最迟交货期dt。顾客订单到达服从参数为λ的泊松分布。订单内单个产品的价格和对相应产品的需求数量均服从均匀分布。This invention assumes that an MTO enterprise with limited production capacity produces through a single production line. Assume that there are n types of customer orders in the market. Order-related information includes customer priority μ, unit product price pr (pr is the abbreviation of price), quantity q, lead time lt (full name "leadtime") and latest delivery Period dt (full name "deliverytime"), etc. The lead time lt refers to the agreed delivery time, the working time cycle of the order production under no unexpected circumstances, that is, the time from the start of the order production work to the end of the order production work; but when the enterprise cannot fully guarantee whether it can be delivered within the agreed time When completed, the preset time period will be postponed to the agreed delivery time, and the preset postponement time period will be the latest delivery date dt. Customer order arrival follows a Poisson distribution with parameter λ. The price of a single product in an order and the quantity demanded of the corresponding product are uniformly distributed.

当有订单到达时，企业需要根据自身生产能力判断是否接受订单。如果拒绝订单，则产生拒绝成本μ*J，顾客优先级越高拒绝成本越高；其中，μ表示顾客优先级系数，J表示未考虑顾客优先级时的拒绝成本(J也为订单相关信息)。When an order arrives, the company needs to judge whether to accept the order based on its own production capacity. If the order is rejected, a rejection cost μ*J is generated. The higher the customer priority, the higher the rejection cost; where μ represents the customer priority coefficient, and J represents the rejection cost when the customer priority is not considered (J is also order-related information) .

如果接受订单，则获得该订单的利润，即I＝pr*q，同时消耗生产成本，即C＝c*q，其中c为单位产品生产成本。MTO企业对已经接受的订单按照先来先服务的原则进行生产，如果订单没有在顾客要求的提前期之内交货，即时，企业需要支付一定的延期惩罚成本Y，即/>其中t表示在接受当前订单前已接受的订单仍需要的生产时间、b表示企业单位生产能力、u表示单位时间单位产品延期惩罚成本，并且顾客优先级越高延期惩罚成本越高。顾客对提前期之前生产出来的产品不提前取货，即当时，产品被暂存在MTO企业仓库中，产生库存成本N，即/> 其中h表示单位时间单位产品库存成本。每个订单不进行拆分生产，订单生产完成后一次性发给顾客，并且MTO企业一旦接受订单，顾客不能对订单进行更改或取消。If the order is accepted, the profit of the order is obtained, that is, I=pr*q, and the production cost is consumed, that is, C=c*q, where c is the production cost of the unit product. MTO enterprises will produce accepted orders on a first-come, first-served basis. If the order is not delivered within the lead time required by the customer, that is, When , the enterprise needs to pay a certain delay penalty cost Y, that is/> Among them, t represents the production time still needed for the accepted order before accepting the current order, b represents the unit production capacity of the enterprise, u represents the unit product delay penalty cost per unit time, and the higher the customer priority, the higher the delay penalty cost. Customers do not take delivery of products produced before the lead time, that is, if When, the products are temporarily stored in the MTO enterprise warehouse, resulting in inventory cost N, that is,/> Where h represents the unit product inventory cost per unit time. Each order is not divided into separate productions. After the order is completed, it will be sent to the customer in one go. Once the MTO company accepts the order, the customer cannot change or cancel the order.

本发明要解决的问题是，当顾客订单随机达到时，MTO企业在考虑当前的订单队列，以及延期交货成本、拒绝成本、生产成本、库存成本及多种顾客优先级因素的基础上，决策是否接受当前到达的订单以保证企业长期平均利润最大化。The problem to be solved by this invention is that when a customer order arrives randomly, the MTO enterprise makes a decision based on considering the current order queue, as well as backorder costs, rejection costs, production costs, inventory costs and various customer priority factors. Whether to accept currently arriving orders to ensure the company's long-term average profit maximization.

二、基于MDP理论的订单接受策略建模2. Order acceptance strategy modeling based on MDP theory

可见，MTO企业订单接受决策问题是一类随机序贯决策问题(或称为随机系统多阶段决策问题)。MTO企业决策者决策接受或者拒绝订单后，系统状态发生改变，但这个决策阶段在当前阶段以后的发展过程不受这个阶段之前各段状态的影响，即其具有无后效性。因此，可依据MDP(马尔科夫决策过程)理论，将该问题抽象成MDP模型。MDP模型定义为四元组(S，A，f，R)，分别表示状态空间S、动作空间A、状态转移函数f、奖励函数R：It can be seen that the MTO enterprise order acceptance decision-making problem is a type of stochastic sequential decision-making problem (or called a stochastic system multi-stage decision-making problem). After the decision-maker of the MTO enterprise decides to accept or reject the order, the system state changes, but the development process of this decision-making stage after the current stage is not affected by the states of the previous stages, that is, it has no aftereffects. Therefore, this problem can be abstracted into an MDP model based on the MDP (Markov Decision Process) theory. The MDP model is defined as a four-tuple (S, A, f, R), which respectively represents the state space S, action space A, state transition function f, and reward function R:

1)系统状态：假设订单接受系统(订单处理系统)中有n种订单类型(订单类型有n种)，则系统状态可由向量S表示：S＝(μ，pr，q，lt，dt，t)，t表示已接受的订单仍需要的生产完成时间，基于有限产能的MTO企业，t有最大上限值。1) System status: Assuming that there are n order types (there are n order types) in the order acceptance system (order processing system), the system status can be represented by the vector S: S = (μ, pr, q, lt, dt, t ), t represents the production completion time required for accepted orders. Based on MTO companies with limited production capacity, t has a maximum upper limit.

2)系统动作集合：在m时刻，当有顾客订单到达时，MTO企业需要做出接受或拒绝订单的决策，模型中的动作集合可由向量A＝(a₁，a₂)表示，其中a₁表示接受订单，a₂表示拒绝订单。其中，向量A里的动作只是针对当前到达订单的，不包含订单队列内的订单。2) System action set: At time m, when a customer order arrives, the MTO enterprise needs to make a decision to accept or reject the order. The action set in the model can be represented by the vector A = (a ₁ , a ₂ ), where a ₁ means accepting the order, a ₂ means rejecting the order. Among them, the actions in vector A are only for the currently arriving orders, and do not include orders in the order queue.

3)状态转移模型：在给定初始状态s(初始状态表示订单接受系统中第一个到达的订单)和已经采取的动作a情况下，下一个状态的概率密度函数用f(·|(s，a))表示。这里假定订单信息μ，pr，q，lt，dt均是独立且同分布的，则可用概率密度函数分别表示各自的分布，定订单信息μ，pr，q，lt，dt的分布各自对应的概率密度函数为f_M(x)、f_PR(x)、f_Q(x)、f_LT(x)、f_DT(x)，因此订单信息μ_m+1，pr_m+1，q_m+1，lt_m+1，dt_m+1关于(s_m，a_m)是独立的，也就是表示在当前状态s_m下，采取动作a_m后，下一个决策时刻(下一时刻状态s_m+1)中的订单信息。但是t_m+1受(s_m，a_m)的影响，这是因为不同的(q_m，t_m，a_m)会导致不同的订单生产时间，t_m+1同样也受订单到达时间间隔的影响，即t_m+1可以表示为：3) State transition model: Given the initial state s (the initial state represents the first arriving order in the order acceptance system) and the action a that has been taken, the probability density function of the next state is f(·|(s , a)) means. It is assumed here that the order information μ, pr, q, lt, and dt are all independent and identically distributed. Then the probability density function can be used to represent their respective distributions, and the corresponding probabilities of the distributions of the order information μ, pr, q, lt, and dt can be determined. The density functions are f _M (x), f _PR (x), f _Q (x), f _LT (x), f _DT (x), so the order information μ _m+1 , pr _m+1 , q _m+1 , lt _m+1 , dt _m+1 are independent with respect to (s _m , a _m ), which means that in the current state s _m , after taking the action a _m , the next decision moment (the next moment state s _{m+ 1} ) order information. But t _m+1 is affected by (s _m , a _m ). This is because different (q _m , t _m , a _m ) will lead to different order production times. t _m+1 is also affected by the order arrival time interval. The influence of t _m+1 can be expressed as:

其中AT_m→m+1表示两个订单之间到达的时间间隔，即根据每个订单的到达服从参数为λ的泊松分布。Among them, AT _m→m+1 represents the time interval between the arrival of two orders, that is, the arrival of each order obeys the Poisson distribution with parameter λ.

由于订单信息μ_m+1，pr_m+1，q_m+1，lt_m+1，dt_m+1关于(s_m，a_m)是独立的，所以在给定当前状态s和动作a时，根据当前状态的条件概率密度可以得到下一决策时刻状态s′的条件概率密度，下一决策时刻状态s′的条件概率密度可以表示为：Since the order information μ _m+1 , pr _m+1 , q _m+1 , lt _m+1 , dt _m ₊₁ are independent with respect to (s _m , am ), when the current state s and action a are given, , according to the conditional probability density of the current state, the conditional probability density of the state s′ at the next decision moment can be obtained. The conditional probability density of the state s′ at the next decision moment can be expressed as:

其中f_T(t′|s，a)表示在当前状态s下，采取动作a后，下一时刻生产已接受的订单仍需要的生产时间，其具体形式可以由(1)和相关随机变量定义。where f _T (t′|s, a) represents the production time still required to produce the accepted order at the next moment after taking action a in the current state s. Its specific form can be defined by (1) and related random variables. .

4)奖励函数：在m决策时刻，MTO企业在做出是否接受订单决策后，获得的立即奖励函数为：4) Reward function: At decision moment m, the immediate reward function obtained by the MTO enterprise after making the decision of whether to accept the order is:

其中，I表示订单的利润(单位产品价格×产品数量)；C表示生产成本；Y表示延期惩罚成本(如果订单超过了提前期完成，产生延期惩发成本)；N表示库存成本(如果在提前期之前完成，顾客不提前取货，产生库存成本)。Among them, I represents the profit of the order (unit product price × product quantity); C represents the production cost; Y represents the delay penalty cost (if the order is completed beyond the lead time, a delay penalty cost will be incurred); Completed before the deadline, the customer does not pick up the goods in advance, resulting in inventory costs).

决策者在当前状态下，采取相应动作(拒绝或接受订单)后，奖励函数为系统所给于采取动作的好坏评价。After the decision-maker takes the corresponding action (reject or accept the order) in the current state, the reward function is the evaluation of the action given by the system.

5)最优策略：在MTO企业订单接受问题中，目标是为了寻找一个最优的订单接受策略π^*，从而使企业长期利润最大化。每个策略π是从系统状态到动作的函数，其决定了企业如何根据当前状态信息选择是否接受订单。对于任意策略π，定义其价值函数为其平均长期利润(价值函数代表决策者在状态s下，选择动作a后，然后遵循这个策略的值(即累计折现奖励)，是从当前状态下采取动作后所得到的的累计折现奖励)。所述价值函数如下：5) Optimal strategy: In the MTO enterprise order acceptance problem, the goal is to find an optimal order acceptance strategy π ^* to maximize the long-term profit of the enterprise. Each policy π is a function from system state to action, which determines how the enterprise chooses whether to accept an order based on current state information. For any strategy π, define its value function as its average long-term profit (the value function represents the value of the decision-maker in state s, after choosing action a, and then following this strategy (i.e., the cumulative discounted reward), which is taken from the current state The cumulative discounted reward obtained after the action). The value function is as follows:

其中0＜γ≤1表示未来奖励折扣(其保证(2)中定义的求和项是有意义)。而我们关心的是所有策略中的最优策略π^*，其定义为：Among them, 0<γ≤1 represents future reward discount (which ensures that the summation term defined in (2) is meaningful). What we care about is the optimal strategy π ^* among all strategies, which is defined as:

其中Π表示所有策略集合。where Π represents the set of all strategies.

针对标准的MDP理论介绍如下：The standard MDP theory is introduced as follows:

在MDP理论中，最优策略π^*可以用状态价值函数y*进行构造：In MDP theory, the optimal policy π ^* can be constructed using the state value function y*:

其中期望E[·]是定义在给定当前状态s和动作a的下一个随机的状态s′。同时，V^*是Bellman方程的一个解，即：The expectation E[·] is defined as the next random state s′ given the current state s and action a. At the same time, V ^* is a solution to the Bellman equation, that is:

并且V^*可以通过值迭代方法(value iteration)进行求解。And V ^* can be solved by value iteration.

可见，经典的MDP理论提供了一种基于状态价值函数V^*的最优策略求解方法。但是，状态价值函数V^*的求解是很困难的，这是因为本发明所要解决的问题的系统状态是高维连续的，这导致状态价值函数V^*的表征和基于期望E[·]的值迭代的计算复杂度难以承受。为了解决以上问题，本发明提出了基于后状态价值函数的最优策略构造方法。It can be seen that the classic MDP theory provides an optimal strategy solution method based on the state value function V ^* . However, the solution of the state value function V ^* is very difficult. This is because the system state of the problem to be solved by the present invention is high-dimensional and continuous, which leads to the representation of the state value function V ^* and the value based on the expectation E[·] The computational complexity of iteration is prohibitive. In order to solve the above problems, the present invention proposes an optimal strategy construction method based on the post-state value function.

三、基于后状态的MDP模型转化3. MDP model transformation based on post-state

后状态(after-state)是两个连续状态之间的中间变量，可以用来简化某些MDP的最优控制。后状态的概念是强化学习中经常用于棋类游戏的学习任务的一种技巧。例如，智能体(Agent)在运用强化学习算法下国际象棋时，Agent可以确定地控制自己的走法，而随机的是对手的行动。在决定一个动作之前，Agent面对的是棋盘上特定的棋子位置，这与经典的MDP模型中的状态是相同的。Agent每一步棋的后状态被定义为这一步行动之后但在对手移动之前所在棋盘的状态。如果Agent能够了解所有不同后状态的获胜机会，那么就可以使用这些已知概率来实现最优行为，即简单地选择获胜机会最大后状态进行相应行动。The after-state is an intermediate variable between two consecutive states and can be used to simplify the optimal control of some MDPs. The concept of post-state is a technique in reinforcement learning often used for learning tasks in board games. For example, when an agent uses a reinforcement learning algorithm to play chess, the agent can control its own moves with certainty, while the opponent's moves are random. Before deciding on an action, the Agent faces a specific chess piece position on the chessboard, which is the same state as in the classic MDP model. The post-state of each move of the Agent is defined as the state of the board after this move but before the opponent moves. If the agent can understand the winning chances of all different post-states, then it can use these known probabilities to achieve optimal behavior, that is, simply choose the post-state with the highest chance of winning and take the corresponding action.

本发明采用类似的后状态方法对订单接受问题进行变换。特别地，在我们考虑的订单接受问题中，后状态变量p_m定义为是在m决策时刻选择动作a_m后生产已接受订单仍需要的生产时间。因此，在给定当前状态s_m和动作a_m后，后状态可以表示为：The present invention uses a similar post-state method to transform the order acceptance problem. In particular, in the order acceptance problem we consider, the post-state variable p _m is defined as the production time still required to produce the accepted order after selecting action a m at decision moment _m . Therefore, given the current state s _m and action a _m , the post-state can be expressed as:

不难看出，给定p_m，下一个决策时刻m+1订单仍需要的生产时间t_m+1可以表示为：It is not difficult to see that given p _m , the production time t _m+1 still required for the order m+1 at the next decision moment can be expressed as:

其中表示在变量x和0之间取较大的数；AT为两个订单之间到达的时间间隔。所以，给定当前后状态p，下一个决策时刻状态S′＝(μ′，p′，q′，lt′，dt′，t′)的条件概率密度可以表示为：in It means taking the larger number between variable x and 0; AT is the arrival time interval between two orders. Therefore, given the current subsequent state p, the conditional probability density of the next decision moment state S′ = (μ′, p′, q′, lt′, dt′, t′) can be expressed as:

其中，条件概率密度函数f_T(·|p)由(5)和相关的随机变量所定义。Among them, the conditional probability density function f _T (·|p) is defined by (5) and related random variables.

因此在式(3)和(4)中的条件期望E[V(s′)|s，a]，可以改写成用(6)表示的条件期望E[V(s′)|σ(s，a)]，所以π^*可以被重新定义如下。首先定义后状态值函数为：Therefore, the conditional expectation E[V(s′)|s, a] in equations (3) and (4) can be rewritten as the conditional expectation E[V(s′)|σ(s, a)], so π ^* can be redefined as follows. First define the post-state value function as:

J^*(p)＝γE[V^*(s′)|p] (7)J ^* (p)＝γE[V ^* (s′)|p] (7)

将(7)加入(3)，则最优策略π^*可以使用后状态价值函数J^*构造如下：Adding (7) to (3), the optimal policy π ^* can be constructed using the post-state value function J ^* as follows:

进一步地，将(7)代入(4)，得到：Further, substituting (7) into (4), we get:

所以，得到：So, get:

事实上由(7)可知，γE[V^*(s′)|p]即为J^*(p)，所以可以得出In fact, it can be seen from (7) that γE[V ^* (s′)|p] is J ^* (p), so we can get

最后，我们通过强化学习中值迭代算法对J^*进行求解^[19]，即J₀为任意初始化函数，有Finally, we solve J ^* through the reinforcement learning median iteration algorithm ^[19] , that is, J ₀ is an arbitrary initialization function, with

当k→∞时，J_k收敛到J^*。When k→∞, J _k converges to J ^* .

由式(3)可知，在计算最优策略时需要使用期望E[V^*(s′)|s，a]，而式(8)中的最优策略不仅不需要考虑期望，直接用J^*来替代E[V^*(s′)|s，a]，而且将高维状态空间降低为一维状态空间，大大降低了求解复杂度。It can be seen from equation (3) that the expectation E[V ^* (s′)|s, a] needs to be used when calculating the optimal strategy, while the optimal strategy in equation (8) not only does not need to consider the expectation, but directly uses J ^* To replace E[V ^* (s′)|s, a], and reduce the high-dimensional state space to a one-dimensional state space, which greatly reduces the solution complexity.

四、基于神经网络的最优控制4. Optimal control based on neural network

在此前已经证明了π^*可以用J^*来构造，而J^*又可以通过公式(9)值迭代的方式来求解。但是公式(9)在实施过程却存在两种困难：首先，如果f_M(·)、f_PR(·)、f_Q(·)、f_LT(·)、f_DT(·)和f_T(·|σ(s，a))是不可获取的，那么将无法计算E[·|p]；其次，由于后状态是连续变化的，每次迭代式(9)都必须在无穷多个p值上计算。而强化学习提供了一种有效的方法来解决上述两种困难。强化学习不直接计算J^*，而是采取一种通过学习参数向量来实现J^*的近似，而学习过程则是利用数据样本。换句话来说，RL算法(强化学习算法或称强化学习)的设计包括：It has been proved before that π ^* can be constructed using J ^* , and J ^* can be solved by value iteration of formula (9). However, there are two difficulties in the implementation process of formula (9): first, if f _M (·), f _PR (·), f _Q (·), f _LT (·), f _DT (·) and f _T ( ·|σ(s, a)) is unobtainable, then E[·|p] cannot be calculated; secondly, since the post-state changes continuously, each iteration of equation (9) must be performed on an infinite number of p values Calculation. Reinforcement learning provides an effective way to solve the above two difficulties. Reinforcement learning does not directly calculate J ^* , but adopts an approximation of J ^* by learning parameter vectors, and the learning process uses data samples. In other words, the design of the RL algorithm (reinforcement learning algorithm or reinforcement learning) includes:

1)参数化：这决定了如何从给定的参数向量θ中确定函数其中，θ代表对后状态值函数的近似参数向量。1) Parameterization: This determines how the function is determined from a given parameter vector θ Among them, θ represents the approximate parameter vector of the post-state value function.

2)参数学习：参数向量θ^*是从一批数据样本中学习的，并且使片来近似J^*，即最优的策略可以表示为：2) Parameter learning: The parameter vector θ ^* is learned from a batch of data samples, and makes the slice To approximate J ^* , the optimal strategy can be expressed as:

对比(10)和(8)可知，如果近似于J^*(p)，则/> Comparing (10) and (8), we can see that if Approximate to J ^* (p), then/>

接近最优策略π^*。Close to the optimal policy π ^* .

4.1神经网络近似4.1 Neural network approximation

万能近似定理(universal approximation theorem)表明三层人工神经网络(artificial neural network，ANN)能够将一个连续函数近似到任意精度，因此对于本发明中要求解的J^*(p)，ANN是一个很好的选择。所以可以用神经网络表示为：The universal approximation theorem (universal approximation theorem) shows that a three-layer artificial neural network (ANN) can approximate a continuous function to arbitrary precision. Therefore, for the J ^* (p) to be solved in this invention, ANN is a good s Choice. so It can be expressed as a neural network:

其中参数向量可以表示为：The parameter vector can be expressed as:

θ＝[w₁，...，w_N，α₁，...α_N，u₁，...u_N，β]，θ=[w ₁ ,...,w _N ,α ₁ ,...α _N ,u ₁ ,...u _N ,β],

Φ_H(x)＝1/(1+e^-x)。Φ _H (x)=1/(1+e ^-x ).

式(11)中的函数如图3所示，它实际上是一个三层的单输入单输出神经网络。具体来说，只有一个单节点的输入层，其输出表示后状态p的值，还有一个包含N个节点的隐层，第i个节点的输入是加权后态值w_i*p和隐含层偏置α_i的和。每个隐含层节点的输入输出关系由Φ_H(·)函数表示，其中Φ_H(·)被称为激活函数。最后，输出层有一个节点，其输出代表最终逼近的函数值/>它的输入是隐含层加权输出和输出层偏置β的总和。The function in formula (11) As shown in Figure 3, it is actually a three-layer single-input single-output neural network. Specifically, there is only a single-node input layer, whose output represents the value of the post-state p, and a hidden layer containing N nodes. The input of the i-th node is the weighted post-state value w _i *p and the implicit The sum of layer biases α _i . The input-output relationship of each hidden layer node is represented by the Φ _H (·) function, where Φ _H (·) is called the activation function. Finally, the output layer has a node whose output represents the final approximated function value/> Its input is the sum of the hidden layer weighted output and the output layer bias β.

4.2值迭代训练ANN(三层人工神经网络)4.2 Value iterative training ANN (three-layer artificial neural network)

为了达到最优控制(即得到最优策略)，通过值迭代的方法来训练三层人工神经网络中的参数。具体的训练过程如下。In order to achieve optimal control (that is, obtain the optimal strategy), the parameters in the three-layer artificial neural network are trained through value iteration. The specific training process is as follows.

1)训练数据的获取：本发明需要一批训练样本其中对于每个m，采样获得p_m、μ_m、pr_m、q_m、lt_m、dt_m，其中p_m服从均匀分布、μ_m～f_M(·)、pr_m～f_PR(·)、q_m～f_Q(·)、lt_m～f_LT(·)、dt_m～f_DT(·)。进一步地，根据p_m生成t_m，即t_m＝(p_m-AT)⁺，其中AT为服从泊松分布的随机变量。1) Acquisition of training data: This invention requires a batch of training samples Among them, for each m, sampling obtains p _m , μ _m , pr _m , q _m , lt _m , dt _m , where p _m obeys the uniform distribution, μ _m ～f _M (·), pr _m ～f _PR (·) , q _m ~ f _Q (·), lt _m ~ f _LT (·), dt _m ~ f _DT (·). Further, t _{m is generated according to p m} _, that is, t _m =( _pm -AT) ⁺ , where AT is a random variable obeying Poisson distribution.

2)迭代拟合(见表1的算法1)：如(9)，在第k次迭代时，设当前ANN参数向量为θ_k，用(11)定义的函数即给定当前值函数/>更新后的值函数为：2) Iterative fitting (see Algorithm 1 in Table 1): As in (9), at the k-th iteration, assume the current ANN parameter vector is θ _k , and use the function defined in (11) That is, given the current value function/> The updated value function is:

我们希望利用更新的参数来获取一个新的函数来逼近J_k+1(p)。所以，从Γ和θ_k中，我们构造一批训练数据：We want to use the updated parameters to get a new function to approximate J _k+1 (p). So, from Γ and θ _k , we construct a batch of training data:

其中，o_m表示在给定的情况下所期望的J_k+1(pm)的数值，见式(12)。Among them, o _m means that at a given The expected value of J _k+1 (pm) in the case of is shown in equation (12).

基于训练数据Υ_k，参数更新可以表示为：Based on the training data Υ _k , parameter update can be expressed as:

其中，L(θ|Υ_k)为训练误差：Among them, L(θ|Υ _k ) is the training error:

3)训练参数：应用梯度下降求解ANN参数θ_k+1，使得在Υ_k上误差最小，见式(14)。梯度下降是通过迭代搜索参数空间：梯度迭代中初始参数θ⁽⁰⁾设为θ_k，则迭代过程中参数更新为：3) Training parameters: Apply gradient descent to solve ANN parameters θ _k+1 such that The error is smallest on Υ _k , see equation (14). Gradient descent searches the parameter space iteratively: the initial parameter θ ⁽⁰⁾ in the gradient iteration is set to θ _k , then the parameters are updated during the iteration as:

其中，α为更新步长参数，是定义在式(15)中的L在θ^(z)的梯度。因此，给定足够的迭代次数Z，我们用θ_k+1＝θ^(Z)作为(14)的近似解。最后/>表示为：Among them, α is the update step parameter, is the gradient of L at θ ^(z) defined in equation (15). Therefore, given a sufficient number of iterations Z, we use θ _k+1 = θ ^(Z) as an approximate solution to (14). Finally/> Expressed as:

表1算法1近似J^*(p)Table 1 Algorithm 1 approximates J ^* (p)

综上，本发明所取得的有益效果如下：In summary, the beneficial effects achieved by the present invention are as follows:

1)本发明从收益管理思想出发，在随机动态环境下考虑MTO企业订单接受问题，首先在考虑企业的生产成本、延迟惩罚成本、拒绝成本因素基础上，又考虑了提前期之前完成的订单的库存成本和多种顾客优先级因素，构建了MDP(马尔科夫决策过程)订单接受模型。1) This invention starts from the idea of revenue management and considers the order acceptance problem of MTO enterprises in a random dynamic environment. First, on the basis of considering the enterprise's production cost, delay penalty cost and rejection cost factors, it also considers the order completion before the lead time. Inventory costs and various customer priority factors construct an MDP (Markov Decision Process) order acceptance model.

2)本发明通过后状态方法对传统的MDP中的最优策略的求解进行了转换，证明了经典MDP问题中基于状态价值函数的最优策略可以等价地用基于后状态的价值函数进行定义和构造，将多维控制问题转化为一维控制问题，从而大大简化了求解过程。2) The present invention uses the post-state method to convert the solution of the optimal strategy in the traditional MDP, proving that the optimal strategy based on the state value function in the classic MDP problem can be equivalently defined by the value function based on the post-state. and construction, which converts multi-dimensional control problems into one-dimensional control problems, thereby greatly simplifying the solution process.

3)传统的SRASA，SMART等算法属于表格型强化学习方法，该类方法仅能处理离散状态空间下的最优决策问题。对此，为了解决连续状态空间下订单接受策略学习问题，本发明利用神经网络对后状态价值函数进行参数化表征，并设计相应的训练算法，实现了对后状态价值函数的估计和订单接受策略的快速求解。3) Traditional algorithms such as SRASA and SMART are tabular reinforcement learning methods, which can only handle optimal decision-making problems in discrete state spaces. In this regard, in order to solve the problem of order acceptance strategy learning in continuous state space, the present invention uses a neural network to parameterize the post-state value function and designs a corresponding training algorithm to realize the estimation of the post-state value function and the order acceptance strategy. Quick solution.

五、数值仿真实验5. Numerical simulation experiments

仿真中所需要的相关订单信息的数据生成方法按照以下规则生成：订单价格pr服从均匀分布U(e₁，l₁)，订单数量q服从均匀分布U(e₂，l₂)，订单的到达服从参数为λ的泊松分布；订单提前期lt与订单价格之间存在线性减函数关系，即lt＝δ-β*pr；最迟可接受交货时间取整数，且其设定满足关系式其中φ为提前期弹性系数。The data generation method for the relevant order information required in the simulation is generated according to the following rules: the order price pr obeys the uniform distribution U(e ₁ , l ₁ ), the order quantity q obeys the uniform distribution U(e ₂ , l ₂ ), and the arrival of the order It obeys the Poisson distribution with parameter λ; there is a linear decreasing function relationship between the order lead time lt and the order price, that is, lt=δ-β*pr; the latest acceptable delivery time is an integer, and its setting satisfies the relationship expression where φ is the lead time elasticity coefficient.

这里我们选取pr～U[30，50]，q～U[300，500]，λ＝0.3，δ＝36，β＝0.4，φ＝0.8。企业的单位生产能力、单位生产成本、拒绝成本分别取得b＝20、c＝15、J＝200。单位时间单位数量延期惩罚成本u＝4，顾客等级服从均匀分布μ～U(0，1]，单位时间单位数量库存成本h＝4。最后，算法的初始学习率α＝0.001，探索率ε＝0.1。Here we select pr~U[30,50], q~U[300,500], λ=0.3, δ=36, β=0.4, φ=0.8. The enterprise's unit production capacity, unit production cost, and rejection cost are respectively obtained as b=20, c=15, and J=200. The delay penalty cost per unit quantity per unit time is u=4, the customer level obeys the uniform distribution μ~U(0,1], and the inventory cost per unit quantity per unit time is h=4. Finally, the initial learning rate of the algorithm α=0.001, and the exploration rate ε= 0.1.

仿真实验利用python进行编程，通过仿真实验，对本发明所提出的基于一维后状态空间的强化学习值迭代神经网络(value iteration with neural network)算法，即：AFVINN算法的MTO企业订单接受策略的有效性进行分析。The simulation experiment uses Python for programming. Through the simulation experiment, the effectiveness of the MTO enterprise order acceptance strategy of the reinforcement learning value iteration neural network (value iteration with neural network) algorithm based on one-dimensional post-state space proposed by the present invention, namely: AFVINN algorithm, is verified. sex analysis.

本仿真实验由两部分组成：在第一部分中，首先将AFVINN算法与传统的Q-learning算法进行学习效率对比，对比策略是通过样本利用效率进行评估的；其次根据文献[13，16]的对比策略，即运用提出的算法与FCFS方法对长期平均利润和订单接受率进行对比和分析，其中FCFS方法是指当订单到达时，若企业有能力在最迟交货期内完成生产，则直接接受该订单；订单的接受率是指接受订单的数量除以总到达订单的数量。在第二部分中，首先分别考察在考虑库存成本和不考虑库存成本的两种情境下，AFVINN算法对MTO企业平均利润及订单接受率的影响；其次通过调整与顾客优先级有关的延期惩罚成本和拒绝成本因素，来分析AFVINN算法对MTO企业平均利润及订单接受率的影响；最后分别考察在考虑多种顾客优先级、考虑三种顾客优先级和不考虑顾客优先级的三种情境下，AFVINN算法对MTO企业平均利润的影响。This simulation experiment consists of two parts: In the first part, first, the learning efficiency of the AFVINN algorithm and the traditional Q-learning algorithm are compared. The comparison strategy is evaluated through the sample utilization efficiency; secondly, based on the comparison in the literature [13, 16] The strategy is to use the proposed algorithm and the FCFS method to compare and analyze the long-term average profit and order acceptance rate. The FCFS method means that when the order arrives, if the company has the ability to complete production within the latest delivery period, it will directly accept it. The order; order acceptance rate is the number of accepted orders divided by the total number of arriving orders. In the second part, we first examine the impact of the AFVINN algorithm on the average profit and order acceptance rate of MTO companies under two scenarios: considering inventory costs and not considering inventory costs; secondly, by adjusting the delay penalty cost related to customer priority and rejection cost factors to analyze the impact of the AFVINN algorithm on the average profit and order acceptance rate of MTO companies; finally, we examine the three scenarios of considering multiple customer priorities, considering three customer priorities, and not considering customer priorities. The impact of AFVINN algorithm on the average profit of MTO enterprises.

5.1算法比较5.1 Algorithm comparison

在已有关于强化学习MTO企业订单接受策略的研究中，均是使用传统多维状态空间进行建模和求解。对此，本发明将提出AFVINN算法与传统的Q-learning算法进行学习效率对比，对比策略是每次迭代消耗200个数据样本，根据消耗的数据样本数量来评估学习效率。In the existing research on reinforcement learning MTO enterprise order acceptance strategies, traditional multi-dimensional state space is used for modeling and solving. In this regard, the present invention will propose a learning efficiency comparison between the AFVINN algorithm and the traditional Q-learning algorithm. The comparison strategy is to consume 200 data samples per iteration, and evaluate the learning efficiency based on the number of data samples consumed.

由图4可知：(1)经过足够的数据样本，AFVINN算法和传统的Q-learning算法收敛到相同的值；(2)AFVINN算法的学习效率远远高于传统Q-learning算法的学习效率，约是传统Q-learning算法的1000倍。由此可知，本发明提出的AFVINN算法不仅可以将高维控制问题转化为一维控制问题，简化求解过程，而且能够使MTO企业的长期平均利润保持在较高水平上。It can be seen from Figure 4: (1) After enough data samples, the AFVINN algorithm and the traditional Q-learning algorithm converge to the same value; (2) the learning efficiency of the AFVINN algorithm is much higher than that of the traditional Q-learning algorithm. It is about 1000 times that of the traditional Q-learning algorithm. It can be seen that the AFVINN algorithm proposed by the present invention can not only transform high-dimensional control problems into one-dimensional control problems and simplify the solution process, but also keep the long-term average profits of MTO enterprises at a high level.

由表2表明：(1)AFVINN算法在最大化MTO企业长期平均利润方面优于FCFS方法；(2)在订单接受率低于FCFS方法时，AFVINN算法仍可以保持较高的平均利润。由此可知，AFVINN算法在订单接受策略上，能够以更高的概率接受具有更高利润的订单，以达到最大化企业长期平均利润的目的。Table 2 shows that: (1) the AFVINN algorithm is better than the FCFS method in maximizing the long-term average profit of MTO companies; (2) when the order acceptance rate is lower than the FCFS method, the AFVINN algorithm can still maintain a higher average profit. It can be seen that the AFVINN algorithm can accept orders with higher profits with a higher probability in order acceptance strategy to achieve the purpose of maximizing the long-term average profit of the enterprise.

表2基本情境Table 2 Basic Scenario

AFVINN算法AFVINN algorithm FCFS方法FCFS method 平均利润：367.2365Average profit: 367.2365 平均利润：309.4537Average profit: 309.4537 订单接受率：0.1324Order acceptance rate: 0.1324 订单接受率：0.2698Order acceptance rate: 0.2698

生产能力对于MTO企业的盈利是非常关键的。通过改变MTO企业单位生产能力，其他参数与基本情境相同，来观察AFVINN算法与FCFS方法在MTO企业订单接受策略方面的变化。Production capacity is very critical to the profitability of MTO companies. By changing the unit production capacity of the MTO enterprise and keeping other parameters the same as the basic scenario, we observe the changes in the order acceptance strategy of the MTO enterprise between the AFVINN algorithm and the FCFS method.

由图5知，(1)AFVINN算法在面对企业不同生产能力时，可以始终保持较高的利润水平；(2)在降低企业单位生产能力时，AFVINN算法和FCFS方法在平均利润方面分别降低了38.667％和41.857％；而当单位生产能力由20增加到35时，AFVINN算法和FCFS方法在平均利润方面分别增长128.6104％和122.9773％。由此可知，AFVINN算法能够合理地利用企业有限的资源，从而为企业创造更高的利润，面对资源有限的情况下有更好的适应性。It can be seen from Figure 5 that (1) the AFVINN algorithm can always maintain a high profit level when facing different production capacities of enterprises; (2) when reducing the unit production capacity of enterprises, the AFVINN algorithm and the FCFS method respectively reduce the average profit. The average profits of the AFVINN algorithm and the FCFS method increased by 128.6104% and 122.9773% respectively when the unit production capacity increased from 20 to 35. It can be seen that the AFVINN algorithm can rationally utilize the limited resources of the enterprise, thereby creating higher profits for the enterprise and having better adaptability in the face of limited resources.

订单到达率同样是MTO企业订单接受决策中的一个重要因素。通过改变订单的到达率，其他参数与基本情境相同，来观察AFVINN算法(图6和图7内，分别在每组图形的前一列)与FCFS方法(图6和图7内，分别在每组图形的后一列)在MTO企业订单接受策略方面的变化。由图6和图7知：(1)当λ降低时，订单接受率会增加，当λ提高时，订单接受率会减少；这是因为当单位时间内订单到达的数量增加，即两个订单之间到达的时间间隔减少时，这会使得MTO企业安排已接受的订单时间减少，所以，在最迟交货期限内，已接受订单的完成概率将会减少，从而导致订单接受率降低。(2)AFVINN算法下的订单接受率虽低于FCFS方法的订单接受率，但平均利润却高于FCFS方法的平均利润。由此可见，AFVINN算法能更好地适应顾客订单到达的不确定性。Order arrival rate is also an important factor in MTO companies' order acceptance decisions. By changing the order arrival rate and keeping other parameters the same as the basic scenario, we can observe the AFVINN algorithm (in Figures 6 and 7, in the first column of each group of graphs respectively) and the FCFS method (in Figures 6 and 7, respectively in each group of graphs). (Last column of group graph) Changes in MTO Enterprise Order Acceptance Strategies. It can be seen from Figure 6 and Figure 7: (1) When λ decreases, the order acceptance rate will increase. When λ increases, the order acceptance rate will decrease; this is because when the number of orders arriving per unit time increases, that is, two orders When the time interval between arrivals decreases, this will reduce the time for MTO companies to arrange accepted orders. Therefore, within the latest delivery deadline, the completion probability of accepted orders will decrease, resulting in a lower order acceptance rate. (2) Although the order acceptance rate under the AFVINN algorithm is lower than that of the FCFS method, the average profit is higher than the average profit of the FCFS method. It can be seen that the AFVINN algorithm can better adapt to the uncertainty of customer order arrival.

5.2模型比较5.2 Model comparison

已有文献[15-16]在运用强化学习算法对订单接受问题进行建模求解时，均没有考虑库存成本。本节将考虑库存成本和不考虑库存本在AFVINN算法订单接受策略中加以对比，前者是在订单接受问题建模和求解过程中考虑库存成本；后者是在订单接受问题建模和求解过程中不考虑库存成本。由图8和图9知：(1)不考虑库存成本的订单接受率高于考虑库存成本的订单接受率，但考虑库存成本因素的平均收益始终高于不考虑库成本因素的平均收益；(2)在其他因素不变时，考虑库存成本这一因素下的企业订单接受策略随库存成本的变化而变化，而不考虑库存成本这一因素下的企业订单接受策略不受库存成本变化的影响；(3)当库存成本不断增加时，考虑库存成本这一因素的企业平均利润下降趋势比不考虑库存成本这一因素的企业平均利润下降趋势慢。因此，在MTO企业订单接受问题进行建模求解过程中，将库存成本这一因素加以考虑，企业可以根据不同的库存成本而做出不同的订单接受决策，以保证企业长期平均利润最大化；而且现实生活中，由于库存成本的存在，常常会影响企业利润，占用企业资金，影响企业资金的运转，所以在订单接受过程中不能忽视库存成本这一因素。Existing literature [15-16] did not consider inventory costs when using reinforcement learning algorithms to model and solve order acceptance problems. This section will compare the order acceptance strategy of AFVINN algorithm with and without considering inventory cost. The former considers inventory cost in the process of modeling and solving the order acceptance problem; the latter considers the inventory cost in the process of modeling and solving the order acceptance problem. Inventory costs are not considered. It can be seen from Figures 8 and 9 that: (1) the order acceptance rate without considering inventory costs is higher than the order acceptance rate with inventory costs being considered, but the average revenue considering inventory cost factors is always higher than the average revenue without considering inventory cost factors; ( 2) When other factors remain unchanged, the order acceptance strategy of an enterprise that takes inventory cost into consideration will change with changes in inventory cost, while the order acceptance strategy of an enterprise that does not consider inventory cost will not be affected by changes in inventory cost. ; (3) When inventory costs continue to increase, the average profit decline trend of companies that consider inventory costs is slower than the average profit decline trend of companies that do not consider inventory costs. Therefore, in the process of modeling and solving the order acceptance problem of MTO enterprises, the factor of inventory cost is taken into consideration. The enterprise can make different order acceptance decisions based on different inventory costs to ensure that the enterprise's long-term average profit is maximized; and In real life, the existence of inventory costs often affects corporate profits, occupies corporate funds, and affects the operation of corporate funds. Therefore, the factor of inventory costs cannot be ignored during the order acceptance process.

已有文献大多是仅考虑订单特征，假设顾客是同等重要的，虽然文献[16]运用强化学习思想进行建模和求解时涉及到了顾客优先级因素，但其仅仅将顾客优先级划分为三个等级，在现实生活中顾客优先级存在多种。因此，本发明将顾客优先级设置为μ∈(0，1]上的等概率分布，其中延期惩罚成本和拒绝成本与顾客优先级有直接的关系。在本节实验中，以基本情境为基准，首先在拒绝成本不变的情境下，改变单位延期惩罚成本；其次在单位延期惩罚成本不变的情境下，拒改变绝成本。Most of the existing literature only considers order characteristics and assumes that customers are equally important. Although literature [16] uses reinforcement learning ideas to model and solve customer priority factors, it only divides customer priorities into three Level, there are many kinds of customer priorities in real life. Therefore, the present invention sets the customer priority as an equal probability distribution on μ∈(0,1], in which the delay penalty cost and rejection cost are directly related to the customer priority. In this section of the experiment, the basic scenario is used as the benchmark , firstly, change the unit delay penalty cost under the condition that the rejection cost remains unchanged; secondly, change the rejection cost under the condition that the unit delay penalty cost remains unchanged.

表3改变单位延期惩罚成本和拒绝成本Table 3 Changing the unit delay penalty cost and rejection cost

由图10和表3可知：(1)考虑多种顾客优先级的企业长期平均收益大于考虑三种顾客优先级和不考虑顾客优先级的平均收益；(2)基于AFVINN算法下的顾客等级大于或等于0.5的订单接受率，随延期惩罚成本的增加而降低；而顾客等级小于0.5的订单接受率随延期惩罚成本的增加而上升；(3)当拒绝成本增加时，即拒绝成本对MTO企业利润影响越来越大时，AFVINN算法在进行订单接受决策时，顾客等级大于等于0.5的订单接受率呈上升趋势，而顾客等级小于0.5的订单接受率呈下降趋势。It can be seen from Figure 10 and Table 3: (1) The long-term average revenue of an enterprise that considers multiple customer priorities is greater than the average revenue that considers three customer priorities and does not consider customer priorities; (2) The customer level based on the AFVINN algorithm is greater than Or equal to 0.5, the order acceptance rate decreases with the increase of delay penalty cost; while the order acceptance rate with customer level less than 0.5 increases with the increase of delay penalty cost; (3) When the rejection cost increases, that is, the rejection cost has a negative impact on the MTO enterprise When the profit impact becomes greater and greater, when the AFVINN algorithm makes order acceptance decisions, the order acceptance rate for customer levels greater than or equal to 0.5 shows an upward trend, while the order acceptance rate for customer levels less than 0.5 shows a downward trend.

由此可知，当延期惩罚成本较大时，在接受顾客优先级较高的订单时，若企业没有在规定的期限完成生产，则需要付出较高的成本，所以高优先级顾客订单的接受率随延期惩罚成本的增加有所下降；当拒绝成本较高时，企业拒绝高优先级顾客的订单需要承担较高的费用，所以高优先级顾客订单的接受率随拒绝成本的增加有所上升。因此，当延期惩罚成本较大时，企业可以增加接受顾客优先级比较低的订单，而适当减少顾客优先级较高的订单；而当拒绝成本较大时，企业可以增加接受高优先级顾客的订单。所以在面对不同的延期惩罚成本和拒绝成本时，AFVINN算法能够及时调整订单接受策略，尽可能降低拒绝成本对MTO企业平均利润的影响，以使企业长期平均利润最大化。It can be seen from this that when the delay penalty cost is large, when accepting orders with higher customer priority, if the company does not complete production within the specified time limit, it will need to pay higher costs, so the acceptance rate of high-priority customer orders As the delay penalty cost increases, it decreases; when the rejection cost is high, the enterprise needs to bear higher costs for rejecting orders from high-priority customers, so the acceptance rate of orders from high-priority customers increases as the rejection cost increases. Therefore, when the delay penalty cost is large, the company can increase the number of orders accepted by customers with lower priority, and appropriately reduce the orders of high priority customers; when the cost of rejection is large, the company can increase the number of orders accepted by high-priority customers. Order. Therefore, when faced with different delay penalty costs and rejection costs, the AFVINN algorithm can promptly adjust the order acceptance strategy to minimize the impact of rejection costs on the average profits of MTO companies, so as to maximize the long-term average profits of the companies.

本发明在传统MTO企业订单接受问题考虑的因素基础上，增加了订单库存成本及多种顾客优先级因素，构建了马尔科夫决策过程订单接受模型，并运用AFVINN算法进行求解，该算法不仅能够将MTO企业订单接受问题中的多维状态空间转化为一维状态空间，简化了求解过程，而且能够使企业长期平均利润保持在较高的水平。Based on the factors considered in traditional MTO enterprise order acceptance problems, this invention adds order inventory costs and various customer priority factors, constructs a Markov decision process order acceptance model, and uses the AFVINN algorithm to solve it. This algorithm can not only Converting the multi-dimensional state space in the MTO enterprise order acceptance problem into a one-dimensional state space simplifies the solution process and can keep the long-term average profit of the enterprise at a high level.

通过仿真实验表明，在MTO企业订单接受问题中，顾客优先级和库存成本因素对企业订单接受策略和利润是重要的；基于AFVINN算法与基于传统Q-learning算法相比，能够将高维控制问题转化为一维控制问题，提高对样本的利用效率，简化求解过程；基于AFVINN算法在最大化企业长期平均利润方面优于基于FCFS方法，该算法有较高的订单接受选择能力，而且对环境变化具有较好的适应能力，能够权衡订单利润与各项成本因素为MTO企业带来更高的利润。本发明实现订单是动态变化(订单的信息是不可提前获取，当前到达订单存在不确定性)到达的建模，与现实中实际发生订单的状态更符合。建模及求解所考虑的因素更全面，比如：考虑库存成本因素，考虑到顾客优先级等；降低状态空间维度降低了模型求解时计算难度。Simulation experiments show that in the MTO enterprise order acceptance problem, customer priority and inventory cost factors are important to the enterprise's order acceptance strategy and profit; compared with the traditional Q-learning algorithm, the AFVINN algorithm can solve high-dimensional control problems Convert it into a one-dimensional control problem, improve the efficiency of sample utilization, and simplify the solution process; the AFVINN-based algorithm is better than the FCFS-based method in maximizing the long-term average profit of the enterprise. This algorithm has a higher order acceptance and selection ability, and is more sensitive to environmental changes. Have good adaptability and be able to weigh order profits and various cost factors to bring higher profits to MTO companies. The present invention realizes the modeling of the arrival of dynamically changing orders (the information of the orders cannot be obtained in advance, and there is uncertainty in the current arriving orders), which is more consistent with the status of actual orders in reality. Modeling and solving take into account more comprehensive factors, such as: considering inventory cost factors, considering customer priorities, etc.; reducing the dimension of the state space reduces the computational difficulty when solving the model.

应该明白，公开的过程中的步骤的特定顺序或层次是示例性方法的实例。基于设计偏好，应该理解，过程中的步骤的特定顺序或层次可以在不脱离本公开的保护范围的情况下得到重新安排。所附的方法权利要求以示例性的顺序给出了各种步骤的要素，并且不是要限于所述的特定顺序或层次。It is understood that the specific order or hierarchy of steps in the disclosed processes is an example of an exemplary approach. Based on design preferences, it is understood that the specific order or hierarchy of steps in the process may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy described.

在上述的详细描述中，各种特征一起组合在单个的实施方案中，以简化本公开。不应该将这种公开方法解释为反映了这样的意图，即，所要求保护的主题的实施方案需要比清楚地在每个权利要求中所陈述的特征更多的特征。相反，如所附的权利要求书所反映的那样，本发明处于比所公开的单个实施方案的全部特征少的状态。因此，所附的权利要求书特此清楚地被并入详细描述中，其中每项权利要求独自作为本发明单独的优选实施方案。In the foregoing detailed description, various features are grouped together in single embodiments to simplify the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that embodiments of the claimed subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, this invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the Detailed Description, with each claim standing on its own as a separate preferred embodiment of this invention.

为使本领域内的任何技术人员能够实现或者使用本发明，上面对所公开实施例进行了描述。对于本领域技术人员来说；这些实施例的各种修改方式都是显而易见的，并且本文定义的一般原理也可以在不脱离本公开的精神和保护范围的基础上适用于其它实施例。因此，本公开并不限于本文给出的实施例，而是与本申请公开的原理和新颖性特征的最广范围相一致。The disclosed embodiments are described above to enable any person skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit and scope of the disclosure. Therefore, this disclosure is not intended to be limited to the embodiments set forth herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

上文的描述包括一个或多个实施例的举例。当然，为了描述上述实施例而描述部件或方法的所有可能的结合是不可能的，但是本领域普通技术人员应该认识到，各个实施例可以做进一步的组合和排列。因此，本文中描述的实施例旨在涵盖落入所附权利要求书的保护范围内的所有这样的改变、修改和变型。此外，就说明书或权利要求书中使用的术语“包含”，该词的涵盖方式类似于术语“包括”，就如同“包括，”在权利要求中用作衔接词所解释的那样。此外，使用在权利要求书的说明书中的任何一个术语“或者”是要表示“非排它性的或者”。The above description includes examples of one or more embodiments. Of course, it is impossible to describe all possible combinations of components or methods for describing the above embodiments, but those of ordinary skill in the art will recognize that the various embodiments can be further combined and arranged. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "comprises" is used in the description or claims, the word is encompassed in a manner similar to the term "includes," as if "comprises," is interpreted as a connective in the claims. Furthermore, any term "or" used in the description of the claims is intended to mean "a non-exclusive or".

以上所述的具体实施方式，对本发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施方式而已，并不用于限定本发明的保护范围，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above-described specific embodiments further describe the objectives, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above-mentioned are only specific embodiments of the present invention and are not intended to limit the scope of the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection scope of the present invention.

Claims

1. An MTO enterprise order processing method, which is characterized by including:

When a current order arrives at an order-oriented production MTO enterprise, an order acceptance strategy model based on the Markov decision process MDP theory is established based on the current order queue of the MTO enterprise and the currently arriving orders. The order based on the MDP theory The acceptance strategy model is used to determine the optimal strategy; among them, the current arrival order represents the order that the MTO enterprise has received but has not yet decided whether to accept; the strategy is to accept the current arrival order or reject the current arrival order, and the optimal strategy refers to When choosing this strategy, the interests of the MTO company are optimal;

The order acceptance strategy model based on the MDP theory is transformed according to the post-state theory of the reinforcement learning algorithm, and the MDP model based on the post-state is obtained after the transformation; the solution of the MDP model based on the post-state is reduced through the learning process of the learning parameter vector of the reinforcement learning algorithm. Difficulty, obtain the MDP optimization model based on the post-state;

A three-layer artificial neural network is used to solve the post-state-based MDP optimization model to obtain the solution results. Based on the solution results, the optimal strategy for accepting the current arriving order is determined;

When a current order arrives at an order-oriented production MTO enterprise, an order acceptance strategy model based on the Markov decision process MDP theory is established based on the current order queue and currently arriving orders of the MTO enterprise, specifically including:

Assume that each order is not split into production, that the order is shipped to the customer in one go after production is completed, and that once the MTO company accepts the order, the order cannot be changed or canceled;

Determine the information of each order in the current order queue and the current arriving order based on the current order queue and the current arriving order. The order information includes: the customer priority μ corresponding to the order, the unit product price pr, the required product quantity q, the lead time lt and the minimum Late delivery date dt; among them, the order in the current order queue is an accepted order, the order arrival obeys the Poisson distribution with parameter λ, and the unit product price and product demand quantity in the order obey uniform distribution respectively;

Determine the reward sub-items faced based on the information of each order in the current order queue and the current arriving order. The reported sub-items include: rejection cost of rejecting the current arriving order, profit of accepting the currently arriving order, and delay penalty for orders in the current order queue. Cost, the inventory cost of the order in the current order queue; where:

If the currently arriving order is rejected, a rejection cost will be incurred: μ*J, where J represents the rejection cost without considering customer priority;

If the current arrival order is accepted, the profit of the order I will be obtained: I=pr*q, and the production cost C will be consumed at the same time: C=c*q, where c is the unit product production cost;

For each order in the current order queue, the MTO company will produce according to the principle of first come, first served. If there is a delayed order in the current order queue, and the delivery time of the delayed order is within the latest delivery period, the MTO company will The delay penalty cost Y is paid to the customer corresponding to the delayed order: Among them, t represents the production time still required for the accepted order, b represents the unit production capacity of the MTO enterprise, and u represents the unit product delay penalty cost per unit time of the MTO enterprise;

If the product of the order in the current order queue is generated within the lead time and the product is temporarily stored in the warehouse, the product is temporarily stored in the MTO enterprise warehouse, resulting in inventory cost N: Among them, h represents the inventory cost per unit product per unit time;

According to the Markov decision process MDP theory, as well as the information and return items of each order in the current order queue and the current arriving order, an order acceptance strategy model based on the MDP theory of the MTO enterprise is established. The MTO enterprise's order acceptance strategy model is based on the MDP theory. The order acceptance strategy model is a four-tuple (S, A, f, R), which are the state space S, the action space A, the state transition function f, and the reward function R, where:

The state space S represents the state of the system where the MTO enterprise order processing method is located; the state space S is an nx6-dimensional vector, where n represents the number of order types, and 6 represents 6 types of order information: customer priority μ, unit product price pr, Product demand quantity q, lead time lt, latest delivery date dt still requires production completion time t for orders in the current order queue, where t has a preset maximum upper limit value;

Action space A represents the action set for the currently arriving order; when the current order arrives at time m, the MTO enterprise needs to make an action decision of accepting the order or rejecting the order, and the action set of accepting the order or rejecting the order forms the action space amount A, A=(a ₁ ,a ₂ ), where a ₁ means accepting the order and a ₂ means rejecting the order;

The state transition function f represents the transition from the current state to the m decision time state, and the m decision time refers to the time when action is taken for the current arriving order; the generation process of the state transition function f is:

Assume that the order information μ, pr, q, lt, dt are all independent and identically distributed. According to the initial state s and the action a taken at decision time m, the probability density function f of the state at the next decision time m+1 is (·|(s,a)) are respectively expressed as f _M (x), f _PR (x), f _Q (x), f _LT (x), f _DT (x);

And the order information μ _m+1 , pr _m+1 , q _m+1 , lt _m+1 , dt _m+1 at the next decision time m+1 is independent with respect to (s _m , q _m ); where, t _m+1 is expressed as:

Equation (1) indicates that t _m+1 is affected by (s _m , a _m ), and different (q _m , t _m , a _m ) lead to different order production times, and t _m+1 is also affected by the order arrival time. The influence of interval; among them, AT _m→m+1 represents the time interval between the arrival of two orders, that is, according to the arrival of each order, it obeys the Poisson distribution with parameter λ;

According to the characteristics of the current state s, action a, and order information μ _m+1 , pr _m+1 , q _m+1 , lt _m+1 , dt _m+1 with respect to (s _m , a _m ), we get the following The conditional probability density of state s′ at a decision moment m+1, and the conditional probability density of state s′ at m+1 at the next decision moment are expressed as:

f(s′|s,a)＝f _M (μ′)*f _PR (pr′)*f _Q (q′)*f _LT (lt′)*f _DT (dt′)*f _T (t′ |s,a),

where f _T (r′|s,a) represents the production time still required to produce the accepted order at the next decision moment m+1 after taking action a in the current state s, f _T (t′|s,a ) is defined by formula (1) and related random variables;

When the MTO enterprise makes a decision at m, the reward function R represents the corresponding reward obtained after taking actions on the currently arrived order. The reward function R is expressed as:

Among them, when a _m =1 means that if the MTO enterprise receives the current arriving order, the reward function R is expressed as ICYN;

When a _m = 0, it means that if the MTO enterprise rejects the current arriving order, the reward function R is expressed as -μ*J;

For any strategy in the order acceptance strategy model based on MDP theory, its corresponding value function is defined according to the reward function, and the average long-term profit corresponding to the strategy is represented by the value function. The value function is expressed as:

Among them, π represents any strategy, γ represents future reward discount, 0<γ≤1, by setting γ to ensure that the summation term defined by the formula is meaningful, n represents the total number of decision-making moments, and each current order corresponds to a decision-making moment;

The optimal strategy π ^* is determined by the average long-term profit of any strategy π. The optimal strategy π ^* is used to ensure the long-term profit maximization of the enterprise. At this time, the interests of the MTO enterprise are optimal; the optimal strategy π ^* is expressed as:

Among them, Π represents the set of all strategies;

The order acceptance strategy model based on the MDP theory is transformed according to the post-state theory of the reinforcement learning algorithm. After the transformation, an MDP model based on the post-state is obtained, which specifically includes:

After receiving the current arriving order and making a decision at time m, set the post-state variable p _m according to the reinforcement learning algorithm. The post-state variable represents the production time still needed to produce the accepted order after selecting action a _m at decision time m; where, The post-state is an intermediate variable between two consecutive states;

The post-state variable p _m is determined according to the current state s _m and action a _m , and the post-state variable p _m is expressed as:

According to p _m, the production time t m+1 still required for the order that has been accepted at the next decision time _m+1 is expressed as:

Among them, in formula (5), Indicates the larger number between the variables ,dt′,t′) conditional probability density is expressed as:

Among them, the conditional probability density function f _T (·|p) is defined by equation (5) and related random variables;

After setting the post-state variable, rewrite the conditional expectation E[] in the MDP theory, and after rewriting the conditional expectation E[], define the value function of the post-state, and express the value function of the post-state as:

J ^* (p)＝γE[V ^* (s′)|p] (7)

The optimal strategy π ^* is constructed through the post-state value function, thereby turning the optimal strategy π ^* into a one-dimensional state space. The optimal strategy π ^* is expressed as:

2. The MTO enterprise order processing method according to claim 1, characterized in that the learning process of the learning parameter vector through the reinforcement learning algorithm reduces the difficulty of solving the MDP model based on the post-state, and obtains the MDP optimization based on the post-state. Models, specifically including:

The reinforcement algorithm does not directly calculate J ^* when constructing the optimal policy ^{π *} ^through the post-state value function J * for solving, but adopts the learning process of learning parameter vectors to realize the approximation of J ^* for solving; adopts the method of learning parameter vectors The learning process is used to achieve the approximate solution of J ^* , as follows:

Determined according to the given parameter vector θ as well as

The parameter vector θ is used to perform parameter learning from data samples using the reinforcement algorithm to obtain the parameter vector θ ^* , and the parameter vector θ ^* obtained through learning is determined To approximate J ^* , determine the optimal strategy π ^* based on the approximate J ^* .

3. The MTO enterprise order processing method according to claim 2, characterized in that the three-layer artificial neural network is used to solve the MDP optimization model based on the post-state to obtain the solution result, and it is determined according to the solution result whether the current arrival order is The optimal strategy accepted includes:

J ^* (p) is solved through a three-layer artificial neural network ANN, and J ^* (p) is approximated to arbitrary precision. Using a three-layer artificial neural network ANN is expressed as:

Among them, the parameter vector can be expressed as:

θ=[w ₁ ,...,w _N ,α ₁ ,...α _N ,u ₁ ,...u _N ,β],

Φ _H (x)＝1/(1+e ^-x )

Function of equation (11) It is a three-layer single-input single-output neural network. It has only a single-node input layer, whose output represents the value of the post-state p, and a hidden layer containing N nodes. The input of the i-th node is the weighted post-state. The sum of the value w _i *p and the hidden layer bias α _i ; and the input-output relationship of each hidden layer node is represented by the Φ _H (·) function, where Φ _H (·) is called the activation function; the output layer There is a node. The input of the output layer is the sum of the weighted output of the hidden layer and the bias β of the output layer. The output of the output layer represents the final approximate function value/>

4. An MTO enterprise order processing system, which is characterized by including:

The model building unit is used to establish an order acceptance strategy model based on the Markov decision process MDP theory based on the current order queue of the MTO enterprise and the currently arriving orders when the current order arrives at the order-oriented production MTO enterprise. The above-mentioned order acceptance strategy model based on MDP theory is used to determine the optimal strategy; among them, the current arrival order represents the order that the MTO enterprise has received but has not yet decided whether to accept; the strategy is to accept the current arrival order or reject the current arrival order, so The above-mentioned optimal strategy means that the interests of the MTO enterprise are optimal when choosing this strategy;

The model transformation unit is used to transform the order acceptance strategy model based on the MDP theory according to the post-state theory of the reinforcement learning algorithm. After the transformation, an MDP model based on the post-state is obtained;

The model optimization unit is used to reduce the difficulty of solving the post-state-based MDP model through the learning process of the learning parameter vector of the reinforcement learning algorithm, and obtain the post-state-based MDP optimization model;

The solving unit is used to use a three-layer artificial neural network to solve the post-state-based MDP optimization model to obtain the solution results, and determine the optimal strategy for accepting the current arriving order based on the solution results;

The model building unit is specifically used for:

And the order information μ _m+1 , pr _m+1 , q _m+1 , lt _m+1 , dt _m+1 at the next decision time m+1 is independent with respect to (s _m , a _m ); where, t _m+1 is expressed as:

f(s′|s,a)＝f _M (μ′)*f _PR (pr′)*f _Q (q′)*f _LT (lt′)*f _DT (dt′)*f _T (t′ |s,a)

where f _T (t′|s,a) represents the production time still required to produce the accepted order at the next decision moment m+1 after taking action a in the current state s, f _T (t′|s,a ) is defined by formula (1) and related random variables;

Among them, Π represents the set of all strategies;

The model conversion unit is specifically used for:

After receiving the current arriving order at time m and making a decision, the post-state variable p _m is set according to the reinforcement learning algorithm. The post-state variable represents the production time still needed to produce the accepted order after selecting action a m at decision time _m ; where, The post-state is an intermediate variable between two consecutive states;

J ^* (p)＝γE[V ^* (s′)|p] (7)

5. The MTO enterprise order processing system according to claim 4, characterized in that the model optimization unit is specifically used for:

Determined according to the given parameter vector θ as well as

6. The MTO enterprise order processing system according to claim 5, characterized in that the solving unit is specifically used for:

Among them, the parameter vector can be expressed as:

θ=[w ₁ ,...,w _N ,α ₁ ,...α _N ,u ₁ ,...u _N ,β]

Φ _H (x)＝1/(1+e ^-x )

Function of equation (11) It is a three-layer single-input single-output neural network. It has only a single-node input layer, whose output represents the value of the post-state p, and a hidden layer containing N nodes. The input of the i-th node is the weighted post-state. The sum of the value w _i *p and the hidden layer bias α _i ; and the input-output relationship of each hidden layer node is represented by the Φ _H (·) function, where Φ _H (·) is called the activation function; the output layer There is a node. The input of the output layer is the sum of the weighted output of the hidden layer and the output layer bias β. The output of the output layer represents the final approximate function value/>