CN113592240A - Order processing method and system for MTO enterprise - Google Patents

Order processing method and system for MTO enterprise Download PDF

Info

Publication number
CN113592240A
CN113592240A CN202110749378.1A CN202110749378A CN113592240A CN 113592240 A CN113592240 A CN 113592240A CN 202110749378 A CN202110749378 A CN 202110749378A CN 113592240 A CN113592240 A CN 113592240A
Authority
CN
China
Prior art keywords
order
state
current
enterprise
mto
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110749378.1A
Other languages
Chinese (zh)
Other versions
CN113592240B (en
Inventor
吴克宇
钱静
胡星辰
陈超
成清
程光权
冯旸赫
杜航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202110749378.1A priority Critical patent/CN113592240B/en
Publication of CN113592240A publication Critical patent/CN113592240A/en
Application granted granted Critical
Publication of CN113592240B publication Critical patent/CN113592240B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

The embodiment of the invention provides an MTO enterprise order processing method and system, which comprises the following steps: when a current order arrives at an order-oriented production (MTO) enterprise, establishing an order acceptance strategy model based on a Markov Decision Process (MDP) theory for determining an optimal strategy according to a current order queue of the MTO enterprise and the current arrival order; according to a post-state theory of a reinforcement learning algorithm, converting the order acceptance strategy model based on the MDP theory to obtain a post-state MDP model; reducing the solving difficulty of the MDP model based on the post state through the learning process of the learning parameter vector of the reinforcement learning algorithm to obtain an MDP optimization model based on the post state; and solving the MDP optimization model based on the post-state by adopting a three-layer artificial neural network to determine whether the current arriving order is accepted or not. The modeling of the order dynamic change arrival is more consistent with the actual order state.

Description

Order processing method and system for MTO enterprise
Technical Field
The invention relates to the field of order acceptance optimization, in particular to an order processing method and system for an MTO enterprise.
Background
As customer demand personalization continues to increase, more and more businesses are beginning to adopt a make-to-order (MTO) model to more easily view and contact end-users to maximize customer demand personalization. The MTO model refers to an enterprise that produces orders according to customer orders, where different customers have different requirements for the types of orders, and the MTO enterprise organizes and produces the orders according to the order requirements provided by the customers. In general, the capacity of an MTO enterprise is limited, and due to various cost factors, the enterprise may not accept all randomly arriving customer orders, which requires the MTO enterprise to develop a corresponding order acceptance policy. Therefore, the method and the system research how the MTO enterprise makes order selection decision in the limited resources, and play a great role in fully utilizing the limited resources and realizing long-term profit maximization for the MTO enterprise.
In the process of implementing the invention, the applicant finds that at least the following problems exist in the prior art: the strategy model is too simplified to approach the real situation.
Disclosure of Invention
The embodiment of the invention provides an order processing method and system for an MTO enterprise, wherein the modeling of order dynamic change arrival is more consistent with the actual state of an order.
To achieve the above object, in one aspect, an embodiment of the present invention provides an MTO enterprise order processing method, including:
when a current order arrives at an order-oriented production (MTO) enterprise, establishing an order acceptance strategy model based on a Markov Decision Process (MDP) theory aiming at a current order queue and the current order of the MTO enterprise, wherein the order acceptance strategy model based on the MDP theory is used for determining an optimal strategy; wherein the currently arriving order represents an order that the MTO enterprise receives but has not yet decided whether to accept; the strategy is to accept the current arrival order or reject the current arrival order, and the optimal strategy is to optimize the benefits of the MTO enterprise when the strategy is selected;
converting the order acceptance strategy model based on the MDP theory according to a post-state theory of the reinforcement learning algorithm to obtain an MDP model based on a post-state; reducing the solving difficulty of the MDP model based on the post state through the learning process of the learning parameter vector of the reinforcement learning algorithm to obtain an MDP optimization model based on the post state;
and solving the MDP optimization model based on the post-state by adopting a three-layer artificial neural network to obtain a solving result, and determining whether the optimal strategy is accepted for the current arriving order according to the solving result.
In another aspect, an embodiment of the present invention provides an MTO enterprise order processing system, including:
the model building unit is used for building an order acceptance strategy model based on an MDP theory in a Markov decision process aiming at a current order queue and a current arrival order of an MTO enterprise after the current order arrives at the MTO enterprise facing the order production, and the order acceptance strategy model based on the MDP theory is used for determining an optimal strategy; wherein the currently arriving order represents an order that the MTO enterprise receives but has not yet decided whether to accept; the strategy is to accept the current arrival order or reject the current arrival order, and the optimal strategy is to optimize the benefits of the MTO enterprise when the strategy is selected;
the model conversion unit is used for converting the order acceptance strategy model based on the MDP theory according to the post-state theory of the reinforcement learning algorithm, and obtaining the MDP model based on the post-state after conversion;
the model optimization unit is used for reducing the solving difficulty of the MDP model based on the rear state through the learning process of the learning parameter vector of the reinforcement learning algorithm to obtain the MDP optimization model based on the rear state;
and the solving unit is used for solving the MDP optimization model based on the post-state by adopting a three-layer artificial neural network to obtain a solving result, and determining whether the optimal strategy is accepted or not for the current arriving order according to the solving result.
The technical scheme has the following beneficial effects: the modeling of the order dynamic change arrival is more consistent with the actual order state.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method for processing orders of an MTO enterprise according to an embodiment of the present invention;
FIG. 2 is a block diagram of an MTO enterprise order processing system according to an embodiment of the present invention;
FIG. 3 is a three-layer neural network architecture;
FIG. 4 is a sample learning rate;
FIG. 5 is a graph of different unit capacities;
FIG. 6 is the average profit for different order arrival rates;
FIG. 7 is an acceptance rate for different order arrival rates;
FIG. 8 is the average profit for different inventory costs;
FIG. 9 is the order acceptance rate for different inventory costs;
FIG. 10 is a customer priority factor.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, in accordance with an embodiment of the present invention, there is provided an MTO enterprise order processing method, including:
s101: when a current order arrives at an order-oriented production (MTO) enterprise, establishing an order acceptance strategy model based on a Markov Decision Process (MDP) theory aiming at a current order queue and the current order of the MTO enterprise, wherein the order acceptance strategy model based on the MDP theory is used for determining an optimal strategy; wherein the currently arriving order represents an order that the MTO enterprise receives but has not yet decided whether to accept; the strategy is to accept the current arrival order or reject the current arrival order, and the optimal strategy is to optimize the benefits of the MTO enterprise when the strategy is selected;
s102: converting the order acceptance strategy model based on the MDP theory according to a post-state theory of the reinforcement learning algorithm to obtain an MDP model based on a post-state;
s103: reducing the solving difficulty of the MDP model based on the post state through the learning process of the learning parameter vector of the reinforcement learning algorithm to obtain an MDP optimization model based on the post state;
s104: and solving the MDP optimization model based on the post-state by adopting a three-layer artificial neural network to obtain a solving result, and determining whether the optimal strategy is accepted for the current arriving order according to the solving result.
Preferably, step 101 specifically includes:
assuming that each order is not split for production, the order is sent to the customer once production is completed, and the order cannot be changed or cancelled once the order is accepted by the MTO enterprise;
determining information of each order and the current arrival order in the current order queue according to the current order queue and the current arrival order, wherein the order information comprises: the customer priority mu, the unit product price pr, the required product quantity q, the lead time lt and the latest delivery time dt corresponding to the order; the orders in the current order queue are accepted orders, the orders reach Poisson distribution with a compliance parameter of lambda, and the price of a unit product and the quantity of a product required in the orders respectively comply with uniform distribution;
determining the faced return sub-items according to the information of each order in the current order queue and the current arriving order, wherein the faced return sub-items comprise: rejecting the rejection cost of the current arrival order, accepting the profit of the current arrival order, the deferred penalty cost of the order in the current order queue and the inventory cost of the order in the current order queue; wherein:
if the current arriving order is rejected, a rejection cost is generated: μ x J, where J represents rejection cost when customer priority is not considered;
if the current arrival order is accepted, the profit I of the order is obtained: i ═ pr × q, while consuming production costs C: c ═ C × q, where C is the unit product production cost;
for each order in the current order queue, the MTO enterprise produces according to the principle of first-come first-serve, if a delayed order exists in the current order queue, the delivery time of the delayed order is within the latest delivery time index, and the MTO enterprise generates a deferred penalty cost Y to be paid to a customer corresponding to the delayed order:
Figure BDA0003145461280000041
wherein t represents the production time still needed in the accepted order, b represents the unit production capacity of the MTO enterprise, and u represents the unit product delay penalty cost of the MTO enterprise in unit time;
if the product of the order in the current order queue is generated and completed within the lead period and the product is temporarily stored in the warehouse, the product is temporarily stored in the MTO enterprise warehouse so as to generate the inventory cost N:
Figure BDA0003145461280000042
wherein h represents the inventory cost of a unit product per unit time;
according to the Markov decision process MDP theory, the information and return sub-items of each order and the current arrival order in the current order queue, establishing an order acceptance strategy model based on the MDP theory of the MTO enterprise, wherein the order acceptance strategy model based on the MDP theory of the MTO enterprise is a four-tuple (S, A, f and R), the four-tuple is a state space S, an action space A, a state transfer function f and a reward function R, and the method comprises the following steps:
the state space S represents the state of a system where the MTO enterprise order processing method is located; the state space S is an nx6 dimensional vector, where n represents the number of order types, and 6 represents 6 types of order information: the method comprises the following steps that (1) the priority mu of a customer, the price pr of a unit product, the quantity q of product demands, the lead time lt, and the production completion time t still needed by orders in a current order queue at the latest delivery time dt, wherein t has a preset maximum upper limit value;
action space A represents the set of actions for the currently arriving order; when a current order arrives at the time m, the MTO enterprise needs to make an action decision of accepting the order or rejecting the order, and the actions of accepting the order or rejecting the order are integrated into an action space quantity A, wherein A is (a)1,a2) Wherein a is1Indicating acceptance of an order, a2Indicating a rejection of the order;
the state transition function f represents the state of transition from the current state to the m decision time, wherein the m decision time refers to the time of taking action aiming at the current arriving order; the generation process of the state transfer function f is as follows:
assuming that the order information μ, pr, q, lt, dt are all independent and identically distributed, the probability density function f (· | (s, a)) of the state of the next decision time m +1 is expressed as f according to the initial state s and the action a that has been taken at the decision time mM(x)、fPR(x)、fQ(x)、fLT(x)、fDT(x);
And order information mu at the next decision time m +1m+1,prm+1,qm+1,ltm+1,dtm+1About(s)m,am) Is independent; wherein, tm+1Expressed as:
Figure BDA0003145461280000051
formula (1) represents tm+1To(s)m,am) And is different from (q)m,tm,am) Resulting in different order production times, and tm+1But also by order arrival time interval; wherein, ATm→m+1Representing the time interval of arrival between two orders, i.e. a poisson distribution subject to a parameter λ according to the arrival of each order;
according to the current state s, the action a and the order information mum+1,prm+1,qm+1,ltm+1,dtm+1About(s)m,am) The method is an independent feature, and obtains the conditional probability density of the next decision time m +1 state s ', where the conditional probability density of the next decision time m +1 state s' is expressed as:
f(s′|s,a)=fM(μ′)*fPR(pr′)*fQ(q′)*fLT(lt′)*fDT(dt′)*fT(t′|s,a)
wherein f isT(t' | s, a) represents the production time still required for the next decision moment m +1 to produce the accepted order after taking action a in the current state s, fTThe specific form of (t' | s, a) is defined by formula (1) and the associated random variable;
when the MTO enterprise is at the decision time m, the corresponding reward obtained after the action taken on the current arriving order is expressed by a reward function R, wherein the reward function R is expressed as:
Figure BDA0003145461280000052
wherein when am1 means that if the MTO enterprise receives a current arrival order, the reward function R is denoted as I-C-Y-N;
when a ism0 indicates that if the MTO enterprise rejects the currently arriving order, the reward function R is denoted- μ x J;
for any strategy in the order acceptance strategy model based on the MDP theory, a corresponding value function is defined according to a reward function, the average long-term profit corresponding to the strategy is represented by the value function, and the value function is represented as:
Figure BDA0003145461280000053
wherein pi represents any strategy, gamma represents future reward discount, gamma is more than 0 and less than or equal to 1, the summation item defined by the formula is ensured to have significance by setting gamma, n represents the total decision time quantity, and each current order corresponds to one decision time;
determining an optimal strategy pi by average long-term profit for an arbitrary strategy pi*Optimum strategy pi*The method is used for ensuring that the long-term profit of the enterprise is maximized, and the profit of the MTO enterprise is optimal at the moment; the optimal strategy pi*Expressed as:
Figure BDA0003145461280000054
where Π represents all policy sets.
Preferably, step 102 specifically includes:
after the current arrival order is received and decided at the moment m, a post-state variable p is set according to a reinforcement learning algorithmmSelecting action a at decision time m by means of a post-state variable representationmProduction time still required for orders that have been accepted by post production; wherein the post state is an intermediate variable between two consecutive states;
according to the current state smAnd action amDetermining a post-state variable pmThe post-state variable pmExpressed as:
Figure BDA0003145461280000061
according to pmThe production time t still needed by the order which has been accepted at the next decision moment m +1m+1Expressed as:
Figure BDA0003145461280000062
wherein, in the formula (5),
Figure BDA0003145461280000063
representing a larger number between variable x and 0, AT being the time interval between the arrival of two orders; the conditional probability density of the next decision time state S ' ═ of (μ ', p ', q ', lt ', dt ', t ') is expressed according to the current posterior state as:
Figure BDA0003145461280000064
wherein the conditional probability density function fT(. | p) is defined by equation (5) and the associated random variable;
rewriting the condition expectation E [ ] in the MDP theory after setting the post-state variables, and defining a cost function of the post-state after rewriting the condition expectation E [ ], expressing the cost function of the post-state as:
J*(p)=γE[V*(s′)|p] (7)
constructing an optimal policy π through a post-state cost function*Thereby optimizing the strategy by pi*Changing into one-dimensional state space, the optimal strategy pi*Expressed as:
Figure BDA0003145461280000065
preferably, step 103 specifically includes:
the strengthening algorithm passes a state cost function J*Constructing an optimal strategy pi*Does not directly calculate J when solving*J is realized by learning process through learning parameter vector*The approximation of (a) is solved; implementing J with a learning process by learning a parameter vector*The approximation of (d) is solved as follows:
according toDetermination in a given parameter vector theta
Figure BDA0003145461280000066
And
performing parameter learning on the parameter vector theta from the data sample by adopting an enhanced algorithm to obtain the parameter vector theta*And using the parameter vector theta obtained by learning*Is determined
Figure BDA0003145461280000067
To approximate J*According to the approximation of J*Determining an optimal strategy pi*
Preferably, step 104 specifically includes:
solving for J through three-layer artificial neural network ANN*(p) mixing J*(p) to an arbitrary precision, will
Figure BDA0003145461280000068
The three-layer artificial neural network ANN is adopted to be expressed as:
Figure BDA0003145461280000069
wherein the parameter vector may be represented as:
θ=[w1,...,wN,α1,...αN,u1,...uN,β]
ΦH(x)=1/(1+e-x)
function of formula (11)
Figure BDA0003145461280000071
Is a three-layer single-input single-output neural network, which has a single-node input layer, the output of which represents the value of the post-state p, and a hidden layer containing N nodes, the input of the ith node is the weighted post-state value wiP and hidden layer bias αiThe sum of (1); and the input-output relation of each hidden layer node is represented by phiH(. o) a functional representation in whichH(. cndot.) is called an activation function;the output layer has a node, the input of the output layer is the sum of the weighted output of the hidden layer and the offset beta of the output layer, and the output of the output layer represents the function value of the final approximation
Figure BDA0003145461280000072
As shown in fig. 2, in accordance with an embodiment of the present invention, there is provided an MTO enterprise order processing system, comprising:
the model building unit 21 is configured to build, when a current order arrives at an order-oriented production MTO enterprise, an order acceptance policy model based on an MDP theory in a markov decision process for a current order queue and the current order that the MTO enterprise has, where the order acceptance policy model based on the MDP theory is used to determine an optimal policy; wherein the currently arriving order represents an order that the MTO enterprise receives but has not yet decided whether to accept; the strategy is to accept the current arrival order or reject the current arrival order, and the optimal strategy is to optimize the benefits of the MTO enterprise when the strategy is selected;
the model conversion unit 22 is used for converting the order acceptance strategy model based on the MDP theory according to the post-state theory of the reinforcement learning algorithm, and obtaining the MDP model based on the post-state after conversion;
the model optimization unit 23 is configured to reduce the difficulty in solving the post-state-based MDP model through a learning process of a learning parameter vector of a reinforcement learning algorithm, so as to obtain a post-state-based MDP optimization model;
and the solving unit 24 is configured to solve the post-state-based MDP optimization model by using a three-layer artificial neural network to obtain a solution result, and determine whether the optimal strategy for the currently arrived order is accepted according to the solution result.
Preferably, the model building unit 21 is specifically configured to:
assuming that each order is not split for production, the order is sent to the customer once production is completed, and the order cannot be changed or cancelled once the order is accepted by the MTO enterprise;
determining information of each order and the current arrival order in the current order queue according to the current order queue and the current arrival order, wherein the order information comprises: the customer priority mu, the unit product price pr, the required product quantity q, the lead time lt and the latest delivery time dt corresponding to the order; the orders in the current order queue are accepted orders, the orders reach Poisson distribution with a compliance parameter of lambda, and the price of a unit product and the quantity of a product required in the orders respectively comply with uniform distribution;
determining the faced return sub-items according to the information of each order in the current order queue and the current arriving order, wherein the faced return sub-items comprise: rejecting the rejection cost of the current arrival order, accepting the profit of the current arrival order, the deferred penalty cost of the order in the current order queue and the inventory cost of the order in the current order queue; wherein:
if the current arriving order is rejected, a rejection cost is generated: μ x J, where J represents rejection cost when customer priority is not considered;
if the current arrival order is accepted, the profit I of the order is obtained: i ═ pr × q, while consuming production costs C: c ═ C × q, where C is the unit product production cost;
for each order in the current order queue, the MTO enterprise produces according to the principle of first-come first-serve, if a delayed order exists in the current order queue, the delivery time of the delayed order is within the latest delivery time index, and the MTO enterprise generates a deferred penalty cost Y to be paid to a customer corresponding to the delayed order:
Figure BDA0003145461280000081
wherein t represents the production time still needed in the accepted order, b represents the unit production capacity of the MTO enterprise, and u represents the unit product delay penalty cost of the MTO enterprise in unit time;
if the product of the order in the current order queue is generated and completed within the lead period and the product is temporarily stored in the warehouse, the product is temporarily stored in the MTO enterprise warehouse so as to generate the inventory cost N:
Figure BDA0003145461280000082
wherein, h tableShowing the inventory cost of a unit product per unit time;
according to the Markov decision process MDP theory, the information and return sub-items of each order and the current arrival order in the current order queue, establishing an order acceptance strategy model based on the MDP theory of the MTO enterprise, wherein the order acceptance strategy model based on the MDP theory of the MTO enterprise is a four-tuple (S, A, f and R), the four-tuple is a state space S, an action space A, a state transfer function f and a reward function R, and the method comprises the following steps:
the state space S represents the state of a system where the MTO enterprise order processing method is located; the state space S is an nx6 dimensional vector, where n represents the number of order types, and 6 represents 6 types of order information: the method comprises the following steps that (1) the priority mu of a customer, the price pr of a unit product, the quantity q of product demands, the lead time lt, and the production completion time t still needed by orders in a current order queue at the latest delivery time dt, wherein t has a preset maximum upper limit value;
action space A represents the set of actions for the currently arriving order; when a current order arrives at the time m, the MTO enterprise needs to make an action decision of accepting the order or rejecting the order, and the actions of accepting the order or rejecting the order are integrated into an action space quantity A, wherein A is (a)1,a2) Wherein a is1Indicating acceptance of an order, a2Indicating a rejection of the order;
the state transition function f represents the state of transition from the current state to the m decision time, wherein the m decision time refers to the time of taking action aiming at the current arriving order; the generation process of the state transfer function f is as follows:
assuming that the order information μ, pr, q, lt, dt are all independent and identically distributed, the probability density function f (· | (s, a)) of the state of the next decision time m +1 is expressed as f according to the initial state s and the action a that has been taken at the decision time mM(x)、fPR(x)、fQ(x)、fLT(x)、fDT(x);
And order information mu at the next decision time m +1m+1,prm+1,qm+1,ltm+1,dtm+1About(s)m,am) Is independent; wherein, tm+1Expressed as:
Figure BDA0003145461280000083
formula (1) represents tm+1To(s)m,am) And is different from (q)m,tm,am) Resulting in different order production times, and tm+1But also by order arrival time interval; wherein, ATm→m+1Representing the time interval of arrival between two orders, i.e. a poisson distribution subject to a parameter λ according to the arrival of each order;
according to the current state s, the action a and the order information mum+1,prm+1,qm+1,ltm+1,dtm+1About(s)m,am) The method is an independent feature, and obtains the conditional probability density of the next decision time m +1 state s ', where the conditional probability density of the next decision time m +1 state s' is expressed as:
f(s′|s,a)=fM(μ′)*fPR(pr′)*fQ(q′)*fLT(lt′)*fDT(dt′)*fT(t′|s,a)
wherein f isT(t' | s, a) represents the production time still required for the next decision moment m +1 to produce the accepted order after taking action a in the current state s, fTThe specific form of (t' | s, a) is defined by formula (1) and the associated random variable;
when the MTO enterprise is at the decision time m, the corresponding reward obtained after the action taken on the current arriving order is expressed by a reward function R, wherein the reward function R is expressed as:
Figure BDA0003145461280000091
wherein when am1 means that if the MTO enterprise receives a current arrival order, the reward function R is denoted as I-C-Y-N;
when a ism0 indicates that if the MTO enterprise rejects the currently arriving order, the reward function R is denoted- μ x J;
for any strategy in the order acceptance strategy model based on the MDP theory, a corresponding value function is defined according to a reward function, the average long-term profit corresponding to the strategy is represented by the value function, and the value function is represented as:
Figure BDA0003145461280000092
wherein pi represents any strategy, gamma represents future reward discount, gamma is more than 0 and less than or equal to 1, the summation item defined by the formula is ensured to have significance by setting gamma, n represents the total decision time quantity, and each current order corresponds to one decision time;
determining an optimal strategy pi by average long-term profit for an arbitrary strategy pi*Optimum strategy pi*The method is used for ensuring that the long-term profit of the enterprise is maximized, and the profit of the MTO enterprise is optimal at the moment; the optimal strategy pi*Expressed as:
Figure BDA0003145461280000093
where Π represents all policy sets.
Preferably, the model transformation unit 22 is specifically configured to:
after the current arrival order is received and decided at the moment m, a post-state variable p is set according to a reinforcement learning algorithmmSelecting action a at decision time m by means of a post-state variable representationmProduction time still required for orders that have been accepted by post production; wherein the post state is an intermediate variable between two consecutive states;
according to the current state smAnd action amDetermining a post-state variable pmThe post-state variable pmExpressed as:
Figure BDA0003145461280000101
according to pmThe production time t still needed by the order which has been accepted at the next decision moment m +1m+1Expressed as:
Figure BDA0003145461280000102
wherein, in the formula (5),
Figure BDA0003145461280000103
representing a larger number between variable x and 0, AT being the time interval between the arrival of two orders; the conditional probability density of the next decision time state S ' ═ of (μ ', p ', q ', lt ', dt ', t ') is expressed according to the current posterior state as:
Figure BDA0003145461280000104
wherein the conditional probability density function fT(. | p) is defined by equation (5) and the associated random variable;
rewriting the condition expectation E [ ] in the MDP theory after setting the post-state variables, and defining a cost function of the post-state after rewriting the condition expectation E [ ], expressing the cost function of the post-state as:
J*(p)=γE[V*(s′)|p] (7)
constructing an optimal policy π through a post-state cost function*Thereby optimizing the strategy by pi*Changing into one-dimensional state space, the optimal strategy pi*Expressed as:
Figure BDA0003145461280000105
preferably, the model optimization unit 23 is specifically configured to:
the strengthening algorithm passes a state cost function J*The structure is the mostBest strategy pi*Does not directly calculate J when solving*J is realized by learning process through learning parameter vector*The approximation of (a) is solved; implementing J with a learning process by learning a parameter vector*The approximation of (d) is solved as follows:
from a given parameter vector theta
Figure BDA0003145461280000106
And
performing parameter learning on the parameter vector theta from the data sample by adopting an enhanced algorithm to obtain the parameter vector theta*And using the parameter vector theta obtained by learning*Is determined
Figure BDA0003145461280000107
To approximate J*According to the approximation of J*Determining an optimal strategy pi*
Preferably, the solving unit 24 is specifically configured to:
solving for J through three-layer artificial neural network ANN*(p) mixing J*(p) to an arbitrary precision, will
Figure BDA0003145461280000108
The three-layer artificial neural network ANN is adopted to be expressed as:
Figure BDA0003145461280000109
wherein the parameter vector may be represented as:
θ=[w1,...,wN,α1,...αN,u1,...uN,β]
ΦH(x)=1/(1+e-x)
function of formula (11)
Figure BDA00031454612800001010
Is a three-layer single-input single-output neural network with only oneA single-node input layer outputting a value representing the post-state p and a hidden layer comprising N nodes, the input of the ith node being a weighted post-state value wiP and hidden layer bias αiThe sum of (1); and the input-output relation of each hidden layer node is represented by phiH(. o) a functional representation in whichH(. cndot.) is called an activation function; the output layer has a node, the input of the output layer is the sum of the weighted output of the hidden layer and the offset beta of the output layer, and the output of the output layer represents the function value of the final approximation
Figure BDA0003145461280000111
The above technical solutions of the embodiments of the present invention are described in detail below with reference to specific application examples, and reference may be made to the foregoing related descriptions for technical details that are not described in the implementation process.
The invention relates to a method and a system for an order acceptance strategy of an MTO enterprise based on post-state reinforcement learning, aiming at solving the problems of incomplete model consideration and high complexity of a solving process in the existing research. Meanwhile, a low-complexity solving algorithm based on combination of an after-state (after-state) and a neural network is provided to solve the order acceptance problem of the MTO enterprise.
As the diversified demands of customers are continuously increased, the order-to-order (MTO) mode of organizing production according to the different demands of customers for orders is more and more important in enterprise production activities. Due to the limitation of the production capacity of an enterprise, an MTO enterprise needs to make a reasonable order accepting strategy, that is, how to determine whether to accept an arriving order according to the production capacity and the order state so as to improve the production efficiency of the enterprise.
On the basis of the traditional order receiving problem, the invention provides a more complete MTO enterprise order receiving problem model: on the basis of traditional model elements of postponed delivery cost, rejection cost and production cost, the invention further considers the order inventory cost and various customer priority factors and models the optimal order acceptance problem as a Markov Decision Process (MDP). Furthermore, since the classical MDP solution method relies on solving and estimating a high-dimensional state cost function, its computational complexity is high. Therefore, to reduce complexity, the present invention proposes to use a one-dimensional post-state cost function instead of a high-dimensional state cost function, and to approximate the post-state cost function in conjunction with a neural network. Finally, the applicability and superiority of the order acceptance strategy model and the algorithm provided by the invention are verified through simulation.
First, problem description and model assumptions to be solved by the invention
The invention assumes that a MTO enterprise with limited capacity produces through a single production line. Assuming that there are n types of customer orders on the market, the order-related information includes customer priority μ, price per product pr (pr is an abbreviation for price), quantity q, lead time lt (collectively "lead time"), and latest delivery time dt (collectively "delivery time"), etc. The lead period lt refers to the appointed delivery time, and is the working time period of order production under the condition of no intention, namely the time from the beginning of order production work to the end of order production work; however, if the enterprise cannot completely guarantee whether the delivery can be completed within the appointed time, the preset time period is extended at the appointed delivery time, and the preset extended time period is the latest delivery date dt. The customer order reaches a poisson distribution subject to a parameter lambda. Both the price of a single product within an order and the amount of demand for the corresponding product are subject to uniform distribution.
When an order arrives, the enterprise needs to judge whether to accept the order according to the production capacity of the enterprise. If the order is rejected, a rejection cost μ x J is generated, the higher the customer priority the higher the rejection cost; where μ denotes a customer priority coefficient, and J denotes a rejection cost when the customer priority is not considered (J is also order-related information).
If an order is accepted, the profit for that order is obtained, i.e., I ═ pr × q, while the production cost is consumed, i.e., C ═ C × q, where C is the unit product production cost. The MTO enterprise produces the accepted order on a first come first serve basis, if the order is not delivered within the lead time required by the customer, that is
Figure BDA0003145461280000121
When the enterprise needs to pay a certain delay penalty cost Y, that is
Figure BDA0003145461280000122
Where t represents the production time still required for an accepted order before accepting the current order, b represents the unit capacity of the enterprise, u represents the unit product deferral penalty cost per unit time, and the higher the customer priority the higher the deferral penalty cost. The customer does not get the product produced before the lead time in advance, namely when the product is taken
Figure BDA0003145461280000123
In time, the product is temporarily stored in the MTO enterprise warehouse, resulting in the inventory cost N, i.e.
Figure BDA0003145461280000124
Figure BDA0003145461280000125
Where h represents the unit product inventory cost per unit time. Each order is not split for production, the order is sent to a customer once after production is completed, and once the order is accepted by the MTO enterprise, the customer cannot change or cancel the order.
The problem to be solved by the present invention is that when a customer order arrives randomly, the MTO enterprise decides whether to accept the currently arriving order to ensure the long-term average profit maximization of the enterprise, taking into account the current order queue, and the deferred delivery cost, rejection cost, production cost, inventory cost, and various customer priority factors.
Second, order acceptance strategy modeling based on MDP theory
It can be seen that the MTO enterprise order acceptance decision problem is a type of random sequential decision problem (or called a random system multi-stage decision problem). After the decision maker of the MTO enterprise decides to accept or reject the order, the state of the system changes, but the development process of the decision making stage after the current stage is not influenced by the state of each stage before the decision making stage, namely, the decision making stage has no aftereffect. Therefore, the problem can be abstracted into an MDP (markov decision process) model according to the MDP theory. The MDP model is defined as a quadruple (S, a, f, R) representing a state space S, an action space a, a state transition function f, a reward function R:
1) the system state is as follows: assuming there are n order types (n order types) in the order taking system (order processing system), the system state can be represented by vector S: and S is (mu, pr, q, lt, dt, t), t represents the production completion time still needed by the accepted order, and t has the maximum upper limit value based on the MTO enterprise with limited production capacity.
2) System action set: at time m, when a customer order arrives, the MTO enterprise needs to make a decision to accept or reject the order, and the set of actions in the model may be represented by the vector a ═ (a)1,a2) Is shown in the specification, wherein a1Indicating acceptance of an order, a2Indicating that the order was rejected. The actions in vector a are only for the currently arriving orders and do not include orders in the order queue.
3) And (3) state transition model: given an initial state s (the initial state representing the first order to arrive in the order taking system) and the action a that has been taken, the probability density function for the next state is represented by f (· (s, a)). Here, it is assumed that the order information μ, pr, q, lt, dt are independent and identically distributed, the respective distributions may be represented by probability density functions, and the probability density function corresponding to each distribution of the order information μ, pr, q, lt, dt is defined as fM(x)、fPR(x)、fQ(x)、fLT(x)、fDT(x) Thus order information mum+1,prm+1,qm+1,ltm+1,dtm+1About(s)m,am) Is independent, i.e. represents at the current state smNext, take action amThen, the next decision time (next time state s)m+1) The order information in (1). But t ism+1To(s)m,am) Due to the difference (q)m,tm,am) May result in different order production times, tm+1Also, the same applies toAlso affected by the order arrival time interval, i.e. tm+1Can be expressed as:
Figure BDA0003145461280000131
wherein ATm→m+1The time interval of arrival between two orders is represented, i.e. the poisson distribution according to the arrival compliance parameter λ of each order.
Due to the order information mum+1,prm+1,qm+1,ltm+1,dtm+1About(s)m,am) Are independent, so when a current state s and an action a are given, a conditional probability density of a next decision time state s 'can be obtained according to the conditional probability density of the current state, and the conditional probability density of the next decision time state s' can be expressed as:
f(s′|s,a)=fM(μ′)*fPR(pr′)*fQ(q′)*fLT(lt′)*fDT(dt′)*fT(t′|s,a)
wherein f isT(t' | s, a) represents the production time still required to produce the accepted order at the next moment after taking action a in the current state s, which may be defined in specific form by (1) and associated random variables.
4) The reward function: at the decision moment m, after the MTO enterprise makes a decision whether to accept the order, the immediate reward function obtained by the MTO enterprise is:
Figure BDA0003145461280000132
wherein I represents the profit of the order (price per product x number of products); c represents the production cost; y represents a delay penalty cost (if the order exceeds the advance period, the delay penalty cost is generated); n represents the inventory cost (which would result if the customer did not pick up the good ahead of time if done ahead of the lead time).
The reward function gives the system a good or bad assessment of the action taken after the decision maker takes the corresponding action (refusing or accepting the order) in the current state.
5) The optimal strategy is as follows: in the MTO enterprise order acceptance problem, the aim is to find an optimal order acceptance strategy pi*Thereby maximizing the long-term profit of the enterprise. Each policy π is a function from the system state to the action that determines how the business chooses whether to accept the order based on the current state information. For any strategy pi, its cost function is defined as its average long-term profit (the cost function represents the value of the decision maker in state s after selecting action a, and then following this strategy (i.e., the cumulative discount reward), is the cumulative discount reward resulting from taking action in the current state). The cost function is as follows:
Figure BDA0003145461280000133
where 0 < γ ≦ 1 indicates a future reward discount (for which the sum term defined in guarantee (2) is significant). While we are concerned with the optimal strategy of all strategies pi*It is defined as:
Figure BDA0003145461280000141
where Π denotes all policy sets.
The theory for standard MDP is presented below:
in MDP theory, the optimal strategy is pi*The construction can be done with a state cost function y:
Figure BDA0003145461280000142
where E [. is desired]Is the next random state s' defined given the current state s and the action a. At the same time, V*Is a solution of the Bellman equation, namely:
Figure BDA0003145461280000143
and V*The solution may be performed by a value iteration method (value iteration).
Therefore, the classical MDP theory provides a state-based cost function V*The optimal strategy solving method of (1). However, the state cost function V*Is difficult because the system state of the problem to be solved by the invention is continuous in a high dimension, which results in a state cost function V*Is based on the expected E [. cndot.)]The computational complexity of the value iteration of (c) is difficult to bear. In order to solve the problems, the invention provides an optimal strategy construction method based on a post-state cost function.
Third, MDP model transformation based on post state
The after-state (after-state) is an intermediate variable between two successive states that can be used to simplify the optimal control of certain MDPs. The concept of the rear state is a skill in reinforcement learning that is often used for the learning task of board games. For example, when an Agent (Agent) uses a reinforcement learning algorithm to play chess, the Agent can control its own acts deterministically, while the Agent acts randomly as an opponent. Before deciding on an action, the Agent faces a certain pawn position on the board, which is the same as the state in the classical MDP model. The Agent's post-state for each step of the board is defined as the state of the board after this step of action but before the opponent moves. If the Agent is able to learn the winning opportunities for all the different rear states, then these known probabilities can be used to achieve optimal behavior, i.e. simply select the rear state with the largest winning opportunity to act accordingly.
The invention uses a similar post-state method to transform the order acceptance problem. In particular, the post-state variable p is in the order acceptance problem we considermDefined as selecting action a at decision time mmPost production time still required to accept the order. Thus, given the current state smAnd action amAfter, the post state can be expressed as:
Figure BDA0003145461280000144
it will be readily apparent that given pmProduction time t still needed for the next decision time m +1 orderm+1Can be expressed as:
Figure BDA0003145461280000151
wherein
Figure BDA0003145461280000152
Indicating a larger number between variable x and 0; AT is the time interval between two orders. Therefore, given the current post-state p, the conditional probability density of the next decision time state S ' ═ μ ', p ', q ', lt ', dt ', t ') can be expressed as:
Figure BDA0003145461280000153
wherein the conditional probability density function fT(. p) is defined by (5) and the associated random variable.
Therefore, the conditions in the formulae (3) and (4) are desirably E [ V (s') | s, a]The conditional expectation E [ V (s') | σ (s, a) indicated by (6) may be rewritten]So pi*Can be redefined as follows. First, a post-state value function is defined as:
J*(p)=γE[V*(s′)|p] (7)
adding (7) to (3), the optimal strategy is pi*The post-state cost function J can be used*The structure is as follows:
Figure BDA0003145461280000154
further, substituting (7) into (4) yields:
Figure BDA0003145461280000155
therefore, the following results were obtained:
Figure BDA0003145461280000156
in fact, as shown in (7), γ E [ V ]*(s′)|p]Is J*(p) so that it can be obtained
Figure BDA0003145461280000157
Finally, we iterate the algorithm pair J through the reinforcement learning median*Carry out the solution[19]I.e. J0For arbitrary initialization function, have
Figure BDA0003145461280000158
When k → ∞ is reached, JkConverge to J*
As can be seen from equation (3), the desired E [ V ] needs to be used in calculating the optimal strategy*(s′)|s,a]While the optimal strategy in equation (8) does not only need to consider expectations, but directly uses J*To replace E [ V ]*(s′)|s,a]And the high-dimensional state space is reduced to the one-dimensional state space, so that the solving complexity is greatly reduced.
Fourthly, optimal control based on neural network
It has been demonstrated before*Can be represented by J*To construct, and J*And can be solved by the mode of iteration of the value of the formula (9). However, equation (9) has two difficulties in implementation: first, if fM(·)、fPR(·)、fQ(·)、fLT(·)、fDT(. and f)T(. sigma (s, a)) is unavailable, then E [ p ] cannot be calculated](ii) a Second, since the post-state is continuously changing, each iteration of equation (9) must be computed over an infinite number of p values. Reinforcement learning provides an effective solution to both of these difficulties. Reinforcement learning does not directly calculate J*Instead, a method is adopted to realize J by learning parameter vectors*And the learning process utilizes the data samples. In other words, the design of RL algorithm (reinforcement learning algorithm or reinforcement learning) includes:
1) parameterization: this determines how to determine the function from a given parameter vector theta
Figure BDA0003145461280000161
Where θ represents the approximate parameter vector to the post-state value function.
2) Parameter learning: parameter vector theta*Is learned from a collection of data samples and sliced
Figure BDA0003145461280000162
To approximate J*I.e. the optimal strategy can be expressed as:
Figure BDA0003145461280000163
when comparing (10) and (8), if
Figure BDA0003145461280000164
Is approximately J*(p) then
Figure BDA0003145461280000165
Near-optimal strategy pi*
4.1 neural network approximation
The universal approximation theorem (univarial approximation term) indicates that a three-layer Artificial Neural Network (ANN) can approximate a continuous function to any precision, so that J to be solved in the invention*(p), ANN is a good choice. Therefore, it is not only easy to use
Figure BDA0003145461280000166
Can be represented by a neural network as:
Figure BDA0003145461280000167
wherein the parameter vector can be expressed as:
θ=[w1,...,wN,α1,...αN,u1,...uN,β],
ΦH(x)=1/(1+e-x)。
function in equation (11)
Figure BDA0003145461280000168
As shown in fig. 3, it is actually a three-layer single-input single-output neural network. Specifically, there is only one input layer with a single node, the output of which represents the value of the post-state p, and a hidden layer with N nodes, the input of the ith node being the weighted post-state value wiP and hidden layer bias αiThe sum of (1). The input-output relationship of each hidden layer node is represented by phiH(. o) a functional representation in whichH(. cndot.) is referred to as the activation function. Finally, the output layer has a node, the output of which represents the function value of the final approximation
Figure BDA0003145461280000169
Its input is the sum of the implicit layer weighted output and the output layer bias β.
4.2 value iterative training ANN (three-layer artificial neural network)
In order to achieve optimal control (namely obtaining an optimal strategy), parameters in the three-layer artificial neural network are trained through a value iteration method. The specific training process is as follows.
1) Acquisition of training data: the invention requires a batch of training samples
Figure BDA00031454612800001610
Figure BDA00031454612800001611
Wherein for each m, a sample is taken to obtain pm、μm、prm、qm、ltm、dtmWherein p ismSubject to uniform distribution, mum~fM(·)、prm~fPR(·)、qm~fQ(·)、ltm~fLT(·)、dtm~fDT(. cndot.). Further, according to pmGenerating tmI.e. tm=(pm-AT)+Where AT is a random variable subject to a Poisson distribution.
2) Iterative fitting (see algorithm 1 of table 1): in the k-th iteration, the current ANN parameter vector is set as theta (9)kFunction defined by (11)
Figure BDA00031454612800001612
I.e. given the current value function
Figure BDA00031454612800001613
The updated value function is:
Figure BDA0003145461280000171
it is desirable to obtain a new function using the updated parameters
Figure BDA0003145461280000172
To approach Jk+1(p) of the formula (I). Therefore, from Γ and θkIn (1), we construct a set of training data:
Figure BDA0003145461280000173
wherein o ismIs shown in given
Figure BDA0003145461280000174
Desired J in the case ofk+1(pm) is shown in formula (12).
Figure BDA0003145461280000175
Based on training data γkThe parameter update can be expressed as:
Figure BDA0003145461280000176
wherein, L (theta | upsilonk) For training errors:
Figure BDA0003145461280000177
3) training parameters: solving ANN parameter theta by gradient descentk+1So that
Figure BDA0003145461280000178
In upsilonkThe above error is minimal, see equation (14). Gradient descent is performed by iteratively searching the parameter space: initial parameter θ in gradient iteration(0)Is set to thetakThen, the parameters in the iterative process are updated as follows:
Figure BDA0003145461280000179
wherein alpha is an update step parameter,
Figure BDA00031454612800001710
is defined in formula (15) as L at θ(z)Of the gradient of (c). Thus, given a sufficient number of iterations Z, we use θk+1=θ(Z)As an approximate solution to (14). Finally, the
Figure BDA00031454612800001711
Expressed as:
Figure BDA00031454612800001712
TABLE 1 Algorithm 1 approximation J*(p)
Figure BDA00031454612800001713
Figure BDA0003145461280000181
In summary, the beneficial effects obtained by the invention are as follows:
1) the invention starts from the income management idea, considers the order accepting problem of MTO enterprises in the random dynamic environment, firstly considers the inventory cost of orders completed before the lead period and a plurality of customer priority factors on the basis of considering the production cost, the delay punishment cost and the rejection cost of the enterprises, and constructs an MDP (Markov decision process) order accepting model.
2) The invention converts the solution of the optimal strategy in the traditional MDP through a post-state method, proves that the optimal strategy based on the state cost function in the classic MDP problem can be equivalently defined and constructed by the value function based on the post-state, and converts the multidimensional control problem into the one-dimensional control problem, thereby greatly simplifying the solution process.
3) Traditional algorithms such as SRASA, SMART and the like belong to a table type reinforcement learning method, and the method can only process the optimal decision problem in a discrete state space. In order to solve the problem of learning of the order acceptance strategy under the continuous state space, the method carries out parameterized representation on the post-state value function by utilizing the neural network, and designs a corresponding training algorithm, thereby realizing the estimation of the post-state value function and the rapid solution of the order acceptance strategy.
Fifth, numerical simulation experiment
The data generation method of the related order information required in the simulation is generated according to the following rules: order price pr obeys a uniform distribution U (e)1,l1) The order quantity q follows a uniform distribution U (e)2,l2) Poisson with order arrival compliance parameter of λDistributing; a linear decreasing function relationship exists between the order lead period lt and the order price, namely lt is delta-beta prpr; the latest acceptable delivery time is an integer and the setting satisfies the relation
Figure BDA0003145461280000182
Where φ is the elastic coefficient of the advance period.
Here we select pr to U [30, 50], q to U [300, 500], λ is 0.3, δ is 36, β is 0.4, and Φ is 0.8. The unit production capacity, the unit production cost and the rejection cost of the enterprise are respectively 20, 15 and 200. The unit time quantity postponing penalty cost U is 4, the customer grade is subject to uniform distribution mu-U (0, 1), the unit time quantity inventory cost h is 4, finally, the initial learning rate alpha of the algorithm is 0.001, and the exploration rate epsilon is 0.1.
The simulation experiment uses python to program, and the value iteration with neural network algorithm based on the reinforced learning value of the one-dimensional post-state space proposed by the invention is implemented by the simulation experiment, namely: and analyzing the effectiveness of the MTO enterprise order acceptance strategy of the AFVINN algorithm.
The simulation experiment consists of two parts: in the first part, firstly, the learning efficiency of the AFVINN algorithm is compared with that of the traditional Q-learning algorithm, and the comparison strategy is evaluated through the sample utilization efficiency; secondly, the long-term average profit and the order acceptance rate are compared and analyzed according to the comparison strategy of the document [13, 16], namely, the proposed algorithm is used and the FCFS method, wherein the FCFS method means that when an order arrives, if an enterprise has the capability of completing the production in the latest delivery date, the order is directly accepted; the acceptance rate of an order refers to the number of orders accepted divided by the total number of orders reached. In the second part, firstly, the influence of the AFVINN algorithm on the average profit and the order acceptance rate of the MTO enterprise under two situations of considering the inventory cost and not considering the inventory cost is respectively considered; secondly, analyzing the influence of the AFVINN algorithm on the average profit and the order acceptance rate of the MTO enterprise by adjusting factors of delay penalty cost and rejection cost related to the priority of the customer; and finally, respectively investigating the influence of the AFVINN algorithm on the average profit of the MTO enterprise under three situations of considering various customer priorities, considering three customer priorities and not considering the customer priorities.
5.1 Algorithm comparison
In the existing research on order acceptance strategies of reinforcement learning MTO enterprises, the traditional multidimensional state space is used for modeling and solving. In contrast, the learning efficiency of the AFVINN algorithm is compared with the learning efficiency of the traditional Q-learning algorithm, the comparison strategy is to consume 200 data samples in each iteration, and the learning efficiency is evaluated according to the number of the consumed data samples.
As can be seen from fig. 4: (1) after enough data samples, the AFVINN algorithm and the traditional Q-learning algorithm converge to the same value; (2) the learning efficiency of the AFVINN algorithm is far higher than that of the traditional Q-learning algorithm, and is about 1000 times that of the traditional Q-learning algorithm. Therefore, the AFVINN algorithm provided by the invention can not only convert a high-dimensional control problem into a one-dimensional control problem, simplify the solving process, but also keep the long-term average profit of MTO enterprises at a higher level.
As shown by table 2: (1) the AFVINN algorithm is superior to the FCFS method in the aspect of maximizing the long-term average profit of MTO enterprises; (2) the AFVINN algorithm may still maintain a high average profit when the order acceptance rate is lower than the FCFS method. Therefore, the AFVINN algorithm can accept orders with higher profit with higher probability on the order acceptance strategy so as to achieve the purpose of maximizing the long-term average profit of enterprises.
TABLE 2 basic scenarios
AFVINN algorithm FCFS method
Average profit: 367.2365 Average profit: 309.4537
Order acceptance rate: 0.1324 Order acceptance rate: 0.2698
Capacity is critical to profitability for the MTO enterprise. By changing the production capacity of the MTO enterprise unit, other parameters are the same as the basic situation, and the change of the AFVINN algorithm and the FCFS method in the aspect of the MTO enterprise order receiving strategy is observed.
As can be seen from FIG. 5, (1) the AFVINN algorithm can maintain a high profit level all the time in the face of different production capacities of the enterprise; (2) when the production capacity of an enterprise unit is reduced, the average profit of the AFVINN algorithm and the FCFS method is respectively reduced by 38.667% and 41.857%; while the AFVINN algorithm and the FCFS method increase the average profit by 128.6104% and 122.9773%, respectively, when the unit capacity increases from 20 to 35. Therefore, the AFVINN algorithm can reasonably utilize the limited resources of the enterprises, so that higher profits are created for the enterprises, and the AFVINN algorithm has better adaptability to the condition of limited resources.
Order arrival rate is also an important factor in order acceptance decisions for MTO enterprises. Changes in the MTO enterprise order acceptance policy of the AFVINN algorithm (in fig. 6 and 7, in the previous column of each set of graphs, respectively) and the FCFS method (in fig. 6 and 7, in the next column of each set of graphs, respectively) were observed by changing the arrival rate of orders, with other parameters being the same as the basic scenario. As can be seen from fig. 6 and 7: (1) when lambda is reduced, the order acceptance rate is increased, and when lambda is increased, the order acceptance rate is reduced; this is because as the number of orders arriving per unit time increases, i.e., the time interval between two orders arriving decreases, this will cause the MTO enterprise to schedule orders to be accepted to decrease, so the probability of completion of accepted orders will decrease during the latest delivery deadline, resulting in a decrease in order acceptance rate. (2) The order acceptance rate under the AFVINN algorithm is lower than that of the FCFS method, but the average profit is higher than that of the FCFS method. It can be seen that the AFVINN algorithm can better accommodate uncertainty in the arrival of customer orders.
5.2 model comparison
The existing documents [15-16] do not consider the inventory cost when modeling and solving the order acceptance problem by applying a reinforcement learning algorithm. In the section, the inventory cost is considered and the inventory cost is not considered to be compared in an AFVINN algorithm order receiving strategy, wherein the inventory cost is considered in the modeling and solving process of the order receiving problem; the latter is to not consider inventory costs in the order acceptance problem modeling and solving process. As can be seen from fig. 8 and 9: (1) the order acceptance rate without considering the inventory cost is higher than the order acceptance rate without considering the inventory cost, but the average income without considering the inventory cost is always higher than the average income without considering the inventory cost; (2) when other factors are not changed, the enterprise order receiving strategy considering the inventory cost is changed along with the change of the inventory cost, and the enterprise order receiving strategy not considering the inventory cost is not influenced by the change of the inventory cost; (3) as inventory costs continue to increase, the average profit drop-off for an enterprise taking into account inventory costs is slower than the average profit drop-off for an enterprise not taking into account inventory costs. Therefore, in the modeling and solving process of the order acceptance problem of the MTO enterprise, the inventory cost is considered, and the enterprise can make different order acceptance decisions according to different inventory costs so as to ensure the maximization of the long-term average profit of the enterprise; in real life, due to the existence of the inventory cost, the profit of an enterprise is often influenced, the capital of the enterprise is occupied, and the operation of the capital of the enterprise is influenced, so the inventory cost cannot be ignored in the order receiving process.
Although document [16] relates to a customer priority factor when modeling and solving by using a reinforcement learning idea, the customer priority is only divided into three levels, and there are various customer priorities in real life. In the experiment, the unit postpone punishment cost is firstly changed under the condition that the rejection cost is not changed by taking a basic situation as a reference, and the absolute cost is secondly refused to be changed under the condition that the unit postpone punishment cost is not changed.
TABLE 3 Change Unit deferral penalty cost and rejection cost
Figure BDA0003145461280000211
As can be seen from fig. 10 and table 3: (1) the long-term average income of an enterprise considering various customer priorities is larger than the average income considering three customer priorities and not considering the customer priorities; (2) the order acceptance rate of the customer grade which is greater than or equal to 0.5 based on the AFVINN algorithm is reduced along with the increase of the deferred penalty cost; while order acceptance rates with customer grades less than 0.5 increase with increasing deferred penalty costs; (3) when the rejection cost increases, namely the rejection cost has greater and greater influence on profit of the MTO enterprise, the order acceptance rate of the customer class of 0.5 or more is in an increasing trend, and the order acceptance rate of the customer class of less than 0.5 is in a decreasing trend when the AFVINN algorithm makes an order acceptance decision.
Therefore, when the deferred penalty cost is large, when an order with a higher priority of a customer is received, if an enterprise does not finish production in a specified time limit, the higher cost needs to be paid, so that the receiving rate of the order of the customer with the high priority is reduced along with the increase of the deferred penalty cost; when the rejection cost is high, the enterprise rejecting the order of the high-priority customer needs to bear high cost, so the acceptance rate of the order of the high-priority customer rises with the increase of the rejection cost. Therefore, when the deferred penalty cost is large, the enterprise can increase the order with lower priority, and appropriately decrease the order with higher priority; and when the rejection cost is greater, the business may increase acceptance of orders for high priority customers. Therefore, when different postponed punishment costs and rejection costs are faced, the AFVINN algorithm can adjust the order acceptance strategy in time, and the influence of the rejection costs on the average profit of the MTO enterprise is reduced as much as possible, so that the long-term average profit of the enterprise is maximized.
On the basis of factors considered by the traditional MTO enterprise order acceptance problem, the invention increases order inventory cost and various customer priority factors, constructs an order acceptance model in the Markov decision process, and solves the problem by applying the AFVINN algorithm, and the algorithm can not only convert the multidimensional state space in the MTO enterprise order acceptance problem into the one-dimensional state space, simplifies the solving process, but also can keep the long-term average profit of the enterprise at a higher level.
Simulation experiments show that in the order acceptance problem of MTO enterprises, customer priority and inventory cost factors are important to order acceptance strategies and profits of enterprises; compared with the traditional Q-learning algorithm, the AFVINN-based algorithm can convert a high-dimensional control problem into a one-dimensional control problem, improve the utilization efficiency of samples and simplify the solving process; the AFVINN algorithm is superior to the FCFS based algorithm in the aspect of maximizing the long-term average profit of the enterprise, the algorithm has high order receiving and selecting capacity and good adaptability to environmental change, and the order profit and various cost factors can be balanced to bring higher profit for the MTO enterprise. The invention realizes the modeling that the order is dynamically changed (the information of the order cannot be obtained in advance, and the current order is uncertain), and the modeling is more consistent with the actual order state. The modeling and solving are more comprehensive in consideration of factors such as: inventory cost factors are considered, customer priority is considered, and the like; reducing the state space dimension reduces the computational difficulty in model solution.
It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.
In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. To those skilled in the art; various modifications to these embodiments will be readily apparent, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. An MTO enterprise order processing method is characterized by comprising the following steps:
when a current order arrives at an order-oriented production (MTO) enterprise, establishing an order acceptance strategy model based on a Markov Decision Process (MDP) theory aiming at a current order queue and the current order of the MTO enterprise, wherein the order acceptance strategy model based on the MDP theory is used for determining an optimal strategy; wherein the currently arriving order represents an order that the MTO enterprise receives but has not yet decided whether to accept; the strategy is to accept the current arrival order or reject the current arrival order, and the optimal strategy is to optimize the benefits of the MTO enterprise when the strategy is selected;
converting the order acceptance strategy model based on the MDP theory according to a post-state theory of the reinforcement learning algorithm to obtain an MDP model based on a post-state; reducing the solving difficulty of the MDP model based on the post state through the learning process of the learning parameter vector of the reinforcement learning algorithm to obtain an MDP optimization model based on the post state;
and solving the MDP optimization model based on the post-state by adopting a three-layer artificial neural network to obtain a solving result, and determining whether the optimal strategy is accepted for the current arriving order according to the solving result.
2. The method according to claim 1, wherein when a current order arrives at an order-oriented production MTO enterprise, an order acceptance policy model based on a markov decision process MDP theory is established for a current order queue and the currently arrived order of the MTO enterprise, specifically comprising:
assuming that each order is not split for production, the order is sent to the customer once production is completed, and the order cannot be changed or cancelled once the order is accepted by the MTO enterprise;
determining information of each order and the current arrival order in the current order queue according to the current order queue and the current arrival order, wherein the order information comprises: the customer priority mu, the unit product price pr, the required product quantity q, the lead time lt and the latest delivery time dt corresponding to the order; the orders in the current order queue are accepted orders, the orders reach Poisson distribution with a compliance parameter of lambda, and the price of a unit product and the quantity of a product required in the orders respectively comply with uniform distribution;
determining the faced return sub-items according to the information of each order in the current order queue and the current arriving order, wherein the faced return sub-items comprise: rejecting the rejection cost of the current arrival order, accepting the profit of the current arrival order, the deferred penalty cost of the order in the current order queue and the inventory cost of the order in the current order queue; wherein:
if the current arriving order is rejected, a rejection cost is generated: μ x J, where J represents rejection cost when customer priority is not considered;
if the current arrival order is accepted, the profit I of the order is obtained: i ═ pr × q, while consuming production costs C: c ═ C × q, where C is the unit product production cost;
for each order in the current order queue, the MTO enterprise produces according to the principle of first-come first-serve, if a delayed order exists in the current order queue, the delivery time of the delayed order is within the latest delivery time index, and the MTO enterprise generates a deferred penalty cost Y to be paid to a customer corresponding to the delayed order:
Figure FDA0003145461270000021
wherein t represents the production time still needed in the accepted order, b represents the unit production capacity of the MTO enterprise, and u represents the unit product delay penalty cost of the MTO enterprise in unit time;
if the product of the order in the current order queue is generated and completed within the lead period and the product is temporarily stored in the warehouse, the product is temporarily stored in the MTO enterprise warehouse so as to generate the inventory cost N:
Figure FDA0003145461270000022
wherein h represents the inventory cost of a unit product per unit time;
according to the Markov decision process MDP theory, the information and return sub-items of each order and the current arrival order in the current order queue, establishing an order acceptance strategy model based on the MDP theory of the MTO enterprise, wherein the order acceptance strategy model based on the MDP theory of the MTO enterprise is a four-tuple (S, A, f and R), the four-tuple is a state space S, an action space A, a state transfer function f and a reward function R, and the method comprises the following steps:
the state space S represents the state of a system where the MTO enterprise order processing method is located; the state space S is an nx6 dimensional vector, where n represents the number of order types, and 6 represents 6 types of order information: the method comprises the following steps that (1) the priority mu of a customer, the price pr of a unit product, the quantity q of product demands, the lead time lt, and the production completion time t still needed by orders in a current order queue at the latest delivery time dt, wherein t has a preset maximum upper limit value;
action space A represents the set of actions for the currently arriving order; when a current order arrives at the time m, the MTO enterprise needs to make an action decision of accepting the order or rejecting the order, and the actions of accepting the order or rejecting the order are integrated into an action space quantity A, wherein A is (a)1,a2) Wherein a is1Indicating acceptance of an order, a2Indicating a rejection of the order;
the state transition function f represents the state of transition from the current state to the m decision time, wherein the m decision time refers to the time of taking action aiming at the current arriving order; the generation process of the state transfer function f is as follows:
assuming that the order information μ, pr, q, lt, dt are all independent and identically distributed, the probability density function f (· | (s, a)) of the state of the next decision time m +1 is expressed as f according to the initial state s and the action a that has been taken at the decision time mM(x)、fPR(x)、fQ(x)、fLT(x)、fDT(x);
And order information mu at the next decision time m +1m+1,prm+1,qm+1,ltm+1,dtm+1About(s)m,am) Is independent; wherein, tm+1Expressed as:
Figure FDA0003145461270000023
formula (1) represents tm+1To(s)m,am) And is different from (q)m,tm,αm) Resulting in different order production times, and tm+1But also by order arrival time interval; wherein, ATm→m+1Representing the time interval of arrival between two orders, i.e. a poisson distribution subject to a parameter λ according to the arrival of each order;
according to the current state s, the action a and the order information mum+1,prm+1,qm+1,ltm+1,dtm+1About(s)m,am) The method is an independent feature, and obtains the conditional probability density of the next decision time m +1 state s ', where the conditional probability density of the next decision time m +1 state s' is expressed as:
f(s′|s,a)=fM(μ′)*fPR(pr′)*fQ(q′)*fLT(lt′)*fDT(dt′)*fT(t′|s,a),
wherein f isT(t' | s, a) represents the production time still required for the next decision moment m +1 to produce the accepted order after taking action a in the current state s, fTThe specific form of (t' | s, a) is defined by formula (1) and the associated random variable;
when the MTO enterprise is at the decision time m, the corresponding reward obtained after the action taken on the current arriving order is expressed by a reward function R, wherein the reward function R is expressed as:
Figure FDA0003145461270000031
wherein when am1 means that if the MTO enterprise receives a current arrival order, the reward function R is denoted as I-C-Y-N;
when a ismIf the MTO enterprise refusesCurrently arriving orders, the reward function R is denoted- μ x J;
for any strategy in the order acceptance strategy model based on the MDP theory, a corresponding value function is defined according to a reward function, the average long-term profit corresponding to the strategy is represented by the value function, and the value function is represented as:
Figure FDA0003145461270000032
wherein pi represents any strategy, gamma represents future reward discount, gamma is more than 0 and less than or equal to 1, the summation item defined by the formula is ensured to have significance by setting gamma, n represents the total decision time quantity, and each current order corresponds to one decision time;
determining an optimal strategy pi by average long-term profit for an arbitrary strategy pi*Optimum strategy pi*The method is used for ensuring that the long-term profit of the enterprise is maximized, and the profit of the MTO enterprise is optimal at the moment; the optimal strategy pi*Expressed as:
Figure FDA0003145461270000033
where Π represents all policy sets.
3. The method according to claim 2, wherein the step of converting the MDP theory-based order acceptance policy model according to the post-state theory of the reinforcement learning algorithm to obtain the post-state-based MDP model includes:
after the current arrival order is received and decided at the moment m, a post-state variable p is set according to a reinforcement learning algorithmmSelecting action a at decision time m by means of a post-state variable representationmProduction time still required for orders that have been accepted by post production; wherein the post state is an intermediate variable between two consecutive states;
according to the current state smAnd action amDetermining a post-state variable pmThe post-state variable pmExpressed as:
Figure FDA0003145461270000041
according to pmThe production time t still needed by the order which has been accepted at the next decision moment m +1m+1Expressed as:
Figure FDA0003145461270000042
wherein, in the formula (5),
Figure FDA0003145461270000043
representing a larger number between variable x and 0, AT being the time interval between the arrival of two orders; the conditional probability density of the next decision time state S ' ═ of (μ ', p ', q ', lt ', dt ', t ') is expressed according to the current posterior state as:
Figure FDA0003145461270000044
wherein the conditional probability density function fT(. | p) is defined by equation (5) and the associated random variable;
rewriting the condition expectation E [ ] in the MDP theory after setting the post-state variables, and defining a cost function of the post-state after rewriting the condition expectation E [ ], expressing the cost function of the post-state as:
J*(p)=γE[V*(s′)|p] (7)
constructing an optimal policy π through a post-state cost function*Thereby optimizing the strategy by pi*Changing into one-dimensional state space, the optimal strategy pi*Expressed as:
Figure FDA0003145461270000045
4. the method according to claim 3, wherein the learning process of learning the parameter vector through the reinforcement learning algorithm reduces the solution difficulty of the post-state-based MDP model to obtain the post-state-based MDP optimization model, and specifically comprises:
the strengthening algorithm passes a state cost function J*Constructing an optimal strategy pi*Does not directly calculate J when solving*J is realized by learning process through learning parameter vector*The approximation of (a) is solved; implementing J with a learning process by learning a parameter vector*The approximation of (d) is solved as follows:
according to the given parameter vector theta
Figure FDA0003145461270000046
And
performing parameter learning on the parameter vector theta from the data sample by adopting an enhanced algorithm to obtain the parameter vector theta*And using the parameter vector theta obtained by learning*Is determined
Figure FDA0003145461270000047
To approximate J*According to the approximation of J*Determining an optimal strategy pi*
5. The method according to claim 4, wherein the step of solving the post-state-based MDP optimization model using a three-layer artificial neural network to obtain a solution result, and the step of determining an optimal policy for whether a currently arriving order is accepted according to the solution result specifically comprises:
solving for J through three-layer artificial neural network ANN*(p) mixing J*(p) to an arbitrary precision, will
Figure FDA0003145461270000048
The three-layer artificial neural network ANN is adopted to be expressed as:
Figure FDA0003145461270000049
wherein the parameter vector may be represented as:
θ=[w1,...,WN,α1,...αN,u1,...uN,β],
ΦH(x)=1/(1+e-x)
function potential of formula (11)
Figure FDA0003145461270000051
Is a three-layer single-input single-output neural network, which has a single-node input layer, the output of which represents the value of the post-state p, and a hidden layer containing N nodes, the input of the ith node is the weighted post-state value wiP and hidden layer bias αiThe sum of (1); and the input-output relation of each hidden layer node is represented by phiH(. o) a functional representation in whichH(. cndot.) is called an activation function; the output layer has a node, the input of the output layer is the sum of the weighted output of the hidden layer and the offset beta of the output layer, and the output of the output layer represents the function value of the final approximation
Figure FDA0003145461270000052
6. An MTO enterprise order processing system, comprising:
the model building unit is used for building an order acceptance strategy model based on an MDP theory in a Markov decision process aiming at a current order queue and a current arrival order of an MTO enterprise after the current order arrives at the MTO enterprise facing the order production, and the order acceptance strategy model based on the MDP theory is used for determining an optimal strategy; wherein the currently arriving order represents an order that the MTO enterprise receives but has not yet decided whether to accept; the strategy is to accept the current arrival order or reject the current arrival order, and the optimal strategy is to optimize the benefits of the MTO enterprise when the strategy is selected;
the model conversion unit is used for converting the order acceptance strategy model based on the MDP theory according to the post-state theory of the reinforcement learning algorithm, and obtaining the MDP model based on the post-state after conversion;
the model optimization unit is used for reducing the solving difficulty of the MDP model based on the rear state through the learning process of the learning parameter vector of the reinforcement learning algorithm to obtain the MDP optimization model based on the rear state;
and the solving unit is used for solving the MDP optimization model based on the post-state by adopting a three-layer artificial neural network to obtain a solving result, and determining whether the optimal strategy is accepted or not for the current arriving order according to the solving result.
7. The MTO enterprise order processing system according to claim 6, wherein said model building unit is specifically configured to:
assuming that each order is not split for production, the order is sent to the customer once production is completed, and the order cannot be changed or cancelled once the order is accepted by the MTO enterprise;
determining information of each order and the current arrival order in the current order queue according to the current order queue and the current arrival order, wherein the order information comprises: the customer priority mu, the unit product price pr, the required product quantity q, the lead time lt and the latest delivery time dt corresponding to the order; the orders in the current order queue are accepted orders, the orders reach Poisson distribution with a compliance parameter of lambda, and the price of a unit product and the quantity of a product required in the orders respectively comply with uniform distribution;
determining the faced return sub-items according to the information of each order in the current order queue and the current arriving order, wherein the faced return sub-items comprise: rejecting the rejection cost of the current arrival order, accepting the profit of the current arrival order, the deferred penalty cost of the order in the current order queue and the inventory cost of the order in the current order queue; wherein:
if the current arriving order is rejected, a rejection cost is generated: μ x J, where J represents rejection cost when customer priority is not considered;
if the current arrival order is accepted, the profit I of the order is obtained: i ═ pr × q, while consuming production costs C: c ═ C × q, where C is the unit product production cost;
for each order in the current order queue, the MTO enterprise produces according to the principle of first-come first-serve, if a delayed order exists in the current order queue, the delivery time of the delayed order is within the latest delivery time index, and the MTO enterprise generates a deferred penalty cost Y to be paid to a customer corresponding to the delayed order:
Figure FDA0003145461270000061
wherein t represents the production time still needed in the accepted order, b represents the unit production capacity of the MTO enterprise, and u represents the unit product delay penalty cost of the MTO enterprise in unit time;
if the product of the order in the current order queue is generated and completed within the lead period and the product is temporarily stored in the warehouse, the product is temporarily stored in the MTO enterprise warehouse so as to generate the inventory cost N:
Figure FDA0003145461270000062
wherein h represents the inventory cost of a unit product per unit time;
according to the Markov decision process MDP theory, the information and return sub-items of each order and the current arrival order in the current order queue, establishing an order acceptance strategy model based on the MDP theory of the MTO enterprise, wherein the order acceptance strategy model based on the MDP theory of the MTO enterprise is a four-tuple (S, A, f and R), the four-tuple is a state space S, an action space A, a state transfer function f and a reward function R, and the method comprises the following steps:
the state space S represents the state of a system where the MTO enterprise order processing method is located; the state space S is an nx6 dimensional vector, where n represents the number of order types, and 6 represents 6 types of order information: the method comprises the following steps that (1) the priority mu of a customer, the price pr of a unit product, the quantity q of product demands, the lead time lt, and the production completion time t still needed by orders in a current order queue at the latest delivery time dt, wherein t has a preset maximum upper limit value;
action space A represents the set of actions for the currently arriving order; when a current order arrives at the time m, the MTO enterprise needs to make an action decision of accepting the order or rejecting the order, and the actions of accepting the order or rejecting the order are integrated into an action space quantity A, wherein A is (a)1,a2) Wherein a is1Indicating acceptance of an order, a2Indicating a rejection of the order;
the state transition function f represents the state of transition from the current state to the m decision time, wherein the m decision time refers to the time of taking action aiming at the current arriving order; the generation process of the state transfer function f is as follows:
assuming that the order information μ, pr, q, lt, dt are all independent and identically distributed, the probability density function f (· | (s, a)) of the state of the next decision time m +1 is expressed as f according to the initial state s and the action a that has been taken at the decision time mM(x)、fPR(x)、fQ(x)、fLT(x)、fDT(x);
And order information mu at the next decision time m +1m+1,prm+1,qm+1,ltm+1,dtm+1About(s)m,am) Is independent; wherein, tm+1Expressed as:
Figure FDA0003145461270000071
formula (1) represents tm+1To(s)m,am) And is different from (q)m,tm,am) Resulting in different order production times, and tm+1But also by order arrival time interval; wherein, ATm→m+1Representing the time interval of arrival between two orders, i.e. a poisson distribution subject to a parameter λ according to the arrival of each order;
according to the current state s andaction a and order information mum+1,prm+1,qm+1,ltm+1,dtm+1About(s)m,am) The method is an independent feature, and obtains the conditional probability density of the next decision time m +1 state s ', where the conditional probability density of the next decision time m +1 state s' is expressed as:
f(s′|s,a)=fM(μ′)*fPR(pr′)*fQ(q′)*fLT(lt′)*fDT(dt′)*fT(t′|s,a)
wherein f isT(t' | s, a) represents the production time still required for the next decision moment m +1 to produce the accepted order after taking action a in the current state s, fTThe specific form of (t' | s, a) is defined by formula (1) and the associated random variable;
when the MTO enterprise is at the decision time m, the corresponding reward obtained after the action taken on the current arriving order is expressed by a reward function R, wherein the reward function R is expressed as:
Figure FDA0003145461270000072
wherein when am1 means that if the MTO enterprise receives the current arrival order, the reward function R is denoted as I-C-Y-N:
when a ism0 indicates that if the MTO enterprise rejects the currently arriving order, the reward function R is denoted- μ x J;
for any strategy in the order acceptance strategy model based on the MDP theory, a corresponding value function is defined according to a reward function, the average long-term profit corresponding to the strategy is represented by the value function, and the value function is represented as:
Figure FDA0003145461270000073
wherein pi represents any strategy, gamma represents future reward discount, gamma is more than 0 and less than or equal to 1, the summation item defined by the formula is ensured to have significance by setting gamma, n represents the total decision time quantity, and each current order corresponds to one decision time;
determining an optimal strategy pi by average long-term profit for an arbitrary strategy pi*Optimum strategy pi*The method is used for ensuring that the long-term profit of the enterprise is maximized, and the profit of the MTO enterprise is optimal at the moment; the optimal strategy pi*Expressed as:
Figure FDA0003145461270000074
where Π represents all policy sets.
8. The MTO enterprise order processing system according to claim 7, wherein said model conversion unit is specifically configured to:
after the current arrival order is received and decided at the moment m, a post-state variable p is set according to a reinforcement learning algorithmmSelecting action a at decision time m by means of a post-state variable representationmProduction time still required for orders that have been accepted by post production; wherein the post state is an intermediate variable between two consecutive states;
according to the current state smAnd action amDetermining a post-state variable pmThe post-state variable pmExpressed as:
Figure FDA0003145461270000081
according to pmThe production time t still needed by the order which has been accepted at the next decision moment m +1m+1Expressed as:
Figure FDA0003145461270000082
wherein, in the formula (5),
Figure FDA0003145461270000083
representing a larger number between variable x and 0, AT being the time interval between the arrival of two orders; the conditional probability density of the next decision time state S ' ═ of (μ ', p ', q ', lt ', dt ', t ') is expressed according to the current posterior state as:
Figure FDA0003145461270000084
wherein the conditional probability density function fT(. | p) is defined by equation (5) and the associated random variable;
rewriting the condition expectation E [ ] in the MDP theory after setting the post-state variables, and defining a cost function of the post-state after rewriting the condition expectation E [ ], expressing the cost function of the post-state as:
J*(p)=γE[V*(s′)|p] (7)
constructing an optimal policy π through a post-state cost function*Thereby optimizing the strategy by pi*Changing into one-dimensional state space, the optimal strategy pi*Expressed as:
Figure FDA0003145461270000085
9. the MTO enterprise order processing system according to claim 8, wherein the model optimization unit is specifically configured to:
the strengthening algorithm passes a state cost function J*Constructing an optimal strategy pi*Does not directly calculate J when solving*J is realized by learning process through learning parameter vector*The approximation of (a) is solved; implementing J with a learning process by learning a parameter vector*The approximation of (d) is solved as follows:
from a given parameter vector theta
Figure FDA0003145461270000086
And
performing parameter learning on the parameter vector theta from the data sample by adopting an enhanced algorithm to obtain the parameter vector theta*And using the parameter vector theta obtained by learning*Is determined
Figure FDA0003145461270000087
To approximate J*According to the approximation of J*Determining an optimal strategy pi*
10. The MTO enterprise order processing system according to claim 9, wherein said solving unit is specifically configured to:
solving for J through three-layer artificial neural network ANN*(p) mixing J*(p) to an arbitrary precision, will
Figure FDA0003145461270000088
The three-layer artificial neural network ANN is adopted to be expressed as:
Figure FDA0003145461270000091
wherein the parameter vector may be represented as:
θ=[w1,...,wN,α1,...αN,u1,...uN,β]
ΦH(x)=1/(1+e-x)
function of formula (11)
Figure FDA0003145461270000092
Is a three-layer single-input single-output neural network, which has a single-node input layer, the output of which represents the value of the post-state p, and a hidden layer containing N nodes, the input of the ith node is the weighted post-state value wiP and hidden layer bias αiThe sum of (1); and the output of each hidden layer nodeInput and output relation is formed by phiH(. o) a functional representation in whichH(. cndot.) is called an activation function; the output layer has a node, the input of the output layer is the sum of the weighted output of the hidden layer and the offset beta of the output layer, and the output of the output layer represents the function value of the final approximation
Figure FDA0003145461270000093
CN202110749378.1A 2021-07-02 2021-07-02 MTO enterprise order processing method and system Active CN113592240B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110749378.1A CN113592240B (en) 2021-07-02 2021-07-02 MTO enterprise order processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110749378.1A CN113592240B (en) 2021-07-02 2021-07-02 MTO enterprise order processing method and system

Publications (2)

Publication Number Publication Date
CN113592240A true CN113592240A (en) 2021-11-02
CN113592240B CN113592240B (en) 2023-10-13

Family

ID=78245474

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110749378.1A Active CN113592240B (en) 2021-07-02 2021-07-02 MTO enterprise order processing method and system

Country Status (1)

Country Link
CN (1) CN113592240B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112990584A (en) * 2021-03-19 2021-06-18 山东大学 Automatic production decision system and method based on deep reinforcement learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111080408A (en) * 2019-12-06 2020-04-28 广东工业大学 Order information processing method based on deep reinforcement learning
CN111126905A (en) * 2019-12-16 2020-05-08 武汉理工大学 Casting enterprise raw material inventory management control method based on Markov decision theory
CN112149987A (en) * 2020-09-17 2020-12-29 清华大学 Multi-target flexible job shop scheduling method and device based on deep reinforcement learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111080408A (en) * 2019-12-06 2020-04-28 广东工业大学 Order information processing method based on deep reinforcement learning
CN111126905A (en) * 2019-12-16 2020-05-08 武汉理工大学 Casting enterprise raw material inventory management control method based on Markov decision theory
CN112149987A (en) * 2020-09-17 2020-12-29 清华大学 Multi-target flexible job shop scheduling method and device based on deep reinforcement learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112990584A (en) * 2021-03-19 2021-06-18 山东大学 Automatic production decision system and method based on deep reinforcement learning
CN112990584B (en) * 2021-03-19 2022-08-02 山东大学 Automatic production decision system and method based on deep reinforcement learning

Also Published As

Publication number Publication date
CN113592240B (en) 2023-10-13

Similar Documents

Publication Publication Date Title
Yimer et al. A genetic approach to two-phase optimization of dynamic supply chain scheduling
Gayon et al. Using imperfect advance demand information in production-inventory systems with multiple customer classes
Carr et al. The inverse newsvendor problem: Choosing an optimal demand portfolio for capacitated resources
Kleywegt et al. Stochastic optimization
US11593878B2 (en) Order execution for stock trading
Gilli et al. A global optimization heuristic for portfolio choice with VaR and expected shortfall
EP1922677A1 (en) Novel methods for supply chain management incorporating uncertainity
Yang et al. A new adaptive neural network and heuristics hybrid approach for job-shop scheduling
US20220414570A1 (en) Method and System of Demand Forecasting for Inventory Management of Slow-Moving Inventory in a Supply Chain
Taheri-Bavil-Oliaei et al. Bi-objective build-to-order supply chain network design under uncertainty and time-dependent demand: An automobile case study
CN113283671B (en) Method and device for predicting replenishment quantity, computer equipment and storage medium
Horng et al. Ordinal optimization based metaheuristic algorithm for optimal inventory policy of assemble-to-order systems
US11593877B2 (en) Order execution for stock trading
CN113592240A (en) Order processing method and system for MTO enterprise
Perakis et al. Leveraging the newsvendor for inventory distribution at a large fashion e-retailer with depth and capacity constraints
Chih-Ting Du et al. Building an active material requirements planning system
Bansal et al. Brief application description. neural networks based forecasting techniques for inventory control applications
Katanyukul et al. Approximate dynamic programming for an inventory problem: Empirical comparison
CN114742657A (en) Investment target planning method and system
Maiti et al. Two storage inventory model in a mixed environment
CN113298316A (en) Intelligent manufacturing framework and method based on block chain, scheduling matching method and model
CN113626966A (en) Inventory replenishment method, computer-readable storage medium and terminal device
Tang et al. Online learning and matching for multiproduct systems with general upgrading
Kaynov Deep Reinforcement Learning for Asymmetric One-Warehouse Multi-Retailer Inventory Management
Guo et al. Designing the Customer Order Decoupling Point to Facilitate Mass Customization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant