CN113592240B - MTO enterprise order processing method and system - Google Patents

MTO enterprise order processing method and system Download PDF

Info

Publication number
CN113592240B
CN113592240B CN202110749378.1A CN202110749378A CN113592240B CN 113592240 B CN113592240 B CN 113592240B CN 202110749378 A CN202110749378 A CN 202110749378A CN 113592240 B CN113592240 B CN 113592240B
Authority
CN
China
Prior art keywords
order
state
current
mto
enterprise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110749378.1A
Other languages
Chinese (zh)
Other versions
CN113592240A (en
Inventor
吴克宇
钱静
胡星辰
陈超
成清
程光权
冯旸赫
杜航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202110749378.1A priority Critical patent/CN113592240B/en
Publication of CN113592240A publication Critical patent/CN113592240A/en
Application granted granted Critical
Publication of CN113592240B publication Critical patent/CN113592240B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Educational Administration (AREA)
  • Development Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Game Theory and Decision Science (AREA)
  • Artificial Intelligence (AREA)
  • Operations Research (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention provides an order processing method and system for an MTO enterprise, comprising the following steps: when the current order arrives at an order-oriented production (MTO) enterprise, establishing an order acceptance strategy model based on a Markov Decision Process (MDP) theory for determining an optimal strategy according to a current order queue of the MTO enterprise and the current arriving order; converting the order receiving strategy model based on the MDP theory according to the post-state theory of the reinforcement learning algorithm to obtain an MDP model based on the post-state; the difficulty in solving the MDP model based on the rear state is reduced through the learning process of the learning parameter vector of the reinforcement learning algorithm, and the MDP optimization model based on the rear state is obtained; and solving the MDP optimization model based on the rear state by adopting a three-layer artificial neural network to determine an optimal strategy for whether the current arriving order is accepted. Modeling of the dynamic change arrival of the order is more consistent with the actual state of the order in reality.

Description

MTO enterprise order processing method and system
Technical Field
The invention relates to the field of order acceptance optimization, in particular to an order processing method and system for an MTO enterprise.
Background
As customer demand personalization continues to rise, more and more businesses begin to employ order-oriented production (MTO) modes to more easily view and contact end customers, maximizing customer personalization demand. The MTO mode refers to an enterprise that performs production according to a customer order, different customers have different demands on the type of the order, and the MTO enterprise organizes and produces the order according to the demands of the order made by the customers. In general, the capacity of an MTO enterprise is limited and, with the addition of various cost factors, the enterprise cannot accept all randomly arriving customer orders, which requires the MTO enterprise to formulate a corresponding order acceptance policy. Therefore, the problem of how to make order selection decision in the limited resources by the MTO enterprise is studied, and the method plays a great role in fully utilizing the limited resources and realizing long-term profit maximization of the MTO enterprise.
In carrying out the present invention, the applicant has found that at least the following problems exist in the prior art: the policy model is too simplified to approach reality.
Disclosure of Invention
The embodiment of the invention provides an order processing method and system for an MTO enterprise, and modeling of dynamic change arrival of an order is more consistent with the actual state of the order.
In order to achieve the above objective, in one aspect, an embodiment of the present invention provides a method for processing an order of an MTO enterprise, including:
when the current order arrives at an order-oriented production (MTO) enterprise, establishing an order acceptance strategy model based on a Markov Decision Process (MDP) theory aiming at a current order queue and the current arriving order of the MTO enterprise, wherein the order acceptance strategy model based on the MDP theory is used for determining an optimal strategy; wherein the current arriving order represents an order received by the MTO corporation but not yet decided whether to accept; the policy is to accept the current arriving order or reject the current arriving order, and the optimal policy is that the benefit of the MTO enterprise is optimal when the policy is selected;
converting the order receiving strategy model based on the MDP theory according to the post-state theory of the reinforcement learning algorithm, and obtaining the MDP model based on the post-state after conversion; the difficulty in solving the MDP model based on the rear state is reduced through the learning process of the learning parameter vector of the reinforcement learning algorithm, and the MDP optimization model based on the rear state is obtained;
and solving the MDP optimization model based on the rear state by adopting a three-layer artificial neural network to obtain a solving result, and determining whether an optimal strategy is accepted for the current arriving order or not according to the solving result.
In another aspect, an embodiment of the present invention provides an MTO enterprise order processing system, including:
the system comprises a model construction unit, a data processing unit and a data processing unit, wherein the model construction unit is used for establishing an order acceptance strategy model based on a Markov Decision Process (MDP) theory aiming at a current order queue and a current arriving order of an MTO enterprise after the current order arrives at the MTO enterprise facing the order production, and the order acceptance strategy model based on the MDP theory is used for determining an optimal strategy; wherein the current arriving order represents an order received by the MTO corporation but not yet decided whether to accept; the policy is to accept the current arriving order or reject the current arriving order, and the optimal policy is that the benefit of the MTO enterprise is optimal when the policy is selected;
the model conversion unit is used for converting the order receiving strategy model based on the MDP theory according to the post-state theory of the reinforcement learning algorithm, and obtaining an MDP model based on the post-state after conversion;
the model optimization unit is used for reducing the solving difficulty of the MDP model based on the rear state through the learning process of the learning parameter vector of the reinforcement learning algorithm to obtain the MDP optimization model based on the rear state;
and the solving unit is used for solving the MDP optimization model based on the rear state by adopting the three-layer artificial neural network to obtain a solving result, and determining whether an optimal strategy is accepted for the current arriving order or not according to the solving result.
The technical scheme has the following beneficial effects: modeling of the dynamic change arrival of the order is more consistent with the actual state of the order in reality.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method of MTO Enterprise order processing according to an embodiment of the present invention;
FIG. 2 is a block diagram of an MTO Enterprise order processing system according to an embodiment of the present invention;
FIG. 3 is a three-layer neural network architecture;
FIG. 4 is a sample learning rate;
FIG. 5 is a graph of different unit capacities;
FIG. 6 is an average profit for different order arrival rates;
FIG. 7 is an acceptance rate of different order arrival rates;
FIG. 8 is an average profit at different inventory costs;
FIG. 9 is an order acceptance rate at different inventory costs;
fig. 10 is a customer priority factor.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, in combination with an embodiment of the present invention, there is provided an MTO enterprise order processing method, including:
s101: when the current order arrives at an order-oriented production (MTO) enterprise, establishing an order acceptance strategy model based on a Markov Decision Process (MDP) theory aiming at a current order queue and the current arriving order of the MTO enterprise, wherein the order acceptance strategy model based on the MDP theory is used for determining an optimal strategy; wherein the current arriving order represents an order received by the MTO corporation but not yet decided whether to accept; the policy is to accept the current arriving order or reject the current arriving order, and the optimal policy is that the benefit of the MTO enterprise is optimal when the policy is selected;
s102: converting the order receiving strategy model based on the MDP theory according to the post-state theory of the reinforcement learning algorithm, and obtaining the MDP model based on the post-state after conversion;
s103: the difficulty in solving the MDP model based on the rear state is reduced through the learning process of the learning parameter vector of the reinforcement learning algorithm, and the MDP optimization model based on the rear state is obtained;
s104: and solving the MDP optimization model based on the rear state by adopting a three-layer artificial neural network to obtain a solving result, and determining whether an optimal strategy is accepted for the current arriving order or not according to the solving result.
Preferably, step 101 specifically includes:
assume that each order is not split, that the order is issued once to the customer after the order is completed, and that the order cannot be altered or canceled once accepted by the MTO corporation;
determining information of each order and the current arriving order in the current order queue according to the current order queue and the current arriving order, wherein the order information comprises: customer priority mu, unit product price pr, product required quantity q, lead period and latest delivery period dt corresponding to the order; the order in the current order queue is an accepted order, the order reaches poisson distribution with the obeying parameter lambda, and the price of unit products and the quantity of product requirements in the order respectively obey uniform distribution;
determining the reported sub-items faced according to the information of each order in the current order queue and the current arriving order, wherein the reported sub-items comprise: rejecting cost of rejecting current arriving order, profit of accepting current arriving order, delay punishment cost of order in current order queue, stock cost of order in current order queue; wherein:
if the current arriving order is rejected, a reject cost is generated: μ J, where J represents reject cost when customer priority is not considered;
If the currently arriving order is accepted, profit I for the order is obtained: i=pr×q, while consuming production cost C: c=c×q, where C is the unit product production cost;
for each order in the current order queue, the MTO enterprise performs production according to a first come first serve principle, if a delayed order exists in the current order queue, the delivery time of the delayed order is in the latest delivery date, and the MTO enterprise generates a delay penalty cost Y for paying corresponding customers of the delayed order:wherein t represents the production time still required for an order that has been accepted, b represents the unit production capacity of the MTO enterprise, u represents the unit time unit product delay penalty cost of the MTO enterprise;
if the product of the order in the current order queue is generated to completion within the lead time and the product is temporarily stored in the warehouse, the product is temporarily stored in the MTO Enterprise warehouse to generate an inventory cost N:wherein h represents the inventory cost of the unit product per unit time;
according to the MDP theory of a Markov decision process, and information and return sub-items of each order and a current arriving order in a current order queue, establishing an order acceptance strategy model based on the MDP theory of the MTO enterprise, wherein the order acceptance strategy model based on the MDP theory of the MTO enterprise is a quadruple (S, A, f, R), and the quadruple is a state space S, an action space A, a state transfer function f and a reward function R respectively, wherein:
The state space S represents the state of a system where the MTO enterprise order processing method is located; the state space S is an nx 6-dimensional vector, where n represents the number of order types, and 6 represents 6 order information: customer priority mu, unit product price pr, product demand quantity q, lead period lt, latest delivery period dt and production completion time t still needed by orders in the current order queue, wherein t has a preset maximum upper limit value;
the action space A represents the current arrivalAn action set of the order; when there is a current order arrival at time m, the MTO enterprise needs to make an action decision to accept or reject the order, and the actions to accept or reject the order are aggregated into an action space amount A, A= (a) 1 ,a 2 ) Wherein a is 1 Representing acceptance of an order, a 2 Representing rejection of the order;
the state transfer function f represents the transfer from the current state to the state at the m decision time, which is the time at which the pointer takes action on the current arriving order; the generation process of the state transfer function f is as follows:
assuming that the order information μ, pr, q, lt, dt are all independent and identically distributed, the probability density function f (| (s, a)) of the state at the next decision instant m+1 is expressed as f, based on the initial state s and the action a already taken at the m decision instant, respectively M (x)、f PR (x)、f Q (x)、f LT (x)、f DT (x);
And order information mu at the next decision time m+1 m+1 ,pr m+1 ,q m+1 ,lt m+1 ,dt m+1 With respect to(s) m ,a m ) Is independent; wherein t is m+1 Expressed as:
formula (1) represents t m+1 Acceptor(s) m ,a m ) And different (q m ,t m ,a m ) Resulting in different order production times, t m+1 Also affected by the order arrival time interval; wherein AT m→m+1 Representing the time interval of arrival between two orders, i.e. the poisson distribution with a compliance parameter lambda according to the arrival of each order;
according to the current state s and action a, and order information mu m+1 ,pr m+1 ,q m+1 ,lt m+1 ,dt m+1 With respect to(s) m ,a m ) Is an independent characteristic, and obtains the conditional probability density of the state s' of m+1 at the next decision momentThe conditional probability density of the next decision time m+1 state s' is expressed as:
f(s′|s,a)=f M (μ′)*f PR (pr′)*f Q (q′)*f LT (lt′)*f DT (dt′)*f T (t′|s,a)
wherein f T (t' |s, a) represents the production time still needed for producing an accepted order at the next decision time m+1 after action a is taken in the current state s, f T The specific form of (t' |s, a) is defined by formula (1) and related random variables;
when the MTO enterprise makes a decision at m, the corresponding return obtained after the action taken on the current arriving order is represented by a reward function R, which is represented as:
wherein, when a m =1 means that if the MTO enterprise receives a current arrival order, the reward function R is denoted as I-C-Y-N;
When a is m =0 means that if the MTO enterprise refuses the current arriving order, the reward function R is denoted- μ×j;
for any policy within the MDP theory-based order acceptance policy model, defining a corresponding cost function according to a reward function, and representing average long-term profits corresponding to the policy through the cost function, wherein the cost function is expressed as:
wherein pi represents any strategy, gamma represents future rewarding discount, 0 < gamma is less than or equal to 1, the sum term defined by the formula is ensured to have significance by setting gamma, n represents the total decision moment number, and each current order corresponds to one decision moment;
determining an optimal strategy pi by the average long-term profit of any strategy pi * Optimal policy pi * Is used for ensuring the long-term of enterprisesProfit is maximized, and the benefit of the MTO enterprise is optimal; the optimal strategy pi * Expressed as:
where n represents all policy sets.
Preferably, step 102 specifically includes:
after receiving the current arriving order at m time and deciding, setting a post-state variable p according to a reinforcement learning algorithm m Representing the selection of action a at the m decision time by a post state variable m Post-production of the order that has been accepted still requires production time; wherein the latter state is an intermediate variable between two successive states;
According to the current state s m And action a m Determining a post-state variable p m The post state variable p m Expressed as:
according to p m The production time t still needed for the order that has been accepted at the next decision instant m+1 m+1 Expressed as:
wherein in the formula (5),representing the greater number between variables x and 0, AT is the time interval reached between two orders; the conditional probability density of the next decision moment state S ' = (μ ', p ', q ', lt ', dt ', t ') is expressed as:
wherein the conditional probability density function f T (|p) is defined by formula (5) and related random variables;
rewriting a condition expectation E [ ] in MDP theory after setting a post-state variable, and defining a cost function of the post-state after rewriting the condition expectation E [ ], the cost function of the post-state being expressed as:
J * (p)=γE[V * (s′)|p] (7)
construction of optimal strategy pi by post-state cost function * Thereby making the optimal strategy pi * Into a one-dimensional state space, the optimal strategy pi * Expressed as:
preferably, step 103 specifically includes:
the strengthening algorithm has a state value function J after passing * Constructing an optimal strategy pi * Calculation of J is not directly performed when solving * The J is realized by adopting a learning process through learning parameter vectors * Is solved for; implementation of J by learning a parameter vector * Is solved for, in particular, as follows:
from a given parameter vector θAnd
parameter learning is carried out on the parameter vector theta from the data sample by adopting a strengthening algorithm to obtain the parameter vector theta * And uses the learned parameter vector θ * DeterminedTo approximate J * According to approximately J * Determining an optimal strategy pi *
Preferably, step 104 specifically includes:
solving J through three-layer artificial neural network ANN * (p), J * (p) approximating to any precision, toThe three-layer artificial neural network ANN is adopted as follows:
wherein, the parameter vector can be expressed as:
θ=[w 1 ,...,w N ,α 1 ,...α N ,u 1 ,...u N ,β]
Φ H (x)=1/(1+e -x )
function of formula (11)Is a three-layer single-input single-output neural network, which has only one single-node input layer, whose output represents the value of the post-state p, and an hidden layer containing N nodes, the input of the ith node being the weighted post-state value w i * p and implicit layer bias alpha i And (2) a sum of (2); and the input-output relationship of each hidden layer node is represented by phi H (. Cndot.) function representation, where Φ H (. Cndot.) is called an activation function; the output layer has a node, the input of the output layer is the sum of the weighted output of the hidden layer and the bias beta of the output layer, and the output of the output layer represents the final approximated function value +. >
As shown in fig. 2, in connection with an embodiment of the present invention, there is provided an MTO enterprise order processing system comprising:
a model building unit 21, configured to build an order acceptance policy model based on an MDP theory of a markov decision process for a current order queue and a current arriving order of an MTO enterprise after the current order arrives at the MTO enterprise, where the order acceptance policy model based on the MDP theory is used to determine an optimal policy; wherein the current arriving order represents an order received by the MTO corporation but not yet decided whether to accept; the policy is to accept the current arriving order or reject the current arriving order, and the optimal policy is that the benefit of the MTO enterprise is optimal when the policy is selected;
the model conversion unit 22 is configured to convert the order receiving policy model based on the MDP theory according to the post-state theory of the reinforcement learning algorithm, and obtain an MDP model based on the post-state after conversion;
the model optimizing unit 23 is configured to reduce the solution difficulty of the post-state-based MDP model through the learning process of the learning parameter vector of the reinforcement learning algorithm, and obtain the post-state-based MDP optimizing model;
and the solving unit 24 is used for solving the MDP optimization model based on the rear state by adopting the three-layer artificial neural network to obtain a solving result, and determining whether an optimal strategy is accepted for the current arriving order or not according to the solving result.
Preferably, the model construction unit 21 is specifically configured to:
assume that each order is not split, that the order is issued once to the customer after the order is completed, and that the order cannot be altered or canceled once accepted by the MTO corporation;
determining information of each order and the current arriving order in the current order queue according to the current order queue and the current arriving order, wherein the order information comprises: customer priority mu, unit product price pr, product required quantity q, lead period and latest delivery period dt corresponding to the order; the order in the current order queue is an accepted order, the order reaches poisson distribution with the obeying parameter lambda, and the price of unit products and the quantity of product requirements in the order respectively obey uniform distribution;
determining the reported sub-items faced according to the information of each order in the current order queue and the current arriving order, wherein the reported sub-items comprise: rejecting cost of rejecting current arriving order, profit of accepting current arriving order, delay punishment cost of order in current order queue, stock cost of order in current order queue; wherein:
if the current arriving order is rejected, a reject cost is generated: μ J, where J represents reject cost when customer priority is not considered;
If the currently arriving order is accepted, profit I for the order is obtained: i=pr×q, while consuming production cost C: c=c×q, where C is the unit product production cost;
for each order in the current order queue, the MTO enterprise performs production according to a first come first serve principle, if a delayed order exists in the current order queue, the delivery time of the delayed order is in the latest delivery date, and the MTO enterprise generates a delay penalty cost Y for paying corresponding customers of the delayed order:wherein t represents the production time still required for an order that has been accepted, b represents the unit production capacity of the MTO enterprise, u represents the unit time unit product delay penalty cost of the MTO enterprise;
if the product of the order in the current order queue is generated to completion within the lead time and the product is temporarily stored in the warehouse, the product is temporarily stored in the MTO Enterprise warehouse to generate an inventory cost N:wherein h represents the inventory cost of the unit product per unit time;
according to the MDP theory of a Markov decision process, and information and return sub-items of each order and a current arriving order in a current order queue, establishing an order acceptance strategy model based on the MDP theory of the MTO enterprise, wherein the order acceptance strategy model based on the MDP theory of the MTO enterprise is a quadruple (S, A, f, R), and the quadruple is a state space S, an action space A, a state transfer function f and a reward function R respectively, wherein:
The state space S represents the state of a system where the MTO enterprise order processing method is located; the state space S is an nx 6-dimensional vector, where n represents the number of order types, and 6 represents 6 order information: customer priority mu, unit product price pr, product demand quantity q, lead period lt, latest delivery period dt and production completion time t still needed by orders in the current order queue, wherein t has a preset maximum upper limit value;
action space A represents a collection of actions for a currently arriving order; when there is a current order arrival at time m, the MTO enterprise needs to make an action decision to accept or reject the order, and the actions to accept or reject the order are aggregated into an action space amount A, A= (a) 1 ,a 2 ) Wherein a is 1 Representing acceptance of an order, a 2 Representing rejection of the order;
the state transfer function f represents the transfer from the current state to the state at the m decision time, which is the time at which the pointer takes action on the current arriving order; the generation process of the state transfer function f is as follows:
assuming that the order information μ, pr, q, lt, dt are all independent and identically distributed, the probability density function f (| (s, a)) of the state at the next decision instant m+1 is expressed as f, based on the initial state s and the action a already taken at the m decision instant, respectively M (x)、f PR (x)、f Q (x)、f LT (x)、f DT (x);
And order information mu at the next decision time m+1 m+1 ,pr m+1 ,q m+1 ,lt m+1 ,dt m+1 With respect to(s) m ,a m ) Is independent; wherein t is m+1 Expressed as:
formula (1) represents t m+1 Acceptor(s) m ,a m ) And different (q m ,t m ,a m ) Resulting in different order production times, t m+1 Also affected by the order arrival time interval; wherein AT m→m+1 Representing the time interval of arrival between two orders, i.e. the poisson distribution with a compliance parameter lambda according to the arrival of each order;
according to the current state s and action a, and order information mu m+1 ,pr m+1 ,q m+1 ,lt m+1 ,dt m+1 With respect to(s) m ,a m ) Is an independent characteristic, and obtains the conditional probability density of the state s 'of the next decision time m+1, wherein the conditional probability density of the state s' of the next decision time m+1 is expressed as follows:
f(s′|s,a)=f M (μ′)*f PR (pr′)*f Q (q′)*f LT (lt′)*f DT (dt′)*f T (t′|s,a)
wherein f T (t' |s, a) represents the production time still needed for producing an accepted order at the next decision time m+1 after action a is taken in the current state s, f T The specific form of (t' |s, a) is defined by formula (1) and related random variables;
when the MTO enterprise makes a decision at m, the corresponding return obtained after the action taken on the current arriving order is represented by a reward function R, which is represented as:
wherein, when a m =1 means that if the MTO enterprise receives a current arrival order, the reward function R is denoted as I-C-Y-N;
When a is m =0 means that if the MTO enterprise refuses the current arriving order, the reward function R is denoted- μ×j;
for any policy within the MDP theory-based order acceptance policy model, defining a corresponding cost function according to a reward function, and representing average long-term profits corresponding to the policy through the cost function, wherein the cost function is expressed as:
wherein pi represents any strategy, gamma represents future rewarding discount, 0 < gamma is less than or equal to 1, the sum term defined by the formula is ensured to have significance by setting gamma, n represents the total decision moment number, and each current order corresponds to one decision moment;
determining an optimal strategy pi by the average long-term profit of any strategy pi * Optimal policy pi * The method is used for guaranteeing the maximization of long-term profit of the enterprise, and the benefit of the MTO enterprise is optimal at the moment; the optimal strategy pi * Expressed as:
where n represents all policy sets.
Preferably, the model transformation unit 22 is specifically configured to:
after receiving the current arriving order at m time and deciding, setting a post-state variable p according to a reinforcement learning algorithm m Representing the selection of action a at the m decision time by a post state variable m Post-production of the order that has been accepted still requires production time; wherein the latter state is an intermediate variable between two successive states;
According to the current state s m And action a m Determining a post-state variable p m The post state variable p m Expressed as:
according to p m The production time t still needed for the order that has been accepted at the next decision instant m+1 m+1 Expressed as:
wherein in the formula (5),representing a larger number between variables x and 0, AT is the time of arrival between two ordersAn interval; the conditional probability density of the next decision moment state S ' = (μ ', p ', q ', lt ', dt ', t ') is expressed as:
wherein the conditional probability density function f T (|p) is defined by formula (5) and related random variables;
rewriting a condition expectation E [ ] in MDP theory after setting a post-state variable, and defining a cost function of the post-state after rewriting the condition expectation E [ ], the cost function of the post-state being expressed as:
J * (p)=γE[V * (s′)|p] (7)
construction of optimal strategy pi by post-state cost function * Thereby making the optimal strategy pi * Into a one-dimensional state space, the optimal strategy pi * Expressed as:
preferably, the model optimization unit 23 is specifically configured to:
the strengthening algorithm has a state value function J after passing * Constructing an optimal strategy pi * Calculation of J is not directly performed when solving * The J is realized by adopting a learning process through learning parameter vectors * Is solved for; implementation of J by learning a parameter vector * Is solved for, in particular, as follows:
from a given parameter vector θAnd
parameter learning is carried out on the parameter vector theta from the data sample by adopting a strengthening algorithm to obtain the parameter vector theta * And uses the learned parameter vector θ * DeterminedTo approximate J * According to approximately J * Determining an optimal strategy pi *
Preferably, the solving unit 24 is specifically configured to:
solving J through three-layer artificial neural network ANN * (p), J * (p) approximating to any precision, toThe three-layer artificial neural network ANN is adopted as follows:
wherein, the parameter vector can be expressed as:
θ=[w 1 ,...,w N ,α 1 ,...α N ,u 1 ,...u N ,β]
Φ H (x)=1/(1+e -x )
function of formula (11)Is a three-layer single-input single-output neural network, which has only one single-node input layer, whose output represents the value of the post-state p, and an hidden layer containing N nodes, the input of the ith node being the weighted post-state value w i * p and implicit layer bias alpha i And (2) a sum of (2); and the input-output relationship of each hidden layer node is represented by phi H (. Cndot.) function representation, where Φ H (. Cndot.) is called an activation function; the output layer has a node, the input of the output layer is the sum of the weighted output of the hidden layer and the bias beta of the output layer, and the output of the output layer represents the final approximated function value +. >
The foregoing technical solutions of the embodiments of the present invention will be described in detail with reference to specific application examples, and reference may be made to the foregoing related description for details of the implementation process that are not described.
The invention relates to a method and a system for MTO enterprise order acceptance strategy based on post-state reinforcement learning, which are used for solving the problems of incomplete model consideration factors and high complexity of solving processes in the existing research. Meanwhile, a low-complexity solving algorithm based on the combination of an after-state and a neural network is provided to solve the problem of MTO enterprise order acceptance.
As customer diversity demands continue to rise, order-to-order (MTO) patterns that organize production according to customer demand for orders are becoming increasingly important in enterprise production activities. Due to the limitation of the production capacity of the enterprise, the MTO enterprise needs to formulate a reasonable order acceptance policy, i.e. how to determine whether to accept the arriving order according to the production capacity and the order state, so as to improve the production benefit of the enterprise.
Based on the traditional order receiving problem, the invention provides a more complete model of the MTO enterprise order receiving problem: based on traditional model elements of delay delivery cost, rejection cost and production cost, the invention further considers the order stock cost and various customer priority factors, and models the optimal order acceptance problem as a Markov Decision Process (MDP). Furthermore, since classical MDP solution methods rely on the solution and estimation of high-dimensional state-cost functions, their computational complexity is high. Thus, to reduce complexity, the present invention proposes to use a one-dimensional post-state cost function instead of a high-dimensional state cost function, in combination with a neural network to approximate the post-state cost function. Finally, the applicability and superiority of the order acceptance strategy model and algorithm provided by the invention are verified through simulation.
1. Description of the problem to be solved by the invention and model assumption
The present invention assumes that an MTO enterprise with limited capacity is produced by a single production line. Assuming that there are n types of customer orders on the market, the order related information includes customer priority μ, unit product price pr (pr is an abbreviation for price), quantity q, lead time, and late delivery time dt (full scale). The lead time lt refers to the contracted delivery time, and the working time period of order production under the condition of no intention, namely the time from the start of order production to the end of order production; however, when the enterprise cannot completely guarantee whether the delivery can be completed within the appointed time, the preset time period is followed by the appointed delivery time, and the preset time period is the latest delivery period dt. Customer orders arrive at a poisson distribution subject to a parameter lambda. The price of an individual product within an order and the quantity of demand for the corresponding product are both subject to uniform distribution.
When an order arrives, the enterprise needs to judge whether to accept the order according to the self-production capacity. If the order is rejected, a reject cost μ×j is generated, the higher the customer priority, the higher the reject cost; where μ represents a customer priority coefficient, and J represents a reject cost when the customer priority is not considered (J is also order related information).
If an order is accepted, profit is obtained for the order, i.e., i=pr×q, while production costs are consumed, i.e., c=c×q, where C is the unit production cost. The MTO corporation produces an order that has been accepted on a first come first served basis if the order is not delivered within the lead time of the customer's request, i.e.When the enterprise needs to pay a certain delay penalty cost Y, namely +.>Where t represents the production time still required for an order that has been accepted prior to accepting the current order, b represents the business unit production capacity, u represents the unit time unit product deferral penalty cost, and the higher the customer priority, the higher the deferral penalty cost. Customers do not pick up products produced prior to the lead time, i.e. whenWhen the product is temporarily stored in the MTO enterprise warehouse, the inventory cost N is generated, namely +.> Where h represents the unit time unit product inventory cost. Each order is not split for production, the order is issued once to the customer after production is completed, and once the order is accepted by the MTO enterprise, the customer cannot change or cancel the order.
The invention aims to solve the problem that when a customer order reaches randomly, an MTO enterprise decides whether to accept the currently arrived order or not on the basis of considering the current order queue, delay delivery cost, rejection cost, production cost, inventory cost and various customer priority factors so as to ensure the maximization of long-term average profit of the enterprise.
2. Order acceptance strategy modeling based on MDP theory
As can be seen, the MTO enterprise order acceptance decision problem is a class of random sequential decision problems (alternatively referred to as random system multi-stage decision problems). After the decision maker of the MTO enterprise decides to accept or reject the order, the system state changes, but the development process of the decision stage after the current stage is not influenced by the states of each stage before the stage, i.e. the decision maker has no aftereffect. Thus, the problem can be abstracted into an MDP model according to MDP (Markov decision process) theory. The MDP model is defined as four tuples (S, A, f, R) respectively representing a state space S, an action space A, a state transfer function f, and a reward function R:
1) System state: assuming that there are n order types (n order types) in the order taking system (order processing system), the system state can be represented by vector S: s= (μ, pr, q, lt, dt, t), t representing the production completion time still required for an accepted order, t having a maximum upper limit value based on MTO business of limited capacity.
2) System action set: at time m, when a customer order arrivesThe MTO enterprise needs to make a decision to accept or reject an order, and the set of actions in the model can be represented by vector a= (a) 1 ,a 2 ) Representation, wherein a 1 Representing acceptance of an order, a 2 Indicating rejection of the order. Wherein the actions in vector A are for the currently arriving order only, and do not include the order in the order queue.
3) State transition model: given an initial state s (the initial state representing the first arriving order in the order taking system) and an action a that has been taken, the probability density function for the next state is denoted by f (| (s, a)). Assuming that the order information mu, pr, q, lt, dt are all independent and distributed in the same way, the probability density functions can be used to represent the respective distributions, and the probability density functions corresponding to the distributions of the order information mu, pr, q, lt, dt are determined to be f M (x)、f PR (x)、f Q (x)、f LT (x)、f DT (x) Thus order information mu m+1 ,pr m+1 ,q m+1 ,lt m+1 ,dt m+1 With respect to(s) m ,a m ) Is independent, i.e. is represented in the current state s m Action a is taken m Thereafter, the next decision time (the next time state s m+1 ) In the order information of the customer. But t is m+1 Acceptor(s) m ,a m ) Because of the different (q m ,t m ,a m ) Will result in different order production times, t m+1 Also affected by the order arrival time interval, i.e., t m+1 Can be expressed as:
wherein AT m→m+1 Representing the time interval of arrival between two orders, i.e. the poisson distribution with a compliance parameter lambda according to the arrival of each order.
Due to order information mu m+1 ,pr m+1 ,q m+1 ,lt m+1 ,dt m+1 With respect to(s) m ,a m ) Is independent, so that given the current state s andin the action a, the conditional probability density of the next decision time state s 'can be obtained according to the conditional probability density of the current state, and the conditional probability density of the next decision time state s' can be expressed as:
f(s′|s,a)=f M (μ′)*f PR (pr′)*f Q (q′)*f LT (lt′)*f DT (dt′)*f T (t′|s,a)
wherein f T (t' |s, a) represents the production time that is still required to produce an accepted order at the next time after action a is taken in the current state s, and its specific form can be defined by (1) and related random variables.
4) Bonus function: at the m decision time, after making a decision whether to accept an order, the MTO enterprise obtains the immediate rewards function as follows:
where I represents the profit (price per product x quantity of product) of the order; c represents production cost; y represents deferral penalty costs (if the order exceeds the early completion, deferral penalty costs are generated); n represents the inventory cost (if completed prior to the lead period, customers do not pick ahead, resulting in inventory costs).
In the current state, after the decision maker takes corresponding action (refusing or accepting the order), the reward function is the evaluation of the quality of the action taken by the system.
5) And (3) an optimal strategy: in the case of MTO Enterprise order acceptance problems, the goal is to find an optimal order acceptance policy pi * Thereby maximizing the long-term profit of enterprises. Each policy pi is a function of system state to action that determines how the business chooses whether to accept an order based on current state information. For any policy pi, its cost function is defined as its average long-term profit (the cost function represents the value of this policy (i.e., cumulative discount prize) that is followed after a decision maker selects action a in state s, is the cumulative discount prize that results from taking an action from the current state). The cost function is as follows:
where 0 < gamma.ltoreq.1 represents a future prize discount (which guarantees that the summation term defined in (2) is meaningful). And we are concerned with the optimal policy pi among all policies * It is defined as:
where n represents all policy sets.
The theory of MDP for the standard is introduced as follows:
in MDP theory, the optimal strategy pi * The construction can be made with a state cost function y:
wherein E [. Cndot.]Is defined as the next random state s' given the current state s and action a. At the same time V * Is a solution to the Bellman equation, namely:
and V is * The solution may be performed by a value iteration method (value iteration).
It can be seen that classical MDP theory provides a state-based cost function V * Is an optimal strategy solving method. However, the state cost function V * Is difficult because the system state of the problem to be solved by the present invention is continuous in high dimensions, which results in a state cost function V * Is based on the characterization of desired E [. Cndot.]The computational complexity of the value iterations of (a) is not affordable. In order to solve the problems, the invention provides an optimal strategy construction method based on a post-state cost function.
3. Post-state based MDP model conversion
The after-state is an intermediate variable between two successive states and can be used to simplify the optimal control of certain MDPs. The concept of the post-state is a skill in reinforcement learning for learning tasks often used in board games. For example, an Agent (Agent) may certainly control his own walking while playing chess using a reinforcement learning algorithm, while being randomized to the opponent's actions. Before deciding on an action, an Agent is faced with a specific pawn position on the board, which is the same as the state in the classical MDP model. The post-state of each step of chess by an Agent is defined as the state of the board after this step of action but before the opponent moves. If an Agent is able to learn the winning opportunities for all the different rear states, then these known probabilities can be used to achieve optimal behavior, i.e., simply selecting the rear state with the greatest winning opportunity for the corresponding action.
The present invention employs a similar post-state approach to transforming order acceptance issues. In particular, in the order taking problem we consider, the post state variable p m Defined as selecting action a at the moment of m decision m Post-production still requires production time for accepted orders. Thus, given the current state s m And action a m The post state can then be expressed as:
it can be seen that given p m Production time t still needed for next decision time m+1 order m+1 Can be expressed as:
wherein the method comprises the steps ofIndicating that a larger number is taken between variables x and 0; AT (automatic Transmission)Is the time interval between arrival of two orders. Therefore, given the current back state p, the conditional probability density of the next decision moment state S ' = (μ ', p ', q ', lt ', dt ', t ') can be expressed as:
wherein the conditional probability density function f T (|p) is defined by (5) and related random variables.
The conditions in formulae (3) and (4) therefore expect EV (s') |s, a]The condition represented by (6) may be rewritten to the condition desired EV (s')σ (s, a)]Therefore pi * Can be redefined as follows. First, defining a post state value function as:
J * (p)=γE[V * (s′)|p] (7)
adding (7) to (3), then optimizing strategy pi * The post-state cost function J may be used * The construction is as follows:
Further, substituting (7) into (4) yields:
therefore, the following is obtained:
as is clear from (7), γE [ V ] * (s′)|p]Namely J * (p) so that it can be derived
Finally, we apply the reinforcement learning median iterative algorithm to J * Solving for [19] I.e. J 0 To arbitrarily initialize a function, there are
When k → infinity, J k Converging to J *
From equation (3), it is known that the desired EV needs to be used in calculating the optimal strategy * (s′)|s,a]Whereas the optimal strategy in formula (8) does not only take into account expectations, directly uses J * To replace E [ V ] * (s′)|s,a]And the high-dimensional state space is reduced to one-dimensional state space, so that the solving complexity is greatly reduced.
4. Optimal control based on neural network
Pi has been previously demonstrated * Can use J * Is constructed by J * And can be solved by means of iteration of the value of the formula (9). However, the implementation of equation (9) has two difficulties: first, if f M (·)、f PR (·)、f Q (·)、f LT (·)、f DT (. Cndot.) and f T (. |σ (s, a)) is not available, then E [. |p ] cannot be calculated]The method comprises the steps of carrying out a first treatment on the surface of the Second, since the post state is continuously changing, each iteration (9) must be computed over an infinite number of p values. Reinforcement learning, in turn, provides an effective way to address both of the difficulties described above. Reinforcement learning does not directly calculate J * Instead, a method is adopted that realizes J by learning parameter vectors * And the learning process uses data samples. In other words, the design of the RL algorithm (reinforcement learning algorithm or reinforcement learning) includes:
1) Parameterizing: this determines how to determine the function from a given parameter vector θWhere θ represents the approximate parameter vector for the post state value function.
2) Parameter learning: parameter vectorθ * Is learned from a batch of data samples and the slice is madeTo approximate J * I.e. the optimal strategy can be expressed as:
as can be seen from a comparison of (10) and (8), ifApproximately J * (p), then->
Near optimal policy pi *
4.1 neural network approximation
The universal approximation theorem (universal approximation theorem) shows that a three-layer artificial neural network (artificial neural network, ANN) is capable of approximating a continuous function to any accuracy, and thus for J to be solved in the present invention * (p) ANN is a good choice. So thatCan be expressed in terms of a neural network as:
wherein the parameter vector can be expressed as:
θ=[w 1 ,...,w N ,α 1 ,...α N ,u 1 ,...u N ,β],
Φ H (x)=1/(1+e -x )。
function in equation (11)As shown in fig. 3, it is actually a three-layer single-input single-output neural network. Specifically, there is only one single-node input layer whose output represents the value of the post-state p, there is also one hidden layer containing N nodes, the input of the ith node is the weighted post-state value w i * p and implicit layer bias alpha i A kind of electronic device. The input-output relationship of each hidden layer node is represented by phi H (. Cndot.) function representation, where Φ H (. Cndot.) is called an activation function. Finally, the output layer has a node whose output represents the function value of the final approximation +.>Its input is the sum of the implicit layer weighted output and the output layer bias β.
4.2 iterative training ANN (three-layer artificial neural network)
In order to achieve optimal control (i.e. to obtain an optimal strategy), parameters in the three-layer artificial neural network are trained through a value iteration method. The specific training procedure is as follows.
1) Acquisition of training data: the invention requires a batch of training samples Wherein for each m, sampling obtains p m 、μ m 、pr m 、q m 、lt m 、dt m Wherein p is m Is uniformly distributed and mu m ~f M (·)、pr m ~f PR (·)、q m ~f Q (·)、lt m ~f LT (·)、dt m ~f DT (. Cndot.) the use of a catalyst. Further, according to p m Generating t m I.e. t m =(p m -AT) + Where AT is a random variable that obeys poisson distribution.
2) Iterative fitting (see algorithm 1 of table 1): if (9), at the kth iteration, the current ANN parameter vector is set as theta k Is fixed by (11)Function of meaningI.e. given the current value function +.>The updated value function is:
it is desirable to use the updated parameters to obtain a new functionTo approach J k+1 (p). Therefore, from Γ and θ k In the middle, we construct a batch of training data:
Wherein o is m Is shown in givenIn the case of (1) desired J k+1 The value of (pm) is shown in formula (12).
Y based on training data k The parameter update may be expressed as:
wherein L (θ|y) k ) The training error is as follows:
3) Training parameters: solving for ANN parameters θ using gradient descent k+1 So thatAt gamma (gamma) k The upper error is minimal, as shown in formula (14). Gradient descent is by iteratively searching the parameter space: initial parameter θ in gradient iteration (0) Let θ be k The parameters in the iteration process are updated as follows:
wherein alpha is an update step size parameter,is L in θ defined in formula (15) (z) Is a gradient of (a). Thus, given a sufficient number of iterations Z, we use θ k+1 =θ (Z) As an approximation solution to (14). Finally->Expressed as:
TABLE 1 Algorithm 1 approximates J * (p)
In summary, the beneficial effects obtained by the invention are as follows:
1) The invention starts from the idea of revenue management, considers the problem of MTO enterprise order acceptance under a random dynamic environment, firstly considers the production cost, delay punishment cost and refusal cost factors of enterprises, and also considers the inventory cost of orders completed before the early period and various customer priority factors, thereby constructing an MDP (Markov decision process) order acceptance model.
2) The invention converts the solution of the optimal strategy in the traditional MDP through the post-state method, and proves that the optimal strategy based on the state value function in the classical MDP problem can be equivalently defined and constructed by using the post-state value function, and the multidimensional control problem is converted into the one-dimensional control problem, thereby greatly simplifying the solution process.
3) Traditional SRASA, SMART and other algorithms belong to a table type reinforcement learning method, and the method can only process the optimal decision problem in a discrete state space. In order to solve the problem of learning the order acceptance strategy in the continuous state space, the invention utilizes the neural network to parameterize the post-state cost function and designs a corresponding training algorithm, thereby realizing the estimation of the post-state cost function and the quick solution of the order acceptance strategy.
5. Numerical simulation experiment
The data generation method of the related order information required in the simulation is generated according to the following rules: order price pr obeys uniform distribution U (e) 1 ,l 1 ) The order quantity q obeys the uniform distribution U (e 2 ,l 2 ) The arrival of an order obeys a poisson distribution with a parameter lambda; a linear decreasing function relationship exists between the order advance period lt and the order price, namely lt=delta-beta-pr; the latest acceptable delivery time is taken as an integer and is set to satisfy the relationWherein phi is the early elastic coefficient.
Here we choose pr to U [30, 50], q to U [300, 500], λ=0.3, δ=36, β=0.4, Φ=0.8. The unit production capacity, unit production cost, and reject cost of the enterprise are b=20, c=15, and j=200, respectively. Unit time unit quantity delay penalty cost u=4, customer level obeys uniform distribution μ -U (0, 1], unit time unit quantity inventory cost h=4. Finally, initial learning rate α=0.001 of algorithm, exploration rate ε=0.1.
The simulation experiment is programmed by using python, and the reinforcement learning value iterative neural network (value iteration with neural network) algorithm based on the one-dimensional post-state space provided by the invention is: the effectiveness of the MTO enterprise order acceptance policy of the AFVINN algorithm is analyzed.
The simulation experiment consists of two parts: in the first part, firstly, the learning efficiency of the AFVINN algorithm is compared with that of the traditional Q-learning algorithm, and the comparison strategy is evaluated through the sample utilization efficiency; secondly, according to the comparison strategy of the documents [13, 16], namely, the long-term average profit and the order acceptance rate are compared and analyzed by using a proposed algorithm and an FCFS method, wherein the FCFS method is that when an order arrives, if an enterprise has the ability to finish production in the latest delivery period, the order is accepted directly; the acceptance rate of an order refers to the number of orders accepted divided by the total number of orders arrived. In the second section, firstly, the influence of the AFVINN algorithm on the average profit and the order acceptance rate of the MTO enterprise under the two situations of considering the inventory cost and not considering the inventory cost is examined respectively; secondly, analyzing the influence of the AFVINN algorithm on the average profit and the order acceptance rate of the MTO enterprise by adjusting delay penalty cost and rejection cost factors related to the priority of the customers; finally, the influence of the AFVINN algorithm on the average profit of the MTO enterprise under the three situations of considering various customer priorities, three customer priorities and no customer priorities is examined respectively.
5.1 Algorithm comparison
In the prior research on the order acceptance strategy of the reinforcement learning MTO enterprise, the modeling and solving are carried out by using the traditional multidimensional state space. In contrast, the invention provides a comparison of learning efficiency of the AFVINN algorithm and the traditional Q-learning algorithm, wherein the comparison strategy is to consume 200 data samples for each iteration, and the learning efficiency is evaluated according to the number of consumed data samples.
As can be seen from fig. 4: (1) After enough data samples, the AFVINN algorithm and the traditional Q-learning algorithm converge to the same value; (2) The learning efficiency of the AFVINN algorithm is far higher than that of the traditional Q-learning algorithm and is about 1000 times that of the traditional Q-learning algorithm. Therefore, the AFVINN algorithm provided by the invention can not only convert the high-dimensional control problem into the one-dimensional control problem and simplify the solving process, but also can keep the long-term average profit of the MTO enterprise at a higher level.
The table 2 shows that: (1) The AFVINN algorithm is superior to the FCFS method in maximizing the long-term average profit of MTO enterprises; (2) The AFVINN algorithm can still maintain a high average profit when the order acceptance rate is lower than the FCFS method. Therefore, the AFVINN algorithm can accept orders with higher profits with higher probability on the order acceptance strategy, so as to achieve the purpose of maximizing the average profits of enterprises for a long time.
TABLE 2 basic scenario
AFVINN algorithm FCFS method
Average profit: 367.2365 Average profit: 309.4537
Order acceptance rate: 0.1324 Order acceptance rate: 0.2698
Throughput is critical to the profitability of an MTO enterprise. By changing the MTO enterprise unit production capacity, other parameters are the same as the basic context, so as to observe the change of the AFVINN algorithm and the FCFS method in the order acceptance strategy of the MTO enterprise.
As shown in FIG. 5, (1) the AFVINN algorithm can always keep a higher profit level in the face of different production capacities of enterprises; (2) In reducing the production capacity of the enterprise units, the AFVINN algorithm and the FCFS method respectively reduce 38.667 percent and 41.857 percent in average profit; whereas the AFVINN algorithm and FCFS method increase 128.6104% and 122.9773% in average profit, respectively, when the unit production capacity increases from 20 to 35. Therefore, the AFVINN algorithm can reasonably utilize limited resources of enterprises, so that higher profits are created for the enterprises, and better adaptability is achieved under the condition of limited resources.
Order arrival rate is also an important factor in the decision making of order acceptance by MTO enterprises. By changing the arrival rate of orders, other parameters are the same as the basic context, observing the changes in the AFVINN algorithm (in FIG. 6 and FIG. 7, respectively in the previous column of each set of graphs) and the FCFS method (in FIG. 6 and FIG. 7, respectively in the next column of each set of graphs) in the MTO enterprise order acceptance policy. As can be seen from fig. 6 and 7: (1) Order acceptance rate increases when λ decreases, and order acceptance rate decreases when λ increases; this is because as the number of orders arriving per unit time increases, i.e., the time interval between the arrival of two orders decreases, this causes the MTO enterprise to schedule the accepted order to decrease, so the probability of completion of the accepted order will decrease during the latest delivery deadline, resulting in a decrease in the order acceptance rate. (2) The order acceptance rate under the AFVINN algorithm is lower than that of the FCFS method, but the average profit is higher than that of the FCFS method. It follows that the AFVINN algorithm can better accommodate uncertainty in customer order arrival.
5.2 model comparison
None of the prior documents [15-16] considers inventory costs when modeling and solving order acceptance problems using reinforcement learning algorithms. The section compares the inventory cost considered with the inventory cost not considered in the AFVINN algorithm order acceptance strategy, wherein the inventory cost is considered in the order acceptance problem modeling and solving process; the latter is to disregard inventory costs in order to accept the problem modeling and solving. As can be seen from fig. 8 and 9: (1) The order acceptance rate without considering the inventory cost is higher than the order acceptance rate with considering the inventory cost, but the average benefit with considering the inventory cost factor is always higher than the average benefit without considering the inventory cost factor; (2) When other factors are unchanged, the enterprise order acceptance strategy under the factor of considering the inventory cost changes along with the change of the inventory cost, and the enterprise order acceptance strategy under the factor of not considering the inventory cost is not affected by the change of the inventory cost; (3) As inventory costs continue to increase, the average profit decrease trend for an enterprise that takes into account inventory costs is slower than the average profit decrease trend for an enterprise that does not take into account inventory costs. Therefore, in the modeling and solving process of the MTO enterprise order acceptance problem, the inventory cost is considered, and the enterprise can make different order acceptance decisions according to different inventory costs so as to ensure that the average profit of the enterprise is maximized for a long time; in real life, the existence of the inventory cost often affects enterprise profits, occupies enterprise funds and affects the operation of the enterprise funds, so that the factor of the inventory cost cannot be ignored in the order accepting process.
Most of the prior art documents consider only the order feature, and assume that the customer is equally important, and although the document [16] uses reinforcement learning ideas to model and solve, and involves a customer priority factor, the customer priority is classified into three classes, and there are many kinds of customer priorities in real life. In this section of experiments, based on basic context, firstly, the unit delay penalty cost is changed under the condition that the rejection cost is unchanged, and secondly, the rejection cost is refused to be changed under the condition that the unit delay penalty cost is unchanged.
TABLE 3 Change Unit deferral penalty cost and refusal cost
As can be seen from fig. 10 and table 3: (1) The long-term average benefit of the enterprise considering the plurality of customer priorities is greater than the average benefit of the enterprise considering the three customer priorities and not considering the customer priorities; (2) An order acceptance rate based on the customer grade of greater than or equal to 0.5 under the AFVINN algorithm is reduced with the increase of delay penalty cost; and the order acceptance rate of the customer grade less than 0.5 increases with the delay penalty cost; (3) When the rejection cost increases, namely the rejection cost has an increasing influence on the profit of the MTO enterprise, the AFVINN algorithm makes an order acceptance rate with a customer level of more than or equal to 0.5 in an ascending trend, and makes an order acceptance rate with a customer level of less than 0.5 in a descending trend.
Therefore, when the delay penalty cost is high, if the enterprise does not complete the production within a predetermined period when receiving the order with the high customer priority, the enterprise needs to pay high cost, so that the acceptance rate of the high-priority customer order is reduced with the increase of the delay penalty cost; when the reject cost is high, the enterprise rejects the order of the high priority customer, which entails a higher cost, so the acceptance rate of the high priority customer order increases with the reject cost. Thus, when the deferred penalty cost is greater, the enterprise may increase acceptance of orders with lower customer priority, while appropriately decreasing orders with higher customer priority; and when the reject cost is greater, the business may increase acceptance of the order of the high priority customer. Therefore, the AFVINN algorithm can timely adjust the order acceptance strategy when facing different delay punishment cost and rejection cost, so that the influence of the rejection cost on the average profit of the MTO enterprise is reduced as much as possible, and the average profit of the enterprise is maximized for a long time.
According to the invention, on the basis of factors considered by the conventional MTO enterprise order acceptance problem, the order stock cost and various customer priority factors are increased, a Markov decision process order acceptance model is constructed, and an AFVINN algorithm is applied to solve, so that the algorithm can not only convert the multidimensional state space in the MTO enterprise order acceptance problem into a one-dimensional state space, simplify the solving process, but also keep the average profit of an enterprise at a higher level for a long time.
Simulation experiments show that in the MTO enterprise order acceptance problem, customer priority and inventory cost factors are important to enterprise order acceptance strategies and profits; compared with the traditional Q-learning algorithm, the AFVINN-based algorithm can convert the high-dimensional control problem into a one-dimensional control problem, improves the utilization efficiency of samples, and simplifies the solving process; the AFVINN-based algorithm is superior to the FCFS-based method in maximizing the long-term average profit of enterprises, has higher order acceptance and selection capacity, has better adaptability to environmental changes, and can balance the order profit with each cost factor to bring higher profit for MTO enterprises. The invention realizes the modeling that the order is dynamically changed (the information of the order can not be acquired in advance, and uncertainty exists in the current arriving order), and is more in line with the actual state of the order. Modeling and solving are more comprehensive in terms of factors such as: considering inventory cost factors, considering customer priority, etc.; reducing the state space dimension reduces the computational difficulty in model solving.
It should be understood that the specific order or hierarchy of steps in the processes disclosed are examples of exemplary approaches. Based on design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy presented.
In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, application lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate preferred embodiment of this application.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. As will be apparent to those skilled in the art; various modifications to these embodiments will be readily apparent, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, as used in the specification or claims, the term "comprising" is intended to be inclusive in a manner similar to the term "comprising," as interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean "non-exclusive or".
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (6)

1. An MTO enterprise order processing method, comprising:
when the current order arrives at an order-oriented production (MTO) enterprise, establishing an order acceptance strategy model based on a Markov Decision Process (MDP) theory aiming at a current order queue and the current arriving order of the MTO enterprise, wherein the order acceptance strategy model based on the MDP theory is used for determining an optimal strategy; wherein the current arriving order represents an order received by the MTO corporation but not yet decided whether to accept; the policy is to accept the current arriving order or reject the current arriving order, and the optimal policy is that the benefit of the MTO enterprise is optimal when the policy is selected;
converting the order receiving strategy model based on the MDP theory according to the post-state theory of the reinforcement learning algorithm, and obtaining the MDP model based on the post-state after conversion; the difficulty in solving the MDP model based on the rear state is reduced through the learning process of the learning parameter vector of the reinforcement learning algorithm, and the MDP optimization model based on the rear state is obtained;
solving the MDP optimization model based on the rear state by adopting a three-layer artificial neural network to obtain a solving result, and determining whether an optimal strategy is accepted for the current arriving order or not according to the solving result;
After the current order arrives at the order-oriented production MTO enterprise, establishing an order acceptance strategy model based on a Markov decision process MDP theory aiming at a current order queue and the current arriving order of the MTO enterprise, wherein the order acceptance strategy model specifically comprises the following steps:
assume that each order is not split, that the order is issued once to the customer after the order is completed, and that the order cannot be altered or canceled once accepted by the MTO corporation;
determining information of each order and the current arriving order in the current order queue according to the current order queue and the current arriving order, wherein the order information comprises: customer priority mu, unit product price pr, product required quantity q, lead period and latest delivery period dt corresponding to the order; the order in the current order queue is an accepted order, the order reaches poisson distribution with the obeying parameter lambda, and the price of unit products and the quantity of product requirements in the order respectively obey uniform distribution;
determining the reported sub-items faced according to the information of each order in the current order queue and the current arriving order, wherein the reported sub-items comprise: rejecting cost of rejecting current arriving order, profit of accepting current arriving order, delay punishment cost of order in current order queue, stock cost of order in current order queue; wherein:
If the current arriving order is rejected, a reject cost is generated: μ J, where J represents reject cost when customer priority is not considered;
if the currently arriving order is accepted, profit I for the order is obtained: i=pr×q, while consuming production cost C: c=c×q, where C is the unit product production cost;
for each order in the current order queue, the MTO enterprise performs production according to a first come first serve principle, if a delayed order exists in the current order queue, the delivery time of the delayed order is in the latest delivery date, and the MTO enterprise generates a delay penalty cost Y for paying corresponding customers of the delayed order:wherein t represents the production time still required for an order that has been accepted, b represents the unit production capacity of the MTO enterprise, u represents the unit time unit product delay penalty cost of the MTO enterprise;
if the product of the order in the current order queue is generated to completion within the lead time and the product is temporarily stored in the warehouse, the product is temporarily stored in the MTO Enterprise warehouse to generate an inventory cost N:wherein h represents the inventory cost of the unit product per unit time;
according to the MDP theory of a Markov decision process, and information and return sub-items of each order and a current arriving order in a current order queue, establishing an order acceptance strategy model based on the MDP theory of the MTO enterprise, wherein the order acceptance strategy model based on the MDP theory of the MTO enterprise is a quadruple (S, A, f, R), and the quadruple is a state space S, an action space A, a state transfer function f and a reward function R respectively, wherein:
The state space S represents the state of a system where the MTO enterprise order processing method is located; the state space S is an nx 6-dimensional vector, where n represents the number of order types, and 6 represents 6 order information: customer priority mu, unit product price pr, product demand quantity q, lead period lt, latest delivery period dt and production completion time t still needed by orders in the current order queue, wherein t has a preset maximum upper limit value;
action space A represents a collection of actions for a currently arriving order; when there is a current order arrival at time m, the MTO enterprise needs to make an action decision to accept or reject the order, and the actions to accept or reject the order are aggregated into an action space amount A, A= (a) 1 ,a 2 ) Wherein a is 1 Representing acceptance of an order, a 2 Representing rejection of the order;
the state transfer function f represents the transfer from the current state to the state at the m decision time, which is the time at which the pointer takes action on the current arriving order; the generation process of the state transfer function f is as follows:
assuming that the order information μ, pr, q, lt, dt are all independent and identically distributed, the probability density function f (| (s, a)) of the state at the next decision instant m+1 is expressed as f, based on the initial state s and the action a already taken at the m decision instant, respectively M (x)、f PR (x)、f Q (x)、f LT (x)、f DT (x);
And order information mu at the next decision time m+1 m+1 ,pr m+1 ,q m+1 ,lt m+1 ,dt m+1 With respect to(s) m ,q m ) Is independent; wherein t is m+1 Expressed as:
formula (1) represents t m+1 Acceptor(s) m ,a m ) And different (q m ,t m ,a m ) Resulting in different order production times, t m+1 Also affected by the order arrival time interval; wherein AT m→m+1 Representing the time interval of arrival between two orders, i.e. the poisson distribution with a compliance parameter lambda according to the arrival of each order;
according to the current state s and action a, and order information mu m+1 ,pr m+1 ,q m+1 ,lt m+1 ,dt m+1 With respect to(s) m ,a m ) Is characterized independently to obtainThe conditional probability density of a state s 'at a decision time m+1, the conditional probability density of a state s' at the next decision time m+1 being expressed as:
f(s′|s,a)=f M (μ′)*f PR (pr′)*f Q (q′)*f LT (lt′)*f DT (dt′)*f T (t′|s,a),
wherein f T (r' |s, a) represents the production time still needed for producing an accepted order at the next decision time m+1 after action a is taken in the current state s, f T The specific form of (t' |s, a) is defined by formula (1) and related random variables;
when the MTO enterprise makes a decision at m, the corresponding return obtained after the action taken on the current arriving order is represented by a reward function R, which is represented as:
wherein, when a m =1 means that if the MTO enterprise receives a current arrival order, the reward function R is denoted as I-C-Y-N;
When a is m =0 means that if the MTO enterprise refuses the current arriving order, the reward function R is denoted- μ×j;
for any policy within the MDP theory-based order acceptance policy model, defining a corresponding cost function according to a reward function, and representing average long-term profits corresponding to the policy through the cost function, wherein the cost function is expressed as:
wherein pi represents any strategy, gamma represents future rewarding discount, 0< gamma is less than or equal to 1, the sum term defined by the formula is ensured to have significance by setting gamma, n represents the total decision moment number, and each current order corresponds to one decision moment;
determining an optimal strategy pi by the average long-term profit of any strategy pi * Optimal policy pi * The method is used for guaranteeing the maximization of long-term profit of the enterprise, and the benefit of the MTO enterprise is optimal at the moment; the optimal strategy pi * Expressed as:
wherein, n represents all policy sets;
the method comprises the steps of converting an order acceptance strategy model based on MDP theory according to a post-state theory of a reinforcement learning algorithm, and obtaining the MDP model based on the post-state after conversion, and specifically comprises the following steps:
after receiving the current arriving order at m time and deciding, setting a post-state variable p according to a reinforcement learning algorithm m Representing the selection of action a at the m decision time by a post state variable m Post-production of the order that has been accepted still requires production time; wherein the latter state is an intermediate variable between two successive states;
according to the current state s m And action a m Determining a post-state variable p m The post state variable p m Expressed as:
according to p m The production time t still needed for the order that has been accepted at the next decision instant m+1 m+1 Expressed as:
wherein in the formula (5),representing the greater number between variables x and 0, AT is the time interval reached between two orders; based on the current and the subsequent state, the nextThe conditional probability density of the decision moment state S ' = (μ ', p ', q ', lt ', dt ', t ') is expressed as:
wherein the conditional probability density function f T (|p) is defined by formula (5) and related random variables;
rewriting a condition expectation E [ ] in MDP theory after setting a post-state variable, and defining a cost function of the post-state after rewriting the condition expectation E [ ], the cost function of the post-state being expressed as:
J * (p)=γE[V * (s′)|p] (7)
construction of optimal strategy pi by post-state cost function * Thereby making the optimal strategy pi * Into a one-dimensional state space, the optimal strategy pi * Expressed as:
2. the MTO enterprise order processing method of claim 1, wherein the learning process of the learning parameter vector through the reinforcement learning algorithm reduces the solving difficulty of the post-state-based MDP model, and obtains the post-state-based MDP optimization model, and specifically comprises:
The strengthening algorithm has a state value function J after passing * Constructing an optimal strategy pi * Calculation of J is not directly performed when solving * The J is realized by adopting a learning process through learning parameter vectors * Is solved for; implementation of J by learning a parameter vector * Is solved for, in particular, as follows:
from a given parameter vector θAnd
parameter learning is carried out on the parameter vector theta from the data sample by adopting a strengthening algorithm to obtain the parameter vector theta * And uses the learned parameter vector θ * DeterminedTo approximate J * According to approximately J * Determining an optimal strategy pi *
3. The MTO enterprise order processing method of claim 2, wherein the adopting the three-layer artificial neural network to solve the MDP optimization model based on the post-state to obtain a solution result, and determining whether to accept the optimal policy for the current arriving order according to the solution result specifically comprises:
solving J through three-layer artificial neural network ANN * (p), J * (p) approximating to any precision, toThe three-layer artificial neural network ANN is adopted as follows:
wherein, the parameter vector can be expressed as:
θ=[w 1 ,...,w N1 ,...α N ,u 1 ,...u N ,β],
Φ H (x)=1/(1+e -x )
function of formula (11)Is a three-layer single-input single-output neural network, which has only one single-node input layer, whose output represents the value of the post-state p, and an hidden layer containing N nodes, the input of the ith node being the weighted post-state value w i * p and implicit layer bias alpha i And (2) a sum of (2); and the input-output relationship of each hidden layer node is represented by phi H (. Cndot.) function representation, where Φ H (. Cndot.) is called an activation function; the output layer has a node, the input of the output layer is the sum of the weighted output of the hidden layer and the bias beta of the output layer, and the output of the output layer represents the final approximated function value +.>
4. An MTO enterprise order processing system, comprising:
the system comprises a model construction unit, a data processing unit and a data processing unit, wherein the model construction unit is used for establishing an order acceptance strategy model based on a Markov Decision Process (MDP) theory aiming at a current order queue and a current arriving order of an MTO enterprise after the current order arrives at the MTO enterprise facing the order production, and the order acceptance strategy model based on the MDP theory is used for determining an optimal strategy; wherein the current arriving order represents an order received by the MTO corporation but not yet decided whether to accept; the policy is to accept the current arriving order or reject the current arriving order, and the optimal policy is that the benefit of the MTO enterprise is optimal when the policy is selected;
the model conversion unit is used for converting the order receiving strategy model based on the MDP theory according to the post-state theory of the reinforcement learning algorithm, and obtaining an MDP model based on the post-state after conversion;
The model optimization unit is used for reducing the solving difficulty of the MDP model based on the rear state through the learning process of the learning parameter vector of the reinforcement learning algorithm to obtain the MDP optimization model based on the rear state;
the solving unit is used for solving the MDP optimization model based on the rear state by adopting the three-layer artificial neural network to obtain a solving result, and determining whether an optimal strategy is accepted for the current arriving order or not according to the solving result;
the model construction unit is specifically configured to:
assume that each order is not split, that the order is issued once to the customer after the order is completed, and that the order cannot be altered or canceled once accepted by the MTO corporation;
determining information of each order and the current arriving order in the current order queue according to the current order queue and the current arriving order, wherein the order information comprises: customer priority mu, unit product price pr, product required quantity q, lead period and latest delivery period dt corresponding to the order; the order in the current order queue is an accepted order, the order reaches poisson distribution with the obeying parameter lambda, and the price of unit products and the quantity of product requirements in the order respectively obey uniform distribution;
determining the reported sub-items faced according to the information of each order in the current order queue and the current arriving order, wherein the reported sub-items comprise: rejecting cost of rejecting current arriving order, profit of accepting current arriving order, delay punishment cost of order in current order queue, stock cost of order in current order queue; wherein:
If the current arriving order is rejected, a reject cost is generated: μ J, where J represents reject cost when customer priority is not considered;
if the currently arriving order is accepted, profit I for the order is obtained: i=pr×q, while consuming production cost C: c=c×q, where C is the unit product production cost;
for each order in the current order queue, the MTO enterprise performs production according to a first come first serve principle, if a delayed order exists in the current order queue, the delivery time of the delayed order is in the latest delivery date, and the MTO enterprise generates a delay penalty cost Y for paying corresponding customers of the delayed order:wherein t represents the production time still required for an order that has been accepted, b represents the unit production capacity of the MTO enterprise, u represents the unit time unit product delay penalty cost of the MTO enterprise;
if the product of the order in the current order queue is generated and completed within the advance period and the product is temporarily stored in the warehouse, the product is temporarily stored in the MTO enterprise warehouseThereby generating an inventory cost N:wherein h represents the inventory cost of the unit product per unit time;
according to the MDP theory of a Markov decision process, and information and return sub-items of each order and a current arriving order in a current order queue, establishing an order acceptance strategy model based on the MDP theory of the MTO enterprise, wherein the order acceptance strategy model based on the MDP theory of the MTO enterprise is a quadruple (S, A, f, R), and the quadruple is a state space S, an action space A, a state transfer function f and a reward function R respectively, wherein:
The state space S represents the state of a system where the MTO enterprise order processing method is located; the state space S is an nx 6-dimensional vector, where n represents the number of order types, and 6 represents 6 order information: customer priority mu, unit product price pr, product demand quantity q, lead period lt, latest delivery period dt and production completion time t still needed by orders in the current order queue, wherein t has a preset maximum upper limit value;
action space A represents a collection of actions for a currently arriving order; when there is a current order arrival at time m, the MTO enterprise needs to make an action decision to accept or reject the order, and the actions to accept or reject the order are aggregated into an action space amount A, A= (a) 1 ,a 2 ) Wherein a is 1 Representing acceptance of an order, a 2 Representing rejection of the order;
the state transfer function f represents the transfer from the current state to the state at the m decision time, which is the time at which the pointer takes action on the current arriving order; the generation process of the state transfer function f is as follows:
assuming that the order information μ, pr, q, lt, dt are all independent and identically distributed, the probability density function f (| (s, a)) of the state at the next decision instant m+1 is expressed as f, based on the initial state s and the action a already taken at the m decision instant, respectively M (x)、f PR (x)、f Q (x)、f LT (x)、f DT (x);
And underOrder information mu at decision time m+1 m+1 ,pr m+1 ,q m+1 ,lt m+1 ,dt m+1 With respect to(s) m ,a m ) Is independent; wherein t is m+1 Expressed as:
formula (1) represents t m+1 Acceptor(s) m ,a m ) And different (q m ,t m ,a m ) Resulting in different order production times, t m+1 Also affected by the order arrival time interval; wherein AT m→m+1 Representing the time interval of arrival between two orders, i.e. the poisson distribution with a compliance parameter lambda according to the arrival of each order;
according to the current state s and action a, and order information mu m+1 ,pr m+1 ,q m+1 ,lt m+1 ,dt m+1 With respect to(s) m ,a m ) Is an independent characteristic, and obtains the conditional probability density of the state s 'of the next decision time m+1, wherein the conditional probability density of the state s' of the next decision time m+1 is expressed as follows:
f(s′|s,a)=f M (μ′)*f PR (pr′)*f Q (q′)*f LT (lt′)*f DT (dt′)*f T (t′|s,a)
wherein f T (t' |s, a) represents the production time still needed for producing an accepted order at the next decision time m+1 after action a is taken in the current state s, f T The specific form of (t' |s, a) is defined by formula (1) and related random variables;
when the MTO enterprise makes a decision at m, the corresponding return obtained after the action taken on the current arriving order is represented by a reward function R, which is represented as:
wherein, when a m =1 means that if the MTO enterprise receives a current arrival order, the reward function R is denoted as I-C-Y-N;
When a is m =0 means that if the MTO enterprise refuses the current arriving order, the reward function R is denoted- μ×j;
for any policy within the MDP theory-based order acceptance policy model, defining a corresponding cost function according to a reward function, and representing average long-term profits corresponding to the policy through the cost function, wherein the cost function is expressed as:
wherein pi represents any strategy, gamma represents future rewarding discount, 0< gamma is less than or equal to 1, the sum term defined by the formula is ensured to have significance by setting gamma, n represents the total decision moment number, and each current order corresponds to one decision moment;
determining an optimal strategy pi by the average long-term profit of any strategy pi * Optimal policy pi * The method is used for guaranteeing the maximization of long-term profit of the enterprise, and the benefit of the MTO enterprise is optimal at the moment; the optimal strategy pi * Expressed as:
wherein, n represents all policy sets;
the model conversion unit is specifically used for:
after receiving the current arriving order at m time and deciding, setting a post-state variable p according to a reinforcement learning algorithm m Representing the selection of action a at the m decision time by a post state variable m Post-production of the order that has been accepted still requires production time; wherein the latter state is an intermediate variable between two successive states;
According to the current state s m And action a m Determining a post-state variable p m The rear state is setVariable p m Expressed as:
according to p m The production time t still needed for the order that has been accepted at the next decision instant m+1 m+1 Expressed as:
wherein in the formula (5),representing the greater number between variables x and 0, AT is the time interval reached between two orders; the conditional probability density of the next decision moment state S ' = (μ ', p ', q ', lt ', dt ', t ') is expressed as:
wherein the conditional probability density function f T (|p) is defined by formula (5) and related random variables;
rewriting a condition expectation E [ ] in MDP theory after setting a post-state variable, and defining a cost function of the post-state after rewriting the condition expectation E [ ], the cost function of the post-state being expressed as:
J * (p)=γE[V * (s′)|p] (7)
construction of optimal strategy pi by post-state cost function * Thereby making the optimal strategy pi * Into a one-dimensional state space, the optimal strategy pi * Expressed as:
5. the MTO enterprise order processing system of claim 4, wherein the model optimization unit is specifically configured to:
the strengthening algorithm has a state value function J after passing * Constructing an optimal strategy pi * Calculation of J is not directly performed when solving * The J is realized by adopting a learning process through learning parameter vectors * Is solved for; implementation of J by learning a parameter vector * Is solved for, in particular, as follows:
from a given parameter vector θAnd
parameter learning is carried out on the parameter vector theta from the data sample by adopting a strengthening algorithm to obtain the parameter vector theta * And uses the learned parameter vector θ * DeterminedTo approximate J * According to approximately J * Determining an optimal strategy pi *
6. The MTO enterprise order processing system of claim 5, wherein the solving unit is specifically configured to:
solving J through three-layer artificial neural network ANN * (p), J * (p) approximating to any precision, toThe three-layer artificial neural network ANN is adopted as follows:
wherein, the parameter vector can be expressed as:
θ=[w 1 ,...,w N1 ,...α N ,u 1 ,...u N ,β]
Φ H (x)=1/(1+e -x )
function of formula (11)Is a three-layer single-input single-output neural network, which has only one single-node input layer, whose output represents the value of the post-state p, and an hidden layer containing N nodes, the input of the ith node being the weighted post-state value w i * p and implicit layer bias alpha i And (2) a sum of (2); and the input-output relationship of each hidden layer node is represented by phi H (. Cndot.) function representation, where Φ H (. Cndot.) is called an activation function; the output layer has a node, the input of the output layer is the sum of the weighted output of the hidden layer and the bias beta of the output layer, and the output of the output layer represents the final approximated function value +.>
CN202110749378.1A 2021-07-02 2021-07-02 MTO enterprise order processing method and system Active CN113592240B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110749378.1A CN113592240B (en) 2021-07-02 2021-07-02 MTO enterprise order processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110749378.1A CN113592240B (en) 2021-07-02 2021-07-02 MTO enterprise order processing method and system

Publications (2)

Publication Number Publication Date
CN113592240A CN113592240A (en) 2021-11-02
CN113592240B true CN113592240B (en) 2023-10-13

Family

ID=78245474

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110749378.1A Active CN113592240B (en) 2021-07-02 2021-07-02 MTO enterprise order processing method and system

Country Status (1)

Country Link
CN (1) CN113592240B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112990584B (en) * 2021-03-19 2022-08-02 山东大学 Automatic production decision system and method based on deep reinforcement learning
CN117421705B (en) * 2023-11-02 2024-06-14 升励五金(深圳)有限公司 Information analysis method and system applied to intelligent production

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111080408A (en) * 2019-12-06 2020-04-28 广东工业大学 Order information processing method based on deep reinforcement learning
CN111126905A (en) * 2019-12-16 2020-05-08 武汉理工大学 Casting enterprise raw material inventory management control method based on Markov decision theory
CN112149987A (en) * 2020-09-17 2020-12-29 清华大学 Multi-target flexible job shop scheduling method and device based on deep reinforcement learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111080408A (en) * 2019-12-06 2020-04-28 广东工业大学 Order information processing method based on deep reinforcement learning
CN111126905A (en) * 2019-12-16 2020-05-08 武汉理工大学 Casting enterprise raw material inventory management control method based on Markov decision theory
CN112149987A (en) * 2020-09-17 2020-12-29 清华大学 Multi-target flexible job shop scheduling method and device based on deep reinforcement learning

Also Published As

Publication number Publication date
CN113592240A (en) 2021-11-02

Similar Documents

Publication Publication Date Title
CN113592240B (en) MTO enterprise order processing method and system
Özer Replenishment strategies for distribution systems under advance demand information
Levina et al. Dynamic pricing with online learning and strategic consumers: an application of the aggregating algorithm
Kleywegt et al. Stochastic optimization
Benisch et al. Botticelli: A supply chain management agent designed to optimize under uncertainty
CN112651770B (en) Load declaration optimization method and system for power selling merchants in power spot market
Yang et al. A new adaptive neural network and heuristics hybrid approach for job-shop scheduling
CN110489229A (en) A kind of multiple target method for scheduling task and system
Korpela et al. Inventory forecasting with a multiple criteria decision tool
EP1922677A1 (en) Novel methods for supply chain management incorporating uncertainity
CN113056754A (en) Reinforcement learning system and method for inventory control and optimization
CN110264007A (en) Stock exchange control method and device
Zhu et al. An adaptive forecasting algorithm and inventory policy for products with short life cycles
Agrawal et al. Preference based scheduling in a healthcare provider network
Wu et al. A reinforcement learning-based admission control strategy for elastic network slices
Wang et al. Multicriteria order acceptance decision support in over-demanded job shops: a neural network approach
Wan et al. Autonomous agent models of stock markets
CN116739840A (en) Travel package recommendation method, device and storage medium based on multi-objective group optimization
Peng et al. Simulation optimization in the new era of AI
Katanyukul et al. Approximate dynamic programming for an inventory problem: Empirical comparison
CN113592175A (en) Resource scheduling optimization control system based on logistics business response process
CN117634859B (en) Resource balance construction scheduling method, device and equipment based on deep reinforcement learning
CN117035266A (en) Crowd-sourced task allocation method and device based on deep reinforcement learning
CN117591250B (en) Hard real-time access control method based on policy factors and overload resolution
CN118536912B (en) Automatic warehouse-in and warehouse-out scheduling method and device based on Q-Learning algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant