CN113592240A

CN113592240A - Order processing method and system for MTO enterprise

Info

Publication number: CN113592240A
Application number: CN202110749378.1A
Authority: CN
Inventors: 吴克宇; 钱静; 胡星辰; 陈超; 成清; 程光权; 冯旸赫; 杜航
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-07-02
Filing date: 2021-07-02
Publication date: 2021-11-02
Anticipated expiration: 2041-07-02
Also published as: CN113592240B

Abstract

The embodiment of the invention provides an MTO enterprise order processing method and system, which comprises the following steps: when a current order arrives at an order-oriented production (MTO) enterprise, establishing an order acceptance strategy model based on a Markov Decision Process (MDP) theory for determining an optimal strategy according to a current order queue of the MTO enterprise and the current arrival order; according to a post-state theory of a reinforcement learning algorithm, converting the order acceptance strategy model based on the MDP theory to obtain a post-state MDP model; reducing the solving difficulty of the MDP model based on the post state through the learning process of the learning parameter vector of the reinforcement learning algorithm to obtain an MDP optimization model based on the post state; and solving the MDP optimization model based on the post-state by adopting a three-layer artificial neural network to determine whether the current arriving order is accepted or not. The modeling of the order dynamic change arrival is more consistent with the actual order state.

Description

Order processing method and system for MTO enterprise

Technical Field

The invention relates to the field of order acceptance optimization, in particular to an order processing method and system for an MTO enterprise.

Background

As customer demand personalization continues to increase, more and more businesses are beginning to adopt a make-to-order (MTO) model to more easily view and contact end-users to maximize customer demand personalization. The MTO model refers to an enterprise that produces orders according to customer orders, where different customers have different requirements for the types of orders, and the MTO enterprise organizes and produces the orders according to the order requirements provided by the customers. In general, the capacity of an MTO enterprise is limited, and due to various cost factors, the enterprise may not accept all randomly arriving customer orders, which requires the MTO enterprise to develop a corresponding order acceptance policy. Therefore, the method and the system research how the MTO enterprise makes order selection decision in the limited resources, and play a great role in fully utilizing the limited resources and realizing long-term profit maximization for the MTO enterprise.

In the process of implementing the invention, the applicant finds that at least the following problems exist in the prior art: the strategy model is too simplified to approach the real situation.

Disclosure of Invention

The embodiment of the invention provides an order processing method and system for an MTO enterprise, wherein the modeling of order dynamic change arrival is more consistent with the actual state of an order.

To achieve the above object, in one aspect, an embodiment of the present invention provides an MTO enterprise order processing method, including:

when a current order arrives at an order-oriented production (MTO) enterprise, establishing an order acceptance strategy model based on a Markov Decision Process (MDP) theory aiming at a current order queue and the current order of the MTO enterprise, wherein the order acceptance strategy model based on the MDP theory is used for determining an optimal strategy; wherein the currently arriving order represents an order that the MTO enterprise receives but has not yet decided whether to accept; the strategy is to accept the current arrival order or reject the current arrival order, and the optimal strategy is to optimize the benefits of the MTO enterprise when the strategy is selected;

converting the order acceptance strategy model based on the MDP theory according to a post-state theory of the reinforcement learning algorithm to obtain an MDP model based on a post-state; reducing the solving difficulty of the MDP model based on the post state through the learning process of the learning parameter vector of the reinforcement learning algorithm to obtain an MDP optimization model based on the post state;

and solving the MDP optimization model based on the post-state by adopting a three-layer artificial neural network to obtain a solving result, and determining whether the optimal strategy is accepted for the current arriving order according to the solving result.

In another aspect, an embodiment of the present invention provides an MTO enterprise order processing system, including:

the model building unit is used for building an order acceptance strategy model based on an MDP theory in a Markov decision process aiming at a current order queue and a current arrival order of an MTO enterprise after the current order arrives at the MTO enterprise facing the order production, and the order acceptance strategy model based on the MDP theory is used for determining an optimal strategy; wherein the currently arriving order represents an order that the MTO enterprise receives but has not yet decided whether to accept; the strategy is to accept the current arrival order or reject the current arrival order, and the optimal strategy is to optimize the benefits of the MTO enterprise when the strategy is selected;

the model conversion unit is used for converting the order acceptance strategy model based on the MDP theory according to the post-state theory of the reinforcement learning algorithm, and obtaining the MDP model based on the post-state after conversion;

the model optimization unit is used for reducing the solving difficulty of the MDP model based on the rear state through the learning process of the learning parameter vector of the reinforcement learning algorithm to obtain the MDP optimization model based on the rear state;

and the solving unit is used for solving the MDP optimization model based on the post-state by adopting a three-layer artificial neural network to obtain a solving result, and determining whether the optimal strategy is accepted or not for the current arriving order according to the solving result.

The technical scheme has the following beneficial effects: the modeling of the order dynamic change arrival is more consistent with the actual order state.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a method for processing orders of an MTO enterprise according to an embodiment of the present invention;

FIG. 2 is a block diagram of an MTO enterprise order processing system according to an embodiment of the present invention;

FIG. 3 is a three-layer neural network architecture;

FIG. 4 is a sample learning rate;

FIG. 5 is a graph of different unit capacities;

FIG. 6 is the average profit for different order arrival rates;

FIG. 7 is an acceptance rate for different order arrival rates;

FIG. 8 is the average profit for different inventory costs;

FIG. 9 is the order acceptance rate for different inventory costs;

FIG. 10 is a customer priority factor.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, in accordance with an embodiment of the present invention, there is provided an MTO enterprise order processing method, including:

s101: when a current order arrives at an order-oriented production (MTO) enterprise, establishing an order acceptance strategy model based on a Markov Decision Process (MDP) theory aiming at a current order queue and the current order of the MTO enterprise, wherein the order acceptance strategy model based on the MDP theory is used for determining an optimal strategy; wherein the currently arriving order represents an order that the MTO enterprise receives but has not yet decided whether to accept; the strategy is to accept the current arrival order or reject the current arrival order, and the optimal strategy is to optimize the benefits of the MTO enterprise when the strategy is selected;

s102: converting the order acceptance strategy model based on the MDP theory according to a post-state theory of the reinforcement learning algorithm to obtain an MDP model based on a post-state;

s103: reducing the solving difficulty of the MDP model based on the post state through the learning process of the learning parameter vector of the reinforcement learning algorithm to obtain an MDP optimization model based on the post state;

s104: and solving the MDP optimization model based on the post-state by adopting a three-layer artificial neural network to obtain a solving result, and determining whether the optimal strategy is accepted for the current arriving order according to the solving result.

Preferably, step 101 specifically includes:

assuming that each order is not split for production, the order is sent to the customer once production is completed, and the order cannot be changed or cancelled once the order is accepted by the MTO enterprise;

determining information of each order and the current arrival order in the current order queue according to the current order queue and the current arrival order, wherein the order information comprises: the customer priority mu, the unit product price pr, the required product quantity q, the lead time lt and the latest delivery time dt corresponding to the order; the orders in the current order queue are accepted orders, the orders reach Poisson distribution with a compliance parameter of lambda, and the price of a unit product and the quantity of a product required in the orders respectively comply with uniform distribution;

determining the faced return sub-items according to the information of each order in the current order queue and the current arriving order, wherein the faced return sub-items comprise: rejecting the rejection cost of the current arrival order, accepting the profit of the current arrival order, the deferred penalty cost of the order in the current order queue and the inventory cost of the order in the current order queue; wherein:

if the current arriving order is rejected, a rejection cost is generated: μ x J, where J represents rejection cost when customer priority is not considered;

if the current arrival order is accepted, the profit I of the order is obtained: i ═ pr × q, while consuming production costs C: c ═ C × q, where C is the unit product production cost;

for each order in the current order queue, the MTO enterprise produces according to the principle of first-come first-serve, if a delayed order exists in the current order queue, the delivery time of the delayed order is within the latest delivery time index, and the MTO enterprise generates a deferred penalty cost Y to be paid to a customer corresponding to the delayed order:

wherein t represents the production time still needed in the accepted order, b represents the unit production capacity of the MTO enterprise, and u represents the unit product delay penalty cost of the MTO enterprise in unit time;

if the product of the order in the current order queue is generated and completed within the lead period and the product is temporarily stored in the warehouse, the product is temporarily stored in the MTO enterprise warehouse so as to generate the inventory cost N:

wherein h represents the inventory cost of a unit product per unit time;

according to the Markov decision process MDP theory, the information and return sub-items of each order and the current arrival order in the current order queue, establishing an order acceptance strategy model based on the MDP theory of the MTO enterprise, wherein the order acceptance strategy model based on the MDP theory of the MTO enterprise is a four-tuple (S, A, f and R), the four-tuple is a state space S, an action space A, a state transfer function f and a reward function R, and the method comprises the following steps:

the state space S represents the state of a system where the MTO enterprise order processing method is located; the state space S is an nx6 dimensional vector, where n represents the number of order types, and 6 represents 6 types of order information: the method comprises the following steps that (1) the priority mu of a customer, the price pr of a unit product, the quantity q of product demands, the lead time lt, and the production completion time t still needed by orders in a current order queue at the latest delivery time dt, wherein t has a preset maximum upper limit value;

action space A represents the set of actions for the currently arriving order; when a current order arrives at the time m, the MTO enterprise needs to make an action decision of accepting the order or rejecting the order, and the actions of accepting the order or rejecting the order are integrated into an action space quantity A, wherein A is (a)₁，a₂) Wherein a is₁Indicating acceptance of an order, a₂Indicating a rejection of the order;

the state transition function f represents the state of transition from the current state to the m decision time, wherein the m decision time refers to the time of taking action aiming at the current arriving order; the generation process of the state transfer function f is as follows:

assuming that the order information μ, pr, q, lt, dt are all independent and identically distributed, the probability density function f (· | (s, a)) of the state of the next decision time m +1 is expressed as f according to the initial state s and the action a that has been taken at the decision time m_M(x)、f_PR(x)、f_Q(x)、f_LT(x)、f_DT(x)；

And order information mu at the next decision time m +1_m+1，pr_m+1，q_m+1，lt_m+1，dt_m+1About(s)_m，a_m) Is independent; wherein, t_m+1Expressed as:

formula (1) represents t_m+1To(s)_m，a_m) And is different from (q)_m，t_m，a_m) Resulting in different order production times, and t_m+1But also by order arrival time interval; wherein, AT_m→m+1Representing the time interval of arrival between two orders, i.e. a poisson distribution subject to a parameter λ according to the arrival of each order;

according to the current state s, the action a and the order information mu_m+1，pr_m+1，q_m+1，lt_m+1，dt_m+1About(s)_m，a_m) The method is an independent feature, and obtains the conditional probability density of the next decision time m +1 state s ', where the conditional probability density of the next decision time m +1 state s' is expressed as:

f(s′|s，a)＝f_M(μ′)*f_PR(pr′)*f_Q(q′)*f_LT(lt′)*f_DT(dt′)*f_T(t′|s，a)

wherein f is_T(t' | s, a) represents the production time still required for the next decision moment m +1 to produce the accepted order after taking action a in the current state s, f_TThe specific form of (t' | s, a) is defined by formula (1) and the associated random variable;

when the MTO enterprise is at the decision time m, the corresponding reward obtained after the action taken on the current arriving order is expressed by a reward function R, wherein the reward function R is expressed as:

wherein when a_m1 means that if the MTO enterprise receives a current arrival order, the reward function R is denoted as I-C-Y-N;

when a is_m0 indicates that if the MTO enterprise rejects the currently arriving order, the reward function R is denoted- μ x J;

for any strategy in the order acceptance strategy model based on the MDP theory, a corresponding value function is defined according to a reward function, the average long-term profit corresponding to the strategy is represented by the value function, and the value function is represented as:

wherein pi represents any strategy, gamma represents future reward discount, gamma is more than 0 and less than or equal to 1, the summation item defined by the formula is ensured to have significance by setting gamma, n represents the total decision time quantity, and each current order corresponds to one decision time;

determining an optimal strategy pi by average long-term profit for an arbitrary strategy pi^*Optimum strategy pi^*The method is used for ensuring that the long-term profit of the enterprise is maximized, and the profit of the MTO enterprise is optimal at the moment; the optimal strategy pi^*Expressed as:

where Π represents all policy sets.

Preferably, step 102 specifically includes:

after the current arrival order is received and decided at the moment m, a post-state variable p is set according to a reinforcement learning algorithm_mSelecting action a at decision time m by means of a post-state variable representation_mProduction time still required for orders that have been accepted by post production; wherein the post state is an intermediate variable between two consecutive states;

according to the current state s_mAnd action a_mDetermining a post-state variable p_mThe post-state variable p_mExpressed as:

according to p_mThe production time t still needed by the order which has been accepted at the next decision moment m +1_m+1Expressed as:

wherein, in the formula (5),

representing a larger number between variable x and 0, AT being the time interval between the arrival of two orders; the conditional probability density of the next decision time state S ' ═ of (μ ', p ', q ', lt ', dt ', t ') is expressed according to the current posterior state as:

wherein the conditional probability density function f_T(. | p) is defined by equation (5) and the associated random variable;

rewriting the condition expectation E [ ] in the MDP theory after setting the post-state variables, and defining a cost function of the post-state after rewriting the condition expectation E [ ], expressing the cost function of the post-state as:

J^*(p)＝γE[V^*(s′)|p] (7)

constructing an optimal policy π through a post-state cost function^*Thereby optimizing the strategy by pi^*Changing into one-dimensional state space, the optimal strategy pi^*Expressed as:

preferably, step 103 specifically includes:

the strengthening algorithm passes a state cost function J^*Constructing an optimal strategy pi^*Does not directly calculate J when solving^*J is realized by learning process through learning parameter vector^*The approximation of (a) is solved; implementing J with a learning process by learning a parameter vector^*The approximation of (d) is solved as follows:

according toDetermination in a given parameter vector theta

And

performing parameter learning on the parameter vector theta from the data sample by adopting an enhanced algorithm to obtain the parameter vector theta^*And using the parameter vector theta obtained by learning^*Is determined

To approximate J^*According to the approximation of J^*Determining an optimal strategy pi^*。

Preferably, step 104 specifically includes:

solving for J through three-layer artificial neural network ANN^*(p) mixing J^*(p) to an arbitrary precision, will

The three-layer artificial neural network ANN is adopted to be expressed as:

wherein the parameter vector may be represented as:

θ＝[w₁，...，w_N，α₁，...α_N，u₁，...u_N，β]

Φ_H(x)＝1/(1+e^-x)

function of formula (11)

Is a three-layer single-input single-output neural network, which has a single-node input layer, the output of which represents the value of the post-state p, and a hidden layer containing N nodes, the input of the ith node is the weighted post-state value w_iP and hidden layer bias α_iThe sum of (1); and the input-output relation of each hidden layer node is represented by phi_H(. o) a functional representation in which_H(. cndot.) is called an activation function;the output layer has a node, the input of the output layer is the sum of the weighted output of the hidden layer and the offset beta of the output layer, and the output of the output layer represents the function value of the final approximation

As shown in fig. 2, in accordance with an embodiment of the present invention, there is provided an MTO enterprise order processing system, comprising:

the model building unit 21 is configured to build, when a current order arrives at an order-oriented production MTO enterprise, an order acceptance policy model based on an MDP theory in a markov decision process for a current order queue and the current order that the MTO enterprise has, where the order acceptance policy model based on the MDP theory is used to determine an optimal policy; wherein the currently arriving order represents an order that the MTO enterprise receives but has not yet decided whether to accept; the strategy is to accept the current arrival order or reject the current arrival order, and the optimal strategy is to optimize the benefits of the MTO enterprise when the strategy is selected;

the model conversion unit 22 is used for converting the order acceptance strategy model based on the MDP theory according to the post-state theory of the reinforcement learning algorithm, and obtaining the MDP model based on the post-state after conversion;

the model optimization unit 23 is configured to reduce the difficulty in solving the post-state-based MDP model through a learning process of a learning parameter vector of a reinforcement learning algorithm, so as to obtain a post-state-based MDP optimization model;

and the solving unit 24 is configured to solve the post-state-based MDP optimization model by using a three-layer artificial neural network to obtain a solution result, and determine whether the optimal strategy for the currently arrived order is accepted according to the solution result.

Preferably, the model building unit 21 is specifically configured to:

wherein, h tableShowing the inventory cost of a unit product per unit time;

where Π represents all policy sets.

Preferably, the model transformation unit 22 is specifically configured to:

wherein, in the formula (5),

J^*(p)＝γE[V^*(s′)|p] (7)

preferably, the model optimization unit 23 is specifically configured to:

the strengthening algorithm passes a state cost function J^*The structure is the mostBest strategy pi^*Does not directly calculate J when solving^*J is realized by learning process through learning parameter vector^*The approximation of (a) is solved; implementing J with a learning process by learning a parameter vector^*The approximation of (d) is solved as follows:

from a given parameter vector theta

And

Preferably, the solving unit 24 is specifically configured to:

The three-layer artificial neural network ANN is adopted to be expressed as:

wherein the parameter vector may be represented as:

θ＝[w₁，...，w_N，α₁，...α_N，u₁，...u_N，β]

Φ_H(x)＝1/(1+e^-x)

function of formula (11)

Is a three-layer single-input single-output neural network with only oneA single-node input layer outputting a value representing the post-state p and a hidden layer comprising N nodes, the input of the ith node being a weighted post-state value w_iP and hidden layer bias α_iThe sum of (1); and the input-output relation of each hidden layer node is represented by phi_H(. o) a functional representation in which_H(. cndot.) is called an activation function; the output layer has a node, the input of the output layer is the sum of the weighted output of the hidden layer and the offset beta of the output layer, and the output of the output layer represents the function value of the final approximation

The above technical solutions of the embodiments of the present invention are described in detail below with reference to specific application examples, and reference may be made to the foregoing related descriptions for technical details that are not described in the implementation process.

The invention relates to a method and a system for an order acceptance strategy of an MTO enterprise based on post-state reinforcement learning, aiming at solving the problems of incomplete model consideration and high complexity of a solving process in the existing research. Meanwhile, a low-complexity solving algorithm based on combination of an after-state (after-state) and a neural network is provided to solve the order acceptance problem of the MTO enterprise.

As the diversified demands of customers are continuously increased, the order-to-order (MTO) mode of organizing production according to the different demands of customers for orders is more and more important in enterprise production activities. Due to the limitation of the production capacity of an enterprise, an MTO enterprise needs to make a reasonable order accepting strategy, that is, how to determine whether to accept an arriving order according to the production capacity and the order state so as to improve the production efficiency of the enterprise.

On the basis of the traditional order receiving problem, the invention provides a more complete MTO enterprise order receiving problem model: on the basis of traditional model elements of postponed delivery cost, rejection cost and production cost, the invention further considers the order inventory cost and various customer priority factors and models the optimal order acceptance problem as a Markov Decision Process (MDP). Furthermore, since the classical MDP solution method relies on solving and estimating a high-dimensional state cost function, its computational complexity is high. Therefore, to reduce complexity, the present invention proposes to use a one-dimensional post-state cost function instead of a high-dimensional state cost function, and to approximate the post-state cost function in conjunction with a neural network. Finally, the applicability and superiority of the order acceptance strategy model and the algorithm provided by the invention are verified through simulation.

First, problem description and model assumptions to be solved by the invention

The invention assumes that a MTO enterprise with limited capacity produces through a single production line. Assuming that there are n types of customer orders on the market, the order-related information includes customer priority μ, price per product pr (pr is an abbreviation for price), quantity q, lead time lt (collectively "lead time"), and latest delivery time dt (collectively "delivery time"), etc. The lead period lt refers to the appointed delivery time, and is the working time period of order production under the condition of no intention, namely the time from the beginning of order production work to the end of order production work; however, if the enterprise cannot completely guarantee whether the delivery can be completed within the appointed time, the preset time period is extended at the appointed delivery time, and the preset extended time period is the latest delivery date dt. The customer order reaches a poisson distribution subject to a parameter lambda. Both the price of a single product within an order and the amount of demand for the corresponding product are subject to uniform distribution.

When an order arrives, the enterprise needs to judge whether to accept the order according to the production capacity of the enterprise. If the order is rejected, a rejection cost μ x J is generated, the higher the customer priority the higher the rejection cost; where μ denotes a customer priority coefficient, and J denotes a rejection cost when the customer priority is not considered (J is also order-related information).

If an order is accepted, the profit for that order is obtained, i.e., I ═ pr × q, while the production cost is consumed, i.e., C ═ C × q, where C is the unit product production cost. The MTO enterprise produces the accepted order on a first come first serve basis, if the order is not delivered within the lead time required by the customer, that is

When the enterprise needs to pay a certain delay penalty cost Y, that is

Where t represents the production time still required for an accepted order before accepting the current order, b represents the unit capacity of the enterprise, u represents the unit product deferral penalty cost per unit time, and the higher the customer priority the higher the deferral penalty cost. The customer does not get the product produced before the lead time in advance, namely when the product is taken

In time, the product is temporarily stored in the MTO enterprise warehouse, resulting in the inventory cost N, i.e.

Where h represents the unit product inventory cost per unit time. Each order is not split for production, the order is sent to a customer once after production is completed, and once the order is accepted by the MTO enterprise, the customer cannot change or cancel the order.

The problem to be solved by the present invention is that when a customer order arrives randomly, the MTO enterprise decides whether to accept the currently arriving order to ensure the long-term average profit maximization of the enterprise, taking into account the current order queue, and the deferred delivery cost, rejection cost, production cost, inventory cost, and various customer priority factors.

Second, order acceptance strategy modeling based on MDP theory

It can be seen that the MTO enterprise order acceptance decision problem is a type of random sequential decision problem (or called a random system multi-stage decision problem). After the decision maker of the MTO enterprise decides to accept or reject the order, the state of the system changes, but the development process of the decision making stage after the current stage is not influenced by the state of each stage before the decision making stage, namely, the decision making stage has no aftereffect. Therefore, the problem can be abstracted into an MDP (markov decision process) model according to the MDP theory. The MDP model is defined as a quadruple (S, a, f, R) representing a state space S, an action space a, a state transition function f, a reward function R:

1) the system state is as follows: assuming there are n order types (n order types) in the order taking system (order processing system), the system state can be represented by vector S: and S is (mu, pr, q, lt, dt, t), t represents the production completion time still needed by the accepted order, and t has the maximum upper limit value based on the MTO enterprise with limited production capacity.

2) System action set: at time m, when a customer order arrives, the MTO enterprise needs to make a decision to accept or reject the order, and the set of actions in the model may be represented by the vector a ═ (a)₁，a₂) Is shown in the specification, wherein a₁Indicating acceptance of an order, a₂Indicating that the order was rejected. The actions in vector a are only for the currently arriving orders and do not include orders in the order queue.

3) And (3) state transition model: given an initial state s (the initial state representing the first order to arrive in the order taking system) and the action a that has been taken, the probability density function for the next state is represented by f (· (s, a)). Here, it is assumed that the order information μ, pr, q, lt, dt are independent and identically distributed, the respective distributions may be represented by probability density functions, and the probability density function corresponding to each distribution of the order information μ, pr, q, lt, dt is defined as f_M(x)、f_PR(x)、f_Q(x)、f_LT(x)、f_DT(x) Thus order information mu_m+1，pr_m+1，q_m+1，lt_m+1，dt_m+1About(s)_m，a_m) Is independent, i.e. represents at the current state s_mNext, take action a_mThen, the next decision time (next time state s)_m+1) The order information in (1). But t is_m+1To(s)_m，a_m) Due to the difference (q)_m，t_m，a_m) May result in different order production times, t_m+1Also, the same applies toAlso affected by the order arrival time interval, i.e. t_m+1Can be expressed as:

wherein AT_m→m+1The time interval of arrival between two orders is represented, i.e. the poisson distribution according to the arrival compliance parameter λ of each order.

Due to the order information mu_m+1，pr_m+1，q_m+1，lt_m+1，dt_m+1About(s)_m，a_m) Are independent, so when a current state s and an action a are given, a conditional probability density of a next decision time state s 'can be obtained according to the conditional probability density of the current state, and the conditional probability density of the next decision time state s' can be expressed as:

wherein f is_T(t' | s, a) represents the production time still required to produce the accepted order at the next moment after taking action a in the current state s, which may be defined in specific form by (1) and associated random variables.

4) The reward function: at the decision moment m, after the MTO enterprise makes a decision whether to accept the order, the immediate reward function obtained by the MTO enterprise is:

wherein I represents the profit of the order (price per product x number of products); c represents the production cost; y represents a delay penalty cost (if the order exceeds the advance period, the delay penalty cost is generated); n represents the inventory cost (which would result if the customer did not pick up the good ahead of time if done ahead of the lead time).

The reward function gives the system a good or bad assessment of the action taken after the decision maker takes the corresponding action (refusing or accepting the order) in the current state.

5) The optimal strategy is as follows: in the MTO enterprise order acceptance problem, the aim is to find an optimal order acceptance strategy pi^*Thereby maximizing the long-term profit of the enterprise. Each policy π is a function from the system state to the action that determines how the business chooses whether to accept the order based on the current state information. For any strategy pi, its cost function is defined as its average long-term profit (the cost function represents the value of the decision maker in state s after selecting action a, and then following this strategy (i.e., the cumulative discount reward), is the cumulative discount reward resulting from taking action in the current state). The cost function is as follows:

where 0 < γ ≦ 1 indicates a future reward discount (for which the sum term defined in guarantee (2) is significant). While we are concerned with the optimal strategy of all strategies pi^*It is defined as:

where Π denotes all policy sets.

The theory for standard MDP is presented below:

in MDP theory, the optimal strategy is pi^*The construction can be done with a state cost function y:

where E [. is desired]Is the next random state s' defined given the current state s and the action a. At the same time, V^*Is a solution of the Bellman equation, namely:

and V^*The solution may be performed by a value iteration method (value iteration).

Therefore, the classical MDP theory provides a state-based cost function V^*The optimal strategy solving method of (1). However, the state cost function V^*Is difficult because the system state of the problem to be solved by the invention is continuous in a high dimension, which results in a state cost function V^*Is based on the expected E [. cndot.)]The computational complexity of the value iteration of (c) is difficult to bear. In order to solve the problems, the invention provides an optimal strategy construction method based on a post-state cost function.

Third, MDP model transformation based on post state

The after-state (after-state) is an intermediate variable between two successive states that can be used to simplify the optimal control of certain MDPs. The concept of the rear state is a skill in reinforcement learning that is often used for the learning task of board games. For example, when an Agent (Agent) uses a reinforcement learning algorithm to play chess, the Agent can control its own acts deterministically, while the Agent acts randomly as an opponent. Before deciding on an action, the Agent faces a certain pawn position on the board, which is the same as the state in the classical MDP model. The Agent's post-state for each step of the board is defined as the state of the board after this step of action but before the opponent moves. If the Agent is able to learn the winning opportunities for all the different rear states, then these known probabilities can be used to achieve optimal behavior, i.e. simply select the rear state with the largest winning opportunity to act accordingly.

The invention uses a similar post-state method to transform the order acceptance problem. In particular, the post-state variable p is in the order acceptance problem we consider_mDefined as selecting action a at decision time m_mPost production time still required to accept the order. Thus, given the current state s_mAnd action a_mAfter, the post state can be expressed as:

it will be readily apparent that given p_mProduction time t still needed for the next decision time m +1 order_m+1Can be expressed as:

wherein

Indicating a larger number between variable x and 0; AT is the time interval between two orders. Therefore, given the current post-state p, the conditional probability density of the next decision time state S ' ═ μ ', p ', q ', lt ', dt ', t ') can be expressed as:

wherein the conditional probability density function f_T(. p) is defined by (5) and the associated random variable.

Therefore, the conditions in the formulae (3) and (4) are desirably E [ V (s') | s, a]The conditional expectation E [ V (s') | σ (s, a) indicated by (6) may be rewritten]So pi^*Can be redefined as follows. First, a post-state value function is defined as:

J^*(p)＝γE[V^*(s′)|p] (7)

adding (7) to (3), the optimal strategy is pi^*The post-state cost function J can be used^*The structure is as follows:

further, substituting (7) into (4) yields:

therefore, the following results were obtained:

in fact, as shown in (7), γ E [ V ]^*(s′)|p]Is J^*(p) so that it can be obtained

Finally, we iterate the algorithm pair J through the reinforcement learning median^*Carry out the solution^[19]I.e. J₀For arbitrary initialization function, have

When k → ∞ is reached, J_kConverge to J^*。

As can be seen from equation (3), the desired E [ V ] needs to be used in calculating the optimal strategy^*(s′)|s，a]While the optimal strategy in equation (8) does not only need to consider expectations, but directly uses J^*To replace E [ V ]^*(s′)|s，a]And the high-dimensional state space is reduced to the one-dimensional state space, so that the solving complexity is greatly reduced.

Fourthly, optimal control based on neural network

It has been demonstrated before^*Can be represented by J^*To construct, and J^*And can be solved by the mode of iteration of the value of the formula (9). However, equation (9) has two difficulties in implementation: first, if f_M(·)、f_PR(·)、f_Q(·)、f_LT(·)、f_DT(. and f)_T(. sigma (s, a)) is unavailable, then E [ p ] cannot be calculated](ii) a Second, since the post-state is continuously changing, each iteration of equation (9) must be computed over an infinite number of p values. Reinforcement learning provides an effective solution to both of these difficulties. Reinforcement learning does not directly calculate J^*Instead, a method is adopted to realize J by learning parameter vectors^*And the learning process utilizes the data samples. In other words, the design of RL algorithm (reinforcement learning algorithm or reinforcement learning) includes:

1) parameterization: this determines how to determine the function from a given parameter vector theta

Where θ represents the approximate parameter vector to the post-state value function.

2) Parameter learning: parameter vector theta^*Is learned from a collection of data samples and sliced

To approximate J^*I.e. the optimal strategy can be expressed as:

when comparing (10) and (8), if

Is approximately J^*(p) then

Near-optimal strategy pi^*。

4.1 neural network approximation

The universal approximation theorem (univarial approximation term) indicates that a three-layer Artificial Neural Network (ANN) can approximate a continuous function to any precision, so that J to be solved in the invention^*(p), ANN is a good choice. Therefore, it is not only easy to use

Can be represented by a neural network as:

wherein the parameter vector can be expressed as:

θ＝[w₁，...，w_N，α₁，...α_N，u₁，...u_N，β]，

Φ_H(x)＝1/(1+e^-x)。

function in equation (11)

As shown in fig. 3, it is actually a three-layer single-input single-output neural network. Specifically, there is only one input layer with a single node, the output of which represents the value of the post-state p, and a hidden layer with N nodes, the input of the ith node being the weighted post-state value w_iP and hidden layer bias α_iThe sum of (1). The input-output relationship of each hidden layer node is represented by phi_H(. o) a functional representation in which_H(. cndot.) is referred to as the activation function. Finally, the output layer has a node, the output of which represents the function value of the final approximation

Its input is the sum of the implicit layer weighted output and the output layer bias β.

4.2 value iterative training ANN (three-layer artificial neural network)

In order to achieve optimal control (namely obtaining an optimal strategy), parameters in the three-layer artificial neural network are trained through a value iteration method. The specific training process is as follows.

1) Acquisition of training data: the invention requires a batch of training samples

Wherein for each m, a sample is taken to obtain p_m、μ_m、pr_m、q_m、lt_m、dt_mWherein p is_mSubject to uniform distribution, mu_m～f_M(·)、pr_m～f_PR(·)、q_m～f_Q(·)、lt_m～f_LT(·)、dt_m～f_DT(. cndot.). Further, according to p_mGenerating t_mI.e. t_m＝(p_m-AT)⁺Where AT is a random variable subject to a Poisson distribution.

2) Iterative fitting (see algorithm 1 of table 1): in the k-th iteration, the current ANN parameter vector is set as theta (9)_kFunction defined by (11)

I.e. given the current value function

The updated value function is:

it is desirable to obtain a new function using the updated parameters

To approach J_k+1(p) of the formula (I). Therefore, from Γ and θ_kIn (1), we construct a set of training data:

wherein o is_mIs shown in given

Desired J in the case of_k+1(pm) is shown in formula (12).

Based on training data γ_kThe parameter update can be expressed as:

wherein, L (theta | upsilon_k) For training errors:

3) training parameters: solving ANN parameter theta by gradient descent_k+1So that

In upsilon_kThe above error is minimal, see equation (14). Gradient descent is performed by iteratively searching the parameter space: initial parameter θ in gradient iteration⁽⁰⁾Is set to theta_kThen, the parameters in the iterative process are updated as follows:

wherein alpha is an update step parameter,

is defined in formula (15) as L at θ^(z)Of the gradient of (c). Thus, given a sufficient number of iterations Z, we use θ_k+1＝θ^(Z)As an approximate solution to (14). Finally, the

Expressed as:

TABLE 1 Algorithm 1 approximation J^*(p)

In summary, the beneficial effects obtained by the invention are as follows:

1) the invention starts from the income management idea, considers the order accepting problem of MTO enterprises in the random dynamic environment, firstly considers the inventory cost of orders completed before the lead period and a plurality of customer priority factors on the basis of considering the production cost, the delay punishment cost and the rejection cost of the enterprises, and constructs an MDP (Markov decision process) order accepting model.

2) The invention converts the solution of the optimal strategy in the traditional MDP through a post-state method, proves that the optimal strategy based on the state cost function in the classic MDP problem can be equivalently defined and constructed by the value function based on the post-state, and converts the multidimensional control problem into the one-dimensional control problem, thereby greatly simplifying the solution process.

3) Traditional algorithms such as SRASA, SMART and the like belong to a table type reinforcement learning method, and the method can only process the optimal decision problem in a discrete state space. In order to solve the problem of learning of the order acceptance strategy under the continuous state space, the method carries out parameterized representation on the post-state value function by utilizing the neural network, and designs a corresponding training algorithm, thereby realizing the estimation of the post-state value function and the rapid solution of the order acceptance strategy.

Fifth, numerical simulation experiment

The data generation method of the related order information required in the simulation is generated according to the following rules: order price pr obeys a uniform distribution U (e)₁，l₁) The order quantity q follows a uniform distribution U (e)₂，l₂) Poisson with order arrival compliance parameter of λDistributing; a linear decreasing function relationship exists between the order lead period lt and the order price, namely lt is delta-beta prpr; the latest acceptable delivery time is an integer and the setting satisfies the relation

Where φ is the elastic coefficient of the advance period.

Here we select pr to U [30, 50], q to U [300, 500], λ is 0.3, δ is 36, β is 0.4, and Φ is 0.8. The unit production capacity, the unit production cost and the rejection cost of the enterprise are respectively 20, 15 and 200. The unit time quantity postponing penalty cost U is 4, the customer grade is subject to uniform distribution mu-U (0, 1), the unit time quantity inventory cost h is 4, finally, the initial learning rate alpha of the algorithm is 0.001, and the exploration rate epsilon is 0.1.

The simulation experiment uses python to program, and the value iteration with neural network algorithm based on the reinforced learning value of the one-dimensional post-state space proposed by the invention is implemented by the simulation experiment, namely: and analyzing the effectiveness of the MTO enterprise order acceptance strategy of the AFVINN algorithm.

The simulation experiment consists of two parts: in the first part, firstly, the learning efficiency of the AFVINN algorithm is compared with that of the traditional Q-learning algorithm, and the comparison strategy is evaluated through the sample utilization efficiency; secondly, the long-term average profit and the order acceptance rate are compared and analyzed according to the comparison strategy of the document [13, 16], namely, the proposed algorithm is used and the FCFS method, wherein the FCFS method means that when an order arrives, if an enterprise has the capability of completing the production in the latest delivery date, the order is directly accepted; the acceptance rate of an order refers to the number of orders accepted divided by the total number of orders reached. In the second part, firstly, the influence of the AFVINN algorithm on the average profit and the order acceptance rate of the MTO enterprise under two situations of considering the inventory cost and not considering the inventory cost is respectively considered; secondly, analyzing the influence of the AFVINN algorithm on the average profit and the order acceptance rate of the MTO enterprise by adjusting factors of delay penalty cost and rejection cost related to the priority of the customer; and finally, respectively investigating the influence of the AFVINN algorithm on the average profit of the MTO enterprise under three situations of considering various customer priorities, considering three customer priorities and not considering the customer priorities.

5.1 Algorithm comparison

In the existing research on order acceptance strategies of reinforcement learning MTO enterprises, the traditional multidimensional state space is used for modeling and solving. In contrast, the learning efficiency of the AFVINN algorithm is compared with the learning efficiency of the traditional Q-learning algorithm, the comparison strategy is to consume 200 data samples in each iteration, and the learning efficiency is evaluated according to the number of the consumed data samples.

As can be seen from fig. 4: (1) after enough data samples, the AFVINN algorithm and the traditional Q-learning algorithm converge to the same value; (2) the learning efficiency of the AFVINN algorithm is far higher than that of the traditional Q-learning algorithm, and is about 1000 times that of the traditional Q-learning algorithm. Therefore, the AFVINN algorithm provided by the invention can not only convert a high-dimensional control problem into a one-dimensional control problem, simplify the solving process, but also keep the long-term average profit of MTO enterprises at a higher level.

As shown by table 2: (1) the AFVINN algorithm is superior to the FCFS method in the aspect of maximizing the long-term average profit of MTO enterprises; (2) the AFVINN algorithm may still maintain a high average profit when the order acceptance rate is lower than the FCFS method. Therefore, the AFVINN algorithm can accept orders with higher profit with higher probability on the order acceptance strategy so as to achieve the purpose of maximizing the long-term average profit of enterprises.

TABLE 2 basic scenarios

AFVINN algorithm	FCFS method
		Average profit: 367.2365	Average profit: 309.4537
Order acceptance rate: 0.1324	Order acceptance rate: 0.2698

Capacity is critical to profitability for the MTO enterprise. By changing the production capacity of the MTO enterprise unit, other parameters are the same as the basic situation, and the change of the AFVINN algorithm and the FCFS method in the aspect of the MTO enterprise order receiving strategy is observed.

As can be seen from FIG. 5, (1) the AFVINN algorithm can maintain a high profit level all the time in the face of different production capacities of the enterprise; (2) when the production capacity of an enterprise unit is reduced, the average profit of the AFVINN algorithm and the FCFS method is respectively reduced by 38.667% and 41.857%; while the AFVINN algorithm and the FCFS method increase the average profit by 128.6104% and 122.9773%, respectively, when the unit capacity increases from 20 to 35. Therefore, the AFVINN algorithm can reasonably utilize the limited resources of the enterprises, so that higher profits are created for the enterprises, and the AFVINN algorithm has better adaptability to the condition of limited resources.

Order arrival rate is also an important factor in order acceptance decisions for MTO enterprises. Changes in the MTO enterprise order acceptance policy of the AFVINN algorithm (in fig. 6 and 7, in the previous column of each set of graphs, respectively) and the FCFS method (in fig. 6 and 7, in the next column of each set of graphs, respectively) were observed by changing the arrival rate of orders, with other parameters being the same as the basic scenario. As can be seen from fig. 6 and 7: (1) when lambda is reduced, the order acceptance rate is increased, and when lambda is increased, the order acceptance rate is reduced; this is because as the number of orders arriving per unit time increases, i.e., the time interval between two orders arriving decreases, this will cause the MTO enterprise to schedule orders to be accepted to decrease, so the probability of completion of accepted orders will decrease during the latest delivery deadline, resulting in a decrease in order acceptance rate. (2) The order acceptance rate under the AFVINN algorithm is lower than that of the FCFS method, but the average profit is higher than that of the FCFS method. It can be seen that the AFVINN algorithm can better accommodate uncertainty in the arrival of customer orders.

5.2 model comparison

The existing documents [15-16] do not consider the inventory cost when modeling and solving the order acceptance problem by applying a reinforcement learning algorithm. In the section, the inventory cost is considered and the inventory cost is not considered to be compared in an AFVINN algorithm order receiving strategy, wherein the inventory cost is considered in the modeling and solving process of the order receiving problem; the latter is to not consider inventory costs in the order acceptance problem modeling and solving process. As can be seen from fig. 8 and 9: (1) the order acceptance rate without considering the inventory cost is higher than the order acceptance rate without considering the inventory cost, but the average income without considering the inventory cost is always higher than the average income without considering the inventory cost; (2) when other factors are not changed, the enterprise order receiving strategy considering the inventory cost is changed along with the change of the inventory cost, and the enterprise order receiving strategy not considering the inventory cost is not influenced by the change of the inventory cost; (3) as inventory costs continue to increase, the average profit drop-off for an enterprise taking into account inventory costs is slower than the average profit drop-off for an enterprise not taking into account inventory costs. Therefore, in the modeling and solving process of the order acceptance problem of the MTO enterprise, the inventory cost is considered, and the enterprise can make different order acceptance decisions according to different inventory costs so as to ensure the maximization of the long-term average profit of the enterprise; in real life, due to the existence of the inventory cost, the profit of an enterprise is often influenced, the capital of the enterprise is occupied, and the operation of the capital of the enterprise is influenced, so the inventory cost cannot be ignored in the order receiving process.

Although document [16] relates to a customer priority factor when modeling and solving by using a reinforcement learning idea, the customer priority is only divided into three levels, and there are various customer priorities in real life. In the experiment, the unit postpone punishment cost is firstly changed under the condition that the rejection cost is not changed by taking a basic situation as a reference, and the absolute cost is secondly refused to be changed under the condition that the unit postpone punishment cost is not changed.

TABLE 3 Change Unit deferral penalty cost and rejection cost

As can be seen from fig. 10 and table 3: (1) the long-term average income of an enterprise considering various customer priorities is larger than the average income considering three customer priorities and not considering the customer priorities; (2) the order acceptance rate of the customer grade which is greater than or equal to 0.5 based on the AFVINN algorithm is reduced along with the increase of the deferred penalty cost; while order acceptance rates with customer grades less than 0.5 increase with increasing deferred penalty costs; (3) when the rejection cost increases, namely the rejection cost has greater and greater influence on profit of the MTO enterprise, the order acceptance rate of the customer class of 0.5 or more is in an increasing trend, and the order acceptance rate of the customer class of less than 0.5 is in a decreasing trend when the AFVINN algorithm makes an order acceptance decision.

Therefore, when the deferred penalty cost is large, when an order with a higher priority of a customer is received, if an enterprise does not finish production in a specified time limit, the higher cost needs to be paid, so that the receiving rate of the order of the customer with the high priority is reduced along with the increase of the deferred penalty cost; when the rejection cost is high, the enterprise rejecting the order of the high-priority customer needs to bear high cost, so the acceptance rate of the order of the high-priority customer rises with the increase of the rejection cost. Therefore, when the deferred penalty cost is large, the enterprise can increase the order with lower priority, and appropriately decrease the order with higher priority; and when the rejection cost is greater, the business may increase acceptance of orders for high priority customers. Therefore, when different postponed punishment costs and rejection costs are faced, the AFVINN algorithm can adjust the order acceptance strategy in time, and the influence of the rejection costs on the average profit of the MTO enterprise is reduced as much as possible, so that the long-term average profit of the enterprise is maximized.

On the basis of factors considered by the traditional MTO enterprise order acceptance problem, the invention increases order inventory cost and various customer priority factors, constructs an order acceptance model in the Markov decision process, and solves the problem by applying the AFVINN algorithm, and the algorithm can not only convert the multidimensional state space in the MTO enterprise order acceptance problem into the one-dimensional state space, simplifies the solving process, but also can keep the long-term average profit of the enterprise at a higher level.

Simulation experiments show that in the order acceptance problem of MTO enterprises, customer priority and inventory cost factors are important to order acceptance strategies and profits of enterprises; compared with the traditional Q-learning algorithm, the AFVINN-based algorithm can convert a high-dimensional control problem into a one-dimensional control problem, improve the utilization efficiency of samples and simplify the solving process; the AFVINN algorithm is superior to the FCFS based algorithm in the aspect of maximizing the long-term average profit of the enterprise, the algorithm has high order receiving and selecting capacity and good adaptability to environmental change, and the order profit and various cost factors can be balanced to bring higher profit for the MTO enterprise. The invention realizes the modeling that the order is dynamically changed (the information of the order cannot be obtained in advance, and the current order is uncertain), and the modeling is more consistent with the actual order state. The modeling and solving are more comprehensive in consideration of factors such as: inventory cost factors are considered, customer priority is considered, and the like; reducing the state space dimension reduces the computational difficulty in model solution.

It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.

In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. To those skilled in the art; various modifications to these embodiments will be readily apparent, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. An MTO enterprise order processing method is characterized by comprising the following steps:

2. The method according to claim 1, wherein when a current order arrives at an order-oriented production MTO enterprise, an order acceptance policy model based on a markov decision process MDP theory is established for a current order queue and the currently arrived order of the MTO enterprise, specifically comprising:

wherein h represents the inventory cost of a unit product per unit time;

formula (1) represents t_m+1To(s)_m，a_m) And is different from (q)_m，t_m，α_m) Resulting in different order production times, and t_m+1But also by order arrival time interval; wherein, AT_m→m+1Representing the time interval of arrival between two orders, i.e. a poisson distribution subject to a parameter λ according to the arrival of each order;

f(s′|s，a)＝f_M(μ′)*f_PR(pr′)*f_Q(q′)*f_LT(lt′)*f_DT(dt′)*f_T(t′|s，a)，

when a is_mIf the MTO enterprise refusesCurrently arriving orders, the reward function R is denoted- μ x J;

where Π represents all policy sets.

3. The method according to claim 2, wherein the step of converting the MDP theory-based order acceptance policy model according to the post-state theory of the reinforcement learning algorithm to obtain the post-state-based MDP model includes:

wherein, in the formula (5),

J^*(p)＝γE[V^*(s′)|p] (7)

4. the method according to claim 3, wherein the learning process of learning the parameter vector through the reinforcement learning algorithm reduces the solution difficulty of the post-state-based MDP model to obtain the post-state-based MDP optimization model, and specifically comprises:

according to the given parameter vector theta

And

5. The method according to claim 4, wherein the step of solving the post-state-based MDP optimization model using a three-layer artificial neural network to obtain a solution result, and the step of determining an optimal policy for whether a currently arriving order is accepted according to the solution result specifically comprises:

The three-layer artificial neural network ANN is adopted to be expressed as:

wherein the parameter vector may be represented as:

θ＝[w₁，...，W_N，α₁，...α_N，u₁，...u_N，β]，

Φ_H(x)＝1/(1+e^-x)

function potential of formula (11)

Is a three-layer single-input single-output neural network, which has a single-node input layer, the output of which represents the value of the post-state p, and a hidden layer containing N nodes, the input of the ith node is the weighted post-state value w_iP and hidden layer bias α_iThe sum of (1); and the input-output relation of each hidden layer node is represented by phi_H(. o) a functional representation in which_H(. cndot.) is called an activation function; the output layer has a node, the input of the output layer is the sum of the weighted output of the hidden layer and the offset beta of the output layer, and the output of the output layer represents the function value of the final approximation

6. An MTO enterprise order processing system, comprising:

7. The MTO enterprise order processing system according to claim 6, wherein said model building unit is specifically configured to:

wherein h represents the inventory cost of a unit product per unit time;

according to the current state s andaction a and order information mu_m+1，pr_m+1，q_m+1，lt_m+1，dt_m+1About(s)_m，a_m) The method is an independent feature, and obtains the conditional probability density of the next decision time m +1 state s ', where the conditional probability density of the next decision time m +1 state s' is expressed as:

wherein when a_m1 means that if the MTO enterprise receives the current arrival order, the reward function R is denoted as I-C-Y-N:

determining an optimal strategy pi by average long-term profit for an arbitrary strategy pi_*Optimum strategy pi^*The method is used for ensuring that the long-term profit of the enterprise is maximized, and the profit of the MTO enterprise is optimal at the moment; the optimal strategy pi^*Expressed as:

where Π represents all policy sets.

8. The MTO enterprise order processing system according to claim 7, wherein said model conversion unit is specifically configured to:

wherein, in the formula (5),

J^*(p)＝γE[V^*(s′)|p] (7)

9. the MTO enterprise order processing system according to claim 8, wherein the model optimization unit is specifically configured to:

from a given parameter vector theta

And

10. The MTO enterprise order processing system according to claim 9, wherein said solving unit is specifically configured to:

The three-layer artificial neural network ANN is adopted to be expressed as:

wherein the parameter vector may be represented as:

θ＝[w₁，...，w_N，α₁，...α_N，u₁，...u_N，β]

Φ_H(x)＝1/(1+e^-x)

function of formula (11)

Is a three-layer single-input single-output neural network, which has a single-node input layer, the output of which represents the value of the post-state p, and a hidden layer containing N nodes, the input of the ith node is the weighted post-state value w_iP and hidden layer bias α_iThe sum of (1); and the output of each hidden layer nodeInput and output relation is formed by phi_H(. o) a functional representation in which_H(. cndot.) is called an activation function; the output layer has a node, the input of the output layer is the sum of the weighted output of the hidden layer and the offset beta of the output layer, and the output of the output layer represents the function value of the final approximation