CN109754104B

CN109754104B - Method, system, equipment and medium for optimizing enterprise supply chain by applying artificial intelligence

Info

Publication number: CN109754104B
Application number: CN201711071809.3A
Authority: CN
Inventors: 费翔; 刘珂; 卢维成
Original assignee: Suzhou Feiliu Technology Co ltd
Current assignee: Suzhou Feiliu Technology Co ltd
Priority date: 2017-11-03
Filing date: 2017-11-03
Publication date: 2023-08-08
Anticipated expiration: 2037-11-03
Also published as: CN109754104A

Abstract

The invention provides a method, a system, equipment and a medium for optimizing an enterprise supply chain by applying artificial intelligence, which comprises the following steps: according to the history strategy, deep reinforcement learning is applied to obtain a strategy network; historical demand analysis of different types of products is performed to generate time-varying market demand distribution; training a strategy network by using first sampling data distributed on market demands to obtain an enhanced strategy network; applying deep reinforcement learning to the reinforcement strategy network to obtain a value network; the enhanced strategy network obtains initial scores of various next strategies according to the real-time supply chain data; sampling the predicted market demand prediction data to obtain second sampling data, enhancing the strategy network to obtain first value scores of various next strategies according to the second sampling data, and obtaining a value network comprehensive score by integrating the initial score and the second value score by the value network to obtain second value scores of various next strategies; the final next strategy is determined in combination with the value network composite score and the initial score.

Description

Method, system, equipment and medium for optimizing enterprise supply chain by applying artificial intelligence

Technical Field

The invention relates to the technical field of supply chain management, in particular to a method, a system, equipment and a medium for optimizing an enterprise supply chain by applying artificial intelligence.

Background

The purpose of supply chain management is to integrate all processes in production and marketing activities through coordination of activities of all links of a supply chain, so that enterprises can bring products from concepts and research and development to markets at the highest speed, and the optimal business performance of the enterprises is realized. Efficient supply chain design, information sharing among supply chain members, visibility of inventory, and good coordination of production may result in reduced inventory levels, more efficient transportation operations, and improved order fulfillment and other critical business functions.

With the acceleration of globalization process, the number of suppliers and information for enterprises has increased dramatically, supply chain management has become more complex, costly, and more susceptible, with supply chain management facing cost control, supply chain visibility, risk management, diversified customer needs, globalization challenges. If these challenges are addressed in accordance with conventional supply chain management policies and designs, supply chain management becomes very difficult

There are always a variety of problems in the daily business of an enterprise, and these problems are associated, and the packing is wrong, for example, the production department complains about the lack of stock in the purchasing department and the rapid change in demand in the sales department; the purchasing department complains that the sales department does not provide sales plans in time and the production plans are frequently adjusted; the sales department complains that the production department cannot meet the sales plan; the marketing department commits too short time to customers, is difficult to deal with, etc.;

because the existing enterprises lack supply chain management and optimization modules, when the scale of the supply chain is increasingly enlarged, the structure is increasingly complex, and once a problem occurs in a link in the supply chain, the whole chain is extremely easy to collapse.

Disclosure of Invention

In view of the above-described shortcomings of the prior art, it is an object of the present invention to provide a method, system, apparatus and medium for optimizing an enterprise supply chain using artificial intelligence, which solves the above-described problems of the prior art.

To achieve the above and other related objects, the present invention provides a method for optimizing an enterprise supply chain using artificial intelligence, comprising: according to the history strategy, deep reinforcement learning is applied to obtain a strategy network; generating a time-varying market demand distribution by historical demand analysis of different types of products; training the strategy network by using first sampling data distributed on the market demand to obtain an enhanced strategy network; applying deep reinforcement learning to the reinforcement strategy network to obtain a value network; the enhanced policy network obtains initial scores of various next policies according to the real-time supply chain data; sampling the predicted market demand prediction data to obtain second sampling data, wherein the enhancement strategy network obtains first value scores of the various next strategies according to the second sampling data, obtains second value scores of the various next strategies through the value network, and synthesizes the first value scores and the second value scores to obtain a value network comprehensive score; a final next policy is determined in conjunction with the value network composite score and the initial score.

In an embodiment of the present invention, the supply chain includes one or more supply chain states in each supply chain decision period, each supply chain state is switched to the next supply chain state under the action of the action, wherein each supply chain state switch forms a corresponding feedback, and each feedback in one supply chain decision period is accumulated to form an accumulated feedback; the action corresponds to a policy.

In an embodiment of the present invention, a supply chain period is taken as a sample, and the policy network obtains an optimal action value and a maximum accumulated feedback corresponding to each supply chain state in each sample.

In an embodiment of the present invention, the policy network is trained on a deep neural network through a deep deterministic policy gradient model, the deep deterministic policy gradient model comprising: the system comprises an action network and a correction network, wherein the action network inputs the state of a supply chain and outputs an action value; the correction network inputs the state of the supply chain and the action value output by the action network, and outputs accumulated feedback; the correction network updates the accumulated feedback with a bellman equation and the action network updates from the derivative calculated by the correction network in correspondence with the entered action value.

In one embodiment of the present invention, the training the policy network with the first sampled data for the market demand distribution to obtain an enhanced policy network includes: generating samples in each supply chain period, combining the first sampling data with information of each business department on the supply chain, and extracting content comprises the following steps: supply chain status and required feedback; the policy network processes each sample, including: generating and outputting an optimal action value and the maximum accumulated feedback of each supply chain period according to the extracted supply chain states and required feedback in each supply chain period; and the strategy network generates the enhanced strategy network when the number of the samples exceeds a first preset threshold.

In one embodiment of the present invention, after the sample is generated, the sample is stored in the policy network for sampling the market demand distribution next time.

In one embodiment of the present invention, the applying deep reinforcement learning to the enhanced policy network to obtain a value network includes: taking the supply chain state of the t step in the supply chain period as a characteristic state, inputting the characteristic state as a characteristic value, taking the supply chain period of the characteristic state as a sample, and calculating corresponding accumulated feedback by the enhancement strategy network as a label to construct a deep convolution network as a value network; where T is a random number belonging to [1, T ], T is the maximum number of steps in the supply chain cycle.

In an embodiment of the present invention, calculating cumulative feedback for samples via the enhancement policy network according to a supply chain period in which the feature state is located includes: outputting an action value by the enhanced policy network before the t step; and randomly determining the action value corresponding to each step after the t-th step.

In an embodiment of the present invention, the sampling the predicted market demand prediction data to obtain second sampled data, the enhancement policy network obtaining a first value score of the various next policies according to the second sampled data, and obtaining a second value score of the various next policies through the value network, integrating the first value score and the second value score to obtain a value network integrated score, and updating the initial score through the value network integrated score, including: at the time x, generating corresponding characteristic state input enhancement strategy networks according to real-time service information of each service department on a supply chain, and setting a regression layer of the enhancement strategy networks as a softmax function so as to enable the regression layer to output initial scores of various possible actions adopted in the characteristic states; predicting market demand forecast data in a predetermined future time period after the x time; inputting second sampling data of the market demand forecast data into the enhancement strategy network to respectively obtain first value scores adopting the various possible actions in the characteristic state, enabling the value network to respectively obtain second value scores adopting the possible actions in the characteristic state, and obtaining current value network scores by integrating the first value scores and the second value scores; iterating the step of obtaining the current value network score of each time until reaching the maximum preset iteration times, and calculating the value network comprehensive score by integrating each obtained current value network score; the combining the value network composite score with the initial score to determine a final next step policy includes: and in each possible action, making the action value with the highest comprehensive score of the initial score and the value network be a final next action value, and making the corresponding strategy be a final next strategy.

In an embodiment of the present invention, the types of the history policies include: expert history policies and/or owned history policies.

To achieve the above and other related objects, the present invention provides a system for optimizing an enterprise supply chain using artificial intelligence, comprising: the strategy network is used for applying deep reinforcement learning to obtain according to a history strategy; a market demand analyzer for generating a time-varying market demand distribution by analyzing historical demand for different types of products; an enhanced policy network for training the policy network with first sampled data of the market demand distribution; a value network for applying deep reinforcement learning acquisition to the reinforcement policy network; wherein the enhanced policy network obtains initial scores of various next policies according to real-time supply chain data; sampling the predicted market demand prediction data to obtain second sampling data, wherein the enhancement strategy network obtains first value scores of the various next strategies according to the second sampling data, obtains second value scores of the various next strategies through the value network, and synthesizes the first value scores and the second value scores to obtain a value network comprehensive score; a final next policy is determined in conjunction with the value network composite score and the initial score.

To achieve the above and other related objects, the present invention provides a computer device comprising: a processor and a memory; the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory, so that the processing device executes the method for optimizing the enterprise supply chain by applying artificial intelligence.

To achieve the above and other objects, and in accordance with the purpose of the invention, there is provided a computer storage medium having a computer program stored thereon, which when executed by a processor performs the method of optimizing an enterprise supply chain using artificial intelligence.

As described above, the present invention provides methods, systems, devices, and media for optimizing an enterprise supply chain using artificial intelligence, including: according to the history strategy, deep reinforcement learning is applied to obtain a strategy network; generating a time-varying market demand distribution by historical demand analysis of different types of products; training the strategy network by using first sampling data distributed on the market demand to obtain an enhanced strategy network; applying deep reinforcement learning to the reinforcement strategy network to obtain a value network; the enhanced policy network obtains initial scores of various next policies according to the real-time supply chain data; sampling the predicted market demand prediction data to obtain second sampling data, wherein the enhancement strategy network obtains first value scores of the various next strategies according to the second sampling data, obtains second value scores of the various next strategies through the value network, and synthesizes the first value scores and the second value scores to obtain a value network comprehensive score; a final next policy is determined in conjunction with the value network composite score and the initial score.

The invention utilizes deep reinforcement learning (Deep Reinforcement Learning) to simulate human thinking, and when the problem of supply chain optimization is solved, the optimization of the current state and the influence of the adopted strategy on the future supply chain benefit are considered through past experience. The system has the capability of autonomous learning and self evolution by modeling a local supply chain optimization strategy of a human supply chain decision maker and a deep reinforcement learning method, and further utilizes a Monte Carlo simulation random algorithm to fuse a deep neural network, so that the system can realize global optimum like a human on the basis of local optimum through thinking.

Drawings

FIG. 1 is a flow chart of a method for optimizing an enterprise supply chain using artificial intelligence in accordance with an embodiment of the present invention.

FIG. 2 is a flow chart of the enhanced policy network generation in an embodiment of the invention.

FIG. 3 is a flow chart illustrating the generation of a value network according to an embodiment of the present invention.

FIG. 4 is a flow chart of real-time policy evaluation according to an embodiment of the invention.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.

The basic idea of the invention is to use deep reinforcement learning (Deep Reinforcement Learning) to simulate human thinking, and when the problem of supply chain optimization complexity is processed, the optimization of the current state and the influence of the adopted strategy on the future supply chain benefit are considered through past experience. The system has the capability of autonomous learning and self evolution by modeling a local supply chain optimization strategy of a human supply chain decision maker and a deep reinforcement learning method, and further utilizes a Monte Carlo simulation random algorithm to fuse a deep neural network, so that the system can realize global optimum like a human on the basis of local optimum through thinking.

The scheme of the invention is divided into two parts, a 'local' supply chain optimization strategy of a human supply chain decision maker is modeled, the system has the capability of autonomous learning and self-evolution, namely an offline learning part, and a Monte Carlo simulation random algorithm is used for fusing a deep neural network, so that the system has the part which can realize 'global optimum' like human on the basis of 'local optimum' through 'thinking', namely real-time strategy evaluation.

As shown in fig. 1, a flow chart of a method of optimizing an enterprise supply chain using artificial intelligence of the present invention is shown, which includes:

step S101: and according to the historical strategy, applying deep reinforcement learning to obtain a strategy network.

Corresponding to the supply chain period, define:

state s: at each time node t, the environment in which the supply chain is located is represented by State, for example, information such as the number of inventory, the number of purchases, the number of production, the number of restocking, etc. in the whole enterprise.

Action a: in each State, the value of the Action that the supply chain can take is an Action, e.g., change operational mode (MTO, ATO, MTS, etc.), adjust predictive procedures, change safety stock strategies, change planning procedures, change inventory starved allocation strategies, reduce the number of varieties, etc., and each Action taken, the supply chain will go to the next State accordingly.

Reward r: with each State, the supply chain may receive a report, e.g., stock backlog may receive a negative feedback; the enterprise operation cost is reduced, and a positive feedback is received;

policy P: action value policy, adopting Action and corresponding policy.

The supply chain comprises one or more supply chain states in each supply chain decision period, each supply chain State is switched to the next supply chain State under the Action of an Action value Action a, wherein each supply chain State switching forms corresponding feedback, and each feedback in one supply chain decision period is accumulated to form accumulated feedback; the action corresponds to a policy.

Further, a supply chain decision period may be defined as a Markov Decision Process (MDPs), repeatedly switching between states, actions forward, states..until the end of the period, the process of a period is defined as a sample:

sample 1:

sample 2:

.......

a policy network is built with the goal of using a finite MDP with discount rewards (γ) to learn from historical policies a policy that maximizes the cumulative feedback Q for the supply chain period, calculated as follows:

Q ₀ ＝r ₁ +γr ₂ +γ ² r ₃ +…+γ ⁿ r _n+1 ＝r ₁ +γQ ₁ 。

the history policy includes: expert history policies including history policies from supply chain experts and/or owned history policies including: the enterprise has built-in policies that have accumulated over the past few years.

The policy Network is built by training a depth deterministic policy gradient model Deep Deterministic Policy Gradient (DDPG) model by using the historical policies, and in the policy Network, besides an action Network (actorenetwork) for directly estimating an action value, a correction Network (Critic Network) is used for estimating a Q value, wherein the action Network inputs State and gives action value Actions. And when the correction network inputs State, the Actions generated by the action network are input, the corresponding Q value is given, and the Q is continuously updated by using a Bellman Equation (Bellman Equation), and the action network inputs the calculated derivative from the corresponding Actions of the correction network to update.

The DDPG model trains the action network and the correction network to obtain the weights of the two networks, so that the strategy network outputs an optimal action value A and a maximum accumulated feedback Q for each supply chain State in each sample.

Step S102: a time-varying market demand profile is generated by historical demand analysis of different types of products.

Step S103: the enhanced policy network is obtained by training the policy network with first sampled data of the market demand distribution.

In an embodiment of the present invention, as shown in fig. 2, the generation process of the enhanced policy network is shown.

Generating samples k with each supply chain period, each sample includingThe supply chain state and corresponding actions are n, preferably k is stored in the experience playback memory of the policy network for the next sample; the first sampling data and the service information (for example, the flow settings of each department of the supply chain, from ERP, MES, MRP, etc.) of each service department on the supply chain are combined and input into a supply chain simulation simulator as setting parameters, so that the simulator can simulate the actual production and operation environment of enterprises, and the content extraction of the supply chain simulation simulator comprises: the state S of the supply chain and the feedback r required are input into a strategy network, and a counter t is set; the policy network processes each sample, including: generating an optimal action value A and a maximum accumulated feedback Q of each supply chain period according to each extracted supply chain state S and required feedback in each supply chain period, and outputting, wherein t is added with 1;

if t is less than or equal to n, i.e. the sample is not processed, sampling is repeated and the sample is input into a strategy network;

if t is greater than n, namely, processing a sample, regenerating a sample k, and inputting the sample k into a strategy network for processing;

if the number of samples processed by the policy network exceeds a first preset threshold K, i.e. the number of K > K, the enhanced policy network is formed.

Step S104: and applying deep reinforcement learning to the reinforcement strategy network to obtain a value network.

As shown in fig. 3, the principle of formation of the value network is shown.

The enhanced policy network completes the first T steps of policy output of one sample (supply chain period), wherein T is a random number belonging to [1, T ], and T is the maximum number of steps in one supply chain period; and taking the supply chain state of the t step as a characteristic state t-state, inputting the characteristic state as a characteristic value at the t-state, taking the supply chain period of the characteristic state as a sample, and calculating corresponding accumulated feedback Q as a label through the enhancement strategy network to construct a deep convolution network CNN as a value network.

In an embodiment of the present invention, calculating cumulative feedback for samples via the enhancement policy network according to a supply chain period in which the feature state is located includes: outputting an action value by the enhanced policy network before the t step; and randomly determining the action value corresponding to each step after the T step, increasing diversity, preventing overfitting, and continuously completing the strategy of the supply chain by using the enhanced strategy network until T is more than T to obtain accumulated feedback Q.

Step S105: the enhanced policy network obtains initial scores of various next policies according to the real-time supply chain data;

step S106: and sampling the predicted market demand prediction data to obtain second sampling data, wherein the enhancement strategy network obtains first value scores of the various next strategies according to the second sampling data, obtains second value scores of the various next strategies through the value network, and obtains a value network comprehensive score by integrating the first value scores and the second value scores.

It should be noted that, the first sampling data and/or the second sampling data are obtained by randomly sampling market demand distribution; the market demand prediction method may be a monte carlo algorithm, an arithmetic average method, a moving average method, or the like.

Step S107: a final next policy is determined in conjunction with the value network composite score and the initial score.

In an embodiment of the present invention, steps S101 to S104 relate to the content of the offline deep learning section, and steps S105 to S107, in combination with the content of the real-time next-step policy evaluation section shown in fig. 4, specifically:

at the time x, generating corresponding characteristic state input enhancement strategy network according to real-time service information of each service department on a supply chain, setting a regression layer of the enhancement strategy network as a softmax function, so that the output of the regression layer adopts initial scores P of various possible actions under the characteristic state, for example, increasing inventory action corresponding scores to 0.3, changing logistics distribution action corresponding scores to 0.2 and the like.

The market demand forecast data in a predetermined future time period W after the x-time is forecast using a monte carlo simulation algorithm, wherein influencing factors can be considered: promotion events, seasonal variations, holiday effects, economic orders, ordering cycles, forecast spans type settings, etc.

The enhancement strategy network takes accumulated feedback Q of samples generated by events of various possible actions a adopted in a characteristic state as a first value score A according to second sampling data (which can be random sampling, see Monte Carlo simulation algorithm) of the market demand prediction data, and takes accumulated feedback of samples respectively obtained by the possible actions a in the characteristic state as a second value score B.

Obtaining a current value score V (s, a, i) according to A, B, wherein s is a characteristic state, a is an action value in the characteristic state, i is iteration times, and each time V (s, a, i) is calculated, i is added with 1, and the prediction of the market demand prediction data of the preset duration T is performed, sampling is performed to obtain second sampling data, and then the second sampling data is input into an enhancement strategy network.

V(s，a，i)＝(1-ε)*A+ε*B；

N(s，a，i)＝N(s，a，i)+1，if V(s，a，i)＞0；

Wherein epsilon is a real number between 0 and 1, the formula shows that under the current moment x and the characteristic state s, action a is adopted, the comprehensive scores of the value network and the enhancement strategy network are respectively taken, and 1-epsilon and epsilon are respectively taken as weights of A and B and are automatically regulated by the system according to the reliability degree of the value network and the enhancement strategy network.

Iterating to obtain each V (s, a, i) until i reaches the maximum preset iteration times N, and calculating each V (s, a, i) obtained by synthesis to obtain a value network comprehensive score V, for example, all V (s, a, i) can be averaged to obtain V;

when i > N, calculating the action of the optimal next strategy under the characteristic state according to the formula:

a＝arg _a max (P+V), i.e. one from each possible actionand a gives the highest score of P+V, and the a is the action of the optimal next strategy.

The embodiment of the invention also provides a system for optimizing an enterprise supply chain by applying artificial intelligence, which comprises the following steps: the strategy network is used for applying deep reinforcement learning to obtain according to a history strategy; a market demand analyzer for generating a time-varying market demand distribution by analyzing historical demand for different types of products; an enhanced policy network for training the policy network with first sampled data of the market demand distribution; a value network for applying deep reinforcement learning acquisition to the reinforcement policy network; wherein the enhanced policy network obtains initial scores of various next policies according to real-time supply chain data; sampling the predicted market demand prediction data to obtain second sampling data, wherein the enhancement strategy network obtains first value scores of the various next strategies according to the second sampling data, obtains second value scores of the various next strategies through the value network, and synthesizes the first value scores and the second value scores to obtain a value network comprehensive score; a final next policy is determined in conjunction with the value network composite score and the initial score.

It should be noted that, the division of the modules in the system embodiment of the enterprise supply chain with artificial intelligence optimization is only a division of logic functions, and may be fully or partially integrated into a physical entity or may be physically separated. And these modules may all be implemented in software in the form of calls by the processing element; or can be realized in hardware; the method can also be realized in a form of calling software by a processing element, and the method can be realized in a form of hardware by a part of modules. For example, each module may be a processing element that is set up separately, may be implemented in a chip of the system, or may be stored in a memory of the system in the form of program codes, and may be called by a processing element of the system to execute the functions of the receiving module. The implementation of the other modules is similar. In addition, all or part of the modules can be integrated together or can be independently implemented. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in a software form.

For example, the modules above may be one or more integrated circuits configured to implement the methods above, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), etc. For another example, when a module is implemented in the form of a processing element scheduler code, the processing element may be a general purpose processor, such as a Central Processing Unit (CPU) or other processor that may invoke the program code. For another example, the modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

The embodiment of the invention also provides a computer device, which comprises: a processor and a memory; the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory, so that the processing device executes the method for optimizing the enterprise supply chain by applying artificial intelligence.

The processor may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like that is a control center of the computer system that interfaces and lines to various parts of the computer system.

The memory may be used to store the computer program and/or modules, and the processor may implement various functions of the computer system by running or executing the computer program and/or modules stored in the memory, and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.

Embodiments of the present invention may also provide a computer storage medium having a computer program stored thereon, which when executed by a processor implements the method for optimizing an enterprise supply chain using artificial intelligence.

It should be noted that the computer program code may be in the form of source code, object code, executable files, some intermediate form, or the like. The computer readable storage medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

In summary, the present invention provides a method, a system, a device, and a medium for optimizing an enterprise supply chain by applying artificial intelligence, including: according to the history strategy, deep reinforcement learning is applied to obtain a strategy network; generating a time-varying market demand distribution by historical demand analysis of different types of products; training the strategy network by using first sampling data distributed on the market demand to obtain an enhanced strategy network; applying deep reinforcement learning to the reinforcement strategy network to obtain a value network; the enhanced policy network obtains initial scores of various next policies according to the real-time supply chain data; sampling the predicted market demand prediction data to obtain second sampling data, wherein the enhancement strategy network obtains first value scores of the various next strategies according to the second sampling data, obtains second value scores of the various next strategies through the value network, and synthesizes the first value scores and the second value scores to obtain a value network comprehensive score; a final next policy is determined in conjunction with the value network composite score and the initial score.

The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. Accordingly, it is intended that all equivalent modifications and variations of the invention be covered by the claims, which are within the ordinary skill of the art, be within the spirit and scope of the present disclosure.

Claims

1. A method for optimizing an enterprise supply chain using artificial intelligence, comprising:

according to the history strategy, deep reinforcement learning is applied to obtain a strategy network;

generating a time-varying market demand distribution by historical demand analysis of different types of products;

training the strategy network by using first sampling data distributed on the market demand to obtain an enhanced strategy network;

applying deep reinforcement learning to the reinforcement strategy network to obtain a value network;

the enhanced policy network obtains initial scores of various next policies according to the real-time supply chain data;

sampling the predicted market demand prediction data to obtain second sampling data, wherein the enhancement strategy network obtains first value scores of the various next strategies according to the second sampling data, obtains second value scores of the various next strategies through the value network, and synthesizes the first value scores and the second value scores to obtain a value network comprehensive score;

combining the value network composite score and the initial score to determine a final next step policy, comprising: and in each possible action in one supply chain state, making the action value with the highest comprehensive score of the initial score and the value network be a final next action value, and making the corresponding strategy be a final next strategy.

2. The method of claim 1, wherein the supply chain comprises one or more supply chain states each of which switches to a next supply chain state under action, each supply chain state switch forming a corresponding feedback, each feedback accumulation in one supply chain decision period forming an accumulated feedback;

the action corresponds to a policy.

3. The method of claim 2, wherein the policy network obtains the optimal action value and the maximum accumulated feedback for each supply chain state in each sample with one supply chain cycle as a sample.

4. A method according to claim 2 or 3, wherein the strategy network is trained on a deep neural network by a deep deterministic strategy gradient model comprising: the system comprises an action network and a correction network, wherein the action network inputs the state of a supply chain and outputs an action value; the correction network inputs the state of the supply chain and the action value output by the action network, and outputs accumulated feedback; the correction network updates the accumulated feedback with a bellman equation and the action network updates from the derivative calculated by the correction network in correspondence with the entered action value.

5. A method according to claim 3, wherein training the policy network with the first sampled data of the market demand distribution to obtain an enhanced policy network comprises:

generating samples in each supply chain period, combining the first sampling data with service information of each service department on the supply chain, and extracting content comprises the following steps: supply chain status and required feedback;

the policy network processes each sample, including: generating and outputting an optimal action value and the maximum accumulated feedback of each supply chain period according to the extracted supply chain states and required feedback in each supply chain period;

and the strategy network generates the enhanced strategy network when the number of the samples exceeds a first preset threshold.

6. The method of claim 5, wherein after the sample is generated, it is stored in the policy network for the next sampling of the market demand distribution.

7. The method of claim 5, wherein applying deep reinforcement learning to the enhanced policy network obtains a value network, comprising:

taking the supply chain state of the t step in the supply chain period as a characteristic state, inputting the characteristic state as a characteristic value, taking the supply chain period of the characteristic state as a sample, and calculating corresponding accumulated feedback by the enhancement strategy network as a label to construct a deep convolution network as a value network; where T is a random number belonging to [1, T ], T is the maximum number of steps in the supply chain cycle.

8. The method of claim 7, wherein calculating cumulative feedback for samples via the enhanced policy network based on a supply chain period in which the characteristic state is located comprises:

outputting an action value by the enhanced policy network before the t step;

and randomly determining the action value corresponding to each step after the t-th step.

9. The method of claim 7, wherein the sampling the predicted market demand prediction data to obtain second sampled data, the enhanced policy network obtaining first value scores for the various next policies based on the second sampled data and obtaining second value scores for the various next policies via the value network, synthesizing the first value scores and second value scores to obtain a value network composite score, and updating the initial score via the value network composite score, comprising:

at the time x, generating corresponding characteristic state input enhancement strategy networks according to real-time service information of each service department on a supply chain, and setting a regression layer of the enhancement strategy networks as a softmax function so as to enable the regression layer to output initial scores of various possible actions adopted in the characteristic states;

predicting market demand forecast data in a predetermined future time period after the x time;

inputting second sampling data of the market demand forecast data into the enhancement strategy network to respectively obtain first value scores adopting the various possible actions in the characteristic state, enabling the value network to respectively obtain second value scores adopting the possible actions in the characteristic state, and obtaining current value network scores by integrating the first value scores and the second value scores;

iterating the step of obtaining the current value network score of each time until the maximum preset iteration times are reached, and calculating the value network comprehensive score by integrating each obtained current value network score.

10. The method of claim 1, wherein the type of history policy comprises: expert history policies and/or owned history policies.

11. A system for optimizing an enterprise supply chain using artificial intelligence, comprising:

the strategy network is used for applying deep reinforcement learning to obtain according to a history strategy;

a market demand analyzer for generating a time-varying market demand distribution by analyzing historical demand for different types of products;

an enhanced policy network for training the policy network with first sampled data of the market demand distribution;

a value network for applying deep reinforcement learning acquisition to the reinforcement policy network;

wherein the enhanced policy network obtains initial scores of various next policies according to real-time supply chain data; sampling the predicted market demand prediction data to obtain second sampling data, wherein the enhancement strategy network obtains first value scores of the various next strategies according to the second sampling data, obtains second value scores of the various next strategies through the value network, and synthesizes the first value scores and the second value scores to obtain a value network comprehensive score; combining the value network composite score and the initial score to determine a final next step policy, comprising: and in each possible action in one supply chain state, making the action value with the highest comprehensive score of the initial score and the value network be a final next action value, and making the corresponding strategy be a final next strategy.

12. A computer device, comprising: a processor and a memory;

the memory is configured to store a computer program, and the processor is configured to execute the computer program stored in the memory, to cause the processing device to perform the method according to any one of claims 1 to 10.

13. A computer storage medium, characterized in that a computer program is stored thereon, which program, when being executed by a processor, implements the method according to any of claims 1 to 10.