WO2022135066A1 - Temporal difference-based hybrid flow-shop scheduling method - Google Patents

Temporal difference-based hybrid flow-shop scheduling method Download PDF

Info

Publication number
WO2022135066A1
WO2022135066A1 PCT/CN2021/133905 CN2021133905W WO2022135066A1 WO 2022135066 A1 WO2022135066 A1 WO 2022135066A1 CN 2021133905 W CN2021133905 W CN 2021133905W WO 2022135066 A1 WO2022135066 A1 WO 2022135066A1
Authority
WO
WIPO (PCT)
Prior art keywords
state
behavior
scheduling
value
machine
Prior art date
Application number
PCT/CN2021/133905
Other languages
French (fr)
Chinese (zh)
Inventor
陆宝春
陈志峰
顾钱
翁朝阳
张卫
张哲�
Original Assignee
南京理工大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京理工大学 filed Critical 南京理工大学
Publication of WO2022135066A1 publication Critical patent/WO2022135066A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • G06Q10/06316Sequencing of tasks or work
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the invention belongs to a mixed flow shop scheduling control technology, in particular to a mixed flow shop scheduling method based on time sequence difference.
  • the hybrid flow-shop scheduling problem also known as the flexible flow-shop scheduling problem
  • This problem can be regarded as a combination of the classical flow-shop scheduling problem and the parallel machine scheduling problem. , which is characterized in that the workpiece has a parallel machine stage in the processing process, and the machine allocation is performed while the workpiece processing sequence is determined.
  • the number of processors in at least one stage is greater than 1, which greatly increases the difficulty of solving HFSP. It has been proved that the two-stage HFSP with 2 and 1 processors is NP-hard problem.
  • Exact algorithms include mathematical modeling and branch-and-bound methods, which can obtain optimal solutions for small-scale problems; for large-scale practical scheduling problems, heuristic algorithms or meta-heuristic algorithms are favored because they can obtain near-optimal solutions in a relatively short time. researchers' attention.
  • heuristic algorithms or meta-heuristic algorithms are designed for specific instances with corresponding rules and algorithms, and are not suitable for complex and changeable actual production environments. Reinforcement learning algorithms can generate scheduling policies that adapt to the actual production state.
  • Zhao M used Q-learning for the selection of combination assignment rules in the job shop by defining the eigenvalues of "production pressure” and two-step scheduling rules, but the tabular reinforcement learning model adopted by this method cannot describe the actual complex processing process.
  • Zhang Zhicong and Zheng Li defined 15 state features for each machine, and used the TD method to train a linear state-value function generalizer to solve the NPFS problem, but the linear function generalizer has limited fitting and generalization capabilities.
  • the deep reinforcement learning algorithm can solve the problem of the limited function of the function generalizer.
  • the weight sharing strategy of the convolutional neural network reduces the parameters that need to be trained. The same weight can make the filter not affected by the signal position to detect the signal.
  • the characteristics of the trained model make the generalization ability of the trained model stronger, but there are few researches on the deep reinforcement learning algorithm to solve the workshop scheduling problem at home and abroad.
  • the purpose of the present invention is to provide a mixed flow shop scheduling method based on time sequence difference to solve the mixed flow shop scheduling problem of related parallel machines.
  • a hybrid flow shop scheduling method based on time series difference described in the present invention takes minimizing the weighted average completion time as the scheduling goal, combines neural network and reinforcement learning, and adopts time series difference method for training.
  • the model uses the existing scheduling knowledge and empirical rules to refine the scheduling decision candidate behavior, and combines the online evaluation-execution mechanism of reinforcement learning to select the optimal combined behavior strategy for each scheduling decision of the scheduling system, which includes the following steps:
  • Step 1 Obtain the production constraints and objective functions according to the production characteristics of the mixed flow workshop, and introduce the machine state characteristics, build the mixed flow workshop scheduling environment, and initialize the settings, initialize the experience memory bank D with a capacity of N, and randomly initialize the state value depth Neural network V( ⁇ ) and target network V( ⁇ - ) to realize interaction with the agent, go to step 2;
  • Step 2 The agent randomly selects a behavior a t with the probability of ⁇ or selects the current optimal behavior a t according to the state value after executing the behavior, and obtains the reward r t+1 and the next state s t+ after executing the optimal behavior 1.
  • Record the state characteristics of the current state, the state characteristics of the next state s t+1 , and the state characteristics of the next state s t+1, and whether the terminal state is reached ( ⁇ t , r t+1 , ⁇ t , r t+1 , ⁇ t+1 ,is_end), store the obtained single-step state transition in the memory bank D, and store it in the priority queue P according to the TD-error calculation ratio, and go to step 3;
  • Step 3 Determine whether the number of single-step state transitions in memory D reaches the set threshold Batch_Size:
  • Step 4 Randomly extract a certain number of single-step state transitions from D, use the next state and the reward obtained by executing the corresponding behavior to calculate the target value of the current state, calculate the mean squared cost between the target value and the network output value, use
  • the mini-batch gradient descent algorithm updates the parameters, and goes to step 5;
  • Step 5 Determine whether the current agent has reached the end state, if so, go to Step 6; if not, repeat Step 2;
  • Step 6 Determine whether the scheduling system has experienced Max_Episode complete state transition sequence:
  • step 2 If it is not reached, initialize the scheduling environment, reset the state of the machine and the workpiece, and repeat step 2;
  • Step 7 Output the behavior strategy combination a 1 , a 2 , . . . corresponding to the optimal state sequence.
  • the present invention has the following significant advantages:
  • the present invention proposes a deep reinforcement learning algorithm based on TD learning, which adopts a convolutional neural network with a dual network structure, separates action selection and value estimation, and utilizes the advantages of deep convolution calculation of CNN, which can effectively avoid excessively high estimate.
  • the present invention designs an algorithm model based on state value update to solve the multi-dimensional discrete space, so that it can solve the mixed flow shop scheduling problem.
  • shallow sampling TD learning to solve the state value, it does not depend on the complete state sequence, and selects the optimal action through a probing attempt, which is more in line with the actual scheduling process in principle, and can be used to solve large-scale problems or dynamic problems. more appropriate.
  • the present invention introduces a random priority sampling method when selecting samples for training, which can effectively solve the problem that the algorithm frequently has a higher error and overfitting in the function approximation process due to greedy priority.
  • FIG. 1 is a comparison diagram of the network structures and fitting functions of the algorithms CTDN and DQN proposed by the present invention.
  • Figure 2 is a diagram of the operation model of the CTDN algorithm in a mixed flow shop with a scale of 4 ⁇ 4 ⁇ 3.
  • FIG. 3 is a structural diagram of a convolutional neural network used in the present invention.
  • Figure 4 is an optimal scheduling Gantt chart for a small-scale problem.
  • Figure 5 is an example tai_20_10_2 Gantt chart.
  • Figure 6 is an iterative graph of the instance tai_20_10_2 running.
  • FIG. 7 is a flow chart of the method for scheduling a mixed flow shop based on timing difference of the present invention.
  • Step 1 Obtain the production constraints and objective functions according to the production characteristics of the mixed flow workshop, and introduce the machine state characteristics, build the mixed flow workshop scheduling environment, and initialize the settings, initialize the experience memory bank D with a capacity of N, and randomly initialize the state value depth Neural network V( ⁇ ) and target network V( ⁇ - ) to realize interaction with the agent, go to step 2.
  • the objective function of the scheduling system described in step 1 is to minimize the weighted average completion time, and its minimum weighted average completion time objective function
  • w j is the weight value of workpiece j, that is, the priority of the order
  • c j is the completion time of workpiece j.
  • the average completion time index can be used to measure the inventory level of intermediate products and the processing cycle of a batch of workpieces, which is of great practical significance to enterprises.
  • step 1 the definition of the state characteristics of the machine described in step 1 is shown in the table. By introducing appropriate parameters, selecting the characteristics that describe the state appropriately, and constructing a certain function to approximate the calculated state, it represents the machine and workpiece in a certain state. information.
  • the k-th feature of the i -th machine Mi in the mixed flow workshop is denoted as f i,k , and l represents the total number of processes.
  • a total of 13 real-valued features f i,k are defined, where 1 ⁇ k ⁇ 13, a total of 9 real-valued features f i,k are defined for the machine belonging to the lth process, among which 1 ⁇ k ⁇ 9, the defined state feature set jointly reveals the global and local information of the environment ,as shown in Table 3.
  • i represents the ith machine
  • q represents the qth process
  • m represents the total number of machines
  • l represents the total number of processes
  • Q q represents the waiting queue of the qth process
  • n Indicates that there are n workpieces to be processed in the qth process
  • pq represents the average processing time of all the workpieces to be processed in the qth process
  • pq,j represents the processing time of the jth workpiece in the qth process.
  • State feature 1 represents the distribution of workpieces in each process of the production line; state feature 2 represents the workload of the equipment in each process at the current moment; state feature 3 represents the work to be completed by the machine in each process from the current moment The total amount; status features 4 and 5 describe the maximum value of the processing time of each process in the current waiting queue; status feature 6 represents the processed time of the work-in-process in the equipment, thereby characterizing the operation or idleness of the equipment and the processing progress of the workpiece; status features 7 and 8 represent the maximum value of the remaining completion time in the workpiece waiting queue; state feature 9 represents the utilization rate of each machine from the start of processing to the current moment; state features 10 and 11 represent the processing time of the workpiece in the workpiece waiting queue in the current process. The maximum value of the ratio of the processing time in the next process; the state features 12 and 13 represent the maximum value of the processing time required for the subsequent process of the workpiece.
  • Step 2 The agent randomly selects a behavior a t with the probability of ⁇ or selects the current optimal behavior a t according to the state value after executing the behavior, and obtains the reward r t+1 and the next state s t+ after executing the optimal behavior 1.
  • Record the state characteristics of the current state, the state characteristics of the next state s t+1 , and the state characteristics of the next state s t+1, and whether the terminal state is reached ( ⁇ t , r t+1 , ⁇ t , r t+1 , ⁇ t+1 ,is_end), store the obtained single-step state transition in the memory bank D, and store it in the priority queue P according to the TD-error calculation ratio, and go to step 3;
  • step 2 is as follows:
  • Step 21 In order to ensure continuous exploration, the ⁇ -greedy strategy is adopted. By setting a small ⁇ value, the probability of 1- ⁇ is used to make a greedy selection. Under the current set of optional behaviors, the selection is based on the state value convolution.
  • the behavior corresponding to the state value of the next state obtained by the neural network and the maximum value of the sum of the rewards obtained by performing the behavior where A(s) is the set of optional behaviors, ⁇ is the attenuation coefficient, is the reward obtained by the agent for executing behavior a, ⁇ i+1 represents the state characteristic of the execution behavior a reaching state ⁇ - , V( ⁇ i+1 ) represents the state value of the next state obtained according to the state value network, and using The probability of ⁇ randomly selects an action from the set of all optional actions;
  • Step 22 If the scheduling system needs to specify workpiece processing for multiple processes at the current moment, select the behavior for a process according to Step 21 After that, schedule the system's forward-looking execution behavior Then the scheduling system state transitions to the temporary state Repeat step 21 to select behaviors for the machine until all selections are completed; then, the behaviors executed by the scheduling system in the current state are multidimensional behaviors;
  • the definition of the reward R in step 21 is directly or indirectly related to the objective function of the scheduling system.
  • the scheduling objective adopted by the present invention is to minimize the weighted average completion time, and the agent can obtain greater rewards due to the shorter weighted average completion time.
  • an indicative function ⁇ j ( ⁇ ) representing the workpiece state is defined as follows:
  • the reward function is defined as follows:
  • num is the total number of workpieces
  • w j is the weight value of workpiece j
  • t is the moment node of the scheduling system.
  • r u represents the weighted completion time (the sum of the waiting time and the processing time) of each workpiece between two adjacent decision points (the u-1th decision point and the uth decision point).
  • the reward function has this property: minimizing the objective function is equivalent to maximizing the cumulative reward R obtained by a complete sequence of states.
  • C j represents the total completion time of the jth workpiece. It can be seen from the formula that the smaller the average weighted completion time, the greater the total reward. Therefore, the reward function defined above can directly link the reward function and the scheduling objective, and directly reflect the long-term impact of behavior on the objective function.
  • machine optional behavior set in step 21 is shown in Table 2. Defining candidate behavior sets for each machine based on a simple construction heuristic algorithm, and prioritizing rules for reinforcement learning can overcome the short-sighted nature. State-related or unrelated behavior should be adopted to take full advantage of existing scheduling rules, theories, and the agent's ability to learn from experience. Therefore, the present invention selects 13 behaviors commonly used in minimizing the weighted makepan objective, as shown in Table 4.
  • the scheduling problem studied in the present invention is the same parallel machine scheduling problem, that is, all machines in this parallel machine process have the same processing time for the same workpiece, so the selection of idle machines will not affect the processing cycle of the workpiece under ideal conditions.
  • idle machines are selected according to the principle of minimum machine load in the bottleneck process.
  • act 14 the machine with the shortest total machining time in the parallel machine is selected to machine the workpiece.
  • I is the set of idle machines in the process
  • J is the set of workpieces processed by the machine Mi.
  • the set of actions that can be taken by the machine belonging to the first l-1 process is ⁇ ak
  • the set of actions that the machine belonging to the lth process can take is ⁇ ak
  • the set of actions taken by the scheduling system is ⁇ (a 14 , ak )
  • Step 3 Determine whether the number of single-step state transitions in memory D reaches the set threshold Batch_Size:
  • Step 4 Randomly extract a certain number of single-step state transitions from D, use the next state and the reward obtained by executing the corresponding behavior to calculate the target value of the current state, calculate the mean squared cost between the target value and the network output value, use
  • the mini-batch gradient descent algorithm updates the parameters, and goes to step 5;
  • step 4 is as follows:
  • Step 41 Extract a certain number of single-step state transitions from D according to the proportional weight calculated by TD-error, using the formula Calculate the current target value, where y i represents the obtained current state value, ⁇ represents the decay coefficient, r i+1 represents the reward of the behavior within the single-step state transition, and ⁇ i+1 represents the next state s within the single-step state transition
  • the state feature of t+1 , V( ⁇ i+1 ; ⁇ ⁇ ) represents the state value of the next state obtained from the target network;
  • Step 42 Calculate the mean square error cost between the target value and the network output value, where loss is the required mean square error cost, h is Batch_Size, y i represents the current state value obtained above, ⁇ i+1 represents the state feature of the next state s t+1 in the single-step state transition, V( ⁇ i +1 ; ⁇ ) represents the state value of the next state obtained according to the state value network, and uses the mini-batch gradient descent algorithm to update the network parameters and the priority queue;
  • Step 43 Use the mini-batch gradient descent algorithm to update the state value network parameters, and replace the target network value every T steps.
  • step 41 adopts the probability distribution sampling of priority playback, first according to the formula Calculate the proportion, where pi represents the priority probability of the sample, h is the Batch_Size , and then randomly select Batch_Size samples from D according to the proportion weight.
  • Step 5 Determine whether the current agent has reached the end state, if so, go to Step 6; if not, repeat Step 2;
  • Step 6 Determine whether the scheduling system has experienced Max_Episode complete state transition sequence:
  • step 2 If it is not reached, initialize the scheduling environment, reset the state of the machine and the workpiece, and repeat step 2;
  • Step 7 Output the behavior strategy combination a 1 , a 2 , . . . corresponding to the optimal state sequence.
  • the DQN algorithm has several nodes in the output layer of the deep neural network, each node directly corresponds to a certain behavior value, and the one-dimensional behavior cannot express a multi-dimensional behavior space, and the Q-learning using different strategies is used in the evaluation. Replacing the actual interaction value with the optimal value in the behavioral value can easily lead to overestimation. Therefore, it is proposed to use TD learning instead of Q learning, and calculate the behavior value indirectly based on the state value, which is suitable for multi-dimensional behavior space.
  • the deep BP neural network is replaced by the convolutional neural network, and the CNN weight sharing strategy is used to reduce the parameters that need to be trained. Translation invariance is not very demanding. The difference between the two is reflected in the network structure and the fitted value function.
  • the triangle in the figure represents the workpiece
  • the cuboid represents the machine
  • the rectangle represents the waiting queue before each process.
  • the initial state is s 0 , at this time all machines are in idle state, and all workpieces are in the waiting queue Q 1 of the first process.
  • the machine of the first process selects an action a k , that is, selects this process to wait for a workpiece in the queue to be processed, and the machines of other processes select action a 13 because the waiting queue is empty.
  • the system transfers to a new state s t , the state transition is triggered, the system selects a feasible behavior for each machine, and then when another machine completes the processing, the system transfers to the next state s t+ 1 , the agent gets the reward r t+1 .
  • the system selects the workpiece from the waiting queue according to the current state, and selects the machine for processing from the idle machine queue of the process. Since each machine simultaneously chooses a behavior to execute at each decision point, the system actually implements a multi-dimensional behavior (a 1 , a 2 ,...am ) composed of m sub-behaviors once in the state. When the system reaches the termination state, it means that each waiting queue is empty, that is, all workpieces are processed, and the system obtains a scheduling plan.
  • the structure of the body convolutional neural network is shown in Figure 3, and the network parameters adopt a random initialization strategy.
  • Small-scale problems take a 10 ⁇ 8 ⁇ 6 scheduling problem as an example to test the feasibility of the algorithm.
  • the example includes 10 workpieces and 8 machines. Each workpiece needs to go through 6 production processes. There are parallel machines in the third and fifth processes, and each has the same two devices for scheduling.
  • the specific data of this example are shown in Table 5.
  • the workpiece priority benchmark is 1.
  • the priority weight coefficients of Job3, Job5, and Job8 are considered to randomly select different values, which are 1.2, 1.5, and 1.3, respectively, to test the weight. The effect on the scheduling result.
  • Table 6 shows the results of solving examples using the algorithm of the present invention and some traditional algorithms, and the better solutions in the table are shown in bold. It can be seen from Table 6 that the algorithm of the present invention can obtain a better solution than the traditional algorithm, and the Gantt chart corresponding to the solution is shown in Figure 4. The red vertical line in the figure represents each decision node of the scheduling system. Compared with the IDE algorithm and the HOMA algorithm, the optimal solution of this algorithm is improved by 4.3% and 3.9% respectively.
  • Job5, Job8, and Job3 with high workpiece priority are processed first, and workpieces with higher workpiece priority will be processed first. It can be seen that the reward function set above can reflect the objective function.
  • the present invention randomly selects 15 examples from the [OR_Library] example set for experimental testing, and compares them with the Migratory Bird Optimization Algorithm (MBO) and the comparison algorithm, as shown in Table 7, and the better results in the table are represented by bold fonts.
  • MBO Migratory Bird Optimization Algorithm
  • CTDN algorithm proposed by this algorithm can obtain better solutions, and the solutions of some instances are already lower than the upper bound of the original instance.
  • a deep neural network takes a certain amount of time to train, but the trained network can quickly obtain the optimal behavior in a very short time according to the state value of the input.
  • FIG. 5 is a Gantt chart corresponding to the optimal strategy obtained by the example tai_20_10_2 under the algorithm of the present invention.
  • the red vertical dotted line in the figure represents the scheduling decision point, that is, the time point when the workpiece completes each process.
  • Figure 6 is a graph showing the variation of the weighted average completion time with the training progress under the instance tai_20_10_2. As can be seen from the trend in the figure, the scheduling target value gradually decreases with the continuous cycle of the episode. At the beginning, the agent is in a completely unfamiliar environment, and continues to learn trial and error through independent random behavior selection. The goal is to move forward, and within 900 iterations, a better solution can be obtained.

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Software Systems (AREA)
  • Development Economics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Educational Administration (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Game Theory and Decision Science (AREA)
  • Health & Medical Sciences (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • General Factory Administration (AREA)

Abstract

Disclosed is a temporal difference-based deep reinforcement learning algorithm, which is used to solve a hybrid flow-shop scheduling problem of a related parallel machines. The algorithm combines a convolutional neural network with TD learning in reinforcement learning, and performs behavior selection according to inputted state features so as to more fully comply with a scheduling decision-making process of an actual order response-type production and fabrication system. By means of transforming the scheduling problem into a multi-stage decision-making problem, a convolutional neural network model is used to fit a state value function, fabrication system processing state feature data is inputted into the model, a temporal difference method is used to train the model, a heuristic algorithm or an allocation rule is used as candidate scheduling decision behavior, and an optimal combined behavior strategy is selected for each scheduling decision by combining with a reinforcement learning reward and punishment mechanism. Compared to the prior art, the algorithm proposed by the present invention has the advantages of strong real-time performance and high flexibility.

Description

一种基于时序差分的混合流水车间调度方法A hybrid flow shop scheduling method based on time series difference 技术领域technical field
本发明属于混合流水车间调度控制技术,具体涉及一种基于时序差分的混合流水车间调度方法。The invention belongs to a mixed flow shop scheduling control technology, in particular to a mixed flow shop scheduling method based on time sequence difference.
背景技术Background technique
混合流水车间调度问题(Hybrid flow-shop scheduling problem,HFSP),又称柔性流水车间调度问题,由Salvador在1973年首先提出,该问题可以看作是经典流水车间调度问题与并行机调度问题的结合,其特征是工件在加工过程中存在并行机阶段,在确定工件加工顺序的同时进行机器分配。在HFSP问题中,至少有一个阶段中处理机的个数大于1,这大大增加了HFSP的求解难度,已证明处理机数分别为2和1的两阶段的HFSP是NP-hard问题。The hybrid flow-shop scheduling problem (HFSP), also known as the flexible flow-shop scheduling problem, was first proposed by Salvador in 1973. This problem can be regarded as a combination of the classical flow-shop scheduling problem and the parallel machine scheduling problem. , which is characterized in that the workpiece has a parallel machine stage in the processing process, and the machine allocation is performed while the workpiece processing sequence is determined. In the HFSP problem, the number of processors in at least one stage is greater than 1, which greatly increases the difficulty of solving HFSP. It has been proved that the two-stage HFSP with 2 and 1 processors is NP-hard problem.
目前,精确算法、启发式和元启发式算法是求解流水车间调度问题的三类经典方法。精确算法包括数学建模、分支定界法,能获得小规模问题的最优解;对于大规模实际调度问题,启发式算法或元启发式算法因能在较短的时间获得近优解而受到研究者的关注。然而,启发式算法或元启发式算法是针对具体实例设计相应的规则和算法,不适应于复杂多变的实际生产环境。强化学习算法可以产生适应实际生产状态的调度策略。Wei Y,Zhao M通过定义“生产压力”特征值和两步调度规则,将Q学习用于作业车间的组合分配规则选取,但该方法采用的表格型强化学习模型并不能描述实际复杂加工过程。张志聪、郑力给每台机器定义15个状态特征,利用TD法训练线性状值函数泛化器求解NPFS问题,但线性函数泛化器拟合和泛化能力有限。At present, exact algorithms, heuristics and meta-heuristics are three classical methods for solving the flow shop scheduling problem. Exact algorithms include mathematical modeling and branch-and-bound methods, which can obtain optimal solutions for small-scale problems; for large-scale practical scheduling problems, heuristic algorithms or meta-heuristic algorithms are favored because they can obtain near-optimal solutions in a relatively short time. researchers' attention. However, heuristic algorithms or meta-heuristic algorithms are designed for specific instances with corresponding rules and algorithms, and are not suitable for complex and changeable actual production environments. Reinforcement learning algorithms can generate scheduling policies that adapt to the actual production state. Wei Y, Zhao M used Q-learning for the selection of combination assignment rules in the job shop by defining the eigenvalues of "production pressure" and two-step scheduling rules, but the tabular reinforcement learning model adopted by this method cannot describe the actual complex processing process. Zhang Zhicong and Zheng Li defined 15 state features for each machine, and used the TD method to train a linear state-value function generalizer to solve the NPFS problem, but the linear function generalizer has limited fitting and generalization capabilities.
总结分析现有的研究成果,有关混合流水车间调度问题的研究主要存在以下问题:Summarizing and analyzing the existing research results, the research on the scheduling problem of mixed flow workshop mainly has the following problems:
(1)传统调度算法不能有效利用历史数据进行学习,且实时性较差而难以应对大规模复杂多变的实际生产调度环境。(1) The traditional scheduling algorithm cannot effectively use historical data for learning, and has poor real-time performance, making it difficult to cope with the large-scale complex and changeable actual production scheduling environment.
(2)目前,虽然传统的HFSP的研究已经很成熟,但对于运用强化学习求解混合流水车间问题的研究很少且存在难以表征加工环境以及函数泛化器功能有限等问题。(2) At present, although the traditional research on HFSP is very mature, there are few researches on using reinforcement learning to solve the mixed flow shop problem, and there are problems such as difficulty in characterizing the processing environment and limited functions of function generalizers.
(3)深度强化学习算法可以解决函数泛化器功能有限的问题,卷积神经网络的权重共享策略减少了需要训练的参数,相同的权值可以让滤波器不受信号位置的影响来检测信号的特性,使得训练出来的模型的泛化能力更强,但是国内外有关深度强化学习算法解决车间调度问题的研究较少。(3) The deep reinforcement learning algorithm can solve the problem of the limited function of the function generalizer. The weight sharing strategy of the convolutional neural network reduces the parameters that need to be trained. The same weight can make the filter not affected by the signal position to detect the signal The characteristics of the trained model make the generalization ability of the trained model stronger, but there are few researches on the deep reinforcement learning algorithm to solve the workshop scheduling problem at home and abroad.
发明内容SUMMARY OF THE INVENTION
本发明的目的在于提供一种基于时序差分的混合流水车间调度方法,用以解决相关并行机的混合流水车间调度问题。The purpose of the present invention is to provide a mixed flow shop scheduling method based on time sequence difference to solve the mixed flow shop scheduling problem of related parallel machines.
实现本发明目的的技术解决方案为:本发明所述的一种基于时序差分的混合流水车间调度方法,以最小化加权平均完工时间为调度目标,结合神经网络和强化学习,采用时序差分法训练模型,利用已有的调度知识和经验规则提炼调度决策候选行为,结合强化学习在线评价-执行机制,从而为调度系统的每次调度决策选取最优组合行为策略,具体包括如下步骤:The technical solution to achieve the purpose of the present invention is: a hybrid flow shop scheduling method based on time series difference described in the present invention takes minimizing the weighted average completion time as the scheduling goal, combines neural network and reinforcement learning, and adopts time series difference method for training. The model uses the existing scheduling knowledge and empirical rules to refine the scheduling decision candidate behavior, and combines the online evaluation-execution mechanism of reinforcement learning to select the optimal combined behavior strategy for each scheduling decision of the scheduling system, which includes the following steps:
步骤1:根据混合流水车间的生产特征获得生产约束和目标函数,并引入机器状态特征,构建混合流水车间调度环境,并进行初始化设置,初始化容量为N的经验记忆库D,随机初始化状态价值深度神经网络V(θ)及目标网络V(θ -),以实现与智能体的交互,转入步骤2; Step 1: Obtain the production constraints and objective functions according to the production characteristics of the mixed flow workshop, and introduce the machine state characteristics, build the mixed flow workshop scheduling environment, and initialize the settings, initialize the experience memory bank D with a capacity of N, and randomly initialize the state value depth Neural network V(θ) and target network V(θ - ) to realize interaction with the agent, go to step 2;
步骤2:智能体以ε的概率随机选择一个行为a t或是根据执行行为后的状态价值选择当前最优行为a t,执行最优行为后得到奖励r t+1和下一个状态s t+1,将当前状态的状态特征、执行该行为得到奖励r t+1和下一个状态s t+1的状态特征,以及是否到达终止状态记为单步状态转移(φ t,r t+1t+1,is_end),将得到的单步状态转移存储至记忆库D中,根据TD-error计算比例存至优先级队列P,转入步骤3; Step 2: The agent randomly selects a behavior a t with the probability of ε or selects the current optimal behavior a t according to the state value after executing the behavior, and obtains the reward r t+1 and the next state s t+ after executing the optimal behavior 1. Record the state characteristics of the current state, the state characteristics of the next state s t+1 , and the state characteristics of the next state s t+1, and whether the terminal state is reached (φ t , r t+1 , φ t , r t+1 , φ t+1 ,is_end), store the obtained single-step state transition in the memory bank D, and store it in the priority queue P according to the TD-error calculation ratio, and go to step 3;
步骤3:判断记忆库D中的单步状态转移数量是否达到设定的阈值Batch_Size:Step 3: Determine whether the number of single-step state transitions in memory D reaches the set threshold Batch_Size:
若达到设定的阈值Batch_Size,则转入步骤4;If the set threshold Batch_Size is reached, go to step 4;
若没有达到设定的阈值Batch_Size,则重复步骤2;If the set threshold Batch_Size is not reached, repeat step 2;
步骤4:随机从D中提取一定数量的单步状态转移,用下一状态和执行对应行为获得的奖励来计算当前状态的目标价值,计算目标价值与网络输出价值之间的均方差代价,使用小批量梯度下降算法更新参数,进入步骤5;Step 4: Randomly extract a certain number of single-step state transitions from D, use the next state and the reward obtained by executing the corresponding behavior to calculate the target value of the current state, calculate the mean squared cost between the target value and the network output value, use The mini-batch gradient descent algorithm updates the parameters, and goes to step 5;
步骤5:判断当前智能体是否到达结束状态,若达到,进入步骤6;若没有,重复步骤2;Step 5: Determine whether the current agent has reached the end state, if so, go to Step 6; if not, repeat Step 2;
步骤6:判断调度系统是否经历过Max_Episode个完整的状态转移序列:Step 6: Determine whether the scheduling system has experienced Max_Episode complete state transition sequence:
若达到,则进行步骤7;If so, go to step 7;
若没有达到,初始化调度环境,重置机器与工件的状态,重复步骤2;If it is not reached, initialize the scheduling environment, reset the state of the machine and the workpiece, and repeat step 2;
步骤7:输出最优状态序列对应的行为策略组合a 1,a 2,…。 Step 7: Output the behavior strategy combination a 1 , a 2 , . . . corresponding to the optimal state sequence.
本发明与现有技术相比,其显著优点在于:Compared with the prior art, the present invention has the following significant advantages:
(1)本发明提出一种基于TD学习的深度强化学习算法,采用双网络结构的卷积神经网络,将动作选择和价值估计分开,利用CNN的深层卷积计算的优势,可以有效避免过高估计。(1) The present invention proposes a deep reinforcement learning algorithm based on TD learning, which adopts a convolutional neural network with a dual network structure, separates action selection and value estimation, and utilizes the advantages of deep convolution calculation of CNN, which can effectively avoid excessively high estimate.
(2)由于将强化学习应用于混合流水车间调度问题后,其行为空间为多维离散空间,不适合继续采用基于一维离散行为值函数的Q学习。因此,本发明设计基于状态值更新的算法模型来求解多维离散空间,使得其可以求解混合流水车间调度问题。采取浅层采样的TD学习求解状态价值,其不依赖于完整的状态序列,通过前探式的尝试选择最优动作,其从原理上更加符合实际调度过程,在解决大规模问题或是动态问题上更加合适。(2) After applying reinforcement learning to the scheduling problem of mixed flow shop, its behavior space is a multi-dimensional discrete space, so it is not suitable to continue to use Q-learning based on one-dimensional discrete behavior value function. Therefore, the present invention designs an algorithm model based on state value update to solve the multi-dimensional discrete space, so that it can solve the mixed flow shop scheduling problem. Using shallow sampling TD learning to solve the state value, it does not depend on the complete state sequence, and selects the optimal action through a probing attempt, which is more in line with the actual scheduling process in principle, and can be used to solve large-scale problems or dynamic problems. more appropriate.
(3)本发明在选择样本训练时引入随机优先级采样方法,可以有效解决算法由于贪心优先级造成的在函数逼近过程中频繁的有一个较高的error以及过拟合的问题。(3) The present invention introduces a random priority sampling method when selecting samples for training, which can effectively solve the problem that the algorithm frequently has a higher error and overfitting in the function approximation process due to greedy priority.
附图说明Description of drawings
图1是本发明提出算法CTDN与DQN网络结构及拟合函数对比图。FIG. 1 is a comparison diagram of the network structures and fitting functions of the algorithms CTDN and DQN proposed by the present invention.
图2是规模4×4×3的混合流水车间CTDN算法运行模型图。Figure 2 is a diagram of the operation model of the CTDN algorithm in a mixed flow shop with a scale of 4 × 4 × 3.
图3是本发明所使用的卷积神经网络结构图。FIG. 3 is a structural diagram of a convolutional neural network used in the present invention.
图4是小规模问题的最优调度甘特图。Figure 4 is an optimal scheduling Gantt chart for a small-scale problem.
图5是实例tai_20_10_2甘特图。Figure 5 is an example tai_20_10_2 Gantt chart.
图6是实例tai_20_10_2运行迭代图。Figure 6 is an iterative graph of the instance tai_20_10_2 running.
图7是本发明的基于时序差分的混合流水车间调度方法流程图。FIG. 7 is a flow chart of the method for scheduling a mixed flow shop based on timing difference of the present invention.
具体实施方式Detailed ways
下面结合附图对本发明作进一步详细描述。The present invention will be described in further detail below with reference to the accompanying drawings.
结合图7,本发明所述的一种基于时序差分的混合流水车间调度方法,步骤如下:With reference to Fig. 7 , a method for scheduling a mixed flow shop based on timing difference according to the present invention, the steps are as follows:
步骤1:根据混合流水车间的生产特征获得生产约束和目标函数,并引入机器状态特征,构建混合流水车间调度环境,并进行初始化设置,初始化容量为N的经验记忆库D,随机初始化状态价值深度神经网络V(θ)及目标网络V(θ -),以实现与智能体的交互,转入步骤2。 Step 1: Obtain the production constraints and objective functions according to the production characteristics of the mixed flow workshop, and introduce the machine state characteristics, build the mixed flow workshop scheduling environment, and initialize the settings, initialize the experience memory bank D with a capacity of N, and randomly initialize the state value depth Neural network V(θ) and target network V(θ - ) to realize interaction with the agent, go to step 2.
进一步地,步骤1所述的调度系统目标函数为最小化加权平均完工时间,其最小加权平均完工时间目标函数
Figure PCTCN2021133905-appb-000001
其中w j为工件j的权重值,即订单的优先级,c j为工件j的完工时间。平均完工时间指标可以用来衡量中间品库存水平和一批次工件的加工周期,对企业具有重要的实际意义。
Further, the objective function of the scheduling system described in step 1 is to minimize the weighted average completion time, and its minimum weighted average completion time objective function
Figure PCTCN2021133905-appb-000001
Where w j is the weight value of workpiece j, that is, the priority of the order, and c j is the completion time of workpiece j. The average completion time index can be used to measure the inventory level of intermediate products and the processing cycle of a batch of workpieces, which is of great practical significance to enterprises.
进一步地,步骤1所述的机器状态特征定义如表所示,通过引入适当的参数,选取恰当描述状态的特征,构建一定的函数来近似计算得到状态,其表征了某状态下的机器和工件信息。混合流水车间中第i台机器M i的第k个特征记作f i,k,l表示工序总数,对于前l-1道工序的所属机器,共定义13个实值特征f i,k,其中1≤k≤13,对于第l道工序所属机器共定义9个实值特征f i,k,其中1≤k≤9,所定义的状态特征集共同了揭示环境所处的全局和局部信息,如表3所示。 Further, the definition of the state characteristics of the machine described in step 1 is shown in the table. By introducing appropriate parameters, selecting the characteristics that describe the state appropriately, and constructing a certain function to approximate the calculated state, it represents the machine and workpiece in a certain state. information. The k-th feature of the i -th machine Mi in the mixed flow workshop is denoted as f i,k , and l represents the total number of processes. For the machines belonging to the first l-1 process, a total of 13 real-valued features f i,k are defined, where 1≤k≤13, a total of 9 real-valued features f i,k are defined for the machine belonging to the lth process, among which 1≤k≤9, the defined state feature set jointly reveals the global and local information of the environment ,as shown in Table 3.
状态特征的定义如表3所示:The definition of state characteristics is shown in Table 3:
表3 机器状态特征定义表Table 3 Machine state feature definition table
表3 机器状态特征定义表Table 3 Machine state feature definition table
Figure PCTCN2021133905-appb-000002
Figure PCTCN2021133905-appb-000002
Figure PCTCN2021133905-appb-000003
Figure PCTCN2021133905-appb-000003
在此对表1中使用到的参数做统一说明:i表示第i台机器,q表示第q道工序,m表示机器总数,l表示工序总数,Q q表示第q道工序的等待队列,n表示 第q道工序共有n件待加工工件,p q表示第q道工序所有待加工工件的平均加工时间,p q,j表示第q道工序的第j件工件的加工时间。 Here is a unified description of the parameters used in Table 1: i represents the ith machine, q represents the qth process, m represents the total number of machines, l represents the total number of processes, Q q represents the waiting queue of the qth process, n Indicates that there are n workpieces to be processed in the qth process, pq represents the average processing time of all the workpieces to be processed in the qth process, and pq,j represents the processing time of the jth workpiece in the qth process.
状态特征1表征了工件在生产流水线上每道工序的分布情况;状态特征2表征了当前时刻下每道工序设备的工作负载;状态特征3表征了从当前时刻起每道工序机器要完成的工作总量;状态特征4,5描述了当前各个等待队列中工序加工时间的最值;状态特征6表示设备中在制品的已加工时间,从而表征设备的运行或空闲,以及工件加工进度;状态特征7,8表示工件等待队列中剩余完工时间的最值;状态特征9表征了从开始加工到当前时刻各机器的利用率;状态特征10,11表示工件等待队列中工件在当前工序的加工时间与在下一道工序的加工时间之比的最值;状态特征12,13表示工件后继工序所需加工时间的最值。 State feature 1 represents the distribution of workpieces in each process of the production line; state feature 2 represents the workload of the equipment in each process at the current moment; state feature 3 represents the work to be completed by the machine in each process from the current moment The total amount; status features 4 and 5 describe the maximum value of the processing time of each process in the current waiting queue; status feature 6 represents the processed time of the work-in-process in the equipment, thereby characterizing the operation or idleness of the equipment and the processing progress of the workpiece; status features 7 and 8 represent the maximum value of the remaining completion time in the workpiece waiting queue; state feature 9 represents the utilization rate of each machine from the start of processing to the current moment; state features 10 and 11 represent the processing time of the workpiece in the workpiece waiting queue in the current process. The maximum value of the ratio of the processing time in the next process; the state features 12 and 13 represent the maximum value of the processing time required for the subsequent process of the workpiece.
步骤2:智能体以ε的概率随机选择一个行为a t或是根据执行行为后的状态价值选择当前最优行为a t,执行最优行为后得到奖励r t+1和下一个状态s t+1,将当前状态的状态特征、执行该行为得到奖励r t+1和下一个状态s t+1的状态特征,以及是否到达终止状态记为单步状态转移(φ t,r t+1t+1,is_end),将得到的单步状态转移存储至记忆库D中,根据TD-error计算比例存至优先级队列P,转入步骤3; Step 2: The agent randomly selects a behavior a t with the probability of ε or selects the current optimal behavior a t according to the state value after executing the behavior, and obtains the reward r t+1 and the next state s t+ after executing the optimal behavior 1. Record the state characteristics of the current state, the state characteristics of the next state s t+1 , and the state characteristics of the next state s t+1, and whether the terminal state is reached (φ t , r t+1 , φ t , r t+1 , φ t+1 ,is_end), store the obtained single-step state transition in the memory bank D, and store it in the priority queue P according to the TD-error calculation ratio, and go to step 3;
进一步地,所述步骤2具体步骤如下:Further, the specific steps of the step 2 are as follows:
步骤21:为保证能做到持续的探索,采用ε-贪婪策略,通过设置一个较小的ε值,使用1-ε的概率贪婪的选择在当前可选行为集下,选择根据状态价值卷积神经网络所求得的下一状态的状态价值与执行该行为所获奖励之和最大值对应的行为
Figure PCTCN2021133905-appb-000004
其中A(s)为可选行为集,γ为衰减系数,
Figure PCTCN2021133905-appb-000005
为智能体执行行为a获得的奖励,φ i+1表示执行行为a到达状态θ -的状态特征,V(φ i+1)表示根据状态价值网络求得的下一状态的状态价值,而用ε的概率随机从所有可选行为集中选择行为;
Step 21: In order to ensure continuous exploration, the ε-greedy strategy is adopted. By setting a small ε value, the probability of 1-ε is used to make a greedy selection. Under the current set of optional behaviors, the selection is based on the state value convolution. The behavior corresponding to the state value of the next state obtained by the neural network and the maximum value of the sum of the rewards obtained by performing the behavior
Figure PCTCN2021133905-appb-000004
where A(s) is the set of optional behaviors, γ is the attenuation coefficient,
Figure PCTCN2021133905-appb-000005
is the reward obtained by the agent for executing behavior a, φ i+1 represents the state characteristic of the execution behavior a reaching state θ - , V(φ i+1 ) represents the state value of the next state obtained according to the state value network, and using The probability of ε randomly selects an action from the set of all optional actions;
步骤22:若当前时刻调度系统需要为多个工序指定工件加工,根据步骤21为某一工序选择行为
Figure PCTCN2021133905-appb-000006
后,调度系统前探式的执行行为
Figure PCTCN2021133905-appb-000007
则调度系统状态转移到临时状态
Figure PCTCN2021133905-appb-000008
重复步骤21,为机器选择行为,直至全部选择完毕;那么此 时,调度系统在当前状态下所执行的行为为多维行为;
Step 22: If the scheduling system needs to specify workpiece processing for multiple processes at the current moment, select the behavior for a process according to Step 21
Figure PCTCN2021133905-appb-000006
After that, schedule the system's forward-looking execution behavior
Figure PCTCN2021133905-appb-000007
Then the scheduling system state transitions to the temporary state
Figure PCTCN2021133905-appb-000008
Repeat step 21 to select behaviors for the machine until all selections are completed; then, the behaviors executed by the scheduling system in the current state are multidimensional behaviors;
步骤23:获得多维行为后,调度系统执行此多维行为,智能体得到奖励r t+1和下一个状态s t+1,将单步状态转移存入记忆库D中,而后先计算TD-error,其计算公式为ξ i=R t+1+γV(S t+1)-V(S t),其中γ为衰减系数,R t+1为单步状态转移内的奖励、V(S t+1)为下一状态的状态价值、V(S t)为当前状态价值,然后按照p i=|ξ i|+β计算优先级概率存至优先级队列P,其中ξ i为上文计算的TD-error,β是一个很小的正常数,这是为了使有一些TD-error为0的特殊边缘例子也能够被抽取。 Step 23: After obtaining the multi-dimensional behavior, the scheduling system executes the multi-dimensional behavior, the agent gets the reward r t+1 and the next state s t+1 , stores the single-step state transfer in the memory D, and then calculates the TD-error first , its calculation formula is ξ i =R t+1 +γV(S t+1 )-V(S t ), where γ is the decay coefficient, R t+1 is the reward in the single-step state transition, V(S t +1 ) is the state value of the next state, V(S t ) is the current state value, and then the priority probability is calculated according to p i =|ξ i |+β and stored in the priority queue P, where ξ i is calculated above TD-error, β is a small positive number, this is to enable some special edge examples with TD-error of 0 to be extracted.
其中步骤21的奖励R的定义与调度系统目标函数直接或间接相关。为了使调度系统能对订单的紧急程度做出响应,本发明采取的调度目标是最小化加权平均完工时间,智能体能因为更短的加权平均完工时间而获得更大的奖励。The definition of the reward R in step 21 is directly or indirectly related to the objective function of the scheduling system. In order to enable the scheduling system to respond to the urgency of the order, the scheduling objective adopted by the present invention is to minimize the weighted average completion time, and the agent can obtain greater rewards due to the shorter weighted average completion time.
考虑到加权平均完工时间与工件状态紧密相关,定义表示工件状态的示性函数δ j(τ)如下: Considering that the weighted average makepan is closely related to the workpiece state, an indicative function δ j (τ) representing the workpiece state is defined as follows:
Figure PCTCN2021133905-appb-000009
Figure PCTCN2021133905-appb-000009
报酬函数定义如下:The reward function is defined as follows:
Figure PCTCN2021133905-appb-000010
Figure PCTCN2021133905-appb-000010
式中num为工件总数,w j为工件j的权重值,t为调度系统的时刻节点。r u表示相邻两个决策点(第u-1个决策点与第u个决策点)之间各工件的加权完工时间(等待时间与加工时间之和之和)。报酬函数具备此性质:最小化目标函数等价于最大化一个完整的状态序列所获取的累积奖励R。证明过程如下: In the formula, num is the total number of workpieces, w j is the weight value of workpiece j, and t is the moment node of the scheduling system. r u represents the weighted completion time (the sum of the waiting time and the processing time) of each workpiece between two adjacent decision points (the u-1th decision point and the uth decision point). The reward function has this property: minimizing the objective function is equivalent to maximizing the cumulative reward R obtained by a complete sequence of states. The proof process is as follows:
Figure PCTCN2021133905-appb-000011
Figure PCTCN2021133905-appb-000011
式中:C j表示第j个工件的总完工时间,由式可知,平均加权完工时间越小,总奖励越大。因此上述定义的报酬函数能够将报酬函数和调度目标直接联系起来,直接反映行为对目标函数的长期影响。 In the formula: C j represents the total completion time of the jth workpiece. It can be seen from the formula that the smaller the average weighted completion time, the greater the total reward. Therefore, the reward function defined above can directly link the reward function and the scheduling objective, and directly reflect the long-term impact of behavior on the objective function.
其中步骤21中机器可选行为集的定义如表2所示。依据简单构造启发式算法给每台机器定义候选行为集,优先分配规则用于强化学习可以克服短视的天性。与状态相关或无关的行为都应该被采纳,以充分利用现有调度规则、理论和智能体从经验中学习的能力。因此,本发明选取了最小化加权完工时间目标中常用的13种行为,如表4所示。The definition of machine optional behavior set in step 21 is shown in Table 2. Defining candidate behavior sets for each machine based on a simple construction heuristic algorithm, and prioritizing rules for reinforcement learning can overcome the short-sighted nature. State-related or unrelated behavior should be adopted to take full advantage of existing scheduling rules, theories, and the agent's ability to learn from experience. Therefore, the present invention selects 13 behaviors commonly used in minimizing the weighted makepan objective, as shown in Table 4.
表4 每台机器的候选行为集Table 4 Candidate behavior set for each machine
Figure PCTCN2021133905-appb-000012
Figure PCTCN2021133905-appb-000012
由于生产过程中部分工序存在并行机,因此,行为的定义不仅要考虑选择哪个工件,还要考虑把选取的工件分配给哪台空闲机器加工。本发明研究的调度问题为相同并行机调度问题,即此并行机工序下所有机器对于同一工件的加工时间相同,故对于空闲机器的选择在理想状态下并不会对工件的加工周期产生影响,为了平衡机器利用率,因此根据瓶颈工序机器负荷最小原则选取空闲机器。Since there are parallel machines in some processes in the production process, the definition of behavior should not only consider which workpiece to select, but also which idle machine to allocate the selected workpiece for processing. The scheduling problem studied in the present invention is the same parallel machine scheduling problem, that is, all machines in this parallel machine process have the same processing time for the same workpiece, so the selection of idle machines will not affect the processing cycle of the workpiece under ideal conditions. In order to balance machine utilization, idle machines are selected according to the principle of minimum machine load in the bottleneck process.
行为14,选择并行机中机器总加工时长最短的机器加工工件。In act 14, the machine with the shortest total machining time in the parallel machine is selected to machine the workpiece.
Figure PCTCN2021133905-appb-000013
Figure PCTCN2021133905-appb-000013
式中I为工序中的空闲机器集合,J为机器M i已加工的工件集合。对于只有一台加工机器的工序,前l-1道工序的所属机器能够采取的行为集合是{a k|1≤k≤13},第l道工序所属机器能够采取的行为集合是{a k|1≤k≤8,13}。对于存在并行机的工序,若其非最后一道工序,调度系统采取的行为集合是{(a 14,a k)|1≤k≤13},若为最后一道工序,调度系统采取的行为集合是{(a 14,a k)|1≤k≤8,13},对于未被选择的空闲机器则继续采取行为a 13In the formula, I is the set of idle machines in the process, and J is the set of workpieces processed by the machine Mi. For a process with only one processing machine, the set of actions that can be taken by the machine belonging to the first l-1 process is { ak |1≤k≤13}, and the set of actions that the machine belonging to the lth process can take is { ak |1≤k≤8,13}. For a process with a parallel machine, if it is not the last process, the set of actions taken by the scheduling system is {(a 14 , ak )|1≤k≤13}; if it is the last process, the set of actions taken by the scheduling system is {(a 14 , a k )|1≤k≤8,13}, and continue to take behavior a 13 for idle machines that are not selected.
步骤3:判断记忆库D中的单步状态转移数量是否达到设定的阈值Batch_Size:Step 3: Determine whether the number of single-step state transitions in memory D reaches the set threshold Batch_Size:
若达到设定的阈值Batch_Size,则转入步骤4;If the set threshold Batch_Size is reached, go to step 4;
若没有达到设定的阈值Batch_Size,则重复步骤2;If the set threshold Batch_Size is not reached, repeat step 2;
步骤4:随机从D中提取一定数量的单步状态转移,用下一状态和执行对应行为获得的奖励来计算当前状态的目标价值,计算目标价值与网络输出价值之间的均方差代价,使用小批量梯度下降算法更新参数,进入步骤5;Step 4: Randomly extract a certain number of single-step state transitions from D, use the next state and the reward obtained by executing the corresponding behavior to calculate the target value of the current state, calculate the mean squared cost between the target value and the network output value, use The mini-batch gradient descent algorithm updates the parameters, and goes to step 5;
进一步地,步骤4具体步骤如下:Further, the specific steps of step 4 are as follows:
步骤41:根据TD-error计算的比例权重从D中提取一定数量的单步状态转移,采用公式
Figure PCTCN2021133905-appb-000014
计算当前的目标价值,其中y i表示求得的当前状态价值,γ表示衰减系数,r i+1表示单步状态转移内行为的奖励,φ i+1表示单步状态转移内下一个状态s t+1的状态特征,V(φ i+1;θ -)表示根据目标网络求得的下一状态的状态价值;
Step 41: Extract a certain number of single-step state transitions from D according to the proportional weight calculated by TD-error, using the formula
Figure PCTCN2021133905-appb-000014
Calculate the current target value, where y i represents the obtained current state value, γ represents the decay coefficient, r i+1 represents the reward of the behavior within the single-step state transition, and φ i+1 represents the next state s within the single-step state transition The state feature of t+1 , V(φ i+1 ; θ ) represents the state value of the next state obtained from the target network;
步骤42:再计算目标价值与网络输出价值之间的均方差代价,
Figure PCTCN2021133905-appb-000015
其中loss为所求的均方差代价,h为Batch_Size,y i表示上面求得的当前状态价值,φ i+1表示单步状态转移内下一个状态s t+1的状态特征,V(φ i+1;θ)表示根据状态价值网络求得的下一状态的状态价值,使用小批量梯度下降算法更新网络参数与优先级队列;
Step 42: Calculate the mean square error cost between the target value and the network output value,
Figure PCTCN2021133905-appb-000015
where loss is the required mean square error cost, h is Batch_Size, y i represents the current state value obtained above, φ i+1 represents the state feature of the next state s t+1 in the single-step state transition, V(φ i +1 ; θ) represents the state value of the next state obtained according to the state value network, and uses the mini-batch gradient descent algorithm to update the network parameters and the priority queue;
步骤43:使用小批量梯度下降算法更新状态价值网络参数,每T步更换目标网络价值。Step 43: Use the mini-batch gradient descent algorithm to update the state value network parameters, and replace the target network value every T steps.
其中步骤41采用优先回放的概率分布采样时,首先根据公式
Figure PCTCN2021133905-appb-000016
计算比例,其中p i表示样本的优先级概率,h为Batch_Size,然后根据比例权重随机从D中选取Batch_Size个样本。
Wherein step 41 adopts the probability distribution sampling of priority playback, first according to the formula
Figure PCTCN2021133905-appb-000016
Calculate the proportion, where pi represents the priority probability of the sample, h is the Batch_Size , and then randomly select Batch_Size samples from D according to the proportion weight.
步骤5:判断当前智能体是否到达结束状态,若达到,进入步骤6;若没有,重复步骤2;Step 5: Determine whether the current agent has reached the end state, if so, go to Step 6; if not, repeat Step 2;
步骤6:判断调度系统是否经历过Max_Episode个完整的状态转移序列:Step 6: Determine whether the scheduling system has experienced Max_Episode complete state transition sequence:
若达到,则进行步骤7;If so, go to step 7;
若没有达到,初始化调度环境,重置机器与工件的状态,重复步骤2;If it is not reached, initialize the scheduling environment, reset the state of the machine and the workpiece, and repeat step 2;
步骤7:输出最优状态序列对应的行为策略组合a 1,a 2,…。 Step 7: Output the behavior strategy combination a 1 , a 2 , . . . corresponding to the optimal state sequence.
下面结合附图对本发明做进一步的介绍:Below in conjunction with accompanying drawing, the present invention is further introduced:
如图1所示,DQN算法在深度神经网络输出层有若干个节点,每个节点直接对应某个行为值,而一维的行为动作不能表达多维行为空间,且采用异策略的Q学习在评价行为值时用最优值替代实际交互值容易造成了过高估计。因此,提出采用TD学习代替Q学习,基于状态值间接计算行为值,适用于多维行为空间。并且采用卷积神经网络替换深度BP神经网络,利用CNN权重共享策略减少了需要训练的参数,池化运算可以降低网络的空间分辨率,从而消除信号的微小偏移和扭曲,从而对输入数据的平移不变性要求不高。二者不同之处体现在网络结构和其拟合的价值函数不同。As shown in Figure 1, the DQN algorithm has several nodes in the output layer of the deep neural network, each node directly corresponds to a certain behavior value, and the one-dimensional behavior cannot express a multi-dimensional behavior space, and the Q-learning using different strategies is used in the evaluation. Replacing the actual interaction value with the optimal value in the behavioral value can easily lead to overestimation. Therefore, it is proposed to use TD learning instead of Q learning, and calculate the behavior value indirectly based on the state value, which is suitable for multi-dimensional behavior space. In addition, the deep BP neural network is replaced by the convolutional neural network, and the CNN weight sharing strategy is used to reduce the parameters that need to be trained. Translation invariance is not very demanding. The difference between the two is reflected in the network structure and the fitted value function.
为了能更好的理解状态转移机制,本发明以规模为n=4,m=4,l=3的混合流水车间调度问题为例说明算法的运行过程。如图2所示,图中三角形表示工件,长方体表示机器,矩形表示每道工序前的等待队列。In order to better understand the state transition mechanism, the present invention takes a mixed flow shop scheduling problem with a scale of n=4, m=4, l=3 as an example to illustrate the running process of the algorithm. As shown in Figure 2, the triangle in the figure represents the workpiece, the cuboid represents the machine, and the rectangle represents the waiting queue before each process.
在系统开始阶段,初始状态为s 0,此时所有机器处于空闲状态,并且所有工件位于第一道工序的等待队列Q 1。系统运行后,第一道工序的机器选择一个动作a k,即选择此工序等待队列中某个工件进行加工,其他工序的机器由于等候加工队列为空,选择行为a 13。当有机器完成工件的加工,系统转移到一个新的状态s t,状态转移触发,系统为每台机器选择一个可行行为,之后当又有机器完成加工时,系统转移到下一个状态s t+1,智能体获得奖励r t+1。当工件进入并行机工序,系统根据当前状态从等待队列选取工件,并从工序空闲机器队列中选择机器加工。由于在每个决策点每台机器同时选择一个行为执行,实际上系统在状态实施了一次由m个子行为组合而成的多维行为(a 1,a 2,...a m)。当系统到达终止状态时,代表每个等待队列都为空,即所有工件全部加工完成,系统获得一个调度方案。 At the beginning of the system, the initial state is s 0 , at this time all machines are in idle state, and all workpieces are in the waiting queue Q 1 of the first process. After the system is running, the machine of the first process selects an action a k , that is, selects this process to wait for a workpiece in the queue to be processed, and the machines of other processes select action a 13 because the waiting queue is empty. When a machine completes the processing of the workpiece, the system transfers to a new state s t , the state transition is triggered, the system selects a feasible behavior for each machine, and then when another machine completes the processing, the system transfers to the next state s t+ 1 , the agent gets the reward r t+1 . When the workpiece enters the parallel machining process, the system selects the workpiece from the waiting queue according to the current state, and selects the machine for processing from the idle machine queue of the process. Since each machine simultaneously chooses a behavior to execute at each decision point, the system actually implements a multi-dimensional behavior (a 1 , a 2 ,...am ) composed of m sub-behaviors once in the state. When the system reaches the termination state, it means that each waiting queue is empty, that is, all workpieces are processed, and the system obtains a scheduling plan.
实施例Example
参数选择可能影响求解质量,有一般性原则可以遵循。折扣因子γ衡量后续 状态值对总回报的权重,因此一般取值接近1,设γ=0.95;ε-贪心策略中应先让ε从大变小,以便在初始阶段充分探索策略空间,结束阶段利用所得最优策略,因此初始ε=1,并以0.995的折扣率指数衰减;设学习率α=0.02,最大交互次数MAX_EPISODE=1000;记忆体D容量N=6000,采样批量BATCH_SIZE=256;智能体卷积神经网络结构如图3所示,网络参数采取随机初始化策略。Parameter selection may affect solution quality, and there are general principles to follow. The discount factor γ measures the weight of the subsequent state value to the total return, so the value is generally close to 1, and γ = 0.95; in the ε-greedy strategy, ε should be reduced from large to small, so as to fully explore the strategy space in the initial stage, and at the end stage Using the obtained optimal strategy, so the initial ε=1, and exponential decay with a discount rate of 0.995; set the learning rate α=0.02, the maximum number of interactions MAX_EPISODE=1000; memory D capacity N=6000, sampling batch BATCH_SIZE=256; intelligent The structure of the body convolutional neural network is shown in Figure 3, and the network parameters adopt a random initialization strategy.
(1)小规模问题(1) Small-scale problems
小规模问题以某10×8×6的调度问题为例检验算法的可行性。实例中包含10个工件、8台机器,每个工件需经过6道生产工序,在第三道、第五道工序存在并行机,各有相同的两台设备可供调度。该实例具体数据如表5所示。其中,工件优先级基准为1,为了测试设置的工件优先级对调度方案的影响,考虑对Job3、Job5、Job8的优先级权重系数随机取不同数值,分别为1.2、1.5、1.3,以测试权重对调度结果的影响效果。Small-scale problems take a 10 × 8 × 6 scheduling problem as an example to test the feasibility of the algorithm. The example includes 10 workpieces and 8 machines. Each workpiece needs to go through 6 production processes. There are parallel machines in the third and fifth processes, and each has the same two devices for scheduling. The specific data of this example are shown in Table 5. Among them, the workpiece priority benchmark is 1. In order to test the impact of the set workpiece priority on the scheduling scheme, the priority weight coefficients of Job3, Job5, and Job8 are considered to randomly select different values, which are 1.2, 1.5, and 1.3, respectively, to test the weight. The effect on the scheduling result.
表5 10×8×6的调度问题实例数据Table 5 Instance data of 10×8×6 scheduling problem
Figure PCTCN2021133905-appb-000017
Figure PCTCN2021133905-appb-000017
机器的分布情况为{1,2,[3,4],5,[6,7],8}。采用本发明算法与部分传统算法求解实例的结果如表6所示,表中的较优解加粗表示。由表6可见,本发明算法相较于传统算法能够获得较优解,其解对应的甘特图如图4所示,图中红色竖直线表示调度系统的各个决策节点。本算法最优解相较于IDE算法和HOMA算法效率分别提升4.3%和3.9%。The distribution of machines is {1,2,[3,4],5,[6,7],8}. Table 6 shows the results of solving examples using the algorithm of the present invention and some traditional algorithms, and the better solutions in the table are shown in bold. It can be seen from Table 6 that the algorithm of the present invention can obtain a better solution than the traditional algorithm, and the Gantt chart corresponding to the solution is shown in Figure 4. The red vertical line in the figure represents each decision node of the scheduling system. Compared with the IDE algorithm and the HOMA algorithm, the optimal solution of this algorithm is improved by 4.3% and 3.9% respectively.
表6 小规模测试实例结果对比图Table 6 Comparison of results of small-scale test examples
Figure PCTCN2021133905-appb-000018
Figure PCTCN2021133905-appb-000018
由图可知,工件优先级高的Job5、Job8、Job3先被加工,工件优先级越高的工件,将会优先进行加工,可见上文设定的报酬函数能够反映目标函数。It can be seen from the figure that Job5, Job8, and Job3 with high workpiece priority are processed first, and workpieces with higher workpiece priority will be processed first. It can be seen that the reward function set above can reflect the objective function.
(2)大规模问题(2) Large-scale problems
本发明随机从[OR_Library]实例集中选取15个示例用于实验测试,并与候鸟优化算法(MBO)及比较算法进行对比,如表7所示,表中较优结果用加粗字体表示。The present invention randomly selects 15 examples from the [OR_Library] example set for experimental testing, and compares them with the Migratory Bird Optimization Algorithm (MBO) and the comparison algorithm, as shown in Table 7, and the better results in the table are represented by bold fonts.
表7 大规模实例对比结果Table 7 Comparison results of large-scale instances
Figure PCTCN2021133905-appb-000019
Figure PCTCN2021133905-appb-000019
由表7可知,相比与其它算法,本算法提出的CTDN算法可以获得较优的解,某些实例的解已经低于原实例的上界。深度神经网络需要花费一定时间进行训练,但训练完成的网络可以快速根据输入的状态价值在极短时间内得出最优行为。It can be seen from Table 7 that compared with other algorithms, the CTDN algorithm proposed by this algorithm can obtain better solutions, and the solutions of some instances are already lower than the upper bound of the original instance. A deep neural network takes a certain amount of time to train, but the trained network can quickly obtain the optimal behavior in a very short time according to the state value of the input.
图5为实例tai_20_10_2在本发明算法下求得最优策略对应的甘特图。图中红色竖直虚线代表调度决策点,即工件完成每道工序的时间点。FIG. 5 is a Gantt chart corresponding to the optimal strategy obtained by the example tai_20_10_2 under the algorithm of the present invention. The red vertical dotted line in the figure represents the scheduling decision point, that is, the time point when the workpiece completes each process.
图6为实例tai_20_10_2下加权平均完工时间随着训练进行的变化图。从图中趋势可以看出,调度目标值随着episode的不断循环逐渐减小。一开始,智能体处于完全陌生的环境,通过自主的随机行为选择不断的进行学习试错,随着ε值不断地衰减,智能体倾向于采取模型给出的最优选择,从而使得系统不断向目标方向迈进,在900次迭代内,能获得较优解。Figure 6 is a graph showing the variation of the weighted average completion time with the training progress under the instance tai_20_10_2. As can be seen from the trend in the figure, the scheduling target value gradually decreases with the continuous cycle of the episode. At the beginning, the agent is in a completely unfamiliar environment, and continues to learn trial and error through independent random behavior selection. The goal is to move forward, and within 900 iterations, a better solution can be obtained.

Claims (5)

  1. 一种基于时序差分的混合流水车间调度方法,其特征在于:以最小化加权平均完工时间为调度目标,结合神经网络和强化学习,采用时序差分法训练模型,利用已有的调度知识和经验规则提炼调度决策候选行为,结合强化学习在线评价-执行机制,从而为调度系统的每次调度决策选取最优组合行为策略,具体包括如下步骤:A hybrid flow shop scheduling method based on time series difference, which is characterized in that: taking minimizing the weighted average completion time as the scheduling goal, combining neural network and reinforcement learning, using the time series difference method to train the model, using the existing scheduling knowledge and empirical rules The candidate behavior of scheduling decision is refined, combined with the online evaluation-execution mechanism of reinforcement learning, so as to select the optimal combined behavior strategy for each scheduling decision of the scheduling system, which includes the following steps:
    步骤1:根据混合流水车间的生产特征获得生产约束和目标函数,并引入机器状态特征,构建混合流水车间调度环境,并进行初始化设置,初始化容量为N的经验记忆库D,随机初始化状态价值深度神经网络V(θ)及目标网络V(θ -),以实现与智能体的交互,转入步骤2; Step 1: Obtain the production constraints and objective functions according to the production characteristics of the mixed flow workshop, and introduce the machine state characteristics, build the mixed flow workshop scheduling environment, and initialize the settings, initialize the experience memory bank D with a capacity of N, and randomly initialize the state value depth Neural network V(θ) and target network V(θ - ) to realize interaction with the agent, go to step 2;
    步骤2:智能体以ε的概率随机选择一个行为a t或是根据执行行为后的状态价值选择当前最优行为a t,执行最优行为后得到奖励r t+1和下一个状态s t+1,将当前状态的状态特征、执行该行为得到奖励r t+1、下一个状态s t+1的状态特征,以及是否到达终止状态共同记为单步状态转移(φ t,r t+1t+1,is_end),将得到的单步状态转移存储至记忆库D中,根据TD-error计算比例存至优先级队列P,转入步骤3; Step 2: The agent randomly selects a behavior a t with the probability of ε or selects the current optimal behavior a t according to the state value after executing the behavior, and obtains the reward r t+1 and the next state s t+ after executing the optimal behavior 1 , the state characteristics of the current state, the state characteristics of the next state s t+1 , the state characteristics of the next state s t+1, and whether the terminal state is reached are collectively recorded as a single-step state transition (φ t , r t+1 ) , φ t+1 , is_end), store the obtained single-step state transfer in the memory bank D, and store it in the priority queue P according to the calculation ratio of TD-error, and go to step 3;
    步骤3:判断记忆库D中的单步状态转移数量是否达到设定的阈值Batch_Size:Step 3: Determine whether the number of single-step state transitions in memory D reaches the set threshold Batch_Size:
    若达到设定的阈值Batch_Size,则转入步骤4;If the set threshold Batch_Size is reached, go to step 4;
    若没有达到设定的阈值Batch_Size,则重复步骤2;If the set threshold Batch_Size is not reached, repeat step 2;
    步骤4:随机从记忆库D中提取一定数量的单步状态转移,用下一状态和执行对应行为获得的奖励来计算当前状态的目标价值,计算目标价值与网络输出价值之间的均方差代价,使用小批量梯度下降算法更新参数,进入步骤5;Step 4: Randomly extract a certain number of single-step state transitions from memory D, calculate the target value of the current state with the next state and the reward obtained by performing the corresponding behavior, and calculate the mean square error cost between the target value and the network output value , use the mini-batch gradient descent algorithm to update the parameters, and go to step 5;
    步骤5:判断当前智能体是否到达结束状态,若达到,进入步骤6;若没有,重复步骤2;Step 5: Determine whether the current agent has reached the end state, if so, go to Step 6; if not, repeat Step 2;
    步骤6:判断调度系统是否经历过Max_Episode个完整的状态转移序列:Step 6: Determine whether the scheduling system has experienced Max_Episode complete state transition sequence:
    若达到,则进行步骤7;If so, go to step 7;
    若没有达到,初始化调度环境,重置机器与工件的状态,重复步骤2;If it is not reached, initialize the scheduling environment, reset the state of the machine and the workpiece, and repeat step 2;
    步骤7:输出最优状态序列对应的行为策略组合a 1,a 2,…。 Step 7: Output the behavior strategy combination a 1 , a 2 , . . . corresponding to the optimal state sequence.
  2. 根据权利要求1所述的基于时序差分的混合流水车间调度方法,其特征在于:所述步骤1中,机器状态特征如下:The method for scheduling a mixed flow shop based on timing difference according to claim 1, wherein: in the step 1, the machine state characteristics are as follows:
    混合流水车间中第i台机器M i的第k个特征记作f i,k,l表示工序总数,对于前l-1道工序的所属机器,共定义13个实值特征f i,k,其中1≤k≤13,对于第l道工序所属机器共定义9个实值特征f i,k,其中1≤k≤9,所定义的状态特征集共同了揭示环境所处的全局和局部信息; The k-th feature of the i -th machine Mi in the mixed flow workshop is denoted as f i,k , and l represents the total number of processes. For the machines belonging to the first l-1 process, a total of 13 real-valued features f i,k are defined, where 1≤k≤13, a total of 9 real-valued features f i,k are defined for the machine belonging to the lth process, among which 1≤k≤9, the defined state feature set jointly reveals the global and local information of the environment ;
    状态特征的定义如表1所示:The definition of state characteristics is shown in Table 1:
    表1 机器状态特征定义表Table 1 Definition table of machine state characteristics
    Figure PCTCN2021133905-appb-100001
    Figure PCTCN2021133905-appb-100001
    Figure PCTCN2021133905-appb-100002
    Figure PCTCN2021133905-appb-100002
    在此对表中使用到的参数做统一说明:q表示第q道工序,m表示机器总数,l表示工序总数,Q q表示第q道工序的等待队列,n表示第q道工序共有n件待加工工件,p q表示第q道工序所有待加工工件的平均加工时间,p q,j表示第q道工序的第j件工件的加工时间,J j表示等待队列Q q中的工件。 Here is a unified description of the parameters used in the table: q represents the qth process, m represents the total number of machines, l represents the total number of processes, Q q represents the waiting queue of the qth process, and n represents the qth process. There are n pieces in total The workpiece to be processed, p q represents the average processing time of all workpieces to be processed in the qth process, p q,j represents the processing time of the jth workpiece in the qth process, and J j represents the workpiece in the waiting queue Q q .
  3. 根据权利要求1所述的基于时序差分的混合流水车间调度方法,其特征在于:步骤2中,智能体以ε的概率随机选择一个行为a t或是根据执行行为后的状态价值选择当前最优行为a t,执行最优行为后得到奖励r t+1和下一个状态s t+1,将当前状态的状态特征、执行该行为得到奖励r t+1和下一个状态s t+1的状态特征,以及是否到达终止状态记为单步状态转移(φ t,r t+1t+1,is_end),将得到的单步状态转移存储至记忆库D中,根据TD-error计算比例存至优先级队列P,具体如 下: The hybrid flow shop scheduling method based on time series difference according to claim 1, characterized in that: in step 2, the agent randomly selects a behavior a t with the probability of ε or selects the current optimum according to the state value after executing the behavior Behavior at , get reward r t+1 and the next state s t +1 after executing the optimal behavior, compare the state characteristics of the current state, execute the behavior to get reward r t+1 and the state of the next state s t+1 Features, and whether it reaches the end state is recorded as a single-step state transition (φ t , r t+1 , φ t+1 , is_end), and the obtained single-step state transition is stored in memory D, and the ratio is calculated according to TD-error Stored in the priority queue P, as follows:
    步骤21:采用ε-贪婪策略,通过设置一个较小的ε值,使用1-ε的概率贪婪的选择在当前可选行为集下,选择根据状态价值卷积神经网络所求得的下一状态的状态价值与执行该行为所获奖励之和最大值对应的行为
    Figure PCTCN2021133905-appb-100003
    其中A(s)为可选行为集,γ为衰减系数,
    Figure PCTCN2021133905-appb-100004
    为智能体执行行为a获得的奖励,φ i+1表示执行行为a到达状态θ -的状态特征,V(φ i+1)表示根据状态价值网络求得的下一状态的状态价值,而用ε的概率随机从所有可选行为集中选择行为;
    Step 21: Adopt the ε-greedy strategy, by setting a small ε value, use the probability of 1-ε to select the next state obtained by the convolutional neural network according to the state value under the current optional behavior set The behavior corresponding to the maximum value of the state value and the sum of the rewards for performing the behavior
    Figure PCTCN2021133905-appb-100003
    where A(s) is the set of optional behaviors, γ is the attenuation coefficient,
    Figure PCTCN2021133905-appb-100004
    is the reward obtained by the agent performing behavior a, φ i+1 represents the state characteristic of the execution behavior a reaching state θ - , V(φ i+1 ) represents the state value of the next state obtained according to the state value network, and using The probability of ε randomly selects an action from the set of all optional actions;
    步骤22:若当前时刻调度系统需要为多个工序指定工件加工,根据步骤21为某一工序选择行为
    Figure PCTCN2021133905-appb-100005
    后,调度系统前探式的执行行为
    Figure PCTCN2021133905-appb-100006
    则调度系统状态转移到临时状态
    Figure PCTCN2021133905-appb-100007
    重复步骤21,为机器选择行为,直至全部选择完毕;那么此时,调度系统在当前状态下所执行的行为为多维行为;
    Step 22: If the scheduling system needs to specify workpiece processing for multiple processes at the current moment, select the behavior for a process according to Step 21
    Figure PCTCN2021133905-appb-100005
    After that, schedule the system's forward-looking execution behavior
    Figure PCTCN2021133905-appb-100006
    Then the scheduling system state transitions to the temporary state
    Figure PCTCN2021133905-appb-100007
    Repeat step 21 to select behaviors for the machine until all selections are completed; then, the behaviors executed by the scheduling system in the current state are multi-dimensional behaviors;
    步骤23:获得多维行为后,调度系统执行此多维行为,智能体得到奖励r t+1和下一个状态s t+1,将单步状态转移存入记忆库D中,计算TD-errorξ i=R t+1+γV(S t+1)-V(S t),其中γ为衰减系数,R t+1为单步状态转移内的奖励,V(S t+1)为下一状态的状态价值,V(S t)为当前状态价值,计算优先级概率p i=|ξ i|+β,并存至优先级队列P,其中β是一个很小的正常数,。 Step 23: After obtaining the multi-dimensional behavior, the scheduling system executes the multi-dimensional behavior, the agent gets the reward r t+1 and the next state s t+1 , stores the single-step state transfer in the memory D, and calculates TD-errorξ i = R t+1 +γV(S t+1 )-V(S t ), where γ is the decay coefficient, R t+1 is the reward within a single-step state transition, and V(S t+1 ) is the next state’s State value, V(S t ) is the current state value, calculate the priority probability p i = |ξ i |+β, and store it in the priority queue P, where β is a small positive number.
  4. 根据权利要求3所述的基于时序差分的混合流水车间调度方法,其特征在于:上述可选行为集,具体行为定义如表2:The hybrid flow shop scheduling method based on timing difference according to claim 3, is characterized in that: above-mentioned optional behavior set, specific behavior definition is as shown in Table 2:
    表2 每台机器的候选行为集Table 2 Candidate behavior set for each machine
    Figure PCTCN2021133905-appb-100008
    Figure PCTCN2021133905-appb-100008
    Figure PCTCN2021133905-appb-100009
    Figure PCTCN2021133905-appb-100009
    由于生产过程中部分工序存在并行机,为了平衡机器利用率,因此根据瓶颈工序机器负荷最小原则选取空闲机器;Since there are parallel machines in some processes in the production process, in order to balance the machine utilization, idle machines are selected according to the principle of minimum machine load in the bottleneck process;
    行为14,选择并行机中机器总加工时长最短的机器加工工件;Behavior 14, select the machining workpiece with the shortest total machining time in the parallel machine;
    Figure PCTCN2021133905-appb-100010
    Figure PCTCN2021133905-appb-100010
    式中I为工序中的空闲机器集合,J为机器M i已加工的工件集合,p i,j为工件j在机器i上的加工时间。 where I is the set of idle machines in the process, J is the set of workpieces processed by machine Mi, and p i ,j is the processing time of workpiece j on machine i.
  5. 根据权利要求1所述的基于时序差分的混合流水车间调度方法,其特征在于,步骤4中,随机从D中提取一定数量的单步状态转移,用下一状态和执行对应行为获得的奖励来计算当前状态的目标价值,计算目标价值与网络输出价值之间的均方差代价,使用小批量梯度下降算法更新参数,具体如下:The hybrid flow shop scheduling method based on time series difference according to claim 1, wherein in step 4, a certain number of single-step state transitions are randomly extracted from D, and the next state and the reward obtained by executing the corresponding behavior are used to Calculate the target value of the current state, calculate the mean squared cost between the target value and the network output value, and use the mini-batch gradient descent algorithm to update the parameters, as follows:
    步骤41:根据TD-error计算的比例权重从D中提取一定数量的单步状态转移,采用以下公式计算当前的目标价值y i Step 41: Extract a certain number of single-step state transitions from D according to the proportional weight calculated by TD-error, and use the following formula to calculate the current target value y i
    Figure PCTCN2021133905-appb-100011
    Figure PCTCN2021133905-appb-100011
    其中γ表示衰减系数,r i+1表示单步状态转移内行为的奖励,φ i+1表示单步状态转移内下一个状态s t+1的状态特征,V(φ i+1;θ -)表示根据目标网络求得的下一状态的状态价值; where γ is the decay coefficient, r i+1 is the reward of the behavior within the single-step state transition, φ i+1 is the state feature of the next state s t+1 within the single-step state transition, V(φ i+1 ; θ ) represents the state value of the next state obtained according to the target network;
    步骤42:再计算目标价值与网络输出价值之间的均方差代价lossStep 42: Recalculate the mean squared cost loss between the target value and the network output value
    Figure PCTCN2021133905-appb-100012
    Figure PCTCN2021133905-appb-100012
    其中h为Batch_Size,Where h is Batch_Size,
    步骤43:使用小批量梯度下降算法更新状态价值网络θ参数,每T步更换目标网络价值。Step 43: Use the mini-batch gradient descent algorithm to update the state value network θ parameters, and replace the target network value every T steps.
PCT/CN2021/133905 2020-12-25 2021-11-29 Temporal difference-based hybrid flow-shop scheduling method WO2022135066A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011568657.XA CN112734172B (en) 2020-12-25 2020-12-25 Hybrid flow shop scheduling method based on time sequence difference
CN202011568657.X 2020-12-25

Publications (1)

Publication Number Publication Date
WO2022135066A1 true WO2022135066A1 (en) 2022-06-30

Family

ID=75616847

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/133905 WO2022135066A1 (en) 2020-12-25 2021-11-29 Temporal difference-based hybrid flow-shop scheduling method

Country Status (2)

Country Link
CN (1) CN112734172B (en)
WO (1) WO2022135066A1 (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115333143A (en) * 2022-07-08 2022-11-11 国网黑龙江省电力有限公司大庆供电公司 Deep learning multi-agent micro-grid cooperative control method based on double neural networks
CN115361301A (en) * 2022-10-09 2022-11-18 之江实验室 Distributed computing network cooperative traffic scheduling system and method based on DQN
CN115469612A (en) * 2022-08-30 2022-12-13 安徽工程大学 Dynamic scheduling method and system for job shop
CN115648204A (en) * 2022-09-26 2023-01-31 吉林大学 Training method, device, equipment and storage medium of intelligent decision model
CN115719108A (en) * 2022-11-03 2023-02-28 吉林师范大学 Resource symmetric distributed workshop comprehensive scheduling method
CN115857451A (en) * 2022-12-02 2023-03-28 武汉纺织大学 Flow shop processing scheduling method based on reinforcement learning
CN116259806A (en) * 2023-05-09 2023-06-13 浙江韵量氢能科技有限公司 Fuel cell stack capable of removing gas impurities and method for removing gas impurities
CN116542504A (en) * 2023-07-07 2023-08-04 合肥喆塔科技有限公司 Parameter-adaptive semiconductor workpiece production scheduling method, equipment and storage medium
CN116774651A (en) * 2023-04-28 2023-09-19 兰州理工大学 Distributed flexible flow shop scheduling method and system with time-of-use electricity price constraint
CN116957172A (en) * 2023-09-21 2023-10-27 山东大学 Dynamic job shop scheduling optimization method and system based on deep reinforcement learning
CN117076113A (en) * 2023-08-17 2023-11-17 重庆理工大学 Industrial heterogeneous equipment multi-job scheduling method based on federal learning
CN117422206A (en) * 2023-12-18 2024-01-19 中国科学技术大学 Method, equipment and storage medium for improving engineering problem decision and scheduling efficiency
CN117787476A (en) * 2023-12-07 2024-03-29 聊城大学 Quick evaluation method for blocking flow shop scheduling based on key machine
CN118171892A (en) * 2024-05-11 2024-06-11 浙江大学 Workshop scheduling method and device considering skill level and fatigue degree of workers
CN118536783A (en) * 2024-07-26 2024-08-23 浙江大学 Logistics robot scheduling method based on deep reinforcement learning

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112734172B (en) * 2020-12-25 2022-04-01 南京理工大学 Hybrid flow shop scheduling method based on time sequence difference
CN113406939A (en) * 2021-07-12 2021-09-17 哈尔滨理工大学 Unrelated parallel machine dynamic hybrid flow shop scheduling method based on deep Q network
CN113515097B (en) * 2021-07-23 2022-08-19 合肥工业大学 Two-target single machine batch scheduling method based on deep reinforcement learning
CN113759841B (en) * 2021-08-26 2024-01-12 山东师范大学 Multi-objective optimized machine tool flexible workshop scheduling method and system
CN114219274B (en) * 2021-12-13 2024-08-02 南京理工大学 Workshop scheduling method based on deep reinforcement learning and adapted to machine state
CN114580937B (en) * 2022-03-10 2023-04-28 暨南大学 Intelligent job scheduling system based on reinforcement learning and attention mechanism
CN114625089B (en) * 2022-03-15 2024-05-03 大连东软信息学院 Job shop scheduling method based on improved near-end strategy optimization algorithm
CN114862170B (en) * 2022-04-27 2024-04-19 昆明理工大学 Learning type intelligent scheduling method and system for manufacturing process of communication equipment
CN115793583B (en) * 2022-12-02 2024-06-25 福州大学 New order insertion optimization method for flow shop based on deep reinforcement learning
CN116050803B (en) * 2023-02-27 2023-07-25 湘南学院 Dynamic scheduling method for automatic sorting of customized furniture plates
CN116414093B (en) * 2023-04-13 2024-01-16 暨南大学 Workshop production method based on Internet of things system and reinforcement learning
CN117669988B (en) * 2023-12-26 2024-07-05 中建八局第一数字科技有限公司 Q-Learning algorithm improvement NEH-based prefabricated part production scheduling method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200257968A1 (en) * 2019-02-08 2020-08-13 Adobe Inc. Self-learning scheduler for application orchestration on shared compute cluster
CN111862579A (en) * 2020-06-10 2020-10-30 深圳大学 Taxi scheduling method and system based on deep reinforcement learning
CN112734172A (en) * 2020-12-25 2021-04-30 南京理工大学 Hybrid flow shop scheduling method based on time sequence difference

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109270904A (en) * 2018-10-22 2019-01-25 中车青岛四方机车车辆股份有限公司 A kind of flexible job shop batch dynamic dispatching optimization method
CN110163409B (en) * 2019-04-08 2021-05-18 华中科技大学 Convolutional neural network scheduling method applied to replacement flow shop
CN110930016A (en) * 2019-11-19 2020-03-27 三峡大学 Cascade reservoir random optimization scheduling method based on deep Q learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200257968A1 (en) * 2019-02-08 2020-08-13 Adobe Inc. Self-learning scheduler for application orchestration on shared compute cluster
CN111862579A (en) * 2020-06-10 2020-10-30 深圳大学 Taxi scheduling method and system based on deep reinforcement learning
CN112734172A (en) * 2020-12-25 2021-04-30 南京理工大学 Hybrid flow shop scheduling method based on time sequence difference

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
TOM SCHAUL, QUAN JOHN, ANTONOGLOU IOANNIS, SILVER DAVID: "Prioritized experience replay", ARXIV:1511.05952V4, 25 February 2016 (2016-02-25), XP055348246, Retrieved from the Internet <URL:https://arxiv.org/abs/1511.05952v4> [retrieved on 20170221] *
XIAO PENGFEI: "Study on Non-permutation Flow Shop Scheduling Problem Based on Deep Temporal Difference Reinforcement Learning Network", MASTER THESIS, TIANJIN POLYTECHNIC UNIVERSITY, CN, no. 3, 15 March 2020 (2020-03-15), CN , XP055945790, ISSN: 1674-0246 *

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115333143B (en) * 2022-07-08 2024-05-07 国网黑龙江省电力有限公司大庆供电公司 Deep learning multi-agent micro-grid cooperative control method based on double neural networks
CN115333143A (en) * 2022-07-08 2022-11-11 国网黑龙江省电力有限公司大庆供电公司 Deep learning multi-agent micro-grid cooperative control method based on double neural networks
CN115469612A (en) * 2022-08-30 2022-12-13 安徽工程大学 Dynamic scheduling method and system for job shop
CN115648204A (en) * 2022-09-26 2023-01-31 吉林大学 Training method, device, equipment and storage medium of intelligent decision model
CN115361301A (en) * 2022-10-09 2022-11-18 之江实验室 Distributed computing network cooperative traffic scheduling system and method based on DQN
US12021751B2 (en) 2022-10-09 2024-06-25 Zhejiang Lab DQN-based distributed computing network coordinate flow scheduling system and method
CN115719108A (en) * 2022-11-03 2023-02-28 吉林师范大学 Resource symmetric distributed workshop comprehensive scheduling method
CN115857451A (en) * 2022-12-02 2023-03-28 武汉纺织大学 Flow shop processing scheduling method based on reinforcement learning
CN115857451B (en) * 2022-12-02 2023-08-25 武汉纺织大学 Flow shop processing scheduling method based on reinforcement learning
CN116774651A (en) * 2023-04-28 2023-09-19 兰州理工大学 Distributed flexible flow shop scheduling method and system with time-of-use electricity price constraint
CN116259806A (en) * 2023-05-09 2023-06-13 浙江韵量氢能科技有限公司 Fuel cell stack capable of removing gas impurities and method for removing gas impurities
CN116542504B (en) * 2023-07-07 2023-09-22 合肥喆塔科技有限公司 Parameter-adaptive semiconductor workpiece production scheduling method, equipment and storage medium
CN116542504A (en) * 2023-07-07 2023-08-04 合肥喆塔科技有限公司 Parameter-adaptive semiconductor workpiece production scheduling method, equipment and storage medium
CN117076113A (en) * 2023-08-17 2023-11-17 重庆理工大学 Industrial heterogeneous equipment multi-job scheduling method based on federal learning
CN116957172B (en) * 2023-09-21 2024-01-16 山东大学 Dynamic job shop scheduling optimization method and system based on deep reinforcement learning
CN116957172A (en) * 2023-09-21 2023-10-27 山东大学 Dynamic job shop scheduling optimization method and system based on deep reinforcement learning
CN117787476A (en) * 2023-12-07 2024-03-29 聊城大学 Quick evaluation method for blocking flow shop scheduling based on key machine
CN117422206A (en) * 2023-12-18 2024-01-19 中国科学技术大学 Method, equipment and storage medium for improving engineering problem decision and scheduling efficiency
CN117422206B (en) * 2023-12-18 2024-03-29 中国科学技术大学 Method, equipment and storage medium for improving engineering problem decision and scheduling efficiency
CN118171892A (en) * 2024-05-11 2024-06-11 浙江大学 Workshop scheduling method and device considering skill level and fatigue degree of workers
CN118536783A (en) * 2024-07-26 2024-08-23 浙江大学 Logistics robot scheduling method based on deep reinforcement learning

Also Published As

Publication number Publication date
CN112734172B (en) 2022-04-01
CN112734172A (en) 2021-04-30

Similar Documents

Publication Publication Date Title
WO2022135066A1 (en) Temporal difference-based hybrid flow-shop scheduling method
CN108694502B (en) Self-adaptive scheduling method for robot manufacturing unit based on XGboost algorithm
CN113792924A (en) Single-piece job shop scheduling method based on Deep reinforcement learning of Deep Q-network
WO2021109644A1 (en) Hybrid vehicle working condition prediction method based on meta-learning
CN116542445A (en) Intelligent scheduling method and system for equipment manufacturing workshop based on deep reinforcement learning
CN114565247B (en) Workshop scheduling method, device and system based on deep reinforcement learning
CN112947300A (en) Virtual measuring method, system, medium and equipment for processing quality
CN114399227A (en) Production scheduling method and device based on digital twins and computer equipment
CN115168027A (en) Calculation power resource measurement method based on deep reinforcement learning
CN111352419B (en) Path planning method and system for updating experience playback cache based on time sequence difference
CN109242099A (en) Training method, device, training equipment and the storage medium of intensified learning network
WO2024113585A1 (en) Intelligent interactive decision-making method for discrete manufacturing system
CN112836974A (en) DQN and MCTS based box-to-box inter-zone multi-field bridge dynamic scheduling method
CN113406939A (en) Unrelated parallel machine dynamic hybrid flow shop scheduling method based on deep Q network
CN116151581A (en) Flexible workshop scheduling method and system and electronic equipment
CN116932198A (en) Resource scheduling method, device, electronic equipment and readable storage medium
CN114384931B (en) Multi-target optimal control method and equipment for unmanned aerial vehicle based on strategy gradient
CN114219274A (en) Workshop scheduling method adapting to machine state based on deep reinforcement learning
CN112163788B (en) Scheduling method of Internet pile-free bicycle based on real-time data
CN106611381A (en) Algorithm for analyzing influence of material purchase to production scheduling of manufacturing shop based on cloud manufacturing
CN117647960A (en) Workshop scheduling method, device and system based on deep reinforcement learning
CN117827434A (en) Mixed elastic telescoping method based on multidimensional resource prediction
CN112488543A (en) Intelligent work site shift arrangement method and system based on machine learning
CN104698838A (en) Discourse domain based dynamic division and learning fuzzy scheduling rule mining method
CN116562584A (en) Dynamic workshop scheduling method based on Conv-lasting and generalization characterization

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21909065

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21909065

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 21909065

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 11.01.2024)