一种基于时序差分的混合流水车间调度方法A hybrid flow shop scheduling method based on time series difference
技术领域technical field
本发明属于混合流水车间调度控制技术,具体涉及一种基于时序差分的混合流水车间调度方法。The invention belongs to a mixed flow shop scheduling control technology, in particular to a mixed flow shop scheduling method based on time sequence difference.
背景技术Background technique
混合流水车间调度问题(Hybrid flow-shop scheduling problem,HFSP),又称柔性流水车间调度问题,由Salvador在1973年首先提出,该问题可以看作是经典流水车间调度问题与并行机调度问题的结合,其特征是工件在加工过程中存在并行机阶段,在确定工件加工顺序的同时进行机器分配。在HFSP问题中,至少有一个阶段中处理机的个数大于1,这大大增加了HFSP的求解难度,已证明处理机数分别为2和1的两阶段的HFSP是NP-hard问题。The hybrid flow-shop scheduling problem (HFSP), also known as the flexible flow-shop scheduling problem, was first proposed by Salvador in 1973. This problem can be regarded as a combination of the classical flow-shop scheduling problem and the parallel machine scheduling problem. , which is characterized in that the workpiece has a parallel machine stage in the processing process, and the machine allocation is performed while the workpiece processing sequence is determined. In the HFSP problem, the number of processors in at least one stage is greater than 1, which greatly increases the difficulty of solving HFSP. It has been proved that the two-stage HFSP with 2 and 1 processors is NP-hard problem.
目前,精确算法、启发式和元启发式算法是求解流水车间调度问题的三类经典方法。精确算法包括数学建模、分支定界法,能获得小规模问题的最优解;对于大规模实际调度问题,启发式算法或元启发式算法因能在较短的时间获得近优解而受到研究者的关注。然而,启发式算法或元启发式算法是针对具体实例设计相应的规则和算法,不适应于复杂多变的实际生产环境。强化学习算法可以产生适应实际生产状态的调度策略。Wei Y,Zhao M通过定义“生产压力”特征值和两步调度规则,将Q学习用于作业车间的组合分配规则选取,但该方法采用的表格型强化学习模型并不能描述实际复杂加工过程。张志聪、郑力给每台机器定义15个状态特征,利用TD法训练线性状值函数泛化器求解NPFS问题,但线性函数泛化器拟合和泛化能力有限。At present, exact algorithms, heuristics and meta-heuristics are three classical methods for solving the flow shop scheduling problem. Exact algorithms include mathematical modeling and branch-and-bound methods, which can obtain optimal solutions for small-scale problems; for large-scale practical scheduling problems, heuristic algorithms or meta-heuristic algorithms are favored because they can obtain near-optimal solutions in a relatively short time. researchers' attention. However, heuristic algorithms or meta-heuristic algorithms are designed for specific instances with corresponding rules and algorithms, and are not suitable for complex and changeable actual production environments. Reinforcement learning algorithms can generate scheduling policies that adapt to the actual production state. Wei Y, Zhao M used Q-learning for the selection of combination assignment rules in the job shop by defining the eigenvalues of "production pressure" and two-step scheduling rules, but the tabular reinforcement learning model adopted by this method cannot describe the actual complex processing process. Zhang Zhicong and Zheng Li defined 15 state features for each machine, and used the TD method to train a linear state-value function generalizer to solve the NPFS problem, but the linear function generalizer has limited fitting and generalization capabilities.
总结分析现有的研究成果,有关混合流水车间调度问题的研究主要存在以下问题:Summarizing and analyzing the existing research results, the research on the scheduling problem of mixed flow workshop mainly has the following problems:
(1)传统调度算法不能有效利用历史数据进行学习,且实时性较差而难以应对大规模复杂多变的实际生产调度环境。(1) The traditional scheduling algorithm cannot effectively use historical data for learning, and has poor real-time performance, making it difficult to cope with the large-scale complex and changeable actual production scheduling environment.
(2)目前,虽然传统的HFSP的研究已经很成熟,但对于运用强化学习求解混合流水车间问题的研究很少且存在难以表征加工环境以及函数泛化器功能有限等问题。(2) At present, although the traditional research on HFSP is very mature, there are few researches on using reinforcement learning to solve the mixed flow shop problem, and there are problems such as difficulty in characterizing the processing environment and limited functions of function generalizers.
(3)深度强化学习算法可以解决函数泛化器功能有限的问题,卷积神经网络的权重共享策略减少了需要训练的参数,相同的权值可以让滤波器不受信号位置的影响来检测信号的特性,使得训练出来的模型的泛化能力更强,但是国内外有关深度强化学习算法解决车间调度问题的研究较少。(3) The deep reinforcement learning algorithm can solve the problem of the limited function of the function generalizer. The weight sharing strategy of the convolutional neural network reduces the parameters that need to be trained. The same weight can make the filter not affected by the signal position to detect the signal The characteristics of the trained model make the generalization ability of the trained model stronger, but there are few researches on the deep reinforcement learning algorithm to solve the workshop scheduling problem at home and abroad.
发明内容SUMMARY OF THE INVENTION
本发明的目的在于提供一种基于时序差分的混合流水车间调度方法,用以解决相关并行机的混合流水车间调度问题。The purpose of the present invention is to provide a mixed flow shop scheduling method based on time sequence difference to solve the mixed flow shop scheduling problem of related parallel machines.
实现本发明目的的技术解决方案为:本发明所述的一种基于时序差分的混合流水车间调度方法,以最小化加权平均完工时间为调度目标,结合神经网络和强化学习,采用时序差分法训练模型,利用已有的调度知识和经验规则提炼调度决策候选行为,结合强化学习在线评价-执行机制,从而为调度系统的每次调度决策选取最优组合行为策略,具体包括如下步骤:The technical solution to achieve the purpose of the present invention is: a hybrid flow shop scheduling method based on time series difference described in the present invention takes minimizing the weighted average completion time as the scheduling goal, combines neural network and reinforcement learning, and adopts time series difference method for training. The model uses the existing scheduling knowledge and empirical rules to refine the scheduling decision candidate behavior, and combines the online evaluation-execution mechanism of reinforcement learning to select the optimal combined behavior strategy for each scheduling decision of the scheduling system, which includes the following steps:
步骤1:根据混合流水车间的生产特征获得生产约束和目标函数,并引入机器状态特征,构建混合流水车间调度环境,并进行初始化设置,初始化容量为N的经验记忆库D,随机初始化状态价值深度神经网络V(θ)及目标网络V(θ
-),以实现与智能体的交互,转入步骤2;
Step 1: Obtain the production constraints and objective functions according to the production characteristics of the mixed flow workshop, and introduce the machine state characteristics, build the mixed flow workshop scheduling environment, and initialize the settings, initialize the experience memory bank D with a capacity of N, and randomly initialize the state value depth Neural network V(θ) and target network V(θ - ) to realize interaction with the agent, go to step 2;
步骤2:智能体以ε的概率随机选择一个行为a
t或是根据执行行为后的状态价值选择当前最优行为a
t,执行最优行为后得到奖励r
t+1和下一个状态s
t+1,将当前状态的状态特征、执行该行为得到奖励r
t+1和下一个状态s
t+1的状态特征,以及是否到达终止状态记为单步状态转移(φ
t,r
t+1,φ
t+1,is_end),将得到的单步状态转移存储至记忆库D中,根据TD-error计算比例存至优先级队列P,转入步骤3;
Step 2: The agent randomly selects a behavior a t with the probability of ε or selects the current optimal behavior a t according to the state value after executing the behavior, and obtains the reward r t+1 and the next state s t+ after executing the optimal behavior 1. Record the state characteristics of the current state, the state characteristics of the next state s t+1 , and the state characteristics of the next state s t+1, and whether the terminal state is reached (φ t , r t+1 , φ t , r t+1 , φ t+1 ,is_end), store the obtained single-step state transition in the memory bank D, and store it in the priority queue P according to the TD-error calculation ratio, and go to step 3;
步骤3:判断记忆库D中的单步状态转移数量是否达到设定的阈值Batch_Size:Step 3: Determine whether the number of single-step state transitions in memory D reaches the set threshold Batch_Size:
若达到设定的阈值Batch_Size,则转入步骤4;If the set threshold Batch_Size is reached, go to step 4;
若没有达到设定的阈值Batch_Size,则重复步骤2;If the set threshold Batch_Size is not reached, repeat step 2;
步骤4:随机从D中提取一定数量的单步状态转移,用下一状态和执行对应行为获得的奖励来计算当前状态的目标价值,计算目标价值与网络输出价值之间的均方差代价,使用小批量梯度下降算法更新参数,进入步骤5;Step 4: Randomly extract a certain number of single-step state transitions from D, use the next state and the reward obtained by executing the corresponding behavior to calculate the target value of the current state, calculate the mean squared cost between the target value and the network output value, use The mini-batch gradient descent algorithm updates the parameters, and goes to step 5;
步骤5:判断当前智能体是否到达结束状态,若达到,进入步骤6;若没有,重复步骤2;Step 5: Determine whether the current agent has reached the end state, if so, go to Step 6; if not, repeat Step 2;
步骤6:判断调度系统是否经历过Max_Episode个完整的状态转移序列:Step 6: Determine whether the scheduling system has experienced Max_Episode complete state transition sequence:
若达到,则进行步骤7;If so, go to step 7;
若没有达到,初始化调度环境,重置机器与工件的状态,重复步骤2;If it is not reached, initialize the scheduling environment, reset the state of the machine and the workpiece, and repeat step 2;
步骤7:输出最优状态序列对应的行为策略组合a
1,a
2,…。
Step 7: Output the behavior strategy combination a 1 , a 2 , . . . corresponding to the optimal state sequence.
本发明与现有技术相比,其显著优点在于:Compared with the prior art, the present invention has the following significant advantages:
(1)本发明提出一种基于TD学习的深度强化学习算法,采用双网络结构的卷积神经网络,将动作选择和价值估计分开,利用CNN的深层卷积计算的优势,可以有效避免过高估计。(1) The present invention proposes a deep reinforcement learning algorithm based on TD learning, which adopts a convolutional neural network with a dual network structure, separates action selection and value estimation, and utilizes the advantages of deep convolution calculation of CNN, which can effectively avoid excessively high estimate.
(2)由于将强化学习应用于混合流水车间调度问题后,其行为空间为多维离散空间,不适合继续采用基于一维离散行为值函数的Q学习。因此,本发明设计基于状态值更新的算法模型来求解多维离散空间,使得其可以求解混合流水车间调度问题。采取浅层采样的TD学习求解状态价值,其不依赖于完整的状态序列,通过前探式的尝试选择最优动作,其从原理上更加符合实际调度过程,在解决大规模问题或是动态问题上更加合适。(2) After applying reinforcement learning to the scheduling problem of mixed flow shop, its behavior space is a multi-dimensional discrete space, so it is not suitable to continue to use Q-learning based on one-dimensional discrete behavior value function. Therefore, the present invention designs an algorithm model based on state value update to solve the multi-dimensional discrete space, so that it can solve the mixed flow shop scheduling problem. Using shallow sampling TD learning to solve the state value, it does not depend on the complete state sequence, and selects the optimal action through a probing attempt, which is more in line with the actual scheduling process in principle, and can be used to solve large-scale problems or dynamic problems. more appropriate.
(3)本发明在选择样本训练时引入随机优先级采样方法,可以有效解决算法由于贪心优先级造成的在函数逼近过程中频繁的有一个较高的error以及过拟合的问题。(3) The present invention introduces a random priority sampling method when selecting samples for training, which can effectively solve the problem that the algorithm frequently has a higher error and overfitting in the function approximation process due to greedy priority.
附图说明Description of drawings
图1是本发明提出算法CTDN与DQN网络结构及拟合函数对比图。FIG. 1 is a comparison diagram of the network structures and fitting functions of the algorithms CTDN and DQN proposed by the present invention.
图2是规模4×4×3的混合流水车间CTDN算法运行模型图。Figure 2 is a diagram of the operation model of the CTDN algorithm in a mixed flow shop with a scale of 4 × 4 × 3.
图3是本发明所使用的卷积神经网络结构图。FIG. 3 is a structural diagram of a convolutional neural network used in the present invention.
图4是小规模问题的最优调度甘特图。Figure 4 is an optimal scheduling Gantt chart for a small-scale problem.
图5是实例tai_20_10_2甘特图。Figure 5 is an example tai_20_10_2 Gantt chart.
图6是实例tai_20_10_2运行迭代图。Figure 6 is an iterative graph of the instance tai_20_10_2 running.
图7是本发明的基于时序差分的混合流水车间调度方法流程图。FIG. 7 is a flow chart of the method for scheduling a mixed flow shop based on timing difference of the present invention.
具体实施方式Detailed ways
下面结合附图对本发明作进一步详细描述。The present invention will be described in further detail below with reference to the accompanying drawings.
结合图7,本发明所述的一种基于时序差分的混合流水车间调度方法,步骤如下:With reference to Fig. 7 , a method for scheduling a mixed flow shop based on timing difference according to the present invention, the steps are as follows:
步骤1:根据混合流水车间的生产特征获得生产约束和目标函数,并引入机器状态特征,构建混合流水车间调度环境,并进行初始化设置,初始化容量为N的经验记忆库D,随机初始化状态价值深度神经网络V(θ)及目标网络V(θ
-),以实现与智能体的交互,转入步骤2。
Step 1: Obtain the production constraints and objective functions according to the production characteristics of the mixed flow workshop, and introduce the machine state characteristics, build the mixed flow workshop scheduling environment, and initialize the settings, initialize the experience memory bank D with a capacity of N, and randomly initialize the state value depth Neural network V(θ) and target network V(θ - ) to realize interaction with the agent, go to step 2.
进一步地,步骤1所述的调度系统目标函数为最小化加权平均完工时间,其最小加权平均完工时间目标函数
其中w
j为工件j的权重值,即订单的优先级,c
j为工件j的完工时间。平均完工时间指标可以用来衡量中间品库存水平和一批次工件的加工周期,对企业具有重要的实际意义。
Further, the objective function of the scheduling system described in step 1 is to minimize the weighted average completion time, and its minimum weighted average completion time objective function Where w j is the weight value of workpiece j, that is, the priority of the order, and c j is the completion time of workpiece j. The average completion time index can be used to measure the inventory level of intermediate products and the processing cycle of a batch of workpieces, which is of great practical significance to enterprises.
进一步地,步骤1所述的机器状态特征定义如表所示,通过引入适当的参数,选取恰当描述状态的特征,构建一定的函数来近似计算得到状态,其表征了某状态下的机器和工件信息。混合流水车间中第i台机器M
i的第k个特征记作f
i,k,l表示工序总数,对于前l-1道工序的所属机器,共定义13个实值特征f
i,k,其中1≤k≤13,对于第l道工序所属机器共定义9个实值特征f
i,k,其中1≤k≤9,所定义的状态特征集共同了揭示环境所处的全局和局部信息,如表3所示。
Further, the definition of the state characteristics of the machine described in step 1 is shown in the table. By introducing appropriate parameters, selecting the characteristics that describe the state appropriately, and constructing a certain function to approximate the calculated state, it represents the machine and workpiece in a certain state. information. The k-th feature of the i -th machine Mi in the mixed flow workshop is denoted as f i,k , and l represents the total number of processes. For the machines belonging to the first l-1 process, a total of 13 real-valued features f i,k are defined, where 1≤k≤13, a total of 9 real-valued features f i,k are defined for the machine belonging to the lth process, among which 1≤k≤9, the defined state feature set jointly reveals the global and local information of the environment ,as shown in Table 3.
状态特征的定义如表3所示:The definition of state characteristics is shown in Table 3:
表3 机器状态特征定义表Table 3 Machine state feature definition table
表3 机器状态特征定义表Table 3 Machine state feature definition table
在此对表1中使用到的参数做统一说明:i表示第i台机器,q表示第q道工序,m表示机器总数,l表示工序总数,Q
q表示第q道工序的等待队列,n表示 第q道工序共有n件待加工工件,p
q表示第q道工序所有待加工工件的平均加工时间,p
q,j表示第q道工序的第j件工件的加工时间。
Here is a unified description of the parameters used in Table 1: i represents the ith machine, q represents the qth process, m represents the total number of machines, l represents the total number of processes, Q q represents the waiting queue of the qth process, n Indicates that there are n workpieces to be processed in the qth process, pq represents the average processing time of all the workpieces to be processed in the qth process, and pq,j represents the processing time of the jth workpiece in the qth process.
状态特征1表征了工件在生产流水线上每道工序的分布情况;状态特征2表征了当前时刻下每道工序设备的工作负载;状态特征3表征了从当前时刻起每道工序机器要完成的工作总量;状态特征4,5描述了当前各个等待队列中工序加工时间的最值;状态特征6表示设备中在制品的已加工时间,从而表征设备的运行或空闲,以及工件加工进度;状态特征7,8表示工件等待队列中剩余完工时间的最值;状态特征9表征了从开始加工到当前时刻各机器的利用率;状态特征10,11表示工件等待队列中工件在当前工序的加工时间与在下一道工序的加工时间之比的最值;状态特征12,13表示工件后继工序所需加工时间的最值。 State feature 1 represents the distribution of workpieces in each process of the production line; state feature 2 represents the workload of the equipment in each process at the current moment; state feature 3 represents the work to be completed by the machine in each process from the current moment The total amount; status features 4 and 5 describe the maximum value of the processing time of each process in the current waiting queue; status feature 6 represents the processed time of the work-in-process in the equipment, thereby characterizing the operation or idleness of the equipment and the processing progress of the workpiece; status features 7 and 8 represent the maximum value of the remaining completion time in the workpiece waiting queue; state feature 9 represents the utilization rate of each machine from the start of processing to the current moment; state features 10 and 11 represent the processing time of the workpiece in the workpiece waiting queue in the current process. The maximum value of the ratio of the processing time in the next process; the state features 12 and 13 represent the maximum value of the processing time required for the subsequent process of the workpiece.
步骤2:智能体以ε的概率随机选择一个行为a
t或是根据执行行为后的状态价值选择当前最优行为a
t,执行最优行为后得到奖励r
t+1和下一个状态s
t+1,将当前状态的状态特征、执行该行为得到奖励r
t+1和下一个状态s
t+1的状态特征,以及是否到达终止状态记为单步状态转移(φ
t,r
t+1,φ
t+1,is_end),将得到的单步状态转移存储至记忆库D中,根据TD-error计算比例存至优先级队列P,转入步骤3;
Step 2: The agent randomly selects a behavior a t with the probability of ε or selects the current optimal behavior a t according to the state value after executing the behavior, and obtains the reward r t+1 and the next state s t+ after executing the optimal behavior 1. Record the state characteristics of the current state, the state characteristics of the next state s t+1 , and the state characteristics of the next state s t+1, and whether the terminal state is reached (φ t , r t+1 , φ t , r t+1 , φ t+1 ,is_end), store the obtained single-step state transition in the memory bank D, and store it in the priority queue P according to the TD-error calculation ratio, and go to step 3;
进一步地,所述步骤2具体步骤如下:Further, the specific steps of the step 2 are as follows:
步骤21:为保证能做到持续的探索,采用ε-贪婪策略,通过设置一个较小的ε值,使用1-ε的概率贪婪的选择在当前可选行为集下,选择根据状态价值卷积神经网络所求得的下一状态的状态价值与执行该行为所获奖励之和最大值对应的行为
其中A(s)为可选行为集,γ为衰减系数,
为智能体执行行为a获得的奖励,φ
i+1表示执行行为a到达状态θ
-的状态特征,V(φ
i+1)表示根据状态价值网络求得的下一状态的状态价值,而用ε的概率随机从所有可选行为集中选择行为;
Step 21: In order to ensure continuous exploration, the ε-greedy strategy is adopted. By setting a small ε value, the probability of 1-ε is used to make a greedy selection. Under the current set of optional behaviors, the selection is based on the state value convolution. The behavior corresponding to the state value of the next state obtained by the neural network and the maximum value of the sum of the rewards obtained by performing the behavior where A(s) is the set of optional behaviors, γ is the attenuation coefficient, is the reward obtained by the agent for executing behavior a, φ i+1 represents the state characteristic of the execution behavior a reaching state θ - , V(φ i+1 ) represents the state value of the next state obtained according to the state value network, and using The probability of ε randomly selects an action from the set of all optional actions;
步骤22:若当前时刻调度系统需要为多个工序指定工件加工,根据步骤21为某一工序选择行为
后,调度系统前探式的执行行为
则调度系统状态转移到临时状态
重复步骤21,为机器选择行为,直至全部选择完毕;那么此 时,调度系统在当前状态下所执行的行为为多维行为;
Step 22: If the scheduling system needs to specify workpiece processing for multiple processes at the current moment, select the behavior for a process according to Step 21 After that, schedule the system's forward-looking execution behavior Then the scheduling system state transitions to the temporary state Repeat step 21 to select behaviors for the machine until all selections are completed; then, the behaviors executed by the scheduling system in the current state are multidimensional behaviors;
步骤23:获得多维行为后,调度系统执行此多维行为,智能体得到奖励r
t+1和下一个状态s
t+1,将单步状态转移存入记忆库D中,而后先计算TD-error,其计算公式为ξ
i=R
t+1+γV(S
t+1)-V(S
t),其中γ为衰减系数,R
t+1为单步状态转移内的奖励、V(S
t+1)为下一状态的状态价值、V(S
t)为当前状态价值,然后按照p
i=|ξ
i|+β计算优先级概率存至优先级队列P,其中ξ
i为上文计算的TD-error,β是一个很小的正常数,这是为了使有一些TD-error为0的特殊边缘例子也能够被抽取。
Step 23: After obtaining the multi-dimensional behavior, the scheduling system executes the multi-dimensional behavior, the agent gets the reward r t+1 and the next state s t+1 , stores the single-step state transfer in the memory D, and then calculates the TD-error first , its calculation formula is ξ i =R t+1 +γV(S t+1 )-V(S t ), where γ is the decay coefficient, R t+1 is the reward in the single-step state transition, V(S t +1 ) is the state value of the next state, V(S t ) is the current state value, and then the priority probability is calculated according to p i =|ξ i |+β and stored in the priority queue P, where ξ i is calculated above TD-error, β is a small positive number, this is to enable some special edge examples with TD-error of 0 to be extracted.
其中步骤21的奖励R的定义与调度系统目标函数直接或间接相关。为了使调度系统能对订单的紧急程度做出响应,本发明采取的调度目标是最小化加权平均完工时间,智能体能因为更短的加权平均完工时间而获得更大的奖励。The definition of the reward R in step 21 is directly or indirectly related to the objective function of the scheduling system. In order to enable the scheduling system to respond to the urgency of the order, the scheduling objective adopted by the present invention is to minimize the weighted average completion time, and the agent can obtain greater rewards due to the shorter weighted average completion time.
考虑到加权平均完工时间与工件状态紧密相关,定义表示工件状态的示性函数δ
j(τ)如下:
Considering that the weighted average makepan is closely related to the workpiece state, an indicative function δ j (τ) representing the workpiece state is defined as follows:
报酬函数定义如下:The reward function is defined as follows:
式中num为工件总数,w
j为工件j的权重值,t为调度系统的时刻节点。r
u表示相邻两个决策点(第u-1个决策点与第u个决策点)之间各工件的加权完工时间(等待时间与加工时间之和之和)。报酬函数具备此性质:最小化目标函数等价于最大化一个完整的状态序列所获取的累积奖励R。证明过程如下:
In the formula, num is the total number of workpieces, w j is the weight value of workpiece j, and t is the moment node of the scheduling system. r u represents the weighted completion time (the sum of the waiting time and the processing time) of each workpiece between two adjacent decision points (the u-1th decision point and the uth decision point). The reward function has this property: minimizing the objective function is equivalent to maximizing the cumulative reward R obtained by a complete sequence of states. The proof process is as follows:
式中:C
j表示第j个工件的总完工时间,由式可知,平均加权完工时间越小,总奖励越大。因此上述定义的报酬函数能够将报酬函数和调度目标直接联系起来,直接反映行为对目标函数的长期影响。
In the formula: C j represents the total completion time of the jth workpiece. It can be seen from the formula that the smaller the average weighted completion time, the greater the total reward. Therefore, the reward function defined above can directly link the reward function and the scheduling objective, and directly reflect the long-term impact of behavior on the objective function.
其中步骤21中机器可选行为集的定义如表2所示。依据简单构造启发式算法给每台机器定义候选行为集,优先分配规则用于强化学习可以克服短视的天性。与状态相关或无关的行为都应该被采纳,以充分利用现有调度规则、理论和智能体从经验中学习的能力。因此,本发明选取了最小化加权完工时间目标中常用的13种行为,如表4所示。The definition of machine optional behavior set in step 21 is shown in Table 2. Defining candidate behavior sets for each machine based on a simple construction heuristic algorithm, and prioritizing rules for reinforcement learning can overcome the short-sighted nature. State-related or unrelated behavior should be adopted to take full advantage of existing scheduling rules, theories, and the agent's ability to learn from experience. Therefore, the present invention selects 13 behaviors commonly used in minimizing the weighted makepan objective, as shown in Table 4.
表4 每台机器的候选行为集Table 4 Candidate behavior set for each machine
由于生产过程中部分工序存在并行机,因此,行为的定义不仅要考虑选择哪个工件,还要考虑把选取的工件分配给哪台空闲机器加工。本发明研究的调度问题为相同并行机调度问题,即此并行机工序下所有机器对于同一工件的加工时间相同,故对于空闲机器的选择在理想状态下并不会对工件的加工周期产生影响,为了平衡机器利用率,因此根据瓶颈工序机器负荷最小原则选取空闲机器。Since there are parallel machines in some processes in the production process, the definition of behavior should not only consider which workpiece to select, but also which idle machine to allocate the selected workpiece for processing. The scheduling problem studied in the present invention is the same parallel machine scheduling problem, that is, all machines in this parallel machine process have the same processing time for the same workpiece, so the selection of idle machines will not affect the processing cycle of the workpiece under ideal conditions. In order to balance machine utilization, idle machines are selected according to the principle of minimum machine load in the bottleneck process.
行为14,选择并行机中机器总加工时长最短的机器加工工件。In act 14, the machine with the shortest total machining time in the parallel machine is selected to machine the workpiece.
式中I为工序中的空闲机器集合,J为机器M
i已加工的工件集合。对于只有一台加工机器的工序,前l-1道工序的所属机器能够采取的行为集合是{a
k|1≤k≤13},第l道工序所属机器能够采取的行为集合是{a
k|1≤k≤8,13}。对于存在并行机的工序,若其非最后一道工序,调度系统采取的行为集合是{(a
14,a
k)|1≤k≤13},若为最后一道工序,调度系统采取的行为集合是{(a
14,a
k)|1≤k≤8,13},对于未被选择的空闲机器则继续采取行为a
13。
In the formula, I is the set of idle machines in the process, and J is the set of workpieces processed by the machine Mi. For a process with only one processing machine, the set of actions that can be taken by the machine belonging to the first l-1 process is { ak |1≤k≤13}, and the set of actions that the machine belonging to the lth process can take is { ak |1≤k≤8,13}. For a process with a parallel machine, if it is not the last process, the set of actions taken by the scheduling system is {(a 14 , ak )|1≤k≤13}; if it is the last process, the set of actions taken by the scheduling system is {(a 14 , a k )|1≤k≤8,13}, and continue to take behavior a 13 for idle machines that are not selected.
步骤3:判断记忆库D中的单步状态转移数量是否达到设定的阈值Batch_Size:Step 3: Determine whether the number of single-step state transitions in memory D reaches the set threshold Batch_Size:
若达到设定的阈值Batch_Size,则转入步骤4;If the set threshold Batch_Size is reached, go to step 4;
若没有达到设定的阈值Batch_Size,则重复步骤2;If the set threshold Batch_Size is not reached, repeat step 2;
步骤4:随机从D中提取一定数量的单步状态转移,用下一状态和执行对应行为获得的奖励来计算当前状态的目标价值,计算目标价值与网络输出价值之间的均方差代价,使用小批量梯度下降算法更新参数,进入步骤5;Step 4: Randomly extract a certain number of single-step state transitions from D, use the next state and the reward obtained by executing the corresponding behavior to calculate the target value of the current state, calculate the mean squared cost between the target value and the network output value, use The mini-batch gradient descent algorithm updates the parameters, and goes to step 5;
进一步地,步骤4具体步骤如下:Further, the specific steps of step 4 are as follows:
步骤41:根据TD-error计算的比例权重从D中提取一定数量的单步状态转移,采用公式
计算当前的目标价值,其中y
i表示求得的当前状态价值,γ表示衰减系数,r
i+1表示单步状态转移内行为的奖励,φ
i+1表示单步状态转移内下一个状态s
t+1的状态特征,V(φ
i+1;θ
-)表示根据目标网络求得的下一状态的状态价值;
Step 41: Extract a certain number of single-step state transitions from D according to the proportional weight calculated by TD-error, using the formula Calculate the current target value, where y i represents the obtained current state value, γ represents the decay coefficient, r i+1 represents the reward of the behavior within the single-step state transition, and φ i+1 represents the next state s within the single-step state transition The state feature of t+1 , V(φ i+1 ; θ − ) represents the state value of the next state obtained from the target network;
步骤42:再计算目标价值与网络输出价值之间的均方差代价,
其中loss为所求的均方差代价,h为Batch_Size,y
i表示上面求得的当前状态价值,φ
i+1表示单步状态转移内下一个状态s
t+1的状态特征,V(φ
i+1;θ)表示根据状态价值网络求得的下一状态的状态价值,使用小批量梯度下降算法更新网络参数与优先级队列;
Step 42: Calculate the mean square error cost between the target value and the network output value, where loss is the required mean square error cost, h is Batch_Size, y i represents the current state value obtained above, φ i+1 represents the state feature of the next state s t+1 in the single-step state transition, V(φ i +1 ; θ) represents the state value of the next state obtained according to the state value network, and uses the mini-batch gradient descent algorithm to update the network parameters and the priority queue;
步骤43:使用小批量梯度下降算法更新状态价值网络参数,每T步更换目标网络价值。Step 43: Use the mini-batch gradient descent algorithm to update the state value network parameters, and replace the target network value every T steps.
其中步骤41采用优先回放的概率分布采样时,首先根据公式
计算比例,其中p
i表示样本的优先级概率,h为Batch_Size,然后根据比例权重随机从D中选取Batch_Size个样本。
Wherein step 41 adopts the probability distribution sampling of priority playback, first according to the formula Calculate the proportion, where pi represents the priority probability of the sample, h is the Batch_Size , and then randomly select Batch_Size samples from D according to the proportion weight.
步骤5:判断当前智能体是否到达结束状态,若达到,进入步骤6;若没有,重复步骤2;Step 5: Determine whether the current agent has reached the end state, if so, go to Step 6; if not, repeat Step 2;
步骤6:判断调度系统是否经历过Max_Episode个完整的状态转移序列:Step 6: Determine whether the scheduling system has experienced Max_Episode complete state transition sequence:
若达到,则进行步骤7;If so, go to step 7;
若没有达到,初始化调度环境,重置机器与工件的状态,重复步骤2;If it is not reached, initialize the scheduling environment, reset the state of the machine and the workpiece, and repeat step 2;
步骤7:输出最优状态序列对应的行为策略组合a
1,a
2,…。
Step 7: Output the behavior strategy combination a 1 , a 2 , . . . corresponding to the optimal state sequence.
下面结合附图对本发明做进一步的介绍:Below in conjunction with accompanying drawing, the present invention is further introduced:
如图1所示,DQN算法在深度神经网络输出层有若干个节点,每个节点直接对应某个行为值,而一维的行为动作不能表达多维行为空间,且采用异策略的Q学习在评价行为值时用最优值替代实际交互值容易造成了过高估计。因此,提出采用TD学习代替Q学习,基于状态值间接计算行为值,适用于多维行为空间。并且采用卷积神经网络替换深度BP神经网络,利用CNN权重共享策略减少了需要训练的参数,池化运算可以降低网络的空间分辨率,从而消除信号的微小偏移和扭曲,从而对输入数据的平移不变性要求不高。二者不同之处体现在网络结构和其拟合的价值函数不同。As shown in Figure 1, the DQN algorithm has several nodes in the output layer of the deep neural network, each node directly corresponds to a certain behavior value, and the one-dimensional behavior cannot express a multi-dimensional behavior space, and the Q-learning using different strategies is used in the evaluation. Replacing the actual interaction value with the optimal value in the behavioral value can easily lead to overestimation. Therefore, it is proposed to use TD learning instead of Q learning, and calculate the behavior value indirectly based on the state value, which is suitable for multi-dimensional behavior space. In addition, the deep BP neural network is replaced by the convolutional neural network, and the CNN weight sharing strategy is used to reduce the parameters that need to be trained. Translation invariance is not very demanding. The difference between the two is reflected in the network structure and the fitted value function.
为了能更好的理解状态转移机制,本发明以规模为n=4,m=4,l=3的混合流水车间调度问题为例说明算法的运行过程。如图2所示,图中三角形表示工件,长方体表示机器,矩形表示每道工序前的等待队列。In order to better understand the state transition mechanism, the present invention takes a mixed flow shop scheduling problem with a scale of n=4, m=4, l=3 as an example to illustrate the running process of the algorithm. As shown in Figure 2, the triangle in the figure represents the workpiece, the cuboid represents the machine, and the rectangle represents the waiting queue before each process.
在系统开始阶段,初始状态为s
0,此时所有机器处于空闲状态,并且所有工件位于第一道工序的等待队列Q
1。系统运行后,第一道工序的机器选择一个动作a
k,即选择此工序等待队列中某个工件进行加工,其他工序的机器由于等候加工队列为空,选择行为a
13。当有机器完成工件的加工,系统转移到一个新的状态s
t,状态转移触发,系统为每台机器选择一个可行行为,之后当又有机器完成加工时,系统转移到下一个状态s
t+1,智能体获得奖励r
t+1。当工件进入并行机工序,系统根据当前状态从等待队列选取工件,并从工序空闲机器队列中选择机器加工。由于在每个决策点每台机器同时选择一个行为执行,实际上系统在状态实施了一次由m个子行为组合而成的多维行为(a
1,a
2,...a
m)。当系统到达终止状态时,代表每个等待队列都为空,即所有工件全部加工完成,系统获得一个调度方案。
At the beginning of the system, the initial state is s 0 , at this time all machines are in idle state, and all workpieces are in the waiting queue Q 1 of the first process. After the system is running, the machine of the first process selects an action a k , that is, selects this process to wait for a workpiece in the queue to be processed, and the machines of other processes select action a 13 because the waiting queue is empty. When a machine completes the processing of the workpiece, the system transfers to a new state s t , the state transition is triggered, the system selects a feasible behavior for each machine, and then when another machine completes the processing, the system transfers to the next state s t+ 1 , the agent gets the reward r t+1 . When the workpiece enters the parallel machining process, the system selects the workpiece from the waiting queue according to the current state, and selects the machine for processing from the idle machine queue of the process. Since each machine simultaneously chooses a behavior to execute at each decision point, the system actually implements a multi-dimensional behavior (a 1 , a 2 ,...am ) composed of m sub-behaviors once in the state. When the system reaches the termination state, it means that each waiting queue is empty, that is, all workpieces are processed, and the system obtains a scheduling plan.
实施例Example
参数选择可能影响求解质量,有一般性原则可以遵循。折扣因子γ衡量后续 状态值对总回报的权重,因此一般取值接近1,设γ=0.95;ε-贪心策略中应先让ε从大变小,以便在初始阶段充分探索策略空间,结束阶段利用所得最优策略,因此初始ε=1,并以0.995的折扣率指数衰减;设学习率α=0.02,最大交互次数MAX_EPISODE=1000;记忆体D容量N=6000,采样批量BATCH_SIZE=256;智能体卷积神经网络结构如图3所示,网络参数采取随机初始化策略。Parameter selection may affect solution quality, and there are general principles to follow. The discount factor γ measures the weight of the subsequent state value to the total return, so the value is generally close to 1, and γ = 0.95; in the ε-greedy strategy, ε should be reduced from large to small, so as to fully explore the strategy space in the initial stage, and at the end stage Using the obtained optimal strategy, so the initial ε=1, and exponential decay with a discount rate of 0.995; set the learning rate α=0.02, the maximum number of interactions MAX_EPISODE=1000; memory D capacity N=6000, sampling batch BATCH_SIZE=256; intelligent The structure of the body convolutional neural network is shown in Figure 3, and the network parameters adopt a random initialization strategy.
(1)小规模问题(1) Small-scale problems
小规模问题以某10×8×6的调度问题为例检验算法的可行性。实例中包含10个工件、8台机器,每个工件需经过6道生产工序,在第三道、第五道工序存在并行机,各有相同的两台设备可供调度。该实例具体数据如表5所示。其中,工件优先级基准为1,为了测试设置的工件优先级对调度方案的影响,考虑对Job3、Job5、Job8的优先级权重系数随机取不同数值,分别为1.2、1.5、1.3,以测试权重对调度结果的影响效果。Small-scale problems take a 10 × 8 × 6 scheduling problem as an example to test the feasibility of the algorithm. The example includes 10 workpieces and 8 machines. Each workpiece needs to go through 6 production processes. There are parallel machines in the third and fifth processes, and each has the same two devices for scheduling. The specific data of this example are shown in Table 5. Among them, the workpiece priority benchmark is 1. In order to test the impact of the set workpiece priority on the scheduling scheme, the priority weight coefficients of Job3, Job5, and Job8 are considered to randomly select different values, which are 1.2, 1.5, and 1.3, respectively, to test the weight. The effect on the scheduling result.
表5 10×8×6的调度问题实例数据Table 5 Instance data of 10×8×6 scheduling problem
机器的分布情况为{1,2,[3,4],5,[6,7],8}。采用本发明算法与部分传统算法求解实例的结果如表6所示,表中的较优解加粗表示。由表6可见,本发明算法相较于传统算法能够获得较优解,其解对应的甘特图如图4所示,图中红色竖直线表示调度系统的各个决策节点。本算法最优解相较于IDE算法和HOMA算法效率分别提升4.3%和3.9%。The distribution of machines is {1,2,[3,4],5,[6,7],8}. Table 6 shows the results of solving examples using the algorithm of the present invention and some traditional algorithms, and the better solutions in the table are shown in bold. It can be seen from Table 6 that the algorithm of the present invention can obtain a better solution than the traditional algorithm, and the Gantt chart corresponding to the solution is shown in Figure 4. The red vertical line in the figure represents each decision node of the scheduling system. Compared with the IDE algorithm and the HOMA algorithm, the optimal solution of this algorithm is improved by 4.3% and 3.9% respectively.
表6 小规模测试实例结果对比图Table 6 Comparison of results of small-scale test examples
由图可知,工件优先级高的Job5、Job8、Job3先被加工,工件优先级越高的工件,将会优先进行加工,可见上文设定的报酬函数能够反映目标函数。It can be seen from the figure that Job5, Job8, and Job3 with high workpiece priority are processed first, and workpieces with higher workpiece priority will be processed first. It can be seen that the reward function set above can reflect the objective function.
(2)大规模问题(2) Large-scale problems
本发明随机从[OR_Library]实例集中选取15个示例用于实验测试,并与候鸟优化算法(MBO)及比较算法进行对比,如表7所示,表中较优结果用加粗字体表示。The present invention randomly selects 15 examples from the [OR_Library] example set for experimental testing, and compares them with the Migratory Bird Optimization Algorithm (MBO) and the comparison algorithm, as shown in Table 7, and the better results in the table are represented by bold fonts.
表7 大规模实例对比结果Table 7 Comparison results of large-scale instances
由表7可知,相比与其它算法,本算法提出的CTDN算法可以获得较优的解,某些实例的解已经低于原实例的上界。深度神经网络需要花费一定时间进行训练,但训练完成的网络可以快速根据输入的状态价值在极短时间内得出最优行为。It can be seen from Table 7 that compared with other algorithms, the CTDN algorithm proposed by this algorithm can obtain better solutions, and the solutions of some instances are already lower than the upper bound of the original instance. A deep neural network takes a certain amount of time to train, but the trained network can quickly obtain the optimal behavior in a very short time according to the state value of the input.
图5为实例tai_20_10_2在本发明算法下求得最优策略对应的甘特图。图中红色竖直虚线代表调度决策点,即工件完成每道工序的时间点。FIG. 5 is a Gantt chart corresponding to the optimal strategy obtained by the example tai_20_10_2 under the algorithm of the present invention. The red vertical dotted line in the figure represents the scheduling decision point, that is, the time point when the workpiece completes each process.
图6为实例tai_20_10_2下加权平均完工时间随着训练进行的变化图。从图中趋势可以看出,调度目标值随着episode的不断循环逐渐减小。一开始,智能体处于完全陌生的环境,通过自主的随机行为选择不断的进行学习试错,随着ε值不断地衰减,智能体倾向于采取模型给出的最优选择,从而使得系统不断向目标方向迈进,在900次迭代内,能获得较优解。Figure 6 is a graph showing the variation of the weighted average completion time with the training progress under the instance tai_20_10_2. As can be seen from the trend in the figure, the scheduling target value gradually decreases with the continuous cycle of the episode. At the beginning, the agent is in a completely unfamiliar environment, and continues to learn trial and error through independent random behavior selection. The goal is to move forward, and within 900 iterations, a better solution can be obtained.