WO2022135066A1

WO2022135066A1 - Temporal difference-based hybrid flow-shop scheduling method

Info

Publication number: WO2022135066A1
Application number: PCT/CN2021/133905
Authority: WO
Inventors: 陆宝春; 陈志峰; 顾钱; 翁朝阳; 张卫; 张哲�
Original assignee: 南京理工大学
Priority date: 2020-12-25
Filing date: 2021-11-29
Publication date: 2022-06-30
Also published as: CN112734172B; CN112734172A

Abstract

Disclosed is a temporal difference-based deep reinforcement learning algorithm, which is used to solve a hybrid flow-shop scheduling problem of a related parallel machines. The algorithm combines a convolutional neural network with TD learning in reinforcement learning, and performs behavior selection according to inputted state features so as to more fully comply with a scheduling decision-making process of an actual order response-type production and fabrication system. By means of transforming the scheduling problem into a multi-stage decision-making problem, a convolutional neural network model is used to fit a state value function, fabrication system processing state feature data is inputted into the model, a temporal difference method is used to train the model, a heuristic algorithm or an allocation rule is used as candidate scheduling decision behavior, and an optimal combined behavior strategy is selected for each scheduling decision by combining with a reinforcement learning reward and punishment mechanism. Compared to the prior art, the algorithm proposed by the present invention has the advantages of strong real-time performance and high flexibility.

Description

A hybrid flow shop scheduling method based on time series difference

technical field

The invention belongs to a mixed flow shop scheduling control technology, in particular to a mixed flow shop scheduling method based on time sequence difference.

Background technique

The hybrid flow-shop scheduling problem (HFSP), also known as the flexible flow-shop scheduling problem, was first proposed by Salvador in 1973. This problem can be regarded as a combination of the classical flow-shop scheduling problem and the parallel machine scheduling problem. , which is characterized in that the workpiece has a parallel machine stage in the processing process, and the machine allocation is performed while the workpiece processing sequence is determined. In the HFSP problem, the number of processors in at least one stage is greater than 1, which greatly increases the difficulty of solving HFSP. It has been proved that the two-stage HFSP with 2 and 1 processors is NP-hard problem.

At present, exact algorithms, heuristics and meta-heuristics are three classical methods for solving the flow shop scheduling problem. Exact algorithms include mathematical modeling and branch-and-bound methods, which can obtain optimal solutions for small-scale problems; for large-scale practical scheduling problems, heuristic algorithms or meta-heuristic algorithms are favored because they can obtain near-optimal solutions in a relatively short time. researchers' attention. However, heuristic algorithms or meta-heuristic algorithms are designed for specific instances with corresponding rules and algorithms, and are not suitable for complex and changeable actual production environments. Reinforcement learning algorithms can generate scheduling policies that adapt to the actual production state. Wei Y, Zhao M used Q-learning for the selection of combination assignment rules in the job shop by defining the eigenvalues of "production pressure" and two-step scheduling rules, but the tabular reinforcement learning model adopted by this method cannot describe the actual complex processing process. Zhang Zhicong and Zheng Li defined 15 state features for each machine, and used the TD method to train a linear state-value function generalizer to solve the NPFS problem, but the linear function generalizer has limited fitting and generalization capabilities.

Summarizing and analyzing the existing research results, the research on the scheduling problem of mixed flow workshop mainly has the following problems:

(1) The traditional scheduling algorithm cannot effectively use historical data for learning, and has poor real-time performance, making it difficult to cope with the large-scale complex and changeable actual production scheduling environment.

(2) At present, although the traditional research on HFSP is very mature, there are few researches on using reinforcement learning to solve the mixed flow shop problem, and there are problems such as difficulty in characterizing the processing environment and limited functions of function generalizers.

(3) The deep reinforcement learning algorithm can solve the problem of the limited function of the function generalizer. The weight sharing strategy of the convolutional neural network reduces the parameters that need to be trained. The same weight can make the filter not affected by the signal position to detect the signal The characteristics of the trained model make the generalization ability of the trained model stronger, but there are few researches on the deep reinforcement learning algorithm to solve the workshop scheduling problem at home and abroad.

SUMMARY OF THE INVENTION

The purpose of the present invention is to provide a mixed flow shop scheduling method based on time sequence difference to solve the mixed flow shop scheduling problem of related parallel machines.

The technical solution to achieve the purpose of the present invention is: a hybrid flow shop scheduling method based on time series difference described in the present invention takes minimizing the weighted average completion time as the scheduling goal, combines neural network and reinforcement learning, and adopts time series difference method for training. The model uses the existing scheduling knowledge and empirical rules to refine the scheduling decision candidate behavior, and combines the online evaluation-execution mechanism of reinforcement learning to select the optimal combined behavior strategy for each scheduling decision of the scheduling system, which includes the following steps:

Step 1: Obtain the production constraints and objective functions according to the production characteristics of the mixed flow workshop, and introduce the machine state characteristics, build the mixed flow workshop scheduling environment, and initialize the settings, initialize the experience memory bank D with a capacity of N, and randomly initialize the state value depth Neural network V(θ) and target network V(θ ^- ) to realize interaction with the agent, go to step 2;

Step 2: The agent randomly selects a behavior a _t with the probability of ε or selects the current optimal behavior a _t according to the state value after executing the behavior, and obtains the reward r _t+1 and the next state s _t+ after executing the optimal behavior _1. Record the state characteristics of the current state, the state characteristics of the next state s _t+1 _, and the state characteristics of the next state s t+1, and whether the terminal state is reached (φ _t , r t+1 , φ t , r _t+1 , φ _t+1 ,is_end), store the obtained single-step state transition in the memory bank D, and store it in the priority queue P according to the TD-error calculation ratio, and go to step 3;

Step 3: Determine whether the number of single-step state transitions in memory D reaches the set threshold Batch_Size:

If the set threshold Batch_Size is reached, go to step 4;

If the set threshold Batch_Size is not reached, repeat step 2;

Step 4: Randomly extract a certain number of single-step state transitions from D, use the next state and the reward obtained by executing the corresponding behavior to calculate the target value of the current state, calculate the mean squared cost between the target value and the network output value, use The mini-batch gradient descent algorithm updates the parameters, and goes to step 5;

Step 5: Determine whether the current agent has reached the end state, if so, go to Step 6; if not, repeat Step 2;

Step 6: Determine whether the scheduling system has experienced Max_Episode complete state transition sequence:

If so, go to step 7;

If it is not reached, initialize the scheduling environment, reset the state of the machine and the workpiece, and repeat step 2;

Step 7: Output the behavior strategy combination a ₁ , a ₂ , . . . corresponding to the optimal state sequence.

Compared with the prior art, the present invention has the following significant advantages:

(1) The present invention proposes a deep reinforcement learning algorithm based on TD learning, which adopts a convolutional neural network with a dual network structure, separates action selection and value estimation, and utilizes the advantages of deep convolution calculation of CNN, which can effectively avoid excessively high estimate.

(2) After applying reinforcement learning to the scheduling problem of mixed flow shop, its behavior space is a multi-dimensional discrete space, so it is not suitable to continue to use Q-learning based on one-dimensional discrete behavior value function. Therefore, the present invention designs an algorithm model based on state value update to solve the multi-dimensional discrete space, so that it can solve the mixed flow shop scheduling problem. Using shallow sampling TD learning to solve the state value, it does not depend on the complete state sequence, and selects the optimal action through a probing attempt, which is more in line with the actual scheduling process in principle, and can be used to solve large-scale problems or dynamic problems. more appropriate.

(3) The present invention introduces a random priority sampling method when selecting samples for training, which can effectively solve the problem that the algorithm frequently has a higher error and overfitting in the function approximation process due to greedy priority.

Description of drawings

FIG. 1 is a comparison diagram of the network structures and fitting functions of the algorithms CTDN and DQN proposed by the present invention.

Figure 2 is a diagram of the operation model of the CTDN algorithm in a mixed flow shop with a scale of 4 × 4 × 3.

FIG. 3 is a structural diagram of a convolutional neural network used in the present invention.

Figure 4 is an optimal scheduling Gantt chart for a small-scale problem.

Figure 5 is an example tai_20_10_2 Gantt chart.

Figure 6 is an iterative graph of the instance tai_20_10_2 running.

FIG. 7 is a flow chart of the method for scheduling a mixed flow shop based on timing difference of the present invention.

Detailed ways

The present invention will be described in further detail below with reference to the accompanying drawings.

With reference to Fig. 7 , a method for scheduling a mixed flow shop based on timing difference according to the present invention, the steps are as follows:

Step 1: Obtain the production constraints and objective functions according to the production characteristics of the mixed flow workshop, and introduce the machine state characteristics, build the mixed flow workshop scheduling environment, and initialize the settings, initialize the experience memory bank D with a capacity of N, and randomly initialize the state value depth Neural network V(θ) and target network V(θ ^- ) to realize interaction with the agent, go to step 2.

Further, the objective function of the scheduling system described in step 1 is to minimize the weighted average completion time, and its minimum weighted average completion time objective function

Where w _j is the weight value of workpiece j, that is, the priority of the order, and c _j is the completion time of workpiece j. The average completion time index can be used to measure the inventory level of intermediate products and the processing cycle of a batch of workpieces, which is of great practical significance to enterprises.

Further, the definition of the state characteristics of the machine described in step 1 is shown in the table. By introducing appropriate parameters, selecting the characteristics that describe the state appropriately, and constructing a certain function to approximate the calculated state, it represents the machine and workpiece in a certain state. information. The k-th feature of the _i -th machine Mi in the mixed flow workshop is denoted as f _i,k , and l represents the total number of processes. For the machines belonging to the first l-1 process, a total of 13 real-valued features f _i,k are defined, where 1≤k≤13, a total of 9 real-valued features f _i,k are defined for the machine belonging to the lth process, among which 1≤k≤9, the defined state feature set jointly reveals the global and local information of the environment ,as shown in Table 3.

The definition of state characteristics is shown in Table 3:

Table 3 Machine state feature definition table

Here is a unified description of the parameters used in Table 1: i represents the ith machine, q represents the qth process, m represents the total number of machines, l represents the total number of processes, Q _q represents the waiting queue of the qth process, n Indicates that there are n workpieces to be processed in the _qth process, pq represents the average processing time of all the workpieces to be processed in the qth process, and _pq,j represents the processing time of the jth workpiece in the qth process.

State feature 1 represents the distribution of workpieces in each process of the production line; state feature 2 represents the workload of the equipment in each process at the current moment; state feature 3 represents the work to be completed by the machine in each process from the current moment The total amount; status features 4 and 5 describe the maximum value of the processing time of each process in the current waiting queue; status feature 6 represents the processed time of the work-in-process in the equipment, thereby characterizing the operation or idleness of the equipment and the processing progress of the workpiece; status features 7 and 8 represent the maximum value of the remaining completion time in the workpiece waiting queue; state feature 9 represents the utilization rate of each machine from the start of processing to the current moment; state features 10 and 11 represent the processing time of the workpiece in the workpiece waiting queue in the current process. The maximum value of the ratio of the processing time in the next process; the state features 12 and 13 represent the maximum value of the processing time required for the subsequent process of the workpiece.

Further, the specific steps of the step 2 are as follows:

Step 21: In order to ensure continuous exploration, the ε-greedy strategy is adopted. By setting a small ε value, the probability of 1-ε is used to make a greedy selection. Under the current set of optional behaviors, the selection is based on the state value convolution. The behavior corresponding to the state value of the next state obtained by the neural network and the maximum value of the sum of the rewards obtained by performing the behavior

where A(s) is the set of optional behaviors, γ is the attenuation coefficient,

is the reward obtained by the agent for executing behavior a, φ _i+1 represents the state characteristic of the execution behavior a reaching state θ ^- , V(φ _i+1 ) represents the state value of the next state obtained according to the state value network, and using The probability of ε randomly selects an action from the set of all optional actions;

Step 22: If the scheduling system needs to specify workpiece processing for multiple processes at the current moment, select the behavior for a process according to Step 21

After that, schedule the system's forward-looking execution behavior

Then the scheduling system state transitions to the temporary state

Repeat step 21 to select behaviors for the machine until all selections are completed; then, the behaviors executed by the scheduling system in the current state are multidimensional behaviors;

Step 23: After obtaining the multi-dimensional behavior, the scheduling system executes the multi-dimensional behavior, the agent gets the reward r _t+1 and the next state s _t+1 , stores the single-step state transfer in the memory D, and then calculates the TD-error first , its calculation formula is ξ _i =R _t+1 +γV(S _t+1 )-V(S _t ), where γ is the decay coefficient, R _t+1 is the reward in the single-step state transition, V(S _{t +1} ) is the state value of the next state, V(S _t ) is the current state value, and then the priority probability is calculated according to p _i =|ξ _i |+β and stored in the priority queue P, where ξ _i is calculated above TD-error, β is a small positive number, this is to enable some special edge examples with TD-error of 0 to be extracted.

The definition of the reward R in step 21 is directly or indirectly related to the objective function of the scheduling system. In order to enable the scheduling system to respond to the urgency of the order, the scheduling objective adopted by the present invention is to minimize the weighted average completion time, and the agent can obtain greater rewards due to the shorter weighted average completion time.

Considering that the weighted average makepan is closely related to the workpiece state, an indicative function δ _j (τ) representing the workpiece state is defined as follows:

The reward function is defined as follows:

In the formula, num is the total number of workpieces, w _j is the weight value of workpiece j, and t is the moment node of the scheduling system. r _u represents the weighted completion time (the sum of the waiting time and the processing time) of each workpiece between two adjacent decision points (the u-1th decision point and the uth decision point). The reward function has this property: minimizing the objective function is equivalent to maximizing the cumulative reward R obtained by a complete sequence of states. The proof process is as follows:

In the formula: C _j represents the total completion time of the jth workpiece. It can be seen from the formula that the smaller the average weighted completion time, the greater the total reward. Therefore, the reward function defined above can directly link the reward function and the scheduling objective, and directly reflect the long-term impact of behavior on the objective function.

The definition of machine optional behavior set in step 21 is shown in Table 2. Defining candidate behavior sets for each machine based on a simple construction heuristic algorithm, and prioritizing rules for reinforcement learning can overcome the short-sighted nature. State-related or unrelated behavior should be adopted to take full advantage of existing scheduling rules, theories, and the agent's ability to learn from experience. Therefore, the present invention selects 13 behaviors commonly used in minimizing the weighted makepan objective, as shown in Table 4.

Table 4 Candidate behavior set for each machine

Since there are parallel machines in some processes in the production process, the definition of behavior should not only consider which workpiece to select, but also which idle machine to allocate the selected workpiece for processing. The scheduling problem studied in the present invention is the same parallel machine scheduling problem, that is, all machines in this parallel machine process have the same processing time for the same workpiece, so the selection of idle machines will not affect the processing cycle of the workpiece under ideal conditions. In order to balance machine utilization, idle machines are selected according to the principle of minimum machine load in the bottleneck process.

In act 14, the machine with the shortest total machining time in the parallel machine is selected to machine the workpiece.

In the formula, I is the set of idle machines in the process, and J is the set of _workpieces processed by the machine Mi. For a process with only one processing machine, the set of actions that can be taken by the machine belonging to the first l-1 process is { _ak |1≤k≤13}, and the set of actions that the machine belonging to the lth process can take is { _ak |1≤k≤8,13}. For a process with a parallel machine, if it is not the last process, the set of actions taken by the scheduling system is {(a ₁₄ , _ak )|1≤k≤13}; if it is the last process, the set of actions taken by the scheduling system is {(a ₁₄ , a _k )|1≤k≤8,13}, and continue to take behavior a ₁₃ for idle machines that are not selected.

If the set threshold Batch_Size is reached, go to step 4;

If the set threshold Batch_Size is not reached, repeat step 2;

Further, the specific steps of step 4 are as follows:

Step 41: Extract a certain number of single-step state transitions from D according to the proportional weight calculated by TD-error, using the formula

Calculate the current target value, where y _i represents the obtained current state value, γ represents the decay coefficient, r _i+1 represents the reward of the behavior within the single-step state transition, and φ _i+1 represents the next state s within the single-step state transition The state feature of _t+1 , V(φ _i+1 ; θ ⁻ ) represents the state value of the next state obtained from the target network;

Step 42: Calculate the mean square error cost between the target value and the network output value,

where loss is the required mean square error cost, h is Batch_Size, y _i represents the current state value obtained above, φ _i+1 represents the state feature of the next state s _t+1 in the single-step state transition, V(φ _{i +1} ; θ) represents the state value of the next state obtained according to the state value network, and uses the mini-batch gradient descent algorithm to update the network parameters and the priority queue;

Step 43: Use the mini-batch gradient descent algorithm to update the state value network parameters, and replace the target network value every T steps.

Wherein step 41 adopts the probability distribution sampling of priority playback, first according to the formula

Calculate the proportion, where pi represents the priority probability of the sample, h is the _{Batch_Size} , and then randomly select Batch_Size samples from D according to the proportion weight.

If so, go to step 7;

Below in conjunction with accompanying drawing, the present invention is further introduced:

As shown in Figure 1, the DQN algorithm has several nodes in the output layer of the deep neural network, each node directly corresponds to a certain behavior value, and the one-dimensional behavior cannot express a multi-dimensional behavior space, and the Q-learning using different strategies is used in the evaluation. Replacing the actual interaction value with the optimal value in the behavioral value can easily lead to overestimation. Therefore, it is proposed to use TD learning instead of Q learning, and calculate the behavior value indirectly based on the state value, which is suitable for multi-dimensional behavior space. In addition, the deep BP neural network is replaced by the convolutional neural network, and the CNN weight sharing strategy is used to reduce the parameters that need to be trained. Translation invariance is not very demanding. The difference between the two is reflected in the network structure and the fitted value function.

In order to better understand the state transition mechanism, the present invention takes a mixed flow shop scheduling problem with a scale of n=4, m=4, l=3 as an example to illustrate the running process of the algorithm. As shown in Figure 2, the triangle in the figure represents the workpiece, the cuboid represents the machine, and the rectangle represents the waiting queue before each process.

At the beginning of the system, the initial state is s ₀ , at this time all machines are in idle state, and all workpieces are in the waiting queue Q ₁ of the first process. After the system is running, the machine of the first process selects an action a _k , that is, selects this process to wait for a workpiece in the queue to be processed, and the machines of other processes select action a ₁₃ because the waiting queue is empty. When a machine completes the processing of the workpiece, the system transfers to a new state s _t , the state transition is triggered, the system selects a feasible behavior for each machine, and then when another machine completes the processing, the system transfers to the next state s _{t+ 1} , the agent gets the reward r _t+1 . When the workpiece enters the parallel machining process, the system selects the workpiece from the waiting queue according to the current state, and selects the machine for processing from the idle machine queue of the process. Since each machine simultaneously chooses a behavior to execute at each decision point, the system actually implements a multi-dimensional behavior (a ₁ , a ₂ ,...am ) composed of _m sub-behaviors once in the state. When the system reaches the termination state, it means that each waiting queue is empty, that is, all workpieces are processed, and the system obtains a scheduling plan.

Example

Parameter selection may affect solution quality, and there are general principles to follow. The discount factor γ measures the weight of the subsequent state value to the total return, so the value is generally close to 1, and γ = 0.95; in the ε-greedy strategy, ε should be reduced from large to small, so as to fully explore the strategy space in the initial stage, and at the end stage Using the obtained optimal strategy, so the initial ε=1, and exponential decay with a discount rate of 0.995; set the learning rate α=0.02, the maximum number of interactions MAX_EPISODE=1000; memory D capacity N=6000, sampling batch BATCH_SIZE=256; intelligent The structure of the body convolutional neural network is shown in Figure 3, and the network parameters adopt a random initialization strategy.

(1) Small-scale problems

Small-scale problems take a 10 × 8 × 6 scheduling problem as an example to test the feasibility of the algorithm. The example includes 10 workpieces and 8 machines. Each workpiece needs to go through 6 production processes. There are parallel machines in the third and fifth processes, and each has the same two devices for scheduling. The specific data of this example are shown in Table 5. Among them, the workpiece priority benchmark is 1. In order to test the impact of the set workpiece priority on the scheduling scheme, the priority weight coefficients of Job3, Job5, and Job8 are considered to randomly select different values, which are 1.2, 1.5, and 1.3, respectively, to test the weight. The effect on the scheduling result.

Table 5 Instance data of 10×8×6 scheduling problem

The distribution of machines is {1,2,[3,4],5,[6,7],8}. Table 6 shows the results of solving examples using the algorithm of the present invention and some traditional algorithms, and the better solutions in the table are shown in bold. It can be seen from Table 6 that the algorithm of the present invention can obtain a better solution than the traditional algorithm, and the Gantt chart corresponding to the solution is shown in Figure 4. The red vertical line in the figure represents each decision node of the scheduling system. Compared with the IDE algorithm and the HOMA algorithm, the optimal solution of this algorithm is improved by 4.3% and 3.9% respectively.

Table 6 Comparison of results of small-scale test examples

It can be seen from the figure that Job5, Job8, and Job3 with high workpiece priority are processed first, and workpieces with higher workpiece priority will be processed first. It can be seen that the reward function set above can reflect the objective function.

(2) Large-scale problems

The present invention randomly selects 15 examples from the [OR_Library] example set for experimental testing, and compares them with the Migratory Bird Optimization Algorithm (MBO) and the comparison algorithm, as shown in Table 7, and the better results in the table are represented by bold fonts.

Table 7 Comparison results of large-scale instances

It can be seen from Table 7 that compared with other algorithms, the CTDN algorithm proposed by this algorithm can obtain better solutions, and the solutions of some instances are already lower than the upper bound of the original instance. A deep neural network takes a certain amount of time to train, but the trained network can quickly obtain the optimal behavior in a very short time according to the state value of the input.

FIG. 5 is a Gantt chart corresponding to the optimal strategy obtained by the example tai_20_10_2 under the algorithm of the present invention. The red vertical dotted line in the figure represents the scheduling decision point, that is, the time point when the workpiece completes each process.

Figure 6 is a graph showing the variation of the weighted average completion time with the training progress under the instance tai_20_10_2. As can be seen from the trend in the figure, the scheduling target value gradually decreases with the continuous cycle of the episode. At the beginning, the agent is in a completely unfamiliar environment, and continues to learn trial and error through independent random behavior selection. The goal is to move forward, and within 900 iterations, a better solution can be obtained.

Claims

A hybrid flow shop scheduling method based on time series difference, which is characterized in that: taking minimizing the weighted average completion time as the scheduling goal, combining neural network and reinforcement learning, using the time series difference method to train the model, using the existing scheduling knowledge and empirical rules The candidate behavior of scheduling decision is refined, combined with the online evaluation-execution mechanism of reinforcement learning, so as to select the optimal combined behavior strategy for each scheduling decision of the scheduling system, which includes the following steps:

Step 1: Obtain the production constraints and objective functions according to the production characteristics of the mixed flow workshop, and introduce the machine state characteristics, build the mixed flow workshop scheduling environment, and initialize the settings, initialize the experience memory bank D with a capacity of N, and randomly initialize the state value depth Neural network V(θ) and target network V(θ - ) to realize interaction with the agent, go to step 2;

Step 2: The agent randomly selects a behavior a t with the probability of ε or selects the current optimal behavior a t according to the state value after executing the behavior, and obtains the reward r t+1 and the next state s t+ after executing the optimal behavior 1 , the state characteristics of the current state, the state characteristics of the next state s t+1 , the state characteristics of the next state s t+1, and whether the terminal state is reached are collectively recorded as a single-step state transition (φ t , r t+1 ) , φ t+1 , is_end), store the obtained single-step state transfer in the memory bank D, and store it in the priority queue P according to the calculation ratio of TD-error, and go to step 3;

Step 3: Determine whether the number of single-step state transitions in memory D reaches the set threshold Batch_Size:

If the set threshold Batch_Size is reached, go to step 4;

If the set threshold Batch_Size is not reached, repeat step 2;

Step 4: Randomly extract a certain number of single-step state transitions from memory D, calculate the target value of the current state with the next state and the reward obtained by performing the corresponding behavior, and calculate the mean square error cost between the target value and the network output value , use the mini-batch gradient descent algorithm to update the parameters, and go to step 5;

Step 5: Determine whether the current agent has reached the end state, if so, go to Step 6; if not, repeat Step 2;

Step 6: Determine whether the scheduling system has experienced Max_Episode complete state transition sequence:

If so, go to step 7;

If it is not reached, initialize the scheduling environment, reset the state of the machine and the workpiece, and repeat step 2;

Step 7: Output the behavior strategy combination a 1 , a 2 , . . . corresponding to the optimal state sequence.
The method for scheduling a mixed flow shop based on timing difference according to claim 1, wherein: in the step 1, the machine state characteristics are as follows:

The k-th feature of the i -th machine Mi in the mixed flow workshop is denoted as f i,k , and l represents the total number of processes. For the machines belonging to the first l-1 process, a total of 13 real-valued features f i,k are defined, where 1≤k≤13, a total of 9 real-valued features f i,k are defined for the machine belonging to the lth process, among which 1≤k≤9, the defined state feature set jointly reveals the global and local information of the environment ;

The definition of state characteristics is shown in Table 1:

Table 1 Definition table of machine state characteristics

Here is a unified description of the parameters used in the table: q represents the qth process, m represents the total number of machines, l represents the total number of processes, Q q represents the waiting queue of the qth process, and n represents the qth process. There are n pieces in total The workpiece to be processed, p q represents the average processing time of all workpieces to be processed in the qth process, p q,j represents the processing time of the jth workpiece in the qth process, and J j represents the workpiece in the waiting queue Q q .
The hybrid flow shop scheduling method based on time series difference according to claim 1, characterized in that: in step 2, the agent randomly selects a behavior a t with the probability of ε or selects the current optimum according to the state value after executing the behavior Behavior at , get reward r t+1 and the next state s t +1 after executing the optimal behavior, compare the state characteristics of the current state, execute the behavior to get reward r t+1 and the state of the next state s t+1 Features, and whether it reaches the end state is recorded as a single-step state transition (φ t , r t+1 , φ t+1 , is_end), and the obtained single-step state transition is stored in memory D, and the ratio is calculated according to TD-error Stored in the priority queue P, as follows:

Step 21: Adopt the ε-greedy strategy, by setting a small ε value, use the probability of 1-ε to select the next state obtained by the convolutional neural network according to the state value under the current optional behavior set The behavior corresponding to the maximum value of the state value and the sum of the rewards for performing the behavior
where A(s) is the set of optional behaviors, γ is the attenuation coefficient,
is the reward obtained by the agent performing behavior a, φ i+1 represents the state characteristic of the execution behavior a reaching state θ - , V(φ i+1 ) represents the state value of the next state obtained according to the state value network, and using The probability of ε randomly selects an action from the set of all optional actions;

Step 22: If the scheduling system needs to specify workpiece processing for multiple processes at the current moment, select the behavior for a process according to Step 21
After that, schedule the system's forward-looking execution behavior
Then the scheduling system state transitions to the temporary state
Repeat step 21 to select behaviors for the machine until all selections are completed; then, the behaviors executed by the scheduling system in the current state are multi-dimensional behaviors;

Step 23: After obtaining the multi-dimensional behavior, the scheduling system executes the multi-dimensional behavior, the agent gets the reward r t+1 and the next state s t+1 , stores the single-step state transfer in the memory D, and calculates TD-errorξ i = R t+1 +γV(S t+1 )-V(S t ), where γ is the decay coefficient, R t+1 is the reward within a single-step state transition, and V(S t+1 ) is the next state’s State value, V(S t ) is the current state value, calculate the priority probability p i = |ξ i |+β, and store it in the priority queue P, where β is a small positive number.
The hybrid flow shop scheduling method based on timing difference according to claim 3, is characterized in that: above-mentioned optional behavior set, specific behavior definition is as shown in Table 2:

Table 2 Candidate behavior set for each machine

Since there are parallel machines in some processes in the production process, in order to balance the machine utilization, idle machines are selected according to the principle of minimum machine load in the bottleneck process;

Behavior 14, select the machining workpiece with the shortest total machining time in the parallel machine;

where I is the set of idle machines in the process, J is the set of workpieces processed by machine Mi, and p i ,j is the processing time of workpiece j on machine i.
The hybrid flow shop scheduling method based on time series difference according to claim 1, wherein in step 4, a certain number of single-step state transitions are randomly extracted from D, and the next state and the reward obtained by executing the corresponding behavior are used to Calculate the target value of the current state, calculate the mean squared cost between the target value and the network output value, and use the mini-batch gradient descent algorithm to update the parameters, as follows:

Step 41: Extract a certain number of single-step state transitions from D according to the proportional weight calculated by TD-error, and use the following formula to calculate the current target value y i

where γ is the decay coefficient, r i+1 is the reward of the behavior within the single-step state transition, φ i+1 is the state feature of the next state s t+1 within the single-step state transition, V(φ i+1 ; θ − ) represents the state value of the next state obtained according to the target network;

Step 42: Recalculate the mean squared cost loss between the target value and the network output value

Where h is Batch_Size,

Step 43: Use the mini-batch gradient descent algorithm to update the state value network θ parameters, and replace the target network value every T steps.