CN116562584A

CN116562584A - Dynamic workshop scheduling method based on Conv-lasting and generalization characterization

Info

Publication number: CN116562584A
Application number: CN202310600842.XA
Authority: CN
Inventors: 刘海滨; 夏铭浩; 李明飞; 王龙; 董浩
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2023-05-25
Filing date: 2023-05-25
Publication date: 2023-08-08

Abstract

The invention discloses a dynamic workshop scheduling method based on Conv-Dueling and generalization characterization, which comprises the steps of firstly adopting a multidimensional matrix to represent equipment states and workpiece states; a composite bonus function is designed to guide the convergence of the algorithm. And providing a Conv-lasting network model, taking a multidimensional state matrix as input, taking the scheduling rule value as output, and selecting the optimal scheduling rule at different rescheduling decision points. The network model consists of a feature extraction network, a state network and an advantage network, and global optimal scheduling is realized. Through verification under static and dynamic conditions, the network model can obtain good optimization effect. The dynamic workshop scheduling method provided by the invention can reduce the maximum finishing time, improve the punctual finishing rate and reduce the total delay time, and simultaneously ensure the robustness and the stability, and the comprehensive scheduling performance of the dynamic workshop scheduling method is superior to that of the existing scheduling method.

Description

Dynamic workshop scheduling method based on Conv-lasting and generalization characterization

Technical Field

The invention belongs to the field of dynamic scheduling production decision, and is used for material scheduling tasks of flexible job shops, in particular relates to a dynamic scheduling method based on Conv-lasting and generalization characterization of deep reinforcement learning.

Background

The material scheduling technology in the flexible job shop refers to the process of real-time monitoring, scheduling and optimizing the shop materials by using a computer technology and an artificial intelligence algorithm. The purpose is to improve the utilization efficiency of workshop materials, reduce the waste and the cost in the production process, and thereby realize the maximization of the production benefit. The material scheduling technique can be applied to a plurality of fields including manufacturing, logistics, storage and other fields. In the manufacturing industry, the material scheduling technology can optimize the production flow, improve the production efficiency and quality and reduce the cost. However, due to the complex real environment and the large number of disturbance factors, the solving performance of most workshop scheduling algorithms is poor, and the requirements of high efficiency, punctual time, stability and the like in the production process are difficult to meet. Therefore, developing a dynamic shop material scheduling algorithm with high scheduling performance is an urgent problem to be solved at present.

In recent years, most solutions to the multi-objective dynamic flexible shop scheduling problem (MFJSP) of a production shop are assumed to be performed in a static production environment, where the processing information of equipment and workpieces in the shop is completely known, and multiple disturbance influencing factors existing in the actual production process are not considered, so that a fixed scheduling scheme is output, and no change is performed in the whole production process. However, there are a number of dynamic event disturbances in the actual production process, such as: the insertion of new orders, equipment faults, workpiece processing time changes and other uncertain and unavoidable disturbance factors. These randomly occurring disturbances cause severe deviations from the expected results when the original static scheduling scheme is executed, greatly reducing the time rate of task completion and production efficiency. The dynamic multi-target flexible job shop scheduling problem (DMFJSP) aims at completing all scheduling rapidly, on time and with low delay, is oriented to complex task information constraint relation and real-time uncertainty event disturbance scenes in a manufacturing shop, and has important significance in the production and processing of modern manufacturing industry when researching a dynamic optimal scheduling solution.

In recent years, more and more scholars turn the research direction to the research of task scheduling algorithm based on artificial intelligent neural network, exert the advantage of deep reinforcement learning algorithm in machine learning, promote the robustness of material scheduling system and accomplish scheduling task with high efficiency. In order to select the most appropriate scheduling rule at each rescheduling time point, the dynamic multi-objective flexible job shop scheduling problem can be regarded as a markov decision process. Under the constraint of multiple processing information such as workpieces, equipment, working procedures, processing time and the like and an uncertainty disturbance event, the intelligent agent should comprehensively utilize the current production state information and select an optimal scheduling rule. However, few studies have considered dynamic disturbance events that occur with uncertainty in the real production environment, nor have they considered various goals required for achieving production scheduling while solving the disturbance problem, thereby efficiently completing scheduling tasks.

Disclosure of Invention

Aiming at the problems, the invention provides a dynamic workshop scheduling method based on Conv-lasting and generalization characterization.

The invention discloses a dynamic workshop scheduling method based on Conv-lasting and generalization characterization, which comprises the following steps:

and step A, determining a scheduling problem of the dynamic flexible job shop.

The invention designs a dynamic multi-objective flexible scheduling shop problem comprising a plurality of dynamic events and a plurality of objectives. These disturbance events include the insertion of production orders, variations in process tooling man-hours, and equipment failures. The three goals include minimizing maximum finishing time (makespan), maximizing workpiece completion time rate, and minimizing workpiece delay time.

First, a logical scheduling formula of JSSP is established, in which lowercase letters represent indexes and uppercase letters represent sets. Assume that there is a j= { J in the flexible shop scheduling problem ₁ ,J ₂ ,...,J _j Sum of the workpiece m= { M ₁ ,M ₂ ,...,M _m A } stage apparatus in which each work J _j Comprises one or more processing procedures O= { O ₁ ,O ₂ ,...,O _i E.g., turn-milling, planing, welding, etc. Each work-piece processing must be performed in a fixed sequence, each process being processed by a plurality of apparatuses and having different processing durations P in different apparatuses _ji . The scheduling is to reasonably distribute all the workpieces to all the equipment for processing, and aims at minimizing the maximum processing time, the maximum workpiece completion time rate and the minimum workpiece total delay time.

The insertion of the workpiece refers to the fact that the workpiece needs to be added to complete the production task due to the conditions of insufficient shrink fit, new task requirements and the like besides the initial planned task of workshop production scheduling.

Equipment failure is an unavoidable and randomly occurring disturbance event in the actual production process. The equipment faults have a plurality of fault types and respectively have different maintenance times.

The processing time variation refers to the situation that processing tasks cannot be completed according to the specified processing time due to factors such as different proficiency of workers in operating equipment or equipment problems in the production process, and the processing is completed in advance or delayed.

And (B) step (B): a dynamic flexible job shop scheduling problem is determined.

(a) State feature design

In order to fully exploit deep learning to extract features from the original input, the proposed state space is composed of a multidimensional matrix containing workpiece and equipment state information. The matrix strengthens the mapping relation between the state characteristics and the action space, not only can complete expression of information on which the equipment performs actions, but also is beneficial to the rapid training of the neural network and better convergence effect, so that the equipment is easier to make an optimal action decision. The multidimensional state matrix uses different scheduling characteristic information as different channels of an image, and each channel has the length of a device serial number, the width of a working procedure sequence and the height of the number of workpieces. Scheduling features considered herein include workpiece, process, equipment, processing time, deadline, current time, etc. information. Each element is normalized by the overall maximum processing time. If an operation has been assigned to an apparatus, the apparatus is in a processing state, while the value is the remaining processing time of the process on the apparatus, the row remaining values are all 0. The rightmost processing time channel of the image represents a numerical value obtained by calculating weights of the processing time, the cut-off time and the current time, and the multi-time information is expressed more completely.

(b) Action set design

Therefore, 9 better scheduling rules are designed by comprehensively considering the information factors such as processing time, workpiece completion rate, waiting time, cut-off time, arrival time, idle time and the like.

(c) Bonus function design

The invention aims at comprehensively considering the minimized maximum finishing time, the minimized delay time and the maximized completion time, so that a compound rewarding mode of combining a main line task and a branch line task is adopted, the branch line rewards are designed to guide an intelligent body to learn to the optimal action, the main line rewards give successful or failed positive or negative feedback when one training is completed, and the problems that sparse rewards are difficult to converge and dense rewards are easy to cause local optimal are solved. Wherein the dominant line bonus function is as in equation (1).

Wherein R and R _b Is the rewarding value set after multiple experiments, c _r For the completion time rate of the workpiece, d _r For failure rate of work, j _t Max for the current processing time step _t And r is a target completion rate index for the processing time step threshold.

Wherein the spur rewards are as shown in formula (2):

reward2＝-{j _l /m _s )*μ (2)

wherein j is _l For the number of unfinished tasks of the workpiece, m _s And mu is the number of the total machines and mu is the weight coefficient.

The total prize is shown in formula (3): where α is a weight coefficient.

reward＝reward1+α*reward2 (3)

Step C: the Conv-lasting scheduling algorithm optimally solves the scheduling problem of the large-scale flexible job shop.

In the training phase, the Conv-Dueling network adopts a deep convolutional neural network architecture. Specifically, the Conv-lasting network takes a multidimensional state matrix containing workpiece and equipment processing information as input, a predicted Q value of a scheduling action as output, acquires a response reward feedback, continuously performs trial and error learning, continuously interacts with the environment, and finally ensures that a global better solution is obtained under the condition of maximum accumulated reward value. The Conv-lasting scheduling algorithm solves the specific requirements of the dynamic flexible workshop scheduling problem as follows:

step 1: initializing the memory pool capacity as D, batch mini_batch, action cost function q and target cost functionInitializing the parameters of the target network and the estimated network, wherein the learning rate is alpha, the discount rate is gamma.

Step 2: resetting the scheduling context at the beginning of each round to obtain an initial state S ₀ 。

Step 3: at time t<At any time of T, the agent selects an action a from the action space according to the observed state _t Execution is performed wherein T is equal to the total process time step. The action selection is based on the proposed epsilon-decrementing strategy.

Step 4: after the action is executed, the action with the highest priority in the equipment processing workpiece list is scheduled preferentially, and then the instant rewards r are observed _t And next state s _t+1 。

Step 5: data (S) _t ,a _t ,r _t ,S _t+1 ) The memory is stored in the memory pool D, the experience memory amount is detected, and if the maximum memory amount of the experience pool is exceeded, new experience is learned instead of old experience. The conversion for a given sample performs a loss calculation from the q value and the target value.

Step 6: network parameters are converted by all samplesIs updated by changing the cumulative weight value of (c). To ensure stable convergence of the training process, the weights of the target q network are replaced by the weights of the target q network periodicallyThe weight of the network.

Step 7: and (3) judging whether all work procedures of the case are scheduled, if yes, entering the next round, and if no, continuing to execute the step (3).

Step 8: and judging whether the round is ended, if so, outputting a better scheduling model, and if not, continuing to execute the step 2.

The beneficial technical effects of the invention are as follows:

(1) The method designs a multidimensional matrix containing workpiece and equipment state information as state representation, can completely express information on which equipment performs actions, is beneficial to the rapid training of a neural network and obtains better convergence effect, so that the equipment is easier to make optimal action decisions; meanwhile, the method designs the reward function and accelerates the algorithm convergence.

(2) The method designs a Conv-lasting network model, takes a multidimensional state matrix as input and takes the value of a scheduling rule as output, and selects an optimal scheduling rule on different rescheduling decision points based on the value of the scheduling rule. The network model consists of a feature extraction network, a state network and an advantage network, and global optimal scheduling is realized. The network model can be better optimized in both static and dynamic cases.

Drawings

FIG. 1 is a general scheduling flow chart of an implementation of the present invention.

FIG. 2 is a diagram showing the whole training process of Conv-Dueling network model in the implementation of the present invention.

FIG. 3 is a diagram illustrating a state S before a scheduling state transition in accordance with an embodiment of the present invention ₀ A drawing.

FIG. 4 is a diagram illustrating a post-scheduling state S implemented by the present invention ₁ Drawing of the figure

FIG. 5 is a graph of the on-time completion rate of a work piece for a scheduling agent learning process.

FIG. 6 is a diagram of reward and punishment records of the process of learning a scheduling agent

Detailed Description

The invention will now be described in further detail with reference to the drawings and to specific examples.

The chapter innovatively invents a dynamic scheduling method based on Conv-Dueling and generalization characterization of deep reinforcement learning, and provides a new effective method for solving the problem. Firstly, digitally modeling a scheduling problem of a dynamic flexible job shop and defining state characteristics, action space and rewarding functions. And secondly, carrying out unsupervised training on a network model by adopting a D3QN algorithm, finally taking a workpiece multidimensional state representation matrix as input according to multi-constraint and multi-disturbance production environment information, extracting and making a decision by Conv-lasting network model characteristics, and outputting an optimal scheduling rule.

The invention relates to a dynamic scheduling method based on Conv-Dueling and generalization characterization of deep reinforcement learning, which comprises the following steps:

and step A, determining a scheduling problem of the dynamic flexible job shop.

First, a logical scheduling formula of JSSP is established, in which lowercase letters represent indexes and uppercase letters represent sets. Assume that there is a j= { J in the flexible shop scheduling problem ₁ ,J ₂ ,...,J _j Sum of the workpiece m= { M ₁ ,M ₂ ,...,M _n A } stage apparatus in which each work J _j Comprises one or more processing procedures O= { O ₁ ,O ₂ ,...,O _i E.g., turn-milling, planing, welding, etc. Each work-piece processing must be performed in a fixed sequence, each process being processed by a plurality of apparatuses and having different processing durations P in different apparatuses _ji . The scheduling is to reasonably distribute all the workpieces to all the equipment for processing, and aims at minimizing the maximum processing time, the maximum workpiece completion time rate and the minimum workpiece total delay time.

The research aims at processing unpredictable dynamic disturbance events such as insertion list, machine fault, working hour variation and the like in the multi-constraint relation between processing equipment and workpieces in the production scheduling process, and processing each working procedure O of the workpieces _i,j Allocated to adaptations at appropriate timesCombined processing equipment M _m Processing is performed, so that all scheduling tasks are efficiently completed, and superior comprehensive performances are obtained in terms of time, resource utilization and the like. The overall process of the production scheduling process is shown in fig. 1. The details of the dynamic disturbance event processing shown in fig. 1 are as follows.

And (B) step (B): conversion of scheduling problems

a) State feature design

In order to fully utilize deep learning to extract features from the original input, the state space proposed by the present invention is composed of a multi-dimensional matrix containing workpiece and equipment state information. The matrix strengthens the mapping relation between the state characteristics and the action space, not only can complete expression of information on which the equipment performs actions, but also is beneficial to the rapid training of the neural network and better convergence effect, so that the equipment is easier to make an optimal action decision. The multidimensional state matrix uses different scheduling characteristic information as different channels of an image, and each channel has the length of a device serial number, the width of a working procedure sequence and the height of the number of workpieces. Scheduling features considered herein include workpiece, process, equipment, processing time, deadline, current time, etc. information. Each element is normalized by the overall maximum processing time. If an operation has been assigned to an apparatus, the apparatus is in a processing state, while the value is the remaining processing time of the process on the apparatus, the row remaining values are all 0. As shown in fig. 3 and 4, the rightmost processing time channel of the image represents a numerical value obtained by weighting the processing time, the cut-off time and the current time, and the multi-time information is more completely expressed.

b) Motion space design

In order to solve the problems that a single scheduling rule is not suitable for various scheduling scenes and the like, according to different environment states, proper scheduling rules are selected through deep reinforcement learning. Too few scheduling rules easily cause that the agent cannot realize global optimal scheduling in the face of complex and diverse environments, and too many scheduling rules can cause that the agent consumes a great deal of time in the learning process, so that the goal of real-time efficient scheduling cannot be met. Therefore, we design 9 better scheduling rules by comprehensively considering the information factors such as processing time, workpiece completion rate, waiting time, cut-off time, arrival time, idle time and the like as shown in table 1.

Table 1 action set table.

c) Bonus function design

The invention aims at comprehensively considering the minimized maximum finishing time, the minimized delay time and the maximized completion time, so that a compound rewarding mode of combining a main line task and a branch line task is adopted, the branch line rewards are designed to guide an intelligent body to learn to the optimal action, the main line rewards give successful or failed positive or negative feedback when one training is completed, and the problems that sparse rewards are difficult to converge and dense rewards are easy to cause local optimal are solved. Wherein the dominant line bonus function is as in equation (4).

Wherein R and R _b Is the rewarding value set after multiple experiments, c _r For the completion time rate of the workpiece, d _r For failure rate of work, j _t Max for the current processing time step _t For the processing time step threshold, r is the target completion index。

Wherein the spur rewards are as shown in formula (5):

reward2＝-(j_l/m_s)*μ (2)

The total prize is shown in equation (6), where α is a weight coefficient.

raward＝reward1+α*reward2 (3)

Step C: DDQN algorithm optimization solution for scheduling problem of large-scale flexible job shop

The scheduling agent selects proper scheduling rules in the workshop environment state, and sorts and distributes the workpieces to the processing equipment. When the environmental status of the plant changes, corresponding prize values are given according to the prize function, a high prize value meaning that the scheduling rule is more efficient in this case, and a negative prize value meaning that the rule scheduling is less efficient. The scheduling agent obtains the global better solution while obtaining the maximum cumulative prize value through continuous trial and error learning and interaction with the environment.

In the training phase, the Conv-Dueling network adopts a deep convolutional neural network architecture. Specifically, the Conv-forcing network takes as input a multidimensional state matrix containing workpiece and equipment processing information, and takes as output a predicted Q value of a scheduling operation. The specific requirements are as follows:

Step 6: the network parameters are updated by accumulated weight value changes over all sample transitions. To ensure stable convergence of the training process, the weights of the target q network are replaced by the weights of the target q network periodicallyThe weight of the network.

(1) Design of experiment

The data example used in this test procedure was randomly generated, wherein the initial number of workpieces was 20 and the number of procedures per workpiece was random. The number of devices is 10, 20, and the processing functions of the devices are random. The number of dynamic disturbance events is 30, 50 and 80, the type of which is random. As with the training phase, the relevant parameters of the workpiece and the apparatus are random. Meanwhile, the influence of the correct selection of the super-parameters on the learning ability and algorithm performance of the intelligent agent is large, but the super-parameters are wide in range and difficult to select to be suitable, and the invention sets related parameters according to the general principle, and the related parameters are shown in a table 2. Finally, there are 30 test case combinations in total, 50 runs. The test code was written in the python programming language, compiled and run using python 3.8.12.

Table 2 training hyper-parameters

(2) Analysis of experimental results

The D3QN obtains the prize convergence result in the first 2000 training rounds as shown in fig. 5, and as the training set proceeds, the prizes of the three algorithms can all converge to the maximum value. The converged oscillations are mainly caused by a small probability of random action selection. Compared with the three algorithms, the DQN algorithm has the worst effect, the lowest learning efficiency and slow convergence speed. The DDQN is obviously improved and promoted, and the convergence speed is high, but still is inferior to the D3QN algorithm. The convergence and stability of the D3QN algorithm are best, and the influence of overestimation of the action value is relieved due to the structure of the duel-fight network, the multi-dimensional state space of the deep double-Q network and the convolutional neural network; the results of the completion of the time-lapse rate convergence for the workpieces of the different algorithms are shown in fig. 6, it is evident that the convergence can be achieved for all three deep reinforcement learning algorithms, respectively at 0.85,0.8,0.7. It can be seen that the D3QN algorithm can converge more stably to a higher prize value, while the other two algorithms do not learn more excellent scheduling rules at rescheduling points, so that a higher workpiece completion timing rate cannot be obtained. Both graphs can show that under the condition of multiple constraints and multiple disturbance, the invention provides a dynamic scheduling method based on Conv-lasting and generalization characterization of deep reinforcement learning, which learns more efficient scheduling rules.

In order to verify that the proposed scheduling algorithm based on D3QN is superior to the scheduling algorithm based on DDQN, the following experiment was designed. Wherein at m, n _add And E is connected with _ave Under different experimental parameter settings, multiple groups of experimental data are randomly generated for simulating different task scheduling conditions in the production process, and on each group of experimental data, the scheduling algorithm proposed by the user and each scheduling rule are repeated for 50 times respectively. The average and standard deviation values of the total completion time, task completion rate and total delay time for the tasks obtained for each method are shown in table 3, where the best results are indicated in bold. In order to guarantee fairness among different algorithms, DDQN uses action and rewarding functions of a scheduling algorithm proposed by us,the state features of DDQN will be divided into 9 discrete states using a neural network with a self-organizing map (SOM) layer. From the table, the optimal scheduling rule selected by the scheduling algorithm based on D3QN at the rescheduling point has better expectations and robustness in the aspect of realizing multiple targets than the scheduling algorithm based on DDQN.

TABLE 3 average and standard deviation values of results of the scheduling algorithm and DDQN scheduling algorithm after 50 runs

Claims

1. The dynamic workshop scheduling method based on Conv-lasting and generalization characterization is characterized by comprising the following steps of:

step A, determining flexible job shop scheduling;

firstly, establishing a logic scheduling formula of the JSSP, wherein lowercase letters represent indexes, and uppercase letters represent sets; there is a j= { J in the flexible shop scheduling problem ₁ ,J ₂ ,...,J _j Sum of the workpiece m= { M ₁ ,M ₂ ,...,M _m A } stage apparatus in which each work J _j Comprises one or more processing procedures O= { O ₁ ,O ₂ ,...,O _i -a }; each workpiece processing is performed in a fixed sequence, each process is processed by a plurality of devices and has different processing time periods P in different devices _ji The method comprises the steps of carrying out a first treatment on the surface of the The scheduling is to reasonably distribute all the workpieces to all the equipment for processing, and aims at minimizing the maximum processing time, the maximum workpiece completion time precision and the minimum workpiece total delay time;

and (B) step (B): determining a scheduling problem of a dynamic flexible job shop;

(a) Designing state characteristics;

the state space is formed by a multi-dimensional matrix containing workpiece and equipment state information, and the multi-dimensional state matrix takes different scheduling characteristic information as different channels of an image, wherein each channel has the length of equipment serial numbers, the width of procedure sequences and the height of the number of the workpieces; the considered scheduling characteristic information comprises workpieces, working procedures, equipment, processing time, cut-off time and current time; each time information is standardized according to the overall maximum processing time; if an operation has been assigned to an apparatus, the apparatus is in a process state; the rightmost processing time channel of the image represents a numerical value obtained by calculating weights of processing time, cut-off time and current time, and multi-time information is expressed completely;

(b) Designing an action set;

selecting 9 types of optimal scheduling rules by comprehensively considering the processing time, the workpiece completion rate, the waiting time, the cut-off time, the arrival time and the idle time information factors as shown in table 1;

TABLE 1 action set table

(c) Designing a reward function;

adopting a compound rewarding mode combining a main line task and a branch line task, designing branch line rewards to guide an intelligent body to learn to an optimal action, and giving positive or negative feedback of success or failure when the main line rewards complete one training;

step C: the Conv-Dueling scheduling algorithm optimally solves the scheduling problem of the large-scale flexible job shop; in the training phase, the Conv-Dueling network adopts a deep convolutional neural network architecture.

2. The dynamic shop scheduling method based on Conv-forcing and generalization characterization according to claim 1, characterized in that the dominant line reward function is as in equation (1);

wherein R and R _b Is the rewarding value set after multiple experiments, c _r For the completion time rate of the workpiece, d _r For failure rate of work, j _t Max for the current processing time step _t R is a target completion rate index for a processing time step threshold;

the spur rewards are shown in formula (2):

reward2＝-(j _l /m _s )*μ (2)

wherein j is _l For the number of unfinished tasks of the workpiece, m _s The number of the total machines is represented by mu, and mu is a weight coefficient;

the total prize is represented by equation (3): wherein alpha is a weight coefficient;

reward＝reward1+α*reward2 (3)。

3. the dynamic workshop scheduling method based on Conv-Dueling and generalization characterization according to claim 1, wherein the Conv-Dueling network takes a multidimensional state matrix containing workpiece and equipment processing information as input, a predicted Q value of scheduling actions as output, acquires response reward feedback, and continuously performs trial and error learning and continuous interaction with the environment, so as to finally ensure that a global better solution is obtained under the condition of maximum accumulated reward value.

4. The dynamic workshop scheduling method based on Conv-Dueling and generalization characterization according to claim 1, wherein the Conv-Dueling scheduling algorithm solves the dynamic flexible workshop scheduling problem as follows:

step 1: initializing the memory pool capacity as D, batch mini_batch, action cost function q and target cost functionInitializing parameters of a target network and an estimated network, wherein the learning rate is alpha, the discount rate is gamma;

step 2: resetting the scheduling context at the beginning of each round to obtain an initial state S ₀ ；

Step 3: at time t<At any time of T, the agent selects an action a from the action space according to the observed state _t Performing, wherein T is equal to the total processing time step; action selection is based on the proposed epsilon-decremental strategy;

step 4: after the action is executed, the action with the highest priority in the equipment processing workpiece list is scheduled preferentially, and then the instant rewards r are observed _t And next state s _t+1 ；

Step 5: data (S) _t ,a _t ,r _t ,S _t+1 ) The experience memory is stored in a memory pool D, the experience memory is detected, and if the maximum memory of the experience pool is exceeded, new experience is learned instead of old experience; performing loss calculation from the q value and the target value by conversion of given samples;

step 6: the network parameters are updated by the accumulated weight value change on all sampling conversions; to ensure stable convergence of the training process, the weights of the target q network are replaced by the weights of the target q network periodicallyA weight of the network;

step 7: judging whether all work procedures of the case are scheduled to be completed, if yes, entering the next round, and if no, continuing to execute the step 3;

step 8: and judging whether the round is ended, if so, outputting the scheduling model, and if not, continuing to execute the step 2.