CN117519030B

CN117519030B - Distributed assembly blocking flow shop scheduling method based on hyper-heuristic reinforcement learning

Info

Publication number: CN117519030B
Application number: CN202311565429.0A
Authority: CN
Inventors: 张梓琪; 李瑛�; 钱斌; 胡蓉
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2023-11-22
Filing date: 2023-11-22
Publication date: 2024-04-26
Anticipated expiration: 2043-11-22
Also published as: CN117519030A

Abstract

The invention discloses a distributed assembly blocking flow shop scheduling method based on hyper-heuristic reinforcement learning, which is characterized in that a mathematical model considering machine blocking DABFSP is established and a scheduling optimization algorithm is designed on the basis of the scheduling problem of the distributed assembly flow shop; applying QLHHEA based on Q learning to solve the mathematical model of DABFSP; the low-level heuristic LLH is defined as a state, transition between states is defined as an action, a high-level strategy based on Q learning is adopted in global exploration, a proper action is automatically selected under a specific state, a searching direction is effectively driven, and LLHs based on problem characteristics is adopted in local development, so that searching behaviors are effectively enriched. The use of the acceleration strategy based on insertion effectively saves the calculation cost and improves the searching efficiency; the quality of the initial solution is improved and the present invention aims to determine DABFSP the product distribution, the work order of the work pieces and the assembly order of the products in each factory, so as to minimize the maximum finishing time of all factories.

Description

Distributed assembly blocking flow shop scheduling method based on hyper-heuristic reinforcement learning

Technical Field

The invention discloses a distributed assembly blocking flow shop scheduling method based on hyper-heuristic reinforcement learning, and belongs to the technical field of production scheduling.

Background

Scheduling aims at reasonably allocating available resources among activities, balancing multiple conflict goals, and meeting the requirements of different decision makers within a specific time frame, which is an important component of modern supply chains and manufacturing systems. In recent years, with the rapid development of intelligent manufacturing and industry 4.0, together with the globalization trend, enterprises are faced with serious and complex problems such as dynamic demands, economic benefits and production modes, and manufacturing enterprises change manufacturing modes and purchase and manage multi-area manufacturing centers or factories to provide flexibility and adaptability to face vigorous competition. Thus, the traditional centralized single-factory production model has turned to an emerging distributed assembly production model, inevitably requiring three important issues to be addressed: how to distribute the work pieces to the factory, how to arrange the processing of the work pieces at the factory, and how to adjust the assembly of the product. Therefore, developing effective and efficient algorithms and emerging technologies to solve the distributed shop scheduling problem (Distributed shop scheduling problems, DSSPs) is of significant interest.

In previous DSSPs studies, distributed assembly replacement flow shop problems (Distributed assembly permutation FSP, DAPFSP) have become a hotspot study problem; however, many existing DAPFSP studies assume that the buffer capacity between successive processing machines in a flow shop is infinite, meaning that workpieces can be temporarily stored in an intermediate buffer. However, in practice, due to process characteristics or specifications, no buffers are allowed or are not available between adjacent machines, meaning that the work piece processed on a certain machine must remain on the current processing machine and cannot be immediately released to the next machine, a condition known as a jam. This is often the case in actual production due to buffer capacity or storage device limitations. Clogging increases the processing time of the workpiece, and therefore it is important to reduce the clogging time reasonably to increase the machine productivity. The distributed assembly choked flow shop scheduling problem is an important class of discrete optimization problems that is more realistic. Therefore, the research on the distributed assembly blocking flow shop scheduling problem (DABFSP) considering blocking constraint has practical significance and potential application prospect.

DABFSP can be decomposed into two strongly coupled sub-problems, a workpiece processing sub-problem and a product assembling sub-problem, and four closely coupled sub-decisions are derived: the work pieces are distributed to factories, the processing sequence of the work pieces in each factory is adjusted, products on the assembly machine are arranged, and meanwhile, the blocking time between machines is reduced. Generally, for large-scale DABFSP problems, it is difficult to solve by accurate mathematical methods, such as branch-and-bound methods and column generation methods, due to their computational complexity; a constructive heuristic algorithm can generally construct a viable solution based on problem-based rules and constraints and quickly provide an appropriate scheduling plan, but it is difficult to guarantee the superiority of the solution.

In the scheduling field, hybrid intelligent optimization algorithms (Hybrid intelligent optimization algorithms, HIOAs) have become the dominant approach to solving such problems. HIOAs typically use efficient evolutionary mechanisms, specific search strategies, and efficient neighborhood operators to produce some satisfactory scheduling schemes within a limited time. HIOAs has a significant effect in solving the complex scheduling problem of strong coupling. Hyper-heuristic algorithms (Hyper-heuristic algorithms, HHAs) are a class of HIOAs of great interest. HHA is typically composed of a high-level policy and a set of low-level heuristics, which instead of searching directly within the solution space of the problem, mainly applies a high-level policy (HLS) to manage or manipulate a series of pre-designed low-level heuristics (LLHs), determines the best order of LLHs in the policy space or heuristic search space, and then executes the selected LLHs to search the solution space for more superior solutions. HHAs have wide application in combinatorial optimization problems because of their ability to automatically select, integrate and develop simple and efficient heuristic algorithms. The search behavior of HHA can be generally divided into two categories, namely, heuristic generation and heuristic selection. The former employs appropriate high-level policies to construct heuristics, and the latter employs appropriate selection policies to extract LLH and evaluate its effectiveness. According to literature studies, most studies generally utilize evolutionary algorithm-based methods or artificial intelligence-based methods as advanced selection strategies. Reinforcement learning (Reinforcement learning, RL) integrates perception, adaptive learning and autonomous decision making through an object-oriented learning mechanism, and has stronger learning capability. The RL aims to gain experience and perform optimal search behavior through dynamic interactions between agents and the environment. In general, RL has two key features: trial-and-error searching and delayed rewarding policies. In the RL framework, an agent can perceive the likely state of the environment based on one or more objectives with a well-defined goal (i.e., maximizing the jackpot), and then take action to change it. Q learning is a relatively successful learning strategy that allows an agent to determine the best search behavior by learning to obtain a function of action values that expresses the expected utility of applying appropriate actions in certain specific states. Thus, Q learning based hyper-heuristics can provide appropriate strategies to select appropriate search behavior and guide search trends toward promising areas; these advantages inspire the design of the Q-learning based hyper-heuristic evolutionary algorithm (QLHHEA) and use it to solve DABFSP.

Disclosure of Invention

The invention aims to provide a distributed assembly blocking flow shop scheduling method based on hyper-heuristic reinforcement learning, which is used for determining DABFSP the product distribution, the processing sequence of workpieces and the assembly sequence of products of each factory, so that the maximum working time of all factories is minimized; 12 valid low-level heuristics are designed and all defined as states, transitions between states are defined as actions, and an evolution framework based on Q-learning is used as HLS regulation LLHs to search a solution space; two acceleration strategies based on insertion are formulated, so that the time complexity of an evaluation solution is reduced, the calculation cost is saved, and the search efficiency is accelerated; second, a construction heuristic based on the problem characteristics was developed to produce a high quality initial solution.

In order to achieve the above purpose, the technical scheme provided by the invention is as follows:

A distributed assembly blocking flow shop scheduling method based on hyper-heuristic reinforcement learning specifically comprises the following steps:

step 1: initializing a population and a Q table, wherein individuals in a low-level population are generated by adopting a construction heuristic method, individuals in a high-level population are randomly generated, and the scales of the two populations are the same (the size is popsize); setting related parameters; the q value (q-value) of the state action pair is set to zero.

Step 2: each individual acquired global optimal solution pi _best is decoded by adopting a forward or backward calculation method, and two acceleration strategies based on insertion are executed when the solution is evaluated so as to save the calculation cost of the evaluation solution.

Step 3: sequentially executing LLHs (i.e., states, where the conversion of one LLH to another LLH is defined) in higher-level individuals (consisting of 12 high-efficiency heuristics) on feasible scheduling solutions in lower-level individuals, and if the new solution adaptation value is better, replacing the old solution with the new solution and updating the globally optimal solution; calculating the Contribution Rate (CR) of each higher-layer individual, thereby selectingThe Q table is updated by adopting an updating mechanism for each high-level individual with high contribution rate; while setting count=0.

Step 4: the updated Q-table is sampled to generate new higher-level individuals, i.e., lower-level heuristics are operated to search the solution space using a higher-level strategy based on Q-learning.

Step 4.1: the action selection policy is utilized to select state s _t, obtain action a _t and next state s _t+1.

Step 4.2: applying state s _t+1 to pi _best to yield pi' _best; the adaptation values for pi _best (C (pi _best))、C(π′_best), IR, get bonus function r (s _t,a_t), update Q value (Q _t+1(s_t,a_t)), probabilities ε _t, and pi _best are calculated.

Step 4.3: if C (pi '_best)＜C(π_best), updating the global optimal solution pi _best to pi' _best, otherwise jumping to step 4.1.

Step 4.4: if count= popsize, then go to step 3, otherwise go to step 4.

Step 5: checking whether the stopping condition is met, if not, jumping to the step 4, otherwise outputting pi _best.

Preferably, the individuals in the lower population in the step (1) of the present invention are generated by adopting a construction heuristic method, which specifically comprises the following steps:

Step 1.1: for product P _h (h=1, 2,.,. S.) the processing time I (h, I) for each workpiece on all machines was calculated by the following formula:

Where i is the work index, i=1, 2,..n _l; j is the index of the machine and, j=1, 2, m; h is the product index, h=1, 2,..s; m is the number of machines in the production stage; i (h, I) is the processing time of each workpiece on all machines; p _h,i,j is the processing time of the ith workpiece belonging to product P _h on machine M _j.

The I (h, I) is arranged in an ascending order to obtain an initial workpiece sequence lambda _h formed by omega _h workpieces of the product P _h, and if the I (h, I) of the two workpieces are the same, P _h,i,1 is selected to be smaller; the quality of the work order is improved by utilizing NEH heuristic algorithm, and the earliest completion time e _h of the product P _h is further determined.

Step 1.2: the product sequence lambda is obtained by arranging the earliest finishing time e _h in ascending order.

Step 1.3: extracting the first s products from the product sequence lambda and distributing them to plants, at least one for each plant; the part of the product sequence assigned to the plant f' is denoted asS is assigned to the variable k, i.e. k=s.

Step 1.4: if there are still products to be dispensed in the product sequence λ and k < S, assuming δ _f′ products have been dispensed to the factory f', then the first product P _h, S < h+.s, where h is the product index, h=1, 2,..s, performs the following steps:

step 1.4.1: for F' =1, 2, F, insert lambda _h as a block To ensure that all workpieces of the same product are not separated; the cost function σ (h, l ', f') of the non-dispensed product P _h inserted into the void l 'is calculated by the following formula, l' =1, 2.

Where i is the work index, i=1, 2,..n _l; j is the index of the machine and, j=1, 2, m; n is the number of work pieces; f is the number of factories; m is the number of machines in the production stage; s is the number of products; h is the product index; l is the index of the product in factory f, l=1, 2,..delta _f;λ_h is the work order of product P _l in factory f, Is a product in a factory f The departure time of the workpiece V _l,i on the machine M _j in the factory f; /(I)For the processing time of the workpiece V _l,i on the machine M _j in the factory f.

Selecting the product with the smallest sigma (h, l ', f'); if σ (h, l ', f') is the same, the earliest finishing time is selectedThe shortest product; let/>Is the corresponding vacancy.

Step 1.4.2: finding the plant f ^* with the shortest finishing time, i.eProduct P _h was dispensed to the factory and lambda _h was inserted into/>Vacancies/>

Step 1.5: let k=k+1, repeat step 1.4 until all products have been traversed.

Preferably, the invention decodes each individual to obtain the global optimal solution pi _best by adopting a forward or backward calculation method, and comprises the following specific steps:

When decoding the higher-layer individuals, sequentially executing LLH contained in each higher-layer individual, and searching a better solution in a solution space by using the LLH; if the obtained candidate solution has better adaptability than the original solution, replacing with a new solution, and executing the rest LLHs, otherwise, executing the next LLH until all LLHs in the higher-level individual are executed; during which the effectiveness of each higher-level individual is assessed by the contribution rate.

When decoding a lower-layer individual, two forward or backward calculation methods are adopted, and two acceleration strategies based on insertion are executed before calculation, specifically: and inserting each product into all possible positions to evaluate all solutions, selecting an optimal solution, inserting workpieces belonging to the same product in the solution into all positions of the product to evaluate all solutions, and finally selecting the optimal solution to execute a forward or backward decoding calculation method.

Preferably, the two acceleration strategies based on the insertion of the present invention specifically include the following steps:

Step 2.1: acceleration strategy based on product insertion: inserting each product into all possible locations to evaluate all solutions; assuming that δ _f′ products are distributed to f' plants, product P _l′ has δ _f′ +1 pluggable positions in total; the acceleration strategy for evaluating all insertion solutions based on product insertion is as follows:

Step 2.1.1: in the factory f' as a product The time and product of the ith workpiece leaving the machine M _j The times of departure from the assembly machine M _A are/>, respectivelyAnd/>

Step 2.1.2: on the basis of step 2.1.1, the processing time and the product of the ith workpiece on machine M _j The assembly time on the assembly machine M _A is/>, respectivelyAnd,/>

Step 2.1.3: assume thatProduct/>Inserted, the following formula holds:

Step 2.1.4: maximum finishing time of factory f' after product P _l′ is inserted Calculated as follows:

Step 2.1.5: steps 2.1.3 and 2.1.4 are repeated until all insertion positions have been tried.

Step 2.2: acceleration strategy based on workpiece insertion: inserting workpieces belonging to the same product into the product; assume thatIn item 1/>If the first n _[l′] -1 workpieces have completed scheduling, then the i' th workpiece may have n _[l′] insert positions; the acceleration strategy for evaluating all insertion solutions based on workpiece insertion is as follows:

Step 2.2.1: calculation of

Step 2.2.2: suppose that workpiece V _[l′],i′ is inserted into a productThe time for workpiece V _[l′],i′ to leave machine M _j is calculated as follows:

step 2.2.3: if the workpiece V _[l′],i′ is inserted into Then product/>Completion time/>It must be calculated as follows:

step 2.2.4: finishing time of factory f' after workpiece insertion The calculation is as follows:

in the above formula, i is a work index, i=1, 2,..n _l; j is the index of the machine and, j=1, 2, m; delta _f is the number of products in the plant f, F factory index, f=1, 2,; m is the machine set, m= { M ₁,M₂,...,M_m }; /(I)For the product in factory f,/> The departure time of product P _l on machine M _j in factory f; /(I)The departure time of the workpiece V _l,i on the machine M _j in the factory f; /(I)Assembly time on machine M _A in factory f for product P _l; /(I)Processing time on machine M _j in factory f for workpiece V _l,i; /(I)The duration on machine M _j in factory f for workpiece V _l,i; /(I)The duration on machine M _j in factory f for product P _l; l is the product index in factory f, l=1, 2.

Step 2.2.5: steps 2.2.2 to 2.2.4 are repeated until the possible insertion positions are taken into account.

Preferably, the higher-level individuals in the step 3 of the present invention are constructed by 12 heuristic LLHs, and the heuristic LLHs can be divided into two categories, wherein one category is based on critical paths, and the other category is based on non-critical paths; the factories through which the critical path passes are named critical factories, and the products and workpieces allocated to the critical factories are named critical products and critical workpieces; the heuristic designed is as follows:

LLH ₁: randomly selecting a key product from a key factory, and randomly selecting a key workpiece from a workpiece set of the product; this workpiece is inserted until all of the workpieces of the product have been selected before the other workpieces are inserted into position.

LLH ₂: randomly selecting a key product from a key factory and randomly selecting a key workpiece from a workpiece set of the product; after inserting this work piece into the position of each of the other work pieces, until all work pieces of the key product are selected.

LLH ₃: randomly selecting a key product from a key factory and randomly selecting a key workpiece from a workpiece set of the product; the position of the workpiece is swapped with all other workpieces until all workpieces of the key product are selected.

LLH ₄: randomly selecting a key product from a key factory, and randomly selecting two different key workpieces from a workpiece set of the product; the subsequence between the two selected key workpieces is inverted.

LLH ₅: randomly selecting a product from a non-critical factory and randomly selecting a workpiece from a set of workpieces for the product; this workpiece is inserted until all workpieces that are not critical products are selected before the other workpieces are placed.

LLH ₆: randomly selecting a product from a non-critical factory and randomly selecting a workpiece from a set of workpieces for the product; after inserting this work piece into the position of each other work piece, until all work pieces of the non-critical product are selected.

LLH ₇: randomly selecting a product from a non-critical factory and randomly selecting a workpiece from a set of workpieces for the product; the position of the selected workpiece is swapped with all other workpieces until all workpieces of non-critical products are selected.

LLH ₈: one product is randomly selected from a non-critical factory, two different workpieces are randomly selected from a set of workpieces for the non-critical product, and then the subsequence between the selected two non-critical workpieces is inverted.

LLH ₉: randomly selecting a key product from a key factory; each product is inserted before or after its location in the critical factory until all critical products are selected and all insert-based operations are performed.

LLH ₁₀: one key product is randomly selected from the key factories, and the location of the product is exchanged with other products until all key products are selected and all exchange-based operations are performed.

LLH ₁₁: one product is randomly selected from a non-critical factory and the selected product is inserted before or after the location of all other products.

LLH ₁₂: one product is randomly selected from the non-critical factories, and the location of the selected product is exchanged with the locations of all other products.

Preferably, in the step (3) of the present invention, the Q-value of Q (S _t,a_t) reflects the priority preference of the agent to execute action a _t e a in state S _t e S, because the Q table records the knowledge learned by the agent from the environment; for each state-action pair (s _t,a_t),Q(s_t,a_t) updated by weighting the instant prize r (s _t,a_t) and the discounted q-value, it can be calculated by the following formula:

In the above formula, s is a state, a is an action, and r (s, a) is an instant prize for executing the action a in the state s; s is a state space, s= { s|s ₁,s₂,...,s_T},t＝1,...,T,s_t represents a selectable state; a is an action set, a= { a|a ₁,a₂,...,a_T},t＝1,...,T,a_t representing a selectable action; (s _t,a_t) is a selectable state-action pair; the Q (s _t,a_t) table performs the priority preference of state-action pairs (s _t,a_t); lambda is a coefficient.

Preferably, the action selection policy in step 4.1 of the present invention is specifically as follows:

For a particular state s _t for a particular time step t, a modified ε greedy strategy is used to select either a random action with probability ε _t or an action yielding the maximum reward with probability 1- ε _t; the action selection policy is used to determine the appropriate actions for the initial state s ₀ and all subsequent states; at initial iteration, setting q-value of all state-action pairs to 0, thereby randomly selecting initial state and action; while lower epsilon _t values facilitate depth-first searches around potential regions; thus, an epsilon adaptive adjustment method is provided for the T _cur event, noted as The formula is as follows:

Where ε ₀ is an initial value (i.e., ε ₀＝0.15),ε_f is a final value (i.e., ε _f＝0.01);T_total is the total number of time steps, T _cur is the current number of time steps, according to the description above, the action selection policy first sets state s _t+1 to 0, if a random number of 1 is generated, s _t+1 will be selected by the ε greedy policy with a probability of The resulting random action a _t determines that, otherwise, state s _t+1 is to be randomly generated.

Preferably, the reward function of step 4.2 of the present invention is designed as follows:

In the above formula, r (s, a) is the immediate reward for executing action a in state s, and is determined by the quality IR of the improved solution pi, IR can be calculated by [ C _max(π)-C_max(π′)]/C_max (pi), C _max (pi) is the maximum completion time of the scheduling solution pi, and pi' is the new candidate solution obtained after executing action.

The invention establishes DABFSP scheduling model based on the scheduling problem with blocking constraint; some of its properties are deduced from the characteristics of DABFSP; two acceleration strategies based on DABFSP properties have been proposed to reduce the time complexity of evaluating solutions.

The heuristic method for constructing the initial population based on the problem characteristics ensures the quality and diversity of the initial population.

According to the neighborhood structure of the specific problem, twelve low-level heuristics are designed to construct the LLH pool; the switching between LLHs is defined as the available action.

The invention embeds the Q-learning algorithm into QLHHEA as HLS, thereby controlling LLHs selection for selecting an appropriate search strategy in the solution space of the problem.

The beneficial effects of the invention are as follows:

Compared with other high-level strategies, the HLS based on the Q learning, which is applied by the method, allows proper actions to be automatically selected in a specific state, so that the searching behavior is effectively enriched; QLHHEA, which is a learning paradigm driven by knowledge, makes decisions in the strategy space through trial and error feedback, guides the search to promising solution space areas, which is advantageous over most hybrid intelligent optimization algorithms based on random search solution spaces. The new definition of states and actions provides insight into the connection relationships of LLHs, focusing on mining patterns and key features that are hidden in the future in higher-level individuals.

Drawings

FIG. 1 is a general flow chart of the present invention;

FIG. 2 is an exemplary diagram of forward and backward calculation methods; wherein a is a forward calculation method schematic diagram, and b is a backward calculation method schematic diagram;

FIG. 3 is an exemplary diagram of an acceleration strategy based on product insertion;

FIG. 4 is an exemplary graph of an acceleration strategy based on workpiece insertion; wherein a is a schematic drawing of the workpiece 14 inserted into the first position of the product P3, b is a schematic drawing of the workpiece 14 inserted into the second position of the product P3, c is a schematic drawing of the workpiece 14 inserted into the third position of the product P3;

FIG. 5 is a directional fully connected network of selectable states and available actions; wherein a is a state selection directional full-connection network, b is an action selection directional full-connection network;

FIG. 6 is a schematic diagram of the present invention QLHHEA.

Detailed Description

The invention of the present disclosure is further described below with reference to the drawings and examples, but the scope of the invention is not limited to the description.

DABFSP, which is a two-stage scheduling problem, is divided into processing and assembly stages; there are F identical plants, each plant having a flow shop for processing workpieces and an assembly machine for product assembly; the method comprises the following steps that n workpieces and S products exist, each product consists of different workpieces, and each workpiece only belongs to one product; in the processing stage, distributing all workpieces to a factory according to a proper distribution rule; each workpiece is processed on M machines in the same route (M ₁,M₂,...,M_m); no buffer area exists between any two machines, and the workpiece just processed cannot immediately leave the current machine due to the lack of the buffer area, and the next machine is required to be free and available; all the workpieces belonging to the same product must be processed together, and workpieces of different products are not allowed to be cross-processed; that is, workpieces belonging to the same product in pi _f cannot be mixed with workpieces belonging to other products, which means that the products are in sequenceImplicit in the work order pi _f of factory f; once all the pieces of product P _h are completed in the processing phase and the assembly machine is idle, they can be immediately delivered to the assembly machine; in the assembly phase, all the work pieces processed by each factory are gathered and assembled into S final products on an assembly machine.

The distributed assembly replacement flow shop problem (DABFSP) described in this embodiment satisfies the following assumptions:

(1) The processing time of the workpiece and the assembly time of the product are predetermined positive integers; the processing time comprises setting time and transportation time; irrespective of the release time of the workpiece.

(2) Each workpiece can only be processed on one machine at most, and each machine cannot process a plurality of workpieces at a time; the workpiece and machine are available from time zero.

(3) Preemption is not allowed; once the work-piece machining process and the product assembly operation are started, they must be performed without interruption, without allowing the work to be interrupted.

(4) Each workpiece can be distributed to any one factory, and all operations of each workpiece must be completed in the same factory; the work pieces that have been assigned to the factories cannot be transferred to other factories any more.

DABFSP on the premise of meeting the above assumption, the constructed model is as follows:

the start time of each workpiece and product in each factory is determined by formulas (1) - (3).

The departure time of the first workpiece from machines M ₁ to M _m-1 for the first product in each plant is calculated by equation (4).

The start time of each workpiece of each product in the factory on the first machine is calculated by equation (5).

Calculation by equation (6) ensures that the departure time of each workpiece must be greater than its completion time and the completion time of its previous workpiece.

The workpiece departure time on the last machine of each factory is calculated by equation (7).

The assembly completion time of the product on the assembly machine is determined by equation (8).

In the recursive formulae of formulae (10) - (19), for each plant, the last product is first processedAnd to the process for preparing the sameWork pieces and then processing a second product/>And so on until the first product/>Thus, C _max (pi) can be calculated from equation (19), with a temporal complexity of/>

The goal of DABFSP with minimized maximum completion time is to find the optimal schedule pi ^* among all possible schedule sets pi, equation (20).

In the above formula: n is the number of work pieces; m is the number of machines in the production stage; f is the number of factories; s is the number of products; f factory index, f=1, 2,; l is the index of products in factory f, l=1, 2,..delta _f;δ_f is the number of products in factory f,I is the work index, i=1, 2,., n _l; j is the index of the machine and, j=1, 2, m; h is the product index, h=1, 2,..s; j is the workpart set, j= { J ₁,J₂,...,J_n }; m is the machine set, m= { M ₁,M₂,...,M_m }; p is the product set, P= { P ₁,P₂,...,P_S};V_l is the work set belonging to product P _l,/>N _f is the number of tools in the plant f,Omega _h is the number of work pieces belonging to product P _h,/>N _l is the set of workpieces belonging to product P _l in factory f,Lambda is the total product order, lambda= [ lambda ₁,λ₂,...,λ_s];λ_h is the work order of product P _l in factory f,Pi _f is the work order in factory f, pi _f＝[π_f(1),π_f(2),...,π_f(n_f) ]; /(I)For product in plant f/>P _l,i,j is the processing time of the workpiece V _l,i on machine M _j; o _l,i,j is the operation of the workpiece V _l,i on machine M _j; /(I)Processing time on machine M _j in factory f for workpiece V _l,i; /(I)Assembly time on machine M _A in factory f for product P _l; /(I)The departure time of the workpiece V _l,i on the machine M _j in the factory f; /(I)The duration on machine M _j in factory f for workpiece V _l,i; /(I)The departure time of product P _l on machine M _j in factory f; /(I)The duration on machine M _j in factory f for product P _l; pi is DABFSP, pi= [ pi ₁,π₂,...,π_F ]; pi is a feasible solution set; /(I)Is the completion time of plant f; c _max (pi) is the total finishing time for the complete solution pi.

Example 1

A distributed assembly blocking flow shop scheduling method based on hyper-heuristic reinforcement learning comprises the following specific steps:

Step 2: each individual acquisition global optimal solution pi _best is decoded by adopting a forward or backward calculation method, and two acceleration strategies based on insertion are executed before calculation so as to save the calculation cost of the evaluation solution.

Step 4: sampling the updated Q table generates a new higher-level individual.

Step 4.1: state s _t is selected, action a _t and next state s _t+1 are obtained.

Step 4.2: applying state s _t+1 to pi _best to yield pi' _best; the adaptation values for pi _best (C (pi _best))、C(π′_best), IR, get prize r (s _t,a_t), update Q value (Q _t+1(s_t,a_t)), probabilities ε _t, and pi _best are calculated.

Step 4.4: if count= popsize, then go to step 3, otherwise go to step 4.

Further, the coding method of solutions (individuals) in the upper and lower layer populations is as follows:

QLHHEA is a two-level framework that involves the high-level individuals consisting of LLHs and the possible scheduling schemes in the problem solution space (low-level individuals); all higher individuals were sequences of LLH random composition, the length depending on the number of LLHs designed, where the same LLHs was allowed to occur; a viable scheduling scheme for the problem is the total work-piece procedure pi, which consists of F subsequences, i.e., pi= [ pi ₁,π₂,...,π_f,...,π_F ], f=1, 2. Each sub-sequence pi _f＝[π_f(1),π_f(2),...,π_f(n_f) ] represents the processing order of n _f workpieces assigned to factory f; the makespan value and C _max (pi) are used as fitness to evaluate each feasible solution pi.

Further, the individuals in the lower-layer population in the step 1 are generated by adopting a construction heuristic method, and specifically include the following steps:

Step 1.1: for product P _h (h=1, 2,., S), the initial work sequence λ _h of ω _h pieces of belonging P _h is obtained by calculating the processing time I (h, I) of each piece on all machines from formula (21) and then arranging in ascending order; if I (h, I) of the two workpieces are the same, p _h,i,1 is selected to be smaller; improving the work order quality by NEH heuristic (conventional method), and further determining the earliest completion time e _h of the product P _h:

/>

Step 1.3: extracting the first s products from λ and distributing them to plants, at least one for each plant; the part of the product sequence assigned to the plant f' is denoted asSetting k=s.

Step 1.4: if there are still products to be dispensed in λ and k < S, assuming that δ _f′ products have been dispensed to the plant f', the first product P _h is selected (S < h.ltoreq.S) to perform the following steps:

Step 1.4.1: for F' =1, 2, F, insert lambda _h as a block as much as possible To ensure that all workpieces of the same product are not separated; calculating a cost function σ (h, l ', f') of the undispensed product P _h inserted into the slot l '(l' =1, 2,., δ _f′) by formula (22), selecting the product with the smallest σ (h, l ', f'); if σ (h, l ', f') is the same then the earliest finishing time/>, is selectedThe shortest product; let/>Is the corresponding vacancy.

Step 1.5: let k=k+1, repeat step 1.4 until all products have been traversed.

Further, in step 2, a forward or backward calculation method is adopted to decode each individual to obtain a global optimal solution pi _best, which specifically comprises the following steps:

When decoding the higher-layer individuals, sequentially executing LLH contained in each higher-layer individual, and searching for a better solution in a solution space; if the obtained candidate solution has better adaptability than the original solution, replacing with a new solution, and executing the rest LLHs, otherwise, executing the next LLH until all LLHs in the higher-level individual are executed; during which the effectiveness of each higher-level individual is assessed by Contribution Rate (CR).

When decoding the low-layer individual, adopting a forward calculation method or a backward calculation method; however, whether forward calculation or backward calculation is used, each product needs to be inserted into all possible positions to evaluate all solutions (i.e. based on the acceleration strategy of product insertion) before calculation, a better solution is selected, then workpieces belonging to the same product in the solution are inserted into all positions of the product (i.e. based on the acceleration strategy of workpiece insertion) to evaluate all solutions, and finally the better solution is selected to execute a forward or backward decoding calculation method; for a clearer description, examples of two calculation methods are given in fig. 2 ((a) forward calculation and (b) backward calculation), where n=16, m=3, f=2, s=5; table 1 provides the processing times for the jobs and products; as in fig. 2, products P ₁、P₃ and P ₅ are assigned to plant 1, product P ₂ and to plant 2; the processing sequences of the workpieces in the two factories are pi ₁ = [1,6,2,3,8,5,14,4] and pi ₂ = [9,11,10,7,13,15,12,16] respectively; the times for products P ₃ and P ₄ to leave the processing machine are respectivelyAnd/>Thus, the maximum finishing time is

TABLE 1 processing time and assembly time for workpieces and products

Further, two acceleration strategies based on the insertion are as follows:

Step 2.1.1: in the factory f' as a product The time and product of the ith workpiece leaving the machine M _j The times of departure from the assembly machine M _A are/>, respectivelyAnd/>Calculated as formulas (1) - (8).

Step 2.1.2: on the basis of step 2.1.1, the processing time and the product of the first workpiece on the machine M _j The assembly time on the assembly machine M _A is/>, respectivelyAnd,/>Calculated as formulas (10) - (18).

Step 2.1.3: assume thatProduct/>Inserted, the following formula holds:

It is evident that the time complexity of both step 2.2.1 and step 2.2.2 isSince steps 2.2.3 and 2.2.4 are repeated to check all positions, the time complexity is O (δ _f′mn_l′); to sum up, the total time complexity/>, of δ _f′ +1 solutions was evaluatedMuch lower than the time complexity without using an insertion-based acceleration strategyReferring to FIG. 2, products P ₁、P₃ and P ₅ are present in plant 1, product P ₂、P₄ is present in plant 2, and product P ₅ is extracted from plant 1 and inserted between product P ₂ and product P ₄ in plant 2; a schematic of the acceleration strategy based on product insertion is shown in fig. 3.

Step 2.2.1: calculation of

Obviously, the complexity of step 1 isSince steps 2 and 4 require n _[l′] iterations to evaluate all post-insertion solutions, the time complexity is O (mn _l′); total time complexity/>Much less than the temporal complexity/>, without employing a workpiece insertion-based acceleration strategyReferring to fig. 2, product P ₃ in factory 1 contains 3 pieces (piece 5, piece 14, and piece 4), assuming that one piece is extracted from product P ₃ (assuming that piece 14 is extracted), and then reinserted until all insertion positions in factory 1 are traversed, factory 2 is unchanged. Acceleration strategy based on workpiece insertion as shown in fig. 4 ((a) inserting workpiece 14 into a first position of product P ₃, (b) inserting workpiece 14 into a second position of product P ₃, (c) inserting workpiece 14 into a third position of product P ₃).

Further, 12 high-efficiency heuristics are designed to build high-level individuals; based on three easy-to-implement operations (i.e., insert, swap, and reverse), 12 simple and efficient heuristic (LLH) were designed, which can be divided into two categories, one based on critical paths and the other based on non-critical paths; the factories through which the critical path passes are named critical factories, and the products and workpieces allocated to the critical factories are named critical products and critical workpieces; the details of the design LLHs are as follows:

Further, the proposed QLHHEA contains five key components of state, action selection policy, rewards function and update mechanism, designed and described as follows:

(1) Definition of the states: the state tends to reflect key features of the external environment; where state refers to LLH, a state space (also called state set S) is divided into two subsets of 12 LLHs, namely 8 workpiece-based LLHs (S _job) and 4 product-based LLHs (S _product), as described below:

S_product＝{CPI,CPS,NPI,NPS}

Fig. 5 (transferto represents a transition between two states, (a) is a S _job fully connected network, (b) is a S _product fully connected network) shows two directed, fully connected networks of the 12 LLHs mentioned above, visualizing the relationship between selectable states and available actions.

(2) Definition of actions: the action refers to the action of state transition in S, namely, the action of transition from one state to another state; thus, the action set a consisting of available actions can be represented by a directed connection network g= { V, E }, where V is a set of V nodes and E is a set of E directed edges connecting these nodes; node V _i′ e V represents a particular state; each edge E _i′j′ E represents a preferential dependency between the two states s _i′ and s _j′ (i.e., LLH _j′ can be performed after LLH _i′ is completed); the weight of each edge represents the transition probability from LLH _i′ to LLH _j′.

(3) Action selection strategy: the choice of general action will affect the current reward, as well as the subsequent status and rewards; to rationally adjust the relationship between development and exploration, for a particular state s _t for a particular time step t, a epsilon greedy strategy (epsilon _t -greedy) is used to select either a random action with probability epsilon _t or an action that produces the largest reward (i.e., the largest q-value) with probability 1-epsilon _t (see algorithm 1); the pseudo code of action selection policy SelectionAnAction (s _t,ε_t, type) is shown in algorithm 2 for determining the appropriate actions for the initial state s ₀ and all subsequent states; at initial iteration, setting q-value of all state-action pairs to 0, thereby randomly selecting initial state and action; it is desirable to find regions that contain some good solutions, while lower epsilon _t values facilitate depth-first searches around potential regions; thus, an epsilon adaptive adjustment method (denoted T _cur event) As formula (29):

In equation (29), ε ₀ is an initial value (i.e., ε ₀＝0.15),ε_f is a final value (i.e., ε _f＝0.01);T_total is the total number of time steps, T _cur is the current number of time steps; according to the description above, the action selection policy first sets state s _t+1 to 0, and if a random number of 1 is generated, s _t+1 will be selected by the ε greedy policy with probability of being The resulting random action a _t determines that, otherwise, state s _t+1 is to be randomly generated. /(I)

(4) Bonus function: the reward function is a key component of developing a high-level strategy based on Q learning; the design of the reward function directly influences the ability of the agent to acquire required skills, and also has important influence on the convergence speed and the final performance of QLHHEA; since DABFSP is a single objective minimization problem, an appropriate action requires a reduction in the fitness of the feasible solution; the immediate prize of performing action a in state s is denoted r (s, a), determined by the quality IR of the improvement solution pi; IR may be calculated by [ C _max(π)-C_max(π′)]/C_max (pi), where pi' is the new candidate solution obtained after performing the action; the bonus function r (s, a) is designed as shown in equation (30):

(5) Update mechanism: since the Q table records the knowledge that the agent learns from the environment, the Q-value of Q (S _t,a_t) reflects the preferential preference of the agent to perform action a _t e A in state S _t e S; the q-value for each state-action pair (s _t,a_t),Q(s_t,a_t) updated by weighting the instant prize r (s _t,a_t) and the discount can be calculated by equation (33).

Further, a high-level strategy based on Q learning manipulates a low-level heuristic to search the solution space, as follows:

RL aims to get better behavior through dynamic interactions between agents and the environment; in general, RL has two key features: trial-and-error searching and delayed rewarding policies; in the RL framework, one or more objects with well-defined goals (i.e., maximizing jackpots) can perceive all aspects of the environment and then take action to change it; q learning is a sequential decision strategy based on Markov Decisions (MDP), which is one of the most successful strategies in reinforcement learning; MDP is used to search for policies in certain specific scenarios, aimed at simulating random policies and rewards through interactions between dynamic environment and agent agents whose system states satisfy Markov properties; MDP may be defined and described as a quadruple (S, A, P, R); s is a state space, a is a set of actions, where s= { s|s ₁,s₂,...,s_T } represents a set of selectable states; a= { a|a ₁,a₂,...,a_T } represents a set of available operations; s x A x S → R is the state transition probability, which refers to the potential possibility that the agent transitions to another state after taking action in the current state; s x A → R is a reward function, indicating that an enhanced reaction or reward is obtained immediately after appropriate action is taken; thus, the goal of the agent is to find the optimal strategy ω by trial and error, maximizing the desire for the discount rewards, as in equation (31).

In equation (31), V ^*(s) is a function of the action value under the optimal strategy ω; r (s _t,a_t) represents the real-time rewards obtained by agent taking action a _t in state s _t in time step t; gamma e 0,1 represents the discount factor for the current and future rewards of the balance state-action pair (s _t,a_t); in Q learning, the agent perceives a specific signal of state s _t and makes a correct decision from action set A through epsilon-greedy; after one action a _t is performed, state s _t may be changed to another state s _t+1 and rewards r _t are returned by a well-designed rewards function; thus, a state-action trajectory (s ₀,a₀,s₁,a₁,) can be obtained at each time step; state action value Q _ω(s_t,a_t) is defined in equation (32).

By solving the recursive Bellman optimality equation for the state-action pair (s _t,a_t), the best Q _ω(s_t,a_t can be found; i.e. by selecting the available action with the largest q value each time, determining the optimal strategy ω; the Q values of all state-action pairs are stored in a Q table and updated using equation (33).

In formula (31), Q _ω(s_t,a_t) represents a Q value at which an action is taken in state s _t; Maximum Q for all actions at state s _t+1; lambda epsilon 0,1 is the learning rate for equilibrium exploration and development.

Further, in the following experiments, the effect of the proposed Q-learning based hyper-heuristic evolutionary algorithm solution DABFSP was tested; carrying out comparison on QLHHEA and iterative greedy algorithm IG, carrying out backtracking search on the super heuristic BS-HH based on a biological geography effective hybrid optimization algorithm HBBO, carrying out iterative local search ILS based on matrix cube distribution estimation algorithm MCEDA; all algorithms were re-implemented with Pascal and compiled by Embarcadero Rad Studio (XE 8); each algorithm runs independently on a PC with an Inter (R) Core (TM) i7-8700M@3.2GHz processor and 32GB RAM under the Windows 7 operating system.

The experimental dataset uses a reference dataset provided by Hatami et al (available at http:// soa. Iti. Es/S), i.e., n= {8, 12, 16, 20, 24}, m= {2,3,4}, f= {2,3,4}, s= {2,3,4}. The processing time of the work piece in the processing stage is uniformly distributed in [1, 99], and the assembly time of each product P _l in the assembly stage is in the interval [ n _l,99n_l ]; the termination condition for all algorithms is a maximum CPU run time of 30nm milliseconds.

The invention uses four controllable parameters: population size popsize, high-quality high-rise individual proportionLearning rate λ and discounted silver γ; the horizontal values of the parameters are set as follows: popsize E {10,30,60,90},/>Lambda epsilon {0.1,0.3,0.5,0.7}, and gamma epsilon {0.3,0.5,0.7,0.9}; depending on the number and level of the parameters, there are a total of 4×4×4=256 groups; the sensitivity of the parameters is analyzed by adopting a full factor experimental design method, and each parameter group is independently and repeatedly operated for 10 times, so that the best parameter combination is determined as follows: ps=30,/>λ＝0.5，γ＝0.7。

To evaluate the performance of each algorithm, the experimental results were measured in percent form of relative deviation (ARPD):

Where R is the total number of runs, C _i is the maximum time to finish obtained for a particular algorithm in the ith run, and C _best is the optimal time to finish value for all algorithms.

The test results are shown in table 2,

Table 2 QLHHEA results of comparison with 4 advanced algorithms

n×m×F×S	IG₃	HBBO	BS-HH	MCEDA	QLHHEA
						8×2×2×2	0.068	0.045	0.032	0.013	0.000
12×2×2×2	0.096	0.063	0.048	0.023	0.002
						16×2×2×2	0.125	0.091	0.066	0.031	0.004
20×2×2×2	0.157	0.114	0.084	0.047	0.009
						24×2×2×2	0.133	0.090	0.065	0.033	0.006
8×3×3×3	0.094	0.055	0.034	0.014	0.000
						12×3×3×3	0.132	0.083	0.061	0.032	0.003
16×3×3×3	0.151	0.101	0.056	0.041	0.005
						20×3×3×3	0.174	0.124	0.075	0.054	0.008
24×3×3×3	0.186	0.143	0.093	0.063	0.013
						8×4×4×4	0.147	0.096	0.067	0.037	0.005
12×4×4×4	0.183	0.125	0.085	0.056	0.012
						16×4×4×4	0.165	0.118	0.074	0.048	0.008
20×4×4×4	0.172	0.142	0.093	0.085	0.016
						24×4×4×4	0.246	0.184	0.128	0.112	0.025
Average	0.149	0.105	0.071	0.046	0.008

As seen in the table, the ensemble average ARPD (0.008) of QLHHEA over all example sizes is smaller than the average ARPD (0.149, 0.105, 0.071, 0.046) obtained by other algorithms; the results of BS-HH and MCEDA are also competitive in most arrays, but both are still inferior to QLHHEA; the main reason for this may be that BS-HH only uses retrospective search like differential evolution as HLS to operate a series of LLHs, and does not use some efficient advanced strategies, failing to reasonably record promising patterns in advanced individuals and thus failing to obtain better results; although MCEDA can learn the structural features or patterns in the high-quality solution, it still lacks a target-oriented search strategy to drive the search direction, and the accuracy of the probability model directly affects its performance; in addition, the results of IG ₃ remain inferior to population-based HIOA, i.e., BS-HH and MCEDA, indicating that the population-based search method has significant advantages over the single-point search method; the results in Table 2 demonstrate the superiority of QLHHEA.

Aiming at the problem of distributed assembly blocking flow shop scheduling commonly existing in the actual production process, the invention provides a Q learning-based hyper-heuristic evolutionary algorithm; this is the application of the first study QLHHEA in solving DABFSP, which mainly contributes as follows: (1) A forward and backward calculation method based on the problem characteristics is provided; (2) Two acceleration strategies based on problem properties are provided, so that the calculation cost is reduced, and the search efficiency is improved; (3) A structured heuristic is proposed for specific problems to produce a high quality initial population; (4) There are 12 simple and efficient low-level heuristics that are defined as selectable states, and transitions between them are defined as available actions; (5) A high-level strategy based on Q learning is designed to guide searching behavior to obtain a better scheduling scheme; (6) Finally, experimental comparisons confirm the design advantages of the proposed evolutionary framework and key components, indicating that QLHHEA is an efficient algorithm for solving DABFSP.

Claims

1. A distributed assembly blocking flow shop scheduling method based on hyper-heuristic reinforcement learning is characterized by comprising the following steps:

Step 1: initializing a population and a Q table, wherein individuals in a low-level population are generated by adopting a construction heuristic method, individuals in a high-level population are randomly generated, and the scales of the two populations are the same; setting related parameters; setting the q value of the state action pair to be zero;

Step 2: decoding each individual to obtain a global optimal solution pi _best by adopting a forward or backward calculation method, and simultaneously executing two acceleration strategies based on insertion before calculation to save the calculation cost of an evaluation solution;

step 3: sequentially executing LLH in a higher-level individual on feasible scheduling solutions in a lower-level individual, wherein LLH is defined as a lower-level heuristic, and if a new solution adaptation value is better, replacing an old solution with the new solution and updating a global optimal solution; calculating the contribution rate of each higher-layer individual, and selecting according to the contribution rate The Q table is updated by adopting an updating mechanism for each high-level individual with high contribution rate; simultaneously setting count=0;

step 4: sampling the updated Q table to generate a new high-level individual, namely, operating a low-level heuristic by using a high-level strategy based on Q learning to search a solution space;

Step 4.1: selecting state s _t using an action selection policy, obtaining action a _t and next state s _t+1;

Step 4.2: applying state s _t+1 to pi _best to yield pi' _best; calculating the adaptation value of pi _best (C (pi _best))、C(π′_best), IR, get bonus function r (s _t,a_t), update Q value (Q _t+1(s_t,a_t)), probability epsilon _t and pi _best;

Step 4.3: if C (pi '_best)＜C(π_best), updating the global optimal solution pi _best to pi' _best, otherwise, skipping to the step 4.1;

Step 4.4: if count= popsize, jumping to step 3, otherwise jumping to step 4;

2. The distributed assembly blocking flow shop scheduling method based on hyper-heuristic reinforcement learning according to claim 1, wherein: the individuals in the low-level population in the step (1) are generated by adopting a construction heuristic method, and the method specifically comprises the following steps:

Where i is the work index, i=1, 2,..n _l; j is the index of the machine and, j=1, 2, m; h is the product index, h=1, 2,..s; m is the number of machines in the production stage; i (h, I) is the processing time of each workpiece on all machines; p _h,i,j is the processing time of the ith workpiece belonging to product P _h on machine M _j;

The I (h, I) is arranged in an ascending order to obtain an initial workpiece sequence lambda _h formed by omega _h workpieces of the product P _h, and if the I (h, I) of the two workpieces are the same, P _h,i,1 is selected to be smaller; improving the quality of the workpiece sequence by utilizing NEH heuristic algorithm, and further determining the earliest completion time e _h of the product P _h;

step 1.2: the product sequence lambda is obtained by arranging the earliest finishing time e _h in ascending order;

Step 1.3: extracting the first s products from the product sequence lambda and distributing them to plants, at least one for each plant; the part of the product sequence assigned to the plant f' is denoted as Assigning s to the variable k, i.e. k=s;

step 1.4.1: for F' =1, 2, F, insert lambda _h as a block To ensure that all workpieces of the same product are not separated; the cost function σ (h, l ', f') of the undispensed product P _h inserted into the slot l 'is calculated by the following formula, l' =1, 2,..delta _f′;

Where i is the work index, i=1, 2,..n _l; j is the index of the machine and, j=1, 2, m; n is the number of work pieces; f is the number of factories; m is the number of machines in the production stage; s is the number of products; h is the product index; l is the index of the product in factory f, l=1, 2,..delta _f;λ_h is the work order of product P _l in factory f, Is a product in a factory f The departure time of the workpiece V _l,i on the machine M _j in the factory f; /(I)Processing time on machine M _j in factory f for workpiece V _l,i;

selecting the product with the smallest sigma (h, l ', f'); if σ (h, l ', f') is the same, the earliest finishing time is selected The shortest product; let/>Is a corresponding vacancy;

Step 1.4.2: finding the plant f ^* with the shortest finishing time, i.e Product P _h was dispensed to the factory and lambda _h was inserted into/>Vacancies/>

Step 1.5: let k=k+1, repeat step 1.4 until all products have been traversed.

3. The distributed assembly blocking flow shop scheduling method based on hyper-heuristic reinforcement learning according to claim 1, wherein: decoding each individual by adopting a forward or backward calculation method to obtain a global optimal solution pi _best, wherein the method comprises the following specific steps of:

when decoding the higher-layer individuals, sequentially executing LLH contained in each higher-layer individual, and searching for a better solution in a solution space; if the obtained candidate solution has better adaptability than the original solution, replacing with a new solution, and executing the rest LLHs, otherwise, executing the next LLH until all LLHs in the higher-level individual are executed; during which the effectiveness of each higher-level individual is assessed by contribution rate;

4. A distributed assembly blocking flow shop scheduling method based on hyper-heuristic reinforcement learning according to claim 3, wherein: the two acceleration strategies based on the insertion specifically comprise the following steps:

Step 2.1.1: in the factory f' as a product The time and product/>, of the ith workpiece leaving machine M _j The times of departure from the assembly machine M _A are/>, respectivelyAnd/>

Step 2.1.3: assume thatProduct/>Inserted, the following formula holds:

Step 2.1.5: repeating steps 2.1.3 and 2.1.4 until all insertion positions are tried;

step 2.2: acceleration strategy based on workpiece insertion: inserting workpieces belonging to the same product into the product; assume that In item 1/>If the first n _[l′] -1 workpieces have completed scheduling, then the i' th workpiece may have n _[l′] insert positions; the acceleration strategy for evaluating all insertion solutions based on workpiece insertion is as follows:

Step 2.2.1: calculation of

in the above formula, i is a work index, i=1, 2,..n _l; j is the index of the machine and, j=1, 2, m; delta _f is the number of products in the plant f, F factory index, f=1, 2,; m is the machine set, m= { M ₁,M₂,...,M_m }; /(I)For the product in factory f,/> The departure time of product P _l on machine M _j in factory f; The departure time of the workpiece V _l,i on the machine M _j in the factory f; /(I) Assembly time on machine M _A in factory f for product P _l; /(I)Processing time on machine M _j in factory f for workpiece V _l,i; /(I)The duration on machine M _j in factory f for workpiece V _l,i; /(I)The duration on machine M _j in factory f for product P _l; l is the product index in factory f, l=1, 2,..delta _f;

5. The distributed assembly blocking flow shop scheduling method based on hyper-heuristic reinforcement learning according to claim 1, wherein: the higher-level individuals in the step 3 are constructed by 12 heuristic LLHs, and the heuristic LLHs can be divided into two categories, wherein one category is based on a critical path, and the other category is based on a non-critical path; the factories through which the critical path passes are named critical factories, and the products and workpieces allocated to the critical factories are named critical products and critical workpieces; the heuristic designed is as follows:

LLH ₁: randomly selecting a key product from a key factory, and randomly selecting a key workpiece from a workpiece set of the product; inserting the workpiece into the position of other workpieces until all workpieces of the product have been selected;

LLH ₂: randomly selecting a key product from a key factory and randomly selecting a key workpiece from a workpiece set of the product; inserting the workpiece into the position of each other workpiece until all the workpieces of the key product are selected;

LLH ₃: randomly selecting a key product from a key factory and randomly selecting a key workpiece from a workpiece set of the product; exchanging the position of the workpiece with all other workpieces until all workpieces of the key product are selected;

LLH ₄: randomly selecting a key product from a key factory, and randomly selecting two different key workpieces from a workpiece set of the product; performing inverse operation on the subsequences between the two selected key workpieces;

LLH ₅: randomly selecting a product from a non-critical factory and randomly selecting a workpiece from a set of workpieces for the product; inserting the workpiece into the position of other workpieces until all workpieces of non-critical products are selected;

LLH ₆: randomly selecting a product from a non-critical factory and randomly selecting a workpiece from a set of workpieces for the product; inserting the workpiece into the position of each other workpiece until all workpieces of the non-critical product are selected;

LLH ₇: randomly selecting a product from a non-critical factory and randomly selecting a workpiece from a set of workpieces for the product; exchanging the position of the selected workpiece with all other workpieces until all workpieces of non-critical products are selected;

LLH ₈: randomly selecting a product from a non-critical factory, randomly selecting two different workpieces from a workpiece set of the non-critical product, and then inverting a subsequence between the selected two non-critical workpieces;

LLH ₉: randomly selecting a key product from a key factory; inserting each product before or after its location in the key factory until all key products are selected and all insert-based operations are performed;

LLH ₁₀: randomly selecting a key product from the key factories, exchanging the position of the product with other products until all key products are selected and all exchange-based operations are performed;

LLH ₁₁: randomly selecting one product from a non-critical factory, and inserting the selected product before or after the position of all other products;

6. The distributed assembly blocking flow shop scheduling method based on hyper-heuristic reinforcement learning according to claim 1, wherein: in the step (3), the update mechanism records the knowledge learned by the agent from the environment, so the Q-value of Q (S _t,a_t) reflects the preferential preference of the agent to execute the action a _t epsilon A under the state S _t epsilon S; for each state-action pair (s _t,a_t),Q(s_t,a_t) updated by weighting the instant prize r (s _t,a_t) and the discounted q-value, it can be calculated by the following formula:

7. The distributed assembly blocking flow shop scheduling method based on hyper-heuristic reinforcement learning according to claim 1, wherein: the action selection strategy in step 4.1 is specifically as follows:

For a particular state s _t for a particular time step t, a modified ε greedy strategy is used to select either a random action with probability ε _t or an action yielding the maximum reward with probability 1- ε _t; the action selection policy is used to determine the appropriate actions for the initial state s ₀ and all subsequent states; at initial iteration, setting q-value of all state-action pairs to 0, thereby randomly selecting initial state and action; while lower epsilon _t values facilitate depth-first searches around potential regions; thus, an epsilon adaptive adjustment method is provided for the T _cur event, denoted epsilon _Tcur, of the formula:

Where ε ₀ is an initial value (i.e., ε ₀＝0.15),ε_f is a final value (i.e., ε _f＝0.01);T_total is the total number of time steps, T _cur is the current number of time steps, according to the description above, the action selection policy first sets state s _t+1 to 0, if a random number of 1 is generated, then s _t+1 will be determined by random action a _t generated by εgreedy policy selection probability ε _Tcur, otherwise state s _t+1 will be randomly generated.

8. The distributed assembly blocking flow shop scheduling method based on hyper-heuristic reinforcement learning according to claim 1, wherein: the bonus function of step 4.2 is designed as follows: