CN115204497A

CN115204497A - Prefabricated part production scheduling optimization method and system based on reinforcement learning

Info

Publication number: CN115204497A
Application number: CN202210846471.9A
Authority: CN
Inventors: 邓晓平; 刘福磊; 李成栋; 房海波; 侯和涛; 彭伟; 刘洪彬
Original assignee: Shandong University; Second Construction Co Ltd of China Construction Eighth Engineering Division Co Ltd; Shandong Jianzhu University
Current assignee: Shandong University; Second Construction Co Ltd of China Construction Eighth Engineering Division Co Ltd; Shandong Jianzhu University
Priority date: 2022-07-19
Filing date: 2022-07-19
Publication date: 2022-10-18

Abstract

The invention provides a prefabricated part production scheduling optimization method and system based on reinforcement learning, which are used for acquiring real-time production data and historical production data to establish a prefabricated part scheduling model; determining an optimization target of a pre-component scheduling model; converting the solution of the optimization target of the scheduling model into a solution based on a deep reinforcement learning model; establishing an experience playback pool based on the state of the current machine, the current action, the reward corresponding to the current action and the state of the machine at the next moment, and randomly extracting data in the experience playback pool in batches to perform iterative updating on the deep reinforcement learning model to obtain a trained deep reinforcement learning model; and inputting the order information of the prefabricated part into the trained deep reinforcement learning model and outputting an optimal scheduling strategy. The method does not depend on a model, and can dynamically adapt to system disturbance caused by external factors of an actual system, such as design change, emergency insertion and the like.

Description

Prefabricated part production scheduling optimization method and system based on reinforcement learning

Technical Field

The invention belongs to the technical field related to prefabricated part production, and particularly relates to a prefabricated part production scheduling optimization method and system based on reinforcement learning.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The assembly type building has the advantages that the production and installation of prefabricated parts are separated, so that the industrialized production and management mode is fully applied, the energy is saved, the environment is protected, the building speed is high, the project quality is high, the intensive transformation of the building industry can be effectively promoted, and the assembly type building is the future development direction of the building industry in China. With the increasing popularity of prefabricated structures, the demand for prefabricated components is increasing. However, at present, the production of the prefabricated parts is usually arranged according to manual experience and the preference of production managers, the situation that the production progress is not matched with the construction progress due to unreasonable production schemes often occurs, the utilization rate of equipment resources is low, and the supply delay phenomenon exists. In addition, the stocking of prefabricated components is sometimes wasteful of inventory resources. Therefore, it is necessary to develop a production scheduling optimization method suitable for prefabricated parts to improve the production efficiency and the capability of coping with emergencies of the prefabricated parts and reduce the production cost.

Most of domestic prefabricated parts at the present stage adopt a mode of batch order production according to the requirements of assembly construction on site of an assembly building, and the production modes mainly comprise a flow mode and a fixed mould table mode. In the flow production, the prefabricated components move in sequence in the workstations of each process, and each process is provided with a fixed worker for corresponding operation, so that the prefabricated components mainly produce laminated slabs, inner walls, outer walls and other slab components; the fixed station mode production is that the prefabricated parts are operated in all processes at a fixed position, workers in charge of the operation of each process can be the same or different, and the mode mainly produces special-shaped parts such as stairs. The automation degree of the production line is higher, the specialized division of workers enables the production to be more efficient and flexible, and the overall operation efficiency is higher; the fixed production mode is convenient to manage and simple to dispatch, but the utilization of production resources is very inefficient. The current stage flow line production mode is the mainstream production mode of the fabricated concrete prefabricated part.

The production process of the prefabricated part in the flow line production mode mainly comprises 6 stages of formwork erecting, pre-embedding, pouring, maintaining, formwork stripping and repairing of the prefabricated part. The main idea of the scheduling optimization method for the production process of the existing flow type prefabricated part is to firstly establish a production scheduling mathematical model with constraint conditions by combining the production process characteristics and the attention indexes of the prefabricated part, to define a target function, then to solve the model by adopting heuristic rules formed by empirical induction, dynamic programming or genetic algorithm, particle swarm optimization algorithm and other group intelligent algorithms, and finally to form a scheduling plan by a cross-road map or Gantt map and the like. Such methods have the following problems:

(1) The finally formed scheduling scheme completely depends on the design of the model, and can not dynamically adapt to system disturbance caused by external factors of an actual system, such as design change, emergency insertion and the like.

(2) When a dynamic programming method is used for solving, the problem of dimension disaster exists, and the problem of large scale cannot be effectively processed; the optimization calculation speed by using the group intelligent algorithm is low, and the requirement of the system on real-time control cannot be met.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a prefabricated part production scheduling optimization method and system based on reinforcement learning.

In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions: a prefabricated part production scheduling optimization method based on reinforcement learning comprises the following steps:

acquiring real-time production data and historical production data to establish a prefabricated part scheduling model;

determining an optimization target of a pre-component scheduling model;

converting the solution of the optimization target of the scheduling model into a solution based on a deep reinforcement learning model;

establishing an experience playback pool based on the state of the current machine, the current action, the reward corresponding to the current action and the state of the machine at the next moment, and performing iterative update on the deep reinforcement learning model by using randomly extracted data in the experience playback pool to obtain a trained deep reinforcement learning model;

and inputting the order information of the prefabricated part into the trained deep reinforcement learning model and outputting an optimal scheduling strategy.

Further, the optimization target is to minimize the maximum completion time when the stall penalty is minimized or no stall, and when the optimization target is to minimize the stall penalty P _min Comprises the following steps:

wherein l _1～F :{l ₁ 、l ₂ 、l ₃ ...l _F Is the assembly line number, n _l For the number of orders to be allocated on the production line l,

i is more than or equal to 1 and less than or equal to n for the ith prefabricated part in the production scheduling sequence of the prefabricated parts on the first assembly line _l ；

When the optimization objective is to minimize the maximum completion time f _min ：

f _min ＝min C _max

Wherein the content of the first and second substances,

further, the production data comprises an order number, a production line number, a processing machine number, the processing starting time of the prefabricated part and the processing finishing time of the prefabricated part.

Further, the converting the solution of the optimization target of the scheduling model into the solution based on the deep reinforcement learning model includes:

establishing a state set, wherein the state set selects n characteristics with strong relevance to the optimization target to establish the state set, and the characteristics comprise but are not limited to the ratio of the number of workpieces to the number of orders in a queue, the ratio of the average processing time of all prefabricated parts in the queue to the processing time of the current prefabricated part, and the ratio of the average processing time of the ith machine in the queue to the processing time of the current prefabricated part;

establishing an action set, wherein the action set comprises but is not limited to selecting the workpiece with the longest/short processing time, selecting the workpiece with the shortest remaining processing time and selecting the workpiece with the long/short processing time of the subsequent process;

establishing a reward mechanism, and taking the waiting time of idle time of the machine or the prefabricated part per unit time as a reward function, wherein the reward function is as follows:

when the kth procedure of the ith prefabricated part is finished, the machining machine of the (k + 1) th procedure is not idle, and the reward is a negative value of the queuing waiting time of the workpiece in unit time; when the k procedure of the ith prefabricated part is completed, the k-1 procedure of the (i + 1) th prefabricated part is not completed, and the reward is a negative idle time value of the k processing machine.

Further, parameters in the deep reinforcement learning model are updated iteratively by adopting a DQN algorithm in the deep reinforcement learning.

Further, the training of the deep reinforcement learning model comprises the following steps:

s1: initializing neural network parameters;

s2: initializing the state of each machine;

s3: selecting an action based on a greedy strategy, executing instant reward acquired by the action, and updating the machine state;

s4: storing the state transition data to an experience playback pool, and overwriting the old data with the new experience data;

s5: randomly extracting data in batches from the experience playback pool to update parameters of the estimated neural network, and judging whether the single round is finished according to the set target times of the single round; if so, performing S6, otherwise, performing S3;

s6: and updating the target neural network parameters, judging whether a termination condition is reached, if not, executing S2, and if so, finishing the training.

Furthermore, a priority experience replay strategy is adopted for training in the deep reinforcement learning model training.

The second aspect of the present invention provides a prefabricated part production scheduling optimization system based on reinforcement learning, comprising:

the scheduling model establishing module is used for acquiring real-time production data and historical production data to establish a prefabricated part scheduling model;

the optimization target determining module is used for determining an optimization target of the pre-component scheduling model;

a solution conversion module: converting the solution of the optimization target of the scheduling model into a solution based on a deep reinforcement learning model;

a model training module: establishing an experience playback pool based on the state of the current machine, the current action, the reward corresponding to the current action and the state of the machine at the next moment, and randomly extracting data in the experience playback pool in batches to perform iterative updating on the deep reinforcement learning model to obtain a trained deep reinforcement learning model;

a policy output module: and inputting the order information of the prefabricated parts into a trained deep reinforcement learning model and outputting an optimal scheduling strategy.

A third aspect of the invention provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the steps of the above-described method.

A fourth aspect of the invention provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, the computer instructions, when executed by the processor, performing the steps of the method.

The above one or more technical solutions have the following beneficial effects:

the scheduling scheme based on reinforcement learning does not depend on a model, and can dynamically adapt to system disturbance caused by external factors of an actual system, such as design change, emergency insertion and the like.

The invention adopts an experience playback strategy with priority, has higher iterative solution rate and can meet the requirement of actual production.

Compared with the traditional heuristic scheduling method, the method has the advantages of excellent performance of linearity, parallelism and reentrancy of deep reinforcement learning.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are included to illustrate an exemplary embodiment of the invention and not to limit the invention.

FIG. 1 is a block diagram illustrating the production scheduling optimization of prefabricated parts according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a state sensing module according to an embodiment of the present invention;

FIG. 3 is a flow chart of a deep reinforcement learning model according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a neural network according to an embodiment of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Example one

As shown in fig. 1, the embodiment discloses a prefabricated part production scheduling optimization method based on reinforcement learning, which includes the following steps:

step 1: acquiring real-time production data and historical production data to establish a prefabricated part scheduling model;

step 2: determining an optimization target of a pre-component scheduling model;

and step 3: converting the solution of the optimization target of the scheduling model into a solution based on a deep reinforcement learning model;

and 4, step 4: establishing an experience playback pool based on the state of the current machine, the current action, the reward corresponding to the current action and the state of the machine at the next moment, and performing iterative update on the deep reinforcement learning model by using randomly extracted data in the experience playback pool to obtain a trained deep reinforcement learning model;

and 5: and inputting the order information of the prefabricated parts into a trained deep reinforcement learning model and outputting an optimal scheduling strategy.

According to the method, the production perception module is used for obtaining the previous and instant production experience of the prefabricated parts, the time required by each process of different prefabricated parts is estimated, then a concrete prefabricated part production scheduling optimization model is established according to the obtained data, the optimization goal is to minimize the delay punishment of the prefabricated parts or minimize the maximum completion time when no delay exists, finally, a reinforcement learning mechanism and a heuristic scheduling rule with simple design are introduced, and the production scheduling problem of the prefabricated parts is optimized and solved by combining the production characteristics of the prefabricated parts.

In the implementation, real-time production data and historical production data are obtained through establishing a state sensing module, and specifically, the real-time production data and the historical production data comprise the position of the prefabricated part on the production line (including the number of the production line and the process number), and the time for starting and finishing the treatment.

As shown in fig. 2, specifically, an RFID reader/writer having a read/write function and a wireless communication function is installed at a side of the mold, and a passive RFID tag is installed on a traveling wheel and a ferry vehicle which are responsible for conveying the mold in each process of the assembly line. The wireless communication function can be realized by adopting wireless communication modules such as Lora, wi-Fi and NB-IoT. And after the order system issues the order production task, issuing the order number and the number of the prefabricated part to be produced in the order to the RFID reader-writer on the mold table in a wireless mode, and binding the mold table with the specific prefabricated part in the order.

The mould platform flows in each production stage on the assembly line according to a fixed sequence, when the mould platform flows to a certain process stage, a reader-writer on the mould platform approaches a passive RFID label arranged on a mould platform circulating system transmission device at the stage, reads data in the label and uploads the data and reading time to an ordering system through a wireless network.

In this embodiment, feature extraction is performed on the acquired real-time production data and historical production data, and data is stored in the form of { "order number", "pipeline number", "machining machine number", "machining start time", and "machining completion time", so as to perform state feature extraction and data management.

In step 1 and step 2 of this implementation, a prefabricated part production scheduling model is established, specifically:

(1) Associated definition of symbols

Order number J of concrete prefabricated part _1～n :{J ₁ 、J ₂ 、J ₃ ...J _n }；

Number l of production line _1～F :{l ₁ 、l ₂ 、l ₃ ...l _F }；

The number of the production process k of the concrete prefabricated part is 1 to 6;

n _l the number of orders distributed to the production line l;

d _j each prefabricated part corresponds to a delivery date;

w _j : each prefabricated part corresponds to the expense of one delay penalty;

B _j,k : starting processing time of the kth procedure of the prefabricated part j;

C _j,k : the completion time of the kth procedure of the prefabricated part j;

p _j,k : processing time of the kth procedure of the prefabricated part j;

the ith prefabricated part in the production scheduling sequence of the prefabricated parts on the first assembly line, wherein i is more than or equal to 1 and is more than or equal to n _l 。

(2) Completion time of each prefabricated part

The moving time of the prefabricated parts among the machines, the machine faults and the emergencies are not considered, and the finishing time of each process of each prefabricated part can be expressed as follows:

the completion time of the first process of each prefabricated part can be expressed as:

finishing time of 2-6 production processes of each prefabricated part:

wherein the capacity of the curing cellar is X _l The steam curing time is of a fixed length T and is limited by the steam curing capacity condition, and the time for starting curing the ith prefabricated part on the ith assembly line is not less than the y (y) th time when the curing cellar is full<i) The steam curing completion time of each prefabricated part.

Maximum completion time C of order _max Comprises the following steps:

in the present embodiment, the optimization goal of the prefabricated part scheduling model is determined to minimize the stall penalty of the prefabricated parts or minimize the maximum completion time without stall.

The optimization objective is to minimize the stall penalty P _min The optimization function can be expressed as:

when the prefabricated part is produced without pull-out, the optimization objective is to minimize the maximum completion time f _min ：

f _min ＝min C _max (6)

In the embodiment, reinforcement learning is applied to the field of production scheduling optimization, a problem to be optimized is converted into a markov decision process, and a reinforcement learning state set, an action set, a reward mechanism, an experience playback mechanism and the like are established on the basis of an optimization target. Extracting state features of a machine to establish a feature set; forming a selectable action set of the machine according to a scheduling rule; establishing a reward mechanism associated with the optimization objective; and extracting empirical data by adopting a priority playback strategy to update parameters.

In this embodiment, the building of the deep reinforcement learning model includes:

(1) Establishing a state feature set

Data feature extraction is carried out on the data collected from the state perception module according to the selected feature categories, and the state features of the ith machine on the ith pipeline can be used

To express, selecting n common features with strong relevance with the optimization target to establish a state set:

the features may include, but are not limited to, the ratio of the number of workpieces to the number of orders in the queue, the ratio of the average processing time of all the prefabricated parts in the queue to the current prefabricated part processing time, the ratio of the average processing time on the ith machine in the queue to the current prefabricated part processing time, and the like.

(2) Establishing action sets

Establishing a reinforcement learning action set for each machine according to a simple heuristic scheduling rule, which may include but is not limited to m rules such as selecting a workpiece with the longest/short processing time, selecting a workpiece with the shortest remaining processing time, and selecting a workpiece with a long/short processing time in a subsequent process, and establishing an action set as follows:

and (3) representing the action with the number m of the ith machine on the ith production line, for example, selecting the workpiece with the longest processing time, selecting one action from the action set by the machine in each state, transferring to the next state, and obtaining the optimal action selected by the state through multi-loop iteration.

(3) Establishing a reward function

The waiting time of the idle time of the machine or the prefabricated part per unit time is taken as a reward function, and the expression is as follows:

The total reward for each iteration round is:

(4) Determining a parameter update mode

In the present embodiment, the DQN algorithm in deep reinforcement learning is employed.

Updating the action value function:

in order to stabilize the training process, two neural networks, namely a target network and an estimation network, are introduced, the estimation network is updated immediately, and parameters are copied to the target network after n steps, wherein theta is an estimation network parameter, and theta is an estimation network parameter ^- Is a target network parameter; gamma is a discount factor, and the value of gamma is 0-1; alpha is learning rate and takes the value of 0 to 1;

representing an action cost function which is fitted by a neural network when a certain action is selected in a certain state of the ith machine on the ith production line; max represents taking the maximum; r is _i Indicating a reward;

updating the neural network parameters:

the gradient descent method is adopted to update the parameters of the neural network, and the formula is as follows:

θ ^- updating after each round is finished, wherein the updating mode is as follows: theta.theta. ^- ←θ；

Representing the gradient of the motion cost function.

As shown in fig. 4, the neural network used in the present embodiment is a BP neural network, which is a multi-layer feedforward network trained by an error inverse propagation algorithm, and the network model topology includes an input layer, a hidden layer, and an output layer. The input of the BP neural network (target network and estimation network) is the state

And actions

Outputting an action cost function

By continuously updating the parameters of the BP neural network, the action value fitted by the BP neural network is more accurate.

(5) Priority empirical playback mechanism

In order to improve the training efficiency of the neural network, experience data(s) generated in one round of exploration utilization process of the intelligent agent is subjected to an experience replay strategy with priority _i ,a _i ,r _i+1 ,s _i+1 ) Weighting processing is carried out, so that the sampling rate of favorable experience is improved, and sampling of useless experience data is reduced. Each state can be converted into a vector whose spatial position in the cartesian coordinate system is expressed as: (x) _t ,y _t ,z _t ) Multiple training sessions by the agentThe target expected to be achieved later can be expressed as: (x) _e ,y _e ,z _e ) The distance between the two can be expressed as:

the empirical data generated for this round is the mean from the target:

D(τ)＝argmax(d _t ) (14)

the smaller the mean, the higher the data quality, and the larger the priority value, which can be expressed as:

p _i ＝|-k*D(τ _i )+b| (15)

wherein k is a proportionality coefficient, b is any integer, and the size of b is adjusted to enable the result to fall in a reasonable interval.

As shown in fig. 3, in the present embodiment, the strategy solving for the built deep reinforcement learning model specifically includes:

step 3-1: initializing neural network parameters;

and randomly initializing two neural network parameters of the estimation network and the target network by using Gaussian distribution.

Step 3-2: initializing the state of each machine;

before scheduling, all machines are in an idle available state and the occurrence of machine failure is not considered.

Step 3-3: selecting an action based on the epsilon-greedy strategy, executing the instant reward obtained by the action, and updating the machine state;

the actions are randomly selected by probability search of epsilon (0-1), and the action with the maximum action value is selected by probability of 1-epsilon. As the number of iterations increases, ε decreases progressively, decreasing the rate of exploration.

Step 3-4: and storing the state transition data to an experience replay pool, wherein the new experience data overwrites the old experience data.

Step 3-5: extracting historical experience data from the experience pool to update a parameter theta of the neural network estimation network, judging whether the target times are reached, if not, executing the step 3-3, and if so, executing the step 3-6;

step 3-6: updating the target network parameter θ ^- After each round is finished, copying the parameter theta of the estimated network to the target network theta ^- . And setting the number of rounds according to the pre-training convergence, and terminating when the number of rounds is reached. And judging whether a termination condition is reached, if not, executing the step 3-2, and if so, finishing the training.

In step 4 of this embodiment, the best strategy is found from the order information (including serial number, delivery date, default fund, etc.) of the prefabricated parts through the deep reinforcement learning model. For example, 9 prefabricated part order numbers 1-9 and 3 prefabricated part production lines output several optimal solutions in a certain range, such as scheme I ₁ ：J ₃ 、J ₅ 、J ₂ ，l ₂ ：J ₇ 、J ₁ 、J ₄ ，l ₃ ：J ₈ 、J ₉ 、J ₆ (ii) a Scheme two ₁ ：J ₂ 、J ₄ 、J ₁ ，l ₂ ：J ₇ 、J ₉ 、J ₈ ，l ₃ ：J ₃ 、J ₆ 、J ₅ . Depending on production preferences and other non-essential factors, a suitable protocol is selected.

Example two

It is an object of this embodiment to provide a computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when executing the program.

EXAMPLE III

An object of the present embodiment is to provide a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.

Example four

The embodiment aims to provide a prefabricated part production scheduling optimization system based on reinforcement learning, which comprises:

a model training module: establishing an experience playback pool based on the state of the current machine, the current action, the reward corresponding to the current action and the state of the machine at the next moment, and performing iterative updating on the deep reinforcement learning model by using the experience playback pool to obtain a trained deep reinforcement learning model;

a policy output module: and outputting an optimal scheduling strategy by the trained deep reinforcement learning model according to the order information of the prefabricated part.

The steps involved in the apparatuses of the above second, third and fourth embodiments correspond to the first embodiment of the method, and the detailed description thereof can be found in the relevant description section of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media containing one or more sets of instructions; it should also be understood to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any of the methods of the present invention.

Those skilled in the art will appreciate that the modules or steps of the present invention described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code that is executable by computing means, such that they are stored in memory means for execution by the computing means, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps of them are fabricated into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. A prefabricated part production scheduling optimization method based on reinforcement learning is characterized by comprising the following steps:

determining an optimization target of a pre-component scheduling model;

establishing an experience recovery pool based on the state of the current machine, the current action, the reward corresponding to the current action and the state of the machine at the next moment, and performing iterative update on the deep reinforcement learning model by using the experience recovery pool to obtain a trained deep reinforcement learning model;

2. The reinforcement learning-based precast element production scheduling optimization method according to claim 1, wherein the optimization objective is to minimize the maximum completion time when there is minimal or no stall penalty, and when the optimization objective is to minimize the stall penalty P _min Comprises the following steps:

When the optimization objective is minimizing the maximumTime of completion f _min ：

f _min ＝minC _max

Wherein, C _max ＝maxC _πli,6 。

3. The reinforcement learning-based precast element production scheduling optimization method according to claim 1, wherein the production data includes an order number, a line number, a processing machine number, a precast element start processing time, and a precast element processing completion time.

4. The reinforcement learning-based precast element production scheduling optimization method of claim 1, wherein the converting of the solution of the optimization objective of the scheduling model into the solution based on the deep reinforcement learning model comprises:

establishing an action set, wherein the action set comprises but is not limited to selecting a workpiece with the longest/short processing time, selecting a workpiece with the shortest residual processing time and selecting a workpiece with the longest/short processing time of a subsequent process;

establishing a reward mechanism, and taking the waiting time of the idle time of the machine or the prefabricated part per unit time as a reward function, wherein the reward function is as follows:

5. The reinforcement learning-based precast element production scheduling optimization method of claim 1, wherein parameters in the deep reinforcement learning model are iteratively updated by a DQN algorithm in the deep reinforcement learning.

6. The reinforcement learning-based precast element production scheduling optimization method of claim 1, wherein the training of the deep reinforcement learning model comprises:

s1: initializing neural network parameters;

s2: initializing the state of each machine;

s4: storing the state transition data to an experience playback pool, and covering the old data with the new experience data;

s5: randomly extracting data in batches from the experience playback pool to update parameters of the estimated neural network, and judging whether the round is finished or not according to the set target times of the single round; if so, performing S6, otherwise, performing S3;

s6: and updating the target neural network parameters, judging whether a termination condition is reached, if not, executing S2, and if so, finishing training.

7. A prefabricated part production scheduling optimization method based on reinforcement learning as claimed in claim 1, wherein the deep reinforcement learning model training is performed by adopting a priority experience replay strategy.

8. A reinforcement learning-based prefabricated part production scheduling optimization system is characterized by comprising:

a model training module: establishing an experience playback pool based on the state of the current machine, the current action, the reward corresponding to the current action and the state of the machine at the next moment, and extracting data in the experience playback pool randomly in batches to perform iterative updating on the deep reinforcement learning model to obtain a trained deep reinforcement learning model;

a policy output module: and inputting the order information of the prefabricated part into the trained deep reinforcement learning model and outputting an optimal scheduling strategy.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of a method for reinforcement learning based optimization of a production schedule for prefabricated parts according to any one of claims 1 to 7.

10. A processing apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of a method of reinforcement learning based pre-form production scheduling optimization method of any of claims 1-7.