CN115204497A - Prefabricated part production scheduling optimization method and system based on reinforcement learning - Google Patents

Prefabricated part production scheduling optimization method and system based on reinforcement learning Download PDF

Info

Publication number
CN115204497A
CN115204497A CN202210846471.9A CN202210846471A CN115204497A CN 115204497 A CN115204497 A CN 115204497A CN 202210846471 A CN202210846471 A CN 202210846471A CN 115204497 A CN115204497 A CN 115204497A
Authority
CN
China
Prior art keywords
reinforcement learning
prefabricated part
scheduling
model
production
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210846471.9A
Other languages
Chinese (zh)
Inventor
邓晓平
刘福磊
李成栋
房海波
侯和涛
彭伟
刘洪彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Second Construction Co Ltd of China Construction Eighth Engineering Division Co Ltd
Shandong Jianzhu University
Original Assignee
Shandong University
Second Construction Co Ltd of China Construction Eighth Engineering Division Co Ltd
Shandong Jianzhu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University, Second Construction Co Ltd of China Construction Eighth Engineering Division Co Ltd, Shandong Jianzhu University filed Critical Shandong University
Priority to CN202210846471.9A priority Critical patent/CN115204497A/en
Publication of CN115204497A publication Critical patent/CN115204497A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Physics & Mathematics (AREA)
  • Game Theory and Decision Science (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Operations Research (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Development Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Educational Administration (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a prefabricated part production scheduling optimization method and system based on reinforcement learning, which are used for acquiring real-time production data and historical production data to establish a prefabricated part scheduling model; determining an optimization target of a pre-component scheduling model; converting the solution of the optimization target of the scheduling model into a solution based on a deep reinforcement learning model; establishing an experience playback pool based on the state of the current machine, the current action, the reward corresponding to the current action and the state of the machine at the next moment, and randomly extracting data in the experience playback pool in batches to perform iterative updating on the deep reinforcement learning model to obtain a trained deep reinforcement learning model; and inputting the order information of the prefabricated part into the trained deep reinforcement learning model and outputting an optimal scheduling strategy. The method does not depend on a model, and can dynamically adapt to system disturbance caused by external factors of an actual system, such as design change, emergency insertion and the like.

Description

Prefabricated part production scheduling optimization method and system based on reinforcement learning
Technical Field
The invention belongs to the technical field related to prefabricated part production, and particularly relates to a prefabricated part production scheduling optimization method and system based on reinforcement learning.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
The assembly type building has the advantages that the production and installation of prefabricated parts are separated, so that the industrialized production and management mode is fully applied, the energy is saved, the environment is protected, the building speed is high, the project quality is high, the intensive transformation of the building industry can be effectively promoted, and the assembly type building is the future development direction of the building industry in China. With the increasing popularity of prefabricated structures, the demand for prefabricated components is increasing. However, at present, the production of the prefabricated parts is usually arranged according to manual experience and the preference of production managers, the situation that the production progress is not matched with the construction progress due to unreasonable production schemes often occurs, the utilization rate of equipment resources is low, and the supply delay phenomenon exists. In addition, the stocking of prefabricated components is sometimes wasteful of inventory resources. Therefore, it is necessary to develop a production scheduling optimization method suitable for prefabricated parts to improve the production efficiency and the capability of coping with emergencies of the prefabricated parts and reduce the production cost.
Most of domestic prefabricated parts at the present stage adopt a mode of batch order production according to the requirements of assembly construction on site of an assembly building, and the production modes mainly comprise a flow mode and a fixed mould table mode. In the flow production, the prefabricated components move in sequence in the workstations of each process, and each process is provided with a fixed worker for corresponding operation, so that the prefabricated components mainly produce laminated slabs, inner walls, outer walls and other slab components; the fixed station mode production is that the prefabricated parts are operated in all processes at a fixed position, workers in charge of the operation of each process can be the same or different, and the mode mainly produces special-shaped parts such as stairs. The automation degree of the production line is higher, the specialized division of workers enables the production to be more efficient and flexible, and the overall operation efficiency is higher; the fixed production mode is convenient to manage and simple to dispatch, but the utilization of production resources is very inefficient. The current stage flow line production mode is the mainstream production mode of the fabricated concrete prefabricated part.
The production process of the prefabricated part in the flow line production mode mainly comprises 6 stages of formwork erecting, pre-embedding, pouring, maintaining, formwork stripping and repairing of the prefabricated part. The main idea of the scheduling optimization method for the production process of the existing flow type prefabricated part is to firstly establish a production scheduling mathematical model with constraint conditions by combining the production process characteristics and the attention indexes of the prefabricated part, to define a target function, then to solve the model by adopting heuristic rules formed by empirical induction, dynamic programming or genetic algorithm, particle swarm optimization algorithm and other group intelligent algorithms, and finally to form a scheduling plan by a cross-road map or Gantt map and the like. Such methods have the following problems:
(1) The finally formed scheduling scheme completely depends on the design of the model, and can not dynamically adapt to system disturbance caused by external factors of an actual system, such as design change, emergency insertion and the like.
(2) When a dynamic programming method is used for solving, the problem of dimension disaster exists, and the problem of large scale cannot be effectively processed; the optimization calculation speed by using the group intelligent algorithm is low, and the requirement of the system on real-time control cannot be met.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a prefabricated part production scheduling optimization method and system based on reinforcement learning.
In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions: a prefabricated part production scheduling optimization method based on reinforcement learning comprises the following steps:
acquiring real-time production data and historical production data to establish a prefabricated part scheduling model;
determining an optimization target of a pre-component scheduling model;
converting the solution of the optimization target of the scheduling model into a solution based on a deep reinforcement learning model;
establishing an experience playback pool based on the state of the current machine, the current action, the reward corresponding to the current action and the state of the machine at the next moment, and performing iterative update on the deep reinforcement learning model by using randomly extracted data in the experience playback pool to obtain a trained deep reinforcement learning model;
and inputting the order information of the prefabricated part into the trained deep reinforcement learning model and outputting an optimal scheduling strategy.
Further, the optimization target is to minimize the maximum completion time when the stall penalty is minimized or no stall, and when the optimization target is to minimize the stall penalty P min Comprises the following steps:
Figure BDA0003753040110000031
wherein l 1~F :{l 1 、l 2 、l 3 ...l F Is the assembly line number, n l For the number of orders to be allocated on the production line l,
Figure BDA0003753040110000032
i is more than or equal to 1 and less than or equal to n for the ith prefabricated part in the production scheduling sequence of the prefabricated parts on the first assembly line l
When the optimization objective is to minimize the maximum completion time f min
f min =min C max
Wherein the content of the first and second substances,
Figure BDA0003753040110000033
further, the production data comprises an order number, a production line number, a processing machine number, the processing starting time of the prefabricated part and the processing finishing time of the prefabricated part.
Further, the converting the solution of the optimization target of the scheduling model into the solution based on the deep reinforcement learning model includes:
establishing a state set, wherein the state set selects n characteristics with strong relevance to the optimization target to establish the state set, and the characteristics comprise but are not limited to the ratio of the number of workpieces to the number of orders in a queue, the ratio of the average processing time of all prefabricated parts in the queue to the processing time of the current prefabricated part, and the ratio of the average processing time of the ith machine in the queue to the processing time of the current prefabricated part;
establishing an action set, wherein the action set comprises but is not limited to selecting the workpiece with the longest/short processing time, selecting the workpiece with the shortest remaining processing time and selecting the workpiece with the long/short processing time of the subsequent process;
establishing a reward mechanism, and taking the waiting time of idle time of the machine or the prefabricated part per unit time as a reward function, wherein the reward function is as follows:
Figure BDA0003753040110000034
when the kth procedure of the ith prefabricated part is finished, the machining machine of the (k + 1) th procedure is not idle, and the reward is a negative value of the queuing waiting time of the workpiece in unit time; when the k procedure of the ith prefabricated part is completed, the k-1 procedure of the (i + 1) th prefabricated part is not completed, and the reward is a negative idle time value of the k processing machine.
Further, parameters in the deep reinforcement learning model are updated iteratively by adopting a DQN algorithm in the deep reinforcement learning.
Further, the training of the deep reinforcement learning model comprises the following steps:
s1: initializing neural network parameters;
s2: initializing the state of each machine;
s3: selecting an action based on a greedy strategy, executing instant reward acquired by the action, and updating the machine state;
s4: storing the state transition data to an experience playback pool, and overwriting the old data with the new experience data;
s5: randomly extracting data in batches from the experience playback pool to update parameters of the estimated neural network, and judging whether the single round is finished according to the set target times of the single round; if so, performing S6, otherwise, performing S3;
s6: and updating the target neural network parameters, judging whether a termination condition is reached, if not, executing S2, and if so, finishing the training.
Furthermore, a priority experience replay strategy is adopted for training in the deep reinforcement learning model training.
The second aspect of the present invention provides a prefabricated part production scheduling optimization system based on reinforcement learning, comprising:
the scheduling model establishing module is used for acquiring real-time production data and historical production data to establish a prefabricated part scheduling model;
the optimization target determining module is used for determining an optimization target of the pre-component scheduling model;
a solution conversion module: converting the solution of the optimization target of the scheduling model into a solution based on a deep reinforcement learning model;
a model training module: establishing an experience playback pool based on the state of the current machine, the current action, the reward corresponding to the current action and the state of the machine at the next moment, and randomly extracting data in the experience playback pool in batches to perform iterative updating on the deep reinforcement learning model to obtain a trained deep reinforcement learning model;
a policy output module: and inputting the order information of the prefabricated parts into a trained deep reinforcement learning model and outputting an optimal scheduling strategy.
A third aspect of the invention provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the steps of the above-described method.
A fourth aspect of the invention provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, the computer instructions, when executed by the processor, performing the steps of the method.
The above one or more technical solutions have the following beneficial effects:
the scheduling scheme based on reinforcement learning does not depend on a model, and can dynamically adapt to system disturbance caused by external factors of an actual system, such as design change, emergency insertion and the like.
The invention adopts an experience playback strategy with priority, has higher iterative solution rate and can meet the requirement of actual production.
Compared with the traditional heuristic scheduling method, the method has the advantages of excellent performance of linearity, parallelism and reentrancy of deep reinforcement learning.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are included to illustrate an exemplary embodiment of the invention and not to limit the invention.
FIG. 1 is a block diagram illustrating the production scheduling optimization of prefabricated parts according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating a state sensing module according to an embodiment of the present invention;
FIG. 3 is a flow chart of a deep reinforcement learning model according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a neural network according to an embodiment of the present invention.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Example one
As shown in fig. 1, the embodiment discloses a prefabricated part production scheduling optimization method based on reinforcement learning, which includes the following steps:
step 1: acquiring real-time production data and historical production data to establish a prefabricated part scheduling model;
step 2: determining an optimization target of a pre-component scheduling model;
and step 3: converting the solution of the optimization target of the scheduling model into a solution based on a deep reinforcement learning model;
and 4, step 4: establishing an experience playback pool based on the state of the current machine, the current action, the reward corresponding to the current action and the state of the machine at the next moment, and performing iterative update on the deep reinforcement learning model by using randomly extracted data in the experience playback pool to obtain a trained deep reinforcement learning model;
and 5: and inputting the order information of the prefabricated parts into a trained deep reinforcement learning model and outputting an optimal scheduling strategy.
According to the method, the production perception module is used for obtaining the previous and instant production experience of the prefabricated parts, the time required by each process of different prefabricated parts is estimated, then a concrete prefabricated part production scheduling optimization model is established according to the obtained data, the optimization goal is to minimize the delay punishment of the prefabricated parts or minimize the maximum completion time when no delay exists, finally, a reinforcement learning mechanism and a heuristic scheduling rule with simple design are introduced, and the production scheduling problem of the prefabricated parts is optimized and solved by combining the production characteristics of the prefabricated parts.
In the implementation, real-time production data and historical production data are obtained through establishing a state sensing module, and specifically, the real-time production data and the historical production data comprise the position of the prefabricated part on the production line (including the number of the production line and the process number), and the time for starting and finishing the treatment.
As shown in fig. 2, specifically, an RFID reader/writer having a read/write function and a wireless communication function is installed at a side of the mold, and a passive RFID tag is installed on a traveling wheel and a ferry vehicle which are responsible for conveying the mold in each process of the assembly line. The wireless communication function can be realized by adopting wireless communication modules such as Lora, wi-Fi and NB-IoT. And after the order system issues the order production task, issuing the order number and the number of the prefabricated part to be produced in the order to the RFID reader-writer on the mold table in a wireless mode, and binding the mold table with the specific prefabricated part in the order.
The mould platform flows in each production stage on the assembly line according to a fixed sequence, when the mould platform flows to a certain process stage, a reader-writer on the mould platform approaches a passive RFID label arranged on a mould platform circulating system transmission device at the stage, reads data in the label and uploads the data and reading time to an ordering system through a wireless network.
In this embodiment, feature extraction is performed on the acquired real-time production data and historical production data, and data is stored in the form of { "order number", "pipeline number", "machining machine number", "machining start time", and "machining completion time", so as to perform state feature extraction and data management.
In step 1 and step 2 of this implementation, a prefabricated part production scheduling model is established, specifically:
(1) Associated definition of symbols
Order number J of concrete prefabricated part 1~n :{J 1 、J 2 、J 3 ...J n };
Number l of production line 1~F :{l 1 、l 2 、l 3 ...l F };
The number of the production process k of the concrete prefabricated part is 1 to 6;
n l the number of orders distributed to the production line l;
d j each prefabricated part corresponds to a delivery date;
w j : each prefabricated part corresponds to the expense of one delay penalty;
B j,k : starting processing time of the kth procedure of the prefabricated part j;
C j,k : the completion time of the kth procedure of the prefabricated part j;
p j,k : processing time of the kth procedure of the prefabricated part j;
Figure BDA0003753040110000081
the ith prefabricated part in the production scheduling sequence of the prefabricated parts on the first assembly line, wherein i is more than or equal to 1 and is more than or equal to n l
(2) Completion time of each prefabricated part
The moving time of the prefabricated parts among the machines, the machine faults and the emergencies are not considered, and the finishing time of each process of each prefabricated part can be expressed as follows:
the completion time of the first process of each prefabricated part can be expressed as:
Figure BDA0003753040110000082
finishing time of 2-6 production processes of each prefabricated part:
Figure BDA0003753040110000083
Figure BDA0003753040110000084
wherein the capacity of the curing cellar is X l The steam curing time is of a fixed length T and is limited by the steam curing capacity condition, and the time for starting curing the ith prefabricated part on the ith assembly line is not less than the y (y) th time when the curing cellar is full<i) The steam curing completion time of each prefabricated part.
Maximum completion time C of order max Comprises the following steps:
Figure BDA0003753040110000085
in the present embodiment, the optimization goal of the prefabricated part scheduling model is determined to minimize the stall penalty of the prefabricated parts or minimize the maximum completion time without stall.
The optimization objective is to minimize the stall penalty P min The optimization function can be expressed as:
Figure BDA0003753040110000091
when the prefabricated part is produced without pull-out, the optimization objective is to minimize the maximum completion time f min
f min =min C max (6)
In the embodiment, reinforcement learning is applied to the field of production scheduling optimization, a problem to be optimized is converted into a markov decision process, and a reinforcement learning state set, an action set, a reward mechanism, an experience playback mechanism and the like are established on the basis of an optimization target. Extracting state features of a machine to establish a feature set; forming a selectable action set of the machine according to a scheduling rule; establishing a reward mechanism associated with the optimization objective; and extracting empirical data by adopting a priority playback strategy to update parameters.
In this embodiment, the building of the deep reinforcement learning model includes:
(1) Establishing a state feature set
Data feature extraction is carried out on the data collected from the state perception module according to the selected feature categories, and the state features of the ith machine on the ith pipeline can be used
Figure BDA0003753040110000092
To express, selecting n common features with strong relevance with the optimization target to establish a state set:
Figure BDA0003753040110000093
the features may include, but are not limited to, the ratio of the number of workpieces to the number of orders in the queue, the ratio of the average processing time of all the prefabricated parts in the queue to the current prefabricated part processing time, the ratio of the average processing time on the ith machine in the queue to the current prefabricated part processing time, and the like.
(2) Establishing action sets
Establishing a reinforcement learning action set for each machine according to a simple heuristic scheduling rule, which may include but is not limited to m rules such as selecting a workpiece with the longest/short processing time, selecting a workpiece with the shortest remaining processing time, and selecting a workpiece with a long/short processing time in a subsequent process, and establishing an action set as follows:
Figure BDA0003753040110000094
Figure BDA0003753040110000095
and (3) representing the action with the number m of the ith machine on the ith production line, for example, selecting the workpiece with the longest processing time, selecting one action from the action set by the machine in each state, transferring to the next state, and obtaining the optimal action selected by the state through multi-loop iteration.
(3) Establishing a reward function
The waiting time of the idle time of the machine or the prefabricated part per unit time is taken as a reward function, and the expression is as follows:
Figure BDA0003753040110000101
when the kth procedure of the ith prefabricated part is finished, the machining machine of the (k + 1) th procedure is not idle, and the reward is a negative value of the queuing waiting time of the workpiece in unit time; when the k procedure of the ith prefabricated part is completed, the k-1 procedure of the (i + 1) th prefabricated part is not completed, and the reward is a negative idle time value of the k processing machine.
The total reward for each iteration round is:
Figure BDA0003753040110000102
(4) Determining a parameter update mode
In the present embodiment, the DQN algorithm in deep reinforcement learning is employed.
Updating the action value function:
Figure BDA0003753040110000103
in order to stabilize the training process, two neural networks, namely a target network and an estimation network, are introduced, the estimation network is updated immediately, and parameters are copied to the target network after n steps, wherein theta is an estimation network parameter, and theta is an estimation network parameter - Is a target network parameter; gamma is a discount factor, and the value of gamma is 0-1; alpha is learning rate and takes the value of 0 to 1;
Figure BDA0003753040110000104
representing an action cost function which is fitted by a neural network when a certain action is selected in a certain state of the ith machine on the ith production line; max represents taking the maximum; r is i Indicating a reward;
updating the neural network parameters:
the gradient descent method is adopted to update the parameters of the neural network, and the formula is as follows:
Figure BDA0003753040110000105
θ - updating after each round is finished, wherein the updating mode is as follows: theta.theta. - ←θ;
Figure BDA0003753040110000111
Representing the gradient of the motion cost function.
As shown in fig. 4, the neural network used in the present embodiment is a BP neural network, which is a multi-layer feedforward network trained by an error inverse propagation algorithm, and the network model topology includes an input layer, a hidden layer, and an output layer. The input of the BP neural network (target network and estimation network) is the state
Figure BDA0003753040110000112
And actions
Figure BDA0003753040110000113
Outputting an action cost function
Figure BDA0003753040110000114
By continuously updating the parameters of the BP neural network, the action value fitted by the BP neural network is more accurate.
(5) Priority empirical playback mechanism
In order to improve the training efficiency of the neural network, experience data(s) generated in one round of exploration utilization process of the intelligent agent is subjected to an experience replay strategy with priority i ,a i ,r i+1 ,s i+1 ) Weighting processing is carried out, so that the sampling rate of favorable experience is improved, and sampling of useless experience data is reduced. Each state can be converted into a vector whose spatial position in the cartesian coordinate system is expressed as: (x) t ,y t ,z t ) Multiple training sessions by the agentThe target expected to be achieved later can be expressed as: (x) e ,y e ,z e ) The distance between the two can be expressed as:
Figure BDA0003753040110000115
the empirical data generated for this round is the mean from the target:
D(τ)=argmax(d t ) (14)
the smaller the mean, the higher the data quality, and the larger the priority value, which can be expressed as:
p i =|-k*D(τ i )+b| (15)
wherein k is a proportionality coefficient, b is any integer, and the size of b is adjusted to enable the result to fall in a reasonable interval.
As shown in fig. 3, in the present embodiment, the strategy solving for the built deep reinforcement learning model specifically includes:
step 3-1: initializing neural network parameters;
and randomly initializing two neural network parameters of the estimation network and the target network by using Gaussian distribution.
Step 3-2: initializing the state of each machine;
before scheduling, all machines are in an idle available state and the occurrence of machine failure is not considered.
Step 3-3: selecting an action based on the epsilon-greedy strategy, executing the instant reward obtained by the action, and updating the machine state;
the actions are randomly selected by probability search of epsilon (0-1), and the action with the maximum action value is selected by probability of 1-epsilon. As the number of iterations increases, ε decreases progressively, decreasing the rate of exploration.
Step 3-4: and storing the state transition data to an experience replay pool, wherein the new experience data overwrites the old experience data.
Step 3-5: extracting historical experience data from the experience pool to update a parameter theta of the neural network estimation network, judging whether the target times are reached, if not, executing the step 3-3, and if so, executing the step 3-6;
step 3-6: updating the target network parameter θ - After each round is finished, copying the parameter theta of the estimated network to the target network theta - . And setting the number of rounds according to the pre-training convergence, and terminating when the number of rounds is reached. And judging whether a termination condition is reached, if not, executing the step 3-2, and if so, finishing the training.
In step 4 of this embodiment, the best strategy is found from the order information (including serial number, delivery date, default fund, etc.) of the prefabricated parts through the deep reinforcement learning model. For example, 9 prefabricated part order numbers 1-9 and 3 prefabricated part production lines output several optimal solutions in a certain range, such as scheme I 1 :J 3 、J 5 、J 2 ,l 2 :J 7 、J 1 、J 4 ,l 3 :J 8 、J 9 、J 6 (ii) a Scheme two 1 :J 2 、J 4 、J 1 ,l 2 :J 7 、J 9 、J 8 ,l 3 :J 3 、J 6 、J 5 . Depending on production preferences and other non-essential factors, a suitable protocol is selected.
Example two
It is an object of this embodiment to provide a computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when executing the program.
EXAMPLE III
An object of the present embodiment is to provide a computer-readable storage medium.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.
Example four
The embodiment aims to provide a prefabricated part production scheduling optimization system based on reinforcement learning, which comprises:
the scheduling model establishing module is used for acquiring real-time production data and historical production data to establish a prefabricated part scheduling model;
the optimization target determining module is used for determining an optimization target of the pre-component scheduling model;
a solution conversion module: converting the solution of the optimization target of the scheduling model into a solution based on a deep reinforcement learning model;
a model training module: establishing an experience playback pool based on the state of the current machine, the current action, the reward corresponding to the current action and the state of the machine at the next moment, and performing iterative updating on the deep reinforcement learning model by using the experience playback pool to obtain a trained deep reinforcement learning model;
a policy output module: and outputting an optimal scheduling strategy by the trained deep reinforcement learning model according to the order information of the prefabricated part.
The steps involved in the apparatuses of the above second, third and fourth embodiments correspond to the first embodiment of the method, and the detailed description thereof can be found in the relevant description section of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media containing one or more sets of instructions; it should also be understood to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any of the methods of the present invention.
Those skilled in the art will appreciate that the modules or steps of the present invention described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code that is executable by computing means, such that they are stored in memory means for execution by the computing means, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps of them are fabricated into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims (10)

1. A prefabricated part production scheduling optimization method based on reinforcement learning is characterized by comprising the following steps:
acquiring real-time production data and historical production data to establish a prefabricated part scheduling model;
determining an optimization target of a pre-component scheduling model;
converting the solution of the optimization target of the scheduling model into a solution based on a deep reinforcement learning model;
establishing an experience recovery pool based on the state of the current machine, the current action, the reward corresponding to the current action and the state of the machine at the next moment, and performing iterative update on the deep reinforcement learning model by using the experience recovery pool to obtain a trained deep reinforcement learning model;
and inputting the order information of the prefabricated part into the trained deep reinforcement learning model and outputting an optimal scheduling strategy.
2. The reinforcement learning-based precast element production scheduling optimization method according to claim 1, wherein the optimization objective is to minimize the maximum completion time when there is minimal or no stall penalty, and when the optimization objective is to minimize the stall penalty P min Comprises the following steps:
Figure FDA0003753040100000011
wherein l 1~F :{l 1 、l 2 、l 3 ...l F Is the assembly line number, n l For the number of orders to be allocated on the production line l,
Figure FDA0003753040100000012
i is more than or equal to 1 and less than or equal to n for the ith prefabricated part in the production scheduling sequence of the prefabricated parts on the first assembly line l
When the optimization objective is minimizing the maximumTime of completion f min
f min =minC max
Wherein, C max =maxC πli,6
3. The reinforcement learning-based precast element production scheduling optimization method according to claim 1, wherein the production data includes an order number, a line number, a processing machine number, a precast element start processing time, and a precast element processing completion time.
4. The reinforcement learning-based precast element production scheduling optimization method of claim 1, wherein the converting of the solution of the optimization objective of the scheduling model into the solution based on the deep reinforcement learning model comprises:
establishing a state set, wherein the state set selects n characteristics with strong relevance to the optimization target to establish the state set, and the characteristics comprise but are not limited to the ratio of the number of workpieces to the number of orders in a queue, the ratio of the average processing time of all prefabricated parts in the queue to the processing time of the current prefabricated part, and the ratio of the average processing time of the ith machine in the queue to the processing time of the current prefabricated part;
establishing an action set, wherein the action set comprises but is not limited to selecting a workpiece with the longest/short processing time, selecting a workpiece with the shortest residual processing time and selecting a workpiece with the longest/short processing time of a subsequent process;
establishing a reward mechanism, and taking the waiting time of the idle time of the machine or the prefabricated part per unit time as a reward function, wherein the reward function is as follows:
Figure FDA0003753040100000021
when the kth procedure of the ith prefabricated part is finished, the machining machine of the (k + 1) th procedure is not idle, and the reward is a negative value of the queuing waiting time of the workpiece in unit time; when the k procedure of the ith prefabricated part is completed, the k-1 procedure of the (i + 1) th prefabricated part is not completed, and the reward is a negative idle time value of the k processing machine.
5. The reinforcement learning-based precast element production scheduling optimization method of claim 1, wherein parameters in the deep reinforcement learning model are iteratively updated by a DQN algorithm in the deep reinforcement learning.
6. The reinforcement learning-based precast element production scheduling optimization method of claim 1, wherein the training of the deep reinforcement learning model comprises:
s1: initializing neural network parameters;
s2: initializing the state of each machine;
s3: selecting an action based on a greedy strategy, executing instant reward acquired by the action, and updating the machine state;
s4: storing the state transition data to an experience playback pool, and covering the old data with the new experience data;
s5: randomly extracting data in batches from the experience playback pool to update parameters of the estimated neural network, and judging whether the round is finished or not according to the set target times of the single round; if so, performing S6, otherwise, performing S3;
s6: and updating the target neural network parameters, judging whether a termination condition is reached, if not, executing S2, and if so, finishing training.
7. A prefabricated part production scheduling optimization method based on reinforcement learning as claimed in claim 1, wherein the deep reinforcement learning model training is performed by adopting a priority experience replay strategy.
8. A reinforcement learning-based prefabricated part production scheduling optimization system is characterized by comprising:
the scheduling model establishing module is used for acquiring real-time production data and historical production data to establish a prefabricated part scheduling model;
the optimization target determining module is used for determining an optimization target of the pre-component scheduling model;
a solution conversion module: converting the solution of the optimization target of the scheduling model into a solution based on a deep reinforcement learning model;
a model training module: establishing an experience playback pool based on the state of the current machine, the current action, the reward corresponding to the current action and the state of the machine at the next moment, and extracting data in the experience playback pool randomly in batches to perform iterative updating on the deep reinforcement learning model to obtain a trained deep reinforcement learning model;
a policy output module: and inputting the order information of the prefabricated part into the trained deep reinforcement learning model and outputting an optimal scheduling strategy.
9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of a method for reinforcement learning based optimization of a production schedule for prefabricated parts according to any one of claims 1 to 7.
10. A processing apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of a method of reinforcement learning based pre-form production scheduling optimization method of any of claims 1-7.
CN202210846471.9A 2022-07-19 2022-07-19 Prefabricated part production scheduling optimization method and system based on reinforcement learning Pending CN115204497A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210846471.9A CN115204497A (en) 2022-07-19 2022-07-19 Prefabricated part production scheduling optimization method and system based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210846471.9A CN115204497A (en) 2022-07-19 2022-07-19 Prefabricated part production scheduling optimization method and system based on reinforcement learning

Publications (1)

Publication Number Publication Date
CN115204497A true CN115204497A (en) 2022-10-18

Family

ID=83582977

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210846471.9A Pending CN115204497A (en) 2022-07-19 2022-07-19 Prefabricated part production scheduling optimization method and system based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN115204497A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115600774A (en) * 2022-12-14 2023-01-13 安徽大学绿色产业创新研究院(Cn) Multi-target production scheduling optimization method for assembly type building component production line
CN115793583A (en) * 2022-12-02 2023-03-14 福州大学 Flow shop new order insertion optimization method based on deep reinforcement learning
CN116307047A (en) * 2022-12-16 2023-06-23 中建八局第二建设有限公司 Multi-raw-material one-dimensional blanking optimization method based on tabu search and half tensor product
CN116307440A (en) * 2022-11-21 2023-06-23 暨南大学 Workshop scheduling method based on reinforcement learning and multi-objective weight learning, device and application thereof
CN116542498A (en) * 2023-07-06 2023-08-04 杭州宇谷科技股份有限公司 Battery scheduling method, system, device and medium based on deep reinforcement learning

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116307440A (en) * 2022-11-21 2023-06-23 暨南大学 Workshop scheduling method based on reinforcement learning and multi-objective weight learning, device and application thereof
CN116307440B (en) * 2022-11-21 2023-11-17 暨南大学 Workshop scheduling method based on reinforcement learning and multi-objective weight learning, device and application thereof
CN115793583A (en) * 2022-12-02 2023-03-14 福州大学 Flow shop new order insertion optimization method based on deep reinforcement learning
CN115600774A (en) * 2022-12-14 2023-01-13 安徽大学绿色产业创新研究院(Cn) Multi-target production scheduling optimization method for assembly type building component production line
CN115600774B (en) * 2022-12-14 2023-03-10 安徽大学绿色产业创新研究院 Multi-target production scheduling optimization method for assembly type building component production line
CN116307047A (en) * 2022-12-16 2023-06-23 中建八局第二建设有限公司 Multi-raw-material one-dimensional blanking optimization method based on tabu search and half tensor product
CN116307047B (en) * 2022-12-16 2023-10-17 中建八局第二建设有限公司 Multi-raw-material one-dimensional blanking optimization method based on tabu search and half tensor product
CN116542498A (en) * 2023-07-06 2023-08-04 杭州宇谷科技股份有限公司 Battery scheduling method, system, device and medium based on deep reinforcement learning
CN116542498B (en) * 2023-07-06 2023-11-24 杭州宇谷科技股份有限公司 Battery scheduling method, system, device and medium based on deep reinforcement learning

Similar Documents

Publication Publication Date Title
CN115204497A (en) Prefabricated part production scheduling optimization method and system based on reinforcement learning
CN110691422B (en) Multi-channel intelligent access method based on deep reinforcement learning
CN112734172A (en) Hybrid flow shop scheduling method based on time sequence difference
CN109359888B (en) Comprehensive scheduling method for tight connection constraint among multiple equipment processes
CN101216710A (en) Self-adapting selection dynamic production scheduling control system accomplished through computer
CN113255216B (en) Steelmaking production scheduling method, system, medium and electronic terminal
CN112508398B (en) Dynamic production scheduling method and device based on deep reinforcement learning and electronic equipment
CN113822525B (en) Flexible job shop multi-target scheduling method and system based on improved genetic algorithm
CN109858780A (en) A kind of Steelmaking-Continuous Casting Production Scheduling optimization method
CN113075886B (en) Steelmaking continuous casting scheduling method and device based on distributed robust opportunity constraint model
CN112836974A (en) DQN and MCTS based box-to-box inter-zone multi-field bridge dynamic scheduling method
CN115319742A (en) Flexible manufacturing unit operation scheduling method with robot material handling
CN113283013A (en) Multi-unmanned aerial vehicle charging and task scheduling method based on deep reinforcement learning
CN117196169A (en) Machine position scheduling method based on deep reinforcement learning
CN116151581A (en) Flexible workshop scheduling method and system and electronic equipment
CN115345306A (en) Deep neural network scheduling method and scheduler
CN115437321A (en) Micro-service-multi-agent factory scheduling model based on deep reinforcement learning network
CN115826530A (en) Job shop batch scheduling method based on D3QN and genetic algorithm
CN114675647A (en) AGV trolley scheduling and path planning method
CN114580728A (en) Elevator dispatching method and device, storage medium and electronic equipment
CN114862170B (en) Learning type intelligent scheduling method and system for manufacturing process of communication equipment
CN110826909B (en) Workflow execution method based on rule set
CN116307241B (en) Distributed job shop scheduling method based on reinforcement learning with constraint multiple agents
CN117634859B (en) Resource balance construction scheduling method, device and equipment based on deep reinforcement learning
CN116882660A (en) Self-adaptive production scheduling method, device and medium based on deep reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination