CN115016405A - Process route multi-objective optimization method based on deep reinforcement learning - Google Patents

Process route multi-objective optimization method based on deep reinforcement learning Download PDF

Info

Publication number
CN115016405A
CN115016405A CN202210582122.0A CN202210582122A CN115016405A CN 115016405 A CN115016405 A CN 115016405A CN 202210582122 A CN202210582122 A CN 202210582122A CN 115016405 A CN115016405 A CN 115016405A
Authority
CN
China
Prior art keywords
total
optimization
machine tool
tool
function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210582122.0A
Other languages
Chinese (zh)
Inventor
袁伟
张冠伟
郭伟
王磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202210582122.0A priority Critical patent/CN115016405A/en
Publication of CN115016405A publication Critical patent/CN115016405A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B19/00Programme-control systems
    • G05B19/02Programme-control systems electric
    • G05B19/418Total factory control, i.e. centrally controlling a plurality of machines, e.g. direct or distributed numerical control [DNC], flexible manufacturing systems [FMS], integrated manufacturing systems [IMS], computer integrated manufacturing [CIM]
    • G05B19/41865Total factory control, i.e. centrally controlling a plurality of machines, e.g. direct or distributed numerical control [DNC], flexible manufacturing systems [FMS], integrated manufacturing systems [IMS], computer integrated manufacturing [CIM] characterised by job scheduling, process planning, material flow
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B2219/00Program-control systems
    • G05B2219/30Nc systems
    • G05B2219/32Operator till task planning
    • G05B2219/32252Scheduling production, machining, job shop
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Manufacturing & Machinery (AREA)
  • Quality & Reliability (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention relates to a process route multi-objective optimization method based on deep reinforcement learning, which comprises the following steps: setting four mandatory priority relations among the working procedures, establishing a multi-objective optimization function taking the total process cost, the total process time and the total carbon emission as optimization targets, and establishing a process route optimization model; and secondly, converting the process route optimization problem into a Markov decision process to simulate the randomness strategy and return which can be realized by the Agent of the Agent: the determination of the process route of the part to be processed is regarded as a complete Markov decision process, namely, the selection of a first procedure and processing resources is carried out to the determination of a last procedure and processing resources, and the whole process route comprises the sequence arrangement of the procedures and the processing equipment selected by the procedures, so that the state space, the action space and the reward function of the Markov decision process are defined; and thirdly, solving the deep reinforcement learning method based on the Actor-Critic structure.

Description

Process route multi-objective optimization method based on deep reinforcement learning
Technical Field
The invention belongs to the technical field of process route optimization, and particularly relates to a process route multi-objective optimization method based on deep reinforcement learning.
Background
In the information era, product requirements increasingly tend to be personalized, diversified and dynamic, the market share of small-batch customized products based on user requirements gradually rises, and the existing manufacturing system is difficult to meet the production requirements of the small-batch customized products. The reason is that the customized product is frequently updated on the structural design, and the fine change of the process caused by the change of the demand influences the planning of the whole process route, so that a brand-new challenge is brought to the process planning efficiency, enterprises at the present stage have weak organization capability on process resources, most process routes are manually planned, the dependence on experience is strong, and the intellectualization level shown in the process design process is low.
The process planning may be considered as a bridge connecting product design and manufacturing, which is an activity of combining manufacturing process knowledge with specific design to prepare specific operation specifications thereof under the limitation of manufacturing resources in a workshop or factory. The process route optimization is a core link of process planning, mainly refers to the maximization of economic benefit by reasonably allocating processing resources and arranging processing sequence under the condition of meeting the constraint relation of procedures in the process, and is proved to be a combined optimization problem with NP-hard property due to the fact that multi-objective optimization is involved. However, in the process route optimization, the processing sequence of the part processing characteristics is not unique, the processing method available for processing the characteristics is not unique, the processing equipment for realizing the corresponding processing method is not unique, and the like, so that the process route optimization is flexible and changeable. In the production and processing of various small-batch products, the traditional process route optimization method cannot meet the dynamic processing requirement due to the fact that the process conditions are multiple and the dynamic change is large and the uncertainty factors of the process route optimization are more. The existing solution generally carries out multi-objective solution through a heuristic algorithm, but the method needs to carry out special treatment on the constraint of each type of problem, simultaneously optimizes all processes as a whole during solution, and has no dynamic response capability because the solution range needs to be adjusted and the planning needs to be restarted when facing the dynamic change of resources.
Disclosure of Invention
The invention provides a process route multi-target optimization method based on deep reinforcement learning, which aims at the defects of the prior art, after various economic indexes and low-carbon indexes in the process are analyzed, a multi-target optimization model is established by combining the priority relation among working procedures, the optimization problem is converted into a Markov decision process by utilizing the obvious advantages of the deep reinforcement learning on complex modeling and decision problems, a state space, an action space and a reward function are defined, and the deep reinforcement learning method based on an Actor-Critic structure is further designed by using a super-volume as a multi-target evaluation index so as to solve the optimization model. The method can make flexible decisions when the processing resources change dynamically, and realizes efficient, stable and strong-adaptability multi-objective optimization decision-making capability, thereby solving the flexible and variable problem when the small-batch customized product process route is optimized.
The purpose of the invention is realized by the following technical scheme:
a process route multi-objective optimization method based on deep reinforcement learning comprises the following steps:
setting four mandatory priority relations among working procedures, establishing a multi-objective optimization function taking the total process cost, the total process time and the total carbon emission as optimization targets, and establishing a process route optimization model;
wherein, four mandatory priority constraint relations among the set working procedures are as follows:
coarse-then-fine constraint relation: firstly arranging a rough machining type procedure, and then arranging a finish machining type procedure;
benchmark priority constraint relationship: when the reference feature and the dependent feature are processed, the reference feature is processed preferentially;
surface-first hole pattern priority relationship: when the surface characteristics and the hole characteristics corresponding to the surface characteristics are processed, in order to meet the position precision requirement of the hole and the plane, the surface characteristics are processed firstly, and then the hole characteristics are processed;
primary and secondary priority type constraint relationship: the primary and secondary characteristics are determined by the application functions of the parts, the part structure is designed according to the requirements of a user in the design stage, and the primary and secondary priority relationship can be obtained by combining the actual functional characteristics;
secondly, converting the process route optimization problem into a Markov decision process to simulate the randomness strategy and return which can be realized by the Agent of the Agent
The determination of the process route of the part to be processed is regarded as a complete Markov decision process, namely, the selection of a first procedure and processing resources is carried out to the determination of a last procedure and processing resources, and the whole process route comprises the sequence arrangement of the procedures and the processing equipment selected by the procedures, so that the state space, the action space and the reward function of the Markov decision process are defined;
in terms of state space definition, a static variable and a dynamic variable are introduced, the static variable is a value which cannot be changed in a decision process, and the dynamic variable is updated before each decision step, specifically:
regarding available machine tool and tool resources under each process as a primary combination in a pairing mode, totaling L-type combinations of all processes of workpieces to be processed, and enabling static variables to comprise serial numbers ID of candidate combinations x And process ME of candidate combination i Machine tool number
Figure BDA0003664391630000021
And tool number
Figure BDA0003664391630000022
The dynamic variable includes the current state s t Ratio of the next remaining process
Figure BDA0003664391630000023
Current state s t Next step ME i Whether or not it has already been selected
Figure BDA0003664391630000024
And the current state s t Next step ME i Preceding process residual ratio of
Figure BDA0003664391630000025
There may be several candidates for a machine tool cutter combination for a process,
Figure BDA0003664391630000026
respectively represent the processes ME i The kth type optional machine tool and the mth type optional tool under the machine tool;
dynamic variables
Figure BDA0003664391630000027
Characterizes the current state s t The proportion of the rest working procedures is as follows,
Figure BDA0003664391630000028
the initial state is 1, and the update rule is as follows:
Figure BDA0003664391630000029
in the formula, m t Indicates being in state s t When the number of selected processes is m t When the total number of the working procedures is equal to n, the sequence decision reaches a termination condition;
dynamic variables
Figure BDA00036643916300000210
Characterizes the current state s t Whether the next process step has already been selected,
Figure BDA00036643916300000211
the update rule is as follows:
Figure BDA00036643916300000212
dynamic variables
Figure BDA00036643916300000213
Characterizes the current state s t Next step ME i The residual proportion of the preceding working procedure of (2),
Figure BDA00036643916300000214
when ME i Without a preceding step, it
Figure BDA00036643916300000215
Is always 0; when ME i When a preceding process is carried out, the method comprises the following steps,
Figure BDA00036643916300000216
the update rule of (1) is:
Figure BDA0003664391630000031
in the formula, P i Represents the step ME i The total number of the preceding processes of (2),
Figure BDA0003664391630000032
indicates being in state s t Time P i Number of selected processes, initial state
Figure BDA0003664391630000033
Is 0 or 1, when
Figure BDA0003664391630000034
A value of 0 indicates ME at that time i Can be selected;
Figure BDA0003664391630000035
and
Figure BDA0003664391630000036
are all taken as values of [0,1]]The number of different working procedures of the workpiece to be processed can be effectively responded, and the model has a better sensing effect on data;
in the definition of action space, action a t Representing the Agent in the current state s t Actions that can be taken, which are related to decision strategies, on the process routeIn the line optimization problem, the action set A(s) represents all selectable sequence numbers ID under the state s x Gathering;
determining the reward of the Agent on the reward function definition based on three optimization targets of total process cost, total process time and total process carbon emission, and feeding back the Agent with larger reward when the three targets are lower;
and thirdly, solving the deep reinforcement learning method based on the Actor-Critic structure.
Preferably, in the second step, the established target functions of the total process cost, the total process time and the total carbon emission are as follows:
process Total cost objective function:
Figure BDA0003664391630000037
in the formula, C all For the total cost, C mc Is a machine tool change cost index, C tc Is a tool change cost index, C m (MID i ) Is a machine tool MID i Cost index of (C) t (TID i ) Is a tool TID i Cost index of (a), gamma 1 Is a machine tool change function, gamma 2 Is a tool change function, n is the total number of steps, TID i , TID i+1 Respectively the tool codes used in the process i and the process i + 1; MID i ,MID i+1 The machine tool codes used in the process i and the process i +1 respectively;
wherein, γ 1 The calculation formula is as follows:
Figure BDA0003664391630000038
γ 2 the calculation formula is as follows:
Figure BDA0003664391630000039
in the formula, the following components are mixed;
total process time objective function:
Figure BDA00036643916300000310
in the formula (I), the compound is shown in the specification,
Figure BDA00036643916300000311
is the machining time of the machine tool, TM mc Is an index of change time of the machine tool, TM tc Is the change time index of the tool;
process total carbon emissions objective function:
Figure BDA00036643916300000312
in the formula (I), the compound is shown in the specification,
Figure BDA00036643916300000313
cutting energy consumption and auxiliary energy consumption of the ith procedure, F ele Is the carbon emission factor of electrical energy.
Preferably, in the third step, in the training process of the Actor-Critic algorithm, Critic updates parameters through the Mean Square Error (MSE) of the minimum value estimation and the actual return; the Actor represents a policy function, and is updated according to a timing difference method.
The technical scheme provided by the invention has the beneficial effects that: the invention provides a new state space and action space defining mode by utilizing the advantages of reinforcement learning on complex modeling and dynamic optimization, so as to convert the process route optimization problem into a Markov decision process; on the basis, optimization indexes and influence factors in the process are cooperatively considered, a multi-objective optimization model is established, and the solution is analyzed and determined by the multi-objective optimization evaluation indexes, so that the solution set has good distributivity, and the adaptability and the generalization of the optimization model are improved.
Drawings
FIG. 1 is a state space definition diagram of a process route decision problem
FIG. 2 is a training process of the Actor-Critic algorithm
Constraint matrix of the process of FIG. 3
FIG. 4 off-line training convergence process under different optimization objectives
FIG. 5 Box line graph of the solutions of the algorithms
FIG. 6 solving speed of different algorithms
Detailed Description
The essence of the process route optimization is a multi-objective optimization problem, namely, according to the demand data and the process information of the parts, under the condition of meeting the mandatory constraints among all the processes, the process sequence and the manufacturing resource candidate set are reasonably arranged to realize the relevant requirements such as economic indexes, time indexes and the like. However, different workpieces face the problems of flexible and changeable processing modes and complex and diversified process resources, and the increasing requirements of various products and small-batch customized products further provide a brand-new challenge for the optimization of a process route. The traditional process route generally depends on the accumulated experience of experts or enterprises, and can be determined after the process performance analysis and the comprehensive evaluation are carried out on the processed workpiece. The method is oriented to processing of large-batch standardized products, so that the implementation period is long, the flexibility is poor, the intelligentization level is low, and the diversified market demands of small-batch customized products and increasingly complex enterprise production plans cannot be met.
Based on the method, firstly, relevant optimization indexes in the process are analyzed, and a multi-objective optimization model is established; and then converting the process optimization problem into a Markov decision process, defining a state space and an action space, analyzing and determining the solved solution by using a multi-objective optimization evaluation index to ensure that a solution set has good distributivity, and finally solving by using an Actor-Critic structure-based deep reinforcement learning method to obtain a final process route. The obtained solution set has better distributivity, the adaptability and the generalization of the model are ensured, and the intelligent level of the process route optimization decision is integrally promoted.
Specifically, the method comprises the following steps:
the first step is as follows: and establishing a process route optimization model.
Based on the analysis of the process route optimization process, the invention combs four mandatory priority relations among the working procedures and establishes three types of optimization objective functions so as to establish a process route optimization model.
Four of the mandatory constraints are: a coarse-then-fine constraint relationship, a reference priority constraint relationship, a surface-to-hole type priority relationship, and a primary and secondary priority constraint relationship.
Four of the mandatory constraints are:
(1) coarse-then-fine constraint relation: firstly arranging a rough machining type procedure, and then arranging a finish machining type procedure;
(2) benchmark priority constraint relationship: when processing is performed for a reference feature and its dependent feature, the reference feature should be processed preferentially.
(3) Surface-first hole type priority relationship: when the surface features and the hole features corresponding to the surface features are processed, the surface features are processed first and then the hole features are processed to meet the position precision requirements of the holes and the planes.
(4) Primary and secondary priority type constraint relationship: the primary and secondary characteristics are determined by the application functions of the parts, the part structure is designed according to the requirements of a user in the design stage, and the primary and secondary priority relationship can be obtained by combining the actual functional characteristics.
For the optimization target, from the inside of an enterprise, the optimization of the process route is mainly reflected in economic benefits and processing time, on one hand, the manufacturing cost of equipment such as a machine tool and the like and the change cost of the equipment need to be considered in the processing process, and on the other hand, the processing time needs to be controlled in a coordinated manufacturing process to meet the aging requirement of a demand side. Further, in a severe environment where excessive carbon dioxide emission causes global warming, low-carbon production is required.
Therefore, the invention establishes a multi-objective optimization function taking the total process cost, the total process time and the total carbon emission as optimization targets, and constructs a process route optimization model by combining the mandatory priority relationship among the working procedures. Three of these objective functions are:
(1) process total cost objective function:
Figure BDA0003664391630000051
in the formula, C all For the total cost, C mc Is a machine tool change cost index, C tc Is a tool change cost index, C m (MID i ) Is a machine tool MID i Cost index of (C) t (TID i ) Is a tool TID i Cost index of (a), gamma 1 Is a machine tool change function, gamma 2 Is the tool change function.
Wherein, γ 1 The calculation formula is as follows:
Figure BDA0003664391630000052
in the form of MID i ,MID i+1 The machine tool numbers used in step i and step i +1 are respectively.
γ 2 The calculation formula is as follows:
Figure BDA0003664391630000053
in the form of MID i ,MID i+1 The machine tool code, TID, used in step i and step i +1, respectively i ,TID i+1 The tool numbers used in step i and step i +1 are the same.
(2) Total process time objective function:
Figure BDA0003664391630000054
in the formula (I), the compound is shown in the specification,
Figure BDA0003664391630000061
is the machining time of the machine tool, TM mc Is the change time index of the machine tool, TM tc Is the index of change time of the tool, gamma 1 Is a machine tool change function, gamma 2 Is a tool change function, which is countedThe calculation formulas are respectively consistent with formula 2 and formula 3.
(3) Process total carbon emissions objective function:
Figure BDA0003664391630000062
in the formula (I), the compound is shown in the specification,
Figure BDA0003664391630000063
cutting energy consumption and auxiliary energy consumption of the ith procedure, F ele Is a carbon emission factor of electrical energy.
(1) Process Total cost objective function:
Figure BDA0003664391630000064
in the formula, C all For the total cost, C mc Is a machine tool change cost index, C tc Is a tool change cost index, C m (MID i ) Is a machine tool MID i Cost index of (C) t (TID i ) Is a tool TID i Cost index of (a), gamma 1 Is a machine tool change function, gamma 2 Is a tool change function, and n is the total number of steps.
Wherein, γ 1 The calculation formula is as follows:
Figure BDA0003664391630000065
in the form of MID i ,MID i+1 The machine tool numbers used in step i and step i +1, respectively.
γ 2 The calculation formula is as follows:
Figure BDA0003664391630000066
in the form of MID i ,MID i+1 The machine tool code, TID, used in step i and step i +1, respectively i ,TID i+1 The tool numbers used in step i and step i +1 are the same.
(2) Total process time objective function:
Figure BDA0003664391630000067
in the formula (I), the compound is shown in the specification,
Figure BDA0003664391630000068
is the machining time of the machine tool, TM mc Is the change time index of the machine tool, TM tc Is the change time index of the tool; gamma.a 1 Is a machine tool change function, gamma 2 The calculation formulas of the tool changing function are respectively consistent with the formulas 2 and 3.
(3) Process total carbon emissions objective function:
Figure BDA0003664391630000069
in the formula (I), the compound is shown in the specification,
Figure BDA00036643916300000610
cutting energy consumption and auxiliary energy consumption of the ith procedure, F ele Is the carbon emission factor of electrical energy.
The second step is that: and converting the process route optimization problem into a Markov decision process.
Reinforcement learning, one of the paradigms and methodologies of machine learning, can be used to describe and solve problems for agents (agents) to learn through interactions with the environment to get maximum returns or to achieve a predetermined goal. Reinforcement learning focuses on solving the problem of sequence decision, and is different from general supervised learning in that correct sample guidance exists, reinforcement learning is a trial and error learning method, namely, learning process is completed by means of feedback obtained through continuous interactive training of Agent and environment.
Before reinforcement learning is used, the problem needs to be converted into Markov Decision Process (MDP), which can be expressed as (S, a (S)), P, γ, R, and contains five elements in total. Wherein: s represents a set of all states; a(s) represents a set of actions that can be performed in state s; p represents the state transition probability and represents the probability that the Agent executes the action a to transition to a new state s' in the current state s; γ represents a discount factor for giving a certain discount to future prizes, where γ ∈ [0,1 ]; r is a reward function.
In the process route optimization problem, the determination of the process route of the part to be processed can be regarded as a complete decision-making process from the selection of the first process and the processing resources to the determination of the last process and the processing resources, and the whole process route comprises the sequence arrangement of the processes and the processing equipment selected by the processes, so that the state space, the action space and the reward function of the process can be defined.
In the definition of the state space, the static variable and the dynamic variable are introduced into the setting of the state space in consideration of the process quantity of each part to be processed and the diversity of the candidate resource sets. Where static variables are values that do not change during the decision making process, and dynamic variables are updated before each decision making step. As shown in FIG. 1, state s t The method can be regarded as an array of L multiplied by 7 dimensions, and the available machine tool and tool resources in each process are regarded as a primary combination in a paired mode. All the procedures of the workpiece to be processed are summed up into L-type combinations, and the static variables in the figure contain the serial numbers ID of the candidate combinations x And process ME of candidate combination i Machine tool number
Figure BDA0003664391630000071
And tool number
Figure BDA0003664391630000072
The dynamic variable includes the current state s t Ratio of the next remaining process
Figure BDA0003664391630000073
Current state s t Whether the next process has been selected
Figure BDA0003664391630000074
And whenFront state s t Next step ME i Preceding process residual ratio of
Figure BDA0003664391630000075
A process may have several candidates for combinations of machine tools, among which
Figure BDA0003664391630000076
Respectively represent the processes ME i The optional tool of the kth class and the optional tool of the mth class under the machine tool.
Dynamic variables
Figure BDA0003664391630000077
Characterizes the current state s t The proportion of the rest working procedures is as follows,
Figure BDA0003664391630000078
the initial state is 1, and the update rule is as follows:
Figure BDA0003664391630000079
in the formula, m t Indicates being in state s t When the number of selected processes is m t When the total number of the working procedures is equal to n, the sequence decision reaches the termination condition.
Dynamic variables
Figure BDA00036643916300000710
Characterizes the current state s t Whether the next process step has already been selected,
Figure BDA00036643916300000711
the update rule is as follows:
Figure BDA00036643916300000712
dynamic variables
Figure BDA00036643916300000713
Characterizes the current state s t Next step ME i The residual proportion of the preceding working procedure of (2),
Figure BDA00036643916300000714
when ME i Without a preceding step, it
Figure BDA00036643916300000715
Is always 0; when ME i When a preceding process is carried out, the method comprises the following steps,
Figure BDA00036643916300000716
the update rule of (1) is:
Figure BDA00036643916300000717
in the formula, P i Represents the step ME i The total number of the preceding processes of (2),
Figure BDA00036643916300000718
indicates being in state s t Time P i The number of selected processes, initial state
Figure BDA00036643916300000719
Is 0 or 1, when
Figure BDA00036643916300000720
When the value is 0, the ME at that time is represented i May be selected.
Figure BDA00036643916300000721
And
Figure BDA00036643916300000722
are all taken as values of [0,1]]In addition, different process quantities of the workpieces to be processed can be effectively handled, and the model has a better sensing effect on data.
In the definition of action space, action a t Representing Agent in Current StateState s t Actions that can be taken next, which are related to decision strategies, in the process route optimization problem, the action set a(s) can be specifically understood as all selectable sets of serial numbers ID in the state s.
In the definition of the reward function, proper reward can give correct feedback to the Agent to guide the Agent to learn the strategy selected by the action. Therefore, the reward of the Agent is determined based on three optimization targets of the total process cost, the total process time and the total process carbon emission. When the three types of target values are lower, the Agent is given larger reward for feedback. However, the multi-objective optimization problem generally does not have a global optimal solution, and a Pareto solution set, namely a non-dominant solution set, is obtained. Solution set forms are difficult to directly compare, in the existing method, a weighted sum form is generally adopted for multi-target optimization problems in reinforcement learning, multiple targets are changed into a single target, but the method is strong in subjectivity, and the obtained solutions are poor in distribution. Therefore, the invention takes hyper-volume (HV) as a multi-objective evaluation index, wherein the HV indicates the volume of a hypercube formed by the individuals in the solution set and the reference point in the target space, under the condition of a given reference point and a limited search space, the maximization of the measurement of the HV is equivalent to finding a Pareto set, so the HV index is an evaluation method consistent with the Pareto.
For the process route optimization problem herein, with f 1 、f 2 、f 3 Representing three objective functions, and setting a three-dimensional objective space of n Pareto points as follows:
Prt={(f 11 ),f 21 ),f 31 )),(f 12 ),f 22 ),f 32 )),…,(f 1n ),f 2n ),f 3n ))}#(9)
assume that the reference point is R ═ R (R) 1 ,r 2 ,r 3 ) Then the calculation of HV can be expressed as:
Figure BDA0003664391630000081
wherein λ (Prt) is a Lebesgue measure, [ r ] 1 ,f 1 (s)]×[r 2 ,f 2 (s)]×[r 3 ,f 3 (s)]And the hyper-volume is defined by all points on Prt and the reference point R.
The third step: solving is carried out by deep reinforcement learning method based on Actor-Critic structure
The Actor-Critic algorithm is a deep reinforcement learning method for reinforcing a strategy by using a learning reinforcement signal iteratively generated by a value function, and combines two types of learning methods, namely value-base and policy-base. The algorithm adopts an Actor-Critic structure to independently store a policy function and a value function, the structure representing the policy function is called Actor, and the Actor determines the action executed by an Agent according to the environment state; and the structure representing the value function is called Critic (Critic), which evaluates the action selected by the Actor by computing the value function.
FIG. 2 illustrates the training process of the Actor-Critic algorithm used in the present invention. Where Critic can update the parameters by minimizing the value estimate and the MSE (mean square error) of the actual return, and its loss function is:
Figure BDA0003664391630000082
in the formula T k Which represents the number of sampling steps to be taken,
Figure BDA0003664391630000083
representing an approximation function, R t Indicating a reward.
The Actor represents a policy function, which can be updated according to a time sequence difference method, and the loss function is as follows:
Figure BDA0003664391630000084
the present invention will be described with reference to examples. In this example, there are 12 features in total, and each feature has 27 processes, and the machining method, the available machine tool, the available tool, and the dimensional information of the feature for each process are shown in table 1.
Table 1 example related information
Figure BDA0003664391630000091
The constraint relationships for the 27 processes are shown in FIG. 3. Wherein a value of 1 means that four types of mandatory constraint relationship procedures ME are satisfied i Must take precedence over ME j A value of-1 means ME i Must be preceded by ME j And 0 refers to no mandatory constraint relation between the two types of working procedures.
The total process cost C under different sequence arrangements and resource combinations can be calculated according to the information and constraint relationship of the process all Total time of the process T all And total carbon emissions E all And carrying out optimization solution.
In the case, a pre-training model is obtained based on an off-line training process, and then the pre-training model is loaded by the relevant data of the customized piece to perform on-line training to obtain a solution result after convergence. The deep reinforcement learning is divided into an off-line training process and an on-line training process in application, wherein the off-line training process is to train a deep reinforcement learning model through a batch of training examples, optimize model parameters by exploring and learning the model, obtain a pre-training model and store the pre-training model; and the on-line training is to input the data of the workpiece to be processed to perform on-line learning training by loading the pre-training model to obtain a converged result. The method has the advantages that the neural network in the deep reinforcement learning can be trained to obtain better weight parameters in advance, the training of new data can be further realized based on the model when the method is applied on line, the convergence speed is accelerated, and the result is rapidly output.
The main hyper-parameters set by the invention are shown in table 2.
Table 2 major hyper-parameters of the model
Figure BDA0003664391630000101
In the off-line training process, fig. 4 shows a change process of the HV evaluation index during multi-objective optimization, and tends to converge after about 600 times of iterative training. While (b), (C) and (d) in fig. 4 are iterative calculation processes in the case of single-target optimization, which are the total process cost C in turn all Total time of the process T all Total carbon emission E all Change of value of (c). Wherein, the single target optimization with the total process cost fluctuates in a small interval after being subjected to 450 times of iterative training and tends to be convergent; the single-target optimization is carried out in the total process time, and the convergence tends to be realized after about 500 times of iterative training; single-target optimization with total carbon emissions tends to converge after approximately 500 iterative training passes. In general, the multi-target solution process using HV as the evaluation index is slightly slower, and in the 400 th to 600 th iterative training processes, although there is a convergence trend, there is still large fluctuation, which may be in exploration and balance of three optimization targets, and meanwhile, the optimization of the three targets also increases the computational complexity, so that the convergence rate is slightly lower than that of other three types of single targets.
Table 3 is the result of the online training optimization of the cases under multiple objectives, and table 4 is the summary table of the results obtained by the cases under the multiple objective optimization and the three types of single objective optimization. From table 3, it can be seen that the algorithm proposed by the present invention can arrange similar processes as much as possible in a centralized manner when solving. From the data in Table 4, it can be seen that the total cost C of the process is all Total time of the process T all Total carbon emission E all When the optimization is carried out as a single target, the target per se is enabled to obtain the minimum value, but the results of other targets are quite unsatisfactory, and the multi-target optimization can well balance each target to obtain the result of overall optimization, so that the optimization by using multiple targets is very necessary.
TABLE 3 Multi-objective optimization results
Figure BDA0003664391630000102
Figure BDA0003664391630000111
TABLE 4 target values under multiple, Single target optimization
Figure BDA0003664391630000112
In order to compare the comprehensive effect of the Deep Reinforcement Learning (DRL) -based process route optimization method, the NSGA-III and the MOPSO algorithms are used for solving and comparing, wherein the solving is multi-objective optimization, Pareto solution sets obtained by the three algorithms are counted, and the distribution condition of the solutions is analyzed. As shown in FIG. 5, which is a box line graph drawn according to the values of the solution sets of the three types of algorithms under each target, the solution sets of the three types of algorithms under each target have little difference in distribution as can be found from the data in the graph. Overall, the solution sets of DRL and NSGA-III are widely distributed, with median values close, while the solution of MOPSO is slightly worse. The DRL-based approach presented herein is demonstrated to have global search capabilities similar to heuristic optimization algorithms.
Further, in order to explore the difference of the three algorithms in the solving speed, the multi-objective optimization is carried out on the cases for 10 times. Fig. 6 shows the solving speed of the three types of algorithms under 10 experiments, and the statistical time interval is from the start of the algorithm to the convergence. As can be seen from the data in the figure, the DRL method provided by the invention has the optimal solving speed under 10 experiments, while the NSGA-III algorithm has the slowest solving speed on the whole and the MOPSO algorithm is the second order. This is because the DRL invokes the pre-training model in practical application, and can perform on-line training on the original basis to complete fine tuning, so the on-line convergence speed is fast, and other algorithms need to be reinitialized and then optimized.
In conclusion, the process route optimization method based on deep reinforcement learning provided by the invention has efficient, stable and strong-adaptability multi-objective optimization decision-making capability, and can solve the flexible and variable problem when the process route of small-batch customized products is optimized.

Claims (3)

1. A process route multi-objective optimization method based on deep reinforcement learning comprises the following steps:
setting four mandatory priority relations among working procedures, establishing a multi-objective optimization function taking the total process cost, the total process time and the total carbon emission as optimization targets, and establishing a process route optimization model;
wherein, four mandatory priority constraint relations among the set working procedures are as follows:
coarse-then-fine constraint relation: firstly arranging a rough machining type procedure, and then arranging a finish machining type procedure;
benchmark priority constraint relationship: when the reference feature and the dependent feature are processed, the reference feature is processed preferentially;
surface-first hole type priority relationship: when the surface characteristics and the hole characteristics corresponding to the surface characteristics are processed, in order to meet the position precision requirement of the hole and the plane, the surface characteristics are processed firstly, and then the hole characteristics are processed;
primary and secondary priority type constraint relationship: the primary and secondary characteristics are determined by the application functions of the parts, the part structure is designed according to the requirements of a user in the design stage, and the primary and secondary priority relationship can be obtained by combining the actual functional characteristics;
secondly, converting the process route optimization problem into a Markov decision process to simulate a randomness strategy and return which can be realized by the Agent, wherein the method comprises the following steps:
the determination of the process route of the part to be processed is regarded as a complete Markov decision process, namely, the selection of a first procedure and processing resources is carried out to the determination of a last procedure and processing resources, and the whole process route comprises the sequence arrangement of the procedures and the processing equipment selected by the procedures, so that the state space, the action space and the reward function of the Markov decision process are defined;
in the definition of the state space, a static variable and a dynamic variable are introduced, the static variable is a value which does not change in the decision process, and the dynamic variable is updated before each decision step, specifically:
regarding available machine tool and tool resources under each process as a primary combination in a pairing mode, and summing all processes of workpieces to be processed into L-type groupsStatic variables include the serial number ID of the candidate combination x And process ME of candidate combination i Machine tool number
Figure FDA0003664391620000011
And tool number
Figure FDA0003664391620000012
The dynamic variable includes the current state s t Ratio of the next remaining process
Figure FDA0003664391620000013
Current state s t Next step ME i Whether or not it has already been selected
Figure FDA0003664391620000014
And the current state s t Next step ME i Preceding process residual ratio of
Figure FDA0003664391620000015
There may be several candidates for a machine tool cutter combination for a process,
Figure FDA0003664391620000016
respectively represent the processes ME i The kth optional machine tool and the mth optional tool under the machine tool;
dynamic variables
Figure FDA0003664391620000017
Characterizes the current state s t The proportion of the rest working procedures is as follows,
Figure FDA0003664391620000018
the initial state is 1, and the update rule is as follows:
Figure FDA0003664391620000019
in the formula, m t Indicates being in state s t When the number of selected processes is m t When the total number of the working procedures is equal to n, the sequence decision reaches a termination condition;
dynamic variables
Figure FDA00036643916200000110
Characterizes the current state s t Whether the next process step has already been selected,
Figure FDA00036643916200000111
the update rule is as follows:
Figure FDA00036643916200000112
dynamic variables
Figure FDA0003664391620000021
Characterizes the current state s t Next step ME i The residual proportion of the preceding working procedure of (2),
Figure FDA0003664391620000022
when ME i Without a preceding step, it
Figure FDA0003664391620000023
Is always 0; when ME i When a preceding process is carried out, the method comprises the following steps,
Figure FDA0003664391620000024
the update rule of (2) is:
Figure FDA0003664391620000025
in the formula, P i Represents the step ME i The total number of the preceding processes of (2),
Figure FDA0003664391620000026
indicates being in state s t Time P i The number of selected processes, initial state
Figure FDA0003664391620000027
Is 0 or 1, when
Figure FDA0003664391620000028
When the value is 0, the ME at that time is represented i Can be selected;
Figure FDA0003664391620000029
and
Figure FDA00036643916200000210
are all taken as values of [0,1]]In addition, different process quantities of workpieces to be processed can be effectively dealt with, and the model has a better data perception effect;
in the definition of action space, action a t Representing the Agent in the current state s t Actions that can be made in relation to the decision strategy, in the process route optimization problem, the action set A(s) represents all selectable serial numbers ID in state s x Gathering;
determining the reward of the Agent on the reward function definition based on three optimization targets of total process cost, total process time and total process carbon emission, and feeding back the Agent with larger reward when the three targets are lower;
and thirdly, solving the deep reinforcement learning method based on the Actor-Critic structure.
2. The method of claim 1, wherein in the second step, the established process total cost, process total time, and total carbon emissions objective function is:
process Total cost objective function:
Figure FDA00036643916200000211
in the formula, C all For the total cost, C mc Is a machine tool change cost index, C tc Is a tool change cost index, C m (MID i ) Is a machine tool MID i Cost index of (C) t (TID i ) Is a tool TID i Cost index of (a), gamma 1 Is a machine tool change function, gamma 2 Is a tool change function, n is the total number of steps, TID i ,TID i+1 Respectively the tool codes used in the process i and the process i + 1; MID i ,MID i+1 The machine tool codes used in the process i and the process i +1 respectively;
wherein, γ 1 The calculation formula is as follows:
Figure FDA00036643916200000212
γ 2 the calculation formula is as follows:
Figure FDA00036643916200000213
in the formula, the following components are mixed;
total process time objective function:
Figure FDA00036643916200000214
in the formula (I), the compound is shown in the specification,
Figure FDA0003664391620000031
is the machining time of the machine tool, TM mc Is the change time index of the machine tool, TM tc Is the change time index of the tool;
process total carbon emissions objective function:
Figure FDA0003664391620000032
in the formula (I), the compound is shown in the specification,
Figure FDA0003664391620000033
cutting energy consumption and auxiliary energy consumption of the ith procedure, F ele Is a carbon emission factor of electrical energy.
3. The method of claim 1, wherein in the third step, during the training process of the Actor-criticic algorithm, criticic performs parameter updating by minimizing the Mean Square Error (MSE) between the value estimation and the actual return; the Actor represents a policy function, and updates the policy function according to a timing difference method.
CN202210582122.0A 2022-05-26 2022-05-26 Process route multi-objective optimization method based on deep reinforcement learning Pending CN115016405A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210582122.0A CN115016405A (en) 2022-05-26 2022-05-26 Process route multi-objective optimization method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210582122.0A CN115016405A (en) 2022-05-26 2022-05-26 Process route multi-objective optimization method based on deep reinforcement learning

Publications (1)

Publication Number Publication Date
CN115016405A true CN115016405A (en) 2022-09-06

Family

ID=83071875

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210582122.0A Pending CN115016405A (en) 2022-05-26 2022-05-26 Process route multi-objective optimization method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN115016405A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116956750A (en) * 2023-09-19 2023-10-27 山东山大华天软件有限公司 Knowledge graph-based part process design method, system, medium and equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116956750A (en) * 2023-09-19 2023-10-27 山东山大华天软件有限公司 Knowledge graph-based part process design method, system, medium and equipment
CN116956750B (en) * 2023-09-19 2023-12-19 山东山大华天软件有限公司 Knowledge graph-based part process design method, system, medium and equipment

Similar Documents

Publication Publication Date Title
Liu et al. Many-objective job-shop scheduling: A multiple populations for multiple objectives-based genetic algorithm approach
Zhang et al. Surrogate-assisted evolutionary multitask genetic programming for dynamic flexible job shop scheduling
Zhang et al. Correlation coefficient-based recombinative guidance for genetic programming hyperheuristics in dynamic flexible job shop scheduling
Zhao et al. Dynamic jobshop scheduling algorithm based on deep Q network
Ardil Freighter aircraft selection using entropic programming for multiple criteria decision making analysis
He et al. A multiobjective evolutionary algorithm for achieving energy efficiency in production environments integrated with multiple automated guided vehicles
CN115454005A (en) Manufacturing workshop dynamic intelligent scheduling method and device oriented to limited transportation resource scene
Petrović et al. Multi-objective scheduling of a single mobile robot based on the grey wolf optimization algorithm
CN115016405A (en) Process route multi-objective optimization method based on deep reinforcement learning
Zhang et al. Task relatedness based multitask genetic programming for dynamic flexible job shop scheduling
Tian et al. A dynamic job-shop scheduling model based on deep learning.
CN109559033B (en) Socialized team member optimization method oriented to cloud design and manufacturing mode
Zhao et al. A drl-based reactive scheduling policy for flexible job shops with random job arrivals
May et al. Multi-variate time-series for time constraint adherence prediction in complex job shops
Shakouri et al. A systematic fuzzy decision-making process to choose the best model among a set of competing models
Li et al. An improved whale optimisation algorithm for distributed assembly flow shop with crane transportation
Wang et al. A tailored NSGA-III for multi-objective flexible job shop scheduling
CN109523136A (en) A kind of scheduling knowledge management system towards intelligence manufacture
CN116167492A (en) Closed-loop optimization self-adaptive scheduling method for semiconductor production line
Cervellera et al. Policy optimization for berth allocation problems
CN115220477A (en) Heterogeneous unmanned aerial vehicle alliance forming method based on quantum genetic algorithm
Langsari et al. Optimizing time and effort parameters of COCOMO II using fuzzy Multi-objective Particle Swarm Optimization
CN114004065A (en) Transformer substation engineering multi-objective optimization method based on intelligent algorithm and environmental constraints
CN114819660A (en) Dynamic evolution design crowdsourcing human resource task matching method and system
Stockton et al. Developing cost models by advanced modelling technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination