CN115016405A

CN115016405A - Process route multi-objective optimization method based on deep reinforcement learning

Info

Publication number: CN115016405A
Application number: CN202210582122.0A
Authority: CN
Inventors: 袁伟; 张冠伟; 郭伟; 王磊
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-05-26
Filing date: 2022-05-26
Publication date: 2022-09-06

Abstract

The invention relates to a process route multi-objective optimization method based on deep reinforcement learning, which comprises the following steps: setting four mandatory priority relations among the working procedures, establishing a multi-objective optimization function taking the total process cost, the total process time and the total carbon emission as optimization targets, and establishing a process route optimization model; and secondly, converting the process route optimization problem into a Markov decision process to simulate the randomness strategy and return which can be realized by the Agent of the Agent: the determination of the process route of the part to be processed is regarded as a complete Markov decision process, namely, the selection of a first procedure and processing resources is carried out to the determination of a last procedure and processing resources, and the whole process route comprises the sequence arrangement of the procedures and the processing equipment selected by the procedures, so that the state space, the action space and the reward function of the Markov decision process are defined; and thirdly, solving the deep reinforcement learning method based on the Actor-Critic structure.

Description

Process route multi-objective optimization method based on deep reinforcement learning

Technical Field

The invention belongs to the technical field of process route optimization, and particularly relates to a process route multi-objective optimization method based on deep reinforcement learning.

Background

In the information era, product requirements increasingly tend to be personalized, diversified and dynamic, the market share of small-batch customized products based on user requirements gradually rises, and the existing manufacturing system is difficult to meet the production requirements of the small-batch customized products. The reason is that the customized product is frequently updated on the structural design, and the fine change of the process caused by the change of the demand influences the planning of the whole process route, so that a brand-new challenge is brought to the process planning efficiency, enterprises at the present stage have weak organization capability on process resources, most process routes are manually planned, the dependence on experience is strong, and the intellectualization level shown in the process design process is low.

The process planning may be considered as a bridge connecting product design and manufacturing, which is an activity of combining manufacturing process knowledge with specific design to prepare specific operation specifications thereof under the limitation of manufacturing resources in a workshop or factory. The process route optimization is a core link of process planning, mainly refers to the maximization of economic benefit by reasonably allocating processing resources and arranging processing sequence under the condition of meeting the constraint relation of procedures in the process, and is proved to be a combined optimization problem with NP-hard property due to the fact that multi-objective optimization is involved. However, in the process route optimization, the processing sequence of the part processing characteristics is not unique, the processing method available for processing the characteristics is not unique, the processing equipment for realizing the corresponding processing method is not unique, and the like, so that the process route optimization is flexible and changeable. In the production and processing of various small-batch products, the traditional process route optimization method cannot meet the dynamic processing requirement due to the fact that the process conditions are multiple and the dynamic change is large and the uncertainty factors of the process route optimization are more. The existing solution generally carries out multi-objective solution through a heuristic algorithm, but the method needs to carry out special treatment on the constraint of each type of problem, simultaneously optimizes all processes as a whole during solution, and has no dynamic response capability because the solution range needs to be adjusted and the planning needs to be restarted when facing the dynamic change of resources.

Disclosure of Invention

The invention provides a process route multi-target optimization method based on deep reinforcement learning, which aims at the defects of the prior art, after various economic indexes and low-carbon indexes in the process are analyzed, a multi-target optimization model is established by combining the priority relation among working procedures, the optimization problem is converted into a Markov decision process by utilizing the obvious advantages of the deep reinforcement learning on complex modeling and decision problems, a state space, an action space and a reward function are defined, and the deep reinforcement learning method based on an Actor-Critic structure is further designed by using a super-volume as a multi-target evaluation index so as to solve the optimization model. The method can make flexible decisions when the processing resources change dynamically, and realizes efficient, stable and strong-adaptability multi-objective optimization decision-making capability, thereby solving the flexible and variable problem when the small-batch customized product process route is optimized.

The purpose of the invention is realized by the following technical scheme:

a process route multi-objective optimization method based on deep reinforcement learning comprises the following steps:

setting four mandatory priority relations among working procedures, establishing a multi-objective optimization function taking the total process cost, the total process time and the total carbon emission as optimization targets, and establishing a process route optimization model;

wherein, four mandatory priority constraint relations among the set working procedures are as follows:

coarse-then-fine constraint relation: firstly arranging a rough machining type procedure, and then arranging a finish machining type procedure;

benchmark priority constraint relationship: when the reference feature and the dependent feature are processed, the reference feature is processed preferentially;

surface-first hole pattern priority relationship: when the surface characteristics and the hole characteristics corresponding to the surface characteristics are processed, in order to meet the position precision requirement of the hole and the plane, the surface characteristics are processed firstly, and then the hole characteristics are processed;

primary and secondary priority type constraint relationship: the primary and secondary characteristics are determined by the application functions of the parts, the part structure is designed according to the requirements of a user in the design stage, and the primary and secondary priority relationship can be obtained by combining the actual functional characteristics;

secondly, converting the process route optimization problem into a Markov decision process to simulate the randomness strategy and return which can be realized by the Agent of the Agent

The determination of the process route of the part to be processed is regarded as a complete Markov decision process, namely, the selection of a first procedure and processing resources is carried out to the determination of a last procedure and processing resources, and the whole process route comprises the sequence arrangement of the procedures and the processing equipment selected by the procedures, so that the state space, the action space and the reward function of the Markov decision process are defined;

in terms of state space definition, a static variable and a dynamic variable are introduced, the static variable is a value which cannot be changed in a decision process, and the dynamic variable is updated before each decision step, specifically:

regarding available machine tool and tool resources under each process as a primary combination in a pairing mode, totaling L-type combinations of all processes of workpieces to be processed, and enabling static variables to comprise serial numbers ID of candidate combinations _x And process ME of candidate combination _i Machine tool number

And tool number

The dynamic variable includes the current state s _t Ratio of the next remaining process

Current state s _t Next step ME _i Whether or not it has already been selected

And the current state s _t Next step ME _i Preceding process residual ratio of

There may be several candidates for a machine tool cutter combination for a process,

respectively represent the processes ME _i The kth type optional machine tool and the mth type optional tool under the machine tool;

dynamic variables

Characterizes the current state s _t The proportion of the rest working procedures is as follows,

the initial state is 1, and the update rule is as follows:

in the formula, m ^t Indicates being in state s _t When the number of selected processes is m ^t When the total number of the working procedures is equal to n, the sequence decision reaches a termination condition;

dynamic variables

Characterizes the current state s _t Whether the next process step has already been selected,

the update rule is as follows:

dynamic variables

Characterizes the current state s _t Next step ME _i The residual proportion of the preceding working procedure of (2),

when ME _i Without a preceding step, it

Is always 0; when ME _i When a preceding process is carried out, the method comprises the following steps,

the update rule of (1) is:

in the formula, P _i Represents the step ME _i The total number of the preceding processes of (2),

indicates being in state s _t Time P _i Number of selected processes, initial state

Is 0 or 1, when

A value of 0 indicates ME at that time _i Can be selected;

and

are all taken as values of [0,1]]The number of different working procedures of the workpiece to be processed can be effectively responded, and the model has a better sensing effect on data;

in the definition of action space, action a _t Representing the Agent in the current state s _t Actions that can be taken, which are related to decision strategies, on the process routeIn the line optimization problem, the action set A(s) represents all selectable sequence numbers ID under the state s _x Gathering;

determining the reward of the Agent on the reward function definition based on three optimization targets of total process cost, total process time and total process carbon emission, and feeding back the Agent with larger reward when the three targets are lower;

and thirdly, solving the deep reinforcement learning method based on the Actor-Critic structure.

Preferably, in the second step, the established target functions of the total process cost, the total process time and the total carbon emission are as follows:

process Total cost objective function:

in the formula, C _all For the total cost, C _mc Is a machine tool change cost index, C _tc Is a tool change cost index, C _m (MID _i ) Is a machine tool MID _i Cost index of (C) _t (TID _i ) Is a tool TID _i Cost index of (a), gamma ₁ Is a machine tool change function, gamma ₂ Is a tool change function, n is the total number of steps, TID _i ， TID _i+1 Respectively the tool codes used in the process i and the process i + 1; MID _i ，MID _i+1 The machine tool codes used in the process i and the process i +1 respectively;

wherein, γ ₁ The calculation formula is as follows:

γ ₂ the calculation formula is as follows:

in the formula, the following components are mixed;

total process time objective function:

in the formula (I), the compound is shown in the specification,

is the machining time of the machine tool, TM _mc Is an index of change time of the machine tool, TM _tc Is the change time index of the tool;

process total carbon emissions objective function:

in the formula (I), the compound is shown in the specification,

cutting energy consumption and auxiliary energy consumption of the ith procedure, F _ele Is the carbon emission factor of electrical energy.

Preferably, in the third step, in the training process of the Actor-Critic algorithm, Critic updates parameters through the Mean Square Error (MSE) of the minimum value estimation and the actual return; the Actor represents a policy function, and is updated according to a timing difference method.

The technical scheme provided by the invention has the beneficial effects that: the invention provides a new state space and action space defining mode by utilizing the advantages of reinforcement learning on complex modeling and dynamic optimization, so as to convert the process route optimization problem into a Markov decision process; on the basis, optimization indexes and influence factors in the process are cooperatively considered, a multi-objective optimization model is established, and the solution is analyzed and determined by the multi-objective optimization evaluation indexes, so that the solution set has good distributivity, and the adaptability and the generalization of the optimization model are improved.

Drawings

FIG. 1 is a state space definition diagram of a process route decision problem

FIG. 2 is a training process of the Actor-Critic algorithm

Constraint matrix of the process of FIG. 3

FIG. 4 off-line training convergence process under different optimization objectives

FIG. 5 Box line graph of the solutions of the algorithms

FIG. 6 solving speed of different algorithms

Detailed Description

The essence of the process route optimization is a multi-objective optimization problem, namely, according to the demand data and the process information of the parts, under the condition of meeting the mandatory constraints among all the processes, the process sequence and the manufacturing resource candidate set are reasonably arranged to realize the relevant requirements such as economic indexes, time indexes and the like. However, different workpieces face the problems of flexible and changeable processing modes and complex and diversified process resources, and the increasing requirements of various products and small-batch customized products further provide a brand-new challenge for the optimization of a process route. The traditional process route generally depends on the accumulated experience of experts or enterprises, and can be determined after the process performance analysis and the comprehensive evaluation are carried out on the processed workpiece. The method is oriented to processing of large-batch standardized products, so that the implementation period is long, the flexibility is poor, the intelligentization level is low, and the diversified market demands of small-batch customized products and increasingly complex enterprise production plans cannot be met.

Based on the method, firstly, relevant optimization indexes in the process are analyzed, and a multi-objective optimization model is established; and then converting the process optimization problem into a Markov decision process, defining a state space and an action space, analyzing and determining the solved solution by using a multi-objective optimization evaluation index to ensure that a solution set has good distributivity, and finally solving by using an Actor-Critic structure-based deep reinforcement learning method to obtain a final process route. The obtained solution set has better distributivity, the adaptability and the generalization of the model are ensured, and the intelligent level of the process route optimization decision is integrally promoted.

Specifically, the method comprises the following steps:

the first step is as follows: and establishing a process route optimization model.

Based on the analysis of the process route optimization process, the invention combs four mandatory priority relations among the working procedures and establishes three types of optimization objective functions so as to establish a process route optimization model.

Four of the mandatory constraints are: a coarse-then-fine constraint relationship, a reference priority constraint relationship, a surface-to-hole type priority relationship, and a primary and secondary priority constraint relationship.

Four of the mandatory constraints are:

(1) coarse-then-fine constraint relation: firstly arranging a rough machining type procedure, and then arranging a finish machining type procedure;

(2) benchmark priority constraint relationship: when processing is performed for a reference feature and its dependent feature, the reference feature should be processed preferentially.

(3) Surface-first hole type priority relationship: when the surface features and the hole features corresponding to the surface features are processed, the surface features are processed first and then the hole features are processed to meet the position precision requirements of the holes and the planes.

(4) Primary and secondary priority type constraint relationship: the primary and secondary characteristics are determined by the application functions of the parts, the part structure is designed according to the requirements of a user in the design stage, and the primary and secondary priority relationship can be obtained by combining the actual functional characteristics.

For the optimization target, from the inside of an enterprise, the optimization of the process route is mainly reflected in economic benefits and processing time, on one hand, the manufacturing cost of equipment such as a machine tool and the like and the change cost of the equipment need to be considered in the processing process, and on the other hand, the processing time needs to be controlled in a coordinated manufacturing process to meet the aging requirement of a demand side. Further, in a severe environment where excessive carbon dioxide emission causes global warming, low-carbon production is required.

Therefore, the invention establishes a multi-objective optimization function taking the total process cost, the total process time and the total carbon emission as optimization targets, and constructs a process route optimization model by combining the mandatory priority relationship among the working procedures. Three of these objective functions are:

(1) process total cost objective function:

in the formula, C _all For the total cost, C _mc Is a machine tool change cost index, C _tc Is a tool change cost index, C _m (MID _i ) Is a machine tool MID _i Cost index of (C) _t (TID _i ) Is a tool TID _i Cost index of (a), gamma ₁ Is a machine tool change function, gamma ₂ Is the tool change function.

Wherein, γ ₁ The calculation formula is as follows:

in the form of MID _i ，MID _i+1 The machine tool numbers used in step i and step i +1 are respectively.

γ ₂ The calculation formula is as follows:

in the form of MID _i ，MID _i+1 The machine tool code, TID, used in step i and step i +1, respectively _i ，TID _i+1 The tool numbers used in step i and step i +1 are the same.

(2) Total process time objective function:

in the formula (I), the compound is shown in the specification,

is the machining time of the machine tool, TM _mc Is the change time index of the machine tool, TM _tc Is the index of change time of the tool, gamma ₁ Is a machine tool change function, gamma ₂ Is a tool change function, which is countedThe calculation formulas are respectively consistent with formula 2 and formula 3.

(3) Process total carbon emissions objective function:

in the formula (I), the compound is shown in the specification,

cutting energy consumption and auxiliary energy consumption of the ith procedure, F _ele Is a carbon emission factor of electrical energy.

(1) Process Total cost objective function:

in the formula, C _all For the total cost, C _mc Is a machine tool change cost index, C _tc Is a tool change cost index, C _m (MID _i ) Is a machine tool MID _i Cost index of (C) _t (TID _i ) Is a tool TID _i Cost index of (a), gamma ₁ Is a machine tool change function, gamma ₂ Is a tool change function, and n is the total number of steps.

Wherein, γ ₁ The calculation formula is as follows:

in the form of MID _i ，MID _i+1 The machine tool numbers used in step i and step i +1, respectively.

γ ₂ The calculation formula is as follows:

(2) Total process time objective function:

in the formula (I), the compound is shown in the specification,

is the machining time of the machine tool, TM _mc Is the change time index of the machine tool, TM _tc Is the change time index of the tool; gamma.a ₁ Is a machine tool change function, gamma ₂ The calculation formulas of the tool changing function are respectively consistent with the

formulas

2 and 3.

(3) Process total carbon emissions objective function:

in the formula (I), the compound is shown in the specification,

The second step is that: and converting the process route optimization problem into a Markov decision process.

Reinforcement learning, one of the paradigms and methodologies of machine learning, can be used to describe and solve problems for agents (agents) to learn through interactions with the environment to get maximum returns or to achieve a predetermined goal. Reinforcement learning focuses on solving the problem of sequence decision, and is different from general supervised learning in that correct sample guidance exists, reinforcement learning is a trial and error learning method, namely, learning process is completed by means of feedback obtained through continuous interactive training of Agent and environment.

Before reinforcement learning is used, the problem needs to be converted into Markov Decision Process (MDP), which can be expressed as (S, a (S)), P, γ, R, and contains five elements in total. Wherein: s represents a set of all states; a(s) represents a set of actions that can be performed in state s; p represents the state transition probability and represents the probability that the Agent executes the action a to transition to a new state s' in the current state s; γ represents a discount factor for giving a certain discount to future prizes, where γ ∈ [0,1 ]; r is a reward function.

In the process route optimization problem, the determination of the process route of the part to be processed can be regarded as a complete decision-making process from the selection of the first process and the processing resources to the determination of the last process and the processing resources, and the whole process route comprises the sequence arrangement of the processes and the processing equipment selected by the processes, so that the state space, the action space and the reward function of the process can be defined.

In the definition of the state space, the static variable and the dynamic variable are introduced into the setting of the state space in consideration of the process quantity of each part to be processed and the diversity of the candidate resource sets. Where static variables are values that do not change during the decision making process, and dynamic variables are updated before each decision making step. As shown in FIG. 1, state s _t The method can be regarded as an array of L multiplied by 7 dimensions, and the available machine tool and tool resources in each process are regarded as a primary combination in a paired mode. All the procedures of the workpiece to be processed are summed up into L-type combinations, and the static variables in the figure contain the serial numbers ID of the candidate combinations _x And process ME of candidate combination _i Machine tool number

And tool number

Current state s _t Whether the next process has been selected

And whenFront state s _t Next step ME _i Preceding process residual ratio of

A process may have several candidates for combinations of machine tools, among which

Respectively represent the processes ME _i The optional tool of the kth class and the optional tool of the mth class under the machine tool.

Dynamic variables

the initial state is 1, and the update rule is as follows:

in the formula, m ^t Indicates being in state s _t When the number of selected processes is m ^t When the total number of the working procedures is equal to n, the sequence decision reaches the termination condition.

Dynamic variables

the update rule is as follows:

dynamic variables

when ME _i Without a preceding step, it

the update rule of (1) is:

indicates being in state s _t Time P _i The number of selected processes, initial state

Is 0 or 1, when

When the value is 0, the ME at that time is represented _i May be selected.

And

are all taken as values of [0,1]]In addition, different process quantities of the workpieces to be processed can be effectively handled, and the model has a better sensing effect on data.

In the definition of action space, action a _t Representing Agent in Current StateState s _t Actions that can be taken next, which are related to decision strategies, in the process route optimization problem, the action set a(s) can be specifically understood as all selectable sets of serial numbers ID in the state s.

In the definition of the reward function, proper reward can give correct feedback to the Agent to guide the Agent to learn the strategy selected by the action. Therefore, the reward of the Agent is determined based on three optimization targets of the total process cost, the total process time and the total process carbon emission. When the three types of target values are lower, the Agent is given larger reward for feedback. However, the multi-objective optimization problem generally does not have a global optimal solution, and a Pareto solution set, namely a non-dominant solution set, is obtained. Solution set forms are difficult to directly compare, in the existing method, a weighted sum form is generally adopted for multi-target optimization problems in reinforcement learning, multiple targets are changed into a single target, but the method is strong in subjectivity, and the obtained solutions are poor in distribution. Therefore, the invention takes hyper-volume (HV) as a multi-objective evaluation index, wherein the HV indicates the volume of a hypercube formed by the individuals in the solution set and the reference point in the target space, under the condition of a given reference point and a limited search space, the maximization of the measurement of the HV is equivalent to finding a Pareto set, so the HV index is an evaluation method consistent with the Pareto.

For the process route optimization problem herein, with f ₁ 、f ₂ 、f ₃ Representing three objective functions, and setting a three-dimensional objective space of n Pareto points as follows:

Prt＝{(f ₁ (ξ ₁ ),f ₂ (ξ ₁ ),f ₃ (ξ ₁ )),(f ₁ (ξ ₂ ),f ₂ (ξ ₂ ),f ₃ (ξ ₂ )),…,(f ₁ (ξ _n ),f ₂ (ξ _n ),f ₃ (ξ _n ))}#(9)

assume that the reference point is R ═ R (R) ₁ ,r ₂ ,r ₃ ) Then the calculation of HV can be expressed as:

wherein λ (Prt) is a Lebesgue measure, [ r ] ₁ ,f ₁ (s)]×[r ₂ ,f ₂ (s)]×[r ₃ ,f ₃ (s)]And the hyper-volume is defined by all points on Prt and the reference point R.

The third step: solving is carried out by deep reinforcement learning method based on Actor-Critic structure

The Actor-Critic algorithm is a deep reinforcement learning method for reinforcing a strategy by using a learning reinforcement signal iteratively generated by a value function, and combines two types of learning methods, namely value-base and policy-base. The algorithm adopts an Actor-Critic structure to independently store a policy function and a value function, the structure representing the policy function is called Actor, and the Actor determines the action executed by an Agent according to the environment state; and the structure representing the value function is called Critic (Critic), which evaluates the action selected by the Actor by computing the value function.

FIG. 2 illustrates the training process of the Actor-Critic algorithm used in the present invention. Where Critic can update the parameters by minimizing the value estimate and the MSE (mean square error) of the actual return, and its loss function is:

in the formula T _k Which represents the number of sampling steps to be taken,

representing an approximation function, R _t Indicating a reward.

The Actor represents a policy function, which can be updated according to a time sequence difference method, and the loss function is as follows:

the present invention will be described with reference to examples. In this example, there are 12 features in total, and each feature has 27 processes, and the machining method, the available machine tool, the available tool, and the dimensional information of the feature for each process are shown in table 1.

Table 1 example related information

The constraint relationships for the 27 processes are shown in FIG. 3. Wherein a value of 1 means that four types of mandatory constraint relationship procedures ME are satisfied _i Must take precedence over ME _j A value of-1 means ME _i Must be preceded by ME _j And 0 refers to no mandatory constraint relation between the two types of working procedures.

The total process cost C under different sequence arrangements and resource combinations can be calculated according to the information and constraint relationship of the process _all Total time of the process T _all And total carbon emissions E _all And carrying out optimization solution.

In the case, a pre-training model is obtained based on an off-line training process, and then the pre-training model is loaded by the relevant data of the customized piece to perform on-line training to obtain a solution result after convergence. The deep reinforcement learning is divided into an off-line training process and an on-line training process in application, wherein the off-line training process is to train a deep reinforcement learning model through a batch of training examples, optimize model parameters by exploring and learning the model, obtain a pre-training model and store the pre-training model; and the on-line training is to input the data of the workpiece to be processed to perform on-line learning training by loading the pre-training model to obtain a converged result. The method has the advantages that the neural network in the deep reinforcement learning can be trained to obtain better weight parameters in advance, the training of new data can be further realized based on the model when the method is applied on line, the convergence speed is accelerated, and the result is rapidly output.

The main hyper-parameters set by the invention are shown in table 2.

Table 2 major hyper-parameters of the model

In the off-line training process, fig. 4 shows a change process of the HV evaluation index during multi-objective optimization, and tends to converge after about 600 times of iterative training. While (b), (C) and (d) in fig. 4 are iterative calculation processes in the case of single-target optimization, which are the total process cost C in turn _all Total time of the process T _all Total carbon emission E _all Change of value of (c). Wherein, the single target optimization with the total process cost fluctuates in a small interval after being subjected to 450 times of iterative training and tends to be convergent; the single-target optimization is carried out in the total process time, and the convergence tends to be realized after about 500 times of iterative training; single-target optimization with total carbon emissions tends to converge after approximately 500 iterative training passes. In general, the multi-target solution process using HV as the evaluation index is slightly slower, and in the 400 th to 600 th iterative training processes, although there is a convergence trend, there is still large fluctuation, which may be in exploration and balance of three optimization targets, and meanwhile, the optimization of the three targets also increases the computational complexity, so that the convergence rate is slightly lower than that of other three types of single targets.

Table 3 is the result of the online training optimization of the cases under multiple objectives, and table 4 is the summary table of the results obtained by the cases under the multiple objective optimization and the three types of single objective optimization. From table 3, it can be seen that the algorithm proposed by the present invention can arrange similar processes as much as possible in a centralized manner when solving. From the data in Table 4, it can be seen that the total cost C of the process is _all Total time of the process T _all Total carbon emission E _all When the optimization is carried out as a single target, the target per se is enabled to obtain the minimum value, but the results of other targets are quite unsatisfactory, and the multi-target optimization can well balance each target to obtain the result of overall optimization, so that the optimization by using multiple targets is very necessary.

TABLE 3 Multi-objective optimization results

TABLE 4 target values under multiple, Single target optimization

In order to compare the comprehensive effect of the Deep Reinforcement Learning (DRL) -based process route optimization method, the NSGA-III and the MOPSO algorithms are used for solving and comparing, wherein the solving is multi-objective optimization, Pareto solution sets obtained by the three algorithms are counted, and the distribution condition of the solutions is analyzed. As shown in FIG. 5, which is a box line graph drawn according to the values of the solution sets of the three types of algorithms under each target, the solution sets of the three types of algorithms under each target have little difference in distribution as can be found from the data in the graph. Overall, the solution sets of DRL and NSGA-III are widely distributed, with median values close, while the solution of MOPSO is slightly worse. The DRL-based approach presented herein is demonstrated to have global search capabilities similar to heuristic optimization algorithms.

Further, in order to explore the difference of the three algorithms in the solving speed, the multi-objective optimization is carried out on the cases for 10 times. Fig. 6 shows the solving speed of the three types of algorithms under 10 experiments, and the statistical time interval is from the start of the algorithm to the convergence. As can be seen from the data in the figure, the DRL method provided by the invention has the optimal solving speed under 10 experiments, while the NSGA-III algorithm has the slowest solving speed on the whole and the MOPSO algorithm is the second order. This is because the DRL invokes the pre-training model in practical application, and can perform on-line training on the original basis to complete fine tuning, so the on-line convergence speed is fast, and other algorithms need to be reinitialized and then optimized.

In conclusion, the process route optimization method based on deep reinforcement learning provided by the invention has efficient, stable and strong-adaptability multi-objective optimization decision-making capability, and can solve the flexible and variable problem when the process route of small-batch customized products is optimized.

Claims

1. A process route multi-objective optimization method based on deep reinforcement learning comprises the following steps:

surface-first hole type priority relationship: when the surface characteristics and the hole characteristics corresponding to the surface characteristics are processed, in order to meet the position precision requirement of the hole and the plane, the surface characteristics are processed firstly, and then the hole characteristics are processed;

secondly, converting the process route optimization problem into a Markov decision process to simulate a randomness strategy and return which can be realized by the Agent, wherein the method comprises the following steps:

in the definition of the state space, a static variable and a dynamic variable are introduced, the static variable is a value which does not change in the decision process, and the dynamic variable is updated before each decision step, specifically:

regarding available machine tool and tool resources under each process as a primary combination in a pairing mode, and summing all processes of workpieces to be processed into L-type groupsStatic variables include the serial number ID of the candidate combination _x And process ME of candidate combination _i Machine tool number

And tool number

Current state s _t Next step ME _i Whether or not it has already been selected

And the current state s _t Next step ME _i Preceding process residual ratio of

respectively represent the processes ME _i The kth optional machine tool and the mth optional tool under the machine tool;

dynamic variables

the initial state is 1, and the update rule is as follows:

dynamic variables

the update rule is as follows:

dynamic variables

when ME _i Without a preceding step, it

the update rule of (2) is:

Is 0 or 1, when

When the value is 0, the ME at that time is represented _i Can be selected;

and

are all taken as values of [0,1]]In addition, different process quantities of workpieces to be processed can be effectively dealt with, and the model has a better data perception effect;

in the definition of action space, action a _t Representing the Agent in the current state s _t Actions that can be made in relation to the decision strategy, in the process route optimization problem, the action set A(s) represents all selectable serial numbers ID in state s _x Gathering;

2. The method of claim 1, wherein in the second step, the established process total cost, process total time, and total carbon emissions objective function is:

process Total cost objective function:

in the formula, C _all For the total cost, C _mc Is a machine tool change cost index, C _tc Is a tool change cost index, C _m (MID _i ) Is a machine tool MID _i Cost index of (C) _t (TID _i ) Is a tool TID _i Cost index of (a), gamma ₁ Is a machine tool change function, gamma ₂ Is a tool change function, n is the total number of steps, TID _i ，TID _i+1 Respectively the tool codes used in the process i and the process i + 1; MID _i ，MID _i+1 The machine tool codes used in the process i and the process i +1 respectively;

wherein, γ ₁ The calculation formula is as follows:

γ ₂ the calculation formula is as follows:

in the formula, the following components are mixed;

total process time objective function:

in the formula (I), the compound is shown in the specification,

is the machining time of the machine tool, TM _mc Is the change time index of the machine tool, TM _tc Is the change time index of the tool;

process total carbon emissions objective function:

in the formula (I), the compound is shown in the specification,

3. The method of claim 1, wherein in the third step, during the training process of the Actor-criticic algorithm, criticic performs parameter updating by minimizing the Mean Square Error (MSE) between the value estimation and the actual return; the Actor represents a policy function, and updates the policy function according to a timing difference method.