CN116384969A - Manufacturing network maintenance-detection combined optimization method based on deep reinforcement learning - Google Patents

Manufacturing network maintenance-detection combined optimization method based on deep reinforcement learning Download PDF

Info

Publication number
CN116384969A
CN116384969A CN202310333773.0A CN202310333773A CN116384969A CN 116384969 A CN116384969 A CN 116384969A CN 202310333773 A CN202310333773 A CN 202310333773A CN 116384969 A CN116384969 A CN 116384969A
Authority
CN
China
Prior art keywords
machine
time
network
maintenance
manufacturing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310333773.0A
Other languages
Chinese (zh)
Inventor
叶正梗
蔡志强
司书宾
王鑫
柯勇伟
李丁林
周福礼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou University
Original Assignee
Zhengzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou University filed Critical Zhengzhou University
Priority to CN202310333773.0A priority Critical patent/CN116384969A/en
Publication of CN116384969A publication Critical patent/CN116384969A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/20Administration of product repair or maintenance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06395Quality analysis or management

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Economics (AREA)
  • General Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Operations Research (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Quality & Reliability (AREA)
  • Marketing (AREA)
  • Development Economics (AREA)
  • Game Theory and Decision Science (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Educational Administration (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Factory Administration (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

The invention provides a manufacturing network maintenance-detection combined optimization method based on deep reinforcement learning, which comprises the following steps: firstly, for a machine level, under the condition of considering dynamic production speed caused by machine fault shutdown, constructing a machine reliability model considering feeding quality influence and a processing quality model considering machine reliability influence; secondly, performing system evaluation on the manufacturing network state and performance based on the reliability model and the quality model; constructing a manufacturing network maintenance and quality detection combined optimization model; finally, at the system level, the economic operation of the manufacturing network is used as the standard of strategy evaluation, and the optimal strategy of quality detection and maintenance under the given state of the manufacturing network is learned through a designed depth deterministic strategy gradient algorithm. The invention can well balance the contradiction between the economic benefit and the operation risk of the manufacturing network, and has better adaptability to dynamic and diversified manufacturing scenes.

Description

Manufacturing network maintenance-detection combined optimization method based on deep reinforcement learning
Technical Field
The invention relates to the technical field of quality and reliability, in particular to a manufacturing network maintenance-detection combined optimization method based on deep reinforcement learning.
Background
Mass-customized production gives rise to higher production flexibility but increases the complexity of the manufacturing system. In this context, more high-tech manufacturing equipment with flexibility is continuously presented, which greatly enriches the process routes of products, so that the manufacturing system takes machines as nodes and takes product flows as edges, and the characteristics of a complex network are presented. The improvement of machine flexibility and structural complexity strengthens the nonlinear characteristics of the manufacturing system, which increases the operation management difficulty of the network structure manufacturing system and weakens the profit growth brought by the flexible machine.
The operation control of the manufacturing system refers to optimizing the system performance by a production management method, wherein machine maintenance and quality inspection of work in process are two important management measures. The integrated optimization is performed through joint control of production scheduling, product quality, machine reliability and the like, and the integrated optimization becomes the first choice for improving the performance of a manufacturing system. Existing research has attempted different forms of integration, such as single Preventative Maintenance (PM), integration of production planning with maintenance, or integration of production planning, maintenance and quality inspection. Currently, research on joint optimization of manufacturing systems is mainly focused on simulation-based methods. In addition, traditional dynamic planning, integer planning and heuristic algorithms are also important approaches to solve this problem. These traditional optimization methods have great potential in optimizing small-scale systems, such as stand-alone manufacturing systems, simple serial manufacturing systems, or unitized manufacturing systems. However, these methods remain significantly inadequate for optimization of large-scale manufacturing systems with complex system structures. While heuristic algorithms, such as genetic algorithms, can effectively optimize multi-stage serial or parallel fabrication systems, their effectiveness in large-scale fabrication systems with complex system structures has not been verified.
In many manufacturing systems, the diversification of process routes makes them feature a complex network. In addition to the large system scale and structural complexity, interaction is another key factor that makes manufacturing network joint optimization difficult, especially between work-in-process quality and machine reliability. Currently, the interplay between reliability and quality in manufacturing systems is of interest to many researchers. For example, in joint optimization of production and maintenance, literature [ Hajej Z, rezg N, gharbi a.quality issue in forecasting problem of production and maintenance policy for production unit.international Journal of Production research.2018;56:6147-63 ] proposes a cumulative reject rate function affected by machine failure rate; in maintenance optimization of serial multi-station manufacturing systems, literature [ methou X, lu b.predictive maintenance scheduling for serial multi-station manufacturing systems with interaction between station reliability and product quality. Computers & Industrial engineering 2018;122:283-91 model the interplay between machine reliability and product quality; in performance evaluation of automated production lines and series-parallel manufacturing systems, literature [ Ye Z, cai Z, si S, zhang S, yang h.computing Failure Modeling for Performance Analysis of Automated Manufacturing Systems with Serial Structures and Imperfect Quality instrumentation.ieee Transactions on Industrial information.2020; 16:6476-86] a decision graph model and a stochastic model were constructed to characterize the interplay between reliability and quality. Furthermore, literature [ Wang L, bai Y, huang N, wang Q.Fractal-based Reliability Measure for Heterogeneous Manufacturing networks.IEEE Transactions on Industrial information.2019; 15:6407-14] reliability metrics for manufacturing networks are proposed by a fractal-based approach. These studies have attempted to solve the above-mentioned problems of interactive behavior or large scale, but they have studied only one aspect of the problems of interactive behavior and structural complexity independently, and have not fully grasped the characteristics of the manufacturing network.
As mentioned above, joint optimization of manufacturing networks with complex structures and interaction behavior is a tricky problem, especially when we try to solve it with traditional methods. The conventional method has better control or optimization effect on specific manufacturing scenes, but has poorer control or optimization effect in dynamic and diversified manufacturing scenes. However, the development of Artificial Intelligence (AI) has brought promise to effectively address this real-time control and optimization problem. The method based on artificial intelligence can realize the manufacture of higher added value and create higher flexibility for intelligent customization factories. Therefore, it is widely used in the related art: such as risk assessment, intelligent maintenance, quality control and dynamic loading strategies for repairable systems. Reinforcement learning has achieved dramatic achievements in various control tasks for non-linear, high-and dynamic systems in many artificial intelligence-based approaches. Because of its effectiveness, many reinforcement learning-based control studies are used in optimization of dynamic systems, such as maintenance optimization of serial production lines, multi-state engineering systems, production-maintenance joint optimization for degenerate manufacturing systems, and coverage and connectivity maintenance of wireless sensor networks.
In the study of manufacturing system operation control, the complexity of the structure is not sufficiently emphasized. In addition, simplifying the system state to a single continuous type or discrete type is not compatible with the actual production state. At the same time, discrete maintenance activities have become a mainstream choice of operational settings, such as utilizing discrete maintenance activities to build joint control problems in connection with production scheduling. However, current research does not take into account the continuity characteristics of the quality detection activity. The selection of the reinforcement learning algorithm reveals the reason for this gap. The existing manufacturing system operation control algorithm is better in a limited-range Markov decision process, but has the defects of a large-scale manufacturing system with complex structure and various state behaviors. In addition, with more advanced reinforcement learning algorithms, research into non-manufacturing system control provides a good example of a large-scale problem with state-behavior diversity. For example: application of actor-critic algorithm and depth deterministic strategy gradient algorithm of 13-component system in 14-component parallel-series system. In addition, there are students who consider inspection activities on some action settings, but such repair inspection is quite different from quality inspection activities of manufacturing systems. In these studies, detection is considered to be just one discrete action. However, in actual production, the quality inspection in the manufacturing system has continuous action space, and the sampling inspection can be performed on the product according to any ratio of [0,1 ]. Most importantly, research into non-manufacturing systems does not take into account the interaction behavior between components. Therefore, whether these advanced algorithms are suitable for large-scale manufacturing networks with reliability-quality interactions remains to be further explored.
Disclosure of Invention
In order to balance the contradiction between the economic benefit and the operation risk of the manufacturing network, the invention provides a manufacturing network maintenance-detection combined optimization method based on Deep Reinforcement Learning (DRL), and the maintenance and quality detection optimization of a large-scale manufacturing network with reliability-quality interaction behavior is realized through the reliability-quality combined control based on machine learning.
The technical scheme of the invention is realized as follows:
a manufacturing network maintenance-detection combined optimization method based on deep reinforcement learning comprises the following steps:
step one: for a machine level, under the condition of considering dynamic production speed caused by machine fault shutdown, constructing a machine reliability model considering feeding quality influence and a processing quality model considering machine reliability influence;
step two: performing a systematic evaluation of manufacturing network status and performance based on the reliability model and the quality model; constructing a manufacturing network maintenance and quality detection combined optimization model;
step three: at the system level, the economic operation of the manufacturing network is used as the standard of strategy evaluation, and the optimal strategy of quality detection and maintenance under the given state of the manufacturing network is learned through a designed depth deterministic strategy gradient algorithm.
At the machine level, the construction method of the reliability model and the quality model comprises the following steps:
calculating the dynamic production speed:
considering each machine as a node, modeling a loop-free manufacturing network of n nodes with a directed loop-free graph G (V, E), where v= { V 1 ,v 2 ,…,v n Is a set of nodes of a manufacturing network,
Figure SMS_1
to make a set of directed edges in a network; i. j are nodes;
when there is no machine downtime in the manufacturing network, the production speed of the machine is defined as the maximum production speed, denoted as P rm =[P rm1 ,P rm2 ,…,P rmn ]The method comprises the steps of carrying out a first treatment on the surface of the The actual production speed of the machine in the manufacturing network is denoted as P ra (t)=[P ra1 (t),P ra2 (t),…,P ran (t)]And meet P ra ≤P rm The method comprises the steps of carrying out a first treatment on the surface of the Wherein P is rmn Represents the maximum production speed of the nth machine, P ran (t) represents the actual production speed of the nth machine;
when the production speed of the node i changes by delta P rai (t) the amount of change DeltaP in the production speed of the node immediately upstream thereof rak (t) and the variation DeltaP of the production speed of the immediately downstream node raj (t) are respectively expressed as:
Figure SMS_2
Figure SMS_3
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_4
representing the set of immediately upstream nodes connected to node i,/->
Figure SMS_5
Representing a set of immediately downstream nodes connected to node i; />
Figure SMS_6
Representing the probability of all in-process flows into node i from upstream node k, +.>
Figure SMS_7
Representing the probability of node i flowing out into downstream node j in the product;
Calculating dynamic maintenance cost:
when considering the dynamics of the production speed, the failure rate in processing a qualified feed is defined as the base failure rate r b (t):
r b (t)=(β/α)·(t r /α) β-1
Wherein α is a scale parameter and β is a shape parameter;
Figure SMS_8
is the relative run time of the machine calculated with the maximum production speed as the standard; t is the actual time of machine operation;
considering the impact that may be caused by an off-grade feed, the actual failure rate r (t) is defined as:
Figure SMS_9
wherein Deltar i′ For the cumulative failure rate increase caused by machine reject feed, N (t) is the number of reject runs processed by the machine in [0, t); the probability distribution function F (t) of the machine failure is obtained through deduction:
F(t)=1-exp(-(t r /α) β -∫Δr(t)dt);
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_10
an accumulated failure rate increase for machining reject feed material up to time t; the integral of Δr (t) is calculated as:
Figure SMS_11
wherein t is i′ The actual occurrence time of the i' th failure rate increment;
suppose that the repair times for corrective and preventive repair of a machine are respectively subject to the following normal distributions:
Figure SMS_12
and->
Figure SMS_13
And mu cm ≥μ pm ,/>
Figure SMS_14
The maintenance cost per unit time of the corrective maintenance and the preventive maintenance is c respectively cm And c pm Then the total maintenance cost c of the machine in time [0, t ] m (t) can be expressed as:
Figure SMS_15
wherein N is cm (t) is the number of corrective repairs to the machine in [0, t ], N pm (t) is the number of preventive maintenance operations of the machine in a time of [0, t ], t cm_i1 Time spent for machine ith 1 st corrective maintenance, t pm_j1 Time spent for the j1 st preventive maintenance of the machine;
constructing a dynamic processing quality and detection activity model:
defining M (t) as the number of rejects produced in a [0, t ] time, M (t) satisfying the non-homogeneous poisson process of the intensity function λ (t):
λ(t)=ω-ε·e -δ·r(t)
wherein omega>0 represents the maximum reject generation strength ε>0 and delta>0 are the influence coefficients of failure rate on the intensity function, and omega-epsilon<λ(t)<ω<P ra (t); definition n d Is the number of unqualified products generated by the machine in the time of [ t, t+delta t ]The probability of the quantity is:
Figure SMS_16
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_17
is an expected value of the output of the unqualified product in the [0, t ] time period, and delta t is a quality statistics period; definition n q The total number of acceptable products processed by the machine during the [ t, t+Δt ] period is expressed as:
Figure SMS_18
in the detection activity, the error judgment of the defective product and the qualified product is as follows: type I error, representing error rejection, probability p I The method comprises the steps of carrying out a first treatment on the surface of the Class II errors, indicating acceptance of errors, probability p II The method comprises the steps of carrying out a first treatment on the surface of the Assume that the sampling rate in the detection activity is s a ,s a ∈[0,1]The joint probability of I-type error of any qualified product is s a ·p I The joint probability of any disqualified product to generate II-type error is s a ·p II The method comprises the steps of carrying out a first treatment on the surface of the Defining the times of the I type error and the II type error as n respectively fr And n fa And obeys the binomial distribution B (n) q ,s a ·p I ) And B (n) d ,s a ·p II ) The method comprises the steps of carrying out a first treatment on the surface of the Thus, the number of unacceptable products M' (t) exiting the machine is the non-homogeneous poisson process M (t) and the binomial distribution B (n d ,s a ·p II ) And n is fa The probability of a failed work-in-process flowing out of the machine during the [ t, t+Δt ] time period is:
Figure SMS_19
wherein m' (t) is the average reject work-in-process number flowing out of the machine in the [0, t ] time period; in addition, the probability that the pass product is rejected during the [ t, t+Δt ] period is:
Figure SMS_20
wherein D (t) is the rejected good product accumulated to time t,
Figure SMS_21
represents a value represented by n q Randomly extracting n from the unqualified products fr The number of possible combinations of the individual samples;
the product processed by the machine in the time period of [ t, t+delta t ] is correctly judged as the number n of unqualified products cr And the number n of the qualified products correctly determined ca Expressed as:
Figure SMS_22
on the basis, the proportion of unqualified products, which are distinguished by the machine through detecting the activity in the time of [ t, t+delta t ], is obtained as follows:
Figure SMS_23
assume that the machine detects a single product at a detection cost c ins The total cost of the machine detection during the time [ t, t+Δt):
Figure SMS_24
assume that the cumulative value increment incurred by the process of a single work-in-process from a manufacturing network input node to a current process node i is v si The average process value increment v brought by machining single piece corresponding to node i in product ai Can be defined as:
Figure SMS_25
wherein v is sk Representing the accumulated value of node k upstream of node iAn increment;
the value loss caused by disqualification of class II errors in the detection process during the production process is defined as
Figure SMS_26
Thus, all work in process for machine i in time [0, t ] brings a net value increment v net_i The sum of the value loss for all correctly accepted work-in-process, the value loss for the incorrectly accepted work-in-process, and the value loss for all rejected work-in-process:
Figure SMS_27
wherein n is ca_i Indicating the number of work-in-process correctly accepted by the machine corresponding to node i, n fa_i Indicating the number of work-in-process (n) of machine error acceptance corresponding to node i cr_i Indicating the number of work-in-process (n) of correct rejection of the machine corresponding to node i fr_i And respectively representing the number of work-in-process of machine error rejection corresponding to the node i.
The method for performing system evaluation on the manufacturing network state and performance comprises the following steps:
for a manufacturing network with n machines, a state matrix S is constructed K State at time t:
S K =[Q K ;D K ;H K ;O K ];
wherein, t=kΔt,
Figure SMS_28
mass state for each machine in [ t- Δt, t); d (D) K =[t r1 ,t r2 ,…,t rn ]Representing the degradation state of the machine at time t, H K =[h 1 ,h 2 ,…,h n ]Is the health state of the machine at the time t, h i 1 in e {0,1} represents a failure state of the machine, 0 represents a no failure state of the machine; o (O) K =[o 1 ,o 2 ,…,o n ]Indicating the idle state of the machine at time t, o i A1 in E {0,1} represents idleState, 0 represents an operating state;
definition of rewards r K From state S in [ t, t+Δt ] time for manufacturing the network K Transition to S K+1 Net benefit from the process of (2):
Figure SMS_29
wherein c I_i Is the total detection cost of the machine corresponding to node i, c m_i Is the total maintenance cost of the machine corresponding to node i, c D Is the decision cost for maintenance and detection actions;
to evaluate the slave state S in [ K.DELTA.t, K' DELTA.t ] time K To S K′ Cumulative performance of (C) will benefit G K Defined as the long-term return of the manufacturing network, obtained by jackpot calculation:
Figure SMS_30
wherein K' > K.
The construction method of the manufacturing network maintenance and quality detection combined optimization model comprises the following steps:
regarding quality inspection and preventive maintenance as actions, noted as
Figure SMS_31
Wherein (1)>
Figure SMS_32
For quality detection action, < >>
Figure SMS_33
Is a preventive maintenance action;
the actions of all machines in the manufacturing network at t=kΔt depend on the state S K Therefore, it is denoted as A k =π(S K ) Where pi (·) represents the policy function:
Figure SMS_34
wherein pi * (. Cndot.) represents the optimal policy function, Q (. Cndot.) represents the state S K Take A K A long term return on motion function;
under the optimal strategy, the cost function and the Q function satisfy:
Figure SMS_35
wherein V (·) represents the state S K Maximum long-term return at that time.
The designed method for learning the optimal strategy of quality detection and maintenance under the given manufacturing network state by using the depth deterministic strategy gradient algorithm (DDPG) comprises the following steps:
step1: performing current actions, simulating operation of the manufacturing network in [ kΔt, kΔt') time, wherein t=kΔt;
step1.1: evaluating the state of the manufacturing network at time t=kΔt: in a learning environment, based on the proposed dynamic reliability and quality model, the machine state at time t=kΔt is evaluated, and then the state S of the manufacturing network at time t=kΔt is evaluated K Make state S K Providing the first observation to the Agent as a manufacturing network;
step1.2: generating an action based on the current policy function pi (S): one action
Figure SMS_36
By inputting state S to an Actor network μ (S) K The DDPG algorithm is obtained by adding a random noise N which is compliant with normal distribution r Attempts to explore better strategies by allowable actions, i.e. A K =π(S K )+N r The method comprises the steps of carrying out a first treatment on the surface of the Then according to the preventive maintenance criterion c d Converting the preventative maintenance action into discrete executable actions {0,1}, wherein 0 means that preventative maintenance is not performed and 1 means that preventative maintenance is performed;
Step1.3: performing an action simulating the operation of the manufacturing network during [ kΔt, kΔt+Δt): action a at time t=kΔt is obtained K Later, during subsequent operationsThe quality detection of the machine will immediately adopt the new sampling proportion
Figure SMS_37
At the same time, preventive maintenance actions are performed on the respective machine, time t of preventive maintenance pm From a normal distribution N (mu) pm2 pm ) Obtaining; if the machine fails in [ K delta t, K delta t + delta t ] period, corrective maintenance is performed immediately upon the failure occurrence for a corrective maintenance time t cm From a normal distribution N (mu) cm2 cm ) Obtaining;
step1.4: assessment of rewards r K : calculating a prize r i2 And i2=i2+1 is updated; if i2<K', returning to the step Step1.1; otherwise, executing Step2;
step2: acquiring a state transition record: after operation at [ K.DELTA.t, K' DELTA.t) time, a long-term return G is obtained K The method comprises the steps of carrying out a first treatment on the surface of the According to the method of step step1.1, the state S at t=k' Δt is obtained K′ The method comprises the steps of carrying out a first treatment on the surface of the Then, a state transition record { S } is obtained K ,A K ,G K ,S K′ Store in experience buffers;
step3: updating an Actor network and a Critic network: randomly extracting a small batch of M transfer records from the experience buffer, thereby updating an Actor network mu (S) and a Critic network Q (S, A); at this time, the maximum memory capacity of the experience buffer is L, and when the number of transfer records reaches L, the earliest record is discarded;
Step4: judging the ending condition: if the training number Episode reaches a predetermined maximum training number or a stable long-term return G is obtained K Stopping training; otherwise the simulation period will be updated: [ K.DELTA.t, K ' DELTA.t) ≡ [ K ' DELTA.t, 2K ' DELTA.t-K DELTA.t), and returns to Step1.
The updating method of the Actor network and the Critic network comprises the following steps:
a) The method comprises the following steps Randomly extracting a batch of M transfer records from the experience buffer:
Figure SMS_38
s3.3.2: for turning aroundShift records i3=1, 2, …, M, calculate future target actions
Figure SMS_39
And target future long-term return->
Figure SMS_40
And sets the objective +.>
Figure SMS_41
b) The method comprises the following steps Updating parameter θ of Critic network Q (S, A) by minimizing loss function Q
Figure SMS_42
c) The method comprises the following steps Updating the parameter θ of the Actor network μ (S) by maximizing the expected cumulative long-term return μ
Figure SMS_43
d) The method comprises the following steps Updating parameters of a target actor network and a target Critic network:
θ μ′ =τθ μ +(1-τ)θ μ′
θ Q′ =τθ Q +(1-τ)θ Q′
where τ is a smoothing factor.
Compared with the prior art, the invention has the beneficial effects that:
1) The invention provides a mathematical model for constructing nonlinear, high-dimensional and dynamic environments of a manufacturing network, and provides an effective state transition model for controlling the manufacturing network.
2) On the basis of considering reliability-quality interaction behavior, an effective DRL model suitable for manufacturing network reliability-quality joint control is constructed, and a physical system simultaneously comprising a discrete-continuous mixed state and a mixed action can be modeled; the DRL model has better adaptability to dynamic and diversified manufacturing scenes.
3) In order to realize the reliability-quality joint control of the dynamic manufacturing network, the invention constructs a machine learning model driven by the deep neural network based on a DDPG algorithm under the established maintenance and quality detection mixed action space model; according to learning results in different manufacturing scenes, the method can realize an optimal manufacturing system control strategy.
4) The model provided by the invention can well balance the contradiction between the economic benefit and the operation risk of the manufacturing network.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a fabrication network formed by multiple process paths.
FIG. 2 is a cascade effect of machine faults in a network structure manufacturing system node.
Fig. 3 is a flow chart of the present invention.
Fig. 4 is a neural network structure of the DDPG algorithm of the present invention; wherein, (a) an Actor network (b) a Critic network.
Fig. 5 is a training flow chart of the DRL model based on the DDPG algorithm of the present invention.
Fig. 6 is a flowchart of updating the Actor and Critic networks in the DDPG algorithm of the present invention.
Fig. 7 is a directed acyclic manufacturing network of an example of the invention.
FIG. 8 is a training trajectory diagram of the highest benefits of the DRL model and genetic algorithm of the present invention under different manufacturing scenarios; wherein (a) step Δt=50 (b) step Δt=100 (c) step Δt=500.
FIG. 9 is a scatter plot of rewards per unit time obtained by three trained agents.
FIG. 10 is a graph of manufacturing network connectivity under different training DRL Agent controls; wherein (a) step Δt=50 (b) step Δt=100 (c) step Δt=500.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.
In a flexible manufacturing environment, the variety of process routes can result in random flow of the article, which can render the manufacturing system networked in character. As shown in fig. 1, each machine may be considered a node of the manufacturing network and the process flow may be considered an edge connecting the nodes. In this case, for a given machine, its upstream machine refers to all machines preceding it in the relevant process route, and its downstream machine refers to all machines following it in the relevant process route. In a multi-stage manufacturing network, the flow of the product brings the stock to each machine at different stages, resulting in interactions between machine reliability and feed quality. Wherein an off-grade feed increases the risk of machine failure; conversely, degraded machines will be more likely to process out of order and become off-grade feed to the immediate downstream machine (NDM). When no intervention is performed, the interaction between machine reliability and feed quality will propagate along the process route, with a higher risk of failure of the machine downstream of the process route.
In a manufacturing system, there are two failure modes for a machine: failure to immediately shut down the machine upon failure, defined as a hard failure; failure that is usually not directly found and corrected only upon inspection is defined as soft failure. Hard failure can lead to machine downtime, bringing its production speed to zero. Soft failure is caused by degradation of components, which is induced when the cumulative degradation level first exceeds a threshold. Before soft failure occurs, the machine degradation only affects the machine's processing quality and not its production speed. In real world industrial production, both machine maintenance and quality inspection are important activities that improve the performance of manufacturing systems, the former allowing the machine to recover from failure conditions, and the latter allowing the loss to be timely reduced by shutting off the flow of reject between different machines. When a hard failure occurs or a soft failure is detected, corrective Maintenance (CM) can be implemented to repair the machine that is failing. Furthermore, when degradation does not cause soft failure, preventative maintenance may be performed to recover the degraded machine. On the other hand, preventative maintenance may reduce the probability of hard failure occurring. And, preventive maintenance can also improve the processingquality of lathe.
In the above case, the control of the operation of the manufacturing network would be a complex problem. When a machine fails and repair is not completed in time, the effect of the machine failure will propagate to the machines upstream and downstream. The production speed of these machines is then reduced. Furthermore, due to congestion or starvation, machines may be idle (referred to as idle machines). As shown in fig. 2, because M5 is the only immediately downstream machine of M3, machine M3 will be idle when machine M5 fails in node 5. Also, because the M6 machine can continue to operate with a reduced production speed, the production speeds of the M4 and M7 machines may be reduced accordingly. Further, further downstream and upstream machines may be impacted prior to repair of the failed machine. When the failed machine is repaired, its own and the production speed of the affected machine will also resume. However, variations in production speed can cause more uncertainty in machine degradation, impairing the effectiveness of pre-established maintenance and quality inspection plans.
In this context, it is necessary how to achieve optimal control of reliability and quality in a manufacturing network based on dynamic machine maintenance and quality inspection. Accordingly, the present invention proposes a deep reinforcement learning based joint optimization method for manufacturing network repair-inspection based on the assumption that (1. Discrete part manufacturing process is considered. 2. Each machine has quality inspection activity, sampling inspection of work in process at any ratio of 0%,100% ]. 3. Work In Process (WIP) has two quality states, quality good (no defects) and quality bad (rejects), which can be distinguished by the quality inspection process. 4. The machine has two failure modes, hard failure and soft failure. 5. Two repair activities, corrective and preventive maintenance, will be performed, and both can be brought back to a new state.) based on the research method to explore the optimal reliability and quality control strategies under economical operation of the manufacturing network, as shown in FIG. 3. First, for the machine level, a reliability model (Q-R effect) taking into account feed quality effects and a quality model (R-Q effect) taking into account machine reliability effects are constructed taking into account dynamic production speeds caused by machine downtime, and the machine state and the manufacturing network state are systematically evaluated by the proposed model. Secondly, at the system level, taking the economic operation of the manufacturing network as a standard of strategy evaluation, an optimal strategy for learning quality detection and maintenance under a given manufacturing network state based on an optimization model of deep reinforcement learning is proposed.
Dynamic production speed: considering each machine as a node, modeling a loop-free manufacturing network of n nodes with a directed loop-free graph G (V, E), where v= { V 1 ,v 2 ,…,v n A set of nodes, i.e. a set of machines,
Figure SMS_44
to make a collection of directed edges in the network, i.e., the flow direction of the article; i. j are nodes. W (W) + Representing the forward weight matrix of the network edge.
Figure SMS_45
Wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_46
the forward weight of the directed edge is the probability of the node i flowing out of the product into the downstream node j. In addition, the inverse weight matrix W of the directed edge - May pass through matrix W + Calculated, wherein->
Figure SMS_47
Representing the probability that all of the incoming node i are in process from the upstream node k. Defining a set of immediately downstream nodes connected to node i as
Figure SMS_48
The set of immediately upstream Nodes (NUM) connected to node i is
Figure SMS_49
When there is no machine downtime in the manufacturing network, the production speed of the machine is defined as the maximum production speed, denoted as P rm =[P rm1 ,P rm2 ,…,P rmn ]The method comprises the steps of carrying out a first treatment on the surface of the The actual production speed of the machine in the manufacturing network is denoted as P ra (t)=[P ra1 (t),P ra2 (t),…,P ran (t)]And meet P ra ≤P rm The method comprises the steps of carrying out a first treatment on the surface of the Wherein P is rmn Represents the maximum production speed of the nth machine, P ran (t) represents the actual production speed of the nth machine; during production, the actual production speed of the machine is closely related to the immediate upstream node and the immediate downstream node to maintain balanced production. When the production speed of the node i changes by delta P rai (t) the amount of change DeltaP in the production speed of the node immediately upstream thereof rak (t) and the variation DeltaP of the production speed of the immediately downstream node raj (t) are respectively expressed as:
Figure SMS_50
Figure SMS_51
after that, deltaP rai The effect of (t) will propagate along its process route to the source node and the destination node and cause corresponding changes in the corresponding process route on each machine.
Dynamic machine reliability and maintenance activities: in a manufacturing network, machines are subject to dynamic failure probabilities during operation. Such dynamics can be attributed toThe junction is three aspects: uncertainty in machine failure mode, dynamics in production speed, and instability in feed quality. Fortunately, the weibull distribution can fit different failure modes of the machine, such as decreasing, constant, and increasing failure rates. Thus, failure rates that follow the weibull distribution are suitable for modeling the failure risk of a machine. Defining the failure rate when only qualified feed materials are processed as the basic failure rate r b (t) when considering the dynamics of the production speed, the base failure rate is expressed as:
r b (t)=(β/α)·(t r /α) β-1 (4)
wherein α is a scale parameter and β is a shape parameter; t is t r The relative time of the operation of the opportunity machine at the maximum production speed is considered; t is the actual time of machine operation; and, relative time t r And can also be used as an index for measuring the degradation of the machine.
Figure SMS_52
Considering the impact that may be caused by an off-grade feed, the actual failure rate r (t) is defined as:
Figure SMS_53
wherein Deltar i′ The delta of cumulative failure rate due to machine reject feed is subject to Beta distribution Beta (a, b). N (t) is the number of unqualified products processed by the machine in the time of [0, t); the probability distribution function F (t) of the machine failure is obtained through deduction:
F(t)=1-exp(-(t r /α) β -∫Δr(t)dt) (7)
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_54
an accumulated failure rate increase for machining reject feed material up to time t; the integral of Δr (t) is calculated as:
Figure SMS_55
wherein t is i′ The actual time of occurrence of the i' th failure rate increment.
Corrective maintenance and preventative maintenance are effective methods of handling machine hard and soft failures. The present invention assumes that corrective maintenance and preventive maintenance both restore the failure rate of the machine to a level at t=0. Also, it is assumed that the maintenance times for corrective maintenance and preventive maintenance of the machine are respectively subject to the following normal distributions:
Figure SMS_56
and
Figure SMS_57
often, corrective repairs require recovery from a failed shutdown of the chaotic machine, which is more abrupt than the failure faced by preventative repairs. It is assumed that corrective maintenance will take more time to complete and therefore there is μ cm ≥μ pm
Figure SMS_58
Assuming that the maintenance costs per unit time for corrective maintenance and preventive maintenance are c, respectively cm And c pm Then the total maintenance cost c of the machine in time [0, t ] m (t) can be expressed as:
Figure SMS_59
wherein N is cm (t) is the number of corrective repairs to the machine in [0, t ], N pm (t) is the number of preventive maintenance operations of the machine in a time of [0, t ], t cm_i1 Time spent for machine ith 1 st corrective maintenance, t pm_j1 Time spent for preventive maintenance of the machine j1 st.
Constructing a dynamic processing quality and detection activity model: the quality of the process is another important criterion of the reliability of the machine, which can be described by the number M (t) of rejects produced in a specific time [0, t ]. Failure of the product to meet the quality specifications is referred to as reject. The quality of the process also has time-varying characteristics due to instability in machine reliability. Thus, the random variable M (t) is modeled by a non-homogeneous Poisson process (NHPP) with an intensity function of λ (t). The expression of λ (t) is:
λ(t)=ω-ε·e -δ·r(t) (10)
wherein omega>0 represents the maximum reject generation strength ε>0 and delta>0 is the influence coefficient of failure rate on the intensity function, and omega-epsilon<λ(t)<ω<P ra (t); when ω=p is defined ra At (t), one can demonstrate ε=gxω, where g ε [0,1 ]]Is the initial percentage of machine-produced defect free product. Definition n d The probability of the number of unqualified products generated by the machine in the time of [ t, t+delta t) is as follows:
Figure SMS_60
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_61
is the expected value of the output of the unqualified products in the [0, t ] time period, and delta t is the quality statistical period. Definition n q The total number of acceptable products processed by the machine during the [ t, t+Δt ] period is expressed as:
Figure SMS_62
in a manufacturing network, the inspection activities are performed after the machining activities to ensure that defective articles can be found in time. In verification activities, typically a class I error (false reject) and a class II error (false accept) occur, with a probability of p I The method comprises the steps of carrying out a first treatment on the surface of the Class II error, probability p II . Accordingly, the correct judgment of defective products and qualified products, called correct acceptance and correct rejection, is performed by 1-p I And 1-p II To represent the corresponding probabilities. Can be used to p if the processing machine is not detecting activity I =0 and p II =1.
Assume that in detecting activityIs s a ,s a ∈[0,1]And the sampling and detecting activities are independent, the joint probability of any qualified product in I type error is s a ·p I The joint probability of any disqualified product to generate II-type error is s a ·p II The method comprises the steps of carrying out a first treatment on the surface of the Defining the times of the I type error and the II type error as n respectively fr And n fa And obeys the binomial distribution B (n) q ,s a ·p I ) And B (n) d ,s a ·p II ) The method comprises the steps of carrying out a first treatment on the surface of the Thus, the number of unacceptable products M' (t) exiting the machine is considered to be the non-homogeneous poisson process M (t) and the binomial distribution B (n d ,s a ·p II ) Is a complex of (a) and (b); n is n fa The probability of a failed work-in-process flowing out of the machine during the [ t, t+Δt ] time period is:
Figure SMS_63
wherein m' (t) is the average number of unacceptable products flowing out of the machine during the [0, t ] time period; in addition, the probability that the pass product is rejected during the [ t, t+Δt ] period is:
Figure SMS_64
wherein D (t) is the rejected good product accumulated to time t,
Figure SMS_65
represents a value represented by n q Randomly extracting n from the unqualified products fr The number of possible combinations of the individual samples; the product processed by the machine in the time period of [ t, t+delta t ] is correctly judged as the number n of unqualified products cr And the number n of the qualified products correctly determined ca Expressed as:
Figure SMS_66
on the basis, the proportion of unqualified products, which are distinguished by the machine through detecting the activity in the time of [ t, t+delta t ], is obtained as follows:
Figure SMS_67
assume that the machine detects a single product at a detection cost c ins The total cost of the machine detection during the time [ t, t+Δt):
Figure SMS_68
/>
assume that the cumulative value increment incurred by the process of a single work-in-process from a manufacturing network input node to a current process node i is v si The value loss caused by the fact that a single product in the node i is judged to be a defective product is also v si The method comprises the steps of carrying out a first treatment on the surface of the The method comprises the steps of carrying out a first treatment on the surface of the The average process value increment v of the machined single piece corresponding to the node i in the product ai Can be defined as:
Figure SMS_69
wherein v is sk Representing the accumulated value increase of the upstream node k of the node i; in addition, when a class II error occurs in the process of detecting a defective work-in-process, the defective work-in-process will flow into the downstream machine and consume more production resources, resulting in a value loss that will be greater than the value loss caused by the current machine being properly determined as a defective work-in-process. The value loss caused by disqualification of class II errors in the detection process during the production process is defined as
Figure SMS_70
Thus, all work in process for machine i in time [0, t ] brings a net value increment v net_i The sum of the value loss for all correctly accepted work-in-process, the value loss for the incorrectly accepted work-in-process, and the value loss for all rejected work-in-process:
Figure SMS_71
wherein n is ca_i Indicating the number of work-in-process correctly accepted by the machine corresponding to node i, n fa_i Indicating the number of work-in-process (n) of machine error acceptance corresponding to node i cr_i Indicating the number of work-in-process (n) of correct rejection of the machine corresponding to node i fr_i And respectively representing the number of work-in-process of machine error rejection corresponding to the node i.
Evaluating manufacturing network status: machine performance has different manifestations, such as faults caused by hard or soft faults, idleness caused by starvation or blockage, variations in process quality, and degradation, all of which affect the performance of the manufacturing network. For a manufacturing network with n machines, a state matrix S is constructed K State at time t:
S K =[Q K ;D K ;H K ;O K ] (20)
wherein, t=kΔt,
Figure SMS_72
the defect ratio for each machine over the time t- Δt, t, representing the mass state of the machine, is calculated from equation (16). D (D) K =[t r1 ,t r2 ,…,t rn ]The degradation state of the machine at time t is represented by the relative time from the last maintenance to the present of each machine, as shown in equation (5). H K =[h 1 ,h 2 ,…,h n ]Is the health state of the machine at the time t, h i 1 in e {0,1} represents a failure state of the machine, 0 represents a no failure state of the machine; o (O) K =[o 1 ,o 2 ,…,o n ]Indicating the idle state of the machine at time t, o i A 1 in e {0,1} represents an idle state, and a 0 represents an active state.
Evaluating cumulative performance of the manufacturing network: the cumulative performance assessment is used to determine whether maintenance and quality inspection schemes for all machines are cost effective. From an economic standpoint, the cumulative performance of a manufacturing network may be expressed in terms of net revenue. Definition of rewards r K At [ t, t+Δt ] time for manufacturing the networkInner slave state S K Transition to S K+1 Net benefit from the process of (2); where Δt is the period of the quality statistic and is also the step size of the state transition. Thus, the rewards per unit step are equal to the cumulative net value increment after deduction of maintenance, detection and decision costs;
Figure SMS_73
wherein c I_i Is the total detection cost of the machine corresponding to node i, c m_i Is the total maintenance cost of the machine corresponding to node i, c D Is the decision cost for repair and inspection activities. To evaluate the slave state S in [ K.DELTA.t, K' DELTA.t ] time K To S K′ Defining the benefit GK as the long-term return of the manufacturing network, obtained by jackpot calculation:
Figure SMS_74
wherein K' > K.
Joint optimization model based on markov decision process: the dynamic reliability and quality model constructed previously provides a state transition model of the manufacturing network. When quality inspection and maintenance are considered actions, a typical control model based on a Markov decision process can be constructed, where state sets and action sets, rewards functions, and state transition models are known. At the same time, the model aims at searching the optimal preventive maintenance and quality detection strategy function, so that the manufacturing network obtains the optimal long-term return. In practice, corrective maintenance will automatically begin when the machine fails, without policy support. Thus, this strategy refers only to the behaviour of preventive maintenance and quality detection at the moment t=kΔt, noted as
Figure SMS_75
Wherein (1)>
Figure SMS_76
For quality detection action, < >>
Figure SMS_77
Is a preventive maintenance action. Furthermore, the actions of all machines in the manufacturing network at t=kΔt depend on the state S K Therefore, it is denoted as A K =π(S K ) Where pi (·) represents the policy function:
Figure SMS_78
wherein pi * (. Cndot.) represents the optimal policy function, Q (. Cndot.) represents the state S K Take A K Long term return on motion.
Under the optimal strategy, the cost function and the Q function satisfy:
Figure SMS_79
wherein V (·) represents the state S K Maximum long-term return at that time.
Traditional dynamic programming or heuristic algorithms are able to accomplish optimization tasks in a finite time domain markov decision process with an enumeratable state space and action space. In the present study, however, the defect ratio Q K And state of degradation D K Is a continuous state space, the quality detection has a continuous action space, and the sampling ratio can be 0,1]Any value of (3). Thus, the Markov decision process of the manufacturing network has an immense state space and action space. In addition, the state space and action space of the constructed Markov decision process also grows exponentially as the number of machines increases, resulting in a "dimension disaster". Therefore, the conventional dynamic programming method cannot solve such an infinite time domain sequence decision problem. Although heuristic algorithms have strong searching capabilities and can provide optimal solutions, due to their weak transfer learning capabilities, optimal manufacturing system performance cannot be continuously guaranteed as manufacturing scenes change.
In current algorithmic research, the learning capabilities of the DRL algorithm have proven to be effective in handling an infinite time-domain markov decision process. Among them, the DQN algorithm, the Actor-Critic algorithm, and the DDPG algorithm have been demonstrated to be effective in solving different Markov decision processes in a maintenance solution. First, all three algorithms are applicable to Markov decision processes with continuous or discrete state spaces. Whereas the DQN algorithm is only applicable to discrete motion spaces, the DDPG algorithm is only applicable to continuous motion spaces, and the Actor-Critic algorithm is applicable to both continuous and discrete motion spaces. In addition, DDPG takes advantage of the neural network Q function and Actor-Critic framework, and shows more excellent stability than the other two algorithms. Therefore, the present invention adopts the DDPG algorithm.
The DDPG algorithm is built based on an Actor-Critic framework, two of which have θ μ And theta Q The neural network of parameters is used to approximate the policy function and the cost function. The policy and cost functions are constructed using an Actor network μ (S) and a Critic network Q (S, a) in an Actor-Critic framework, as shown in fig. 4. In addition, the activation function is a Relu function, and the hidden layer is a fully connected layer. In an Actor network, a state matrix S K Is an input, corresponding action A K Is the output to achieve long term return maximization. Quality detection actions due to the possession of the same normalized neurons at the final output layer
Figure SMS_80
And preventive maintenance actions->
Figure SMS_81
At [0,1]With successive output values. In order to cope with the inconsistency between the continuous quality inspection action and the discrete preventive maintenance action, in the present study, the preventive maintenance criterion c is given d ∈[0,1]To discretize preventive maintenance actions. In particular, preventive maintenance is->
Figure SMS_82
Will not perform at all, but at +.>
Figure SMS_83
Is executed. State matrix S K And motion vector A K As input to the Critic network and return the corresponding long term expected Q value Q (S K ,A K ) As an output. />
Based on the constructed neural network, the Agent iteratively interacts with the manufacturing network environment. In the interactive process, the state of the manufacturing network in [ t, t+Δt) (where t=kΔt) time is taken from S K Conversion to S K+1 Is defined as one step of the DRL algorithm. An Epoch refers to an evaluation phase of long-term return over [ kΔt, kΔt') time, consisting of a plurality of steps. An epoode refers to a process in which an Agent performs a task, and is composed of a plurality of epochs. During the training process, the Agent will iteratively simulate the DDPG algorithm up to a maximum number of steps. At time t=0 (k=0), it is assumed that initial action a 0 Quality inspection and preventive maintenance are not performed. Similarly, assume an initial state S 0 Is free of defects, machine degradation, machine failure, and machine idleness. The detailed training process is shown in fig. 5 below.
Step1: performing actions simulating the operation of the manufacturing network during a [ kΔt, kΔt'), wherein t=kΔt;
step1.1: evaluating the state of the manufacturing network at time t=kΔt: in a learning environment, based on the proposed dynamic reliability and quality model, the machine state at time t=kΔt is evaluated, and then the state S of the manufacturing network at time t=kΔt is evaluated K Make state S K Providing the first observation to the Agent as a manufacturing network;
step1.2: generating an action based on the current policy function pi (S): one action
Figure SMS_84
By inputting state S to an Actor network μ (S) K The DDPG algorithm is obtained by adding a random noise N which is compliant with normal distribution r Attempts to explore better strategies by allowable actions, i.e. A K =π(S K )+N r The method comprises the steps of carrying out a first treatment on the surface of the Then according to the preventive maintenance criterion c d Converting the preventative maintenance action into discrete executable actions {0,1}, wherein 0 means that preventative maintenance is not performed and 1 means that preventative maintenance is performed; to avoidCriterion c d Preferences caused, therefore, define c d =0.5, representing the output range of the Actor network [0,1 ]]Is a median value of (c).
Step1.3: performing an action simulating the operation of the manufacturing network during [ kΔt, kΔt+Δt): action a at time t=kΔt is obtained K The quality detection of the corresponding machine during subsequent operations will immediately employ the new sampling rate
Figure SMS_85
At the same time, preventive maintenance actions are performed on the respective machine, time t of preventive maintenance pm From a normal distribution N (mu) pm2 pm ) Obtaining; if the machine fails in [ K delta t, K delta t + delta t ] period, corrective maintenance is performed immediately upon the failure occurrence for a corrective maintenance time t cm From a normal distribution N (mu) cm2 cm ) Obtaining;
step1.4: assessment of rewards r K : calculating a prize r i2 And i2=i2+1 is updated; if i2<K', returning to the step Step1.1; otherwise, executing Step2;
step2: obtaining a transfer record: after operation at [ K.DELTA.t, K' DELTA.t) time, a long-term return G is obtained K Is capable of; according to the method of step step1.1, the state S at t=k' Δt is obtained K′ The method comprises the steps of carrying out a first treatment on the surface of the Then, the state transition record { S K ,A K ,G K ,S K′ Store in experience buffers;
step3: updating an Actor network and a Critic network: randomly extracting state transition records with batch number M from the experience buffer zone, thereby updating an Actor network mu (S) and a Critic network Q (S, A); at this point, the empirical buffer has the maximum storage, i.e., the buffer length L. When the number of transfer records reaches L, the earliest record will be discarded in order to store a new transfer record.
Step4: judging an ending condition; if Epinode reaches a predetermined maximum training number or a stable long-term return G is obtained K Stopping training; otherwise the simulation period will be updated: [ K.DELTA.t, K ' DELTA.t) ≡ [ K ' DELTA.t, 2K ' DELTA.t-K DELTA.t), and returns to Step1.
Prior to the training step, a target Actor network μ '(S) and a target Critic network Q' (S, a) are respectively constructed using an Actor network and Critic neural network having the same structure, and a random parameter θ is used μ And theta Q Initializing Actor network μ (S) and Critic network Q (S, A), using θ μ' =θ μ And theta Q' =θ Q The target actor network μ '(S) and the target critic network Q' (S, a) are initialized. In order to improve the stability of the optimization, the target Actor network and the target Critic network are updated periodically based on the latest Actor and Critic parameters. Based on the state transition record in the experience buffer, the neural network will update at each training Epoch, the flow of which is shown in FIG. 6, and the update algorithm is as follows.
a) The method comprises the following steps Randomly extracting batches of M state transition records from the experience buffer:
Figure SMS_86
Figure SMS_87
b) The method comprises the following steps For transfer records i3=1, 2, …, M, calculate future target actions
Figure SMS_88
And target future long-term return
Figure SMS_89
And sets the objective +.>
Figure SMS_90
c) The method comprises the following steps Updating parameter θ of Critic network Q (S, A) by minimizing loss function Q
Figure SMS_91
d) The method comprises the following steps Updating the parameter θ of the Actor network μ (S) by maximizing the expected cumulative long-term return μ
Figure SMS_92
e) The method comprises the following steps Updating parameters of a target actor network and a target Critic network:
θ μ′ =τθ μ +(1-τ)θ μ′ (27)
θ Q′ =τθ Q +(1-τ)θ Q′ (28)
where τ is a smoothing factor.
Example analysis: manufacturing network and Agent parameters
In this example, a manufacturing network with multiple process routes was modeled using a directed acyclic network consisting of 30 nodes. As shown in fig. 7, the machines are expressed as nodes with different reliability parameters, while the flow of work-in-process between machines is represented by edges. The product flows randomly between the machines, the number of which is limited by the capacity of the machine to ensure a balance of production. The manufacturing network has 4 source nodes, 4 end nodes and 51 directed edges, defining a total of 724 different process routes.
This example was trained using MATLAB R2021a software. According to the learning rate used in the related study, the learning rates of the Actor network and the Critic network in the training are respectively 2 multiplied by 10 -3 And 1X 10 -3 . Since the number of hidden layer neurons has a great relationship with the dimension of the problem of interest, the hidden layer neurons are respectively set to L in consideration of similar research of DRL s1 =128、L s2 =256、L s3 =128、L a1 =64、L a2 =128、L c1 =256、L c2 =128、L c3 =64、L 1 =128、L 2 =256、L 3 =128. In addition, based on the existing research, other parameters are respectively: empirical buffer length l=1×10 6 Small sampled batches m=1280, smoothing factor τ=1×10 -3 Decision cost c D =100。
Training DRL Agent: the DRL Agent will be trained to obtain the maximum benefit in 5000 units of time, i.e. (K' -K) ×Δt=5000. Step by step in consideration of Agent performanceLong sensitivity, three different manufacturing scenarios were constructed using different step sizes Δt and decision frequencies K' -K: { Δt=50, K ' -k=100 }, { Δt=100, K ' -k=50) }, and { Δt=500, K ' -k=10 }, which means that Agent will generate action a in each Epoch K And interacts with the manufacturing network 100 times, 50 times, and 10 times. Based on the constructed manufacturing network environment and DRL agents, three training Episodes were developed for three manufacturing scenarios, and the rewards obtained are shown in Table 1.
In addition, an optimization model based on a genetic algorithm is used as a comparison standard of the method. The algorithm constructs a population of 70 individuals, each individual having a 1 x 60 matrix as its chromosome (solution), representing preventive maintenance and quality inspection actions for 30 nodes in the constructed manufacturing network. Gain G in 5000 units of time K Individuals are evaluated as a function of fitness. For three manufacturing scenarios, the maximum evolution algebra (MEG) is set to 7×10 manufacturing network runs 5 The time of the step, equal to the number of training steps under the DRL algorithm, { Δt=50, meg=100 }, { Δt=100, meg=200) }, and { Δt=500, meg=1000 }, the resulting benefits are shown in table 1.
TABLE 1 benefit form trained by DRL and genetic algorithm
Figure SMS_93
When the yields tended to stabilize, their mean (mean) and Standard Deviation (SD) were calculated in table 1. The Coefficient of Variation (CV) in equation (29) is used to analyze the relative dispersion of the benefits taking into account the scale differences.
CV=SD/mean (29)
The mean, standard deviation, and coefficient of variation under the DRL algorithm are calculated based on the yields of the last 100 epochs whose yields tend to be stable. The mean, standard deviation, and coefficient of variation of the genetic algorithm are calculated from the benefits of the last 50 epochs when the benefits tend to stabilize. For the DRL algorithm, when Δt=50, 100, 500, the highest gains are 4.83×10, respectively 4 、3.63×10 4 、6.47×10 4 The method comprises the steps of carrying out a first treatment on the surface of the For genetic algorithms, when Δt=50,100. At 500, the highest yields were 2.63×10 respectively 4 、4.25×10 4 、6.39×10 4 . Only when Δt=100, the genetic algorithm can help the manufacturing network to get better average yields than the DRL algorithm. Therefore, compared with a genetic algorithm, the DRL algorithm provided by the invention has better adaptability to various manufacturing scenes in a complex manufacturing network. Moreover, in most training, the coefficient of variation under the genetic algorithm is greater than that under the DRL algorithm, meaning that the genetic algorithm achieves less stability of the benefit than the DRL algorithm. The training trajectories of the highest benefit of the DRL algorithm and the genetic algorithm in different manufacturing scenarios are shown in FIG. 8.
The profit trace under the DRL algorithm shows that the constructed DRL Agent can improve the long-term return of the manufacturing network through interaction with the manufacturing network, and the effectiveness of the model is proved. Furthermore, the different patterns of the revenue trace illustrate the differences in training results in different manufacturing scenarios. First, the higher the interaction frequency between the manufacturing environment and the DRL agent (when the step Δt=50), not only will the learning process converge slowly, but also a significant loss will occur in the training process, i.e. a negative return as shown in fig. 8. Second, the trace of Δt=100 shows that under lower frequency interactions, fast convergence can be achieved, but the gain obtained may not be optimal. In summary, the learning performance of the DRL algorithm is sensitive to unsynchronized long manufacturing scenarios due to the nonlinearity, high dimensionality, and dynamics of the manufacturing network. Furthermore, the revenue trace indicates that DRL training may cause possible loss. Therefore, the digital manufacturing network environment must be optimized based on machine learning to avoid possible losses when agents interact with the real manufacturing system.
Experiments based on trained agents: through the optimal DRL Agent trained by the three manufacturing scenes, the invention implements experiments for controlling the maintenance and quality detection of the manufacturing network within 50000 units of time. The coefficient of variation and cumulative benefit of rewards for different manufacturing scenarios are shown in figure 9. With the help of the DRL Agent, the manufacturing network can obtain continuously increasing accumulated benefits in all manufacturing scenarios. Likewise, when the interaction step Δt=500, at the proposed D Under the control of the RL Agent, the manufacturing network can obtain the highest accumulated income. Meanwhile, the variation coefficient represents the relation of the dispersion degree of rewards under different step sizes: CV (CV) 500 <CV 100 <CV 50 . Therefore, the Agent performance is consistent with the training result, and the experimental process is stable and effective.
And calculating the unit time rewards of each step length according to the experimental results. For the Kth step, it awards r per unit time u(K) Can be obtained from formula (30):
r u(K) =r K /Δt (30)
FIG. 9 is a scatter plot of rewards per unit time obtained by three trained agents. In the scatter diagram, when the step Δt=100 or 500, the unit time rewards are stable and concentrated, and the unit time rewards at Δt=500 are larger than those at Δt=100. However, the rewards per unit time become dispersed and unstable when Δt=50. This phenomenon suggests that smaller step sizes make it difficult for an Agent to continuously give optimal decisions due to the high nonlinearity, high dimensionality, and dynamics of the manufacturing network. On the other hand, the unit time rewards also explain why the Agent cumulative returns trained at steps 50 and 100 are lower.
Finally, this example analyzes connectivity of a manufacturing network in 50000 units of time with intervention of 3 trained DRL agents. Connectivity refers to the probability that at least one process path remains connected between the source node and the destination node of the manufacturing network, and the connectivity curve is shown in fig. 10. When the step Δt=500, the connectivity of the manufacturing network is the worst and the connectivity variation amplitude is the largest, indicating that the connectivity is the worst and the least stable. And when the step length delta t=100, the connectivity is the best, the variation amplitude is the smallest, and the concept of 'high benefit accompanied by high risk' is verified. In particular, when the manufacturing network maintains a high long-term return (Δt=500), there will be a high risk of outage (worst connectivity).
The invention researches the problem of joint optimization of manufacturing network maintenance and quality detection based on a DRL algorithm under the condition of machine reliability and product quality interaction behavior. Firstly, a mathematical model for constructing a nonlinear, high-dimensional and dynamic environment of a manufacturing network is provided, and an effective state transition model is provided for control of the manufacturing network. Secondly, an effective DRL model suitable for manufacturing network reliability-quality joint control is constructed, and modeling of discrete-continuous mixed states and mixed actions can be simultaneously realized; in addition, training and experimental results verify the validity of the proposed DRL model. Compared with genetic algorithm, the DRL algorithm has better adaptability to dynamic and diversified manufacturing scenes. Meanwhile, the model provided by the invention can well balance the contradiction between the economic benefit and the operation risk of the manufacturing network.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims (6)

1. The manufacturing network maintenance-detection combined optimization method based on deep reinforcement learning is characterized by comprising the following steps of:
Step one: for a machine level, under the condition of considering dynamic production speed caused by machine fault shutdown, constructing a machine reliability model considering feeding quality influence and a processing quality model considering machine reliability influence;
step two: performing a systematic evaluation of manufacturing network status and performance based on the reliability model and the quality model; constructing a manufacturing network maintenance and quality detection combined optimization model;
step three: at the system level, the economic operation of the manufacturing network is used as the standard of strategy evaluation, and the optimal strategy of quality detection and maintenance under the given state of the manufacturing network is learned through a designed depth deterministic strategy gradient algorithm.
2. The method for manufacturing network maintenance-detection joint optimization based on deep reinforcement learning according to claim 1, wherein the method for constructing the reliability model and the quality model is as follows:
calculating the dynamic production speed:
considering each machine as a node, modeling a loop-free manufacturing network of n nodes with a directed loop-free graph G (V, E), where v= { V 1 ,v 2 ,…,v n Is a set of nodes of a manufacturing network,
Figure FDA0004155778540000011
to make a set of directed edges in a network; i. j are nodes;
When there is no machine downtime in the manufacturing network, the production speed of the machine is defined as the maximum production speed, denoted as P rm =[P rm1 ,P rm2 ,…,P rmn ]The method comprises the steps of carrying out a first treatment on the surface of the The actual production speed of the machine in the manufacturing network is denoted as P ra (t)=[P ra1 (t),P ra2 (t),…,P ran (t)]And meet P ra ≤P rm The method comprises the steps of carrying out a first treatment on the surface of the Wherein P is rmn Represents the maximum production speed of the nth machine, P ran (t) represents the actual production speed of the nth machine;
when the production speed of the node i changes by delta P rai (t) the amount of change DeltaP in the production speed of the node immediately upstream thereof rak (t) and the variation DeltaP of the production speed of the immediately downstream node raj (t) are respectively expressed as:
Figure FDA0004155778540000012
Figure FDA0004155778540000013
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure FDA0004155778540000014
representing the set of immediately upstream nodes connected to node i,
Figure FDA0004155778540000015
representation and nodei a set of connected immediate downstream nodes; />
Figure FDA0004155778540000016
Representing the probability of all in-process flows into node i from upstream node k, +.>
Figure FDA0004155778540000017
Representing the probability of node i flowing out into downstream node j in the product;
calculating dynamic maintenance cost:
when considering the dynamics of the production speed, the failure rate in processing a qualified feed is defined as the base failure rate r b (t):
r b (t)=(β/α)·(t r /α) β-1
Wherein α is a scale parameter and β is a shape parameter;
Figure FDA0004155778540000018
is the relative run time of the machine calculated with the maximum production speed as the standard; t is the actual time of machine operation;
considering the impact that may be caused by an off-grade feed, the actual failure rate r (t) is defined as:
Figure FDA0004155778540000021
Wherein Deltar i′ For the cumulative failure rate increase caused by machine reject feed, N (t) is the number of reject runs processed by the machine in [0, t); the probability distribution function F (t) of the machine failure is obtained through deduction:
F(t)=1-exp(-(t r /α) β -∫Δr(t)dt);
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure FDA0004155778540000022
an accumulated failure rate increase for machining reject feed material up to time t; Δrthe integral of t) is calculated as:
Figure FDA0004155778540000023
wherein t is i′ The actual occurrence time of the i' th failure rate increment;
suppose that the repair times for corrective and preventive repair of a machine are respectively subject to the following normal distributions:
Figure FDA0004155778540000024
and->
Figure FDA0004155778540000025
And mu cm ≥μ pm ,/>
Figure FDA0004155778540000026
The maintenance cost per unit time of the corrective maintenance and the preventive maintenance is c respectively cm And c pm Then the total maintenance cost c of the machine in time [0, t ] m (t) can be expressed as:
Figure FDA0004155778540000027
wherein N is cm (t) is the number of corrective repairs to the machine in [0, t ], N pm (t) is the number of preventive maintenance operations of the machine in a time of [0, t ], t cm_i1 Time spent for machine ith 1 st corrective maintenance, t pm_j1 Time spent for the j1 st preventive maintenance of the machine;
constructing a dynamic processing quality and detection activity model:
defining M (t) as the number of rejects produced in a [0, t ] time, M (t) satisfying the non-homogeneous poisson process of the intensity function λ (t):
λ(t)=ω-ε·e -δ·r(t)
Wherein omega>0 represents the maximum mismatchThe lattice product generates strength epsilon>0 and delta>0 are the influence coefficients of failure rate on the intensity function, and omega-epsilon<λ(t)<ω<P ra (t); definition n d The probability of the number of unqualified products generated by the machine in the time of [ t, t+delta t) is as follows:
Figure FDA0004155778540000028
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure FDA0004155778540000029
is an expected value of the output of the unqualified product in the [0, t ] time period, and delta t is a quality statistics period; definition n q The total number of acceptable products processed by the machine during the [ t, t+Δt ] period is expressed as:
Figure FDA00041557785400000210
in the detection activity, the error judgment of the defective product and the qualified product is as follows: type I error, representing error rejection, probability p I The method comprises the steps of carrying out a first treatment on the surface of the Class II errors, indicating acceptance of errors, probability p II The method comprises the steps of carrying out a first treatment on the surface of the Assume that the sampling rate in the detection activity is s a ,s a ∈[0,1]The joint probability of I-type error of any qualified product is s a ·p I The joint probability of any disqualified product to generate II-type error is s a ·p II The method comprises the steps of carrying out a first treatment on the surface of the Defining the times of the I type error and the II type error as n respectively fr And n fa And obeys the binomial distribution B (n) q ,s a ·p I ) And B (n) d ,s a ·p II ) The method comprises the steps of carrying out a first treatment on the surface of the Thus, the number of unacceptable products M' (t) exiting the machine is the non-homogeneous poisson process M (t) and the binomial distribution B (n d ,s a ·p II ) And n is fa The probability of a failed work-in-process flowing out of the machine during the [ t, t+Δt ] time period is:
Figure FDA0004155778540000031
Wherein m' (t) is the average reject work-in-process number flowing out of the machine in the [0, t ] time period; in addition, the probability that the pass product is rejected during the [ t, t+Δt ] period is:
Figure FDA0004155778540000032
wherein D (t) is the rejected good product accumulated to time t,
Figure FDA0004155778540000033
represents a value represented by n q Randomly extracting n from the unqualified products fr The number of possible combinations of the individual samples;
the product processed by the machine in the time period of [ t, t+delta t ] is correctly judged as the number n of unqualified products cr And the number n of the qualified products correctly determined ca Expressed as:
Figure FDA0004155778540000034
on the basis, the proportion of unqualified products, which are distinguished by the machine through detecting the activity in the time of [ t, t+delta t ], is obtained as follows:
Figure FDA0004155778540000035
assume that the machine detects a single product at a detection cost c ins The total cost of the machine detection during the time [ t, t+Δt):
Figure FDA0004155778540000036
suppose that a single article is transported from a manufacturing networkThe accumulated value increment brought by the processing process from the input node to the current processing node i is v si The average process value increment v brought by machining single piece corresponding to node i in product ai Can be defined as:
Figure FDA0004155778540000037
wherein v is sk Representing the accumulated value increase of the upstream node k of the node i;
the value loss caused by disqualification of class II errors in the detection process during the production process is defined as
Figure FDA0004155778540000038
Thus, all work in process for machine i in time [0, t ] brings a net value increment v net_i The sum of the value loss for all correctly accepted work-in-process, the value loss for the incorrectly accepted work-in-process, and the value loss for all rejected work-in-process:
Figure FDA0004155778540000039
wherein n is ca_i Indicating the number of work-in-process correctly accepted by the machine corresponding to node i, n fa_i Indicating the number of work-in-process (n) of machine error acceptance corresponding to node i cr_i Indicating the number of work-in-process (n) of correct rejection of the machine corresponding to node i fr_i And respectively representing the number of work-in-process of machine error rejection corresponding to the node i.
3. The deep reinforcement learning-based manufacturing network repair-inspection joint optimization method according to claim 2, wherein the method for performing system evaluation on the manufacturing network state and performance is:
for a manufacturing network with n machines, a state matrix S is constructed K State at time t:
S K =[Q K ;D K ;H K ;O K ];
wherein, t=kΔt,
Figure FDA0004155778540000041
mass state for each machine in [ t- Δt, t); d (D) K =[t r1 ,t r2 ,…,t rn ]Representing the degradation state of the machine at time t, H K =[h 1 ,h 2 ,…,h n ]Is the health state of the machine at the time t, h i 1 in e {0,1} represents a failure state of the machine, 0 represents a no failure state of the machine; o (O) K =[o 1 ,o 2 ,…,o n ]Indicating the idle state of the machine at time t, o i 1 in e {0,1} represents an idle state, 0 represents an operating state;
definition of rewards r K From state S in [ t, t+Δt ] time for manufacturing the network K Transition to S K+1 Net benefit from the process of (2):
Figure FDA0004155778540000042
wherein c I_i Is the total detection cost of the machine corresponding to node i, c m_i Is the total maintenance cost of the machine corresponding to node i, c D Is the decision cost for maintenance and detection actions;
to evaluate the slave state S in [ K.DELTA.t, K' DELTA.t ] time K To S K′ Cumulative performance of (C) will benefit G K Defined as the long-term return of the manufacturing network, obtained by jackpot calculation:
Figure FDA0004155778540000043
wherein K' > K.
4. The manufacturing network maintenance-detection joint optimization method based on deep reinforcement learning according to claim 3, wherein the method for constructing the manufacturing network maintenance and quality detection joint optimization model is as follows:
regarding quality inspection and preventive maintenance as actions, noted as
Figure FDA0004155778540000044
Wherein (1)>
Figure FDA0004155778540000045
For quality detection action, < >>
Figure FDA0004155778540000046
Is a preventive maintenance action;
the actions of all machines in the manufacturing network at t=kΔt depend on the state S K Therefore, it is denoted as A K =π(S K ) Where pi (·) represents the policy function:
Figure FDA0004155778540000047
wherein pi * (. Cndot.) represents the optimal policy function, Q (. Cndot.) represents the state S K Take A K A long term return on motion function;
Under the optimal strategy, the cost function and the Q function satisfy:
Figure FDA0004155778540000048
wherein V (·) represents the state S K Maximum long-term return at that time.
5. The method for optimizing repair-inspection of a manufacturing network based on deep reinforcement learning according to claim 4, wherein the designed method for learning the optimal strategy for quality inspection and repair under a given manufacturing network state by using a depth deterministic strategy gradient algorithm (DDPG) is as follows:
step1: performing current actions, simulating operation of the manufacturing network in [ kΔt, kΔt') time, wherein t=kΔt;
step1.1: evaluating the state of the manufacturing network at time t=kΔt: in a learning environment, based on the proposed dynamic reliability and quality model, the machine state at time t=kΔt is evaluated, and then the state S of the manufacturing network at time t=kΔt is evaluated K Make state S K Providing the first observation to the Agent as a manufacturing network;
step1.2: generating an action based on the current policy function pi (S): one action
Figure FDA0004155778540000051
By inputting state S to an Actor network μ (S) K The DDPG algorithm is obtained by adding a random noise N which is compliant with normal distribution r Attempts to explore better strategies by allowable actions, i.e. A K =π(S K )+N r The method comprises the steps of carrying out a first treatment on the surface of the Then according to the preventive maintenance criterion c d Converting the preventative maintenance action into discrete executable actions {0,1}, wherein 0 means that preventative maintenance is not performed and 1 means that preventative maintenance is performed;
step1.3: performing an action simulating the operation of the manufacturing network during [ kΔt, kΔt+Δt): action a at time t=kΔt is obtained K The quality detection of the corresponding machine during subsequent operations will immediately employ the new sampling rate
Figure FDA0004155778540000052
At the same time, preventive maintenance actions are performed on the respective machine, time t of preventive maintenance pm From a normal distribution N (mu) pm2 pm ) Obtaining; if the machine fails in [ K delta t, K delta t + delta t ] period, corrective maintenance is performed immediately upon the failure occurrence for a corrective maintenance time t cm From a normal distribution N (mu) cm2 cm ) Obtaining;
step1.4: assessment of rewards r K : calculating a prize r i2 And i2=i2+1 is updated; such asFruit i2<K', returning to the step Step1.1;
otherwise, executing Step2;
step2: acquiring a state transition record: after operation at [ K.DELTA.t, K' DELTA.t) time, a long-term return G is obtained K The method comprises the steps of carrying out a first treatment on the surface of the According to the method of step step1.1, the state S at t=k' Δt is obtained K′ The method comprises the steps of carrying out a first treatment on the surface of the Then, a state transition record { S } is obtained K ,A K ,G K ,S K′ Store in experience buffers;
step3: updating an Actor network and a Critic network: randomly extracting a small batch of M transfer records from the experience buffer, thereby updating an Actor network mu (S) and a Critic network Q (S, A); at this time, the maximum memory capacity of the experience buffer is L, and when the number of transfer records reaches L, the earliest record is discarded;
Step4: judging the ending condition: if the training number Episode reaches a predetermined maximum training number or a stable long-term return G is obtained K Stopping training; otherwise the simulation period will be updated: [ K.DELTA.t, K ' DELTA.t) ≡ [ K ' DELTA.t, 2K ' DELTA.t-K DELTA.t), and returns to Step1.
6. The method for manufacturing network maintenance-detection joint optimization based on deep reinforcement learning according to claim 5, wherein the updating method of the Actor network and the Critic network is as follows:
a) The method comprises the following steps Randomly extracting a batch of M transfer records from the experience buffer:
Figure FDA0004155778540000061
s3.3.2: for the transfer record i3=1, 2, …, M, calculate future target action +.>
Figure FDA0004155778540000062
And target future long-term return
Figure FDA0004155778540000063
And sets the objective +.>
Figure FDA0004155778540000064
b) The method comprises the following steps Updating parameter θ of Critic network Q (S, A) by minimizing loss function Q
Figure FDA0004155778540000065
c) The method comprises the following steps Updating the parameter θ of the Actor network μ (S) by maximizing the expected cumulative long-term return μ
Figure FDA0004155778540000066
d) The method comprises the following steps Updating parameters of a target actor network and a target Critic network:
θ μ′ =τθ μ +(1-τ)θ μ′
θ Q′ =τθ Q +(1-τ)θ Q′
where τ is a smoothing factor.
CN202310333773.0A 2023-03-30 2023-03-30 Manufacturing network maintenance-detection combined optimization method based on deep reinforcement learning Pending CN116384969A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310333773.0A CN116384969A (en) 2023-03-30 2023-03-30 Manufacturing network maintenance-detection combined optimization method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310333773.0A CN116384969A (en) 2023-03-30 2023-03-30 Manufacturing network maintenance-detection combined optimization method based on deep reinforcement learning

Publications (1)

Publication Number Publication Date
CN116384969A true CN116384969A (en) 2023-07-04

Family

ID=86974473

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310333773.0A Pending CN116384969A (en) 2023-03-30 2023-03-30 Manufacturing network maintenance-detection combined optimization method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN116384969A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117078236A (en) * 2023-10-18 2023-11-17 广东工业大学 Intelligent maintenance method and device for complex equipment, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117078236A (en) * 2023-10-18 2023-11-17 广东工业大学 Intelligent maintenance method and device for complex equipment, electronic equipment and storage medium
CN117078236B (en) * 2023-10-18 2024-02-02 广东工业大学 Intelligent maintenance method and device for complex equipment, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Fleming et al. Evolutionary algorithms in control systems engineering: a survey
Liu et al. Production planning for semiconductor manufacturing via simulation optimization
Lee et al. Bias-corrected q-learning to control max-operator bias in q-learning
CN116384969A (en) Manufacturing network maintenance-detection combined optimization method based on deep reinforcement learning
Chen et al. An iterative procedure for optimizing the performance of the fuzzy-neural job cycle time estimation approach in a wafer fabrication factory
Tasgetiren et al. A particle swarm optimization and differential evolution algorithms for job shop scheduling problem
Ke et al. Solving project scheduling problem with the philosophy of fuzzy random programming
Huang et al. Graph neural network and multi-agent reinforcement learning for machine-process-system integrated control to optimize production yield
Hoffman et al. Online improvement of condition-based maintenance policy via Monte Carlo tree search
Wu et al. A fuzzy-neural ensemble and geometric rule fusion approach for scheduling a wafer fabrication factory
Li et al. An improved whale optimisation algorithm for distributed assembly flow shop with crane transportation
Lv et al. A New Maintenance Optimization Model Based on Three-Stage Time Delay for Series Intelligent System with Intermediate Buffer
CN115629576A (en) Non-invasive flexible load aggregation characteristic identification and optimization method, device and equipment
Zhao et al. An end-to-end deep reinforcement learning approach for job shop scheduling
Wang et al. A HEURISTICALLY ACCELERATED REINFORCEMENT LEARNING METHOD FOR MAINTENANCE POLICY OF AN ASSEMBLY LINE.
Horng et al. An ordinal optimization theory-based algorithm for a class of simulation optimization problems and application
Deng et al. A bottleneck prediction and rolling horizon scheme combined dynamic scheduling algorithm for semiconductor wafer fabrication
Kianpour et al. DOE-based Enhanced Genetic Algorithm for Unrelated Parallel Machine Scheduling to Minimize Earliness and Tardiness Costs
Tsourveloudis et al. Work-in-process scheduling by evolutionary tuned distributed fuzzy controllers
Li et al. Performance prediction of a production line with variability based on grey model artificial neural network
Huang et al. A new approach to on-line rescheduling for a semiconductor foundry fab
Nannapaneni et al. Performance evaluation of smart systems under uncertainty
CN108445761B (en) Joint modeling method for process control and maintenance strategy based on GERT network statistics
CN113780577B (en) Hierarchical decision complete cooperation multi-agent reinforcement learning method and system
Yu et al. Electric Power Material Demand Forecasting Based on LSTM and GM-BP methods

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination