CN116384969A

CN116384969A - Manufacturing network maintenance-detection combined optimization method based on deep reinforcement learning

Info

Publication number: CN116384969A
Application number: CN202310333773.0A
Authority: CN
Inventors: 叶正梗; 蔡志强; 司书宾; 王鑫; 柯勇伟; 李丁林; 周福礼
Original assignee: Zhengzhou University
Current assignee: Zhengzhou University
Priority date: 2023-03-30
Filing date: 2023-03-30
Publication date: 2023-07-04

Abstract

The invention provides a manufacturing network maintenance-detection combined optimization method based on deep reinforcement learning, which comprises the following steps: firstly, for a machine level, under the condition of considering dynamic production speed caused by machine fault shutdown, constructing a machine reliability model considering feeding quality influence and a processing quality model considering machine reliability influence; secondly, performing system evaluation on the manufacturing network state and performance based on the reliability model and the quality model; constructing a manufacturing network maintenance and quality detection combined optimization model; finally, at the system level, the economic operation of the manufacturing network is used as the standard of strategy evaluation, and the optimal strategy of quality detection and maintenance under the given state of the manufacturing network is learned through a designed depth deterministic strategy gradient algorithm. The invention can well balance the contradiction between the economic benefit and the operation risk of the manufacturing network, and has better adaptability to dynamic and diversified manufacturing scenes.

Description

Manufacturing network maintenance-detection combined optimization method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of quality and reliability, in particular to a manufacturing network maintenance-detection combined optimization method based on deep reinforcement learning.

Background

Mass-customized production gives rise to higher production flexibility but increases the complexity of the manufacturing system. In this context, more high-tech manufacturing equipment with flexibility is continuously presented, which greatly enriches the process routes of products, so that the manufacturing system takes machines as nodes and takes product flows as edges, and the characteristics of a complex network are presented. The improvement of machine flexibility and structural complexity strengthens the nonlinear characteristics of the manufacturing system, which increases the operation management difficulty of the network structure manufacturing system and weakens the profit growth brought by the flexible machine.

The operation control of the manufacturing system refers to optimizing the system performance by a production management method, wherein machine maintenance and quality inspection of work in process are two important management measures. The integrated optimization is performed through joint control of production scheduling, product quality, machine reliability and the like, and the integrated optimization becomes the first choice for improving the performance of a manufacturing system. Existing research has attempted different forms of integration, such as single Preventative Maintenance (PM), integration of production planning with maintenance, or integration of production planning, maintenance and quality inspection. Currently, research on joint optimization of manufacturing systems is mainly focused on simulation-based methods. In addition, traditional dynamic planning, integer planning and heuristic algorithms are also important approaches to solve this problem. These traditional optimization methods have great potential in optimizing small-scale systems, such as stand-alone manufacturing systems, simple serial manufacturing systems, or unitized manufacturing systems. However, these methods remain significantly inadequate for optimization of large-scale manufacturing systems with complex system structures. While heuristic algorithms, such as genetic algorithms, can effectively optimize multi-stage serial or parallel fabrication systems, their effectiveness in large-scale fabrication systems with complex system structures has not been verified.

In many manufacturing systems, the diversification of process routes makes them feature a complex network. In addition to the large system scale and structural complexity, interaction is another key factor that makes manufacturing network joint optimization difficult, especially between work-in-process quality and machine reliability. Currently, the interplay between reliability and quality in manufacturing systems is of interest to many researchers. For example, in joint optimization of production and maintenance, literature [ Hajej Z, rezg N, gharbi a.quality issue in forecasting problem of production and maintenance policy for production unit.international Journal of Production research.2018;56:6147-63 ] proposes a cumulative reject rate function affected by machine failure rate; in maintenance optimization of serial multi-station manufacturing systems, literature [ methou X, lu b.predictive maintenance scheduling for serial multi-station manufacturing systems with interaction between station reliability and product quality. Computers & Industrial engineering 2018;122:283-91 model the interplay between machine reliability and product quality; in performance evaluation of automated production lines and series-parallel manufacturing systems, literature [ Ye Z, cai Z, si S, zhang S, yang h.computing Failure Modeling for Performance Analysis of Automated Manufacturing Systems with Serial Structures and Imperfect Quality instrumentation.ieee Transactions on Industrial information.2020; 16:6476-86] a decision graph model and a stochastic model were constructed to characterize the interplay between reliability and quality. Furthermore, literature [ Wang L, bai Y, huang N, wang Q.Fractal-based Reliability Measure for Heterogeneous Manufacturing networks.IEEE Transactions on Industrial information.2019; 15:6407-14] reliability metrics for manufacturing networks are proposed by a fractal-based approach. These studies have attempted to solve the above-mentioned problems of interactive behavior or large scale, but they have studied only one aspect of the problems of interactive behavior and structural complexity independently, and have not fully grasped the characteristics of the manufacturing network.

As mentioned above, joint optimization of manufacturing networks with complex structures and interaction behavior is a tricky problem, especially when we try to solve it with traditional methods. The conventional method has better control or optimization effect on specific manufacturing scenes, but has poorer control or optimization effect in dynamic and diversified manufacturing scenes. However, the development of Artificial Intelligence (AI) has brought promise to effectively address this real-time control and optimization problem. The method based on artificial intelligence can realize the manufacture of higher added value and create higher flexibility for intelligent customization factories. Therefore, it is widely used in the related art: such as risk assessment, intelligent maintenance, quality control and dynamic loading strategies for repairable systems. Reinforcement learning has achieved dramatic achievements in various control tasks for non-linear, high-and dynamic systems in many artificial intelligence-based approaches. Because of its effectiveness, many reinforcement learning-based control studies are used in optimization of dynamic systems, such as maintenance optimization of serial production lines, multi-state engineering systems, production-maintenance joint optimization for degenerate manufacturing systems, and coverage and connectivity maintenance of wireless sensor networks.

In the study of manufacturing system operation control, the complexity of the structure is not sufficiently emphasized. In addition, simplifying the system state to a single continuous type or discrete type is not compatible with the actual production state. At the same time, discrete maintenance activities have become a mainstream choice of operational settings, such as utilizing discrete maintenance activities to build joint control problems in connection with production scheduling. However, current research does not take into account the continuity characteristics of the quality detection activity. The selection of the reinforcement learning algorithm reveals the reason for this gap. The existing manufacturing system operation control algorithm is better in a limited-range Markov decision process, but has the defects of a large-scale manufacturing system with complex structure and various state behaviors. In addition, with more advanced reinforcement learning algorithms, research into non-manufacturing system control provides a good example of a large-scale problem with state-behavior diversity. For example: application of actor-critic algorithm and depth deterministic strategy gradient algorithm of 13-component system in 14-component parallel-series system. In addition, there are students who consider inspection activities on some action settings, but such repair inspection is quite different from quality inspection activities of manufacturing systems. In these studies, detection is considered to be just one discrete action. However, in actual production, the quality inspection in the manufacturing system has continuous action space, and the sampling inspection can be performed on the product according to any ratio of [0,1 ]. Most importantly, research into non-manufacturing systems does not take into account the interaction behavior between components. Therefore, whether these advanced algorithms are suitable for large-scale manufacturing networks with reliability-quality interactions remains to be further explored.

Disclosure of Invention

In order to balance the contradiction between the economic benefit and the operation risk of the manufacturing network, the invention provides a manufacturing network maintenance-detection combined optimization method based on Deep Reinforcement Learning (DRL), and the maintenance and quality detection optimization of a large-scale manufacturing network with reliability-quality interaction behavior is realized through the reliability-quality combined control based on machine learning.

The technical scheme of the invention is realized as follows:

a manufacturing network maintenance-detection combined optimization method based on deep reinforcement learning comprises the following steps:

step one: for a machine level, under the condition of considering dynamic production speed caused by machine fault shutdown, constructing a machine reliability model considering feeding quality influence and a processing quality model considering machine reliability influence;

step two: performing a systematic evaluation of manufacturing network status and performance based on the reliability model and the quality model; constructing a manufacturing network maintenance and quality detection combined optimization model;

step three: at the system level, the economic operation of the manufacturing network is used as the standard of strategy evaluation, and the optimal strategy of quality detection and maintenance under the given state of the manufacturing network is learned through a designed depth deterministic strategy gradient algorithm.

At the machine level, the construction method of the reliability model and the quality model comprises the following steps:

calculating the dynamic production speed:

considering each machine as a node, modeling a loop-free manufacturing network of n nodes with a directed loop-free graph G (V, E), where v= { V ₁ ,v ₂ ,…,v _n Is a set of nodes of a manufacturing network,

to make a set of directed edges in a network; i. j are nodes;

when there is no machine downtime in the manufacturing network, the production speed of the machine is defined as the maximum production speed, denoted as P _rm ＝[P _rm1 ,P _rm2 ,…,P _rmn ]The method comprises the steps of carrying out a first treatment on the surface of the The actual production speed of the machine in the manufacturing network is denoted as P _ra (t)＝[P _ra1 (t),P _ra2 (t),…,P _ran (t)]And meet P _ra ≤P _rm The method comprises the steps of carrying out a first treatment on the surface of the Wherein P is _rmn Represents the maximum production speed of the nth machine, P _ran (t) represents the actual production speed of the nth machine;

when the production speed of the node i changes by delta P _rai (t) the amount of change DeltaP in the production speed of the node immediately upstream thereof _rak (t) and the variation DeltaP of the production speed of the immediately downstream node _raj (t) are respectively expressed as:

wherein, the liquid crystal display device comprises a liquid crystal display device,

representing the set of immediately upstream nodes connected to node i,/->

Representing a set of immediately downstream nodes connected to node i; />

Representing the probability of all in-process flows into node i from upstream node k, +.>

Representing the probability of node i flowing out into downstream node j in the product;

Calculating dynamic maintenance cost:

when considering the dynamics of the production speed, the failure rate in processing a qualified feed is defined as the base failure rate r _b (t)：

r _b (t)＝(β/α)·(t _r /α) ^β-1 ；

Wherein α is a scale parameter and β is a shape parameter;

is the relative run time of the machine calculated with the maximum production speed as the standard; t is the actual time of machine operation;

considering the impact that may be caused by an off-grade feed, the actual failure rate r (t) is defined as:

wherein Deltar _i′ For the cumulative failure rate increase caused by machine reject feed, N (t) is the number of reject runs processed by the machine in [0, t); the probability distribution function F (t) of the machine failure is obtained through deduction:

F(t)＝1-exp(-(t _r /α) ^β -∫Δr(t)dt)；

an accumulated failure rate increase for machining reject feed material up to time t; the integral of Δr (t) is calculated as:

wherein t is _i′ The actual occurrence time of the i' th failure rate increment;

suppose that the repair times for corrective and preventive repair of a machine are respectively subject to the following normal distributions:

and->

And mu _cm ≥μ _pm ，/>

The maintenance cost per unit time of the corrective maintenance and the preventive maintenance is c respectively _cm And c _pm Then the total maintenance cost c of the machine in time [0, t ] _m (t) can be expressed as:

wherein N is _cm (t) is the number of corrective repairs to the machine in [0, t ], N _pm (t) is the number of preventive maintenance operations of the machine in a time of [0, t ], t _{cm_i1} Time spent for machine ith 1 st corrective maintenance, t _{pm_j1} Time spent for the j1 st preventive maintenance of the machine;

constructing a dynamic processing quality and detection activity model:

defining M (t) as the number of rejects produced in a [0, t ] time, M (t) satisfying the non-homogeneous poisson process of the intensity function λ (t):

λ(t)＝ω-ε·e ^-δ·r(t) ；

wherein omega>0 represents the maximum reject generation strength ε>0 and delta>0 are the influence coefficients of failure rate on the intensity function, and omega-epsilon<λ(t)<ω<P _ra (t); definition n _d Is the number of unqualified products generated by the machine in the time of [ t, t+delta t ]The probability of the quantity is:

is an expected value of the output of the unqualified product in the [0, t ] time period, and delta t is a quality statistics period; definition n _q The total number of acceptable products processed by the machine during the [ t, t+Δt ] period is expressed as:

in the detection activity, the error judgment of the defective product and the qualified product is as follows: type I error, representing error rejection, probability p _I The method comprises the steps of carrying out a first treatment on the surface of the Class II errors, indicating acceptance of errors, probability p _II The method comprises the steps of carrying out a first treatment on the surface of the Assume that the sampling rate in the detection activity is s _a ，s _a ∈[0,1]The joint probability of I-type error of any qualified product is s _a ·p _I The joint probability of any disqualified product to generate II-type error is s _a ·p _II The method comprises the steps of carrying out a first treatment on the surface of the Defining the times of the I type error and the II type error as n respectively _fr And n _fa And obeys the binomial distribution B (n) _q ,s _a ·p _I ) And B (n) _d ,s _a ·p _II ) The method comprises the steps of carrying out a first treatment on the surface of the Thus, the number of unacceptable products M' (t) exiting the machine is the non-homogeneous poisson process M (t) and the binomial distribution B (n _d ,s _a ·p _II ) And n is _fa The probability of a failed work-in-process flowing out of the machine during the [ t, t+Δt ] time period is:

wherein m' (t) is the average reject work-in-process number flowing out of the machine in the [0, t ] time period; in addition, the probability that the pass product is rejected during the [ t, t+Δt ] period is:

wherein D (t) is the rejected good product accumulated to time t,

represents a value represented by n _q Randomly extracting n from the unqualified products _fr The number of possible combinations of the individual samples;

the product processed by the machine in the time period of [ t, t+delta t ] is correctly judged as the number n of unqualified products _cr And the number n of the qualified products correctly determined _ca Expressed as:

on the basis, the proportion of unqualified products, which are distinguished by the machine through detecting the activity in the time of [ t, t+delta t ], is obtained as follows:

assume that the machine detects a single product at a detection cost c _ins The total cost of the machine detection during the time [ t, t+Δt):

assume that the cumulative value increment incurred by the process of a single work-in-process from a manufacturing network input node to a current process node i is v _si The average process value increment v brought by machining single piece corresponding to node i in product _ai Can be defined as:

wherein v is _sk Representing the accumulated value of node k upstream of node iAn increment;

the value loss caused by disqualification of class II errors in the detection process during the production process is defined as

Thus, all work in process for machine i in time [0, t ] brings a net value increment v _{net_i} The sum of the value loss for all correctly accepted work-in-process, the value loss for the incorrectly accepted work-in-process, and the value loss for all rejected work-in-process:

wherein n is _{ca_i} Indicating the number of work-in-process correctly accepted by the machine corresponding to node i, n _{fa_i} Indicating the number of work-in-process (n) of machine error acceptance corresponding to node i _{cr_i} Indicating the number of work-in-process (n) of correct rejection of the machine corresponding to node i _{fr_i} And respectively representing the number of work-in-process of machine error rejection corresponding to the node i.

The method for performing system evaluation on the manufacturing network state and performance comprises the following steps:

for a manufacturing network with n machines, a state matrix S is constructed ^K State at time t:

S ^K ＝[Q ^K ；D ^K ；H ^K ；O ^K ]；

wherein, t=kΔt,

mass state for each machine in [ t- Δt, t); d (D) ^K ＝[t _r1 ,t _r2 ,…,t _rn ]Representing the degradation state of the machine at time t, H ^K ＝[h ₁ ,h ₂ ,…,h _n ]Is the health state of the machine at the time t, h _i 1 in e {0,1} represents a failure state of the machine, 0 represents a no failure state of the machine; o (O) ^K ＝[o ₁ ,o ₂ ,…,o _n ]Indicating the idle state of the machine at time t, o _i A1 in E {0,1} represents idleState, 0 represents an operating state;

definition of rewards r _K From state S in [ t, t+Δt ] time for manufacturing the network ^K Transition to S ^K+1 Net benefit from the process of (2):

wherein c _{I_i} Is the total detection cost of the machine corresponding to node i, c _{m_i} Is the total maintenance cost of the machine corresponding to node i, c _D Is the decision cost for maintenance and detection actions;

to evaluate the slave state S in [ K.DELTA.t, K' DELTA.t ] time ^K To S ^K′ Cumulative performance of (C) will benefit G _K Defined as the long-term return of the manufacturing network, obtained by jackpot calculation:

wherein K' > K.

The construction method of the manufacturing network maintenance and quality detection combined optimization model comprises the following steps:

regarding quality inspection and preventive maintenance as actions, noted as

Wherein (1)>

For quality detection action, < >>

Is a preventive maintenance action;

the actions of all machines in the manufacturing network at t=kΔt depend on the state S ^K Therefore, it is denoted as A ^k ＝π(S ^K ) Where pi (·) represents the policy function:

wherein pi ^* (. Cndot.) represents the optimal policy function, Q (. Cndot.) represents the state S ^K Take A ^K A long term return on motion function;

under the optimal strategy, the cost function and the Q function satisfy:

wherein V (·) represents the state S ^K Maximum long-term return at that time.

The designed method for learning the optimal strategy of quality detection and maintenance under the given manufacturing network state by using the depth deterministic strategy gradient algorithm (DDPG) comprises the following steps:

step1: performing current actions, simulating operation of the manufacturing network in [ kΔt, kΔt') time, wherein t=kΔt;

step1.1: evaluating the state of the manufacturing network at time t=kΔt: in a learning environment, based on the proposed dynamic reliability and quality model, the machine state at time t=kΔt is evaluated, and then the state S of the manufacturing network at time t=kΔt is evaluated ^K Make state S ^K Providing the first observation to the Agent as a manufacturing network;

step1.2: generating an action based on the current policy function pi (S): one action

By inputting state S to an Actor network μ (S) ^K The DDPG algorithm is obtained by adding a random noise N which is compliant with normal distribution _r Attempts to explore better strategies by allowable actions, i.e. A ^K ＝π(S ^K )+N _r The method comprises the steps of carrying out a first treatment on the surface of the Then according to the preventive maintenance criterion c _d Converting the preventative maintenance action into discrete executable actions {0,1}, wherein 0 means that preventative maintenance is not performed and 1 means that preventative maintenance is performed;

Step1.3: performing an action simulating the operation of the manufacturing network during [ kΔt, kΔt+Δt): action a at time t=kΔt is obtained ^K Later, during subsequent operationsThe quality detection of the machine will immediately adopt the new sampling proportion

At the same time, preventive maintenance actions are performed on the respective machine, time t of preventive maintenance _pm From a normal distribution N (mu) _pm ,σ ² _pm ) Obtaining; if the machine fails in [ K delta t, K delta t + delta t ] period, corrective maintenance is performed immediately upon the failure occurrence for a corrective maintenance time t _cm From a normal distribution N (mu) _cm ,σ ² _cm ) Obtaining;

step1.4: assessment of rewards r _K : calculating a prize r _i2 And i2=i2+1 is updated; if i2<K', returning to the step Step1.1; otherwise, executing Step2;

step2: acquiring a state transition record: after operation at [ K.DELTA.t, K' DELTA.t) time, a long-term return G is obtained _K The method comprises the steps of carrying out a first treatment on the surface of the According to the method of step step1.1, the state S at t=k' Δt is obtained ^K′ The method comprises the steps of carrying out a first treatment on the surface of the Then, a state transition record { S } is obtained ^K ,A ^K ,G _K ,S ^K′ Store in experience buffers;

step3: updating an Actor network and a Critic network: randomly extracting a small batch of M transfer records from the experience buffer, thereby updating an Actor network mu (S) and a Critic network Q (S, A); at this time, the maximum memory capacity of the experience buffer is L, and when the number of transfer records reaches L, the earliest record is discarded;

Step4: judging the ending condition: if the training number Episode reaches a predetermined maximum training number or a stable long-term return G is obtained _K Stopping training; otherwise the simulation period will be updated: [ K.DELTA.t, K ' DELTA.t) ≡ [ K ' DELTA.t, 2K ' DELTA.t-K DELTA.t), and returns to Step1.

The updating method of the Actor network and the Critic network comprises the following steps:

a) The method comprises the following steps Randomly extracting a batch of M transfer records from the experience buffer:

s3.3.2: for turning aroundShift records i3=1, 2, …, M, calculate future target actions

And target future long-term return->

And sets the objective +.>

b) The method comprises the following steps Updating parameter θ of Critic network Q (S, A) by minimizing loss function _Q ：

c) The method comprises the following steps Updating the parameter θ of the Actor network μ (S) by maximizing the expected cumulative long-term return _μ ：

d) The method comprises the following steps Updating parameters of a target actor network and a target Critic network:

θ _μ′ ＝τθ _μ +(1-τ)θ _μ′ ；

θ _Q′ ＝τθ _Q +(1-τ)θ _Q′ ；

where τ is a smoothing factor.

Compared with the prior art, the invention has the beneficial effects that:

1) The invention provides a mathematical model for constructing nonlinear, high-dimensional and dynamic environments of a manufacturing network, and provides an effective state transition model for controlling the manufacturing network.

2) On the basis of considering reliability-quality interaction behavior, an effective DRL model suitable for manufacturing network reliability-quality joint control is constructed, and a physical system simultaneously comprising a discrete-continuous mixed state and a mixed action can be modeled; the DRL model has better adaptability to dynamic and diversified manufacturing scenes.

3) In order to realize the reliability-quality joint control of the dynamic manufacturing network, the invention constructs a machine learning model driven by the deep neural network based on a DDPG algorithm under the established maintenance and quality detection mixed action space model; according to learning results in different manufacturing scenes, the method can realize an optimal manufacturing system control strategy.

4) The model provided by the invention can well balance the contradiction between the economic benefit and the operation risk of the manufacturing network.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a fabrication network formed by multiple process paths.

FIG. 2 is a cascade effect of machine faults in a network structure manufacturing system node.

Fig. 3 is a flow chart of the present invention.

Fig. 4 is a neural network structure of the DDPG algorithm of the present invention; wherein, (a) an Actor network (b) a Critic network.

Fig. 5 is a training flow chart of the DRL model based on the DDPG algorithm of the present invention.

Fig. 6 is a flowchart of updating the Actor and Critic networks in the DDPG algorithm of the present invention.

Fig. 7 is a directed acyclic manufacturing network of an example of the invention.

FIG. 8 is a training trajectory diagram of the highest benefits of the DRL model and genetic algorithm of the present invention under different manufacturing scenarios; wherein (a) step Δt=50 (b) step Δt=100 (c) step Δt=500.

FIG. 9 is a scatter plot of rewards per unit time obtained by three trained agents.

FIG. 10 is a graph of manufacturing network connectivity under different training DRL Agent controls; wherein (a) step Δt=50 (b) step Δt=100 (c) step Δt=500.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.

In a flexible manufacturing environment, the variety of process routes can result in random flow of the article, which can render the manufacturing system networked in character. As shown in fig. 1, each machine may be considered a node of the manufacturing network and the process flow may be considered an edge connecting the nodes. In this case, for a given machine, its upstream machine refers to all machines preceding it in the relevant process route, and its downstream machine refers to all machines following it in the relevant process route. In a multi-stage manufacturing network, the flow of the product brings the stock to each machine at different stages, resulting in interactions between machine reliability and feed quality. Wherein an off-grade feed increases the risk of machine failure; conversely, degraded machines will be more likely to process out of order and become off-grade feed to the immediate downstream machine (NDM). When no intervention is performed, the interaction between machine reliability and feed quality will propagate along the process route, with a higher risk of failure of the machine downstream of the process route.

In a manufacturing system, there are two failure modes for a machine: failure to immediately shut down the machine upon failure, defined as a hard failure; failure that is usually not directly found and corrected only upon inspection is defined as soft failure. Hard failure can lead to machine downtime, bringing its production speed to zero. Soft failure is caused by degradation of components, which is induced when the cumulative degradation level first exceeds a threshold. Before soft failure occurs, the machine degradation only affects the machine's processing quality and not its production speed. In real world industrial production, both machine maintenance and quality inspection are important activities that improve the performance of manufacturing systems, the former allowing the machine to recover from failure conditions, and the latter allowing the loss to be timely reduced by shutting off the flow of reject between different machines. When a hard failure occurs or a soft failure is detected, corrective Maintenance (CM) can be implemented to repair the machine that is failing. Furthermore, when degradation does not cause soft failure, preventative maintenance may be performed to recover the degraded machine. On the other hand, preventative maintenance may reduce the probability of hard failure occurring. And, preventive maintenance can also improve the processingquality of lathe.

In the above case, the control of the operation of the manufacturing network would be a complex problem. When a machine fails and repair is not completed in time, the effect of the machine failure will propagate to the machines upstream and downstream. The production speed of these machines is then reduced. Furthermore, due to congestion or starvation, machines may be idle (referred to as idle machines). As shown in fig. 2, because M5 is the only immediately downstream machine of M3, machine M3 will be idle when machine M5 fails in node 5. Also, because the M6 machine can continue to operate with a reduced production speed, the production speeds of the M4 and M7 machines may be reduced accordingly. Further, further downstream and upstream machines may be impacted prior to repair of the failed machine. When the failed machine is repaired, its own and the production speed of the affected machine will also resume. However, variations in production speed can cause more uncertainty in machine degradation, impairing the effectiveness of pre-established maintenance and quality inspection plans.

In this context, it is necessary how to achieve optimal control of reliability and quality in a manufacturing network based on dynamic machine maintenance and quality inspection. Accordingly, the present invention proposes a deep reinforcement learning based joint optimization method for manufacturing network repair-inspection based on the assumption that (1. Discrete part manufacturing process is considered. 2. Each machine has quality inspection activity, sampling inspection of work in process at any ratio of 0%,100% ]. 3. Work In Process (WIP) has two quality states, quality good (no defects) and quality bad (rejects), which can be distinguished by the quality inspection process. 4. The machine has two failure modes, hard failure and soft failure. 5. Two repair activities, corrective and preventive maintenance, will be performed, and both can be brought back to a new state.) based on the research method to explore the optimal reliability and quality control strategies under economical operation of the manufacturing network, as shown in FIG. 3. First, for the machine level, a reliability model (Q-R effect) taking into account feed quality effects and a quality model (R-Q effect) taking into account machine reliability effects are constructed taking into account dynamic production speeds caused by machine downtime, and the machine state and the manufacturing network state are systematically evaluated by the proposed model. Secondly, at the system level, taking the economic operation of the manufacturing network as a standard of strategy evaluation, an optimal strategy for learning quality detection and maintenance under a given manufacturing network state based on an optimization model of deep reinforcement learning is proposed.

Dynamic production speed: considering each machine as a node, modeling a loop-free manufacturing network of n nodes with a directed loop-free graph G (V, E), where v= { V ₁ ,v ₂ ,…,v _n A set of nodes, i.e. a set of machines,

to make a collection of directed edges in the network, i.e., the flow direction of the article; i. j are nodes. W (W) ⁺ Representing the forward weight matrix of the network edge.

the forward weight of the directed edge is the probability of the node i flowing out of the product into the downstream node j. In addition, the inverse weight matrix W of the directed edge ^- May pass through matrix W ⁺ Calculated, wherein->

Representing the probability that all of the incoming node i are in process from the upstream node k. Defining a set of immediately downstream nodes connected to node i as

The set of immediately upstream Nodes (NUM) connected to node i is

When there is no machine downtime in the manufacturing network, the production speed of the machine is defined as the maximum production speed, denoted as P _rm ＝[P _rm1 ,P _rm2 ,…,P _rmn ]The method comprises the steps of carrying out a first treatment on the surface of the The actual production speed of the machine in the manufacturing network is denoted as P _ra (t)＝[P _ra1 (t),P _ra2 (t),…,P _ran (t)]And meet P _ra ≤P _rm The method comprises the steps of carrying out a first treatment on the surface of the Wherein P is _rmn Represents the maximum production speed of the nth machine, P _ran (t) represents the actual production speed of the nth machine; during production, the actual production speed of the machine is closely related to the immediate upstream node and the immediate downstream node to maintain balanced production. When the production speed of the node i changes by delta P _rai (t) the amount of change DeltaP in the production speed of the node immediately upstream thereof _rak (t) and the variation DeltaP of the production speed of the immediately downstream node _raj (t) are respectively expressed as:

after that, deltaP _rai The effect of (t) will propagate along its process route to the source node and the destination node and cause corresponding changes in the corresponding process route on each machine.

Dynamic machine reliability and maintenance activities: in a manufacturing network, machines are subject to dynamic failure probabilities during operation. Such dynamics can be attributed toThe junction is three aspects: uncertainty in machine failure mode, dynamics in production speed, and instability in feed quality. Fortunately, the weibull distribution can fit different failure modes of the machine, such as decreasing, constant, and increasing failure rates. Thus, failure rates that follow the weibull distribution are suitable for modeling the failure risk of a machine. Defining the failure rate when only qualified feed materials are processed as the basic failure rate r _b (t) when considering the dynamics of the production speed, the base failure rate is expressed as:

r _b (t)＝(β/α)·(t _r /α) ^β-1 (4)

wherein α is a scale parameter and β is a shape parameter; t is t _r The relative time of the operation of the opportunity machine at the maximum production speed is considered; t is the actual time of machine operation; and, relative time t _r And can also be used as an index for measuring the degradation of the machine.

wherein Deltar _i′ The delta of cumulative failure rate due to machine reject feed is subject to Beta distribution Beta (a, b). N (t) is the number of unqualified products processed by the machine in the time of [0, t); the probability distribution function F (t) of the machine failure is obtained through deduction:

F(t)＝1-exp(-(t _r /α) ^β -∫Δr(t)dt) (7)

wherein t is _i′ The actual time of occurrence of the i' th failure rate increment.

Corrective maintenance and preventative maintenance are effective methods of handling machine hard and soft failures. The present invention assumes that corrective maintenance and preventive maintenance both restore the failure rate of the machine to a level at t=0. Also, it is assumed that the maintenance times for corrective maintenance and preventive maintenance of the machine are respectively subject to the following normal distributions:

and

often, corrective repairs require recovery from a failed shutdown of the chaotic machine, which is more abrupt than the failure faced by preventative repairs. It is assumed that corrective maintenance will take more time to complete and therefore there is μ _cm ≥μ _pm ，

Assuming that the maintenance costs per unit time for corrective maintenance and preventive maintenance are c, respectively _cm And c _pm Then the total maintenance cost c of the machine in time [0, t ] _m (t) can be expressed as:

wherein N is _cm (t) is the number of corrective repairs to the machine in [0, t ], N _pm (t) is the number of preventive maintenance operations of the machine in a time of [0, t ], t _{cm_i1} Time spent for machine ith 1 st corrective maintenance, t _{pm_j1} Time spent for preventive maintenance of the machine j1 st.

Constructing a dynamic processing quality and detection activity model: the quality of the process is another important criterion of the reliability of the machine, which can be described by the number M (t) of rejects produced in a specific time [0, t ]. Failure of the product to meet the quality specifications is referred to as reject. The quality of the process also has time-varying characteristics due to instability in machine reliability. Thus, the random variable M (t) is modeled by a non-homogeneous Poisson process (NHPP) with an intensity function of λ (t). The expression of λ (t) is:

λ(t)＝ω-ε·e ^-δ·r(t) (10)

wherein omega>0 represents the maximum reject generation strength ε>0 and delta>0 is the influence coefficient of failure rate on the intensity function, and omega-epsilon<λ(t)<ω<P _ra (t); when ω=p is defined _ra At (t), one can demonstrate ε=gxω, where g ε [0,1 ]]Is the initial percentage of machine-produced defect free product. Definition n _d The probability of the number of unqualified products generated by the machine in the time of [ t, t+delta t) is as follows:

is the expected value of the output of the unqualified products in the [0, t ] time period, and delta t is the quality statistical period. Definition n _q The total number of acceptable products processed by the machine during the [ t, t+Δt ] period is expressed as:

in a manufacturing network, the inspection activities are performed after the machining activities to ensure that defective articles can be found in time. In verification activities, typically a class I error (false reject) and a class II error (false accept) occur, with a probability of p _I The method comprises the steps of carrying out a first treatment on the surface of the Class II error, probability p _II . Accordingly, the correct judgment of defective products and qualified products, called correct acceptance and correct rejection, is performed by 1-p _I And 1-p _II To represent the corresponding probabilities. Can be used to p if the processing machine is not detecting activity _I =0 and p _II =1.

Assume that in detecting activityIs s _a ，s _a ∈[0,1]And the sampling and detecting activities are independent, the joint probability of any qualified product in I type error is s _a ·p _I The joint probability of any disqualified product to generate II-type error is s _a ·p _II The method comprises the steps of carrying out a first treatment on the surface of the Defining the times of the I type error and the II type error as n respectively _fr And n _fa And obeys the binomial distribution B (n) _q ,s _a ·p _I ) And B (n) _d ,s _a ·p _II ) The method comprises the steps of carrying out a first treatment on the surface of the Thus, the number of unacceptable products M' (t) exiting the machine is considered to be the non-homogeneous poisson process M (t) and the binomial distribution B (n _d ,s _a ·p _II ) Is a complex of (a) and (b); n is n _fa The probability of a failed work-in-process flowing out of the machine during the [ t, t+Δt ] time period is:

wherein m' (t) is the average number of unacceptable products flowing out of the machine during the [0, t ] time period; in addition, the probability that the pass product is rejected during the [ t, t+Δt ] period is:

wherein D (t) is the rejected good product accumulated to time t,

represents a value represented by n _q Randomly extracting n from the unqualified products _fr The number of possible combinations of the individual samples; the product processed by the machine in the time period of [ t, t+delta t ] is correctly judged as the number n of unqualified products _cr And the number n of the qualified products correctly determined _ca Expressed as:

/>

assume that the cumulative value increment incurred by the process of a single work-in-process from a manufacturing network input node to a current process node i is v _si The value loss caused by the fact that a single product in the node i is judged to be a defective product is also v _si The method comprises the steps of carrying out a first treatment on the surface of the The method comprises the steps of carrying out a first treatment on the surface of the The average process value increment v of the machined single piece corresponding to the node i in the product _ai Can be defined as:

wherein v is _sk Representing the accumulated value increase of the upstream node k of the node i; in addition, when a class II error occurs in the process of detecting a defective work-in-process, the defective work-in-process will flow into the downstream machine and consume more production resources, resulting in a value loss that will be greater than the value loss caused by the current machine being properly determined as a defective work-in-process. The value loss caused by disqualification of class II errors in the detection process during the production process is defined as

Evaluating manufacturing network status: machine performance has different manifestations, such as faults caused by hard or soft faults, idleness caused by starvation or blockage, variations in process quality, and degradation, all of which affect the performance of the manufacturing network. For a manufacturing network with n machines, a state matrix S is constructed ^K State at time t:

S ^K ＝[Q ^K ；D ^K ；H ^K ；O ^K ] (20)

wherein, t=kΔt,

the defect ratio for each machine over the time t- Δt, t, representing the mass state of the machine, is calculated from equation (16). D (D) ^K ＝[t _r1 ,t _r2 ,…,t _rn ]The degradation state of the machine at time t is represented by the relative time from the last maintenance to the present of each machine, as shown in equation (5). H ^K ＝[h ₁ ,h ₂ ,…,h _n ]Is the health state of the machine at the time t, h _i 1 in e {0,1} represents a failure state of the machine, 0 represents a no failure state of the machine; o (O) ^K ＝[o ₁ ,o ₂ ,…,o _n ]Indicating the idle state of the machine at time t, o _i A 1 in e {0,1} represents an idle state, and a 0 represents an active state.

Evaluating cumulative performance of the manufacturing network: the cumulative performance assessment is used to determine whether maintenance and quality inspection schemes for all machines are cost effective. From an economic standpoint, the cumulative performance of a manufacturing network may be expressed in terms of net revenue. Definition of rewards r _K At [ t, t+Δt ] time for manufacturing the networkInner slave state S ^K Transition to S ^K+1 Net benefit from the process of (2); where Δt is the period of the quality statistic and is also the step size of the state transition. Thus, the rewards per unit step are equal to the cumulative net value increment after deduction of maintenance, detection and decision costs;

wherein c _{I_i} Is the total detection cost of the machine corresponding to node i, c _{m_i} Is the total maintenance cost of the machine corresponding to node i, c _D Is the decision cost for repair and inspection activities. To evaluate the slave state S in [ K.DELTA.t, K' DELTA.t ] time ^K To S ^K′ Defining the benefit GK as the long-term return of the manufacturing network, obtained by jackpot calculation:

wherein K' > K.

Joint optimization model based on markov decision process: the dynamic reliability and quality model constructed previously provides a state transition model of the manufacturing network. When quality inspection and maintenance are considered actions, a typical control model based on a Markov decision process can be constructed, where state sets and action sets, rewards functions, and state transition models are known. At the same time, the model aims at searching the optimal preventive maintenance and quality detection strategy function, so that the manufacturing network obtains the optimal long-term return. In practice, corrective maintenance will automatically begin when the machine fails, without policy support. Thus, this strategy refers only to the behaviour of preventive maintenance and quality detection at the moment t=kΔt, noted as

Wherein (1)>

For quality detection action, < >>

Is a preventive maintenance action. Furthermore, the actions of all machines in the manufacturing network at t=kΔt depend on the state S ^K Therefore, it is denoted as A ^K ＝π(S ^K ) Where pi (·) represents the policy function:

wherein pi ^* (. Cndot.) represents the optimal policy function, Q (. Cndot.) represents the state S ^K Take A ^K Long term return on motion.

Under the optimal strategy, the cost function and the Q function satisfy:

wherein V (·) represents the state S ^K Maximum long-term return at that time.

Traditional dynamic programming or heuristic algorithms are able to accomplish optimization tasks in a finite time domain markov decision process with an enumeratable state space and action space. In the present study, however, the defect ratio Q ^K And state of degradation D ^K Is a continuous state space, the quality detection has a continuous action space, and the sampling ratio can be 0,1]Any value of (3). Thus, the Markov decision process of the manufacturing network has an immense state space and action space. In addition, the state space and action space of the constructed Markov decision process also grows exponentially as the number of machines increases, resulting in a "dimension disaster". Therefore, the conventional dynamic programming method cannot solve such an infinite time domain sequence decision problem. Although heuristic algorithms have strong searching capabilities and can provide optimal solutions, due to their weak transfer learning capabilities, optimal manufacturing system performance cannot be continuously guaranteed as manufacturing scenes change.

In current algorithmic research, the learning capabilities of the DRL algorithm have proven to be effective in handling an infinite time-domain markov decision process. Among them, the DQN algorithm, the Actor-Critic algorithm, and the DDPG algorithm have been demonstrated to be effective in solving different Markov decision processes in a maintenance solution. First, all three algorithms are applicable to Markov decision processes with continuous or discrete state spaces. Whereas the DQN algorithm is only applicable to discrete motion spaces, the DDPG algorithm is only applicable to continuous motion spaces, and the Actor-Critic algorithm is applicable to both continuous and discrete motion spaces. In addition, DDPG takes advantage of the neural network Q function and Actor-Critic framework, and shows more excellent stability than the other two algorithms. Therefore, the present invention adopts the DDPG algorithm.

The DDPG algorithm is built based on an Actor-Critic framework, two of which have θ _μ And theta _Q The neural network of parameters is used to approximate the policy function and the cost function. The policy and cost functions are constructed using an Actor network μ (S) and a Critic network Q (S, a) in an Actor-Critic framework, as shown in fig. 4. In addition, the activation function is a Relu function, and the hidden layer is a fully connected layer. In an Actor network, a state matrix S ^K Is an input, corresponding action A ^K Is the output to achieve long term return maximization. Quality detection actions due to the possession of the same normalized neurons at the final output layer

And preventive maintenance actions->

At [0,1]With successive output values. In order to cope with the inconsistency between the continuous quality inspection action and the discrete preventive maintenance action, in the present study, the preventive maintenance criterion c is given _d ∈[0,1]To discretize preventive maintenance actions. In particular, preventive maintenance is->

Will not perform at all, but at +.>

Is executed. State matrix S ^K And motion vector A ^K As input to the Critic network and return the corresponding long term expected Q value Q (S ^K ,A ^K ) As an output. />

Based on the constructed neural network, the Agent iteratively interacts with the manufacturing network environment. In the interactive process, the state of the manufacturing network in [ t, t+Δt) (where t=kΔt) time is taken from S ^K Conversion to S ^K+1 Is defined as one step of the DRL algorithm. An Epoch refers to an evaluation phase of long-term return over [ kΔt, kΔt') time, consisting of a plurality of steps. An epoode refers to a process in which an Agent performs a task, and is composed of a plurality of epochs. During the training process, the Agent will iteratively simulate the DDPG algorithm up to a maximum number of steps. At time t=0 (k=0), it is assumed that initial action a ⁰ Quality inspection and preventive maintenance are not performed. Similarly, assume an initial state S ⁰ Is free of defects, machine degradation, machine failure, and machine idleness. The detailed training process is shown in fig. 5 below.

Step1: performing actions simulating the operation of the manufacturing network during a [ kΔt, kΔt'), wherein t=kΔt;

By inputting state S to an Actor network μ (S) ^K The DDPG algorithm is obtained by adding a random noise N which is compliant with normal distribution _r Attempts to explore better strategies by allowable actions, i.e. A ^K ＝π(S ^K )+N _r The method comprises the steps of carrying out a first treatment on the surface of the Then according to the preventive maintenance criterion c _d Converting the preventative maintenance action into discrete executable actions {0,1}, wherein 0 means that preventative maintenance is not performed and 1 means that preventative maintenance is performed; to avoidCriterion c _d Preferences caused, therefore, define c _d =0.5, representing the output range of the Actor network [0,1 ]]Is a median value of (c).

Step1.3: performing an action simulating the operation of the manufacturing network during [ kΔt, kΔt+Δt): action a at time t=kΔt is obtained ^K The quality detection of the corresponding machine during subsequent operations will immediately employ the new sampling rate

step2: obtaining a transfer record: after operation at [ K.DELTA.t, K' DELTA.t) time, a long-term return G is obtained _K Is capable of; according to the method of step step1.1, the state S at t=k' Δt is obtained ^K′ The method comprises the steps of carrying out a first treatment on the surface of the Then, the state transition record { S ^K ,A ^K ,G _K ,S ^K′ Store in experience buffers;

step3: updating an Actor network and a Critic network: randomly extracting state transition records with batch number M from the experience buffer zone, thereby updating an Actor network mu (S) and a Critic network Q (S, A); at this point, the empirical buffer has the maximum storage, i.e., the buffer length L. When the number of transfer records reaches L, the earliest record will be discarded in order to store a new transfer record.

Step4: judging an ending condition; if Epinode reaches a predetermined maximum training number or a stable long-term return G is obtained _K Stopping training; otherwise the simulation period will be updated: [ K.DELTA.t, K ' DELTA.t) ≡ [ K ' DELTA.t, 2K ' DELTA.t-K DELTA.t), and returns to Step1.

Prior to the training step, a target Actor network μ '(S) and a target Critic network Q' (S, a) are respectively constructed using an Actor network and Critic neural network having the same structure, and a random parameter θ is used _μ And theta _Q Initializing Actor network μ (S) and Critic network Q (S, A), using θ _μ' ＝θ _μ And theta _Q' ＝θ _Q The target actor network μ '(S) and the target critic network Q' (S, a) are initialized. In order to improve the stability of the optimization, the target Actor network and the target Critic network are updated periodically based on the latest Actor and Critic parameters. Based on the state transition record in the experience buffer, the neural network will update at each training Epoch, the flow of which is shown in FIG. 6, and the update algorithm is as follows.

a) The method comprises the following steps Randomly extracting batches of M state transition records from the experience buffer:

b) The method comprises the following steps For transfer records i3=1, 2, …, M, calculate future target actions

And target future long-term return

And sets the objective +.>

c) The method comprises the following steps Updating parameter θ of Critic network Q (S, A) by minimizing loss function _Q ：

d) The method comprises the following steps Updating the parameter θ of the Actor network μ (S) by maximizing the expected cumulative long-term return _μ ：

e) The method comprises the following steps Updating parameters of a target actor network and a target Critic network:

θ _μ′ ＝τθ _μ +(1-τ)θ _μ′ (27)

θ _Q′ ＝τθ _Q +(1-τ)θ _Q′ (28)

where τ is a smoothing factor.

Example analysis: manufacturing network and Agent parameters

In this example, a manufacturing network with multiple process routes was modeled using a directed acyclic network consisting of 30 nodes. As shown in fig. 7, the machines are expressed as nodes with different reliability parameters, while the flow of work-in-process between machines is represented by edges. The product flows randomly between the machines, the number of which is limited by the capacity of the machine to ensure a balance of production. The manufacturing network has 4 source nodes, 4 end nodes and 51 directed edges, defining a total of 724 different process routes.

This example was trained using MATLAB R2021a software. According to the learning rate used in the related study, the learning rates of the Actor network and the Critic network in the training are respectively 2 multiplied by 10 ^-3 And 1X 10 ^-3 . Since the number of hidden layer neurons has a great relationship with the dimension of the problem of interest, the hidden layer neurons are respectively set to L in consideration of similar research of DRL _s1 ＝128、L _s2 ＝256、L _s3 ＝128、L _a1 ＝64、L _a2 ＝128、L _c1 ＝256、L _c2 ＝128、L _c3 ＝64、L ₁ ＝128、L ₂ ＝256、L ₃ =128. In addition, based on the existing research, other parameters are respectively: empirical buffer length l=1×10 ⁶ Small sampled batches m=1280, smoothing factor τ=1×10 ^-3 Decision cost c _D ＝100。

Training DRL Agent: the DRL Agent will be trained to obtain the maximum benefit in 5000 units of time, i.e. (K' -K) ×Δt=5000. Step by step in consideration of Agent performanceLong sensitivity, three different manufacturing scenarios were constructed using different step sizes Δt and decision frequencies K' -K: { Δt=50, K ' -k=100 }, { Δt=100, K ' -k=50) }, and { Δt=500, K ' -k=10 }, which means that Agent will generate action a in each Epoch ^K And interacts with the manufacturing network 100 times, 50 times, and 10 times. Based on the constructed manufacturing network environment and DRL agents, three training Episodes were developed for three manufacturing scenarios, and the rewards obtained are shown in Table 1.

In addition, an optimization model based on a genetic algorithm is used as a comparison standard of the method. The algorithm constructs a population of 70 individuals, each individual having a 1 x 60 matrix as its chromosome (solution), representing preventive maintenance and quality inspection actions for 30 nodes in the constructed manufacturing network. Gain G in 5000 units of time _K Individuals are evaluated as a function of fitness. For three manufacturing scenarios, the maximum evolution algebra (MEG) is set to 7×10 manufacturing network runs ⁵ The time of the step, equal to the number of training steps under the DRL algorithm, { Δt=50, meg=100 }, { Δt=100, meg=200) }, and { Δt=500, meg=1000 }, the resulting benefits are shown in table 1.

TABLE 1 benefit form trained by DRL and genetic algorithm

When the yields tended to stabilize, their mean (mean) and Standard Deviation (SD) were calculated in table 1. The Coefficient of Variation (CV) in equation (29) is used to analyze the relative dispersion of the benefits taking into account the scale differences.

CV＝SD/mean (29)

The mean, standard deviation, and coefficient of variation under the DRL algorithm are calculated based on the yields of the last 100 epochs whose yields tend to be stable. The mean, standard deviation, and coefficient of variation of the genetic algorithm are calculated from the benefits of the last 50 epochs when the benefits tend to stabilize. For the DRL algorithm, when Δt=50, 100, 500, the highest gains are 4.83×10, respectively ⁴ 、3.63×10 ⁴ 、6.47×10 ⁴ The method comprises the steps of carrying out a first treatment on the surface of the For genetic algorithms, when Δt=50,100. At 500, the highest yields were 2.63×10 respectively ⁴ 、4.25×10 ⁴ 、6.39×10 ⁴ . Only when Δt=100, the genetic algorithm can help the manufacturing network to get better average yields than the DRL algorithm. Therefore, compared with a genetic algorithm, the DRL algorithm provided by the invention has better adaptability to various manufacturing scenes in a complex manufacturing network. Moreover, in most training, the coefficient of variation under the genetic algorithm is greater than that under the DRL algorithm, meaning that the genetic algorithm achieves less stability of the benefit than the DRL algorithm. The training trajectories of the highest benefit of the DRL algorithm and the genetic algorithm in different manufacturing scenarios are shown in FIG. 8.

The profit trace under the DRL algorithm shows that the constructed DRL Agent can improve the long-term return of the manufacturing network through interaction with the manufacturing network, and the effectiveness of the model is proved. Furthermore, the different patterns of the revenue trace illustrate the differences in training results in different manufacturing scenarios. First, the higher the interaction frequency between the manufacturing environment and the DRL agent (when the step Δt=50), not only will the learning process converge slowly, but also a significant loss will occur in the training process, i.e. a negative return as shown in fig. 8. Second, the trace of Δt=100 shows that under lower frequency interactions, fast convergence can be achieved, but the gain obtained may not be optimal. In summary, the learning performance of the DRL algorithm is sensitive to unsynchronized long manufacturing scenarios due to the nonlinearity, high dimensionality, and dynamics of the manufacturing network. Furthermore, the revenue trace indicates that DRL training may cause possible loss. Therefore, the digital manufacturing network environment must be optimized based on machine learning to avoid possible losses when agents interact with the real manufacturing system.

Experiments based on trained agents: through the optimal DRL Agent trained by the three manufacturing scenes, the invention implements experiments for controlling the maintenance and quality detection of the manufacturing network within 50000 units of time. The coefficient of variation and cumulative benefit of rewards for different manufacturing scenarios are shown in figure 9. With the help of the DRL Agent, the manufacturing network can obtain continuously increasing accumulated benefits in all manufacturing scenarios. Likewise, when the interaction step Δt=500, at the proposed D Under the control of the RL Agent, the manufacturing network can obtain the highest accumulated income. Meanwhile, the variation coefficient represents the relation of the dispersion degree of rewards under different step sizes: CV (CV) ₅₀₀ <CV ₁₀₀ <CV ₅₀ . Therefore, the Agent performance is consistent with the training result, and the experimental process is stable and effective.

And calculating the unit time rewards of each step length according to the experimental results. For the Kth step, it awards r per unit time _u(K) Can be obtained from formula (30):

r _u(K) ＝r _K /Δt (30)

FIG. 9 is a scatter plot of rewards per unit time obtained by three trained agents. In the scatter diagram, when the step Δt=100 or 500, the unit time rewards are stable and concentrated, and the unit time rewards at Δt=500 are larger than those at Δt=100. However, the rewards per unit time become dispersed and unstable when Δt=50. This phenomenon suggests that smaller step sizes make it difficult for an Agent to continuously give optimal decisions due to the high nonlinearity, high dimensionality, and dynamics of the manufacturing network. On the other hand, the unit time rewards also explain why the Agent cumulative returns trained at

steps

50 and 100 are lower.

Finally, this example analyzes connectivity of a manufacturing network in 50000 units of time with intervention of 3 trained DRL agents. Connectivity refers to the probability that at least one process path remains connected between the source node and the destination node of the manufacturing network, and the connectivity curve is shown in fig. 10. When the step Δt=500, the connectivity of the manufacturing network is the worst and the connectivity variation amplitude is the largest, indicating that the connectivity is the worst and the least stable. And when the step length delta t=100, the connectivity is the best, the variation amplitude is the smallest, and the concept of 'high benefit accompanied by high risk' is verified. In particular, when the manufacturing network maintains a high long-term return (Δt=500), there will be a high risk of outage (worst connectivity).

The invention researches the problem of joint optimization of manufacturing network maintenance and quality detection based on a DRL algorithm under the condition of machine reliability and product quality interaction behavior. Firstly, a mathematical model for constructing a nonlinear, high-dimensional and dynamic environment of a manufacturing network is provided, and an effective state transition model is provided for control of the manufacturing network. Secondly, an effective DRL model suitable for manufacturing network reliability-quality joint control is constructed, and modeling of discrete-continuous mixed states and mixed actions can be simultaneously realized; in addition, training and experimental results verify the validity of the proposed DRL model. Compared with genetic algorithm, the DRL algorithm has better adaptability to dynamic and diversified manufacturing scenes. Meanwhile, the model provided by the invention can well balance the contradiction between the economic benefit and the operation risk of the manufacturing network.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. The manufacturing network maintenance-detection combined optimization method based on deep reinforcement learning is characterized by comprising the following steps of:

2. The method for manufacturing network maintenance-detection joint optimization based on deep reinforcement learning according to claim 1, wherein the method for constructing the reliability model and the quality model is as follows:

calculating the dynamic production speed:

to make a set of directed edges in a network; i. j are nodes;

representing the set of immediately upstream nodes connected to node i,

representation and nodei a set of connected immediate downstream nodes; />

calculating dynamic maintenance cost:

r _b (t)＝(β/α)·(t _r /α) ^β-1 ；

Wherein α is a scale parameter and β is a shape parameter;

F(t)＝1-exp(-(t _r /α) ^β -∫Δr(t)dt)；

an accumulated failure rate increase for machining reject feed material up to time t; Δrthe integral of t) is calculated as:

and->

And mu _cm ≥μ _pm ，/>

constructing a dynamic processing quality and detection activity model:

λ(t)＝ω-ε·e ^-δ·r(t) ；

Wherein omega>0 represents the maximum mismatchThe lattice product generates strength epsilon>0 and delta>0 are the influence coefficients of failure rate on the intensity function, and omega-epsilon<λ(t)<ω<P _ra (t); definition n _d The probability of the number of unqualified products generated by the machine in the time of [ t, t+delta t) is as follows:

wherein D (t) is the rejected good product accumulated to time t,

suppose that a single article is transported from a manufacturing networkThe accumulated value increment brought by the processing process from the input node to the current processing node i is v _si The average process value increment v brought by machining single piece corresponding to node i in product _ai Can be defined as:

wherein v is _sk Representing the accumulated value increase of the upstream node k of the node i;

3. The deep reinforcement learning-based manufacturing network repair-inspection joint optimization method according to claim 2, wherein the method for performing system evaluation on the manufacturing network state and performance is:

S ^K ＝[Q ^K ；D ^K ；H ^K ；O ^K ]；

wherein, t=kΔt,

mass state for each machine in [ t- Δt, t); d (D) ^K ＝[t _r1 ,t _r2 ,…,t _rn ]Representing the degradation state of the machine at time t, H ^K ＝[h ₁ ,h ₂ ,…,h _n ]Is the health state of the machine at the time t, h _i 1 in e {0,1} represents a failure state of the machine, 0 represents a no failure state of the machine; o (O) ^K ＝[o ₁ ,o ₂ ,…,o _n ]Indicating the idle state of the machine at time t, o _i 1 in e {0,1} represents an idle state, 0 represents an operating state;

wherein K' > K.

4. The manufacturing network maintenance-detection joint optimization method based on deep reinforcement learning according to claim 3, wherein the method for constructing the manufacturing network maintenance and quality detection joint optimization model is as follows:

regarding quality inspection and preventive maintenance as actions, noted as

Wherein (1)>

For quality detection action, < >>

Is a preventive maintenance action;

Under the optimal strategy, the cost function and the Q function satisfy:

wherein V (·) represents the state S ^K Maximum long-term return at that time.

5. The method for optimizing repair-inspection of a manufacturing network based on deep reinforcement learning according to claim 4, wherein the designed method for learning the optimal strategy for quality inspection and repair under a given manufacturing network state by using a depth deterministic strategy gradient algorithm (DDPG) is as follows:

step1.4: assessment of rewards r _K : calculating a prize r _i2 And i2=i2+1 is updated; such asFruit i2<K', returning to the step Step1.1;

otherwise, executing Step2;

6. The method for manufacturing network maintenance-detection joint optimization based on deep reinforcement learning according to claim 5, wherein the updating method of the Actor network and the Critic network is as follows:

s3.3.2: for the transfer record i3=1, 2, …, M, calculate future target action +.>

And target future long-term return

And sets the objective +.>

θ _μ′ ＝τθ _μ +(1-τ)θ _μ′ ；

θ _Q′ ＝τθ _Q +(1-τ)θ _Q′ ；

where τ is a smoothing factor.