CN111857081B

CN111857081B - Chip packaging test production linear energy control method based on Q-learning reinforcement learning

Info

Publication number: CN111857081B
Application number: CN202010797879.2A
Authority: CN
Inventors: 李波; 冯益铭; 钱鑫森
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-08-10
Filing date: 2020-08-10
Publication date: 2023-05-05
Anticipated expiration: 2040-08-10
Also published as: CN111857081A

Abstract

The invention relates to the field of control and optimization of the production linear energy of semiconductor chip packaging test, in particular to a method for controlling the production linear energy of the chip packaging test based on Q-learning reinforcement learning. According to the invention, a more accurate performance prediction model of the semiconductor packaging test series-parallel production line is established, and a Morris screening method and an Arena simulation method are comprehensively used for carrying out global sensitivity quantitative analysis, so that a plurality of influence factors and influence rules thereof with the greatest influence on production linearity can be obtained, and the situations that the equipment Markov state space is huge and the traditional mathematical model analysis is not applicable are avoided. The invention controls the variability factors of the production line on the basis of performance prediction and sensitivity analysis, improves the value mode of the parameter epsilon, ensures that the algorithm converges more rapidly and avoids local optimization, and simultaneously has better flexibility and instantaneity.

Description

Chip packaging test production linear energy control method based on Q-learning reinforcement learning

Technical Field

The invention relates to the field of semiconductor chip packaging test production linear energy control and optimization, in particular to a performance control method for a semiconductor chip packaging test production line, which combines sensitivity analysis and a Q-learning reinforcement learning algorithm.

Background

The semiconductor manufacturing industry has great strategic value for the development of national economy, and in order to keep the good development of the semiconductor manufacturing industry in China, the production efficiency of a manufacturing system needs to be focused and the production management control technology needs to be enhanced besides the expansion of the production scale. Because the semiconductor manufacturing system has the production characteristics of high reentry of a process path, high complexity of a production process, long manufacturing period, huge system scale, high uncertainty and the like, the difficulty of controlling the performance of a production line is high. The production performance of a manufacturing system is greatly influenced by various variability factors such as buffer capacity, equipment sudden faults, equipment preventive maintenance, product reworking and the like, so that the production efficiency is reduced, the production period is prolonged, and the normal execution of a production plan is influenced.

The current research on intelligent, comprehensive and dynamic control of the performance of the production line is less, and is mostly limited to a certain aspect of the variability of the production line, and various variability factors on the production line cannot be considered globally; the semiconductor serial-parallel production linear energy prediction model established in the current research has certain deviation from the actual production condition, and the accuracy is deficient; the traditional performance control optimization method is difficult to control in real time aiming at the change of the variability factors of the production line, and has insufficient flexibility.

Disclosure of Invention

Aiming at the defects of the performance control model and strategy of the existing semiconductor chip packaging test production line, the invention provides a chip packaging test production linear energy control method based on Q-learning reinforcement learning. Aiming at the problems of untimely response of the existing variability factors, incomplete consideration of the variability factors, conflict of control strategies and the like, the method provided by the invention is used for intelligently controlling the manufacturing performance of the semiconductor chip packaging test production line by combining sensitivity analysis and a Q-learning reinforcement learning algorithm.

A chip packaging test production line performance control method based on Q-learning reinforcement learning comprises the following steps:

step 1: constructing an abstract model of a semiconductor chip packaging test serial-parallel production line;

step 2: based on the production line abstract model constructed in the step 1, a prediction model of the performance of the semiconductor chip packaging test serial-parallel production line is established;

step 3: based on the production line abstract model constructed in the step 1, obtaining an influence mechanism of key variability factors on the performance of the production line according to Morris screening legal analysis and Arena simulation quantitative analysis;

step 4: and (3) establishing a performance control model based on a Q-learning reinforcement learning algorithm based on the prediction model of the performance of the semiconductor chip packaging test serial-parallel production line established in the step (2) and the key variability analysis obtained in the step (3), and carrying out iterative solution by taking the benefit index of the production line as a performance control target to obtain a global optimal performance control strategy.

The step 1 specifically comprises the following steps:

semiconductor chip package test line model abstraction: the method uses the subsequent process of the semiconductor production line, namely the chip packaging test production line as a research object, and supposes that a limited buffer area exists between stations, the queuing rule is first to serve, and abstracts the queuing rule into a multi-station serial-parallel queuing production line model containing re-entry (re-working).

The step 2 specifically comprises the following steps:

step 2.1: variability calculation: calculating arrival variability c _a And processing time variability c _e 。

Step 2.2: and determining a performance prediction basic index.

Average processing time CT of workpieces at queue _q And effective processing time t _e The average time CT (production period) of the work station is obtained, the average work-in-process level WIP at the work station is further calculated, and the work-in-process production rate TH, the production period CT and the work-in-process level WIP are used as basic indexes for predicting the production linear energy.

CT＝CT _q +t _e

WIP＝CT×TH

Step 2.3: and establishing a production line performance prediction model.

Step 2.3.1: calculating queuing time of the product j at the workstation i:

wherein c _a ^ij 、c _e ^ij The arrival variability and the processing time variability of the product j at the station i, u ^ij For the utilization rate of station i, m ^ij For the number of equipment connected in parallel for work station i, t _e ^ij For the effective processing time of product j at station i.

Step 2.3.2: and calculating the production rate TH of the workpiece.

Let m be the case in station i ^ij (b>m>1) The parallel equipment of the station, b is the capacity of a buffer zone in front of the station i, k is the number of workpieces being processed by the station i, if k is more than or equal to 0 and less than or equal to b, the probability p of processing the workpieces j (0 < j < r, r represents the number of products processed together in the production line) which are not waiting in front of the station i ₀ The method comprises the following steps:

blocking probability of workpiece j with capacity b in cache region

The method comprises the following steps:

let q _hj For the defective rate of the workpiece j on the work station h, Q _ij For the defective product rate monitored by the work station i, the value range is more than 0 and less than i and less than or equal to s, wherein s represents the number of the work stations in the serial-parallel production line, and the defective product probability Q of the work piece j detected and removed on the work station i _ij The method comprises the following steps:

representing a set of all defective product detection station numbers in the production line.

The production rate TH of the workpiece j at the station i _ij The method comprises the following steps:

when the utilization rate of a certain station is maximum, the station I is the bottleneck station of the product J, and the production rate is r _b ^IJ ＝max(u ^ij )。

Step 2.3.3: calculating production cycle (logic production cycle) CT of production line _j And WIP at work-in-process level _j 。

Calculating the workpiece average wait batch time WTBT:

wherein r is _a Representing the rate at which the workpiece arrives at the workstation, where k _ij Indicating the product j processing lot size at station i, at this time

Then->

Rewriting CT _q ^ij The calculation formula is as follows:

calculating the production period CT of the product j at the station i _j And WIP at work-in-process level _j ：

Thereby obtaining the production cycle (logic production cycle) CT of the product j in the whole series-parallel production line _j And WIP at work-in-process level _j ：

Step 2.4: and evaluating the performance of the production line performance prediction model.

Step 2.4.1: and calculating the performance index F of the production line.

As shown in FIG. 3, the WIP-CT and WIP-TH curves in the best case, worst case and actual worst case of the production line are used as targets to define the "good zone" and "bad zone" in the performance quadrant, which constitutes the performance evaluation graph of the production line.

Taking the ratio of the distance between the actual performance point and the distance between the best case and the actual worst case targets as a performance evaluation index, and marking as F:

/>

wherein w represents a given actual work-in-process level, T represents an actual production period, T ₀ Represents the theoretical processing time of the production line, where T ₀ ＝CT；r _b Represents the bottleneck rate of the production line, where r _b ＝TH _ij If and only if u _ij ＝u _max 。

Step 2.4.2: and calculating the benefit index Bf of the production line.

Considering the production cost, and rewriting the production linear energy index F into a benefit index Bf:

Bf＝C*F

wherein C is a cost factor, C ₁ Unit equipment cost, c ₂ Cost per buffer capacity, c ₃ For the rest of the fixed cost, m ₁ And b ₁ Respectively the current parallel equipment number and the buffer capacity size, m ₀ And b ₀ The initial number of parallel devices and the buffer capacity size, respectively.

The step 3 specifically comprises the following steps:

step 3.1: and (5) performing qualitative analysis on sensitivity of Morris screening method.

Selecting a random parameter x in a production line performance prediction model, presetting a fixed step length C and a maximum amplitude M, carrying out disturbance change on the parameter x by the step length C, and taking the average change rate of a performance evaluation index F as a sensitivity coefficient S:

wherein Y is ₀ The performance evaluation index F corresponding to the initial value of the parameter x; y is Y _g 、Y _g+1 Disturbance for parameter xg at g-th and g+1-th timesA performance evaluation index F after the change; p (P) _g 、P _g +1 is the change rate of the value of the parameter disturbance change after the parameter disturbance change of the g th time and the g+1 th time relative to the initial value, and n is the operation times.

The parameters of the more sensitive and high sensitivity coefficients are determined as factors that can affect the semiconductor package test production linearity more according to the sensitivity grading criteria of table 1.

TABLE 1 sensitivity grading criteria

Absolute value of sensitivity coefficient	Sensitivity grading
		0.00≤/S/＜0.05	Insensitivity to
0.05≤/S/＜0.20	Moderate sensitivity
		0.20≤/S/＜1.00	More sensitive
/S/≥1.00	High sensitivity

Step 3.2: arena simulation sensitivity quantitative analysis.

And establishing a semiconductor chip packaging test serial-parallel production line model in Arena software. Each device has an independent random process time, failure time and maintenance time.

The work piece arrival rate, the work station equipment processing rate and the average time before failure m on the production line _f Average repairComplex time m _p And respectively obeying negative exponential distribution and normal distribution, wherein the processing batch size k, the buffer capacity b and the parallel equipment number m are fixed positive integers, b is more than m and more than 1, and the simulation experiment preheating time setting, the running total time and the experiment repetition times are set.

Experiments have resulted in a profile of overall line performance, production cycle CT, production rate TH, and WIP at work-in-process level with respect to key factors affecting line performance.

The step 4 specifically comprises the following steps:

step 4.1: the method is characterized in that a production line performance prediction model is used as a reinforcement learning external environment, the change of the production line variability is used as a trigger condition, and a semiconductor chip packaging test production line performance control model based on reinforcement learning as shown in fig. 5 is established based on a dynamic control method combining an event trigger strategy and a periodic trigger strategy.

Step 4.2: the initialization of the values of Q (s, a),

a epsilon A (S), wherein the Q value is a reflection of long-term consideration, S is a system state set, and A (S) is an action strategy set of key factors obtained in the step 4.2. Given the parameter learning rate factor α and the discount factor γ, a return function r is determined.

Step 4.3: given a starting state s, and selecting action a at state s according to an ε -greedy strategy. The modified epsilon value mode is set as a function:

wherein p is the current execution deployment step number of the algorithm, and M is the total iteration step number of the algorithm, so that the value of the algorithm gradually decreases from an initial value of 0.2 along with the increase of the execution step number of the algorithm.

Step 4.4: selecting action a and b as the selection sequence number of a in state s according to epsilon-greedy strategy to obtain return r and next state s _next ，a _next The Q value is updated on behalf of the next action:

s＝s _next ，a＝a _next

step 4.5: the process goes to step 4.4 until the system goes towards a steady state, i.e. a converging state.

Step 4.6: and repeatedly executing the steps 4.2 to 4.5 until the learning period (the number of times that the steps 4.2 to 4.5 are repeatedly executed, which are preset by the algorithm) is ended, and stopping iteration.

Step 4.7: outputting the final policy

And obtaining the index optimization condition of the production line performance.

According to the invention, a more accurate performance prediction model of the semiconductor packaging test series-parallel production line is established, and a Morris screening method and an Arena simulation method are comprehensively used for carrying out global sensitivity quantitative analysis, so that a plurality of influence factors and influence rules thereof with the greatest influence on production linearity can be obtained, and the situations that the equipment Markov state space is huge and the traditional mathematical model analysis is not applicable are avoided. The invention provides a production line performance control model based on a Q-learning algorithm, which is used for controlling the production line variability factor on the basis of performance prediction and sensitivity analysis, and improving the value mode of a parameter epsilon, so that the algorithm convergence speed is higher, local optimization is avoided, and meanwhile, the performance control method has better flexibility and real-time performance.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is an abstract model of a semiconductor chip package test line;

FIG. 3 is a diagram of a method for evaluating performance of a three-major industrial physical marker post;

FIG. 4 is a schematic diagram of a simulation model logic structure of a production line;

FIG. 5 is a model of line performance control based on reinforcement learning according to an embodiment;

FIG. 6 is a graph of production line performance versus variability ca and ce;

FIG. 7 shows the production line performance index variation before and after performance control at different volatility levels CV 1;

fig. 8 shows the production line performance index change before and after performance control at different variability levels CV 2.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples, which are provided for the purpose of illustrating a detailed embodiment and a specific operation process (fig. 1) based on the technical scheme of the present invention, but the scope of the present invention is not limited to the following examples.

The embodiment can be mainly divided into the following steps:

step 1: semiconductor chip package test line model abstraction: taking a chip packaging test production line as a research object, assuming that a buffer zone with a limited size exists between stations, a queuing rule is first served and abstracted into a multi-station serial-parallel queuing production line model containing re-entry (re-engineering) (figure 2).

Step 2:

step 2.1: and (5) calculating variability.

Calculating arrival variability c _a And processing time variability c _e 。

Step 2.2: and determining a performance prediction basic index.

CT＝CT _q +t _e

WIP＝CT×TH

Step 2.3: and establishing a production line performance prediction model.

Step 2.3.1: calculating queuing time of the product j at the workstation i:

Step 2.3.2: and calculating the production rate TH of the workpiece.

loss rate of workpiece j at station i

The method comprises the following steps:

let q _hj For the defective rate of the workpiece j on the work station h, Q _ij The defective rate monitored by the work station i is in a value range of 0 < h < i.ltoreq.s, wherein s represents the number of the work stations in the series-parallel production line. Defective probability Q of workpiece j detected and removed at station i _ij The method comprises the following steps:

representing all sets with defective product detection station numbers in production line。

the production rate of the bottleneck station I of the product J is recorded as r _b ^IJ ＝max(u ^ij )。

Calculating the workpiece average wait batch time WTBT:

wherein r is _a Representing the rate of arrival of the workpiece at the workstation, k _ij Indicating the product j processing lot size at station i, at this time

Then->

Rewriting CT _q ^ij The calculation formula is as follows:

Thereby obtainingProduction cycle (logic production cycle) CT to product j in whole series-parallel production line _j And WIP at work-in-process level _j ：

Step 2.4.1: and calculating the performance index F of the production line.

Step 2.4.2: and calculating the benefit index Bf of the production line.

Bf＝C*F

Step 3:

Selecting a certain parameter x in a production line performance prediction model, presetting a fixed step length C and a maximum amplitude M, carrying out disturbance change on the parameter x by the step length C, and taking the average change rate of a performance evaluation index F as a sensitivity coefficient S:

wherein Y is ₀ The performance evaluation index F corresponding to the initial value of the parameter x; y is Y _g 、Y _g+1 The performance evaluation index F after the disturbance change of the parameter x of the g time and the g+1st time; p (P) _g 、P _g +1 is the change rate of the value of the parameter disturbance change after the parameter disturbance change of the g th time and the g+1 th time relative to the initial value, and n is the operation times.

Table 1 shows the sensitivity coefficients of the performance evaluation index F obtained by the Morris screening method for different parameters.

TABLE 1 sensitivity coefficient S of index F

Parameter name	Unit (B)	Parameter meaning	Sensitivity coefficient S
				u	％	Utilization rate	1.242
r ₀	Piece/min	Feed rate	-0.163
				ra	Piece/min	Production rate	0.622
k	Piece	Processing batch size	0.478
				c _a	/	Workpiece arrival time variability	0.350
c _e	/	Workability variability	0.457
				m	Bench	Number of parallel devices	-1.134
A	％	Device availability	-0.104
				b	Piece	Buffer size	0.581
Q	％	Defective rate of workpieces	-0.029

Based on the sensitivity level and the relation between the parameters in Table 2, the number of parallel devices m, the processing lot size k, and the workpiece arrival time variability c _a Workability c _e And buffer capacity size b is determined as a factor that can have a greater impact on semiconductor package test production linearity.

TABLE 2 sensitivity grading criteria

Step 3.2: arena simulation sensitivity quantitative analysis.

A semiconductor chip package test serial-parallel production line model is built in Arena software as shown in fig. 4. Each device has an independent random process time, failure time and maintenance time.

The work piece arrival rate, the work station equipment processing rate and the average time before failure m on the production line _f Average repair time m _p And respectively obeying negative exponential distribution and normal distribution, wherein the processing batch size k, the buffer capacity b and the parallel equipment number m are fixed positive integers, b is more than m and more than 1, the preheating time of a simulation experiment is set to 600 minutes, the total operation time is set to 1200 minutes, and the test is repeated for 3 times.

Experiments have resulted in a profile of overall line performance, production cycle CT, production rate TH, and WIP at work-in-process level with respect to key factors affecting line performance. As shown in fig. 6, the production line performance is related to time variability c _a And processing variability c _e Is a variation graph of (a).

Step 4:

step 4.1: the method is characterized in that a production line performance prediction model is used as an reinforcement learning external environment, the change of the variability of the production line is used as a trigger condition, and a semiconductor chip packaging test production line performance control model based on reinforcement learning as shown in fig. 5 is established based on a dynamic control method combining an event trigger strategy and a periodic trigger strategy.

Step 4.2: the initialization of the values of Q (s, a),

a.epsilon.A(s), which isThe medium Q value is a reflection of long term consideration and S is a system state set. The division is shown in table 3:

TABLE 3 System State set Spartitionings

System status	Division basis	System status	Division basis
				s1
	0≤Bf≤0.1	s2	0.1＜Bf≤0.2
				s3	0.2＜Bf≤0.3	s4	0.3＜Bf≤0.4
s5	0.4＜Bf≤05	s6	0.5＜Bf≤0.6
				s7	0.6＜Bf≤0.7	s8	0.7＜Bf≤0.8
s9	0.8＜Bf≤0.9	s10	0.9＜Bf≤1.0
				s11	Bf≥1.0

A(s) is an action policy set, A(s): { a1+1, a2:1, a3:1+1, a4:1, a5:1+1, a6:1 }. Setting the parameter learning rate factor alpha as 0.1, the discount factor gamma as 0.9, and determining the return function r as follows, bf _pre Representing the benefit index after the last optimization of the production line:

step 4.3: given a starting state s, and selecting action a at state s according to an ε -greedy strategy.

s＝s _next ，a＝a _next

Step 4.7: outputting the final policy

And obtaining the index optimization condition of the production line performance. Fig. 7 and 8 show the production linear energy index variation before and after performance control at different levels of variability CV1 and CV2, respectively.

In summary, the invention establishes a more accurate semiconductor packaging test series-parallel production linear energy prediction model, comprehensively uses Morris screening method and Arena simulation method to carry out global sensitivity quantitative analysis, obtains a plurality of influence factors and influence rules thereof which have the greatest influence on production linear energy, and avoids the conditions that the equipment Markov state space is huge and the traditional mathematical model analysis is not applicable; and the value mode of the parameter epsilon is improved, so that the algorithm convergence speed is higher, local optimization is avoided, and better flexibility and instantaneity are realized.

Claims

1. The chip packaging test production line performance control method based on Q-learning reinforcement learning comprises the following steps:

step 4: based on the prediction model established in the step 2 and the key variability analysis obtained in the step 3, establishing a performance control model based on a Q-learning reinforcement learning algorithm, and carrying out iterative solution by taking the optimal benefit index of the production line as a performance control target to obtain a global optimal performance control strategy;

the step 1 specifically comprises the following steps: taking the subsequent process of a semiconductor production line, namely a chip packaging test production line as a research object, assuming that a limited buffer area exists between stations, the queuing rule is first served, and abstracting the queuing rule into a multi-station serial-parallel queuing production line model containing reentrant;

the step 2 specifically comprises the following steps:

step 2.1: variability calculation: calculating arrival variability c _a And processing time variability c _e ；

Step 2.2: determining a performance prediction basic index;

average processing time CT of workpieces at queue _q And effective processing time t _e Obtaining an average time CT of residing in a workstation, namely a production period; further calculating to obtain average work-in-process level WIP at a work station, and taking the work-in-process level WIP, the production rate TH and the production period CT of the work-in-process as basic production linear energy prediction indexes;

CT＝CT _q +t _e

WIP＝CT×TH

step 2.3: establishing a production line performance prediction model;

step 2.3.1: calculating queuing time of the product j at the workstation i:

wherein c _a ^ij 、c _e ^ij The arrival variability and the processing time variability of the product j at the station i, u ^ij For the utilization rate of station i, m ^ij For the number of equipment connected in parallel for work station i, t _e ^ij The effective processing time of the product j at the station i is;

step 2.3.2: calculating the production rate TH of the workpiece;

the station i has m ^ij Station parallel equipment, b is the capacity of a buffer zone before a work station i, k is the number of work pieces being processed by the work station i, b>m>1, a step of; if k is more than or equal to 0 and less than or equal to b, the probability p of processing work j without waiting before work station i ₀ For, where 0 < j < r, r represents the number of co-processed products in the production line:

blocking probability of workpiece j with capacity b in cache region

Is>

representing all sets with defective product detection station numbers in the production line;

when the utilization rate of a certain station is maximum, the station I is the bottleneck station of the product J, and the production rate is r _b ^IJ ＝max(u ^ij )；

Step 2.3.3: calculating production cycle CT of production line _j And WIP at work-in-process level _j ；

Calculating the workpiece average wait batch time WTBT:

Then->

Rewriting CT _q ^ij The calculation formula is as follows:

Thereby obtaining the production period CT of the product j in the whole series-parallel production line _j And WIP at work-in-process level _j ：

Step 2.4: evaluating the performance of the production line performance prediction model;

step 2.4.1: calculating a performance index F of the production line;

the WIP-CT and WIP-TH curves of the production line under the best condition, the worst condition and the actual worst condition are used as marker posts to define a good area and a bad area in the performance quadrant, so as to form a performance evaluation graph of the production line;

wherein w represents a given actual work-in-process level, T represents an actual production period, T ₀ Represents the theoretical processing time of the production line, where T ₀ ＝CT；r _b Represents the bottleneck rate of the production line, where r _b ＝TH _ij If and only if u _ij ＝u _max ；

Step 2.4.2: calculating a benefit index Bf of the production line;

Bf＝C*F

wherein C is a cost factor, C ₁ Unit equipment cost, c ₂ Cost per buffer capacity, c ₃ For the rest of the fixed cost, m ₁ And b ₁ Respectively the current parallel equipment number and the buffer capacity size, m ₀ And b ₀ The number of the initial parallel devices and the size of the buffer area capacity are respectively;

the step 3 specifically comprises the following steps:

step 3.1: qualitative analysis of sensitivity of Morris screening method;

wherein Y is ₀ The performance evaluation index F corresponding to the initial value of the parameter x; y is Y _g 、Y _g+1 For parameter x of g-th and g+1th times _g Performance evaluation index F after disturbance change; p (P) _g 、P _g +1 is the change rate of the value of the parameter after disturbance change of the g-th parameter and the g+1st parameter relative to the initial value, and n is the operation times;

according to the sensitivity grading standard, determining parameters of higher sensitivity and high sensitivity coefficient as factors which have larger influence on the semiconductor package test production linearity; the sensitivity grading standard according to the absolute value of the sensitivity coefficient is as follows: the sensitivity is not more than 0.00 and less than 0.05, the sensitivity is not more than 0.05 and less than 0.20, the sensitivity is more sensitive and less than 0.20 and less than 1.00, and the sensitivity is high and more than 1.00;

step 3.2: arena simulation sensitivity quantitative analysis;

establishing a semiconductor chip packaging test serial-parallel production line model in Arena software, wherein each device has independent random processing time, failure time and maintenance time;

the work piece arrival rate, the work station equipment processing rate and the average time before failure m on the production line _f Average repair time m _p Respectively obeying negative index distribution and normal distribution, wherein the processing batch size k, the buffer capacity b and the parallel equipment number m are fixed positive integers, b is more than m and more than 1, and the simulation experiment preheating time setting, the running total time and the experiment repetition times are set;

the variation curves of the overall performance of the production line, the production period CT, the production rate TH and the WIP of the product level about key factors influencing the performance of the production line are obtained through experiments;

the step 4 specifically comprises the following steps:

step 4.1: taking a production line performance prediction model as an reinforcement learning external environment, taking the change of the production line variability as a trigger condition, and establishing a semiconductor chip packaging test production line performance control model based on reinforcement learning based on a dynamic control method combining an event trigger strategy and a periodic trigger strategy;

step 4.2: initializing the initial values of A (s, a),

a epsilon A (S), wherein the A value is the reflection of long-term rewards, S is a system state set, and A (S) is an action strategy set of key factors obtained in the step 4.2; setting a parameter learning rate factor alpha and a discount factor gamma, and determining a return function r;

step 4.3: giving a starting state s, and selecting an action a in the state s according to an epsilon-greedy strategy; the modified epsilon value mode is set as a function:

wherein p is the current execution deployment step number of the algorithm, and M is the total iteration step number of the algorithm;

step 4.4: selecting action a and b as the selection sequence number of a in state s according to the e greedy strategy to obtain return r and next state s _next s，a _next The Q value is updated on behalf of the next action:

s＝s _next ，a＝a _next

step 4.5: turning to step 4.4 until the system goes towards a steady state, i.e. a converging state;

step 4.6: repeatedly executing the steps 4.2 to 4.5 until the learning period, namely the repeated execution times of the steps 4.2 to 4.5 preset by the algorithm, is ended, and stopping iteration;

step 4.7: outputting the final policy

And obtaining the index optimization condition of the production line performance. />