CN115061444A

CN115061444A - Real-time optimization method for technological parameters integrating probability network and reinforcement learning

Info

Publication number: CN115061444A
Application number: CN202210989613.7A
Authority: CN
Inventors: 毛旭初; 张翔; 谢天; 陈松; 汪江
Original assignee: Luculent Smart Technologies Co ltd
Current assignee: Luculent Smart Technologies Co ltd
Priority date: 2022-08-18
Filing date: 2022-08-18
Publication date: 2022-09-16
Anticipated expiration: 2042-08-18
Also published as: CN115061444B

Abstract

The invention discloses a real-time optimization method for process parameters fusing a probability network and reinforcement learning, which comprises the following steps: collecting technological parameter data of a production system, and carrying out operations of preprocessing, processing and dividing a data set on the collected data; constructing a state transition model of adjacent time intervals in the production process based on the preprocessed data; building an intelligent agent model capable of outputting manually controllable parameter data in the production process by using reinforcement learning; and fusing and applying the state transition model and the intelligent agent model to realize real-time optimization and output of process parameters in the production process. The invention divides the process parameters into the control variables, the influence variables and the target values, organically combines the control variables, the influence variables and the target values, recommends the controllable process parameters of the production process in real time, ensures the continuous and efficient operation of the production process, reduces the cost and improves the efficiency.

Description

Real-time optimization method for technological parameters integrating probability network and reinforcement learning

Technical Field

The invention relates to the technical field of technological parameter optimization in a production process, in particular to a technological parameter real-time optimization method integrating a probability network and reinforcement learning.

Background

With the rapid development of the internet of things and big data technology, the development and application of new-generation intelligent manufacturing are promoted, a new paradigm is provided for optimizing process parameters in the production process, and the optimization of the process parameters is to predict parameters which should be input into the production system in the next time period in advance, so that the continuous and efficient operation of the production process is ensured, and the cost reduction and the efficiency improvement of the operation process of the production system are promoted.

The current parameter optimization methods include optimization algorithm and artificial intelligence algorithm implementation, and although they can solve a set of optimal process parameters according to different targets, the methods have some defects. The method for constructing the parameter optimization model based on the optimization algorithm is quite dependent on the logical relationship between parameters and targets, so that the constructed model is insufficient in staticizing, disturbance resistance and migration capacity, when the parameter types or the targets change, the algorithm of the original constructed model is not applicable, and the convergence of the solving process is slow and time-consuming; most methods for constructing parameter optimization models based on artificial intelligence algorithms ignore the time sequence relation in data, cannot search optimal process parameters according to the time sequence process, and easily cause the constructed models to be separated from the real operation process of a system.

In order to overcome the defects of the existing parameter optimization method based on an optimization algorithm and an artificial intelligence algorithm, the method has the advantages that the probabilistic neural network can be combined with the data distribution situation, the fault tolerance rate is high, the model environment adaptability of reinforcement learning training is strong, and the model environment adaptability and the target positive feedback are realized, and meanwhile, the time sequence relation of data is considered in the training process of the model, so that the continuous and efficient operation of the production process is ensured, the cost is reduced, and the efficiency is improved.

Disclosure of Invention

This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and the title of the invention of this application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.

The present invention has been made in view of the above-mentioned problems.

Therefore, the technical problem solved by the invention is as follows: in the prior art, the optimization method for the process parameters of the production system has the problems of over-experience, low prediction efficiency and insufficient fusion with production targets.

In order to solve the technical problems, the invention provides the following technical scheme: a real-time optimization method for process parameters fusing a probability network and reinforcement learning comprises the following steps: collecting technological parameter data of a production system, and carrying out operations of preprocessing, processing and dividing a data set on the technological parameter data; constructing a state transition model of adjacent time intervals in the production process based on the preprocessed process parameter data; by using

Building an intelligent agent model capable of outputting artificially controllable parameter data in the production process by reinforcement learning; and fusing and applying the state transition model and the intelligent agent model to realize real-time optimization and output of process parameters in the production process.

As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the collection of the process parameter data comprises collecting control variables, influence variables and actual production target values of the production process at equal time intervals;

the control variables comprise process parameters which can be manually and directly adjusted in the production process;

the influence variables comprise process parameters generated by the influence of manually input control variables on the production system;

the actual target value of production comprises a production target which is completed by the production system at a certain time interval.

As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the pre-processing and processing of the process parameter data includes,

the preprocessing of the process parameter data comprises processing of abnormal samples, filling of null values and standardization of data;

and the processing of the process parameter data comprises the steps of making a difference between two adjacent time intervals of the actual production target value, taking the difference value of the two adjacent time intervals as a new target value, and then aggregating the sample data of a plurality of time intervals in a time sequence, wherein the aggregation mode is average value aggregation.

As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the dividing of the process parameter data comprises dividing a new data set after pretreatment and processing into a training set, a verification set and a test set according to a certain proportion.

As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the construction of the state transition model includes,

constructing a probabilistic neural network by using the divided training set;

solving the influence variables and actual production target values of the state of the immediately-after time interval (the state of the next time interval in the current time interval state);

obtaining a state transition function and a reward function which can express the change of the actual target value of the production along with the state transition with high fidelity.

As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the construction of the probabilistic neural network comprises the setting of a loss function and the training of a probabilistic neural network model;

setting the loss function

Predicting the probability for a logarithm;

the calculation of the log-prediction probability includes,

wherein the content of the first and second substances,

a set of training data is represented that is,

a density function representing a probabilistic neural network model;

the output of the training of the probabilistic neural network model is a gaussian distribution parameterized by diagonal covariance;

the calculation of the density function of the probabilistic neural network model includes,

the computation of the loss function after the pull-in logarithmic prediction probability reduction is,

wherein the content of the first and second substances,

a mean vector representing each attribute, T represents the transpose of the matrix,

a diagonal covariance matrix is represented,

denotes the inverse of the diagonal covariance matrix, k denotes

The number of the features in (1) is,

representing the determinant of the diagonal covariance matrix.

As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the solving of the influence variables and the target values of the state of the tight time interval comprises the selection of a probabilistic neural network submodel in a model library, the solving of the difference value of the influence variables and the target values of the adjacent time intervals and the solving of the influence variables of the tight time interval;

the selection of the probabilistic neural network submodel comprises randomly selecting a submodel from a learned model library to obtain a mean vector and a diagonal covariance matrix output by the submodel;

the calculation of the solution of the adjacent time interval influencing variable and target value difference comprises,

wherein the content of the first and second substances,

the difference between the parameter value representing the current state and the value of the immediately subsequent state parameter,

representing compliance

Distribution of (2)

The result is solved to form a random data set,

represents the standard deviation;

the solving of the adjacent time interval influencing variable and target value difference comprises

Are randomly paired in a distributed data set

Sampling to obtain multiple samples

Taking the average value as the parameter value of the normalized state after tightening;

the parameter solving of the closely-spaced time interval influence variable comprises the steps of adding a solving difference value to a parameter value in the current time, carrying out inverse standardization on the collected process parameters in a standardization mode, and verifying the performance of the constructed state transition model by using the training set.

As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the construction of the intelligent agent model comprises an action design causing state transition and an action-caused reward design;

the action design for causing state transition comprises making difference to adjacent time intervals of each control variable, and taking median of all values

Can be increased for each control variable individually at each moment

A value is defined as an action

The action space during a state transition contains elements of

，

Is the number of control variables;

the action-induced reward design includes a target value difference for a given action

From the current state

Transition to the next state

Is awarded

And the target value changes correspondingly every time the control variable is changed, and the changed value is the reward after the control variable is changed.

As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the learning process of the intelligent agent model comprises,

searching a minimum value of the TD error, and setting the minimum value of the TD error as a target;

the TD error is calculated as the difference between,

wherein, the first and the second end of the pipe are connected with each other,

to represent

Applying motion in state

The expectation of the gain to be obtained is,

the discount factor is represented by a number of discount factors,

to represent

Applying motion in state

(ii) a gain expectation obtained;

by using

The reinforcement learning algorithm inputs the influence variables of the current time interval into the state value network and the strategy network, and generates the maximum reward value through loop iteration

Corresponding action

The action of

All states, actions and rewards constitute a policy network.

As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the real-time optimization and output of the process parameters in the production process comprises,

collecting the technological parameters of the production process in real time by taking each fixed time interval as a unit, selecting the number of time intervals to be aggregated according to the actual business requirements and the time sequence, and carrying out data processing and aggregation on the sample data consisting of the selected influence variables and the target values;

inputting the processed and aggregated data into the constructed state transition model, and outputting a difference value between an influence variable and a target value;

and inputting the influence variable of the current time interval into the trained strategy network, outputting a control variable, and realizing the fusion and application of the state transition model and the intelligent agent model in the actual production process.

The invention has the beneficial effects that: the real-time optimization method for the process parameters fusing the probability network and the reinforcement learning fully utilizes historical process parameter data, divides the process parameters into control variables, influence variables and target values, organically combines the control variables, the influence variables and the target values, recommends the controllable process parameters of the production process in real time, ensures the continuous and efficient operation of the production process, reduces the cost and improves the efficiency. Compared with the traditional method for constructing the parameter optimization model, the method has stronger disturbance resistance and migration resistance, better accords with the real application scene, has better parameter recommendation effect, and is suitable for most types of production systems.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise. Wherein:

FIG. 1 is a flowchart illustrating a method for real-time optimization of process parameters by fusion of a probabilistic network and reinforcement learning according to an embodiment of the present invention;

fig. 2 is a diagram showing and comparing the prediction effects of 100 groups of coal consumption selected in the process parameter real-time optimization method for integrating the probabilistic network and reinforcement learning according to the second embodiment of the present invention;

fig. 3 is a diagram showing an implementation effect of control variables recommended by an agent in a process parameter real-time optimization method combining a probabilistic network and reinforcement learning according to a second embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, specific embodiments accompanied with figures are described in detail below, and it is apparent that the described embodiments are a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present invention, shall fall within the protection scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

Furthermore, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

The present invention will be described in detail with reference to the drawings, wherein the cross-sectional views illustrating the structure of the device are not enlarged partially in general scale for convenience of illustration, and the drawings are only exemplary and should not be construed as limiting the scope of the present invention. In addition, the three-dimensional dimensions of length, width and depth should be included in the actual fabrication.

Meanwhile, in the description of the present invention, it should be noted that the terms "upper, lower, inner and outer" and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation and operate, and thus, cannot be construed as limiting the present invention. Furthermore, the terms first, second, or third are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The terms "mounted, connected and connected" in the present invention are to be understood broadly, unless otherwise explicitly specified or limited, for example: can be fixedly connected, detachably connected or integrally connected; they may be mechanically, electrically, or directly connected, or indirectly connected through intervening media, or may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

Example 1

Referring to fig. 1, for an embodiment of the present invention, a method for real-time optimization of process parameters by fusion of a probability network and reinforcement learning is provided, including:

s1: collecting technological parameter data of a production system, and carrying out operations of preprocessing, processing and dividing a data set on the technological parameter data. It should be noted that:

the acquisition of the process parameter data comprises the steps of acquiring control variables, influence variables and actual production target values of the production process at equal time intervals; the control variables comprise process parameters which can be directly adjusted manually in the production process; the influence variables comprise process parameters generated by the influence of manually input control variables on the production system; the actual target value of production includes a production target that the production system has completed at certain time intervals.

Further, the preprocessing of the process parameter data comprises processing of abnormal samples, filling of null values and standardization of data;

further, the processing of the process parameter data comprises that the difference between two adjacent time intervals for producing the actual target value is used as a new target value, then sample data of a plurality of time intervals in the time sequence are aggregated, and the aggregation mode is average value aggregation;

furthermore, the dividing of the process parameter data includes dividing the new preprocessed and processed data set into a training set, a verification set and a test set according to a certain proportion.

S2: and constructing a state transition model of adjacent time intervals in the production process based on the preprocessed process parameter data. It should be noted that:

the method comprises the steps of firstly building a probabilistic neural network by using a divided training set, then solving influence variables of a time interval state (the state of the next time interval in the current time interval state) and a production actual target value, and finally obtaining a state transfer function and a reward function which can express that the production actual target value changes along with state transfer in a high-fidelity mode.

Further, the building of the probabilistic neural network comprises setting of a loss function and training of a probabilistic neural network model;

setting a loss function

The probability is predicted logarithmically and its calculation includes,

wherein the content of the first and second substances,

a set of training data is represented that is,

a density function representing a probabilistic neural network model;

the output of the training of the probabilistic neural network model is a gaussian distribution parameterized by diagonal covariance, whose density function is calculated including,

wherein the content of the first and second substances,

a diagonal covariance matrix is represented,

denotes the inverse of the diagonal covariance matrix, k denotes

The number of the features in (1) is,

a determinant representing a diagonal covariance matrix;

the input value of the probabilistic neural network model is the attribute of a training data set obtained from an intelligent agent model, the output is a mean value and a diagonal covariance matrix of distribution obeyed by influence variable differences, a plurality of probabilistic neural network submodels with excellent performance are built to form a model base, and a data set is verified

The smaller the value, the better the trained probabilistic neural network model;

further, the solving of the influence variables and the target values of the state of the time interval after the tightening comprises the selection of a probabilistic neural network sub-model in a model library, the solving of the difference value of the influence variables and the target values of the adjacent time intervals and the solving of the influence variables of the time interval after the tightening;

the calculation of the solution of the difference between the adjacent time interval influencing variable and the target value comprises,

wherein the content of the first and second substances,

representing compliance

Distribution of (2)

The result is solved to form a random data set,

represents the standard deviation;

it should be noted that the solution of the difference between the influencing variable and the target value of the adjacent time interval is based on compliance

Distribution of (2)

Solving the result to form a random data set, the data set being defined as

Then, then

Is also obeyed

Is distributed and

setting up

From distribution of obeys

Is randomly paired in the data set

Sampling to obtain a plurality of samples

the parameter solving of the time interval influence variables comprises the steps of adding the solved difference value to the parameter value in the current time, carrying out inverse standardization according to a standardization mode of the collected process parameters, and verifying the performance of the constructed state transition model by utilizing a training set.

S3: by using

And (3) building an intelligent agent model capable of outputting artificial controllable parameter data in the production process by reinforcement learning. It should be noted that:

constructing an intelligent agent model, wherein the intelligent agent model comprises an action design causing state transition and an action-caused reward design;

the action design for causing state transition comprises making difference between adjacent time intervals of each control variable, and taking median of all values

Can be increased for each control variable individually at each moment

A value is defined as an action

The action space during a state transition contains elements of

，

Is the number of control variables;

action-induced reward design includes a target value difference for a given action

From the current state

Transition to the next state

Is awarded

The learning process of the intelligent agent model comprises,

searching the minimum value of the TD error, and setting the minimum value of the TD error as a target;

the TD error is calculated as the difference between,

wherein the content of the first and second substances,

represent

Applying motion in state

The expectation of the gain to be obtained is,

the discount factor is represented by a number of discount factors,

to represent

Applying motion in state

The expectation of the gain to be obtained.

By using

Corresponding action

The action of

All states, actions and rewards constitute a policy network.

S4: and fusing and applying the state transition model and the intelligent agent model to realize the real-time optimization and output of the process parameters in the production process. It should be noted that:

the real-time optimization and output of the process parameters in the production process comprises,

collecting the technological parameters of the production process in real time by taking each fixed time interval as a unit, selecting the number of time intervals to be aggregated according to the actual business requirements and the time sequence, and carrying out data processing and aggregation on the sample data consisting of the selected influence variables and the target value;

inputting the processed and aggregated data into a constructed state transition model, and outputting a difference value between an influence variable and a target value;

and inputting the influence variable of the current time interval into the trained strategy network, outputting the control variable, and realizing the fusion and application of the state transition model and the intelligent agent model in the actual production process.

It should be noted that, in consideration of the problems in the prior art that the convergence of the solving process is slow, the prediction efficiency is low, and the fusion with the production target is insufficient, a probabilistic neural network which can quickly fit the data distribution situation and has high fault tolerance rate and a model for reinforcement learning training which has strong environmental adaptability and is positively feedback with the target are adopted, and the time sequence relation of data is taken into consideration in the training process of the model, so that the defects of most of the existing methods based on an artificial intelligence algorithm are overcome, the model fitting effect and the migration effect are more excellent, the pertinence of dynamically recommended process parameters to the optimized target is stronger, the recommended process parameters are more in line with the reality, the continuous and efficient operation of the production process is ensured, and the cost is reduced and the efficiency is increased.

The real-time optimization method for the process parameters fusing the probability network and the reinforcement learning fully utilizes historical process parameter data, divides the process parameters into control variables, influence variables and target values, organically combines the control variables, the influence variables and the target values, recommends the controllable process parameters of the production process in real time, ensures the continuous and efficient operation of the production process, reduces the cost and improves the efficiency. Compared with the traditional method for constructing the parameter optimization model, the method has stronger disturbance resistance and migration resistance, better accords with the real application scene, has better parameter recommendation effect, and is suitable for most types of production systems.

Example 2

Referring to fig. 2 and 3, a second embodiment of the present invention is different from the first embodiment in that a verification test of a process parameter implementation optimization method integrating a probabilistic network and reinforcement learning is provided, and in order to verify and explain the technical effects adopted in the method, the embodiment adopts a conventional technical scheme and the method of the present invention to perform a comparison test, and compares the test results by means of scientific demonstration to verify the true effects of the method.

Taking the rotary kiln system of the Tai-Gai base in Taiyuan City as an example, the collection is carried out in 1 minute as unit time, and the collected process parameters comprise: the control variables are head coal, tail coal, grate speed, high-temperature fan high-pressure frequency conversion frequency setting, head row high-pressure frequency conversion frequency setting and tail row high-pressure frequency conversion frequency setting, the influence variables are secondary air temperature, decomposition furnace temperature, clinker temperature 2, kiln head cover negative pressure, kiln tail negative pressure and decomposition furnace outlet temperature, the target value is coal consumption, and finally collected rotary kiln production system process parameter data are multi-dimensional time sequence data.

Then, carrying out operations of preprocessing, processing and dividing a data set on the collected process parameter data:

pretreatment:

1) processing an abnormal sample: to avoid abnormal value pairsInfluence of subsequent modeling process, utilization

The principle modifies the exception sample to a null value, remaining in range for each parameter datum

A value of (1), wherein

Is shown as

The mean value of the data of the individual parameters,

is shown as

For each standard deviation of the parameter data, the parameter values not in the range are replaced with null values.

2) Filling of null values: use of

In this way, the average of 6 consecutive data is selected for filling, for example, if the value of the 10 th minute of the secondary air temperature is null, the average of the values of the parameters of the 7 th, 8, 9 th and 11 th, 12 th, 13 th minutes of the secondary air temperature is filled into the 10 th time interval.

3) Normalization of the data: obeying all historical parameter data

The normalization process of (1).

Processing: and taking the difference between two adjacent time intervals of the target value as a new target value, and then aggregating the sample data of 10 adjacent minutes in the time sequence, wherein the aggregation mode is an average value.

Dividing the data set: the processed and processed data were processed as per 6: 2: the approach of 2 is divided into a training set, a prediction set, and a validation set.

Then constructing a state transition model, learning and training a probabilistic neural network model by using the divided training set, wherein an input layer is 14, an output layer is 8, a mode layer is 200, a summation layer is 8, the batch number is 256, the learning rate is set to be 0.001 by adopting an Adam optimizer, the epoch is 1000 times, and the output result is a diagonal covariance matrix formed by standard deviations of all parameters

And a row vector consisting of the average values of the respective parameters

On the verification data set

The values represent the smallest probabilistic neural network model.

According to the formula

At random in

In the distribution, the sampling is performed 5 times to obtain 5

Value, averaged to a predicted state parameter value difference

. The prediction result of the influencing variable and the target value is the difference value of the parameter value in the current time and the solution

And then carrying out inverse standardization according to a mode of standardizing the collected process parameters, and verifying the performance of the constructed state transition model by using the test set.

Table 1: and (4) evaluation comparison of the influence variable and the target value.

Table 2 shows the effect of the evaluation comparison of the different algorithms on the influence variable and the target value (the evaluation index is mean square error MSE, the smaller the value of the mean square error MSE, the better the effect of the model is), and the effect of the probabilistic neural network prediction model constructed by using the loss function constructed by the method as the index of the model training is better and the advantage of the prediction effect is obvious.

The predicted effect of coal consumption is the basis for the recommendation of control variables, which represent the rewards obtained when actions leading to state transitions occur, and the predicted effect of randomly selecting 100 groups of coal consumption is shown in figure 1, and it can be seen that the predicted value is very close to the actual value.

Then, an agent model capable of outputting artificially controllable parameter data in the production process is constructed by using reinforcement learning, because the number of empty value variables is 6 in the present embodiment, 64 elements are included in the action space during the state transition process, and the following table shows parameter values corresponding to the action of each control variable.

Table 2: the action of each control variable.

By using

A reinforcement learning algorithm of

Input device

Train out the current state in

Down-exerting an action

Producing a reward

Table form

Obtaining an optimal policy network, wherein the discount coefficient

Set to 0.98 and the probability of exploration set to 0.1.

Finally, the control variable of the previous state is input into the constructed intelligent agent model, the control variable value of the next state is output, and the coal consumption of the rotary kiln production system caused by the input of the control variable recommended by the intelligent agent is compared with the coal consumption of the next state predicted by the state transition model; fig. 2 shows the implementation effect of the control variables recommended by the agent, which can be obtained from fig. 2, and 100 recommended sets of process parameter data are selected, the optimized coal consumption per unit time is 0.2893, the control variables recommended by the agent conform to the process of the actual production system, and the excellent control variables can be recommended to optimize the target coal consumption in real time.

It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims

1. A real-time optimization method for process parameters fusing a probability network and reinforcement learning is characterized by comprising the following steps:

collecting technological parameter data of a production system, and carrying out operations of preprocessing, processing and dividing a data set on the technological parameter data;

constructing a state transition model of adjacent time intervals in the production process based on the preprocessed process parameter data;

by using

Building an intelligent agent model capable of outputting artificially controllable parameter data in the production process by reinforcement learning;

and fusing and applying the state transition model and the intelligent agent model to realize the real-time optimization and output of the process parameters in the production process.

2. The method for optimizing process parameters of fusion probability network and reinforcement learning in real time as claimed in claim 1, wherein: the collection of the process parameter data comprises collecting control variables, influence variables and actual production target values of the production process at equal time intervals;

3. The method for optimizing process parameters of fusion probability network and reinforcement learning in real time as claimed in claim 2, wherein: the pre-processing and processing of the process parameter data includes,

and the processing of the process parameter data comprises the steps of making a difference between two adjacent time intervals of the actual production target value, taking the difference value of the two adjacent time intervals as a new target value, and then aggregating the sample data of a plurality of time intervals in the time sequence.

4. The method for optimizing process parameters of fusion probability network and reinforcement learning in real time as claimed in any one of claims 1 to 3, wherein: the dividing of the process parameter data comprises dividing a new data set after pretreatment and processing into a training set, a verification set and a test set according to a certain proportion.

5. The method for optimizing process parameters in real time for fusion of a probabilistic network and reinforcement learning according to claim 4, wherein: the construction of the state transition model includes,

constructing a probabilistic neural network by using the divided training set;

solving the influence variables and the actual production target values of the state of the immediately-after time interval;

6. The method for optimizing process parameters in real time for fusion of a probabilistic network and reinforcement learning according to claim 5, wherein: the construction of the probabilistic neural network comprises the setting of a loss function and the training of a probabilistic neural network model;

setting the loss function

Predicting the probability for a logarithm;

the calculation of the log-prediction probability includes,

wherein the content of the first and second substances,

a set of training data is represented that is,

representing probabilistic neural networksA density function of the complex model;

the calculating of the density function of the probabilistic neural network model includes,

the computation of the loss function after the pull-in logarithmic prediction probability reduction includes,

a diagonal covariance matrix is represented,

represents the inverse of the diagonal covariance matrix,kto represent

The number of the features in (1) is,

representing the determinant of the diagonal covariance matrix.

7. The method for optimizing process parameters in real time for fusion of a probabilistic network and reinforcement learning according to claim 6, wherein: the solving of the influence variables and the target values of the state of the tight time interval comprises the selection of a probabilistic neural network submodel in a model base, the solving of the difference value of the influence variables and the target values of the adjacent time intervals and the solving of the influence variables of the tight time interval;

wherein the content of the first and second substances,

representing compliance

Distribution of (2)

The result is solved to form a random data set,

represents the standard deviation;

Are randomly paired in a distributed data set

Sampling to obtain a plurality of samples

the parameter solving of the time interval influence variables comprises the steps of carrying out inverse standardization on parameter values in the current time and solved difference values according to a standardization mode of collected process parameters, and verifying the performance of the constructed state transition model by utilizing the training set.

8. The method for optimizing process parameters in real time for fusion of a probabilistic network and reinforcement learning according to claim 7, wherein: the construction of the intelligent agent model comprises an action design causing state transition and an action-caused reward design;

Can be increased for each control variable individually at each moment

A value is defined as an action

The action space during a state transition contains elements of

，

Is the number of control variables;

From the current state

Transition to the next state

Is awarded

9. The method for optimizing process parameters in real time for fusion of a probabilistic network and reinforcement learning according to claim 8, wherein: the learning process of the intelligent agent model comprises,

the TD error is calculated as the difference between,

wherein the content of the first and second substances,

to represent

Applying motion in state

The expectation of the gain to be obtained is,

the discount factor is represented by a number of discount factors,

represent

Applying motion in state

(ii) a gain expectation obtained;

by using

Corresponding action

The action of

All states, actions and rewards constitute a policy network.

10. The method for optimizing process parameters in real time for fusion of a probabilistic network and reinforcement learning according to claim 9, wherein: the real-time optimization and output of the process parameters in the production process comprises,