CN115061444B

CN115061444B - Real-time optimization method for process parameters integrating probability network and reinforcement learning

Info

Publication number: CN115061444B
Application number: CN202210989613.7A
Authority: CN
Inventors: 毛旭初; 张翔; 谢天; 陈松; 汪江
Original assignee: Luculent Smart Technologies Co ltd
Current assignee: Luculent Smart Technologies Co ltd
Priority date: 2022-08-18
Filing date: 2022-08-18
Publication date: 2022-12-09
Anticipated expiration: 2042-08-18
Also published as: CN115061444A

Abstract

The invention discloses a real-time optimization method for process parameters integrating a probability network and reinforcement learning, which comprises the following steps: collecting technological parameter data of a production system, and carrying out operations of preprocessing, processing and dividing a data set on the collected data; constructing a state transition model of adjacent time intervals in the production process based on the preprocessed data; building an intelligent agent model capable of outputting artificially controllable parameter data in the production process by using reinforcement learning; and fusing and applying the state transition model and the intelligent agent model to realize real-time optimization and output of process parameters in the production process. The invention divides the process parameters into the control variables, the influence variables and the target values, organically combines the control variables, the influence variables and the target values, recommends the controllable process parameters of the production process in real time, ensures the continuous and efficient operation of the production process, reduces the cost and improves the efficiency.

Description

Real-time optimization method for process parameters integrating probability network and reinforcement learning

Technical Field

The invention relates to the technical field of optimization of technological parameters in a production process, in particular to a real-time optimization method for technological parameters by fusing a probability network and reinforcement learning.

Background

With the rapid development of the internet of things and big data technology, the development and application of new-generation intelligent manufacturing are promoted, a new paradigm is provided for optimizing process parameters in the production process, and the optimization of the process parameters is to predict parameters which should be input into the production system in the next time period in advance, so that the continuous and efficient operation of the production process is ensured, and the cost reduction and the efficiency improvement of the operation process of the production system are promoted.

The current parameter optimization methods include optimization algorithm and artificial intelligence algorithm implementation, and although they can solve a set of optimal process parameters according to different targets, the methods have some defects. The method for constructing the parameter optimization model based on the optimization algorithm is quite dependent on the logical relationship between parameters and targets, so that the constructed model is insufficient in staticizing, disturbance resistance and migration capacity, when the parameter types or the targets change, the algorithm of the original constructed model is not applicable, and the convergence of the solving process is slow and time-consuming; most methods for constructing parameter optimization models based on artificial intelligence algorithms ignore the time sequence relation in data, cannot search optimal process parameters according to the time sequence process, and easily cause the constructed models to be separated from the real operation process of a system.

In order to overcome the defects of the existing parameter optimization method based on an optimization algorithm and an artificial intelligence algorithm, the method has the advantages that the probabilistic neural network can be combined with the data distribution situation, the fault tolerance rate is high, the model environment adaptability of reinforcement learning training is strong, and the model environment adaptability and the target positive feedback are realized, and meanwhile, the time sequence relation of data is considered in the training process of the model, so that the continuous and efficient operation of the production process is ensured, the cost is reduced, and the efficiency is improved.

Disclosure of Invention

This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and the title of the invention of this application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.

The present invention has been made in view of the above-mentioned problems.

Therefore, the technical problem solved by the invention is as follows: in the prior art, the optimization method for the process parameters of the production system has the problems of over-experience, low prediction efficiency and insufficient fusion with production targets.

In order to solve the technical problems, the invention provides the following technical scheme: a real-time optimization method for process parameters fusing a probability network and reinforcement learning comprises the following steps: collecting technological parameter data of a production system, and carrying out operations of preprocessing, processing and dividing a data set on the technological parameter data; constructing a state transition model of adjacent time intervals in the production process based on the preprocessed process parameter data; by using

Building an intelligent agent model capable of outputting artificially controllable parameter data in a production process by reinforcement learning; and fusing and applying the state transition model and the intelligent agent model to realize real-time optimization and output of process parameters in the production process.

As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the collection of the process parameter data comprises collecting control variables, influence variables and actual production target values of the production process at equal time intervals;

the control variables comprise process parameters which can be directly adjusted manually in the production process;

the influence variables comprise process parameters generated by the influence of manually input control variables on the production system;

the actual target value of production comprises a production target which is completed by the production system at a certain time interval.

As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the pre-processing and processing of the process parameter data includes,

the preprocessing of the process parameter data comprises processing of abnormal samples, filling of null values and standardization of data;

and the processing of the process parameter data comprises the steps of differentiating two adjacent time intervals of the actual production target value, taking the difference value of the two adjacent time intervals as a new target value, and then aggregating the sample data of a plurality of time intervals in a time sequence, wherein the aggregation mode is average value aggregation.

As a preferred scheme of the process parameter real-time optimization method for integrating the probability network and the reinforcement learning, the method comprises the following steps: the dividing of the process parameter data comprises dividing a new data set after pretreatment and processing into a training set, a verification set and a test set according to a certain proportion.

As a preferred scheme of the process parameter real-time optimization method for integrating the probability network and the reinforcement learning, the method comprises the following steps: the construction of the state transition model includes,

constructing a probabilistic neural network by using the divided training set;

solving the influence variables and actual production target values of the immediately-after time interval state (the state of the next time interval in the current time interval state);

a state transition function and a reward function that can express the change of the actual target value of the production with high fidelity according to the state transition are obtained.

As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the construction of the probabilistic neural network comprises the setting of a loss function and the training of a probabilistic neural network model;

setting the loss function

Predicting the probability for a logarithm;

the calculation of the log-prediction probability includes,

wherein,

a set of training data is represented that is,

a density function representing a probabilistic neural network model;

the output of the training of the probabilistic neural network model is a gaussian distribution parameterized by diagonal covariance;

the calculation of the density function of the probabilistic neural network model includes,

the computation of the loss function after the pull-in logarithmic prediction probability reduction is,

wherein,

a mean vector representing each attribute, T represents the transpose of the matrix,

a diagonal covariance matrix is represented as a diagonal covariance matrix,

denotes the inverse of the diagonal covariance matrix, k denotes

The number of the features in (1) is,

representing the determinant of the diagonal covariance matrix.

As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the solving of the influence variables and the target values of the state of the tight time interval comprises the selection of a probabilistic neural network submodel in a model library, the solving of the difference value of the influence variables and the target values of the adjacent time intervals and the solving of the influence variables of the tight time interval;

the selection of the probabilistic neural network submodel comprises randomly selecting a submodel from a learned model library to obtain a mean vector and a diagonal covariance matrix output by the submodel;

the calculation of the solution of the adjacent time interval influencing variable and target value difference comprises,

wherein,

the difference between the parameter value representing the current state and the value of the immediately subsequent state parameter,

representing compliance

Distribution of (2)

The result is solved to form a random data set,

represents the standard deviation;

the solving of the adjacent time interval influencing variable and target value difference comprises

Are randomly paired in a distributed data set

Sampling to obtain multiple samples

The value, after taking the average number, is the state after being normalizedA parameter value of (d);

the parameter solving of the time interval influence variables comprises the steps of adding a solving difference value to a parameter value in the current time, carrying out inverse standardization according to a standardization mode of collected process parameters, and verifying the performance of the constructed state transition model by utilizing the training set.

As a preferred scheme of the process parameter real-time optimization method for integrating the probability network and the reinforcement learning, the method comprises the following steps: the construction of the intelligent agent model comprises an action design causing state transition and an action-caused reward design;

the action design for causing state transition comprises making difference between adjacent time intervals of each control variable, and taking median of all values

Can be increased for each control variable individually at each moment

A value is defined as an action

The action space during the state transition contains elements of

，

Is the number of control variables;

the action-induced reward design includes a target value difference for a given action

From the current state

Transition to the next state

Is awarded

And the target value changes correspondingly every time the control variable is changed, and the changed value is the reward after the control variable is changed.

As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the learning process of the intelligent agent model comprises,

searching a minimum value of the TD error, and setting the minimum value of the TD error as a target;

the TD error is calculated as the difference between,

wherein,

to represent

Applying motion in state

The expectation of the gain to be obtained is,

the discount factor is represented by a number of discount factors,

represent

Applying motion in state

(ii) a gain expectation obtained;

by using

The reinforcement learning algorithm inputs the influence variables of the current time interval into the state value network and the strategy network, and generates the maximum reward value through loop iteration

Corresponding action

The action of

All states, actions and rewards constitute a policy network.

As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the real-time optimization and output of the process parameters in the production process comprises,

collecting the technological parameters of the production process in real time by taking each fixed time interval as a unit, selecting the number of time intervals to be aggregated according to the actual business requirements and the time sequence, and carrying out data processing and aggregation on the sample data consisting of the selected influence variables and the target value;

inputting the processed and aggregated data into the constructed state transition model, and outputting a difference value between an influence variable and a target value;

and inputting the influence variable of the current time interval into the trained strategy network, outputting a control variable, and realizing the fusion and application of the state transition model and the intelligent agent model in the actual production process.

The invention has the beneficial effects that: the real-time optimization method for the process parameters fusing the probability network and the reinforcement learning fully utilizes historical process parameter data, divides the process parameters into control variables, influence variables and target values, organically combines the control variables, the influence variables and the target values, recommends the controllable process parameters of the production process in real time, ensures the continuous and efficient operation of the production process, reduces the cost and improves the efficiency. Compared with the traditional method for constructing the parameter optimization model, the method has stronger disturbance resistance and migration resistance, better accords with the real application scene, has better parameter recommendation effect, and is suitable for most types of production systems.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise. Wherein:

FIG. 1 is a flowchart illustrating a method for real-time optimization of process parameters by fusion of a probabilistic network and reinforcement learning according to an embodiment of the present invention;

fig. 2 is a diagram showing and comparing the prediction effects of 100 groups of coal consumption selected in the process parameter real-time optimization method for integrating the probabilistic network and reinforcement learning according to the second embodiment of the present invention;

fig. 3 is a diagram showing an implementation effect of control variables recommended by an agent in a process parameter real-time optimization method combining a probabilistic network and reinforcement learning according to a second embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, embodiments accompanying figures of the present invention are described in detail below, and it is apparent that the described embodiments are a part, not all or all of the embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced otherwise than as specifically described herein, and it will be appreciated by those skilled in the art that the present invention may be practiced without departing from the spirit and scope of the present invention and that the present invention is not limited by the specific embodiments disclosed below.

Furthermore, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

The present invention will be described in detail with reference to the drawings, wherein the cross-sectional views illustrating the structure of the device are not enlarged partially in general scale for convenience of illustration, and the drawings are only exemplary and should not be construed as limiting the scope of the present invention. In addition, the three-dimensional dimensions of length, width and depth should be included in the actual fabrication.

Meanwhile, in the description of the present invention, it should be noted that the terms "upper, lower, inner and outer" and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation and operate, and thus, cannot be construed as limiting the present invention. Furthermore, the terms first, second, or third are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The terms "mounted, connected and connected" in the present invention are to be understood broadly, unless otherwise explicitly specified or limited, for example: can be fixedly connected, detachably connected or integrally connected; they may be mechanically, electrically, or directly connected, or indirectly connected through intervening media, or may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

Example 1

Referring to fig. 1, for an embodiment of the present invention, a method for real-time optimization of process parameters by fusion of a probability network and reinforcement learning is provided, including:

s1: collecting technological parameter data of a production system, and carrying out operations of preprocessing, processing and dividing a data set on the technological parameter data. It should be noted that:

collecting process parameter data comprises collecting control variables, influence variables and actual production target values of a production process at equal time intervals; the control variables comprise process parameters which can be directly adjusted manually in the production process; the influence variables comprise process parameters generated by the influence of manually input control variables on the production system; the actual target value of production includes a production target that the production system has completed at certain time intervals.

Further, the preprocessing of the process parameter data comprises processing of abnormal samples, filling of null values and standardization of data;

further, the processing of the process parameter data comprises that the difference between two adjacent time intervals for producing the actual target value is used as a new target value, then sample data of a plurality of time intervals in the time sequence are aggregated, and the aggregation mode is average value aggregation;

furthermore, the dividing of the process parameter data comprises dividing the new preprocessed and processed data set into a training set, a verification set and a test set according to a certain proportion.

S2: and constructing a state transition model of adjacent time intervals in the production process based on the preprocessed process parameter data. It should be noted that:

the method comprises the steps of firstly, building a probabilistic neural network by using a divided training set, then solving an influence variable of a time interval state (a state of a next time interval in the current time interval state) and a production actual target value, and finally obtaining a state transfer function and a reward function which can express that the production actual target value changes along with state transfer in a high-fidelity mode.

Further, the establishment of the probabilistic neural network comprises the setting of a loss function and the training of a probabilistic neural network model;

setting loss function

Is logarithmicThe probability is predicted and its calculation includes,

wherein,

a set of training data is represented that is,

a density function representing a probabilistic neural network model;

the output of the training of the probabilistic neural network model is a gaussian distribution parameterized by diagonal covariance, whose density function is calculated including,

wherein,

a diagonal covariance matrix is represented,

denotes the inverse of the diagonal covariance matrix, k denotes

The number of the features in (2) is,

a determinant representing a diagonal covariance matrix;

the input value of the probabilistic neural network model is the attribute of a training data set obtained from an intelligent agent model, the output is a mean value and a diagonal covariance matrix of distribution obeyed by influence variable differences, a plurality of probabilistic neural network submodels with excellent performance are built to form a model base, and a data set is verified

The smaller the value is, the better the trained probabilistic neural network model is;

further, the solving of the influence variables and the target values of the state of the time interval after the tightening comprises the selection of a probabilistic neural network sub-model in a model library, the solving of the difference value of the influence variables and the target values of the adjacent time intervals and the solving of the influence variables of the time interval after the tightening;

the calculation of the solution for the difference between the adjacent time interval influencing variables and the target value includes,

wherein,

representing compliance

Distribution of (2)

Solution result formationThe random data set of (a) is,

represents the standard deviation;

it should be noted that the solution of the difference between the influencing variable and the target value of the adjacent time interval is based on compliance

Distribution of (2)

Solving a random data set formed by the result, the data set being defined as

Then, then

Is also obeyed

Is distributed and

setting up

From the distribution of obeys

Is randomly paired in the data set

Sampling to obtain a plurality of samples

Taking the average value as the parameter value of the normalized state after tightening;

the parameter solving of the time interval influence variables comprises the steps of adding the solved difference value to the parameter value in the current time, carrying out inverse standardization according to a standardization mode of the collected process parameters, and verifying the performance of the constructed state transition model by utilizing a training set.

S3: by using

And (3) building an intelligent agent model capable of outputting artificial controllable parameter data in the production process by reinforcement learning. It should be noted that:

constructing an intelligent agent model, wherein the intelligent agent model comprises an action design causing state transition and an action-caused reward design;

Can be increased for each control variable individually at each instant

A value is defined as an action

The action space during a state transition contains elements of

，

Is the number of control variables;

action-induced reward design includes a target value difference for a given action

From the current state

Transition to the next state

Is awarded

The learning process of the intelligent agent model comprises,

searching the minimum value of the TD error, and setting the minimum value of the TD error as a target;

the error of the TD is calculated as,

wherein,

to represent

Applying motion in state

The expectation of the gain to be obtained is,

the discount-factor is represented by a number of discount factors,

to represent

Applying motion in state

The expected gain is obtained.

By using

Reinforcement learning algorithm, inputting the influence variable of the current time interval to the state priceMaximum reward value generated by loop iteration in value network and policy network

Corresponding action

The action of

All states, actions and rewards constitute a policy network.

S4: and fusing and applying the state transition model and the intelligent agent model to realize the real-time optimization and output of the process parameters in the production process. It should be noted that:

the real-time optimization and output of the process parameters in the production process comprises,

collecting the technological parameters of the production process in real time by taking each fixed time interval as a unit, selecting the number of time intervals to be aggregated according to the actual business requirements and the time sequence, and carrying out data processing and aggregation on the sample data consisting of the selected influence variables and the target values;

and inputting the influence variable of the current time interval into the trained strategy network, outputting the control variable, and realizing the fusion and application of the state transition model and the intelligent agent model in the actual production process.

It should be noted that, in consideration of the problems in the prior art that the convergence of the solving process is slow, the prediction efficiency is low, and the fusion with the production target is insufficient, a probabilistic neural network which can quickly fit the data distribution situation and has high fault tolerance rate and a model for reinforcement learning training which has strong environmental adaptability and is positively feedback with the target are adopted, and the time sequence relation of data is taken into consideration in the training process of the model, so that the defects of most of the existing methods based on an artificial intelligence algorithm are overcome, the model fitting effect and the migration effect are more excellent, the pertinence of dynamically recommended process parameters to the optimized target is stronger, the recommended process parameters are more in line with the reality, the continuous and efficient operation of the production process is ensured, and the cost is reduced and the efficiency is increased.

The real-time optimization method for the process parameters fusing the probability network and the reinforcement learning, provided by the invention, fully utilizes historical process parameter data, divides the process parameters into control variables, influence variables and target values, organically combines the control variables, the influence variables and the target values, recommends the controllable process parameters of the production process in real time, ensures the continuous and efficient operation of the production process, reduces the cost and improves the efficiency. Compared with the traditional method for constructing the parameter optimization model, the method has stronger disturbance resistance and migration resistance, better accords with the real application scene, has better parameter recommendation effect, and is suitable for most types of production systems.

Example 2

Referring to fig. 2 and 3, a second embodiment of the present invention is different from the first embodiment in that a verification test for implementing an optimization method by fusing a probability network and a process parameter of reinforcement learning is provided, and in order to verify and explain the technical effects adopted in the method, the embodiment adopts a comparison test between a conventional technical scheme and the method of the present invention, and compares the test results by means of scientific demonstration to verify the actual effects of the method.

Taking the rotary kiln system of the Tai-Gai base in Taiyuan City as an example, the collection is carried out in 1 minute as unit time, and the collected process parameters comprise: the control variables are head coal, tail coal, grate speed, high-temperature fan high-pressure frequency conversion frequency setting, head row high-pressure frequency conversion frequency setting and tail row high-pressure frequency conversion frequency setting, the influence variables are secondary air temperature, decomposing furnace temperature, clinker temperature 2, kiln head cover negative pressure, kiln tail negative pressure and decomposing furnace outlet temperature, the target value is coal consumption, and finally acquired rotary kiln production system process parameter data are multi-dimensional time sequence data.

Then, carrying out operations of preprocessing, processing and dividing a data set on the collected process parameter data:

pretreatment:

1) Processing an abnormal sample: to avoid the influence of abnormal values on the subsequent modeling process, use is made of

The principle modifies the exception sample to a null value, remaining in range for each parameter datum

A value of (1), wherein

Is shown as

The mean value of the data of the individual parameters,

denotes the first

For each standard deviation of the parameter data, the parameter values not in the range are replaced with null values.

2) Filling of null values: use of

In this way, the average of 6 consecutive data is selected for filling, for example, if the value of the secondary air temperature at the 10 th minute is null, the average of the values of the secondary air temperature at the 7 th, 8 th, 9 th and 11 th, 12 th, 13 th minutes is filled into the 10 th time interval.

3) Normalization of the data: obeying all historical parameter data

The normalization process of (1).

Processing: and taking the difference between two adjacent time intervals of the target value as a new target value, and then aggregating the sample data of 10 adjacent minutes in the time sequence, wherein the aggregation mode is an average value.

Dividing the data set: the processed and processed data were processed as per 6:2: the approach of 2 is divided into a training set, a prediction set, and a validation set.

Then constructing a state transition model, utilizing the divided training set to learn and train a probabilistic neural network model, setting an input layer as 14, an output layer as 8, a mode layer as 200, a summation layer as 8, a batch number as 256, setting the learning rate to be 0.001 by adopting an Adam optimizer, setting the epoch to be 1000 times, and outputting a result to be a diagonal covariance matrix formed by standard deviations of various parameters

And a line vector composed of the average values of the parameters

On the verification data set

The values represent the smallest probabilistic neural network model.

According to the formula

At random in

In the distribution, the number of the pairs is 5 times sampled and 5 are obtained

Value, averaged to a predicted state parameter value difference

. The prediction result of the influencing variable and the target value is the difference value of the parameter value in the current time and the solution

And then carrying out inverse standardization according to a standardization mode of the collected process parameters, and verifying the performance of the constructed state transition model by using the test set.

Table 1: and (4) evaluation comparison of the influence variable and the target value.

Table 2 shows the effect of the evaluation comparison of the different algorithms on the influence variable and the target value (the evaluation index is mean square error MSE, the smaller the value of the mean square error MSE, the better the effect of the model is), and the effect of the probabilistic neural network prediction model constructed by using the loss function constructed by the method as the index of the model training is better and the advantage of the prediction effect is obvious.

The predicted effect of coal consumption is the basis of the recommended control variables, which represent the rewards obtained when actions leading to state transitions occur, and the predicted effect of randomly selecting 100 groups of coal consumption shows that the predicted value is very close to the actual value as shown in fig. 1.

Then, an agent model capable of outputting artificially controllable parameter data in the production process is constructed by using reinforcement learning, because the number of empty value variables is 6 in the present embodiment, 64 elements are included in the action space during the state transition process, and the following table shows parameter values corresponding to the action of each control variable.

Table 2: the action of each control variable.

By using

A reinforcement learning algorithm of

Input device

Train out the current state

Down-applying a certain action

Producing a reward

Table form

Obtaining an optimal policy network, wherein the discount coefficient

Set to 0.98 and the probability of exploration set to 0.1.

Finally, the control variable of the previous state is input into the constructed intelligent agent model, the control variable value of the next state is output, and the coal consumption of the rotary kiln production system caused by the input of the control variable recommended by the intelligent agent is compared with the coal consumption of the next state predicted by the state transition model; fig. 2 shows the implementation effect of the control variables recommended by the agent, which can be obtained from fig. 2, and 100 sets of recommended process parameter data are selected, the optimized coal consumption value per unit time is 0.2893, the control variables recommended by the agent conform to the process of the actual production system, and the excellent control variables can be recommended to optimize the target coal consumption in real time.

It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims

1. A real-time optimization method for process parameters fusing a probability network and reinforcement learning is characterized by comprising the following steps:

collecting technological parameter data of a production system, and carrying out operations of preprocessing, processing and dividing a data set on the technological parameter data;

constructing a state transition model of adjacent time intervals in the production process based on the preprocessed process parameter data;

the construction of the state transition model includes,

constructing a probability network by using the divided training set;

solving the influence variable of the state of the time interval after the closing and the actual target value of the production;

acquiring a state transition function and a reward function which can express that the actual target value of the production changes along with the state transition in a high fidelity way; building an intelligent agent model capable of outputting artificial controllable parameter data in the production process by utilizing Q-Learning reinforcement Learning;

and fusing and applying the state transition model and the intelligent agent model to realize the real-time optimization and output of the process parameters in the production process.

2. The method for optimizing process parameters of fusion probability network and reinforcement learning in real time as claimed in claim 1, wherein: the collection of the process parameter data comprises collecting control variables, influence variables and actual production target values of the production process at equal time intervals;

the control variables comprise process parameters which can be manually and directly adjusted in the production process;

3. The method for optimizing process parameters of fusion probability network and reinforcement learning in real time as claimed in claim 2, wherein: the pre-processing and processing of the process parameter data includes,

and the processing of the process parameter data comprises the steps of differentiating two adjacent time intervals of the actual production target value, taking the difference value of the two adjacent time intervals as a new target value, and then aggregating the sample data of a plurality of time intervals in the time sequence.

4. The method for optimizing process parameters of fusion probability network and reinforcement learning in real time as claimed in any one of claims 1 to 3, wherein: the dividing of the process parameter data comprises dividing a new preprocessed and processed data set into a training set, a verification set and a test set according to a certain proportion.

5. The method for optimizing process parameters in real time for fusion of a probabilistic network and reinforcement learning according to claim 4, wherein: the establishment of the probability network comprises the setting of a loss function and the training of a probability network model;

setting the loss function loss _p Predicting the probability for a logarithm;

the calculation of the log-prediction probability includes,

loss _p ＝-logf(X)

wherein X represents a training data set, and f (X) represents a density function of the probability network model;

the output of the training of the probabilistic network model is a gaussian distribution parameterized by diagonal covariance;

the calculation of the density function of the probabilistic network model includes,

the computation of the loss function after the pull-in logarithmic prediction probability reduction includes,

loss _p ＝(X-μ) ^T ∑ ^-1 (X-μ)+(2π) ^k |∑|

where μ represents the mean vector of each attribute, T represents the transpose of the matrix, Σ represents the diagonal covariance matrix, Σ ^-1 Represents the inverse of the diagonal covariance matrix, k represents the number of features in X, and Σ | represents the determinant of the diagonal covariance matrix.

6. The method for optimizing process parameters in real time for fusion of a probabilistic network and reinforcement learning according to claim 5, wherein: the solving of the influence variables and the target values of the state of the tight time interval comprises the selection of a probability network submodel in a model library, the solving of the difference value of the influence variables and the target values of the adjacent time intervals and the solving of the influence variables of the tight time interval;

the selection of the probability network submodel comprises randomly selecting a submodel from a learned model library to obtain a mean vector and a diagonal covariance matrix output by the submodel;

ΔX＝μ+ε*σ

where Δ X represents the difference between the current state parameter value and the immediate state parameter value, and ε represents the distribution obeying N (0, 1)

Solving a random data set formed by the result, wherein sigma represents a standard deviation;

the solving of the difference between the adjacent time interval influencing variable and the target value comprises randomly sampling epsilon from a data set obeying the distribution of N (0, 1) to obtain a plurality of delta X values, and taking the average number to obtain the parameter value of the normalized state after tightening;

the parameter solving of the closely-spaced time interval influence variable comprises the steps of carrying out inverse standardization on a parameter value in the current time plus a solving difference value according to a standardization mode of collected process parameters, and verifying the performance of the constructed state transition model by utilizing the training set.

7. The method for optimizing process parameters in real time for fusion of a probabilistic network and reinforcement learning according to claim 6, wherein: the construction of the intelligent agent model comprises an action design causing state transition and an action-caused reward design;

the action design for causing state transition comprises that when each control variable is adjacentThe interval is differentiated, the median delta of all values is taken, the delta value which is independently increased for each control variable at each moment can be defined as an action a, and the element contained in the action space in the state transition process is 2 ⁿ N is the number of control variables;

the reward design caused by the action comprises a reward value r (s, a) with a target value difference value, wherein the reward value r (s, a) is the reward value of the given action a which is transferred from the current state s to the next state s', the target value changes correspondingly every time the control variable is changed, and the change value of the target value which changes correspondingly is the reward after the control variable changes.

8. The method for optimizing process parameters in real time for fusion of a probabilistic network and reinforcement learning according to claim 7, wherein: the learning process of the intelligent agent model comprises,

the TD error is calculated as the difference between,

Q(s,a)＝r(s,a)+γmax(Q(s',a'))

wherein Q (s, a) represents the revenue expectation obtained by applying action a in the s state, gamma represents the discount coefficient, and Q (s ', a') represents the revenue expectation obtained by applying action a 'in the s' state;

and inputting the influence variables of the current time interval into the state value network and the strategy network by using a Q-Learning reinforcement Learning algorithm, and generating an action a corresponding to the maximum reward value r (s, a) through loop iteration, wherein all states, actions and rewards of the action a form a strategy network.

9. The method for optimizing process parameters in real time for fusion of a probabilistic network and reinforcement learning according to claim 8, wherein: the real-time optimization and output of the process parameters in the production process comprises,