CN115061444B - Real-time optimization method for process parameters integrating probability network and reinforcement learning - Google Patents

Real-time optimization method for process parameters integrating probability network and reinforcement learning Download PDF

Info

Publication number
CN115061444B
CN115061444B CN202210989613.7A CN202210989613A CN115061444B CN 115061444 B CN115061444 B CN 115061444B CN 202210989613 A CN202210989613 A CN 202210989613A CN 115061444 B CN115061444 B CN 115061444B
Authority
CN
China
Prior art keywords
value
process parameters
network
model
reinforcement learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210989613.7A
Other languages
Chinese (zh)
Other versions
CN115061444A (en
Inventor
毛旭初
张翔
谢天
陈松
汪江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Luculent Smart Technologies Co ltd
Original Assignee
Luculent Smart Technologies Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Luculent Smart Technologies Co ltd filed Critical Luculent Smart Technologies Co ltd
Priority to CN202210989613.7A priority Critical patent/CN115061444B/en
Publication of CN115061444A publication Critical patent/CN115061444A/en
Application granted granted Critical
Publication of CN115061444B publication Critical patent/CN115061444B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B19/00Programme-control systems
    • G05B19/02Programme-control systems electric
    • G05B19/418Total factory control, i.e. centrally controlling a plurality of machines, e.g. direct or distributed numerical control [DNC], flexible manufacturing systems [FMS], integrated manufacturing systems [IMS] or computer integrated manufacturing [CIM]
    • G05B19/41865Total factory control, i.e. centrally controlling a plurality of machines, e.g. direct or distributed numerical control [DNC], flexible manufacturing systems [FMS], integrated manufacturing systems [IMS] or computer integrated manufacturing [CIM] characterised by job scheduling, process planning, material flow
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B2219/00Program-control systems
    • G05B2219/30Nc systems
    • G05B2219/32Operator till task planning
    • G05B2219/32252Scheduling production, machining, job shop

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Manufacturing & Machinery (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a real-time optimization method for process parameters integrating a probability network and reinforcement learning, which comprises the following steps: collecting technological parameter data of a production system, and carrying out operations of preprocessing, processing and dividing a data set on the collected data; constructing a state transition model of adjacent time intervals in the production process based on the preprocessed data; building an intelligent agent model capable of outputting artificially controllable parameter data in the production process by using reinforcement learning; and fusing and applying the state transition model and the intelligent agent model to realize real-time optimization and output of process parameters in the production process. The invention divides the process parameters into the control variables, the influence variables and the target values, organically combines the control variables, the influence variables and the target values, recommends the controllable process parameters of the production process in real time, ensures the continuous and efficient operation of the production process, reduces the cost and improves the efficiency.

Description

Real-time optimization method for process parameters integrating probability network and reinforcement learning
Technical Field
The invention relates to the technical field of optimization of technological parameters in a production process, in particular to a real-time optimization method for technological parameters by fusing a probability network and reinforcement learning.
Background
With the rapid development of the internet of things and big data technology, the development and application of new-generation intelligent manufacturing are promoted, a new paradigm is provided for optimizing process parameters in the production process, and the optimization of the process parameters is to predict parameters which should be input into the production system in the next time period in advance, so that the continuous and efficient operation of the production process is ensured, and the cost reduction and the efficiency improvement of the operation process of the production system are promoted.
The current parameter optimization methods include optimization algorithm and artificial intelligence algorithm implementation, and although they can solve a set of optimal process parameters according to different targets, the methods have some defects. The method for constructing the parameter optimization model based on the optimization algorithm is quite dependent on the logical relationship between parameters and targets, so that the constructed model is insufficient in staticizing, disturbance resistance and migration capacity, when the parameter types or the targets change, the algorithm of the original constructed model is not applicable, and the convergence of the solving process is slow and time-consuming; most methods for constructing parameter optimization models based on artificial intelligence algorithms ignore the time sequence relation in data, cannot search optimal process parameters according to the time sequence process, and easily cause the constructed models to be separated from the real operation process of a system.
In order to overcome the defects of the existing parameter optimization method based on an optimization algorithm and an artificial intelligence algorithm, the method has the advantages that the probabilistic neural network can be combined with the data distribution situation, the fault tolerance rate is high, the model environment adaptability of reinforcement learning training is strong, and the model environment adaptability and the target positive feedback are realized, and meanwhile, the time sequence relation of data is considered in the training process of the model, so that the continuous and efficient operation of the production process is ensured, the cost is reduced, and the efficiency is improved.
Disclosure of Invention
This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and the title of the invention of this application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.
The present invention has been made in view of the above-mentioned problems.
Therefore, the technical problem solved by the invention is as follows: in the prior art, the optimization method for the process parameters of the production system has the problems of over-experience, low prediction efficiency and insufficient fusion with production targets.
In order to solve the technical problems, the invention provides the following technical scheme: a real-time optimization method for process parameters fusing a probability network and reinforcement learning comprises the following steps: collecting technological parameter data of a production system, and carrying out operations of preprocessing, processing and dividing a data set on the technological parameter data; constructing a state transition model of adjacent time intervals in the production process based on the preprocessed process parameter data; by using
Figure 310818DEST_PATH_IMAGE001
Building an intelligent agent model capable of outputting artificially controllable parameter data in a production process by reinforcement learning; and fusing and applying the state transition model and the intelligent agent model to realize real-time optimization and output of process parameters in the production process.
As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the collection of the process parameter data comprises collecting control variables, influence variables and actual production target values of the production process at equal time intervals;
the control variables comprise process parameters which can be directly adjusted manually in the production process;
the influence variables comprise process parameters generated by the influence of manually input control variables on the production system;
the actual target value of production comprises a production target which is completed by the production system at a certain time interval.
As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the pre-processing and processing of the process parameter data includes,
the preprocessing of the process parameter data comprises processing of abnormal samples, filling of null values and standardization of data;
and the processing of the process parameter data comprises the steps of differentiating two adjacent time intervals of the actual production target value, taking the difference value of the two adjacent time intervals as a new target value, and then aggregating the sample data of a plurality of time intervals in a time sequence, wherein the aggregation mode is average value aggregation.
As a preferred scheme of the process parameter real-time optimization method for integrating the probability network and the reinforcement learning, the method comprises the following steps: the dividing of the process parameter data comprises dividing a new data set after pretreatment and processing into a training set, a verification set and a test set according to a certain proportion.
As a preferred scheme of the process parameter real-time optimization method for integrating the probability network and the reinforcement learning, the method comprises the following steps: the construction of the state transition model includes,
constructing a probabilistic neural network by using the divided training set;
solving the influence variables and actual production target values of the immediately-after time interval state (the state of the next time interval in the current time interval state);
a state transition function and a reward function that can express the change of the actual target value of the production with high fidelity according to the state transition are obtained.
As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the construction of the probabilistic neural network comprises the setting of a loss function and the training of a probabilistic neural network model;
setting the loss function
Figure 493538DEST_PATH_IMAGE002
Predicting the probability for a logarithm;
the calculation of the log-prediction probability includes,
Figure 804433DEST_PATH_IMAGE003
wherein,
Figure 465222DEST_PATH_IMAGE004
a set of training data is represented that is,
Figure 545173DEST_PATH_IMAGE005
a density function representing a probabilistic neural network model;
the output of the training of the probabilistic neural network model is a gaussian distribution parameterized by diagonal covariance;
the calculation of the density function of the probabilistic neural network model includes,
Figure 633215DEST_PATH_IMAGE006
the computation of the loss function after the pull-in logarithmic prediction probability reduction is,
Figure 165827DEST_PATH_IMAGE007
wherein,
Figure 630307DEST_PATH_IMAGE008
a mean vector representing each attribute, T represents the transpose of the matrix,
Figure 564765DEST_PATH_IMAGE009
a diagonal covariance matrix is represented as a diagonal covariance matrix,
Figure 823708DEST_PATH_IMAGE010
denotes the inverse of the diagonal covariance matrix, k denotes
Figure 843616DEST_PATH_IMAGE004
The number of the features in (1) is,
Figure 846207DEST_PATH_IMAGE011
representing the determinant of the diagonal covariance matrix.
As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the solving of the influence variables and the target values of the state of the tight time interval comprises the selection of a probabilistic neural network submodel in a model library, the solving of the difference value of the influence variables and the target values of the adjacent time intervals and the solving of the influence variables of the tight time interval;
the selection of the probabilistic neural network submodel comprises randomly selecting a submodel from a learned model library to obtain a mean vector and a diagonal covariance matrix output by the submodel;
the calculation of the solution of the adjacent time interval influencing variable and target value difference comprises,
Figure 635172DEST_PATH_IMAGE012
wherein,
Figure 330595DEST_PATH_IMAGE013
the difference between the parameter value representing the current state and the value of the immediately subsequent state parameter,
Figure 837800DEST_PATH_IMAGE014
representing compliance
Figure 378503DEST_PATH_IMAGE015
Distribution of (2)
Figure 553132DEST_PATH_IMAGE016
The result is solved to form a random data set,
Figure 153878DEST_PATH_IMAGE017
represents the standard deviation;
the solving of the adjacent time interval influencing variable and target value difference comprises
Figure 148379DEST_PATH_IMAGE015
Are randomly paired in a distributed data set
Figure 492772DEST_PATH_IMAGE014
Sampling to obtain multiple samples
Figure 990750DEST_PATH_IMAGE013
The value, after taking the average number, is the state after being normalizedA parameter value of (d);
the parameter solving of the time interval influence variables comprises the steps of adding a solving difference value to a parameter value in the current time, carrying out inverse standardization according to a standardization mode of collected process parameters, and verifying the performance of the constructed state transition model by utilizing the training set.
As a preferred scheme of the process parameter real-time optimization method for integrating the probability network and the reinforcement learning, the method comprises the following steps: the construction of the intelligent agent model comprises an action design causing state transition and an action-caused reward design;
the action design for causing state transition comprises making difference between adjacent time intervals of each control variable, and taking median of all values
Figure 27976DEST_PATH_IMAGE018
Can be increased for each control variable individually at each moment
Figure 244194DEST_PATH_IMAGE018
A value is defined as an action
Figure 392278DEST_PATH_IMAGE019
The action space during the state transition contains elements of
Figure 10341DEST_PATH_IMAGE020
Figure 218469DEST_PATH_IMAGE021
Is the number of control variables;
the action-induced reward design includes a target value difference for a given action
Figure 921983DEST_PATH_IMAGE019
From the current state
Figure 608179DEST_PATH_IMAGE022
Transition to the next state
Figure 346328DEST_PATH_IMAGE023
Is awarded
Figure 725356DEST_PATH_IMAGE024
And the target value changes correspondingly every time the control variable is changed, and the changed value is the reward after the control variable is changed.
As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the learning process of the intelligent agent model comprises,
searching a minimum value of the TD error, and setting the minimum value of the TD error as a target;
the TD error is calculated as the difference between,
Figure 916166DEST_PATH_IMAGE025
wherein,
Figure 406054DEST_PATH_IMAGE026
to represent
Figure 733130DEST_PATH_IMAGE022
Applying motion in state
Figure 283060DEST_PATH_IMAGE019
The expectation of the gain to be obtained is,
Figure 961166DEST_PATH_IMAGE027
the discount factor is represented by a number of discount factors,
Figure 254744DEST_PATH_IMAGE028
represent
Figure 701906DEST_PATH_IMAGE023
Applying motion in state
Figure 157158DEST_PATH_IMAGE029
(ii) a gain expectation obtained;
by using
Figure 322560DEST_PATH_IMAGE030
The reinforcement learning algorithm inputs the influence variables of the current time interval into the state value network and the strategy network, and generates the maximum reward value through loop iteration
Figure 154250DEST_PATH_IMAGE024
Corresponding action
Figure 721497DEST_PATH_IMAGE019
The action of
Figure 347651DEST_PATH_IMAGE019
All states, actions and rewards constitute a policy network.
As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the real-time optimization and output of the process parameters in the production process comprises,
collecting the technological parameters of the production process in real time by taking each fixed time interval as a unit, selecting the number of time intervals to be aggregated according to the actual business requirements and the time sequence, and carrying out data processing and aggregation on the sample data consisting of the selected influence variables and the target value;
inputting the processed and aggregated data into the constructed state transition model, and outputting a difference value between an influence variable and a target value;
and inputting the influence variable of the current time interval into the trained strategy network, outputting a control variable, and realizing the fusion and application of the state transition model and the intelligent agent model in the actual production process.
The invention has the beneficial effects that: the real-time optimization method for the process parameters fusing the probability network and the reinforcement learning fully utilizes historical process parameter data, divides the process parameters into control variables, influence variables and target values, organically combines the control variables, the influence variables and the target values, recommends the controllable process parameters of the production process in real time, ensures the continuous and efficient operation of the production process, reduces the cost and improves the efficiency. Compared with the traditional method for constructing the parameter optimization model, the method has stronger disturbance resistance and migration resistance, better accords with the real application scene, has better parameter recommendation effect, and is suitable for most types of production systems.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise. Wherein:
FIG. 1 is a flowchart illustrating a method for real-time optimization of process parameters by fusion of a probabilistic network and reinforcement learning according to an embodiment of the present invention;
fig. 2 is a diagram showing and comparing the prediction effects of 100 groups of coal consumption selected in the process parameter real-time optimization method for integrating the probabilistic network and reinforcement learning according to the second embodiment of the present invention;
fig. 3 is a diagram showing an implementation effect of control variables recommended by an agent in a process parameter real-time optimization method combining a probabilistic network and reinforcement learning according to a second embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, embodiments accompanying figures of the present invention are described in detail below, and it is apparent that the described embodiments are a part, not all or all of the embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced otherwise than as specifically described herein, and it will be appreciated by those skilled in the art that the present invention may be practiced without departing from the spirit and scope of the present invention and that the present invention is not limited by the specific embodiments disclosed below.
Furthermore, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
The present invention will be described in detail with reference to the drawings, wherein the cross-sectional views illustrating the structure of the device are not enlarged partially in general scale for convenience of illustration, and the drawings are only exemplary and should not be construed as limiting the scope of the present invention. In addition, the three-dimensional dimensions of length, width and depth should be included in the actual fabrication.
Meanwhile, in the description of the present invention, it should be noted that the terms "upper, lower, inner and outer" and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation and operate, and thus, cannot be construed as limiting the present invention. Furthermore, the terms first, second, or third are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
The terms "mounted, connected and connected" in the present invention are to be understood broadly, unless otherwise explicitly specified or limited, for example: can be fixedly connected, detachably connected or integrally connected; they may be mechanically, electrically, or directly connected, or indirectly connected through intervening media, or may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
Example 1
Referring to fig. 1, for an embodiment of the present invention, a method for real-time optimization of process parameters by fusion of a probability network and reinforcement learning is provided, including:
s1: collecting technological parameter data of a production system, and carrying out operations of preprocessing, processing and dividing a data set on the technological parameter data. It should be noted that:
collecting process parameter data comprises collecting control variables, influence variables and actual production target values of a production process at equal time intervals; the control variables comprise process parameters which can be directly adjusted manually in the production process; the influence variables comprise process parameters generated by the influence of manually input control variables on the production system; the actual target value of production includes a production target that the production system has completed at certain time intervals.
Further, the preprocessing of the process parameter data comprises processing of abnormal samples, filling of null values and standardization of data;
further, the processing of the process parameter data comprises that the difference between two adjacent time intervals for producing the actual target value is used as a new target value, then sample data of a plurality of time intervals in the time sequence are aggregated, and the aggregation mode is average value aggregation;
furthermore, the dividing of the process parameter data comprises dividing the new preprocessed and processed data set into a training set, a verification set and a test set according to a certain proportion.
S2: and constructing a state transition model of adjacent time intervals in the production process based on the preprocessed process parameter data. It should be noted that:
the method comprises the steps of firstly, building a probabilistic neural network by using a divided training set, then solving an influence variable of a time interval state (a state of a next time interval in the current time interval state) and a production actual target value, and finally obtaining a state transfer function and a reward function which can express that the production actual target value changes along with state transfer in a high-fidelity mode.
Further, the establishment of the probabilistic neural network comprises the setting of a loss function and the training of a probabilistic neural network model;
setting loss function
Figure 349DEST_PATH_IMAGE002
Is logarithmicThe probability is predicted and its calculation includes,
Figure 635730DEST_PATH_IMAGE003
wherein,
Figure 791904DEST_PATH_IMAGE004
a set of training data is represented that is,
Figure 854538DEST_PATH_IMAGE005
a density function representing a probabilistic neural network model;
the output of the training of the probabilistic neural network model is a gaussian distribution parameterized by diagonal covariance, whose density function is calculated including,
Figure 728953DEST_PATH_IMAGE006
the computation of the loss function after the pull-in logarithmic prediction probability reduction is,
Figure 168025DEST_PATH_IMAGE007
wherein,
Figure 178706DEST_PATH_IMAGE008
a mean vector representing each attribute, T represents the transpose of the matrix,
Figure 677821DEST_PATH_IMAGE009
a diagonal covariance matrix is represented,
Figure 773953DEST_PATH_IMAGE010
denotes the inverse of the diagonal covariance matrix, k denotes
Figure 751136DEST_PATH_IMAGE004
The number of the features in (2) is,
Figure 881903DEST_PATH_IMAGE011
a determinant representing a diagonal covariance matrix;
the input value of the probabilistic neural network model is the attribute of a training data set obtained from an intelligent agent model, the output is a mean value and a diagonal covariance matrix of distribution obeyed by influence variable differences, a plurality of probabilistic neural network submodels with excellent performance are built to form a model base, and a data set is verified
Figure 286340DEST_PATH_IMAGE031
The smaller the value is, the better the trained probabilistic neural network model is;
further, the solving of the influence variables and the target values of the state of the time interval after the tightening comprises the selection of a probabilistic neural network sub-model in a model library, the solving of the difference value of the influence variables and the target values of the adjacent time intervals and the solving of the influence variables of the time interval after the tightening;
the selection of the probabilistic neural network submodel comprises randomly selecting a submodel from a learned model library to obtain a mean vector and a diagonal covariance matrix output by the submodel;
the calculation of the solution for the difference between the adjacent time interval influencing variables and the target value includes,
Figure 400926DEST_PATH_IMAGE012
wherein,
Figure 650642DEST_PATH_IMAGE013
the difference between the parameter value representing the current state and the value of the immediately subsequent state parameter,
Figure 901495DEST_PATH_IMAGE014
representing compliance
Figure 476833DEST_PATH_IMAGE015
Distribution of (2)
Figure 813136DEST_PATH_IMAGE016
Solution result formationThe random data set of (a) is,
Figure 132122DEST_PATH_IMAGE017
represents the standard deviation;
it should be noted that the solution of the difference between the influencing variable and the target value of the adjacent time interval is based on compliance
Figure 237481DEST_PATH_IMAGE015
Distribution of (2)
Figure 983720DEST_PATH_IMAGE032
Solving a random data set formed by the result, the data set being defined as
Figure 541740DEST_PATH_IMAGE033
Then, then
Figure 664417DEST_PATH_IMAGE033
Is also obeyed
Figure 624283DEST_PATH_IMAGE034
Is distributed and
Figure 541423DEST_PATH_IMAGE035
setting up
Figure 852319DEST_PATH_IMAGE036
From the distribution of obeys
Figure 513108DEST_PATH_IMAGE034
Is randomly paired in the data set
Figure 327480DEST_PATH_IMAGE014
Sampling to obtain a plurality of samples
Figure 681101DEST_PATH_IMAGE013
Taking the average value as the parameter value of the normalized state after tightening;
the parameter solving of the time interval influence variables comprises the steps of adding the solved difference value to the parameter value in the current time, carrying out inverse standardization according to a standardization mode of the collected process parameters, and verifying the performance of the constructed state transition model by utilizing a training set.
S3: by using
Figure 213713DEST_PATH_IMAGE030
And (3) building an intelligent agent model capable of outputting artificial controllable parameter data in the production process by reinforcement learning. It should be noted that:
constructing an intelligent agent model, wherein the intelligent agent model comprises an action design causing state transition and an action-caused reward design;
the action design for causing state transition comprises making difference between adjacent time intervals of each control variable, and taking median of all values
Figure 678193DEST_PATH_IMAGE018
Can be increased for each control variable individually at each instant
Figure 347071DEST_PATH_IMAGE018
A value is defined as an action
Figure 871594DEST_PATH_IMAGE019
The action space during a state transition contains elements of
Figure 891502DEST_PATH_IMAGE020
Figure 894093DEST_PATH_IMAGE021
Is the number of control variables;
action-induced reward design includes a target value difference for a given action
Figure 683058DEST_PATH_IMAGE019
From the current state
Figure 112902DEST_PATH_IMAGE022
Transition to the next state
Figure 354528DEST_PATH_IMAGE037
Is awarded
Figure 160810DEST_PATH_IMAGE024
And the target value changes correspondingly every time the control variable is changed, and the changed value is the reward after the control variable is changed.
The learning process of the intelligent agent model comprises,
searching the minimum value of the TD error, and setting the minimum value of the TD error as a target;
the error of the TD is calculated as,
Figure 69860DEST_PATH_IMAGE025
wherein,
Figure 670605DEST_PATH_IMAGE026
to represent
Figure 688544DEST_PATH_IMAGE022
Applying motion in state
Figure 32937DEST_PATH_IMAGE019
The expectation of the gain to be obtained is,
Figure 796494DEST_PATH_IMAGE027
the discount-factor is represented by a number of discount factors,
Figure 568141DEST_PATH_IMAGE028
to represent
Figure 784359DEST_PATH_IMAGE023
Applying motion in state
Figure 932443DEST_PATH_IMAGE029
The expected gain is obtained.
By using
Figure 550506DEST_PATH_IMAGE030
Reinforcement learning algorithm, inputting the influence variable of the current time interval to the state priceMaximum reward value generated by loop iteration in value network and policy network
Figure 758634DEST_PATH_IMAGE024
Corresponding action
Figure 462148DEST_PATH_IMAGE019
The action of
Figure 148344DEST_PATH_IMAGE019
All states, actions and rewards constitute a policy network.
S4: and fusing and applying the state transition model and the intelligent agent model to realize the real-time optimization and output of the process parameters in the production process. It should be noted that:
the real-time optimization and output of the process parameters in the production process comprises,
collecting the technological parameters of the production process in real time by taking each fixed time interval as a unit, selecting the number of time intervals to be aggregated according to the actual business requirements and the time sequence, and carrying out data processing and aggregation on the sample data consisting of the selected influence variables and the target values;
inputting the processed and aggregated data into the constructed state transition model, and outputting a difference value between an influence variable and a target value;
and inputting the influence variable of the current time interval into the trained strategy network, outputting the control variable, and realizing the fusion and application of the state transition model and the intelligent agent model in the actual production process.
It should be noted that, in consideration of the problems in the prior art that the convergence of the solving process is slow, the prediction efficiency is low, and the fusion with the production target is insufficient, a probabilistic neural network which can quickly fit the data distribution situation and has high fault tolerance rate and a model for reinforcement learning training which has strong environmental adaptability and is positively feedback with the target are adopted, and the time sequence relation of data is taken into consideration in the training process of the model, so that the defects of most of the existing methods based on an artificial intelligence algorithm are overcome, the model fitting effect and the migration effect are more excellent, the pertinence of dynamically recommended process parameters to the optimized target is stronger, the recommended process parameters are more in line with the reality, the continuous and efficient operation of the production process is ensured, and the cost is reduced and the efficiency is increased.
The real-time optimization method for the process parameters fusing the probability network and the reinforcement learning, provided by the invention, fully utilizes historical process parameter data, divides the process parameters into control variables, influence variables and target values, organically combines the control variables, the influence variables and the target values, recommends the controllable process parameters of the production process in real time, ensures the continuous and efficient operation of the production process, reduces the cost and improves the efficiency. Compared with the traditional method for constructing the parameter optimization model, the method has stronger disturbance resistance and migration resistance, better accords with the real application scene, has better parameter recommendation effect, and is suitable for most types of production systems.
Example 2
Referring to fig. 2 and 3, a second embodiment of the present invention is different from the first embodiment in that a verification test for implementing an optimization method by fusing a probability network and a process parameter of reinforcement learning is provided, and in order to verify and explain the technical effects adopted in the method, the embodiment adopts a comparison test between a conventional technical scheme and the method of the present invention, and compares the test results by means of scientific demonstration to verify the actual effects of the method.
Taking the rotary kiln system of the Tai-Gai base in Taiyuan City as an example, the collection is carried out in 1 minute as unit time, and the collected process parameters comprise: the control variables are head coal, tail coal, grate speed, high-temperature fan high-pressure frequency conversion frequency setting, head row high-pressure frequency conversion frequency setting and tail row high-pressure frequency conversion frequency setting, the influence variables are secondary air temperature, decomposing furnace temperature, clinker temperature 2, kiln head cover negative pressure, kiln tail negative pressure and decomposing furnace outlet temperature, the target value is coal consumption, and finally acquired rotary kiln production system process parameter data are multi-dimensional time sequence data.
Then, carrying out operations of preprocessing, processing and dividing a data set on the collected process parameter data:
pretreatment:
1) Processing an abnormal sample: to avoid the influence of abnormal values on the subsequent modeling process, use is made of
Figure 886493DEST_PATH_IMAGE038
The principle modifies the exception sample to a null value, remaining in range for each parameter datum
Figure 999942DEST_PATH_IMAGE039
A value of (1), wherein
Figure 190752DEST_PATH_IMAGE040
Is shown as
Figure 680639DEST_PATH_IMAGE041
The mean value of the data of the individual parameters,
Figure 273295DEST_PATH_IMAGE042
denotes the first
Figure 823225DEST_PATH_IMAGE041
For each standard deviation of the parameter data, the parameter values not in the range are replaced with null values.
2) Filling of null values: use of
Figure 501331DEST_PATH_IMAGE043
In this way, the average of 6 consecutive data is selected for filling, for example, if the value of the secondary air temperature at the 10 th minute is null, the average of the values of the secondary air temperature at the 7 th, 8 th, 9 th and 11 th, 12 th, 13 th minutes is filled into the 10 th time interval.
3) Normalization of the data: obeying all historical parameter data
Figure 529330DEST_PATH_IMAGE044
The normalization process of (1).
Processing: and taking the difference between two adjacent time intervals of the target value as a new target value, and then aggregating the sample data of 10 adjacent minutes in the time sequence, wherein the aggregation mode is an average value.
Dividing the data set: the processed and processed data were processed as per 6:2: the approach of 2 is divided into a training set, a prediction set, and a validation set.
Then constructing a state transition model, utilizing the divided training set to learn and train a probabilistic neural network model, setting an input layer as 14, an output layer as 8, a mode layer as 200, a summation layer as 8, a batch number as 256, setting the learning rate to be 0.001 by adopting an Adam optimizer, setting the epoch to be 1000 times, and outputting a result to be a diagonal covariance matrix formed by standard deviations of various parameters
Figure 976491DEST_PATH_IMAGE045
And a line vector composed of the average values of the parameters
Figure 697323DEST_PATH_IMAGE046
On the verification data set
Figure 597146DEST_PATH_IMAGE047
The values represent the smallest probabilistic neural network model.
According to the formula
Figure 428835DEST_PATH_IMAGE048
At random in
Figure 730504DEST_PATH_IMAGE049
In the distribution, the number of the pairs is 5 times sampled and 5 are obtained
Figure 622236DEST_PATH_IMAGE050
Value, averaged to a predicted state parameter value difference
Figure 274935DEST_PATH_IMAGE051
. The prediction result of the influencing variable and the target value is the difference value of the parameter value in the current time and the solution
Figure 644736DEST_PATH_IMAGE051
And then carrying out inverse standardization according to a standardization mode of the collected process parameters, and verifying the performance of the constructed state transition model by using the test set.
Table 1: and (4) evaluation comparison of the influence variable and the target value.
Figure 66490DEST_PATH_IMAGE052
Table 2 shows the effect of the evaluation comparison of the different algorithms on the influence variable and the target value (the evaluation index is mean square error MSE, the smaller the value of the mean square error MSE, the better the effect of the model is), and the effect of the probabilistic neural network prediction model constructed by using the loss function constructed by the method as the index of the model training is better and the advantage of the prediction effect is obvious.
The predicted effect of coal consumption is the basis of the recommended control variables, which represent the rewards obtained when actions leading to state transitions occur, and the predicted effect of randomly selecting 100 groups of coal consumption shows that the predicted value is very close to the actual value as shown in fig. 1.
Then, an agent model capable of outputting artificially controllable parameter data in the production process is constructed by using reinforcement learning, because the number of empty value variables is 6 in the present embodiment, 64 elements are included in the action space during the state transition process, and the following table shows parameter values corresponding to the action of each control variable.
Table 2: the action of each control variable.
Figure 394703DEST_PATH_IMAGE054
By using
Figure 269118DEST_PATH_IMAGE055
A reinforcement learning algorithm of
Figure 708190DEST_PATH_IMAGE056
Input device
Figure 984451DEST_PATH_IMAGE057
Train out the current state
Figure 217986DEST_PATH_IMAGE058
Down-applying a certain action
Figure 579697DEST_PATH_IMAGE059
Producing a reward
Figure 556880DEST_PATH_IMAGE060
Table form
Figure 687647DEST_PATH_IMAGE061
Obtaining an optimal policy network, wherein the discount coefficient
Figure 357663DEST_PATH_IMAGE062
Set to 0.98 and the probability of exploration set to 0.1.
Finally, the control variable of the previous state is input into the constructed intelligent agent model, the control variable value of the next state is output, and the coal consumption of the rotary kiln production system caused by the input of the control variable recommended by the intelligent agent is compared with the coal consumption of the next state predicted by the state transition model; fig. 2 shows the implementation effect of the control variables recommended by the agent, which can be obtained from fig. 2, and 100 sets of recommended process parameter data are selected, the optimized coal consumption value per unit time is 0.2893, the control variables recommended by the agent conform to the process of the actual production system, and the excellent control variables can be recommended to optimize the target coal consumption in real time.
It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims (9)

1. A real-time optimization method for process parameters fusing a probability network and reinforcement learning is characterized by comprising the following steps:
collecting technological parameter data of a production system, and carrying out operations of preprocessing, processing and dividing a data set on the technological parameter data;
constructing a state transition model of adjacent time intervals in the production process based on the preprocessed process parameter data;
the construction of the state transition model includes,
constructing a probability network by using the divided training set;
solving the influence variable of the state of the time interval after the closing and the actual target value of the production;
acquiring a state transition function and a reward function which can express that the actual target value of the production changes along with the state transition in a high fidelity way; building an intelligent agent model capable of outputting artificial controllable parameter data in the production process by utilizing Q-Learning reinforcement Learning;
and fusing and applying the state transition model and the intelligent agent model to realize the real-time optimization and output of the process parameters in the production process.
2. The method for optimizing process parameters of fusion probability network and reinforcement learning in real time as claimed in claim 1, wherein: the collection of the process parameter data comprises collecting control variables, influence variables and actual production target values of the production process at equal time intervals;
the control variables comprise process parameters which can be manually and directly adjusted in the production process;
the influence variables comprise process parameters generated by the influence of manually input control variables on the production system;
the actual target value of production comprises a production target which is completed by the production system at a certain time interval.
3. The method for optimizing process parameters of fusion probability network and reinforcement learning in real time as claimed in claim 2, wherein: the pre-processing and processing of the process parameter data includes,
the preprocessing of the process parameter data comprises processing of abnormal samples, filling of null values and standardization of data;
and the processing of the process parameter data comprises the steps of differentiating two adjacent time intervals of the actual production target value, taking the difference value of the two adjacent time intervals as a new target value, and then aggregating the sample data of a plurality of time intervals in the time sequence.
4. The method for optimizing process parameters of fusion probability network and reinforcement learning in real time as claimed in any one of claims 1 to 3, wherein: the dividing of the process parameter data comprises dividing a new preprocessed and processed data set into a training set, a verification set and a test set according to a certain proportion.
5. The method for optimizing process parameters in real time for fusion of a probabilistic network and reinforcement learning according to claim 4, wherein: the establishment of the probability network comprises the setting of a loss function and the training of a probability network model;
setting the loss function loss p Predicting the probability for a logarithm;
the calculation of the log-prediction probability includes,
loss p =-logf(X)
wherein X represents a training data set, and f (X) represents a density function of the probability network model;
the output of the training of the probabilistic network model is a gaussian distribution parameterized by diagonal covariance;
the calculation of the density function of the probabilistic network model includes,
Figure FDA0003873587060000021
the computation of the loss function after the pull-in logarithmic prediction probability reduction includes,
loss p =(X-μ) T-1 (X-μ)+(2π) k |∑|
where μ represents the mean vector of each attribute, T represents the transpose of the matrix, Σ represents the diagonal covariance matrix, Σ -1 Represents the inverse of the diagonal covariance matrix, k represents the number of features in X, and Σ | represents the determinant of the diagonal covariance matrix.
6. The method for optimizing process parameters in real time for fusion of a probabilistic network and reinforcement learning according to claim 5, wherein: the solving of the influence variables and the target values of the state of the tight time interval comprises the selection of a probability network submodel in a model library, the solving of the difference value of the influence variables and the target values of the adjacent time intervals and the solving of the influence variables of the tight time interval;
the selection of the probability network submodel comprises randomly selecting a submodel from a learned model library to obtain a mean vector and a diagonal covariance matrix output by the submodel;
the calculation of the solution of the adjacent time interval influencing variable and target value difference comprises,
ΔX=μ+ε*σ
where Δ X represents the difference between the current state parameter value and the immediate state parameter value, and ε represents the distribution obeying N (0, 1)
Figure FDA0003873587060000031
Solving a random data set formed by the result, wherein sigma represents a standard deviation;
the solving of the difference between the adjacent time interval influencing variable and the target value comprises randomly sampling epsilon from a data set obeying the distribution of N (0, 1) to obtain a plurality of delta X values, and taking the average number to obtain the parameter value of the normalized state after tightening;
the parameter solving of the closely-spaced time interval influence variable comprises the steps of carrying out inverse standardization on a parameter value in the current time plus a solving difference value according to a standardization mode of collected process parameters, and verifying the performance of the constructed state transition model by utilizing the training set.
7. The method for optimizing process parameters in real time for fusion of a probabilistic network and reinforcement learning according to claim 6, wherein: the construction of the intelligent agent model comprises an action design causing state transition and an action-caused reward design;
the action design for causing state transition comprises that when each control variable is adjacentThe interval is differentiated, the median delta of all values is taken, the delta value which is independently increased for each control variable at each moment can be defined as an action a, and the element contained in the action space in the state transition process is 2 n N is the number of control variables;
the reward design caused by the action comprises a reward value r (s, a) with a target value difference value, wherein the reward value r (s, a) is the reward value of the given action a which is transferred from the current state s to the next state s', the target value changes correspondingly every time the control variable is changed, and the change value of the target value which changes correspondingly is the reward after the control variable changes.
8. The method for optimizing process parameters in real time for fusion of a probabilistic network and reinforcement learning according to claim 7, wherein: the learning process of the intelligent agent model comprises,
searching the minimum value of the TD error, and setting the minimum value of the TD error as a target;
the TD error is calculated as the difference between,
Q(s,a)=r(s,a)+γmax(Q(s',a'))
wherein Q (s, a) represents the revenue expectation obtained by applying action a in the s state, gamma represents the discount coefficient, and Q (s ', a') represents the revenue expectation obtained by applying action a 'in the s' state;
and inputting the influence variables of the current time interval into the state value network and the strategy network by using a Q-Learning reinforcement Learning algorithm, and generating an action a corresponding to the maximum reward value r (s, a) through loop iteration, wherein all states, actions and rewards of the action a form a strategy network.
9. The method for optimizing process parameters in real time for fusion of a probabilistic network and reinforcement learning according to claim 8, wherein: the real-time optimization and output of the process parameters in the production process comprises,
collecting the technological parameters of the production process in real time by taking each fixed time interval as a unit, selecting the number of time intervals to be aggregated according to the actual business requirements and the time sequence, and carrying out data processing and aggregation on the sample data consisting of the selected influence variables and the target value;
inputting the processed and aggregated data into the constructed state transition model, and outputting a difference value between an influence variable and a target value;
and inputting the influence variable of the current time interval into the trained strategy network, outputting a control variable, and realizing the fusion and application of the state transition model and the intelligent agent model in the actual production process.
CN202210989613.7A 2022-08-18 2022-08-18 Real-time optimization method for process parameters integrating probability network and reinforcement learning Active CN115061444B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210989613.7A CN115061444B (en) 2022-08-18 2022-08-18 Real-time optimization method for process parameters integrating probability network and reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210989613.7A CN115061444B (en) 2022-08-18 2022-08-18 Real-time optimization method for process parameters integrating probability network and reinforcement learning

Publications (2)

Publication Number Publication Date
CN115061444A CN115061444A (en) 2022-09-16
CN115061444B true CN115061444B (en) 2022-12-09

Family

ID=83208015

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210989613.7A Active CN115061444B (en) 2022-08-18 2022-08-18 Real-time optimization method for process parameters integrating probability network and reinforcement learning

Country Status (1)

Country Link
CN (1) CN115061444B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115439021B (en) * 2022-10-26 2023-03-24 江苏新恒基特种装备股份有限公司 Metal strengthening treatment quality analysis method and system
CN118642375A (en) * 2024-08-14 2024-09-13 南通理工学院 Self-adaptive control method and system for pyrolysis temperature and oxygen concentration of rotary kiln

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114692310A (en) * 2022-04-14 2022-07-01 北京理工大学 Virtual-real integration-two-stage separation model parameter optimization method based on Dueling DQN

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11327475B2 (en) * 2016-05-09 2022-05-10 Strong Force Iot Portfolio 2016, Llc Methods and systems for intelligent collection and analysis of vehicle data
US11965946B2 (en) * 2020-12-04 2024-04-23 Max-Planck-Gesellschaft Zur Foerderung Der Wissenschaften E. V. Machine learning based processing of magnetic resonance data, including an uncertainty quantification

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114692310A (en) * 2022-04-14 2022-07-01 北京理工大学 Virtual-real integration-two-stage separation model parameter optimization method based on Dueling DQN

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
an incremental probabilistic neural network for regression and reinforcement learning tasks;Milton Roberto Heinen and Paulo Martins Engel;《豆丁网》;20170111;全文 *
基于强化学习的参数化电路优化算法;唐长成;《中国优秀硕士学位论文全文数据库 信息科技辑》;20200430(第4期);全文 *
基于深度强化学习的多机器人协同导航;周世正;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190831(第8期);全文 *
基于自适应归一化RBF网络的Q-V值函数协同逼近模型;刘全等;《计算机学报》;20150731;第38卷(第7期);第1386-1396页 *

Also Published As

Publication number Publication date
CN115061444A (en) 2022-09-16

Similar Documents

Publication Publication Date Title
CN115061444B (en) Real-time optimization method for process parameters integrating probability network and reinforcement learning
CN109993270A (en) Lithium ion battery residual life prediction technique based on grey wolf pack optimization LSTM network
CN116596044B (en) Power generation load prediction model training method and device based on multi-source data
CN112884236B (en) Short-term load prediction method and system based on VDM decomposition and LSTM improvement
CN112967088A (en) Marketing activity prediction model structure and prediction method based on knowledge distillation
CN112910690A (en) Network traffic prediction method, device and equipment based on neural network model
CN115688913A (en) Cloud-side collaborative personalized federal learning method, system, equipment and medium
CN110991621A (en) Method for searching convolutional neural network based on channel number
CN111860787A (en) Short-term prediction method and device for coupling directed graph structure flow data containing missing data
CN112270442A (en) IVMD-ACMPSO-CSLSTM-based combined power load prediction method
CN110929958A (en) Short-term traffic flow prediction method based on deep learning parameter optimization
CN114548591A (en) Time sequence data prediction method and system based on hybrid deep learning model and Stacking
CN115828990A (en) Time-space diagram node attribute prediction method for fused adaptive graph diffusion convolution network
CN113449919B (en) Power consumption prediction method and system based on feature and trend perception
CN113627594A (en) One-dimensional time sequence data amplification method based on WGAN
CN117668743A (en) Time sequence data prediction method of association time-space relation
CN112381591A (en) Sales prediction optimization method based on LSTM deep learning model
CN116822722A (en) Water level prediction method, system, device, electronic equipment and medium
Wang et al. Time series prediction with incomplete dataset based on deep bidirectional echo state network
CN114781699B (en) Reservoir water level prediction and early warning method based on improved particle swarm Conv1D-Attention optimization model
CN114997464A (en) Popularity prediction method based on graph time sequence information learning
CN112667394B (en) Computer resource utilization rate optimization method
CN109117491B (en) Agent model construction method of high-dimensional small data fusing expert experience
CN114118567B (en) Power service bandwidth prediction method based on double-channel converged network
CN114841472B (en) GWO optimization Elman power load prediction method based on DNA hairpin variation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant