CN115061444A - Real-time optimization method for technological parameters integrating probability network and reinforcement learning - Google Patents

Real-time optimization method for technological parameters integrating probability network and reinforcement learning Download PDF

Info

Publication number
CN115061444A
CN115061444A CN202210989613.7A CN202210989613A CN115061444A CN 115061444 A CN115061444 A CN 115061444A CN 202210989613 A CN202210989613 A CN 202210989613A CN 115061444 A CN115061444 A CN 115061444A
Authority
CN
China
Prior art keywords
value
network
reinforcement learning
model
process parameters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210989613.7A
Other languages
Chinese (zh)
Other versions
CN115061444B (en
Inventor
毛旭初
张翔
谢天
陈松
汪江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Luculent Smart Technologies Co ltd
Original Assignee
Luculent Smart Technologies Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Luculent Smart Technologies Co ltd filed Critical Luculent Smart Technologies Co ltd
Priority to CN202210989613.7A priority Critical patent/CN115061444B/en
Publication of CN115061444A publication Critical patent/CN115061444A/en
Application granted granted Critical
Publication of CN115061444B publication Critical patent/CN115061444B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B19/00Programme-control systems
    • G05B19/02Programme-control systems electric
    • G05B19/418Total factory control, i.e. centrally controlling a plurality of machines, e.g. direct or distributed numerical control [DNC], flexible manufacturing systems [FMS], integrated manufacturing systems [IMS] or computer integrated manufacturing [CIM]
    • G05B19/41865Total factory control, i.e. centrally controlling a plurality of machines, e.g. direct or distributed numerical control [DNC], flexible manufacturing systems [FMS], integrated manufacturing systems [IMS] or computer integrated manufacturing [CIM] characterised by job scheduling, process planning, material flow
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B2219/00Program-control systems
    • G05B2219/30Nc systems
    • G05B2219/32Operator till task planning
    • G05B2219/32252Scheduling production, machining, job shop

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Manufacturing & Machinery (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a real-time optimization method for process parameters fusing a probability network and reinforcement learning, which comprises the following steps: collecting technological parameter data of a production system, and carrying out operations of preprocessing, processing and dividing a data set on the collected data; constructing a state transition model of adjacent time intervals in the production process based on the preprocessed data; building an intelligent agent model capable of outputting manually controllable parameter data in the production process by using reinforcement learning; and fusing and applying the state transition model and the intelligent agent model to realize real-time optimization and output of process parameters in the production process. The invention divides the process parameters into the control variables, the influence variables and the target values, organically combines the control variables, the influence variables and the target values, recommends the controllable process parameters of the production process in real time, ensures the continuous and efficient operation of the production process, reduces the cost and improves the efficiency.

Description

Real-time optimization method for technological parameters integrating probability network and reinforcement learning
Technical Field
The invention relates to the technical field of technological parameter optimization in a production process, in particular to a technological parameter real-time optimization method integrating a probability network and reinforcement learning.
Background
With the rapid development of the internet of things and big data technology, the development and application of new-generation intelligent manufacturing are promoted, a new paradigm is provided for optimizing process parameters in the production process, and the optimization of the process parameters is to predict parameters which should be input into the production system in the next time period in advance, so that the continuous and efficient operation of the production process is ensured, and the cost reduction and the efficiency improvement of the operation process of the production system are promoted.
The current parameter optimization methods include optimization algorithm and artificial intelligence algorithm implementation, and although they can solve a set of optimal process parameters according to different targets, the methods have some defects. The method for constructing the parameter optimization model based on the optimization algorithm is quite dependent on the logical relationship between parameters and targets, so that the constructed model is insufficient in staticizing, disturbance resistance and migration capacity, when the parameter types or the targets change, the algorithm of the original constructed model is not applicable, and the convergence of the solving process is slow and time-consuming; most methods for constructing parameter optimization models based on artificial intelligence algorithms ignore the time sequence relation in data, cannot search optimal process parameters according to the time sequence process, and easily cause the constructed models to be separated from the real operation process of a system.
In order to overcome the defects of the existing parameter optimization method based on an optimization algorithm and an artificial intelligence algorithm, the method has the advantages that the probabilistic neural network can be combined with the data distribution situation, the fault tolerance rate is high, the model environment adaptability of reinforcement learning training is strong, and the model environment adaptability and the target positive feedback are realized, and meanwhile, the time sequence relation of data is considered in the training process of the model, so that the continuous and efficient operation of the production process is ensured, the cost is reduced, and the efficiency is improved.
Disclosure of Invention
This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and the title of the invention of this application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.
The present invention has been made in view of the above-mentioned problems.
Therefore, the technical problem solved by the invention is as follows: in the prior art, the optimization method for the process parameters of the production system has the problems of over-experience, low prediction efficiency and insufficient fusion with production targets.
In order to solve the technical problems, the invention provides the following technical scheme: a real-time optimization method for process parameters fusing a probability network and reinforcement learning comprises the following steps: collecting technological parameter data of a production system, and carrying out operations of preprocessing, processing and dividing a data set on the technological parameter data; constructing a state transition model of adjacent time intervals in the production process based on the preprocessed process parameter data; by using
Figure 310818DEST_PATH_IMAGE001
Building an intelligent agent model capable of outputting artificially controllable parameter data in the production process by reinforcement learning; and fusing and applying the state transition model and the intelligent agent model to realize real-time optimization and output of process parameters in the production process.
As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the collection of the process parameter data comprises collecting control variables, influence variables and actual production target values of the production process at equal time intervals;
the control variables comprise process parameters which can be manually and directly adjusted in the production process;
the influence variables comprise process parameters generated by the influence of manually input control variables on the production system;
the actual target value of production comprises a production target which is completed by the production system at a certain time interval.
As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the pre-processing and processing of the process parameter data includes,
the preprocessing of the process parameter data comprises processing of abnormal samples, filling of null values and standardization of data;
and the processing of the process parameter data comprises the steps of making a difference between two adjacent time intervals of the actual production target value, taking the difference value of the two adjacent time intervals as a new target value, and then aggregating the sample data of a plurality of time intervals in a time sequence, wherein the aggregation mode is average value aggregation.
As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the dividing of the process parameter data comprises dividing a new data set after pretreatment and processing into a training set, a verification set and a test set according to a certain proportion.
As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the construction of the state transition model includes,
constructing a probabilistic neural network by using the divided training set;
solving the influence variables and actual production target values of the state of the immediately-after time interval (the state of the next time interval in the current time interval state);
obtaining a state transition function and a reward function which can express the change of the actual target value of the production along with the state transition with high fidelity.
As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the construction of the probabilistic neural network comprises the setting of a loss function and the training of a probabilistic neural network model;
setting the loss function
Figure 493538DEST_PATH_IMAGE002
Predicting the probability for a logarithm;
the calculation of the log-prediction probability includes,
Figure 804433DEST_PATH_IMAGE003
wherein the content of the first and second substances,
Figure 465222DEST_PATH_IMAGE004
a set of training data is represented that is,
Figure 545173DEST_PATH_IMAGE005
a density function representing a probabilistic neural network model;
the output of the training of the probabilistic neural network model is a gaussian distribution parameterized by diagonal covariance;
the calculation of the density function of the probabilistic neural network model includes,
Figure 633215DEST_PATH_IMAGE006
the computation of the loss function after the pull-in logarithmic prediction probability reduction is,
Figure 165827DEST_PATH_IMAGE007
wherein the content of the first and second substances,
Figure 630307DEST_PATH_IMAGE008
a mean vector representing each attribute, T represents the transpose of the matrix,
Figure 564765DEST_PATH_IMAGE009
a diagonal covariance matrix is represented,
Figure 823708DEST_PATH_IMAGE010
denotes the inverse of the diagonal covariance matrix, k denotes
Figure 843616DEST_PATH_IMAGE004
The number of the features in (1) is,
Figure 846207DEST_PATH_IMAGE011
representing the determinant of the diagonal covariance matrix.
As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the solving of the influence variables and the target values of the state of the tight time interval comprises the selection of a probabilistic neural network submodel in a model library, the solving of the difference value of the influence variables and the target values of the adjacent time intervals and the solving of the influence variables of the tight time interval;
the selection of the probabilistic neural network submodel comprises randomly selecting a submodel from a learned model library to obtain a mean vector and a diagonal covariance matrix output by the submodel;
the calculation of the solution of the adjacent time interval influencing variable and target value difference comprises,
Figure 635172DEST_PATH_IMAGE012
wherein the content of the first and second substances,
Figure 330595DEST_PATH_IMAGE013
the difference between the parameter value representing the current state and the value of the immediately subsequent state parameter,
Figure 837800DEST_PATH_IMAGE014
representing compliance
Figure 378503DEST_PATH_IMAGE015
Distribution of (2)
Figure 553132DEST_PATH_IMAGE016
The result is solved to form a random data set,
Figure 153878DEST_PATH_IMAGE017
represents the standard deviation;
the solving of the adjacent time interval influencing variable and target value difference comprises
Figure 148379DEST_PATH_IMAGE015
Are randomly paired in a distributed data set
Figure 492772DEST_PATH_IMAGE014
Sampling to obtain multiple samples
Figure 990750DEST_PATH_IMAGE013
Taking the average value as the parameter value of the normalized state after tightening;
the parameter solving of the closely-spaced time interval influence variable comprises the steps of adding a solving difference value to a parameter value in the current time, carrying out inverse standardization on the collected process parameters in a standardization mode, and verifying the performance of the constructed state transition model by using the training set.
As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the construction of the intelligent agent model comprises an action design causing state transition and an action-caused reward design;
the action design for causing state transition comprises making difference to adjacent time intervals of each control variable, and taking median of all values
Figure 27976DEST_PATH_IMAGE018
Can be increased for each control variable individually at each moment
Figure 244194DEST_PATH_IMAGE018
A value is defined as an action
Figure 392278DEST_PATH_IMAGE019
The action space during a state transition contains elements of
Figure 10341DEST_PATH_IMAGE020
Figure 218469DEST_PATH_IMAGE021
Is the number of control variables;
the action-induced reward design includes a target value difference for a given action
Figure 921983DEST_PATH_IMAGE019
From the current state
Figure 608179DEST_PATH_IMAGE022
Transition to the next state
Figure 346328DEST_PATH_IMAGE023
Is awarded
Figure 725356DEST_PATH_IMAGE024
And the target value changes correspondingly every time the control variable is changed, and the changed value is the reward after the control variable is changed.
As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the learning process of the intelligent agent model comprises,
searching a minimum value of the TD error, and setting the minimum value of the TD error as a target;
the TD error is calculated as the difference between,
Figure 916166DEST_PATH_IMAGE025
wherein, the first and the second end of the pipe are connected with each other,
Figure 406054DEST_PATH_IMAGE026
to represent
Figure 733130DEST_PATH_IMAGE022
Applying motion in state
Figure 283060DEST_PATH_IMAGE019
The expectation of the gain to be obtained is,
Figure 961166DEST_PATH_IMAGE027
the discount factor is represented by a number of discount factors,
Figure 254744DEST_PATH_IMAGE028
to represent
Figure 701906DEST_PATH_IMAGE023
Applying motion in state
Figure 157158DEST_PATH_IMAGE029
(ii) a gain expectation obtained;
by using
Figure 322560DEST_PATH_IMAGE030
The reinforcement learning algorithm inputs the influence variables of the current time interval into the state value network and the strategy network, and generates the maximum reward value through loop iteration
Figure 154250DEST_PATH_IMAGE024
Corresponding action
Figure 721497DEST_PATH_IMAGE019
The action of
Figure 347651DEST_PATH_IMAGE019
All states, actions and rewards constitute a policy network.
As an optimal scheme of the method for optimizing the technological parameters of the fusion probability network and the reinforcement learning in real time, the method comprises the following steps: the real-time optimization and output of the process parameters in the production process comprises,
collecting the technological parameters of the production process in real time by taking each fixed time interval as a unit, selecting the number of time intervals to be aggregated according to the actual business requirements and the time sequence, and carrying out data processing and aggregation on the sample data consisting of the selected influence variables and the target values;
inputting the processed and aggregated data into the constructed state transition model, and outputting a difference value between an influence variable and a target value;
and inputting the influence variable of the current time interval into the trained strategy network, outputting a control variable, and realizing the fusion and application of the state transition model and the intelligent agent model in the actual production process.
The invention has the beneficial effects that: the real-time optimization method for the process parameters fusing the probability network and the reinforcement learning fully utilizes historical process parameter data, divides the process parameters into control variables, influence variables and target values, organically combines the control variables, the influence variables and the target values, recommends the controllable process parameters of the production process in real time, ensures the continuous and efficient operation of the production process, reduces the cost and improves the efficiency. Compared with the traditional method for constructing the parameter optimization model, the method has stronger disturbance resistance and migration resistance, better accords with the real application scene, has better parameter recommendation effect, and is suitable for most types of production systems.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise. Wherein:
FIG. 1 is a flowchart illustrating a method for real-time optimization of process parameters by fusion of a probabilistic network and reinforcement learning according to an embodiment of the present invention;
fig. 2 is a diagram showing and comparing the prediction effects of 100 groups of coal consumption selected in the process parameter real-time optimization method for integrating the probabilistic network and reinforcement learning according to the second embodiment of the present invention;
fig. 3 is a diagram showing an implementation effect of control variables recommended by an agent in a process parameter real-time optimization method combining a probabilistic network and reinforcement learning according to a second embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, specific embodiments accompanied with figures are described in detail below, and it is apparent that the described embodiments are a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present invention, shall fall within the protection scope of the present invention.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.
Furthermore, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
The present invention will be described in detail with reference to the drawings, wherein the cross-sectional views illustrating the structure of the device are not enlarged partially in general scale for convenience of illustration, and the drawings are only exemplary and should not be construed as limiting the scope of the present invention. In addition, the three-dimensional dimensions of length, width and depth should be included in the actual fabrication.
Meanwhile, in the description of the present invention, it should be noted that the terms "upper, lower, inner and outer" and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation and operate, and thus, cannot be construed as limiting the present invention. Furthermore, the terms first, second, or third are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
The terms "mounted, connected and connected" in the present invention are to be understood broadly, unless otherwise explicitly specified or limited, for example: can be fixedly connected, detachably connected or integrally connected; they may be mechanically, electrically, or directly connected, or indirectly connected through intervening media, or may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
Example 1
Referring to fig. 1, for an embodiment of the present invention, a method for real-time optimization of process parameters by fusion of a probability network and reinforcement learning is provided, including:
s1: collecting technological parameter data of a production system, and carrying out operations of preprocessing, processing and dividing a data set on the technological parameter data. It should be noted that:
the acquisition of the process parameter data comprises the steps of acquiring control variables, influence variables and actual production target values of the production process at equal time intervals; the control variables comprise process parameters which can be directly adjusted manually in the production process; the influence variables comprise process parameters generated by the influence of manually input control variables on the production system; the actual target value of production includes a production target that the production system has completed at certain time intervals.
Further, the preprocessing of the process parameter data comprises processing of abnormal samples, filling of null values and standardization of data;
further, the processing of the process parameter data comprises that the difference between two adjacent time intervals for producing the actual target value is used as a new target value, then sample data of a plurality of time intervals in the time sequence are aggregated, and the aggregation mode is average value aggregation;
furthermore, the dividing of the process parameter data includes dividing the new preprocessed and processed data set into a training set, a verification set and a test set according to a certain proportion.
S2: and constructing a state transition model of adjacent time intervals in the production process based on the preprocessed process parameter data. It should be noted that:
the method comprises the steps of firstly building a probabilistic neural network by using a divided training set, then solving influence variables of a time interval state (the state of the next time interval in the current time interval state) and a production actual target value, and finally obtaining a state transfer function and a reward function which can express that the production actual target value changes along with state transfer in a high-fidelity mode.
Further, the building of the probabilistic neural network comprises setting of a loss function and training of a probabilistic neural network model;
setting a loss function
Figure 349DEST_PATH_IMAGE002
The probability is predicted logarithmically and its calculation includes,
Figure 635730DEST_PATH_IMAGE003
wherein the content of the first and second substances,
Figure 791904DEST_PATH_IMAGE004
a set of training data is represented that is,
Figure 854538DEST_PATH_IMAGE005
a density function representing a probabilistic neural network model;
the output of the training of the probabilistic neural network model is a gaussian distribution parameterized by diagonal covariance, whose density function is calculated including,
Figure 728953DEST_PATH_IMAGE006
the computation of the loss function after the pull-in logarithmic prediction probability reduction is,
Figure 168025DEST_PATH_IMAGE007
wherein the content of the first and second substances,
Figure 178706DEST_PATH_IMAGE008
a mean vector representing each attribute, T represents the transpose of the matrix,
Figure 677821DEST_PATH_IMAGE009
a diagonal covariance matrix is represented,
Figure 773953DEST_PATH_IMAGE010
denotes the inverse of the diagonal covariance matrix, k denotes
Figure 751136DEST_PATH_IMAGE004
The number of the features in (1) is,
Figure 881903DEST_PATH_IMAGE011
a determinant representing a diagonal covariance matrix;
the input value of the probabilistic neural network model is the attribute of a training data set obtained from an intelligent agent model, the output is a mean value and a diagonal covariance matrix of distribution obeyed by influence variable differences, a plurality of probabilistic neural network submodels with excellent performance are built to form a model base, and a data set is verified
Figure 286340DEST_PATH_IMAGE031
The smaller the value, the better the trained probabilistic neural network model;
further, the solving of the influence variables and the target values of the state of the time interval after the tightening comprises the selection of a probabilistic neural network sub-model in a model library, the solving of the difference value of the influence variables and the target values of the adjacent time intervals and the solving of the influence variables of the time interval after the tightening;
the selection of the probabilistic neural network submodel comprises randomly selecting a submodel from a learned model library to obtain a mean vector and a diagonal covariance matrix output by the submodel;
the calculation of the solution of the difference between the adjacent time interval influencing variable and the target value comprises,
Figure 400926DEST_PATH_IMAGE012
wherein the content of the first and second substances,
Figure 650642DEST_PATH_IMAGE013
the difference between the parameter value representing the current state and the value of the immediately subsequent state parameter,
Figure 901495DEST_PATH_IMAGE014
representing compliance
Figure 476833DEST_PATH_IMAGE015
Distribution of (2)
Figure 813136DEST_PATH_IMAGE016
The result is solved to form a random data set,
Figure 132122DEST_PATH_IMAGE017
represents the standard deviation;
it should be noted that the solution of the difference between the influencing variable and the target value of the adjacent time interval is based on compliance
Figure 237481DEST_PATH_IMAGE015
Distribution of (2)
Figure 983720DEST_PATH_IMAGE032
Solving the result to form a random data set, the data set being defined as
Figure 541740DEST_PATH_IMAGE033
Then, then
Figure 664417DEST_PATH_IMAGE033
Is also obeyed
Figure 624283DEST_PATH_IMAGE034
Is distributed and
Figure 541423DEST_PATH_IMAGE035
setting up
Figure 852319DEST_PATH_IMAGE036
From distribution of obeys
Figure 513108DEST_PATH_IMAGE034
Is randomly paired in the data set
Figure 327480DEST_PATH_IMAGE014
Sampling to obtain a plurality of samples
Figure 681101DEST_PATH_IMAGE013
Taking the average value as the parameter value of the normalized state after tightening;
the parameter solving of the time interval influence variables comprises the steps of adding the solved difference value to the parameter value in the current time, carrying out inverse standardization according to a standardization mode of the collected process parameters, and verifying the performance of the constructed state transition model by utilizing a training set.
S3: by using
Figure 213713DEST_PATH_IMAGE030
And (3) building an intelligent agent model capable of outputting artificial controllable parameter data in the production process by reinforcement learning. It should be noted that:
constructing an intelligent agent model, wherein the intelligent agent model comprises an action design causing state transition and an action-caused reward design;
the action design for causing state transition comprises making difference between adjacent time intervals of each control variable, and taking median of all values
Figure 678193DEST_PATH_IMAGE018
Can be increased for each control variable individually at each moment
Figure 347071DEST_PATH_IMAGE018
A value is defined as an action
Figure 871594DEST_PATH_IMAGE019
The action space during a state transition contains elements of
Figure 891502DEST_PATH_IMAGE020
Figure 894093DEST_PATH_IMAGE021
Is the number of control variables;
action-induced reward design includes a target value difference for a given action
Figure 683058DEST_PATH_IMAGE019
From the current state
Figure 112902DEST_PATH_IMAGE022
Transition to the next state
Figure 354528DEST_PATH_IMAGE037
Is awarded
Figure 160810DEST_PATH_IMAGE024
And the target value changes correspondingly every time the control variable is changed, and the changed value is the reward after the control variable is changed.
The learning process of the intelligent agent model comprises,
searching the minimum value of the TD error, and setting the minimum value of the TD error as a target;
the TD error is calculated as the difference between,
Figure 69860DEST_PATH_IMAGE025
wherein the content of the first and second substances,
Figure 670605DEST_PATH_IMAGE026
represent
Figure 688544DEST_PATH_IMAGE022
Applying motion in state
Figure 32937DEST_PATH_IMAGE019
The expectation of the gain to be obtained is,
Figure 796494DEST_PATH_IMAGE027
the discount factor is represented by a number of discount factors,
Figure 568141DEST_PATH_IMAGE028
to represent
Figure 784359DEST_PATH_IMAGE023
Applying motion in state
Figure 932443DEST_PATH_IMAGE029
The expectation of the gain to be obtained.
By using
Figure 550506DEST_PATH_IMAGE030
The reinforcement learning algorithm inputs the influence variables of the current time interval into the state value network and the strategy network, and generates the maximum reward value through loop iteration
Figure 758634DEST_PATH_IMAGE024
Corresponding action
Figure 462148DEST_PATH_IMAGE019
The action of
Figure 148344DEST_PATH_IMAGE019
All states, actions and rewards constitute a policy network.
S4: and fusing and applying the state transition model and the intelligent agent model to realize the real-time optimization and output of the process parameters in the production process. It should be noted that:
the real-time optimization and output of the process parameters in the production process comprises,
collecting the technological parameters of the production process in real time by taking each fixed time interval as a unit, selecting the number of time intervals to be aggregated according to the actual business requirements and the time sequence, and carrying out data processing and aggregation on the sample data consisting of the selected influence variables and the target value;
inputting the processed and aggregated data into a constructed state transition model, and outputting a difference value between an influence variable and a target value;
and inputting the influence variable of the current time interval into the trained strategy network, outputting the control variable, and realizing the fusion and application of the state transition model and the intelligent agent model in the actual production process.
It should be noted that, in consideration of the problems in the prior art that the convergence of the solving process is slow, the prediction efficiency is low, and the fusion with the production target is insufficient, a probabilistic neural network which can quickly fit the data distribution situation and has high fault tolerance rate and a model for reinforcement learning training which has strong environmental adaptability and is positively feedback with the target are adopted, and the time sequence relation of data is taken into consideration in the training process of the model, so that the defects of most of the existing methods based on an artificial intelligence algorithm are overcome, the model fitting effect and the migration effect are more excellent, the pertinence of dynamically recommended process parameters to the optimized target is stronger, the recommended process parameters are more in line with the reality, the continuous and efficient operation of the production process is ensured, and the cost is reduced and the efficiency is increased.
The real-time optimization method for the process parameters fusing the probability network and the reinforcement learning fully utilizes historical process parameter data, divides the process parameters into control variables, influence variables and target values, organically combines the control variables, the influence variables and the target values, recommends the controllable process parameters of the production process in real time, ensures the continuous and efficient operation of the production process, reduces the cost and improves the efficiency. Compared with the traditional method for constructing the parameter optimization model, the method has stronger disturbance resistance and migration resistance, better accords with the real application scene, has better parameter recommendation effect, and is suitable for most types of production systems.
Example 2
Referring to fig. 2 and 3, a second embodiment of the present invention is different from the first embodiment in that a verification test of a process parameter implementation optimization method integrating a probabilistic network and reinforcement learning is provided, and in order to verify and explain the technical effects adopted in the method, the embodiment adopts a conventional technical scheme and the method of the present invention to perform a comparison test, and compares the test results by means of scientific demonstration to verify the true effects of the method.
Taking the rotary kiln system of the Tai-Gai base in Taiyuan City as an example, the collection is carried out in 1 minute as unit time, and the collected process parameters comprise: the control variables are head coal, tail coal, grate speed, high-temperature fan high-pressure frequency conversion frequency setting, head row high-pressure frequency conversion frequency setting and tail row high-pressure frequency conversion frequency setting, the influence variables are secondary air temperature, decomposition furnace temperature, clinker temperature 2, kiln head cover negative pressure, kiln tail negative pressure and decomposition furnace outlet temperature, the target value is coal consumption, and finally collected rotary kiln production system process parameter data are multi-dimensional time sequence data.
Then, carrying out operations of preprocessing, processing and dividing a data set on the collected process parameter data:
pretreatment:
1) processing an abnormal sample: to avoid abnormal value pairsInfluence of subsequent modeling process, utilization
Figure 886493DEST_PATH_IMAGE038
The principle modifies the exception sample to a null value, remaining in range for each parameter datum
Figure 999942DEST_PATH_IMAGE039
A value of (1), wherein
Figure 190752DEST_PATH_IMAGE040
Is shown as
Figure 680639DEST_PATH_IMAGE041
The mean value of the data of the individual parameters,
Figure 273295DEST_PATH_IMAGE042
is shown as
Figure 823225DEST_PATH_IMAGE041
For each standard deviation of the parameter data, the parameter values not in the range are replaced with null values.
2) Filling of null values: use of
Figure 501331DEST_PATH_IMAGE043
In this way, the average of 6 consecutive data is selected for filling, for example, if the value of the 10 th minute of the secondary air temperature is null, the average of the values of the parameters of the 7 th, 8, 9 th and 11 th, 12 th, 13 th minutes of the secondary air temperature is filled into the 10 th time interval.
3) Normalization of the data: obeying all historical parameter data
Figure 529330DEST_PATH_IMAGE044
The normalization process of (1).
Processing: and taking the difference between two adjacent time intervals of the target value as a new target value, and then aggregating the sample data of 10 adjacent minutes in the time sequence, wherein the aggregation mode is an average value.
Dividing the data set: the processed and processed data were processed as per 6: 2: the approach of 2 is divided into a training set, a prediction set, and a validation set.
Then constructing a state transition model, learning and training a probabilistic neural network model by using the divided training set, wherein an input layer is 14, an output layer is 8, a mode layer is 200, a summation layer is 8, the batch number is 256, the learning rate is set to be 0.001 by adopting an Adam optimizer, the epoch is 1000 times, and the output result is a diagonal covariance matrix formed by standard deviations of all parameters
Figure 976491DEST_PATH_IMAGE045
And a row vector consisting of the average values of the respective parameters
Figure 697323DEST_PATH_IMAGE046
On the verification data set
Figure 597146DEST_PATH_IMAGE047
The values represent the smallest probabilistic neural network model.
According to the formula
Figure 428835DEST_PATH_IMAGE048
At random in
Figure 730504DEST_PATH_IMAGE049
In the distribution, the sampling is performed 5 times to obtain 5
Figure 622236DEST_PATH_IMAGE050
Value, averaged to a predicted state parameter value difference
Figure 274935DEST_PATH_IMAGE051
. The prediction result of the influencing variable and the target value is the difference value of the parameter value in the current time and the solution
Figure 644736DEST_PATH_IMAGE051
And then carrying out inverse standardization according to a mode of standardizing the collected process parameters, and verifying the performance of the constructed state transition model by using the test set.
Table 1: and (4) evaluation comparison of the influence variable and the target value.
Figure 66490DEST_PATH_IMAGE052
Table 2 shows the effect of the evaluation comparison of the different algorithms on the influence variable and the target value (the evaluation index is mean square error MSE, the smaller the value of the mean square error MSE, the better the effect of the model is), and the effect of the probabilistic neural network prediction model constructed by using the loss function constructed by the method as the index of the model training is better and the advantage of the prediction effect is obvious.
The predicted effect of coal consumption is the basis for the recommendation of control variables, which represent the rewards obtained when actions leading to state transitions occur, and the predicted effect of randomly selecting 100 groups of coal consumption is shown in figure 1, and it can be seen that the predicted value is very close to the actual value.
Then, an agent model capable of outputting artificially controllable parameter data in the production process is constructed by using reinforcement learning, because the number of empty value variables is 6 in the present embodiment, 64 elements are included in the action space during the state transition process, and the following table shows parameter values corresponding to the action of each control variable.
Table 2: the action of each control variable.
Figure 394703DEST_PATH_IMAGE054
By using
Figure 269118DEST_PATH_IMAGE055
A reinforcement learning algorithm of
Figure 708190DEST_PATH_IMAGE056
Input device
Figure 984451DEST_PATH_IMAGE057
Train out the current state in
Figure 217986DEST_PATH_IMAGE058
Down-exerting an action
Figure 579697DEST_PATH_IMAGE059
Producing a reward
Figure 556880DEST_PATH_IMAGE060
Table form
Figure 687647DEST_PATH_IMAGE061
Obtaining an optimal policy network, wherein the discount coefficient
Figure 357663DEST_PATH_IMAGE062
Set to 0.98 and the probability of exploration set to 0.1.
Finally, the control variable of the previous state is input into the constructed intelligent agent model, the control variable value of the next state is output, and the coal consumption of the rotary kiln production system caused by the input of the control variable recommended by the intelligent agent is compared with the coal consumption of the next state predicted by the state transition model; fig. 2 shows the implementation effect of the control variables recommended by the agent, which can be obtained from fig. 2, and 100 recommended sets of process parameter data are selected, the optimized coal consumption per unit time is 0.2893, the control variables recommended by the agent conform to the process of the actual production system, and the excellent control variables can be recommended to optimize the target coal consumption in real time.
It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims (10)

1. A real-time optimization method for process parameters fusing a probability network and reinforcement learning is characterized by comprising the following steps:
collecting technological parameter data of a production system, and carrying out operations of preprocessing, processing and dividing a data set on the technological parameter data;
constructing a state transition model of adjacent time intervals in the production process based on the preprocessed process parameter data;
by using
Figure 386639DEST_PATH_IMAGE001
Building an intelligent agent model capable of outputting artificially controllable parameter data in the production process by reinforcement learning;
and fusing and applying the state transition model and the intelligent agent model to realize the real-time optimization and output of the process parameters in the production process.
2. The method for optimizing process parameters of fusion probability network and reinforcement learning in real time as claimed in claim 1, wherein: the collection of the process parameter data comprises collecting control variables, influence variables and actual production target values of the production process at equal time intervals;
the control variables comprise process parameters which can be manually and directly adjusted in the production process;
the influence variables comprise process parameters generated by the influence of manually input control variables on the production system;
the actual target value of production comprises a production target which is completed by the production system at a certain time interval.
3. The method for optimizing process parameters of fusion probability network and reinforcement learning in real time as claimed in claim 2, wherein: the pre-processing and processing of the process parameter data includes,
the preprocessing of the process parameter data comprises processing of abnormal samples, filling of null values and standardization of data;
and the processing of the process parameter data comprises the steps of making a difference between two adjacent time intervals of the actual production target value, taking the difference value of the two adjacent time intervals as a new target value, and then aggregating the sample data of a plurality of time intervals in the time sequence.
4. The method for optimizing process parameters of fusion probability network and reinforcement learning in real time as claimed in any one of claims 1 to 3, wherein: the dividing of the process parameter data comprises dividing a new data set after pretreatment and processing into a training set, a verification set and a test set according to a certain proportion.
5. The method for optimizing process parameters in real time for fusion of a probabilistic network and reinforcement learning according to claim 4, wherein: the construction of the state transition model includes,
constructing a probabilistic neural network by using the divided training set;
solving the influence variables and the actual production target values of the state of the immediately-after time interval;
obtaining a state transition function and a reward function which can express the change of the actual target value of the production along with the state transition with high fidelity.
6. The method for optimizing process parameters in real time for fusion of a probabilistic network and reinforcement learning according to claim 5, wherein: the construction of the probabilistic neural network comprises the setting of a loss function and the training of a probabilistic neural network model;
setting the loss function
Figure 897255DEST_PATH_IMAGE002
Predicting the probability for a logarithm;
the calculation of the log-prediction probability includes,
Figure 942571DEST_PATH_IMAGE003
wherein the content of the first and second substances,
Figure 541043DEST_PATH_IMAGE004
a set of training data is represented that is,
Figure 620994DEST_PATH_IMAGE005
representing probabilistic neural networksA density function of the complex model;
the output of the training of the probabilistic neural network model is a gaussian distribution parameterized by diagonal covariance;
the calculating of the density function of the probabilistic neural network model includes,
Figure 771353DEST_PATH_IMAGE006
the computation of the loss function after the pull-in logarithmic prediction probability reduction includes,
Figure 507227DEST_PATH_IMAGE007
wherein, the first and the second end of the pipe are connected with each other,
Figure 971707DEST_PATH_IMAGE008
a mean vector representing each attribute, T represents the transpose of the matrix,
Figure 968482DEST_PATH_IMAGE009
a diagonal covariance matrix is represented,
Figure 227425DEST_PATH_IMAGE010
represents the inverse of the diagonal covariance matrix,kto represent
Figure 185016DEST_PATH_IMAGE011
The number of the features in (1) is,
Figure 515504DEST_PATH_IMAGE012
representing the determinant of the diagonal covariance matrix.
7. The method for optimizing process parameters in real time for fusion of a probabilistic network and reinforcement learning according to claim 6, wherein: the solving of the influence variables and the target values of the state of the tight time interval comprises the selection of a probabilistic neural network submodel in a model base, the solving of the difference value of the influence variables and the target values of the adjacent time intervals and the solving of the influence variables of the tight time interval;
the selection of the probabilistic neural network submodel comprises randomly selecting a submodel from a learned model library to obtain a mean vector and a diagonal covariance matrix output by the submodel;
the calculation of the solution of the adjacent time interval influencing variable and target value difference comprises,
Figure 304468DEST_PATH_IMAGE013
wherein the content of the first and second substances,
Figure 937575DEST_PATH_IMAGE014
the difference between the parameter value representing the current state and the value of the immediately subsequent state parameter,
Figure 769746DEST_PATH_IMAGE015
representing compliance
Figure 310449DEST_PATH_IMAGE016
Distribution of (2)
Figure 157182DEST_PATH_IMAGE017
The result is solved to form a random data set,
Figure 757928DEST_PATH_IMAGE018
represents the standard deviation;
the solving of the adjacent time interval influencing variable and target value difference comprises
Figure 80324DEST_PATH_IMAGE019
Are randomly paired in a distributed data set
Figure 362401DEST_PATH_IMAGE020
Sampling to obtain a plurality of samples
Figure 125958DEST_PATH_IMAGE021
Taking the average value as the parameter value of the normalized state after tightening;
the parameter solving of the time interval influence variables comprises the steps of carrying out inverse standardization on parameter values in the current time and solved difference values according to a standardization mode of collected process parameters, and verifying the performance of the constructed state transition model by utilizing the training set.
8. The method for optimizing process parameters in real time for fusion of a probabilistic network and reinforcement learning according to claim 7, wherein: the construction of the intelligent agent model comprises an action design causing state transition and an action-caused reward design;
the action design for causing state transition comprises making difference to adjacent time intervals of each control variable, and taking median of all values
Figure 225501DEST_PATH_IMAGE022
Can be increased for each control variable individually at each moment
Figure 441719DEST_PATH_IMAGE022
A value is defined as an action
Figure 527486DEST_PATH_IMAGE023
The action space during a state transition contains elements of
Figure 473446DEST_PATH_IMAGE024
Figure 681573DEST_PATH_IMAGE025
Is the number of control variables;
the action-induced reward design includes a target value difference for a given action
Figure 322770DEST_PATH_IMAGE023
From the current state
Figure 8966DEST_PATH_IMAGE026
Transition to the next state
Figure 809432DEST_PATH_IMAGE027
Is awarded
Figure 188461DEST_PATH_IMAGE028
And the target value changes correspondingly every time the control variable is changed, and the changed value is the reward after the control variable is changed.
9. The method for optimizing process parameters in real time for fusion of a probabilistic network and reinforcement learning according to claim 8, wherein: the learning process of the intelligent agent model comprises,
searching a minimum value of the TD error, and setting the minimum value of the TD error as a target;
the TD error is calculated as the difference between,
Figure 316954DEST_PATH_IMAGE029
wherein the content of the first and second substances,
Figure 806841DEST_PATH_IMAGE030
to represent
Figure 196234DEST_PATH_IMAGE031
Applying motion in state
Figure 746164DEST_PATH_IMAGE032
The expectation of the gain to be obtained is,
Figure 361953DEST_PATH_IMAGE033
the discount factor is represented by a number of discount factors,
Figure 655531DEST_PATH_IMAGE034
represent
Figure 165010DEST_PATH_IMAGE035
Applying motion in state
Figure 823524DEST_PATH_IMAGE036
(ii) a gain expectation obtained;
by using
Figure 988926DEST_PATH_IMAGE037
The reinforcement learning algorithm inputs the influence variables of the current time interval into the state value network and the strategy network, and generates the maximum reward value through loop iteration
Figure 882933DEST_PATH_IMAGE038
Corresponding action
Figure 450181DEST_PATH_IMAGE032
The action of
Figure 14017DEST_PATH_IMAGE032
All states, actions and rewards constitute a policy network.
10. The method for optimizing process parameters in real time for fusion of a probabilistic network and reinforcement learning according to claim 9, wherein: the real-time optimization and output of the process parameters in the production process comprises,
collecting the technological parameters of the production process in real time by taking each fixed time interval as a unit, selecting the number of time intervals to be aggregated according to the actual business requirements and the time sequence, and carrying out data processing and aggregation on the sample data consisting of the selected influence variables and the target values;
inputting the processed and aggregated data into the constructed state transition model, and outputting a difference value between an influence variable and a target value;
and inputting the influence variable of the current time interval into the trained strategy network, outputting a control variable, and realizing the fusion and application of the state transition model and the intelligent agent model in the actual production process.
CN202210989613.7A 2022-08-18 2022-08-18 Real-time optimization method for process parameters integrating probability network and reinforcement learning Active CN115061444B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210989613.7A CN115061444B (en) 2022-08-18 2022-08-18 Real-time optimization method for process parameters integrating probability network and reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210989613.7A CN115061444B (en) 2022-08-18 2022-08-18 Real-time optimization method for process parameters integrating probability network and reinforcement learning

Publications (2)

Publication Number Publication Date
CN115061444A true CN115061444A (en) 2022-09-16
CN115061444B CN115061444B (en) 2022-12-09

Family

ID=83208015

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210989613.7A Active CN115061444B (en) 2022-08-18 2022-08-18 Real-time optimization method for process parameters integrating probability network and reinforcement learning

Country Status (1)

Country Link
CN (1) CN115061444B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115439021A (en) * 2022-10-26 2022-12-06 江苏新恒基特种装备股份有限公司 Metal strengthening treatment quality analysis method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190025813A1 (en) * 2016-05-09 2019-01-24 Strong Force Iot Portfolio 2016, Llc Methods and systems for intelligent collection and analysis of vehicle data
US20220179026A1 (en) * 2020-12-04 2022-06-09 Max-Planck-Gesellschaft Zur Foerderung Der Wissenschaften E. V. Machine learning based processing of magnetic resonance data, including an uncertainty quantification
CN114692310A (en) * 2022-04-14 2022-07-01 北京理工大学 Virtual-real integration-two-stage separation model parameter optimization method based on Dueling DQN

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190025813A1 (en) * 2016-05-09 2019-01-24 Strong Force Iot Portfolio 2016, Llc Methods and systems for intelligent collection and analysis of vehicle data
US20220179026A1 (en) * 2020-12-04 2022-06-09 Max-Planck-Gesellschaft Zur Foerderung Der Wissenschaften E. V. Machine learning based processing of magnetic resonance data, including an uncertainty quantification
CN114692310A (en) * 2022-04-14 2022-07-01 北京理工大学 Virtual-real integration-two-stage separation model parameter optimization method based on Dueling DQN

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
MILTON ROBERTO HEINEN AND PAULO MARTINS ENGEL: "an incremental probabilistic neural network for regression and reinforcement learning tasks", 《豆丁网》 *
刘全等: "基于自适应归一化RBF网络的Q-V值函数协同逼近模型", 《计算机学报》 *
周世正: "基于深度强化学习的多机器人协同导航", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
唐长成: "基于强化学习的参数化电路优化算法", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115439021A (en) * 2022-10-26 2022-12-06 江苏新恒基特种装备股份有限公司 Metal strengthening treatment quality analysis method and system

Also Published As

Publication number Publication date
CN115061444B (en) 2022-12-09

Similar Documents

Publication Publication Date Title
CN111738512B (en) Short-term power load prediction method based on CNN-IPSO-GRU hybrid model
CN115688913B (en) Cloud edge end collaborative personalized federal learning method, system, equipment and medium
CN112884236B (en) Short-term load prediction method and system based on VDM decomposition and LSTM improvement
CN113361680A (en) Neural network architecture searching method, device, equipment and medium
CN110751318A (en) IPSO-LSTM-based ultra-short-term power load prediction method
CN116596044B (en) Power generation load prediction model training method and device based on multi-source data
CN112910690A (en) Network traffic prediction method, device and equipment based on neural network model
CN113449919B (en) Power consumption prediction method and system based on feature and trend perception
CN112287990A (en) Model optimization method of edge cloud collaborative support vector machine based on online learning
CN112270442A (en) IVMD-ACMPSO-CSLSTM-based combined power load prediction method
CN115061444B (en) Real-time optimization method for process parameters integrating probability network and reinforcement learning
CN110991621A (en) Method for searching convolutional neural network based on channel number
CN114118567A (en) Power service bandwidth prediction method based on dual-channel fusion network
CN114548591A (en) Time sequence data prediction method and system based on hybrid deep learning model and Stacking
CN114777192A (en) Secondary network heat supply autonomous optimization regulation and control method based on data association and deep learning
CN117290721A (en) Digital twin modeling method, device, equipment and medium
CN112381591A (en) Sales prediction optimization method based on LSTM deep learning model
CN117093885A (en) Federal learning multi-objective optimization method integrating hierarchical clustering and particle swarm
CN115357862B (en) Positioning method in long and narrow space
CN116562454A (en) Manufacturing cost prediction method applied to BIM long-short-time attention mechanism network
CN113763710B (en) Short-term traffic flow prediction method based on nonlinear adaptive system
CN115081609A (en) Acceleration method in intelligent decision, terminal equipment and storage medium
Chen Brain Tumor Prediction with LSTM Method
CN114841472B (en) GWO optimization Elman power load prediction method based on DNA hairpin variation
WO2023082045A1 (en) Neural network architecture search method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant