CN113325721B

CN113325721B - Model-free adaptive control method and system for industrial system

Info

Publication number: CN113325721B
Application number: CN202110877921.6A
Authority: CN
Inventors: 罗远哲; 刘瑞景; 赵爱民; 李玉琼; 耿云晓; 刘志明; 易文军; 任光远; 靳晓栋
Original assignee: Zhongchao Weiye Beijing Business Data Technology Service Co ltd; Beijing China Super Industry Information Security Technology Ltd By Share Ltd
Current assignee: Zhongchao Weiye Beijing Business Data Technology Service Co ltd; Beijing China Super Industry Information Security Technology Ltd By Share Ltd
Priority date: 2021-08-02
Filing date: 2021-08-02
Publication date: 2021-11-05
Anticipated expiration: 2041-08-02
Also published as: CN113325721A

Abstract

The invention relates to a model-free self-adaptive control method and system for an industrial system. The method comprises the following steps: acquiring historical monitoring data of various devices in an industrial process; generating a control instruction set by using the controllable class data; the control instruction set comprises a plurality of control instructions generated at the next moment; constructing a prediction simulation model according to the historical monitoring data; training a reinforcement learning-based control model according to the prediction simulation model based on the control instruction set to generate a trained reinforcement learning-based control model; acquiring current monitoring data; and inputting the current monitoring data into the trained control model based on reinforcement learning, adaptively controlling the production process of the industrial system, and outputting the optimal set target of the industrial system. The invention can greatly reduce the trial and error cost and obtain a more effective intelligent control strategy.

Description

Model-free adaptive control method and system for industrial system

Technical Field

The invention relates to the field of industrial intelligent control and reinforcement learning control, in particular to a model-free self-adaptive control method and system for an industrial system.

Background

In recent years, the development of the industrial field is promoted by the rapid development of modern science and technology, and the informatization, automation and intelligent development of the industrial field is mature day by day. With the increasing expansion of industrial production scale, the realization of unmanned intelligent control in a complex industrial scene, how to further reduce the labor cost and the skill training cost of operators, how to separate from human experience intervention, and realize a more accurate and reliable intelligent control strategy becomes a key problem to be solved urgently. The traditional intelligent control technology is only suitable for simple industrial environments, in actual industrial production, a large number of sensors used for monitoring data exist in complex industrial environments, the traditional intelligent control technology cannot well utilize the potential characteristics of the monitoring data, the control method based on machine learning can learn the change rule of the monitoring data, has certain learning capacity and generalization capacity, can extract the objective rule of the production environment from the monitoring data, and sums up the experience and knowledge which cannot be found by human experts.

In machine Learning-based control, a typical method is to use a control law Learning method based on a Reinforcement Learning (RL) algorithm. The monitoring value migration rule in the complex industrial environment can be learned from data through reinforcement learning, a field expert is not needed for designing a control rule, and the method is suitable for complex industrial scenes. And increment learning is carried out on the basis of reinforcement learning, so that the control model has self-adaptive capacity and is closer to the actual industrial production condition in the actual application process. Reinforcement learning has wide application in various industrial fields such as power grid emergency control strategy research [ Liuwei, Zdongxia, Wangxingying, Houjinxiu, Liuliping, deep reinforcement learning-based power grid emergency control strategy research [ J ]. China Motor engineering bulletin, 2018, 38(01): 109-. In the existing actual industrial production control, training and testing of a control strategy in an industrial environment are needed to obtain a self-adaptive model with better performance, and the trial-and-error cost and the research and development cost are too high.

Disclosure of Invention

The invention aims to provide a model-free self-adaptive control method and system for an industrial system, which aim to solve the problems of high trial and error cost and high research and development cost.

In order to achieve the purpose, the invention provides the following scheme:

a model-free adaptive control method for an industrial system, comprising:

acquiring historical monitoring data of various devices in an industrial process; the historical monitoring data comprises controllable data, state data, environmental noise data and target output data; the controllable data comprises the opening degree of a flow valve, the opening degree of a valve, the rotating speed of a frequency converter and the rotating speed of a pump; the state class data comprises pipeline pressure in industrial production; the environmental noise data comprises product information, temperature and humidity of the previous process; the target output class data comprises an object controlled in the production process;

generating a control instruction set by using the controllable class data; the control instruction set comprises a plurality of control instructions generated at the next moment;

constructing a prediction simulation model according to the historical monitoring data;

training a reinforcement learning-based control model according to the prediction simulation model based on the control instruction set to generate a trained reinforcement learning-based control model;

acquiring current monitoring data;

and inputting the current monitoring data into the trained control model based on reinforcement learning, adaptively controlling the production process of the industrial system, and outputting the optimal set target of the industrial system.

Optionally, the generating a control instruction set by using the controllable class data specifically includes:

defining a piece of monitoring data, wherein the monitoring data is historical monitoring data S or current monitoring data

，

For any one of the monitored data, the controllable variable of the controllable class data,

for the system state quantity of the state class data in any one piece of the monitoring data,

for the amount of ambient noise of the ambient noise-like data in any of the monitoring data,

the target output quantity of the target output class data in any piece of monitoring data is S, the historical monitoring data of a continuous time period,

the size of a historical monitoring data set is shown, a control is controllable data, a state is state data, env is environmental noise data, and a goal is target output data;

to the controlled variable from the historical monitoring data S

Collecting and generating

A bar control instruction;

shrinking by clustering

Determining the optimal clustering center number by using Bayesian information criterion according to the scale of the control instructionkAnd all cluster centers in each cluster

As the enhancement basisA control instruction set is generated by an action instruction of the learned control model.

Optionally, the constructing a predictive simulation model according to the historical monitoring data specifically includes:

constructing a plurality of prediction models to predict the system state quantity and the target prediction state output quantity at the next moment

Each variable in (a) is independently predicted; for the prediction of each univariate, a LightGBM algorithm is adopted to construct a prediction model, the maximum number of leaves num _ leaves is 10, the learning rate is 0.8, the feature screening proportion feature _ fraction is 0.9, and l2 regular terms are adopted to reduce overfitting;

dividing the historical monitoring data into 7: 3; wherein 30% of the historical monitoring data is used as a validation set for determining the hyper-parameters of the optimal prediction model;

according to the controllable variable given by the controller and the volume of the environmental noise

And the system state quantity and the target current state output quantity in the historical monitoring data

And integrating a plurality of the prediction models to construct a prediction simulation model.

Optionally, based on the control instruction set, training a reinforcement learning-based control model according to the predictive simulation model, and generating the trained reinforcement learning-based control model specifically includes:

constructing a control model based on reinforcement learning and acquiring the current monitoring data

Setting a control target value

And the historyMonitoring the amount of ambient noise at the next moment in the data

；

The current monitoring data is processed

And setting a control target value

Input to the reinforcement learning-based control model, and output

The profit value of each control instruction is used as probability weight for sampling, and one control instruction in the control instruction set is sampled

；

According to the current monitoring data

And the control instruction

Predicting the system state quantity and the target predicted state output quantity at the next moment by using the prediction simulation model

；

According to the set control target value

And target output quantity at next time

Calculating decision rewardsr；

Based on theDecision rewardsrThe current monitoring data

The control instruction

And the system state quantity and the target predicted state output quantity at the next moment

Training the reinforcement Learning-based control model with a Q-Learning-based time sequence difference loss function to enable the reinforcement Learning-based control model to monitor the current monitoring data

Then, a control instruction for maximizing the future accumulated award is output

；

Monitoring data of the next moment

Replacing the current monitoring data

And training the reinforcement learning-based control model until the average reward of the reinforcement learning-based control model is not increased any more, and determining the trained reinforcement learning-based control model.

Optionally, the time sequence differential loss function is:

wherein the content of the first and second substances,

to a cumulative discount value;ssystem state quantity and target current state output quantity at current moment

，s'Predicting the state output quantity for the system state quantity and the target at the next moment

，

Control instructions for sampling

，

Is at the same times'The control input values that are available for selection in a state,

a learning rate for the reinforcement learning-based control model;Qin order to enhance the learning of the network,

indicating the system state assThe control command is executed as

Under the condition of (1), controlling the optimal long-term income obtained by the strategy in the future;

indicating the system state ass'The control command is executed as

Under the circumstances of (1), long-term gains obtained in the future by the control strategy; system statesIs controlled by

Evolved intos'The obtained single-step control benefit isrTo the network output value

Optimizing to obtain the optimized result of the time sequence difference loss function

。

An industrial system model-free adaptive control system, comprising:

the historical monitoring data acquisition module is used for acquiring historical monitoring data of various devices in the industrial process; the historical monitoring data comprises controllable data, state data, environmental noise data and target output data; the controllable data comprises the opening degree of a flow valve, the opening degree of a valve, the rotating speed of a frequency converter and the rotating speed of a pump; the state class data comprises pipeline pressure in industrial production; the environmental noise data comprises product information, temperature and humidity of the previous process; the target output class data comprises an object controlled in the production process;

the control instruction set generating module is used for generating a control instruction set by using the controllable class data; the control instruction set comprises a plurality of control instructions generated at the next moment;

the prediction simulation model building module is used for building a prediction simulation model according to the historical monitoring data;

the trained reinforcement learning-based control model determining module is used for training a reinforcement learning-based control model according to the prediction simulation model based on the control instruction set to generate the trained reinforcement learning-based control model;

the current monitoring data acquisition module is used for acquiring current monitoring data;

and the self-adaptive control module is used for inputting the current monitoring data into the trained control model based on reinforcement learning, adaptively controlling the production process of the industrial system and outputting the optimal set target of the industrial system.

Optionally, the control instruction set generating module specifically includes:

a parameter definition unit for defining a piece of monitoring data, wherein the monitoring data is historical monitoring data S or current monitoring data

，

a control instruction generation unit for generating a controllable variable from the historical monitoring data S

Collecting and generating

A bar control instruction;

a control instruction set generation unit for employingClustering mode reduction

As an action command of the reinforcement learning-based control model, a control command set is generated.

Optionally, the prediction simulation model building module specifically includes:

a plurality of prediction model construction units for constructing a plurality of prediction models to predict the system state quantity and the target predicted state output quantity at the next time

the dividing unit is used for dividing the historical monitoring data into 7: 3; wherein 30% of the historical monitoring data is used as a validation set for determining the hyper-parameters of the optimal prediction model;

a prediction simulation model construction unit for constructing a prediction simulation model based on the controllable variables and the amount of the environmental noise

Optionally, the trained control model determination module based on reinforcement learning specifically includes:

a reinforcement learning based control model construction unit for constructing a reinforcement learning based control model and acquiring the current monitoring data

Setting a control target value

And the volume of the environmental noise at the next moment in the historical monitoring data

；

A control instruction sampling unit for sampling the current monitoring data

And setting a control target value

Input to the reinforcement learning-based control model, and output

；

A prediction unit for predicting the current monitoring data

And the control instruction

；

A decision reward calculation unit for controlling the target value according to the setting

And target output quantity at next time

Calculating decision rewardsr；

A training unit for rewarding based on the decisionrThe current monitoring data

The control instruction

；

A control model determining unit based on reinforcement learning after training, which is used for determining the monitoring data of the next moment

Replacing the current monitoring data

Optionally, the time sequence differential loss function is:

wherein the content of the first and second substances,

，

Control instructions for sampling

，

indicating the system state assThe control command is executed as

indicating the system state ass'The control command is executed as

。

According to the specific embodiment provided by the invention, the invention discloses the following technical effects: the invention provides a model-free self-adaptive control method and system for an industrial system, which directly utilize sensing monitoring data of the industrial system to establish a prediction simulation model for environment state deduction, simultaneously obtain a set of control instructions in the data preprocessing process, finally utilize a reinforcement learning method to learn a control strategy based on the prediction simulation model, train the control model based on the reinforcement learning, generate the trained control model based on the reinforcement learning, and output the optimal set target of the industrial system, so that the training and the testing of the control strategy in the industrial environment are not needed, and the trial-and-error cost is greatly reduced. And even if the actual industrial equipment generating the training data does not show better control performance, the intelligent control strategy which is more effective than the existing control system or algorithm can be obtained by utilizing the model-free adaptive control method or the system learning control experience of the industrial system provided by the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flow chart of a model-free adaptive control method for an industrial system according to the present invention;

FIG. 2 is a technical framework diagram of a model-free adaptive control method for an industrial system according to the present invention;

FIG. 3 is a schematic diagram of a predictive simulation model according to the present invention;

FIG. 4 is a schematic diagram of a reinforcement learning network structure according to the present invention;

FIG. 5 is a block diagram of a model-free adaptive control system for an industrial system according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention aims to provide a model-free self-adaptive control method and system for an industrial system, which can greatly reduce the trial and error cost and obtain a more effective intelligent control strategy.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a flowchart of a model-free adaptive control method for an industrial system according to the present invention, and as shown in fig. 1, the model-free adaptive control method for an industrial system includes:

step 101: acquiring historical monitoring data of various devices in an industrial process; the historical monitoring data comprises controllable data, state data, environmental noise data and target output data; the controllable data comprises the opening degree of a flow valve, the opening degree of a valve, the rotating speed of a frequency converter and the rotating speed of a pump; the state class data comprises pipeline pressure in industrial production; the environmental noise data comprises product information, temperature and humidity of the previous process; the target output class data includes objects controlled in the production process.

Firstly, classifying and defining various monitoring data collected in an industrial process from devices such as sensors, motor devices, valve switches and the like, and specifically classifying the monitoring data into the following four types:

1) controllable class: production parameters which allow direct control in the industrial field, such as the opening degree of a flow valve, the opening degree of a valve, the rotating speed of a frequency converter, the rotating speed of a pump and the like which can be controlled in industrial production, are classified into controllable variables, which are hereinafter referred to as control.

2) The state class: the pressure value of the pipeline can not be directly controlled, but the pressure value of the pipeline can not be directly controlled, and the flow of the pipeline can be adjusted only by controlling the pump speed of one section of the pipeline, so that the pressure value of the pipeline can be controlled. Such variables are hereinafter referred to as states.

3) Ambient noise class: the variables are not determined internally by the production system, but only from the outside, including product information of the previous process or external environmental influences such as temperature, moderation, etc., hereinafter abbreviated as env.

4) Target output class: the object to be controlled in the production process is often a key object influencing quality and cost in the production process, and is hereinafter referred to as "good".

In practical applications, as shown in fig. 2, it is first necessary to install sensors at key nodes in the production process to measure the system state quantity s' (state, good) and the environmental noiseThe quantity s (env) is measured and the controllable quantity in the production process can be generally directly obtained from an on-site control system. After the data collection is completed, different time series need to be aligned according to time, and specifically, a linear interpolation method or a gaussian process method can be adopted. Assume that the aligned sequence length is

。

Step 102: generating a control instruction set by using the controllable class data; the control instruction set comprises a plurality of control instructions generated at the next moment.

The step 102 specifically includes: defining a piece of monitoring data, wherein the monitoring data is historical monitoring data S or current monitoring data

，

the size of a historical monitoring data set is shown, a control is controllable data, a state is state data, env is environmental noise data, and a goal is target output data; from the historyTo controlled variable in monitoring data S

Collecting and generating

A bar control instruction; shrinking by clustering

In practical application, after data collection and data alignment are completed, controllable parameters in original monitoring data are extracted, in the case of the scheme, the controllable parameters comprise the opening of a flow valve, the rotating speeds of 2 frequency converters and 1 pump, the data of the controllable units can be derived together with the monitoring data of other sensors, and the first step is that

Bar control instruction

Expressed as shown in formula 1:

(1)

wherein, s (control)_iRepresenting the controllable part of the ith record in the monitored data. Obtained through step 101

A control instruction, which is too large to make a decision by the reinforcement learning modelAnd there may be a large number of similar or identical instructions, the present invention employs clustering to reduce its size.

Specifically, a K-means clustering algorithm is adopted to aggregate similar control instructions into a cluster, only the center of the cluster is used as a control instruction for selection of a reinforcement learning model, and due to different dimensions of different input items, a normalization method is required to be adopted before clustering to make calculation of instruction distances in actions sets more meaningful:

(2)

where mean represents the mean of all data entries calculated and std represents the standard deviation of all data entries calculated. Center of cluster numberkThe value of (b) is measured by referring to a Bayes Information Criterion (BIC) value, and the larger the BIC value is, the better the clustering effect is. BIC is defined as shown in formula 3:

(3)

where L is the sum of the likelihood values of all data points for the class to which they belong. By comparing the number of different clusterskThe optimal number of clusters is obtained by the BIC value of (1):

(4)

components denote the number of clustering centers, and kmeans denotes the process of performing the K-means clustering algorithm. Finally, the mean value of the instructions in each class is used for representing one instruction, so that the average value of the instructions in each class is obtained under the given data setkA control instruction.

(5)

Step 103: and constructing a prediction simulation model according to the historical monitoring data.

The step 103 specifically includes: constructing a plurality of prediction models to predict the system state quantity and the target prediction state output quantity at the next moment

Each variable in (a) is independently predicted; for the prediction of each univariate, a LightGBM algorithm is adopted to construct a prediction model, the maximum number of leaves num _ leaves is 10, the learning rate is 0.8, the feature screening proportion feature _ fraction is 0.9, and l2 regular terms are adopted to reduce overfitting; dividing the historical monitoring data into 7: 3; wherein 30% of the historical monitoring data is used as a validation set for determining the hyper-parameters of the optimal prediction model; according to the controllable variable given by the controller and the volume of the environmental noise

In practical applications, as shown in FIG. 3, it is necessary to construct multiple prediction models to pair

And finally, integrating all independent models together to serve as a complete system simulation prediction model.

For the prediction of each univariate, a LightGBM algorithm is adopted to construct a prediction model, the maximum number of leaves num _ leaves is 10, the learning rate is 0.8, the feature screening proportion feature _ fraction is 0.9, and the l2 regular term is adopted to reduce overfitting.

The historical monitoring data was divided into 7:3, with 30% of the data being used as validation set to determine the optimal model hyper-parameters.

For each one

The predicted dependent variable is integrated with all the models to construct a simulation model of the industrial process, namely, the simulation model of the industrial process can be constructed according to the control quantity or the environmental noise quantity given by the controller

And the current state quantity of the system

Predict what is new

。

Step 104: and training a reinforcement learning-based control model according to the prediction simulation model based on the control instruction set to generate the trained reinforcement learning-based control model.

The step 104 specifically includes: constructing a control model based on reinforcement learning and acquiring the current monitoring data

Setting a control target value

(ii) a The current monitoring data is processed

And setting a control target value

Input to the reinforcement learning-based control model, and output

The profit value of each control instruction is used asSampling probability weight, and sampling one control instruction in the control instruction set

(ii) a According to the current monitoring data

And the control instruction

(ii) a According to the set control target value

And target output quantity at next time

Calculating decision rewardsr(ii) a Reward based on the decisionrThe current monitoring data

The control instruction

Next, an output maximizes the future accumulated awardControl instruction

(ii) a Monitoring data of the next moment

Replacing the current monitoring data

The timing differential loss function is:

wherein the time sequence difference loss function is an iterative optimization function based on Q learning,

to the cumulative discount value, set to 0.95;ssystem state quantity and target current state output quantity at current moment

，

Control instructions for sampling

，

indicating the system state assThe control command is executed as

indicating the system state ass'The control command is executed as

Evolved intos'Assume that the control command is executed as

In the case of (3), long-term gains that the control strategy can achieve in the future. Using iterative bellman equations, using collected system state evolution data, i.e. system statessIs controlled by

Evolved intos'The obtained single-step control benefit isrThereby outputting the value to the network

Optimizing to obtain

。

In practical application, a control model based on reinforcement learning can be trained by using a prediction simulation model constructed by a plurality of LightGBM models, and the specific steps are as follows:

constructing a deep neural network-based reinforcement learning model, and inputting the model into the current system state quantity as shown in FIG. 4

And setting a control target value human (goal), outputting a predicted reward for each action

Wherein

The control command is the ith control command in the control command set Actions, i belongs to n, n is the control command serial number, and the set control target value human (goal) refers to the target value of the goal set by the human.

The network consists of a full connection layer, a RELU nonlinear activation layer, a noise linear layer and a softmax normalization layer. A state value branch V and an action dominance value estimation branch A are respectively introduced into the network, and the accuracy of action value estimation can be improved through experimental verification of the network design.

And (3) rewarding of the reinforcement learning model is defined in sections according to the difference between the predicted value of the target parameter and the artificially set value, wherein the difference value is e:

the calculation method of the prize is as shown in table 1.

TABLE 1 reward definition Table

	Reward for
			10
	6
			2
	0

The segmentation criteria of the difference values are related to the industrial scenario applied by the embodiment, and the parameters have no universality in different industrial scenarios.

Training a reinforcement learning network: the reinforcement learning takes the fragments as a unit, and meanwhile, in order to accelerate the training speed of the model, a parallelization technology is adopted to enable the model to process and learn a plurality of fragments simultaneously, the parallelization quantity is represented by batch _ size =32, and the specific training process is as follows:

randomly fetching batch _ size current monitoring data from actual production data

And the state parameter used for representing the current production situation is used as the starting state of each training segment. Setting an artificially set control target valuehuman(gold), the present invention is exemplified by an industrial thickener underflow concentration control, set to 67.

For each time state, the reinforcement learning network inputs the state parameters

And a control target valuehuman(coarse), output the size of batch _ size as the number of cluster centerskEach value in the vector represents a long-term discount yield in the future brought by selecting one of the control inputs. At the moment, the yield value is converted into the probability distribution of action selection by using a softmax function, a control command s' (control) is sampled from the probability distribution, and the system state at the next moment is predicted by adopting a prediction simulation model.

Will be provided with

S '(control) is used as an input of the obtained prediction simulation model to predict the system state quantity and the target quantity s' (state, good) at the next time.

According to artificially set target valueshuman(gold) and predicted s' (gold) computational decision rewardsrAnd based on the target current state output quantity s (state, good), the target predicted state output quantity s' (state, good), the reward

And a control input s' (control), training control model parameters by using a Q-Learning based time sequence difference loss function so as to enable the reinforcement Learning model to be given

Lower, output awardsrAs large as s' (control) as possible. The timing difference loss function is expressed as:

(6)

wherein the content of the first and second substances,

，

Control instructions for sampling

，

indicating the system state assThe control command is executed as

indicating the system state ass'The control command is executed as

Evolved intos'Assume that the control command is executed as

Optimizing to obtain

。

And (4) replacing s (control, state, env, good) with s' (control, state, env, good), and repeatedly training the control model based on reinforcement learning, wherein in the training process of the control model, the average reward obtained by the model in continuous 50 iterations is not increased, which indicates that the model parameters reach a convergence state.

Step 105: and acquiring current monitoring data.

Step 106: and inputting the current monitoring data into the trained control model based on reinforcement learning, adaptively controlling the production process of the industrial system, and outputting the optimal set target of the industrial system.

And deploying the trained reinforcement learning model on a DCS engineer station or a high-performance computing server of an industrial field, and deploying a model inference program into a Web service supporting access of a RestFul protocol.

And accessing industrial system state quantities s (control, state, env, good) such as a sensor monitoring value, a controllable unit state value, an external environment quantity and the like at regular time according to a corresponding data acquisition interval during control model training by using an industrial control system data acquisition protocol, such as an OPC UA protocol.

Inputting s (control, state, env, good) and artificial set value into control model, selecting from solution resultkAnd (4) a command with the largest future potential income estimation value in the candidate actions is written into the control system by utilizing an industrial control protocol to complete control.

Fig. 5 is a structural diagram of a model-free adaptive control system of an industrial system according to the present invention, and as shown in fig. 5, a model-free adaptive control system of an industrial system includes:

a historical monitoring data acquisition module 501, configured to acquire historical monitoring data of various devices in an industrial process; the historical monitoring data comprises controllable data, state data, environmental noise data and target output data; the controllable data comprises the opening degree of a flow valve, the opening degree of a valve, the rotating speed of a frequency converter and the rotating speed of a pump; the state class data comprises pipeline pressure in industrial production; the environmental noise data comprises product information, temperature and humidity of the previous process; the target output class data includes objects controlled in the production process.

A control instruction set generating module 502, configured to generate a control instruction set by using the controllable class data; the control instruction set comprises a plurality of control instructions generated at the next moment.

The control instruction set generating module 502 specifically includes: a parameter definition unit for defining a piece of monitoring data, wherein the monitoring data is historical monitoring data S or current monitoring data

，

the size of a historical monitoring data set is shown, a control is controllable data, a state is state data, env is environmental noise data, and a goal is target output data; a control instruction generation unit for generating a controllable variable from the historical monitoring data S

Collecting and generating

A bar control instruction; a control instruction set generation unit for narrowing down by clustering

And a predictive simulation model constructing module 503, configured to construct a predictive simulation model according to the historical monitoring data.

The prediction simulation model building module 503 specifically includes: a plurality of prediction model construction units for constructing a plurality of prediction models to predict the system state quantity and the target predicted state output quantity at the next time

Each variable in (a) is independently predicted; for the prediction of each univariate, a LightGBM algorithm is adopted to construct a prediction model, the maximum number of leaves num _ leaves is 10, the learning rate is 0.8, the feature screening proportion feature _ fraction is 0.9, and l2 regular terms are adopted to reduce overfitting; the dividing unit is used for dividing the historical monitoring data into 7: 3; wherein 30% of the historical monitoring data is used as a validation set for determining the hyper-parameters of the optimal prediction model; a prediction simulation model construction unit for constructing a prediction simulation model based on the controllable variables and the amount of the environmental noise

A trained reinforcement learning based control model determining module 504, configured to train a reinforcement learning based control model according to the predictive simulation model based on the control instruction set, and generate the trained reinforcement learning based control model.

The trained control model determination module 504 based on reinforcement learning specifically includes: a reinforcement learning based control model construction unit for constructing a reinforcement learning based control model and acquiring the current monitoringData of

Setting a control target value

(ii) a A control instruction sampling unit for sampling the current monitoring data

And setting a control target value

Input to the reinforcement learning-based control model, and output

(ii) a A prediction unit for predicting the current monitoring data

And the control instruction

(ii) a A decision reward calculation unit for controlling the target value according to the setting

And target output quantity at next time

Calculating decision rewardsrA training unit for rewarding based on the decisionrThe current monitoring data

The control instruction

(ii) a A control model determining unit based on reinforcement learning after training, which is used for determining the monitoring data of the next moment

Replacing the current monitoring data

The timing differential loss function is:

wherein the content of the first and second substances,

，

Control instructions for sampling

，

indicating the system state assThe control command is executed as

indicating the system state ass'The control command is executed as

Evolved intos'The obtained single-step control benefit isrAssume that the control command is executed as

Optimizing to obtain

。

And a current monitoring data obtaining module 505, configured to obtain current monitoring data.

And the adaptive control module 506 is used for inputting the current monitoring data into the trained control model based on reinforcement learning, adaptively controlling the production process of the industrial system and outputting the optimal set target of the industrial system.

Aiming at the limitation that the traditional intelligent control technology is only suitable for simple industrial environment, the invention provides a control method based on the combination of machine learning and reinforcement learning. The method can excavate objective rules of the production environment from the monitoring data by utilizing the strong self-learning capability and generalization capability of the method, and converts the objective rules into an intelligent control strategy with good control precision, thereby being capable of separating from the human intervention of field experts and control experts.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A model-free adaptive control method for an industrial system, comprising:

the building of the prediction simulation model according to the historical monitoring data specifically comprises the following steps:

Integrating a plurality of the prediction models to construct a prediction simulation model;

acquiring current monitoring data;

2. The model-free adaptive control method for industrial systems according to claim 1, wherein the generating a set of control instructions using the controllable class data comprises:

，

to the controlled variable from the historical monitoring data S

Collecting and generating

A bar control instruction;

shrinking by clustering

3. The model-free adaptive control method for the industrial system according to claim 2, wherein the training the reinforcement learning-based control model according to the predictive simulation model based on the control instruction set to generate the trained reinforcement learning-based control model specifically comprises:

Setting a control target value

；

The current monitoring data is processed

And setting a control target value

Input to the reinforcement learning-based control model, and output

；

According to the current monitoring data

And the control instruction

；

According to the set control target value

And target output quantity at next time

Calculating decision rewardsr；

Reward based on the decisionrThe current monitoring data

The control instruction

；

Replacing the current monitoring data s (control, state, env, good) with the monitoring data s' (control, state, env, good) at the next moment, training the reinforcement learning-based control model until the average reward of the reinforcement learning-based control model is not increased any more, and determining the trained reinforcement learning-based control model.

4. The model-free adaptive control method for industrial systems according to claim 3, wherein the timing differential loss function is:

wherein the content of the first and second substances,

，

Control instructions for sampling

，

indicating the system state assThe control command is executed as

indicating the system state ass'The control command is executed as

。

5. A model-free adaptive control system for an industrial system, comprising:

the prediction simulation model building module specifically comprises:

6. The model-free adaptive control system for industrial systems according to claim 5, wherein the control instruction set generation module specifically comprises:

，

Collecting and generating

A bar control instruction;

a control instruction set generation unit for narrowing down by clustering

7. The model-free adaptive control system for industrial systems according to claim 6, wherein the trained reinforcement learning-based control model determination module specifically comprises: