CN116954162A

CN116954162A - Method and apparatus for generating control strategy for industrial system

Info

Publication number: CN116954162A
Application number: CN202310491988.5A
Authority: CN
Inventors: 刘浏; 刘子轩; 赵沛霖
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-05-04
Filing date: 2023-05-04
Publication date: 2023-10-27

Abstract

A method, apparatus, electronic device, and computer-readable storage medium for generating a control strategy for an industrial system are disclosed. The method comprises the following steps: acquiring current state information generated by controlling the industrial system based on a control strategy of the previous state; predicting first control information for controlling a part of components in the industrial system in a current state based on the current state information of the industrial system; generating second control information for controlling all components in the industrial system in a current state based on the first control information and an equality constraint for controlling a safety risk of the industrial system; the second control information is modified based on inequality constraints for controlling security risks of the industrial system, a control strategy for the industrial system in a current state is generated, and the industrial system in the current state is controlled in an application environment of the industrial system based on the control strategy.

Description

Method and apparatus for generating control strategy for industrial system

Technical Field

The present disclosure relates to the field of artificial intelligence services, and more particularly, to a method, apparatus, training method, electronic device, and computer-readable storage medium for generating control strategies for industrial systems. .

Background

With the advancement of industrial development, a new round of industrial revolution is rising, and new technologies such as informatization technology, artificial intelligence and the like promote industrial revolution. Various industries are also converting to digitization, intellectualization and automation, and enter a new stage of modern industry. However, as the complexity of industrial systems increases, so does the difficulty of control.

To meet the increasing demand for systems, industry has proposed using artificial intelligence based methods to analyze the state of industrial systems and to develop appropriate control strategies for control. For example, in the application scenarios of automation equipment control, logistics resource scheduling, power grid scheduling and the like, an artificial intelligence-based method can be utilized to help people to formulate a proper control strategy.

However, current industrial control schemes use artificial intelligence to automate, but do not fully guarantee the safety risk of industrial systems, especially micro grid systems, subject to incorrect operation behaviour. Current industrial control schemes are mainly based on reinforcement learning schemes, which tend to ignore security risk constraints in order to obtain better yields, resulting in the generation of undesirable behaviors. Accordingly, there is a need to improve current artificial intelligence based industrial control schemes to ensure system safety.

Disclosure of Invention

Embodiments of the present disclosure provide a method, apparatus, training method, electronic device, and computer-readable storage medium for generating a control strategy for an industrial system.

The disclosed embodiments provide a method of generating a control strategy for an industrial system, comprising: acquiring current state information generated by controlling the industrial system based on a control strategy of the previous state; predicting first control information for controlling a part of components in the industrial system in a current state based on the current state information of the industrial system; generating second control information for controlling all components in the industrial system in a current state based on the first control information and an equality constraint for controlling the safety risk of the industrial system; the second control information is modified based on inequality constraints for controlling security risks of the industrial system, a control strategy for the industrial system in a current state is generated, and the industrial system in the current state is controlled in an application environment of the industrial system based on the control strategy.

An embodiment of the present disclosure provides an apparatus for generating a control strategy for an industrial system, comprising: a policy network configured to: an acquisition module configured to: acquiring current state information generated by controlling the industrial system based on a control strategy of the previous state; a policy network configured to: predicting first control information for controlling a part of components in the industrial system in a current state based on the current state information of the industrial system; a complement module configured to: generating second control information for controlling all components in the industrial system in a current state based on the first control information and an equality constraint for controlling the safety risk of the industrial system; a correction module configured to: correcting the second control information based on inequality constraints for controlling the safety risk of the industrial system, generating a control strategy for the industrial system in the current state; and a control module configured to control the industrial system in a current state in an application environment of the industrial system based on the control strategy.

The embodiment of the disclosure provides an electronic device, comprising: a processor; and a memory, wherein the memory stores a computer executable program that, when executed by the processor, performs the method described above.

The disclosed embodiments provide an apparatus comprising: a processor; and a memory storing computer instructions which, when executed by the processor, implement the above-described method.

Embodiments of the present disclosure provide a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the above-described method.

According to another aspect of the present disclosure, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable medium and executes the computer instructions to cause the computer device to perform the aspects described above or methods provided in various alternative implementations of the aspects described above.

Various embodiments of the present disclosure supplement and adjust predicted sequences of actions for controlling an industrial system based on equality constraints and inequality constraints for controlling the safety risk of the industrial system, thereby ensuring the safety risk of a control strategy for the industrial system. Still further, the training process of the policy network according to some embodiments of the present disclosure further uses equality constraints and inequality constraints for controlling the security risk of the industrial system to generate enhancement losses to implement gradient backhaul based on the security risk constraints using an implicit differential method, resulting in a more robust and reliable policy network.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required to be used in the description of the embodiments will be briefly described below. The drawings in the following description are only exemplary embodiments of the present disclosure.

Fig. 1 shows a schematic diagram of an application scenario according to an embodiment of the present disclosure.

Fig. 2 is an example schematic diagram illustrating a scenario in which a policy network performs reasoning and training in accordance with an embodiment of the present disclosure.

FIG. 3 is a flow chart illustrating a method of generating a control strategy for an industrial system according to an embodiment of the present disclosure.

Fig. 4 is a schematic flowchart illustrating a micro grid scheduling method according to an embodiment of the present disclosure.

FIG. 5 is an interactive schematic diagram illustrating an industrial system, a control strategy, and an application environment according to an embodiment of the present disclosure.

FIG. 6 is a schematic diagram illustrating interactions of a policy network, a completion module, and a correction module during a training process according to an embodiment of the present disclosure.

Fig. 7 is a flowchart illustrating a neural network training method for industrial system control, according to an embodiment of the present disclosure.

FIG. 8 is yet another interaction diagram illustrating a policy network, a completion module, and a correction module in a training process according to an embodiment of the present disclosure.

Fig. 9 shows a schematic diagram of an electronic device according to an embodiment of the disclosure.

FIG. 10 illustrates an architectural diagram of a computing device according to an embodiment of the present disclosure.

Fig. 11 shows a schematic diagram of a computer-readable storage medium according to an embodiment of the disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, exemplary embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present disclosure and not all of the embodiments of the present disclosure, and that the present disclosure is not limited by the example embodiments described herein.

In the present specification and drawings, steps and elements having substantially the same or similar are denoted by the same or similar reference numerals, and repeated descriptions of the steps and elements will be omitted. Meanwhile, in the description of the present disclosure, the terms "first," "second," and the like are used merely to distinguish the descriptions, and are not to be construed as indicating or implying relative importance or order.

For purposes of describing the present disclosure, the following presents concepts related to the present disclosure.

Industrial control (Factory control) mainly refers to the use of computer technology, microelectronics technology, and electrical means to enable the production and manufacturing processes of factories to be more automated, efficient, accurate, and controllable and visible. Industrial control involves industrial system circuit control (e.g., automation equipment control), industrial resource scheduling (e.g., logistics resource scheduling, grid scheduling), and the like. Currently, artificial intelligence-based methods have been used in an increasing number of industrial control scenarios. For example, the state of the current industrial control system can be analyzed based on the neural network to quickly establish a suitable control strategy for reference by people, so that the manual analysis workload is effectively reduced.

More specifically, embodiments of the present disclosure provide industrial control schemes that relate primarily to sequential decision (sequential decision) techniques and security reinforcement learning techniques in neural networks.

Sequential decision techniques are a decision method for stochastic or uncertainty dynamic system optimization, which is widely used in machine learning, reinforcement learning. Wherein, as the artificial intelligence of the decision maker, based on the existing information and the possible future results, the optimal decision scheme can be selected from the decision maker's perspective through a series of decision making processes. Sequential decisions are characterized by the fact that the system under study is dynamic, decisions are made sequentially, and the states that the system may appear in the next step are random or uncertain. The sequential decision making process starts from the initial state, and after the decision maker makes the optimal decision at each moment, the decision maker then observes the state actually appearing in the next step, and repeats until the final step. In an industrial system, every decision (also called operation or action) that an artificial agent, which is a decision maker, makes in a short time must be in compliance with safety regulations, and every state of the whole industrial system is required to be safe and stable. In addition, it is also required that long term yields of industrial systems can be relatively optimized across multiple conditions. Taking a micro grid as an example, an artificial agent for controlling the micro grid often needs to make an optimal decision within ten minutes to optimize long-term benefits (e.g., power generation efficiency for one whole day) of the micro grid, considering various factors that may affect the operating state of the micro grid (e.g., weather, whether new energy is used, battery capacity, rate, etc.).

Reinforcement learning in an industrial system is a machine learning method that is capable of learning how to maximize rewards by interacting with the application environment of the industrial system. Compared with the traditional operation optimization algorithm, the reinforcement learning algorithm can respond more quickly in real time, maximize long-term benefits and is more suitable for the scheduling requirements of industrial systems. For example, in a micro grid, a micro grid control system adopting reinforcement learning technology can play a great role in real-time dispatching of micro grids by learning how to optimize efficiency of a power system, reduce cost, reduce environmental impact, and the like. In addition, the neural network in the reinforcement learning algorithm has high forward transmission speed, and complex solution is not needed, so that the micro-grid scheduling problem can be processed more efficiently.

Safety reinforcement learning (Safe RL) is a method for generating a control strategy capable of ensuring the accuracy and reliability of the behavior of an (intelligent) industrial system under the framework of reinforcement learning by taking the safety and reliability of the industrial system as main considerations. In an industrial system, a scheduling algorithm based on safety reinforcement learning can consider the safety risk and reliability of the industrial system so as to ensure that the system cannot malfunction or cause safety problems. For micro-grid, security reinforcement learning can be through learning a method how to maximize rewards in the course of reinforcement learning, with primary security risks of the micro-grid guaranteed. For example, in real-time scheduling of micro-grids, safety reinforcement learning may consider how to ensure stability of the power system, prevent grid faults, etc., to ensure stable operation of the micro-grids.

The various models to which the present disclosure relates may be artificial intelligence (Artificial intelligence, AI) based. Artificial intelligence is a theory, method, technique and application system that utilizes a digital computer or a machine controlled by a digital computer to sense the environment, acquire knowledge and use the knowledge to obtain optimal results by simulating, extending and expanding human intelligence. In other words, artificial intelligence is an integrated technology of computer science that aims to enable machines to react in a manner similar to human intelligence. For example, for the policy network of the present disclosure, it can determine how to control the control information of an industrial system based on understanding the status information of the industrial system in a manner similar to how a control expert of the industrial system observes and understands the industrial system. The model realizes the function of understanding an industrial system and generating control information according to the industrial system by researching the design principles and the implementation methods of various intelligent machines.

Alternatively, the various models that may be used in embodiments of the present disclosure below may be artificial intelligence models, particularly neural network models based on artificial intelligence. Typically, artificial intelligence based neural network models are implemented as loop-free graphs, in which neurons are arranged in different layers. Typically, the neural network model includes an input layer and an output layer, which are separated by at least one hidden layer. The hidden layer transforms the input received by the input layer into a representation useful for generating an output in the output layer. The network nodes (i.e., neurons) are all connected to nodes in adjacent layers via edges, and there are no edges between nodes within each layer. Data received at a node of an input layer of the neural network is propagated to a node of an output layer via any one of a hidden layer, an active layer, a pooling layer, a convolutional layer, and the like. The input and output of the neural network model may take various forms, which is not limited by the present disclosure.

The scheme provided by the embodiment of the disclosure relates to artificial intelligence, machine learning and other technologies, and is specifically described by the following embodiment.

First, an application scenario of a method according to an embodiment of the present disclosure and a corresponding apparatus or the like will be described with reference to fig. 1. Fig. 1 shows a schematic diagram of an application scenario 100, in which a server 110 and a plurality of terminals 120 are schematically shown, according to an embodiment of the present disclosure.

The policy network (including the first policy network and the second policy network), the bonus network (including the first bonus network and the second bonus network), and the like of the embodiments of the present disclosure may be integrated in various electronic devices, such as any of the server 110 and the plurality of terminals 120 in fig. 1. For example, the policy network may be integrated in the terminal 120. The terminal 120 may be a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal computer (PC, personal Computer), a smart speaker, a smart watch, or the like, but is not limited thereto. For another example, a model for generating garment data may also be integrated with the server 110. The server 110 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligent platform. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the disclosure is not limited herein.

It can be appreciated that the device for reasoning by applying the policy network according to the embodiments of the present disclosure may be a terminal, a server, or a system comprising a terminal and a server. The method for generating the clothing data by applying the embodiment of the disclosure can be executed on the terminal, can be executed on the server, and can be executed by the terminal and the server together.

The policy network provided by the embodiments of the present disclosure may also be part of a cloud service of an industrial system. Cloud technology (Cloud technology) in an industrial system refers to a hosting technology for integrating serial resources such as hardware, software, network and the like in the industrial field to realize calculation, storage, processing and sharing of data. The cloud technology is a generic term of industrial network technology, information technology, integration technology, management platform technology, application technology and the like based on cloud computing business model application, can form a resource pool, and is flexible and convenient as required. Taking a micro-grid as an example, the background service of the micro-grid dispatching system needs a large amount of computing resources and storage resources, and the data of the micro-grid dispatching system needs stronger system rear shield support along with the complicating of the micro-grid dispatching system, so that the micro-grid dispatching system is more suitable for being realized by using a cloud technology.

Artificial intelligence cloud services have also begun to find widespread use in industrial systems. Artificial intelligence cloud services are also commonly referred to as AIaaS (AI as a Service, chinese is "AI as Service"). The method is a service mode of an artificial intelligent platform which is mainstream in the industry at present, and particularly an AIaaS platform can split common AI services and provide independent or packaged services at a cloud end. This service mode is similar to an AI theme mall: all developers of industrial systems can access one or more artificial intelligence services provided by the use platform through application program interfaces (APIs, application Programmi ng Interface), and part of the sophisticated developers can deploy and operate and maintain proprietary artificial intelligence cloud services using AI frameworks and AI infrastructure provided by the platform.

Fig. 2 is an example schematic diagram illustrating a scenario 200 for reasoning and training of various neural network models, according to an embodiment of the present disclosure.

During the training phase, the server may train various neural network models (e.g., policy and rewards networks, etc.) according to embodiments of the present disclosure based on the training sample set. After training is completed, the server may deploy each neural network model (e.g., policy network) that completes training to one or more servers (or cloud services) to provide artificial intelligence services related to controlling the industrial system.

In the reasoning phase, it is assumed that an application (e.g., a micro grid operation application) that interacts with the server 110 for industrial control has been installed on the user terminal 120 that controls the industrial system. The user terminal 120 may indicate current state information of the industrial system to the application interacting with the server 110 for industrial control. For example, taking an industrial system as an example of a micro-grid, current state information of the micro-grid includes, but is not limited to, power generation, energy storage, load, operating parameters, environmental information, grid-tie information, user demand information, and the like with the micro-grid.

The user terminal 120 may then transmit a control information acquisition request to the server 110 corresponding to the application through the network to request the neural network deployed on the server 110 to generate control information. After receiving the control information acquisition request, the server 110 processes the control information acquisition request by using the trained policy network to obtain current control information for the current state information, and then feeds back a control information response to the user terminal 120. The user terminal 120 may receive a control information response including the control information response generated by the server 110. Thereafter, the user terminal 120 may determine the next state information based on the current control information and the current state information. At this time, the user terminal 120 controlling the industrial system will acquire the operation recommended by the neural network and the effect to be produced after the operation is performed, and thus, the controller of the user terminal 120 will determine whether to perform the corresponding operation.

It is noted that the training sample data shown in fig. 2 may also be updated in real time. For example, the user may score the control information. For example, if the user considers the current control information to be safe and efficient, the user may give a higher score to the current control information, and the server 110 may treat the current state information-current control information pair as a positive sample for training a neural network model for industrial control in real time. If the user gives a lower score for the current control information, the server 110 may take the current state information-current control information pair as a negative sample.

The training sample set shown in fig. 2 may also be set in advance. For example, referring to fig. 2, a server may obtain training data from a database and then generate a training sample set for training a neural network for industrial system control. Of course, the present disclosure is not limited thereto.

How to make control data of automated industrial system output under reinforcement learning framework conform to safety constraints remains a challenging problem. Two technical schemes are currently proposed in the industry and academia: (1) a solution that directly optimizes the cost function and (2) a solution for cost function projection. But they all have some limitations.

The scheme of directly optimizing the cost function utilizes the cost function to estimate the long-term cost of the state-action pair, then the constraint of the cost function is constructed by setting a threshold value, and finally the reinforcement learning objective function is solved by utilizing the idea of dual optimization. Since this scheme relies on a cost function to estimate the long-term cost, there is an estimation error. Meanwhile, the adopted threshold setting scheme cannot handle the problem of hard constraint and can only handle the problem of soft constraint.

The scheme of the cost function projection separates the process of strategy network learning safety capability from the traditional reinforcement learning strategy optimization process, and projects the strategy into a safety domain through the constraint of the cost function after the strategy for maximizing long-term rewards is obtained. The scheme of the cost function projection is easy to output actions violating the constraint when encountering states outside the training distribution, and is easy to generate control information violating the constraint when being applied practically.

The present disclosure provides, based thereon, a method of generating a control strategy for an industrial system, comprising: acquiring current state information generated by controlling the industrial system based on a control strategy of the previous state; predicting first control information for controlling a part of components in the industrial system in a current state based on the current state information of the industrial system; generating second control information for controlling all components in the industrial system in a current state based on the first control information and an equality constraint for controlling the safety risk of the industrial system; the second control information is modified based on inequality constraints for controlling security risks of the industrial system, a control strategy for the industrial system in a current state is generated, and the industrial system in the current state is controlled in an application environment of the industrial system based on the control strategy.

The present disclosure also provides an apparatus for generating a control strategy for an industrial system, comprising: an acquisition module configured to: acquiring current state information generated by controlling the industrial system based on a control strategy of the previous state; a policy network configured to: predicting first control information for controlling a part of components in the industrial system in a current state based on the current state information of the industrial system; a complement module configured to: generating second control information for controlling all components in the industrial system in a current state based on the first control information and an equality constraint for controlling the safety risk of the industrial system; a correction module configured to: correcting the second control information based on inequality constraints for controlling the safety risk of the industrial system, generating a control strategy for the industrial system in the current state; and a control module configured to control the industrial system in a current state in an application environment of the industrial system based on the control strategy.

The present disclosure also provides a training method for training the above device, where the training method includes: generating, with the device in training, a control strategy for current state information, a prize value for the control strategy, and next state information for the industrial system; storing the current state information, the control strategy, the rewarding value and the next state information into an experience playback pool as experience transfer sequences; sampling a batch of experience transfer sequences from an experience playback pool, and training a reward network in training based on the batch of experience transfer sequences so as to enable the value of the reward loss function to be converged; and training the training policy network with the training reward network, the empirical transfer sequence of batches, the equality constraint for controlling the safety risk of the industrial system, and the inequality constraint for controlling the safety risk of the industrial system such that the value of the boost loss converges.

All or a portion of embodiments in accordance with the present disclosure are described in more detail below in conjunction with fig. 3-11.

FIG. 3 is a flow chart illustrating a method 30 of generating a control strategy for an industrial system according to an embodiment of the present disclosure. Fig. 4 is a schematic flow chart diagram illustrating a micro grid scheduling method 40 according to an embodiment of the present disclosure.

The method 30 of generating garment data according to embodiments of the present disclosure may be applied to any electronic device. It is understood that the electronic device may be a different kind of hardware device, such as a Personal Digital Assistant (PDA), an audio/video device, a mobile phone, an MP3 player, a personal computer, a laptop computer, a server, etc. For example, the electronic device may be the server and user terminal of fig. 1, etc. Hereinafter, the present disclosure is described by taking a server as an example, and those skilled in the art should understand that the present disclosure is not limited thereto.

In the disclosed embodiments, hard constraints refer to strict constraints or sets of constraints set on control strategies for industrial systems. These constraints are those that the control strategy cannot be violated, otherwise leading to a safety risk for the industrial system. For example, in the task of a cleaning robot to clean a room, hard constraints may include limitations such as inability to damage furniture and inability to collide with obstacles. As another example, in a microgrid control task, hard constraints may include conservation of energy, the current per wire cannot exceed a maximum value, and so on. Therefore, hard constraints are one of the important factors that guarantee the feasibility and safety risk of the control tasks of industrial systems.

For example, the method 30 according to an embodiment of the present disclosure may optionally include the following operations S300 to S303. Of course, the disclosure is not so limited.

First, current state information generated by controlling the industrial system based on a control policy of a previous state is acquired in operation S300. In operation S301, first control information for controlling the industrial system is predicted based on current state information of the industrial system.

For example, an industrial system according to the present disclosure refers to a system that involves any automated control, which may include various production equipment, mechanical and automated equipment, production flow control systems, electrical control systems, and the like.

For example, the (current) status information of the industrial system refers to any information reflecting the (current) operation status of the industrial system, which may include the operation status of the respective devices, the progress of the production process, the quality of the product, and the like. Control information for an industrial system (including first control information and second control information detailed later) refers to information used to control and monitor various components in the industrial system, which may include measured values of parameters such as temperature, pressure, flow, speed, etc., as well as information used to control these parameters. Of course, the disclosure is not so limited.

Optionally, in an industrial automation control application scenario, the status information of the industrial system may include: power parameters of the industrial system (e.g., voltage, current, power, etc.), current information of each controller (e.g., normal operation, disabled, current control signals, etc.), operational status information of each actuator (e.g., information that an action such as translation, rotation, etc. is being performed), etc. The first control information of the industrial system corresponding to the current state information may include: adjustment amounts (e.g., voltage adjustment amounts, current adjustment amounts, power adjustment amounts, etc.) of electric power parameters to the industrial system, control signal changes (e.g., control signal adjustment amounts, etc.) of each controller, and the like. Of course, the present disclosure is not limited thereto.

Optionally, in an industrial resource scheduling application scenario, the state information of the industrial system may include: the current resource amount of each resource allocation node, the resource allocation restriction information of each resource allocation node, and the like. The first control information of the industrial system corresponding to the current state information may include: and the amount of resource adjustment for each resource allocation node. Of course, the present disclosure is not limited thereto.

Optionally, in a scenario where the industrial system is a robotic system, the status information of the industrial system may include: positional information of the robot, body angle of the machine, and the like; the first control information of the industrial system corresponding to the current state information may include: current and voltage input to the motors of the robot, the angle at which the joint motors should rotate, etc. Of course, the present disclosure is not limited thereto.

For example, the first control information is a set of information for controlling a part of components in the industrial system in a current state. In an example embodiment according to the present disclosure, an industrial system may include n components (n > 1), and the first control information may correspond to control information (n > m) of m components therein. Since the first control information does not correspond to control information of all components, sufficient control of the industrial system cannot be achieved by the first control information alone. Of course, the present disclosure is not limited thereto.

Alternatively, a policy network (or any neural network) described in detail later may be used to predict the first control information z for controlling the industrial system based on the current state information of the industrial system ^T WhereinThe first control information predicted by the strategic network tends to enable long-term benefits of the industrial system to reach higher values. However, since the first control information is obtained via a policy network (or any neural network), the first control information is not only incomplete, and some of the control information therein may not completely satisfy the hard constraint condition, it is necessary to perfect and correct the first control information under the combined action of the inequality constraint and the equality constraint which will be described later in detail, thereby generating a control policy capable of satisfying the security risk constraint. Of course, the disclosure is not so limited.

Next, in operation S302, second control information for controlling the industrial system is generated based on the first control information and an equation constraint for controlling a safety risk of the industrial system.

For example, an equality constraint for controlling the security risk of the industrial system refers to a constraint that the industrial system should satisfy that can be presented in an equality manner. Once the values on both sides of the equation are not equal, the industrial system will present a safety risk. For example, in an industrial system, such as a robotic system, the physical body width of the robot is typically a fixed constant that cannot be changed during movement, once changed, the robot will disintegrate. For another example, in an industrial automation control application scenario, the equality constraints for controlling the security risk of the industrial system include: the total amount of the controller is a fixed value, etc. For another example, in an industrial resource scheduling application scenario, the equality constraints for controlling the security risk of the industrial system include: the total supplied power is equal to the total consumed power, the total amount of available devices on the production line is equal to a fixed value, the sum of the supplied power of each power supply device is equal to the total supplied power, and so on. Of course, the present disclosure is not limited thereto.

For example, the second control information is a set of information for controlling all components in the industrial system in a current state. All components in the industrial system perform actions corresponding to the second control information, so that the industrial system reaches the next state under the combined action of the control information and the environmental factors. In an example embodiment according to the present disclosure, the industrial system may include n components (n > 1), and the second control information may correspond to control information of n components therein. Of course, the present disclosure is not limited thereto.

Alternatively, a make-up module, described in detail later, may be usedTo generate second control information +.f for controlling the industrial system based on the first control information and an equality constraint for controlling a security risk of the industrial system>Wherein the method comprises the steps ofWherein (1)>Complement module->Is to make->Always a function. />I.e. mathematical representations of the equality constraints for controlling the safety risk of the industrial system. Alternatively, h _s (. Cndot.) may beA hidden function. Implicit functions refer to functions determined by the functional relationship of one or more variables whose specific values may not be directly determinable, requiring solution by other equations or conditions. For example, h _s One example of (-) is: />Of course, the present disclosure is not limited thereto.

Thereby, the second control informationAll the equality constraints for controlling the safety risk of the industrial system must be satisfied. However, since the first control information is obtained via the policy network (or any neural network), part of the second control information including the first control information may not completely satisfy the inequality constraint condition. Therefore, the second control information needs to be modified under the combined action of the equality constraints detailed later, so as to generate a control strategy capable of meeting the security risk constraints. Of course, the disclosure is not so limited.

Next, in operation S303, the second control information is corrected based on an inequality constraint for controlling a safety risk of the industrial system, and a control strategy for the industrial system is generated.

For example, inequality constraints for controlling the safety risk of the industrial system refer to constraints that the industrial system should satisfy that can be presented in an inequality manner. Once the values on both sides of the inequality do not meet the requirements set by operators (e.g., greater than (equal to) operators, less than (equal to) operators, etc.), the industrial system will present a security risk. For example, in an industrial system, such as a robotic system, the speed of movement of the robot may need to reach a certain threshold, otherwise the robot may fall. For another example, in an industrial automation control application scenario, inequality constraints for controlling security risks of the industrial system include: the operating humidity of the device is less than a predetermined threshold, the operating temperature is less than a predetermined threshold, etc. For another example, in an industrial resource scheduling application scenario, the equality constraints for controlling the security risk of the industrial system include: the amount of electricity that each power supply device can supply should be equal to or less than the upper limit of its capacity, and so on. Of course, the present disclosure is not limited thereto.

Alternatively, a correction module described in detail later may be usedTo modify said second control information based on inequality constraints for controlling the safety risk of said industrial system, generating a control strategy for said industrial system>Alternatively, the control strategy can be solved by iteration, i.e. such that +.>Wherein k is _test The number of gradient descent times (iteration step number) of the correction module in the operation process is corrected. And, in each iteration, determining a correction amount for the second control information based on an inequality constraint for controlling a safety risk of the industrial system; correcting the second control information corresponding to the iteration based on the correction amount corresponding to the second control information; and taking the corrected second control information as second control information in the next iteration.

For example, in each iteration, the module is modifiedAn iteration gradient corresponding to the iteration can be determined by utilizing an implicit differential method based on inequality constraints for controlling the safety risk of the industrial system; and then determining the correction amount corresponding to the second control information based on the iteration gradient corresponding to the iteration. Through k _test Iterative, correction module- >The worker canThe control strategy of the industrial system is adjusted to meet both equality and inequality constraints of safety risk, while enabling the industrial system to obtain higher long-term benefits. Of course, the disclosure is not so limited.

For example, the process of calculating the correction amount corresponding to the second control information in the process of each iteration includes: determining partial derivative information corresponding to second control information corresponding to the iteration based on inequality constraints for controlling safety risk of the industrial system, the partial derivative information indicating a rate of change corresponding to all components in the industrial system for the iteration; determining an iteration gradient corresponding to the iteration based on partial derivative information corresponding to second control information corresponding to the iteration; and determining a correction amount corresponding to the second control information based on the iteration gradient corresponding to the iteration.

More specifically, due to the above h _s (. Cndot.) may be a hidden function that is not explicitly differentiable, and cannot be directly obtainedFunction, thus solving for +.>At this time, +. >Thereby obtaining a gradient +.>The mathematical principle thereof can be expressed by the following formula (1).

Equation (2) can be derived from equation (1) to solve for correction of the second control signalGradient of information

It can be seen that the gradient for correcting the second control informationSecond control information +.> Corresponding first partial derivative information->And second partial derivative information->To express. The first partial derivative information and the second partial derivative information constitute partial derivative information corresponding to the second control information. Wherein the first partial derivative information->Characterizing a rate of change of loss of long-term benefit relative to z due to second control information, second partial derivative information->Characterizing loss of long term benefit due to second control information relative toIs a rate of change of (c). Thus, the partial derivative information corresponding to the second control information can indicate the information used to control all components in the industrial system at any one timeThe corresponding rate of change in the iteration.

In one example, the correction moduleNewton's iteration method can be used to solve for +_at each iteration based on inequality constraints for controlling the safety risk of the industrial system>Including the correction for z ^T Is a correction amount Δz of (2) ^T For->Correction amount of->Wherein (1)> Wherein g _s (. Cndot.) is a mathematical representation of the inequality constraints used to control the safety risk of the industrial system. Thereby, the correction amount Δz ^T And correction amount->May be used to represent an iterative gradient based on an inequality constraint. The newton iteration method is a numerical method for solving an approximate solution of an equation. It approximates the solution of the equation by successive iterations. Optionally, during the use of Newton's iterative method, a correction module +.>With second control information->As initial values Δz and +.>Second control information +_at learning rate γ>The correction is performed until the number of iterations reaches an upper limit (e.g., k _test ) Or control strategy->The inequality constraint is satisfied. That is, in each iteration, the second control information is corrected using the formula (3).

The second control information is iterated as an operator in the iterative process continuously in a direction that satisfies the inequality constraint. Correction module by using Newton's iterative method>Control strategy which can be solved to higher accuracy with a faster convergence speed>Of course, the disclosure is not so limited. />

Alternatively, in the actual reasoning process, k _test May be set to a larger value to ensure satisfaction of the inequality constraint. Of course, the present disclosure is not limited thereto.

Optionally, the method 30 further includes an operation S304, in which the industrial system is controlled in an application environment of the industrial system based on the control policy, and a prize value for the control policy and next state information of the industrial system are acquired. Optionally, the application environment of the industrial system includes an environmental state in which the industrial system is actually located, including a weather state, a product market price, and the like. The present disclosure is not limited thereto.

Next, a specific example of a method 30 according to an embodiment of the present disclosure in a micro grid scheduling application scenario is described in connection with fig. 4.

As shown in fig. 4, first control information for controlling the micro grid system is predicted based on current state information of the micro grid system in operation S401. In operation S402, second control information for controlling the micro grid system is generated based on the first control information and an equality constraint for controlling a security risk of the micro grid system. In operation S403, the second control information is corrected based on the inequality constraint for controlling the safety risk of the micro grid system, and a control strategy for the micro grid system is generated. Optionally, in operation S404, the micro grid system is controlled under an application environment of the micro grid system based on the control policy, and a reward value for the control policy and next state information of the micro grid system are acquired.

For example, a microgrid refers to a small grid system consisting of a distributed energy source, an energy storage system, and a load. The micro-grid dispatching refers to the coordinated dispatching of each component in the micro-grid so as to realize the stable, economical and reliable operation of the micro-grid.

For example, the microgrid state information may comprise: distributed energy generation information (e.g., power generation capacity, output power, operating state, etc. of solar energy, wind energy, biomass energy, fuel cells, etc.); non-distributed energy generation information (e.g., thermal power generation information, nuclear power generation information, etc.); energy storage system information (e.g., charge and discharge state, capacity, remaining energy, etc. of a device such as a battery for storing energy, heat, etc.); load information (e.g., real-time load, predicted load, adjustable load, etc.); micro grid tie information (e.g., tie or off-grid); operating parameter information (e.g., voltage, current, frequency, etc.); environmental information (e.g., such as temperature, wind speed, sunlight, etc.); power market information (e.g., electricity prices, market demand, policy and regulatory constraints, etc.); user demand information (e.g., critical load priority, energy supply demand, etc.).

For example, the microgrid control information may include: non-distributed energy generation control information (e.g., thermal power generation schedule information, nuclear power generation schedule information, etc.), energy storage system control information (e.g., energy storage information, discharge information, etc.), load control information (e.g., load adjustment information), microgrid grid-connected control information, etc. It should be appreciated that since only a portion of the parameters can be controlled in the context of a microgrid scheduling application, the dimension of the microgrid control information is typically smaller than the dimension of the microgrid state information.

In one particular example, method 30 and method 40 may be characterized in terms of the following pseudocode.

Where M represents the number of periods in which long-term benefits need to be settled, epoode represents one period in which long-term benefits need to be settled, and N represents the number of control strategies that need to be generated in the period in which long-term benefits need to be settled. For example, in the context of a microgrid, the period of settling long-term revenue may be one day. Of course the present disclosure is not limited thereto,

thus, various embodiments of the present disclosure are able to supplement and adjust a predicted sequence of actions for controlling an industrial system (e.g., a microgrid) based on equality constraints and inequality constraints for controlling the safety risk of the industrial system (e.g., a microgrid), thereby ensuring the safety risk of a control strategy for the industrial system.

To more clearly supplement the policy network pi potentially used in the examples of fig. 3 and 4And correction module->The nature of (2) is described below with reference to FIGS. 5-11 for a strategy network pi, complementFull module->And correction module->A more detailed description is given.

As shown in FIG. 5, an acquisition module (not shown), a policy network pi, and a completion module may be utilizedCorrection moduleTo obtain the control strategy corresponding to the current state. Policy network pi, complement module->And correction module->A component of a device that generates a control strategy for an industrial system. Specifically, the acquisition module is configured to: and acquiring current state information generated by controlling the industrial system based on the control strategy of the last state. Policy network pi configured to: predicting first control information for controlling a part of components in the industrial system in a current state based on the current state information of the industrial system; complement module->Is configured to: generating second control information for controlling all components in the industrial system in a current state based on the first control information and an equality constraint for controlling the safety risk of the industrial system; correction module->Is configured to: correcting the second control information based on inequality constraints for controlling the safety risk of the industrial system, generating a control strategy for the industrial system in the current state; and a control module configured to control the industrial system in a current state in an application environment of the industrial system based on the control strategy.

After the control strategy corresponding to the current state is acquired, the application environment P of the industrial system can be utilized to determine the information of the next state according to the current state and the control information, and therefore each state is associated with the previous state and the operation executed in the previous state, and sequential decision in the industrial system is achieved.

Further, policy network pi, complement moduleAnd correction module->The purpose of (2) is: the control information corresponding to the plurality of continuous states can enable the whole industrial system to obtain the maximum benefit, and ensure that the control strategy corresponding to each state meets the limitation of hard constraint (namely the equality constraint and the inequality constraint for controlling the safety risk of the industrial system).

Policy network pi, complement moduleAnd correction module->The three constitute an agent capable of making sequential decisions, which determines a { state-control strategy } data pair at each time step t, the control objective of the industrial system being to make the sequence +.>The corresponding total prize is highest. Wherein t is a positive integer, in the sequence, s ₁ First status information representative of the industrial system, and (2)>Representing a first control strategy, s, for said first state information ₂ Second status information representative of the industrial system, and (2)>Representing a second control strategy … … s for said second state information _t T-th status information representing said industrial system, -a third status information representing said industrial system>Represents an mth control strategy for the tth state information.

Each control strategy can utilize strategy network pi and complement moduleAnd correction module->To predict. For example, in the first state information s ₁ As input to the policy network pi via the policy network pi, complement module ++>And correction module->The common processing of the three can output the control strategy, i.e. with the first state information s ₁ Corresponding first control information +.>

Based on current state information, and current control information predicted for the current state informationThe industrial system test may be used to determine next state information. For example, it may be based on the first state information s ₁ Control strategyDetermining second state information s by means of the actual application environment P ₂ 。

It should be appreciated that the actual application environment P may include: and measuring the actual application environment and/or simulating the actual application environment by using simulation software. Of course, the present disclosure is not limited thereto.

For example, for a microgrid scheduling application scenario, the next microgrid state information may be determined using simulation software such as Grid deep learning software (Grid 2 Op), city learning software (CityLearn), etc., based on the current microgrid state information and the current microgrid control information predicted for the current microgrid state information.

The following is a strategy network pi, complement module in combination with FIGS. 6-11And correction module->Is described in more detail.

Wherein fig. 6 is a schematic diagram illustrating interactions of a policy network, a completion module, and a correction module in a training process according to an embodiment of the present disclosure. Fig. 7 is a flowchart illustrating a neural network training method 70 for industrial system control, according to an embodiment of the present disclosure. Neural networks for industrial system control include, but are not limited to, policy networks, completion modules, and correction modules. Wherein the trained policy network, completion module, and revision module will be available to perform the method 30 or method 40 according to embodiments of the present disclosure.

As shown in fig. 6, the neural network for industrial system control includes: the method comprises the steps of acquiring a network, a strategy network, a completion module and a correction module. The neural network for industrial system control optionally includes a reward network for assisting in training of the policy network. The input of the rewarding network is the current state information and the control strategy aiming at the current state information, and the output is the expected accumulated rewards which can be obtained by the industrial system under the current state by adopting the control strategy.

Alternatively, the reward network is a Q network (which may also be referred to as a critic network in some documents). The Q network is a neural network based on a cost function that maps states and control strategies (control actions) to a numerical representation, i.e., Q value. Wherein the Q value is expressed in the current state s _t Take down specific control strategyThe desired jackpot that may be obtained. "rewards" (or "rewards coefficients") according to embodiments of the present disclosure refer to values for which control strategies may be quantitatively evaluated. Alternatively, the "rewards" may quantitatively evaluate the diversity and quality of the control strategy, etc., and may also quantitatively evaluate the similarity between the control strategy and the data determined by the human expert. Of course, the present disclosure is not limited thereto.

Based on this, the policy network, the completion module, the correction module, and the reward network constitute a Q Learning (Q-Learning) network under a reinforcement Learning framework. In the training process described in detail below, through the game process of the control strategy network, the complement module, the correction module and the reward network, parameters of neurons in both the strategy network and the reward network are continuously adjusted, so that the control strategy made by the strategy network, the complement module and the correction module can enable the reward network to obtain the highest Q value.

For example, in a walking robot application scenario, the Q value may be the degree of completion of the motion of the robot over a period of time, and the control strategy indicates the output voltage of the joint motor of the robot at each time step. As another example, in a micro-grid scheduling application scenario, the Q-value may be the total revenue of the micro-grid over a period of time (e.g., an entire day), the control strategy indicating control information for various components of the micro-grid (e.g., wind turbines, thermal generators, energy storage units, etc.).

In the neural network training process for industrial system control shown in fig. 6 and 7, first, a time step C is set before training, and an experience playback pool D is initialized. Then to measure the network pi _φ Randomly initializing neurons phi in a reward network Q _θ The neurons θ in (a) are also randomly initialized. The experience playback pool D is a cache in the training process and can be used for storing interaction data of the strategy network, the complement module and the correction module with the application environment in the training process. As described in detail below, the data generated by the interaction of the policy network, the completion module, and the modification module with the environment includes current state information s _t Control strategy z _t Prize value r for said control strategy _t Next state information s _t+1 Etc. Tetrad { current state information s ] _t Control strategy z _t Prize value r for said control strategy _t Next state information s _t+1 Also known as an empirical transfer sequence. The experience playback pool D stores a plurality of experience transfer sequences generated during training to reward the network Q _θ The training can be performed by randomly sampling the batch of experience transfer sequences from the training sequence, so that the efficiency and the stability of the training process are improved. Alternatively, the size of the empirical playback pool D is typically fixed, and new data will overwrite the earliest data when the buffer is full. Of course, the present disclosure is not limited thereto.

Then, M time periods in which the long-term benefits need to be settled are set, and the number N of control strategies that need to be generated in one time period in which the long-term benefits need to be settled. Thus, based on M x N iterations, the bonus network Q is completed _θ And policy network pi _φ Is a training of (a). At the same time, the iteration number k of the correction module is required to be set _train . Alternatively, k _train Can be less than or equal to k _test . Since, according to the theory of newton's iteration method, a limited number of gradient drops do not necessarily guarantee convergence to the optimal value (i.e. the inequality constraint is fully fulfilled), but during training, if the initial value is already close enough to the optimal value, the process drop of gradient drops becomes sufficiently efficient, Thus, even if a smaller k is used _train Can also obtain a control strategy which simultaneously satisfies equality constraint and inequality constraint

Next, in operation S701, in the training process, the current state information S is predicted _t Control strategy z of (2) _t Prize value r for said control strategy _t Next state information s of the industrial system _t+1 。

Alternatively, in operation S701, the policy network pi in training may be utilized _φ Based on the current state information s of the industrial system _t To predict first control information z for controlling the industrial system _t ＝π _φ (z _t |s _t ). Then, generating second control information for controlling the industrial system based on the first control information and an equality constraint for controlling a safety risk of the industrial system using a completion module Then, using a correction module, based on inequality constraints for controlling the safety risk of the industrial system, the second control information +_>Making corrections, generating a control strategy for said industrial system +.>Then, based on the control strategy, using a control module to simulate the control of the industrial system in the application environment of the industrial system, and obtaining the rewarding value r aiming at the control strategy _t Next state information s of the industrial system _t+1 。

Optionally, to increaseRobustness and stability of the trained individual neural networks, in operation S701, gaussian noise disturbance control strategies may also be usedThen the perturbed control strategy is +.>Input to the control module to be based on the post-disturbance control strategy +.>Simulating control of the industrial system in the application environment of the industrial system, and acquiring a reward value r for the control strategy _t Next state information s of the industrial system _t+1 . In generating the action, a random noise is added to the action that follows a gaussian distribution. This randomness can allow policies to be more diversified in exploring the state space, with a greater chance of finding the optimal solution. Of course, the disclosure is not so limited.

Next, in operation S702, the current state information S may be set _t The control strategyThe prize value r _t Said next state information s _t+1 As an experience transfer sequence, an experience playback pool D is stored. As described above, the empirical transfer sequence may be a quadruple, which may be in the form of { current state information s } _t Control strategy->Prize value r for said control strategy _t Next state information s _t+1 }. Of course, the present disclosure is not limited thereto.

As training proceeds, the experience playback pool D will store a number of experience transfer sequences that will serve as training samples for training the bonus network and the strategy network.

Then, in operation S703, a batch of experience transfer sequences may be sampled from the experience playback pool DAnd training the rewarding network in training based on the experience transfer sequence of the batch so that the rewarding loss function L _critic Is converged.

Alternatively, in some embodiments of the present disclosure, a batch of empirical transfer sequences may be sampled from the empirical playback pool D by way of random sampling, which may include N empirical transfer sequences, N being greater than zero and a smaller integer. Because the data generated by interaction of the reward network or the strategy network with the environment is usually highly relevant, if the data is directly used for training, the reward network or the strategy network may be excessively fitted, so that the training effect is affected, and through a random sampling batch experience transfer sequence, the risk of excessive fitting can be reduced, the utilization rate of the data is improved, and the learning process is accelerated and the training stability is improved. Of course, the present disclosure is not limited thereto.

Alternatively, a batch of empirical transfer sequences are sampledCan be expressed as formula (4) and formula (5).

Wherein c, E, sigma are constants, D _t Pool D is played back for experience at time step t. Of course, the present disclosure is not limited thereto.

Optionally, a bonus loss function L _critic Is the corresponding loss function of the bonus network. As described above, the bonus netThe goal of the complex is to learn the Q value to evaluate the state s _i Take control strategy downIs of value (c). Bonus loss function L _critic There are various forms that may be used and the disclosure is not limited thereto. Optionally, a bonus loss function L _critic Two parts are possible: TD error and regularization term. Wherein TD error characterizes->The difference between the estimated value and the target value of (c). The target value can be determined by the next state s _i+1 Corresponding->Estimate and prize r _i Calculated. The TD error may be calculated by Mean Square Error (MSE) or Mean Absolute Error (MAE). Wherein regularization terms are used to control the complexity of the model to avoid overfitting. The regularization term may be in the form of an L1 regularization, an L2 regularization, or other form of regularization. By minimizing the bonus loss function L _critic Can make the rewarding network Q _θ Gradually approach true- >Thereby improving->Is a decision quality of (a).

Then, in operation S704, the in-training bonus network Q may be utilized _θ Empirical transfer sequence of the batchThe equation constraint h for controlling the safety risk of the industrial system _s And the inequality constraint g for controlling the safety risk of the industrial system _s For the strategy network pi in training _φ Training to enhance losses/>Is converged.

Optionally, the enhancement lossIncluding policy penalty->Equation constraint loss->(also called complement loss) and inequality constraint loss +.>At least one of (also referred to as correction loss). Of course, the present disclosure is not limited thereto.

Optionally, the strategy loss corresponding to the control strategy generated by the strategy network in training can be determined by using the reward network in training. Policy lossCan be obtained by minimizing the KL-divergence. Namely, the following formula (6).

Wherein, the liquid crystal display device comprises a liquid crystal display device,is a policy function, alpha is a superparameter, epi _φ Is indicated at->The following expectations. The goal of this loss function is to maximize +.>Is to minimize the policy function at the same time>KL-divergence from the target distribution.

Alternatively, an equality constraint loss may be determined based on the equality constraint for controlling the safety risk of the industrial system. Equation constraint loss Can be characterized as +.>

Alternatively, the inequality constraint loss may be determined based on the inequality constraint for controlling the safety risk of the industrial system. Inequality constraint lossCan be characterized as +.>ReLU (·) is an activation function, which is collectively referred to as a modified linear unit. ReLU (·) is defined as f (x) =max (0, x), where x is the input and f (x) is the output.

Thereby enhancing lossesMay be shown in equation (7) indicating that the enhancement loss is a weighted sum of the strategic loss, the completion loss, and the correction loss.

Wherein the method comprises the steps ofIs a super parameter.

It can be seen that in the training process of the policy network according to some embodiments of the present disclosure, the enhancement loss is generated by using the equality constraint and the inequality constraint for controlling the security risk of the industrial system, so as to implement the gradient back transmission of the equality constraint and the inequality constraint based on the security risk by using the implicit differentiation method, thereby obtaining a more robust and reliable policy network.

Next, some details of the training method described with reference to fig. 6 and 7 are further described with reference to fig. 8. FIG. 8 is yet another interaction diagram illustrating a policy network, a completion module, and a correction module in a training process according to an embodiment of the present disclosure.

Referring to fig. 8, an exemplary training method 70 may employ a training scheme based on a twin delay depth deterministic strategy gradient (Twin Delayed Deep Determi nistic Policy Gradient, TD 3) that provides a more efficient error-free training effect using a truncated twin Q learning scheme and a delay update strategy.

Alternatively, as shown in fig. 8, a training method 70 employing a dual Q learning scheme and a deferred update strategy divides the bonus network into two sub-bonus networks, namely a first bonus network and a second bonus network. The first and second bonus networks are neural networks of identical structure but different parameters. During training, the first bonus network may be updated every time step, while the second bonus network is updated every C time steps. For example, the neural network parameters of the second bonus network may be updated to be the same as the neural network parameters of the first bonus network every multiple time steps (e.g., every C time steps). At this time, the first bonus network learns the first Q-value functionThe second bonus network learns a second Q-value function +.>So that both are updated in a truncated double Q learning fashion. In particular, the target Q value may be calculated using the second bonus network each time the first bonus network is updated, And then the Q value in the current state is calculated by using the first bonus network, and finally the parameters of the first bonus network are updated by using the Q value and the Q value. By doing so, the estimation error of the Q value function can be reduced, and the performance is improved.

With continued reference to fig. 8, the policy network may also be divided into two sub-policy networks, namely a first policy network and a second policy network. The first policy network and the second policy network are neural networks of the same structure but different parameters. During the training process, the first policy network may be updated at each time step, while the second policy network is updated at every C time steps. For example, the neural network parameters of the second policy network are updated to be the same as the neural network parameters of the first policy network every multiple time steps (e.g., every C time steps). At this time, the first policy network pi (z _t |s _t The method comprises the steps of carrying out a first treatment on the surface of the Phi) is updated using the enhancement loss described above, a second policy networkThen the synchronization with the neuron parameters in the first policy network every C time steps may further improve performance.

Specifically, as shown in fig. 8, in each time step in the process of training the bonus network and the policy network, a first predicted value of the control policy may be generated using the first policy network, and the first bonus value may be predicted based on the first predicted value using the first bonus network; generating a second predicted value of the control strategy by using the second strategy network, and predicting a second prize value based on the second predicted value by using the second prize network; determining a value of a reward loss function corresponding to the time step based on the first reward value and the second reward value; and adjusting the neural network parameters of the first bonus network based on the value of the bonus loss function corresponding to the time step. Then, updating a first rewarding value by using the first rewarding network after the parameter adjustment; determining a value of the enhancement loss corresponding to the time step based on the updated first prize value; and adjusting a neural network parameter of the first policy network based on the value of the enhancement loss.

The method 70 for training a policy network may be described in the following pseudo code.

It can be seen that in the training process of the policy network according to some embodiments of the present disclosure, the enhancement loss is generated by using the equality constraint and the inequality constraint for controlling the security risk of the industrial system, so as to implement the gradient backhaul based on the security risk constraint by using the implicit differentiation method, thereby obtaining a more robust and reliable policy network.

According to yet another aspect of the present disclosure, there is also provided an electronic device for implementing the method 30, 40 or 70 according to an embodiment of the present disclosure or carrying the apparatus according to an embodiment of the present disclosure. Fig. 9 shows a schematic diagram of an electronic device 2000 in accordance with an embodiment of the present disclosure.

As shown in fig. 9, the electronic device 2000 may include one or more processors 2010, and one or more memories 2020. Wherein said memory 2020 has stored therein computer readable code which, when executed by said one or more processors 2010, can perform a method as described above.

The processor in embodiments of the present disclosure may be an integrated circuit chip having signal processing capabilities. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The methods, operations, and logic blocks of the disclosure in the embodiments of the present disclosure may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and may be of the X86 architecture or ARM architecture.

In general, the various example embodiments of the disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic, or any combination thereof. Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of the embodiments of the present disclosure are illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

For example, a method or apparatus according to embodiments of the present disclosure may also be implemented by means of the architecture of computing device 3000 shown in fig. 10. As shown in fig. 10, computing device 3000 may include a bus 3010, one or more CPUs 3020, a Read Only Memory (ROM) 3030, a Random Access Memory (RAM) 3040, a communication port 3050 connected to a network, an input/output component 3060, a hard disk 3070, and the like. A storage device in the computing device 3000, such as a ROM 3030 or hard disk 3070, may store various data or files for processing and/or communication of the methods provided by the present disclosure and program instructions for execution by the CPU. The computing device 3000 may also include a user interface 3080. Of course, the architecture shown in FIG. 10 is merely exemplary, and one or more components of the computing device shown in FIG. 10 may be omitted as may be practical in implementing different devices.

According to yet another aspect of the present disclosure, a computer-readable storage medium is also provided. Fig. 11 shows a schematic diagram of a storage medium 4000 according to the present disclosure.

As shown in fig. 11, the computer storage medium 4020 has stored thereon computer readable instructions 4010. When the computer readable instructions 4010 are executed by a processor, a method according to an embodiment of the disclosure described with reference to the above figures may be performed. The computer readable storage medium in embodiments of the present disclosure may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (ddr SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link Dynamic Random Access Memory (SLDRAM), and direct memory bus random access memory (DR RAM). It should be noted that the memory of the methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory. It should be noted that the memory of the methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

The disclosed embodiments also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from a computer-readable storage medium, the processor executing the computer instructions, causing the computer device to perform a method according to an embodiment of the present disclosure.

It is noted that the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The exemplary embodiments of the present disclosure described in detail above are illustrative only and are not limiting. Those skilled in the art will understand that various modifications and combinations of these embodiments or features thereof may be made without departing from the principles and spirit of the disclosure, and such modifications should fall within the scope of the disclosure.

Claims

1. A method of generating a control strategy for an industrial system, comprising:

acquiring current state information generated by controlling the industrial system based on a control strategy of the previous state;

Predicting first control information for controlling a part of components in the industrial system in a current state based on the current state information of the industrial system;

generating second control information for controlling all components in the industrial system in a current state based on the first control information and an equality constraint for controlling the safety risk of the industrial system;

correcting the second control information based on inequality constraints for controlling safety risk of the industrial system, generating a control strategy for the industrial system in a current state, and

and controlling the industrial system in the current state under the application environment of the industrial system based on the control strategy.

2. The method of claim 1, wherein the modifying the second control information based on inequality constraints for controlling security risks of the industrial system, generating a control strategy for the industrial system comprises:

iteratively modifying the second control information to generate a control strategy for the industrial system, wherein, in each iteration,

determining a correction amount corresponding to the second control information based on an inequality constraint for controlling a safety risk of the industrial system;

Correcting the second control information corresponding to the iteration based on the correction amount corresponding to the second control information; and

and taking the corrected second control information as second control information in the next iteration.

3. The method of claim 2, wherein said determining, in each iteration, a correction amount corresponding to the second control information based on an inequality constraint for controlling a safety risk of the industrial system comprises:

determining partial derivative information corresponding to second control information corresponding to the iteration based on inequality constraints for controlling safety risk of the industrial system, the partial derivative information indicating a rate of change corresponding to all components in the industrial system for the iteration;

determining an iteration gradient corresponding to the iteration based on partial derivative information corresponding to second control information corresponding to the iteration; and

and determining the correction amount corresponding to the second control information based on the iteration gradient corresponding to the iteration.

4. The method of claim 1, wherein the industrial system is a micro-grid system,

the microgrid state information comprises: at least one of distributed energy power generation information, non-distributed energy power generation information, energy storage system information, load information, micro-grid-connected information, operation parameter information, environment information, electric power market information and user demand information;

The microgrid control information comprises: at least one of non-distributed energy power generation control information, energy storage system control information, load control information and micro-grid connection control information.

5. The method of claim 1, wherein the first control information for controlling a part of the components in the industrial system in the current state is predicted by a policy network, the second control information for controlling all the components in the industrial system in the current state is generated by a completion module, the control policy for the industrial system in the current state is generated by a modification module, and the industrial system in the current state is controlled by a control module in an application environment of the industrial system.

6. The method of claim 5, wherein the training of the policy network comprises:

predicting a control strategy for current state information, a reward value for the control strategy, and next state information of the industrial system during training;

storing the current state information, the control strategy, the rewarding value and the next state information into an experience playback pool as experience transfer sequences;

sampling a batch of experience transfer sequences from an experience playback pool, and training a reward network in training based on the batch of experience transfer sequences so as to enable the value of the reward loss function to be converged; and

Training the training policy network with the training reward network, the batch experience transfer sequence, the equality constraint for controlling the safety risk of the industrial system, and the inequality constraint for controlling the safety risk of the industrial system such that the value of the enhancement loss converges.

7. The method of claim 6, wherein inputs of the rewards network are the current status information and the control strategy for the current status information, and outputs are desired cumulative rewards that the industrial system can obtain in the current status with the control strategy.

8. The method of claim 6, wherein the training the strategic network such that the value of the enhancement loss converges comprises:

determining strategy loss corresponding to a control strategy generated by a strategy network in training by utilizing the reward network in training;

determining an equality constraint loss based on the equality constraint for controlling a security risk of the industrial system;

determining an inequality constraint loss based on the inequality constraint for controlling the safety risk of the industrial system; and

the enhancement loss is determined based on the policy loss, the equality constraint loss, and the inequality constraint loss.

9. The method of claim 6, wherein,

the reward network comprises a first reward network and a second reward network, wherein in the training process, the first reward network is updated in each time step, and the second reward network is updated in each multiple time steps; or alternatively

The policy network comprises a first policy network and a second policy network, wherein in the training process, the first policy network is updated at each time step, and the second policy network is updated at every plurality of time steps.

10. The method of claim 9, wherein, during the training process,

updating the neural network parameters of the second bonus network to be the same as the neural network parameters of the first bonus network every a plurality of time steps; or alternatively

And updating the neural network parameters of the second strategy network to be the same as the neural network parameters of the first strategy network every a plurality of time steps.

11. The method of claim 10, wherein, in each time step in the training process,

generating a first predicted value of a control strategy by using the first strategy network, and predicting a first rewarding value based on the first predicted value by using the first rewarding network;

Generating a second predicted value of the control strategy by using the second strategy network, and predicting a second prize value based on the second predicted value by using the second prize network;

determining a value of a reward loss function corresponding to the time step based on the first reward value and the second reward value; and

and adjusting the neural network parameters of the first reward network based on the value of the reward loss function corresponding to the time step.

12. The method of claim 11, wherein, in each time step in the training process,

updating a first rewarding value by using a first rewarding network after parameter adjustment;

determining a value of the enhancement loss corresponding to the time step based on the updated first prize value; and

and adjusting a neural network parameter of the first policy network based on the value of the enhancement loss.

13. An apparatus for generating a control strategy for an industrial system, comprising:

an acquisition module configured to: acquiring current state information generated by controlling the industrial system based on a control strategy of the previous state;

a policy network configured to: predicting first control information for controlling a part of components in the industrial system in a current state based on the current state information of the industrial system;

A complement module configured to: generating second control information for controlling all components in the industrial system in a current state based on the first control information and an equality constraint for controlling the safety risk of the industrial system;

a correction module configured to: correcting the second control information based on inequality constraints for controlling the safety risk of the industrial system, generating a control strategy for the industrial system in the current state; and

a control module configured to control the industrial system in a current state in an application environment of the industrial system based on the control strategy.

14. An electronic device, comprising:

a processor; and

a memory, wherein the memory has stored therein a computer executable program which, when executed by the processor, performs the method of any of claims 1-12.

15. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the method of any of claims 1-12.